Architecture Taxonomy

Copy Markdown View Source

Edifice provides 184 registered architectures across 25 families, spanning from foundational MLPs to cutting-edge state space models, audio codecs, robotics, and 3D generation. This document serves as a comprehensive reference catalog — organized as a taxonomy with provenance, strengths/weaknesses, and adoption context for every architecture.

For guided learning, see Learning Path. For choosing architectures by problem type, see Problem Landscape.


Family Overview

FamilyCountPrimary InputCore Use CaseEra
Transformer4SequencesAutoregressive generation backbone2017–2025
SSM (State Space Models)19SequencesEfficient long-range sequence modeling2021–2025
Attention & Linear Attention34SequencesToken mixing, global dependencies2017–2025
Recurrent15SequencesSequential state tracking, temporal memory1997–2025
Vision15Images/PatchesImage classification, segmentation, synthesis2020–2025
Convolutional6Images/SequencesLocal pattern extraction2015–2019
Generative22Noise/LatentsData generation, density estimation2014–2025
Graph9GraphsNode/graph classification, molecular modeling2017–2025
Contrastive & Self-Supervised8Paired viewsRepresentation learning without labels2020–2025
Meta-Learning & Composition22VariedRouting, adaptation, preference optimization2016–2025
Feedforward5Tabular/VectorsFeature transformation, tabular data1986–2024
Memory3SequencesExternal differentiable memory2014–2025
Energy3VariedEnergy landscapes, continuous dynamics2006–2020
Probabilistic3VariedUncertainty quantification2015–2018
Audio3Audio/SpeechNeural audio codecs, TTS2023–2025
Robotics2Vision+ActionsImitation learning, VLA2024–2025
Interpretability2ActivationsFeature extraction, mechanistic analysis2024–2025
Sets2Point setsPermutation-invariant set functions2017
Neuromorphic2Spike trainsBiologically-plausible, ultra-low-power2015–2019
Scientific1Functions/PDEsOperator learning, PDE solving2021–2025
Multimodal1Multi-modalCross-modal fusion2024–2025
World Model1ObservationsLatent dynamics, planning2024–2025
RL1States/ActionsActor-critic policy networks2024–2025
Liquid1SequencesContinuous-time adaptive dynamics2021
Inference1SequencesSpeculative decoding acceleration2024–2025

Detailed Taxonomy

Sequence Processing

State Space Models (19 architectures)

State Space Models (SSMs) model sequences through continuous-time linear dynamical systems discretized for neural networks. The key equation is:

h[t] = A * h[t-1] + B * x[t]    (state update)
y[t] = C * h[t]                  (output)

SSMs achieve O(N) complexity for sequence length N (vs O(N²) for attention), making them the leading architecture family for long-range sequence modeling. The evolution from S4 to Mamba to Mamba-3 represents one of the most active research fronts in deep learning.

S4
ModuleEdifice.SSM.S4
PaperGu et al., "Efficiently Modeling Long Sequences with Structured State Spaces" (ICLR 2022)
ReferencearXiv:2111.00396
OriginStanford / Albert Gu

The foundational SSM architecture. S4 introduced HiPPO initialization — a mathematically principled way to initialize the state matrix A so it optimally compresses continuous signals into finite-dimensional state. Uses DPLR (Diagonal Plus Low-Rank) decomposition for efficient computation.

Strengths: Exceptional long-range modeling (Path-X, LRA benchmarks), mathematically principled initialization, parallelizable via convolution mode. Weaknesses: Complex implementation (DPLR decomposition), fixed (non-selective) dynamics, superseded by simpler variants. Who uses it: Research benchmarking, long-range dependency tasks, historical baseline for SSM comparisons.

S4D
ModuleEdifice.SSM.S4D
PaperGu et al., "On the Parameterization and Initialization of Diagonal State Space Models"
ReferencearXiv:2206.11893
OriginStanford / Albert Gu

Simplified S4 with purely diagonal state matrix, eliminating the DPLR decomposition. Serves as the bridge between original S4 and modern SSMs like Mamba.

Strengths: Dramatically simpler than S4, nearly identical performance, fewer parameters. Weaknesses: Still uses fixed (non-selective) dynamics. Who uses it: Ablation studies, educational purposes, lightweight SSM needs.

S5
ModuleEdifice.SSM.S5
PaperSmith et al., "Simplified State Space Layers for Sequence Modeling" (ICLR 2023)
ReferencearXiv:2208.04933
OriginUniversity of Toronto

Uses a single MIMO (Multi-Input Multi-Output) SSM instead of many parallel SISO systems. Simpler architecture with fewer parameters than Mamba while maintaining strong performance.

Strengths: Conceptually simple, efficient, good ablation baseline. Weaknesses: Fixed dynamics (not input-dependent like Mamba), less expressive. Who uses it: Research ablations to isolate what Mamba's selectivity contributes.

H3
ModuleEdifice.SSM.H3
PaperFu et al., "Hungry Hungry Hippos: Towards Language Modeling with State Space Models" (ICLR 2023)
ReferencearXiv:2212.14052
OriginStanford / Christopher Ré

Combines two SSM layers with a short convolution and multiplicative gating. The "two-SSM + short conv" pattern bridges the gap between SSMs and Transformers on language modeling. One SSM captures local shifts, the other broader patterns, and a short convolution handles very local (1-4 token) patterns.

Strengths: Closes SSM-Transformer gap on language modeling, elegant multiplicative design. Weaknesses: More complex than single-SSM approaches, superseded by Mamba. Who uses it: Language modeling research, hybrid architecture exploration.

Hyena
ModuleEdifice.SSM.Hyena
PaperPoli et al., "Hyena Hierarchy: Towards Larger Convolutional Language Models" (ICML 2023)
ReferencearXiv:2302.10866
OriginTogether AI / Michael Poli

Replaces attention with a hierarchy of long convolutions and element-wise gating. Uses implicit filters (small MLPs that generate convolution kernels) for O(L log L) training via FFT and O(L) inference via recurrence.

Strengths: Sub-quadratic complexity, no attention needed, strong on language tasks. Weaknesses: FFT-based training can be tricky to optimize, long-conv requires careful implementation. Who uses it: Together AI (StripedHyena models), long-sequence processing.

Mamba
ModuleEdifice.SSM.Mamba
PaperGu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)
ReferencearXiv:2312.00752
OriginCarnegie Mellon / Albert Gu & Tri Dao

The breakthrough SSM architecture. Mamba makes the SSM parameters (B, C, Δ) input-dependent (selective), enabling content-based reasoning that fixed SSMs cannot do. Uses a parallel associative scan for O(N) training.

Strengths: O(N) complexity, selective dynamics, excellent language/sequence modeling, simple gated block design, widely adopted. Weaknesses: Sequential scan limits GPU utilization vs attention's matmul-heavy approach, inference not yet as optimized as FlashAttention. Who uses it: Widely adopted — state-of-the-art for efficient sequence models, AI21 (Jamba), Zyphra (Zamba), many research groups.

Mamba-2 (SSD)
ModuleEdifice.SSM.MambaSSD
PaperDao & Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms" (2024)
OriginPrinceton / Tri Dao & Albert Gu

Reformulates Mamba as structured matrix multiplication (State Space Duality), enabling tensor core utilization. Splits sequences into chunks and uses dense matmul within chunks, tiny scan between chunks.

Strengths: 2-8x faster training than Mamba-1 via tensor cores, mathematically equivalent output. Weaknesses: Chunk boundary effects, XLA implementation less optimized than fused CUDA kernels. Who uses it: Production training of large SSM models.

Mamba-3
ModuleEdifice.SSM.Mamba3
PaperExtension of Gu & Dao (2023, 2024)
OriginCarnegie Mellon / Gu & Dao

Extends Mamba with three innovations: (1) complex-valued state dynamics via 2×2 rotation matrices, (2) generalized trapezoidal discretization reducing approximation error, (3) MIMO rank-r updates for better hardware utilization.

Strengths: More expressive state dynamics, reduced discretization error, better GPU/TPU utilization. Weaknesses: More parameters and complexity than Mamba-1. Who uses it: Cutting-edge SSM research, high-performance sequence modeling.

Mamba (Cumsum)
ModuleEdifice.SSM.MambaCumsum
OriginEdifice experimental variant

Experimental Mamba variant for testing alternative scan algorithms (cumsum-based log-space reformulation). Currently uses Blelloch scan (same as regular Mamba) as the cumsum approach proved slower in XLA.

Strengths: Useful for scan algorithm experimentation. Weaknesses: No performance advantage over standard Mamba. Who uses it: Internal benchmarking and scan algorithm research.

Mamba (Hillis-Steele)
ModuleEdifice.SSM.MambaHillisSteele
OriginEdifice experimental variant

Mamba variant using Hillis-Steele parallel scan instead of Blelloch. O(L log L) work but ALL elements active at every level, maximizing GPU occupancy at the cost of more total operations.

Strengths: Maximum parallelism per scan level, potentially better GPU occupancy. Weaknesses: O(L log L) total work vs O(L) for Blelloch. Who uses it: GPU utilization experiments, comparison of scan algorithms.

BiMamba
ModuleEdifice.SSM.BiMamba
OriginMultiple concurrent works extending Mamba bidirectionally

Bidirectional Mamba that runs two parallel SSMs (forward and backward) for non-causal tasks. Combines outputs via concatenation and projection.

Strengths: Full sequence context for classification and offline analysis. Weaknesses: Not usable for causal/real-time tasks, 2x compute of standard Mamba. Who uses it: Sequence classification, offline analysis, fill-in-the-blank tasks.

GatedSSM
ModuleEdifice.SSM.GatedSSM
OriginEdifice simplified variant

Simplified gated temporal network inspired by SSMs but using sigmoid gating instead of true parallel scan. NOT a true Mamba implementation — uses mean pooling + projection instead of depthwise separable convolution.

Strengths: Numerically stable, lightweight, simple implementation, no NaN issues. Weaknesses: Less expressive than true Mamba, no parallel scan. Who uses it: Lightweight temporal processing, prototyping, when stability is paramount.

Jamba
ModuleEdifice.SSM.Hybrid
PaperAI21 Labs, "Jamba: A Hybrid Transformer-Mamba Language Model" (2024)
OriginAI21 Labs

Hybrid architecture alternating Mamba blocks with periodic attention blocks. Most layers are O(L) Mamba blocks with attention at configurable intervals (default: every 4 layers) for long-range dependencies.

Strengths: Best of both worlds — Mamba efficiency with attention's long-range capability, configurable ratio. Weaknesses: More complex than pure Mamba, still has some O(L²) attention layers. Who uses it: AI21 Labs (Jamba-1.5 production model), hybrid architecture research.

Zamba
ModuleEdifice.SSM.Zamba
PaperZyphra, "Zamba: A Compact 7B SSM Hybrid Model" (2024)
ReferencearXiv:2405.16712
OriginZyphra

Like Jamba but uses a single shared attention layer with reused weights instead of multiple independent attention layers. This achieves 10x KV cache reduction versus Jamba.

Strengths: 10x KV cache reduction, fewer parameters, insight that attention mainly provides global information flow. Weaknesses: Less attention diversity than Jamba. Who uses it: Zyphra (Zamba production models), memory-constrained deployments.

StripedHyena
ModuleEdifice.SSM.StripedHyena
PaperTogether AI, "StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models" (2023)
OriginTogether AI

Interleaves Hyena long convolution blocks (even layers, global context) with gated depthwise convolution blocks (odd layers, local patterns). The striped pattern balances efficiency and expressivity.

Strengths: Better efficiency than pure Hyena, retains global context capability, attention-free. Weaknesses: Complex implementation, less widely adopted than Mamba-based hybrids. Who uses it: Together AI (StripedHyena-7B), attention-free model research.


Attention & Linear Attention (19 architectures)

The attention family spans the full evolution from O(N²) quadratic softmax attention to O(N) linear approximations. This is the largest family in Edifice, reflecting the central role of attention mechanisms in modern deep learning.

Multi-Head Attention
ModuleEdifice.Attention.MultiHead
PaperVaswani et al., "Attention Is All You Need" (NeurIPS 2017)
OriginGoogle Brain

The foundational attention mechanism. Implements sliding window attention (O(K²) for window size K) and hybrid LSTM + attention. QK LayerNorm option for stable training.

Strengths: Universally understood, highly expressive, hardware-optimized (FlashAttention). Weaknesses: O(N²) complexity for full attention, quadratic memory. Who uses it: Universal — the backbone of GPT, BERT, and virtually all modern LLMs.

Grouped Query Attention (GQA)
ModuleEdifice.Attention.GQA
PaperAinslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023)
OriginGoogle Research

Interpolation between Multi-Head Attention (MHA) and Multi-Query Attention (MQA). Groups of query heads share key/value heads, reducing KV cache size while maintaining near-MHA quality.

Strengths: Significant KV cache reduction, near-MHA quality, now standard in production LLMs. Weaknesses: Slightly lower quality than full MHA. Who uses it: LLaMA 2 70B, Mistral 7B, Gemma — now the default for production LLMs.

Multi-Head Latent Attention (MLA)
ModuleEdifice.Attention.MLA
PaperDeepSeek-AI, "DeepSeek-V2" (2024)
ReferencearXiv:2405.04434
OriginDeepSeek

Compresses key-value representations into low-rank latent vectors, reconstructing K,V on-the-fly during attention. Also uses decoupled RoPE to keep position information separate from compressed content.

Strengths: Dramatic KV cache reduction (beyond GQA), maintains attention quality via low-rank reconstruction. Weaknesses: Additional computation for KV reconstruction, complex implementation. Who uses it: DeepSeek-V2/V3 production models.

Differential Transformer
ModuleEdifice.Attention.DiffTransformer
PaperYe et al., "Differential Transformer" (Microsoft Research, 2024)
OriginMicrosoft Research

Computes two independent attention maps per head and subtracts them. Shared noise patterns (tokens that universally attract attention) cancel out, amplifying signal-to-noise ratio — analogous to differential amplifiers in electronics.

Strengths: Better signal-to-noise ratio, reduces attention noise, same complexity as standard attention. Weaknesses: 2x attention score computation per head (minor overhead), relatively new. Who uses it: Microsoft Research, noise-sensitive attention applications.

RetNet
ModuleEdifice.Attention.RetNet
PaperSun et al., "Retentive Network: A Successor to Transformer for Large Language Models" (Microsoft, 2023)
ReferencearXiv:2307.08621
OriginMicrosoft Research

Replaces attention with "retention" — a decay-based mechanism supporting three computation modes: parallel (training, O(L²)), recurrent (inference, O(1) per token), and chunkwise (long sequences, O(L)).

Strengths: Triple paradigm (parallel/recurrent/chunkwise), O(1) inference, multi-scale decay rates. Weaknesses: Decay-based mechanism may lose fine-grained long-range patterns, less widely adopted than attention. Who uses it: Microsoft (RetNet research models), efficient inference research.

RWKV-7
ModuleEdifice.Attention.RWKV
PaperPeng et al., "RWKV: Reinventing RNNs for the Transformer Era"
ReferencearXiv:2305.13048
OriginBo Peng / RWKV community

Linear attention architecture with O(1) space complexity per inference step. RWKV-7 ("Goose") uses a generalized delta rule surpassing the TC0 constraint. Combines WKV (weighted key-value) attention with channel-mixing FFN using R-gate and K-gate.

Strengths: O(1) inference, parallelizable training, shipped to 1.5B Windows devices, large community. Weaknesses: O(1) state may lose long-range fine detail, community-driven (less corporate backing). Who uses it: RWKV community (14B+ parameter models), Microsoft (on-device Copilot).

GLA
ModuleEdifice.Attention.GLA
Paper"Gated Linear Attention Transformers with Hardware-Efficient Training"
OriginFlash-linear-attention project

Linear attention with data-dependent gating for improved expressiveness. Particularly effective on short sequences (<2K tokens) where it can outperform FlashAttention-2.

Strengths: O(L) complexity, data-dependent gating, competitive with attention on short sequences, native tensor ops. Weaknesses: Less effective on very long sequences, less widely adopted. Who uses it: Efficient attention research, short-sequence modeling.

HGRN-2
ModuleEdifice.Attention.HGRN
Paper"HGRN2: Gated Linear RNNs with State Expansion"
ReferencearXiv:2404.07904
OriginLinear attention research community

Linear RNN with hierarchical gating and state expansion. Expands hidden state dimension during recurrence for richer internal representation, then contracts back.

Strengths: O(1) inference, O(L) training, state expansion for richer representations. Weaknesses: Less expressive than attention, state expansion increases compute. Who uses it: Efficient sequence modeling research.

Griffin/Hawk
ModuleEdifice.Attention.Griffin
PaperDe et al., "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models" (2024)
ReferencearXiv:2402.19427
OriginGoogle DeepMind

Hybrid architecture using Real-Gated Linear Recurrent Unit (RG-LRU) with periodic local attention. Pattern: 2 RG-LRU blocks → 1 Local Attention block → repeat. The sqrt(1-a²) term preserves hidden state norm for stable long-sequence training.

Strengths: Simpler than Mamba (just gates, no SSM projections), stable training, Google DeepMind backing. Weaknesses: Local attention windows limit truly global reasoning, less widely adopted than Mamba. Who uses it: Google DeepMind (RecurrentGemma), efficient language model research.

Perceiver
ModuleEdifice.Attention.Perceiver
PaperJaegle et al., "Perceiver IO: A General Architecture for Structured Inputs & Outputs" (DeepMind, 2021)
OriginGoogle DeepMind

Uses cross-attention to map arbitrary-size inputs to a fixed-size learned latent array, then self-attends over latents. Total complexity: O(N·M + M²) where M = num_latents << N.

Strengths: Input-agnostic (works with any modality), decouples compute from input size, elegant design. Weaknesses: Latent bottleneck may lose fine-grained detail, slower than specialized architectures. Who uses it: Multi-modal learning, variable-length input processing, DeepMind research.

FNet
ModuleEdifice.Attention.FNet
PaperLee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms" (Google, 2021)
OriginGoogle Research

Replaces self-attention with an unparameterized Fourier Transform for token mixing. No learnable attention parameters — FFT provides global token mixing at O(N log N) with zero attention parameters.

Strengths: ~7x faster training than attention, zero attention parameters, 92-97% of BERT quality. Weaknesses: Slightly lower quality than attention, no learnable selectivity. Who uses it: Efficiency research, speed-critical applications where near-attention quality suffices.

Linear Transformer
ModuleEdifice.Attention.LinearTransformer
PaperKatharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (2020)
OriginEPFL

Replaces softmax attention with kernel-based linear attention using the ELU+1 feature map. Avoids the N×N attention matrix entirely by computing phi(K)ᵀ·V first (a d×d matrix).

Strengths: O(N) complexity, causal formulation equivalent to an RNN, simple implementation. Weaknesses: Lower quality than softmax attention (feature map approximation), best when N > d. Who uses it: Long-sequence processing, real-time applications, efficiency research.

Nystromformer
ModuleEdifice.Attention.Nystromformer
PaperXiong et al., "Nystromformer: A Nystrom-Based Algorithm for Approximating Self-Attention" (AAAI 2021)
OriginUniversity of Wisconsin

Approximates the full attention matrix using Nyström method with landmark points. Samples M landmarks and reconstructs attention through them: A ≈ F1 · pinv(F2) · F3.

Strengths: O(N·M) complexity, good attention quality with few landmarks (M=32-64 typical). Weaknesses: Kernel inverse O(M³) overhead, landmark selection heuristic. Who uses it: Efficient attention research, long-document processing.

Performer
ModuleEdifice.Attention.Performer
PaperChoromanski et al., "Rethinking Attention with Performers" (ICLR 2021)
OriginGoogle Research

Approximates softmax attention using FAVOR+ (Fast Attention Via positive Orthogonal Random features). Uses orthogonal random features to approximate the exponential kernel.

Strengths: O(N) time and space, mathematically principled approximation, unbiased estimator. Weaknesses: Approximation quality degrades for very sharp attention patterns, random features add variance. Who uses it: Long-sequence tasks, Google Research, efficient attention research.

Mega
ModuleEdifice.Attention.Mega
PaperMa et al., "Mega: Moving Average Equipped Gated Attention" (ICLR 2023)
ReferencearXiv:2209.10655
OriginMeta AI

Combines exponential moving averages (EMA) for local context with single-head gated attention for global context. Sub-quadratic complexity: EMA is O(L·D_ema) and only a single attention head.

Strengths: Sub-quadratic complexity, elegant EMA + attention combination, strong benchmark results. Weaknesses: Single-head attention may limit global context diversity. Who uses it: Meta AI research, efficient sequence modeling.

Based
ModuleEdifice.Attention.Based
PaperArora et al., "Simple linear attention language models balance the recall-throughput tradeoff" (2024)
OriginStanford / Hazy Research

Linear attention with Taylor expansion feature map. Replaces softmax(QK^T) with polynomial feature map phi(x) = [1, x, x²/√2!, ...] for linear-time computation.

Strengths: O(N·d²·p) complexity (p=Taylor order, typically 2-3), simple implementation, good recall. Weaknesses: Quality depends on Taylor order, higher orders increase compute. Who uses it: Efficient language model research, recall-throughput optimization.

InfiniAttention
ModuleEdifice.Attention.InfiniAttention
PaperMunkhdalai et al., "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" (2024)
OriginGoogle

Extends standard attention with a compressive memory system for unbounded context length. Each layer maintains a learnable memory matrix that accumulates information from past segments, blended with local attention via a learnable gate.

Strengths: Effectively unbounded context, combines fine-grained local and compressed global information. Weaknesses: Memory compression loses detail, more complex than standard attention. Who uses it: Long-context modeling, infinite-context research.

Conformer
ModuleEdifice.Attention.Conformer
PaperGulati et al., "Conformer: Convolution-augmented Transformer for Speech Recognition" (2020)
OriginGoogle

Macaron-style architecture combining self-attention with convolution: Half-FFN → MHSA → Conv Module → Half-FFN. Captures both global (attention) and local (convolution) patterns.

Strengths: State-of-the-art for speech recognition, captures both local and global patterns elegantly. Weaknesses: More complex block design, designed for audio/speech (less general). Who uses it: Speech recognition (Google, many ASR systems), audio processing.

Ring Attention
ModuleEdifice.Attention.RingAttention
PaperLiu et al., "Ring Attention with Blockwise Transformers for Near-Infinite Context" (2023)
ReferencearXiv:2310.01889
OriginUC Berkeley

Splits sequences into chunks and processes attention in a rotating ring pattern. Each query chunk attends to all key/value chunks in ring communication order. Enables distributed attention across devices.

Strengths: Near-infinite context on distributed systems, memory-efficient chunked attention. Weaknesses: Communication overhead in distributed setting, single-device version is equivalent to chunked attention. Who uses it: Distributed training, very long context processing.


Recurrent Networks (10 architectures)

The "recurrence renaissance" — a revival of RNN architectures with modern innovations like parallel scans, exponential gating, and test-time learning. These architectures combine the O(1) inference cost of RNNs with training parallelism.

LSTM
ModuleEdifice.Recurrent (cell_type: :lstm)
PaperHochreiter & Schmidhuber, "Long Short-Term Memory" (1997)
OriginTU Munich

The classic gated recurrent network with forget, input, and output gates plus a cell state. Still one of the most widely deployed sequence models in production.

Strengths: Well-understood, robust, extensive tooling, good for moderate sequences. Weaknesses: Purely sequential (not parallelizable), vanishing gradients on very long sequences, O(N) training. Who uses it: Production systems everywhere — NLP, speech, time series, games.

GRU
ModuleEdifice.Recurrent (cell_type: :gru)
PaperCho et al., "Learning Phrase Representations using RNN Encoder-Decoder" (2014)
OriginYoshua Bengio's lab

Simplified LSTM with two gates (reset, update) instead of three. Merges cell and hidden state. Fewer parameters with comparable performance on many tasks.

Strengths: Simpler than LSTM, fewer parameters, comparable performance. Weaknesses: Same parallelism limitations as LSTM. Who uses it: Edge deployment, resource-constrained environments, sequence modeling.

xLSTM
ModuleEdifice.Recurrent.XLSTM
PaperBeck et al., "xLSTM: Extended Long Short-Term Memory" (NeurIPS 2024)
ReferencearXiv:2405.04517
OriginNXAI / Sepp Hochreiter

Addresses three LSTM limitations with: (1) exponential gating for storage revision, (2) matrix memory (mLSTM) for increased capacity, (3) parallelizable mLSTM via covariance updates. Two variants: sLSTM (sequential, state-tracking) and mLSTM (parallelizable, memorization).

Strengths: Bridges LSTM and modern architectures, exponential gating, matrix memory, parallelizable mLSTM. Weaknesses: More complex than original LSTM, sLSTM variant still sequential. Who uses it: NXAI research, LSTM modernization research.

mLSTM
ModuleEdifice.Recurrent.XLSTM (variant: :mlstm)
PaperBeck et al., "xLSTM" (NeurIPS 2024)
OriginNXAI / Sepp Hochreiter

The parallelizable variant of xLSTM with matrix memory cell: Ct = f_t · C{t-1} + i_t · (v_t · k_t^T). Key-value storage mechanism similar to linear attention, fully parallelizable during training.

Strengths: Parallelizable, matrix memory for higher capacity, key-value storage. Weaknesses: Higher memory than scalar LSTM, primarily a memorization engine. Who uses it: Parallel-trainable RNN research, memorization-heavy tasks.

MinGRU
ModuleEdifice.Recurrent.MinGRU
PaperFeng et al., "Were RNNs All We Needed?" (2024)
ReferencearXiv:2410.01201
OriginResearch community

Strips GRU down to a single gate: zt = σ(W_z · x_t), h_t = (1-z) · h{t-1} + z · W_h · x_t. Gate depends only on input (not hidden state), enabling parallel prefix scan training. ~30 lines of core logic.

Strengths: Extremely simple (~30 lines), parallel-scannable, retains core gating mechanism. Weaknesses: Less expressive than full GRU (no hidden-to-hidden gate dependency). Who uses it: Minimalist RNN research, parallel-scannable RNN applications.

MinLSTM
ModuleEdifice.Recurrent.MinLSTM
PaperFeng et al., "Were RNNs All We Needed?" (2024)
ReferencearXiv:2410.01201
OriginResearch community

Simplified LSTM with normalized gates (f + i = 1), no output gate, and no hidden-to-hidden gate dependency. Cell state IS the hidden state. Parallel-scannable during training.

Strengths: Parallel-scannable, normalized gates ensure stability, simpler than LSTM. Weaknesses: Less expressive than full LSTM (no output gate, no hidden-dependent gates). Who uses it: Parallel RNN research, stable training applications.

DeltaNet
ModuleEdifice.Recurrent.DeltaNet
PaperSchlag et al., "Linear Transformers with Learnable Kernel Functions are Better In-Context Models" (2021)
ReferencearXiv:2102.11174
OriginIDSIA

Linear attention with the delta rule update. Maintains an associative memory matrix S updated by error-correction: St = S{t-1} + β · (vt - S{t-1}·k_t) · k_t^T. Subtracts current retrieval before adding, giving superior retrieval accuracy.

Strengths: Error-correcting memory updates, better retrieval than standard linear attention, O(d²) memory. Weaknesses: Sequential memory update, learnable beta adds complexity. Who uses it: In-context learning research, associative memory applications.

TTT (Test-Time Training)
ModuleEdifice.Recurrent.TTT
PaperSun et al., "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" (2024)
ReferencearXiv:2407.04620
OriginStanford / Yu Sun

The hidden state IS a model (a weight matrix) that is updated via self-supervised gradient steps at each token. TTT-Linear is mathematically equivalent to linear attention with the delta rule.

Strengths: Infinite expressiveness (hidden state is a full model), self-supervised adaptation at inference. Weaknesses: Sequential per-token gradient steps, computationally expensive, requires careful initialization. Who uses it: Test-time adaptation research, self-supervised sequence modeling.

Titans
ModuleEdifice.Recurrent.Titans
PaperBehrouz et al., "Titans: Learning to Memorize at Test Time" (2025)
ReferencearXiv:2501.00663
OriginGoogle Research

Extends TTT with surprise-gated memory: the memory update magnitude scales with prediction error. Higher surprise → larger updates. Uses gradient momentum for smoother evolution.

Strengths: Surprise-gated adaptive memory, momentum-based updates, builds on TTT insights. Weaknesses: Sequential, expensive per-token computation, very recent. Who uses it: Cutting-edge memory-augmented sequence research.

Reservoir
ModuleEdifice.Recurrent.Reservoir
PaperJaeger, "Echo State Networks" (2001); Maass et al., "Real-Time Computing Without Stable States" (2002)
OriginHerbert Jaeger / Wolfgang Maass

Echo State Networks with fixed random reservoir weights. Only the readout (output) layer is trained, making training extremely fast. Spectral radius < 1 ensures the echo state property.

Strengths: Extremely fast training (only linear layer), simple, good for time series. Weaknesses: Random reservoir limits expressiveness, can't learn complex representations. Who uses it: Time series forecasting, chaotic system modeling, resource-constrained environments.


Transformer (1 architecture)

DecoderOnly
ModuleEdifice.Transformer.DecoderOnly
PaperRadford et al. (GPT-2, 2019); Brown et al. (GPT-3, 2020); Touvron et al. (LLaMA, 2023)
OriginOpenAI / Meta

GPT-style decoder-only transformer combining modern LLM techniques: GQA for efficient KV cache, RoPE for position encoding, SwiGLU gated FFN, and RMSNorm. This is the backbone architecture of all major production LLMs.

Strengths: The de facto standard for autoregressive generation, extensively optimized, hardware-friendly. Weaknesses: O(N²) attention, requires large parameter counts for best results. Who uses it: GPT-4, Claude, LLaMA, Gemini — virtually all production LLMs.


Perception

Vision (9 architectures)

Vision architectures process images (and related spatial data) through various token-mixing strategies — from attention (ViT) to convolution (ConvNeXt) to pure pooling (PoolFormer) to coordinate mapping (NeRF).

ViT (Vision Transformer)
ModuleEdifice.Vision.ViT
PaperDosovitskiy et al., "An Image is Worth 16x16 Words" (ICLR 2021)
OriginGoogle Brain

Treats an image as a sequence of fixed-size patches, linearly embeds each, prepends a CLS token, adds position embeddings, and processes through standard transformer blocks.

Strengths: Simple, scalable, strong with large datasets, established paradigm. Weaknesses: Requires large datasets (less data-efficient than CNNs), O(N²) in patch count. Who uses it: Google, Meta, virtually all vision research since 2021.

DeiT
ModuleEdifice.Vision.DeiT
PaperTouvron et al., "Training data-efficient image transformers & distillation through attention" (ICML 2021)
OriginMeta AI / Facebook

Data-efficient ViT with a distillation token that learns from a teacher model. CLS token for classification, distillation token for teacher alignment; outputs averaged at inference.

Strengths: Data-efficient (works with ImageNet-1K, no JFT), knowledge distillation built in. Weaknesses: Requires a teacher model for distillation, extra token adds minor overhead. Who uses it: Vision tasks with limited data, knowledge distillation research.

Swin Transformer
ModuleEdifice.Vision.SwinTransformer
PaperLiu et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (ICCV 2021)
OriginMicrosoft Research Asia

Hierarchical ViT computing attention within local windows, with shifted windows between layers for cross-window connections. Produces multi-scale feature maps like a CNN.

Strengths: Linear complexity in image size, hierarchical features for dense prediction, versatile backbone. Weaknesses: Window-based attention limits global receptive field, complex implementation. Who uses it: Object detection (COCO), semantic segmentation, Microsoft Research.

U-Net
ModuleEdifice.Vision.UNet
PaperRonneberger et al., "U-Net: Convolutional Networks for Biomedical Image Segmentation" (MICCAI 2015)
OriginUniversity of Freiburg

Symmetric encoder-decoder with skip connections concatenating encoder features at each level with decoder features. Uses real 2D convolutions, max-pooling, and transposed convolutions.

Strengths: Excellent for segmentation, preserves fine spatial detail via skip connections, well-proven. Weaknesses: Memory-intensive (skip connections double feature maps), primarily for dense prediction. Who uses it: Medical imaging, diffusion model backbones (Stable Diffusion), segmentation tasks.

ConvNeXt
ModuleEdifice.Vision.ConvNeXt
PaperLiu et al., "A ConvNet for the 2020s" (CVPR 2022)
OriginMeta AI / Facebook

Modernized ResNet applying transformer-era techniques: depthwise-separable convolutions, inverted bottleneck, GELU, LayerNorm, fewer activations, LayerScale. Proves CNNs can match ViT quality.

Strengths: CNN simplicity with ViT-level accuracy, no attention needed, efficient. Weaknesses: Limited global receptive field (vs attention), less flexible than ViT. Who uses it: Vision backbone when attention isn't needed, efficiency-focused vision.

MLP-Mixer
ModuleEdifice.Vision.MLPMixer
PaperTolstikhin et al., "MLP-Mixer: An all-MLP Architecture for Vision" (NeurIPS 2021)
OriginGoogle Brain

Pure MLP architecture: token-mixing MLPs (across patches) and channel-mixing MLPs (within patches), alternated. No attention, no convolution.

Strengths: Extremely simple, demonstrates that MLPs alone can be competitive, fast. Weaknesses: Requires large datasets, slightly below ViT/CNN on standard benchmarks. Who uses it: Architecture research, simplicity-focused applications.

FocalNet
ModuleEdifice.Vision.FocalNet
PaperYang et al., "Focal Modulation Networks" (NeurIPS 2022)
ReferencearXiv:2203.11926
OriginMicrosoft Research

Replaces self-attention with focal modulation — hierarchical depthwise convolutions aggregating context at multiple granularity levels with gated aggregation.

Strengths: Captures both local and global context without attention, simple yet effective. Weaknesses: Less widely adopted, hierarchical convolutions add parameters. Who uses it: Vision tasks where attention-free alternatives are preferred.

PoolFormer
ModuleEdifice.Vision.PoolFormer
PaperYu et al., "MetaFormer is Actually What You Need for Vision" (CVPR 2022)
ReferencearXiv:2111.11418
OriginSea AI Lab

Replaces self-attention with simple average pooling. Demonstrates that the MetaFormer architecture (norm → mixer → residual → norm → FFN → residual) matters more than the specific attention mechanism.

Strengths: Extremely simple token mixer, competitive accuracy, very fast, proves MetaFormer thesis. Weaknesses: No learnable token mixing, limited on tasks requiring precise attention. Who uses it: Efficiency research, MetaFormer architecture studies.

NeRF
ModuleEdifice.Vision.NeRF
PaperMildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" (ECCV 2020)
ReferencearXiv:2003.08934
OriginUC Berkeley / Google

Maps 3D coordinates (and optional viewing directions) to color and density values. Uses Fourier positional encoding and an MLP with skip connections. Unlike other vision models, takes raw coordinates rather than image inputs.

Strengths: Novel view synthesis from sparse images, implicit 3D representation, elegant design. Weaknesses: Slow rendering (per-ray MLP evaluation), requires known camera poses. Who uses it: 3D reconstruction, virtual reality, autonomous driving, visual effects.


Convolutional (6 architectures)

Classic convolutional networks for local pattern extraction. Despite the rise of transformers, CNNs remain relevant for efficiency, mobile deployment, and well-understood training dynamics.

Conv1D/2D
ModuleEdifice.Convolutional.Conv
OriginFoundational CNN building blocks

Configurable convolution blocks following the pattern: convolution → batch normalization → activation → dropout. Both 1D (sequence) and 2D (image) variants with optional pooling.

Strengths: Flexible building blocks, composable, well-understood. Weaknesses: Building blocks only — need composition for full architectures. Who uses it: Custom architecture building, feature extraction layers.

ResNet
ModuleEdifice.Convolutional.ResNet
PaperHe et al., "Deep Residual Learning for Image Recognition" (CVPR 2016)
OriginMicrosoft Research

Deep residual networks using skip connections. Residual and bottleneck block variants. Configurations from ResNet-18 (~11M params) to ResNet-152 (~60M params).

Strengths: Enables very deep networks (100+ layers), well-proven, extensive pretrained models available. Weaknesses: Large models, older architecture compared to ConvNeXt/ViT. Who uses it: Still widely used as backbone, benchmark baseline, transfer learning source.

DenseNet
ModuleEdifice.Convolutional.DenseNet
PaperHuang et al., "Densely Connected Convolutional Networks" (CVPR 2017)
OriginCornell / Tsinghua

Each layer receives feature maps of ALL preceding layers via concatenation. Encourages feature reuse and reduces parameter count. Transition layers compress features between dense blocks.

Strengths: Feature reuse, parameter efficient, strong gradient flow. Weaknesses: Memory-intensive (concatenated features), slower than ResNet. Who uses it: Medical imaging, small-dataset applications, feature-reuse research.

TCN
ModuleEdifice.Convolutional.TCN
PaperBai et al., "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling" (2018)
OriginCMU

Dilated causal convolutions with exponentially growing dilation rates (1, 2, 4, 8...). Receptive field grows exponentially with depth while parameters grow linearly. Causal: output at time t depends only on inputs ≤ t.

Strengths: Parallelizable (unlike RNNs), causal, flexible receptive field, residual connections. Weaknesses: Fixed receptive field (must be designed for task), no content-based selection. Who uses it: Audio processing, time series, real-time sequence classification.

MobileNet
ModuleEdifice.Convolutional.MobileNet
PaperHoward et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" (2017)
ReferencearXiv:1704.04861
OriginGoogle

Depthwise separable convolutions factoring standard convolution into per-channel and pointwise operations. Reduces computation by ~1/output_channels + 1/kernel_size².

Strengths: Extremely efficient, designed for mobile/edge deployment, width multiplier for scaling. Weaknesses: Lower accuracy ceiling than full convolutions, approximated as dense layers in Edifice. Who uses it: Mobile applications, edge deployment, real-time inference.

EfficientNet
ModuleEdifice.Convolutional.EfficientNet
PaperTan & Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" (ICML 2019)
ReferencearXiv:1905.11946
OriginGoogle Brain

Compound scaling of depth, width, and resolution with fixed scaling coefficients. Core building block is MBConv (Mobile Inverted Bottleneck) with squeeze-and-excitation attention.

Strengths: Principled scaling strategy, excellent accuracy/efficiency tradeoff, SE attention. Weaknesses: Complex scaling rules, approximated as dense layers in Edifice (designed for images). Who uses it: Efficient image classification, mobile deployment, scaling research.


Structured Data

Graph Networks (8 architectures)

Graph neural networks for processing graph-structured data — molecular graphs, social networks, knowledge graphs, and spatial interactions.

GCN
ModuleEdifice.Graph.GCN
PaperKipf & Welling, "Semi-Supervised Classification with Graph Convolutional Networks" (ICLR 2017)
OriginUniversity of Amsterdam

Spectral graph convolutions approximated by first-order Chebyshev polynomials. Each layer: H' = σ(D^{-½}AD^{-½}HW). Symmetric normalization prevents feature magnitude scaling with node degree.

Strengths: Simple, foundational, well-understood, efficient. Weaknesses: Fixed aggregation (no learned weighting), transductive (can't generalize to new graphs). Who uses it: Baseline for all GNN research, node classification, social networks.

GAT
ModuleEdifice.Graph.GAT
PaperVeličković et al., "Graph Attention Networks" (ICLR 2018)
OriginUniversity of Cambridge / DeepMind

Attention-based message passing where each node attends to neighbors with learned attention weights. Multi-head attention for attending to different subspaces.

Strengths: Learned neighbor weighting (adaptive), multi-head attention, inductive. Weaknesses: O(E) attention computation (E = edges), more parameters than GCN. Who uses it: Citation networks, social networks, any graph task needing adaptive aggregation.

GIN
ModuleEdifice.Graph.GIN
PaperXu et al., "How Powerful are Graph Neural Networks?" (ICLR 2019)
ReferencearXiv:1810.00826
OriginMIT

Provably most expressive GNN under message passing, achieving Weisfeiler-Lehman graph isomorphism test power. Each layer: h_v' = MLP((1+ε)·h_v + Σ h_u).

Strengths: Maximally expressive under message passing, learnable ε for self-vs-neighbor weighting. Weaknesses: Sum aggregation can be unstable, limited by WL expressiveness ceiling. Who uses it: Graph classification, molecular property prediction, GNN expressiveness research.

GINv2
ModuleEdifice.Graph.GINv2
PaperHu et al., "Strategies for Pre-training Graph Neural Networks" (ICLR 2020)
ReferencearXiv:1905.12265
OriginStanford

Extends GIN with edge feature incorporation. Edge features are projected and combined with neighbor features before aggregation, enabling learning from bond types, distances, and relationship properties.

Strengths: Edge feature awareness, more expressive for graphs with rich edge information. Weaknesses: Requires edge features (3D input tensor), more complex than GIN. Who uses it: Molecular graphs (bond types), knowledge graphs (relation types).

GraphSAGE
ModuleEdifice.Graph.GraphSAGE
PaperHamilton et al., "Inductive Representation Learning on Large Graphs" (NeurIPS 2017)
ReferencearXiv:1706.02216
OriginStanford

Inductive graph learning via neighborhood sampling and aggregation. Concatenates self-features with aggregated neighbor features. Supports mean, max, and pool aggregation.

Strengths: Inductive (generalizes to unseen nodes), scalable via sampling, multiple aggregators. Weaknesses: Sampling introduces variance, concatenation doubles feature dimension. Who uses it: Large-scale graphs, evolving graphs, Pinterest recommendation system.

PNA
ModuleEdifice.Graph.PNA
PaperCorso et al., "Principal Neighbourhood Aggregation for Graph Nets" (NeurIPS 2020)
ReferencearXiv:2004.05718
OriginUniversity of Cambridge

Combines multiple aggregators (mean, max, sum, std) with degree-based scalers (identity, amplification) for maximally expressive message passing. Concatenates all aggregator×scaler combinations.

Strengths: Most expressive single-layer message passing, diverse aggregation captures richer structure. Weaknesses: High feature dimension (num_aggregators × num_scalers × hidden), more compute. Who uses it: Molecular property prediction, graph benchmarks requiring maximum expressiveness.

Graph Transformer
ModuleEdifice.Graph.GraphTransformer
PaperDwivedi & Bresson (AAAI 2021); Ying et al. (NeurIPS 2021)
OriginMultiple research groups

Full multi-head attention over graph nodes with adjacency as attention bias/mask. Includes structural positional encoding via random walk or adjacency matrix powers.

Strengths: Full global attention (any node can attend to any other), structural encoding. Weaknesses: O(N²) in number of nodes, may ignore graph sparsity. Who uses it: Molecular property prediction (Graphormer/OGB-LSC winner), knowledge graphs.

SchNet
ModuleEdifice.Graph.SchNet
PaperSchütt et al., "SchNet: A continuous-filter convolutional neural network for modeling quantum interactions" (NeurIPS 2017)
ReferencearXiv:1706.08566
OriginTU Berlin

Continuous-filter convolutions for molecular graphs. Edge weights derived from interatomic distances via radial basis functions. Operates on continuous geometry, not discrete adjacency.

Strengths: Physics-aware (continuous distances, not binary edges), excellent for molecular properties. Weaknesses: Requires distance information, specialized for molecular/atomic data. Who uses it: Molecular property prediction, quantum chemistry, drug discovery.


Set Networks (2 architectures)

Architectures for permutation-invariant processing of unordered sets.

DeepSets
ModuleEdifice.Sets.DeepSets
PaperZaheer et al., "Deep Sets" (NeurIPS 2017)
OriginCMU

Processes each element independently through phi, aggregates with a permutation-invariant operation (sum), then post-processes with rho. Provably universal for permutation-invariant set functions.

Strengths: Theoretically principled, simple, universal approximator for set functions. Weaknesses: Sum aggregation loses ordering information (by design), limited inter-element interaction. Who uses it: Point cloud classification, set-based inference, multi-instance learning.

PointNet
ModuleEdifice.Sets.PointNet
PaperQi et al., "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" (CVPR 2017)
OriginStanford

Point cloud processing with per-point MLPs and max pooling for permutation invariance. Optional T-Net learns spatial transformation matrices for input canonicalization.

Strengths: Direct point cloud processing (no voxelization), T-Net for geometric invariance. Weaknesses: No local structure modeling (each point processed independently before pooling). Who uses it: 3D object classification, autonomous driving, robotics point cloud processing.


Generation

Generative Models (11 architectures)

Architectures for generating data — from noise to images, actions, and latent representations.

VAE
ModuleEdifice.Generative.VAE
PaperKingma & Welling, "Auto-Encoding Variational Bayes" (ICLR 2014)
OriginUniversity of Amsterdam

Encodes inputs into distributions (mu, log_var) via reparameterization trick. KL divergence regularizer pushes learned posterior toward standard normal prior. Beta-VAE variant for disentangled representations.

Strengths: Smooth latent space, principled probabilistic framework, generation + inference. Weaknesses: Blurry outputs (KL regularization trades off with reconstruction), posterior collapse risk. Who uses it: Latent space learning, disentangled representations, anomaly detection.

VQ-VAE
ModuleEdifice.Generative.VQVAE
Papervan den Oord et al., "Neural Discrete Representation Learning" (NeurIPS 2017)
OriginGoogle DeepMind

Discrete codebook replaces continuous latent space. Encoder output quantized to nearest codebook vector via straight-through estimator. Avoids posterior collapse and produces sharper reconstructions.

Strengths: Discrete latents (composable, interpretable), sharp reconstructions, no posterior collapse. Weaknesses: Codebook utilization can collapse, straight-through gradient is approximate. Who uses it: Audio generation (Jukebox), image tokenization, discrete representation learning.

GAN
ModuleEdifice.Generative.GAN
PaperGoodfellow et al., "Generative Adversarial Networks" (NeurIPS 2014)
OriginUniversity of Montreal

Generator/discriminator adversarial training. Supports WGAN-GP variant for stable training. Generator transforms noise to data; discriminator classifies real vs fake.

Strengths: Sharp, high-quality generations, powerful implicit density estimation. Weaknesses: Training instability (mode collapse, vanishing gradients), no latent inference. Who uses it: Image generation, style transfer, data augmentation, super-resolution.

Diffusion (DDPM)
ModuleEdifice.Generative.Diffusion
PaperChi et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" (RSS 2023)
ReferencearXiv:2303.04137
OriginColumbia University

Denoising diffusion probabilistic model adapted for action generation. Iteratively denoises random noise into actions conditioned on observations. Multi-modal output distribution (can represent multiple valid actions).

Strengths: Multi-modal output, stable MSE training, scales to high-dimensional action sequences. Weaknesses: Slow inference (~100-1000 denoising steps), high compute. Who uses it: Robot manipulation (Diffusion Policy), imitation learning, action generation.

DDIM
ModuleEdifice.Generative.DDIM
PaperSong et al., "Denoising Diffusion Implicit Models" (ICLR 2021)
ReferencearXiv:2010.02502
OriginStanford

Deterministic sampling variant of DDPM. Same training objective but reformulates reverse process as deterministic, enabling ~50 steps instead of ~1000. Eta parameter interpolates between deterministic and stochastic.

Strengths: 20x fewer sampling steps, deterministic generation, same training as DDPM. Weaknesses: Slightly lower quality than full DDPM at equal training. Who uses it: Fast diffusion sampling, real-time generation applications.

DiT
ModuleEdifice.Generative.DiT
PaperPeebles & Xie, "Scalable Diffusion Models with Transformers" (ICCV 2023)
ReferencearXiv:2212.09748
OriginUC Berkeley / Meta

Replaces U-Net backbone in diffusion with a Transformer. Uses AdaLN-Zero conditioning: LayerNorm parameters modulated by conditioning signal, with alpha initialized to zero for stable deep training.

Strengths: Scalable (transformer backbone), elegant conditioning mechanism, state-of-the-art image generation. Weaknesses: Higher compute than U-Net for small models, requires large-scale training. Who uses it: Sora (OpenAI), state-of-the-art image generation, scalable diffusion research.

Latent Diffusion
ModuleEdifice.Generative.LatentDiffusion
PaperRombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (CVPR 2022)
ReferencearXiv:2112.10752
OriginLMU Munich / Stability AI

Runs diffusion in VAE latent space instead of pixel space. Train VAE first (perceptual compression), freeze it, then train diffusion in the compact latent space. Returns {encoder, decoder, denoiser}.

Strengths: Dramatically faster than pixel-space diffusion, lower memory, perceptual quality preserved. Weaknesses: Two-stage training (VAE then diffusion), latent space may lose fine detail. Who uses it: Stable Diffusion, Midjourney, virtually all production image generation.

Consistency Model
ModuleEdifice.Generative.ConsistencyModel
PaperSong et al., "Consistency Models" (ICML 2023)
ReferencearXiv:2303.01469
OriginOpenAI

Learns a function f(x_t, t) that maps any point on a diffusion trajectory directly to its origin. Enables single-step generation (f(x_T, T) → x_0) or few-step refinement.

Strengths: Single-step generation (1000x faster than DDPM), can refine with more steps for quality. Weaknesses: Lower quality than full diffusion with equivalent training, consistency constraint is hard to enforce. Who uses it: Real-time generation, OpenAI research, fast diffusion inference.

Score SDE
ModuleEdifice.Generative.ScoreSDE
PaperSong et al., "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021)
ReferencearXiv:2011.13456
OriginStanford

Unifies DDPM, SMLD, and other score-based methods under a single SDE framework. VP-SDE (variance preserving, generalizes DDPM) and VE-SDE (variance exploding, generalizes SMLD) variants. Learns the score function ∇_x log p_t(x).

Strengths: Unified theoretical framework, exact log-likelihood via probability flow ODE, flexible. Weaknesses: SDE/ODE solving is computationally expensive, complex mathematical framework. Who uses it: Generative modeling theory, diffusion research, advanced sampling methods.

Flow Matching
ModuleEdifice.Generative.FlowMatching
PaperLipman et al., "Flow Matching for Generative Modeling" (ICLR 2023)
ReferencearXiv:2210.02747
OriginMeta AI

Learns a velocity field transporting noise to data via ODE. Uses simple linear interpolation (optimal transport paths) — no noise schedule needed. Training: MSE on velocity prediction.

Strengths: No noise schedule, simpler than diffusion, fewer inference steps (10-20), deterministic ODE. Weaknesses: ODE integration at inference, less established than diffusion. Who uses it: Meta AI (Llama 3 tokenizer training), modern generative modeling.

Normalizing Flow
ModuleEdifice.Generative.NormalizingFlow
PaperDinh et al., "Density estimation using Real-NVP" (ICLR 2017)
OriginUniversity of Montreal

Invertible transformations between simple base distribution and complex target. RealNVP-style affine coupling layers with tractable Jacobian determinant for exact log-likelihood computation.

Strengths: Exact log-likelihood (not a bound like VAE), invertible (both generation and inference). Weaknesses: Limited expressiveness per layer (coupling splits), many layers needed, restricted architectures. Who uses it: Density estimation, exact likelihood applications, physics simulations.


Contrastive & Self-Supervised (6 architectures)

Architectures for learning representations without labels, using contrastive losses or prediction-based objectives.

SimCLR
ModuleEdifice.Contrastive.SimCLR
PaperChen et al., "A Simple Framework for Contrastive Learning of Visual Representations" (ICML 2020)
ReferencearXiv:2002.05709
OriginGoogle Brain

Contrastive learning with NT-Xent loss. Two augmented views of each example processed by shared encoder + projection head. Negative pairs from other examples in the batch.

Strengths: Simple framework, strong representations, well-understood. Weaknesses: Requires large batch sizes for enough negatives, sensitive to augmentations. Who uses it: Self-supervised pretraining, visual representation learning.

BYOL
ModuleEdifice.Contrastive.BYOL
PaperGrill et al., "Bootstrap Your Own Latent" (NeurIPS 2020)
ReferencearXiv:2006.07733
OriginGoogle DeepMind

No negative pairs needed. Online network (with predictor head) trained to match EMA target network. Asymmetric design prevents collapse without negatives.

Strengths: No negatives needed (any batch size works), simple EMA update, strong representations. Weaknesses: EMA schedule is sensitive, predictor head adds complexity. Who uses it: Self-supervised learning when batch size is limited, general representation learning.

Barlow Twins
ModuleEdifice.Contrastive.BarlowTwins
PaperZbontar et al., "Barlow Twins: Self-Supervised Learning via Redundancy Reduction" (ICML 2021)
ReferencearXiv:2103.03230
OriginMeta AI / FAIR

Pushes cross-correlation matrix of two augmented views toward identity. Invariance term (diagonal → 1) and redundancy reduction term (off-diagonal → 0).

Strengths: Elegant loss function, no negatives/momentum/asymmetry, batch-size robust. Weaknesses: Requires large projection dimension for decorrelation to work well. Who uses it: Self-supervised learning research, decorrelated representation learning.

MAE
ModuleEdifice.Contrastive.MAE
PaperHe et al., "Masked Autoencoders Are Scalable Vision Learners" (CVPR 2022)
ReferencearXiv:2111.06377
OriginMeta AI / FAIR

Masks 75% of input patches and trains autoencoder to reconstruct them. Asymmetric encoder-decoder: encoder processes only unmasked patches (efficient), lightweight decoder processes all.

Strengths: Simple, scalable, efficient (encoder sees only 25% of patches), strong pretraining. Weaknesses: Requires patch-based input, reconstruction quality depends on masking ratio. Who uses it: Vision pretraining (BERT-style for images), Meta AI.

VICReg
ModuleEdifice.Contrastive.VICReg
PaperBardes et al., "VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning" (ICLR 2022)
ReferencearXiv:2105.04906
OriginMeta AI / FAIR

Three explicit regularization terms: Variance (maintain dimension variance), Invariance (MSE between views), Covariance (decorrelate dimensions). No architectural tricks — symmetric, no momentum, no stop-gradient.

Strengths: Interpretable loss terms, symmetric architecture, explicit collapse prevention. Weaknesses: Three hyperparameters to balance (lambda, mu, nu). Who uses it: Self-supervised learning research, when interpretable loss is desired.

JEPA
ModuleEdifice.Contrastive.JEPA
PaperAssran et al., "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" (CVPR 2023)
ReferencearXiv:2301.08243
OriginMeta AI / Yann LeCun

Predicts representations of masked regions (not pixel values), learning more abstract features. Context encoder processes visible patches, narrow predictor bridges to target encoder's representation space. EMA target updates.

Strengths: Learns abstract representations (not pixel-level), aligns with Yann LeCun's world model vision. Weaknesses: Complex training setup (two encoders + predictor), relatively new paradigm. Who uses it: Meta AI (core research direction for Yann LeCun's AGI vision).


Composition & Enhancement

Meta-Learning (10 architectures)

Architectures for routing, adaptation, composition, and model enhancement. These are typically composed with other architectures rather than used standalone.

MoE (Mixture of Experts)
ModuleEdifice.Meta.MoE
PaperShazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017)
OriginGoogle Brain

Routes each input to a subset of specialized expert networks via learned routing. Top-K, switch (top-1), soft (all experts), and hash routing strategies.

Strengths: Massive capacity with sparse activation, expert specialization, established technique. Weaknesses: Load balancing issues, routing instability, requires auxiliary loss. Who uses it: GPT-4 (rumored), Mixtral, Google Switch Transformer, large-scale models.

Switch MoE
ModuleEdifice.Meta.SwitchMoE
PaperFedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (JMLR 2022)
ReferencearXiv:2101.03961
OriginGoogle Brain

Top-1 expert routing — each token goes to exactly one expert. Simplifies MoE routing while maintaining capacity benefits. Peaked softmax distribution for near-one-hot routing.

Strengths: Simpler than top-K MoE, best load balance, scales to trillion parameters. Weaknesses: Single expert per token may limit quality, token dropping on imbalance. Who uses it: Google (Switch Transformer), large-scale sparse model research.

Soft MoE
ModuleEdifice.Meta.SoftMoE
PaperPuigcerver et al., "From Sparse to Soft Mixtures of Experts" (ICLR 2024)
ReferencearXiv:2308.00951
OriginGoogle Brain

Fully differentiable soft routing — computes weighted combination of all expert outputs for every token. Eliminates token dropping, load balancing issues, and routing instability.

Strengths: Fully differentiable, no token dropping, stable training, no load balancing loss needed. Weaknesses: All experts computed for every token (no sparsity benefit at inference). Who uses it: Soft routing research, when training stability is paramount.

LoRA
ModuleEdifice.Meta.LoRA
PaperHu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (ICLR 2022)
ReferencearXiv:2106.09685
OriginMicrosoft Research

Freezes original weights, injects trainable low-rank decomposition: output = Wx + (α/r)·B(Ax). Reduces trainable parameters by orders of magnitude while maintaining quality.

Strengths: Orders of magnitude fewer trainable params, no inference overhead (merge weights), simple. Weaknesses: Limited capacity (rank constraint), may not match full fine-tuning on complex tasks. Who uses it: Virtually all LLM fine-tuning — now the default PEFT method.

Adapter
ModuleEdifice.Meta.Adapter
PaperHoulsby et al., "Parameter-Efficient Transfer Learning for NLP" (ICML 2019)
ReferencearXiv:1902.00751
OriginGoogle Research

Small bottleneck modules (down-project → activation → up-project + residual) inserted between frozen pretrained layers. Adds minimal trainable parameters.

Strengths: Simple, modular, can be stacked/combined, minimal interference with pretrained weights. Weaknesses: Adds inference overhead (extra layers), largely superseded by LoRA. Who uses it: Transfer learning, multi-task adaptation, historical PEFT method.

Hypernetwork
ModuleEdifice.Meta.Hypernetwork
PaperHa et al., "HyperNetworks" (2016)
ReferencearXiv:1609.09106
OriginGoogle Brain

Networks that generate other networks' weights. A hypernetwork takes conditioning input and produces weight matrices for a target network. Enables conditional computation and task adaptation.

Strengths: Conditional weight generation, task adaptation, compression (small hyper → large target). Weaknesses: Complex training dynamics, hypernetwork capacity limits target quality. Who uses it: Multi-task learning, neural architecture search, dynamic network research.

Capsule
ModuleEdifice.Meta.Capsule
PaperSabour et al., "Dynamic Routing Between Capsules" (2017)
ReferencearXiv:1710.09829
OriginGoogle Brain / Geoffrey Hinton

Vector "capsules" encoding both entity existence (vector length) and properties (vector direction). Dynamic routing by agreement replaces max-pooling, preserving spatial hierarchies.

Strengths: Preserves spatial hierarchies, rotation/pose equivariance, interpretable. Weaknesses: Computationally expensive (routing iterations), difficult to scale, largely abandoned. Who uses it: Research on equivariant representations, medical imaging (small-scale).

Mixture of Depths
ModuleEdifice.Meta.MixtureOfDepths
PaperRaposo et al., "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models" (2024)
ReferencearXiv:2404.02258
OriginGoogle DeepMind

Per-token routing where a learned router scores tokens and only top-C% receive full transformer processing; rest skip via residual. Dynamic compute allocation.

Strengths: Dynamic compute allocation (easy tokens skip layers), reduces average FLOPs. Weaknesses: Differentiable approximation (soft gating) less efficient than true sparsity. Who uses it: Efficient transformer research, dynamic compute allocation.

Mixture of Agents
ModuleEdifice.Meta.MixtureOfAgents
PaperWang et al., "Mixture-of-Agents Enhances Large Language Model Capabilities" (2024)
OriginTogether AI

N independent proposer transformer stacks process input in parallel, outputs concatenated and fed to a larger aggregator transformer that combines proposals.

Strengths: Diverse proposals from multiple models, aggregator learns to combine best aspects. Weaknesses: N× compute for proposers, complex architecture, relatively new. Who uses it: Multi-model combination research, ensemble-style architectures.

RLHF Head
ModuleEdifice.Meta.RLHFHead
PaperOuyang et al. (RLHF, 2022); Rafailov et al. (DPO, 2023)
OriginOpenAI / Stanford

Composable head modules for alignment: Reward head (sequence → scalar reward) and DPO head (chosen/rejected → preference logit). Designed to be attached to any backbone.

Strengths: Composable with any backbone, supports both RLHF reward modeling and DPO. Weaknesses: Head only (requires backbone), reward modeling has known limitations. Who uses it: LLM alignment pipelines, reward model training, DPO fine-tuning.


Specialized

Energy-Based (3 architectures)

Architectures modeling energy landscapes and continuous dynamics.

EBM
ModuleEdifice.Energy.EBM
PaperDu & Mordatch, "Implicit Generation and Modeling with Energy-Based Models" (2019); LeCun et al., "A Tutorial on Energy-Based Learning" (2006)
OriginMIT / NYU

Energy function network assigning scalar energy to inputs. Trained via contrastive divergence: push down energy on real data, push up on negatives from Langevin dynamics MCMC.

Strengths: Flexible unnormalized density model, can model complex distributions, no mode collapse. Weaknesses: MCMC sampling is slow, training instability, energy landscape hard to control. Who uses it: Energy-based generative modeling research, anomaly detection.

Hopfield
ModuleEdifice.Energy.Hopfield
PaperRamsauer et al., "Hopfield Networks is All You Need" (2020)
ReferencearXiv:2008.02217
OriginJohannes Kepler University

Modern continuous Hopfield networks with exponential interaction function. Stores exponentially many patterns (vs polynomial in classical). Update rule softmax(β·X·Y^T)·Y is exactly the attention mechanism.

Strengths: Exponential storage capacity, single-step convergence, mathematical attention connection. Weaknesses: Primarily theoretical interest, specialized use case. Who uses it: Associative memory research, attention-memory connection studies.

Neural ODE
ModuleEdifice.Energy.NeuralODE
PaperChen et al., "Neural Ordinary Differential Equations" (NeurIPS 2018)
ReferencearXiv:1806.07366
OriginUniversity of Toronto

Parameterizes continuous hidden state dynamics: dh/dt = f(h(t), t; θ). Constant memory cost via adjoint method, adaptive computation via solver control, continuous-depth networks.

Strengths: Continuous-depth, constant memory, adaptive computation, elegant mathematics. Weaknesses: Slow (ODE solving), adjoint method can be unstable, harder to train than discrete layers. Who uses it: Physics-informed ML, continuous dynamics modeling, normalizing flows.


Probabilistic (3 architectures)

Architectures for principled uncertainty quantification.

Bayesian NN
ModuleEdifice.Probabilistic.Bayesian
PaperBlundell et al., "Weight Uncertainty in Neural Networks" (2015)
ReferencearXiv:1505.05424
OriginGoogle DeepMind

Each weight is a distribution (mu, rho) sampled via W = mu + softplus(rho)·ε. Trained by maximizing ELBO. Provides uncertainty estimation and learned regularization.

Strengths: Principled uncertainty, learned regularization, multiple weight samples reduce overconfidence. Weaknesses: 2x parameters (mu + rho), expensive sampling, difficult to scale. Who uses it: Safety-critical applications (medical, autonomous), uncertainty research.

MC Dropout
ModuleEdifice.Probabilistic.MCDropout
PaperGal & Ghahramani, "Dropout as a Bayesian Approximation" (2016)
ReferencearXiv:1506.02142
OriginUniversity of Cambridge

Keep dropout ON at inference, run N forward passes, compute mean (prediction) and variance (uncertainty). Practical Bayesian approximation requiring zero training modification.

Strengths: Zero training modification needed, practical uncertainty estimation, simple. Weaknesses: N forward passes at inference (slow), approximate (not exact Bayesian). Who uses it: Any model needing uncertainty estimates with minimal effort, medical AI.

Evidential NN
ModuleEdifice.Probabilistic.EvidentialNN
PaperSensoy et al., "Evidential Deep Learning to Quantify Classification Uncertainty" (NeurIPS 2018)
ReferencearXiv:1806.01768
OriginNational Defense University (Turkey)

Places Dirichlet distribution over class probabilities. Network outputs evidence parameters (alpha) for single-pass epistemic and aleatoric uncertainty estimation.

Strengths: Single forward pass uncertainty, separates epistemic and aleatoric uncertainty, no ensemble needed. Weaknesses: Dirichlet assumption may not hold, evidence calibration can be tricky. Who uses it: Out-of-distribution detection, safety-critical classification.


Memory-Augmented (2 architectures)

Architectures with external differentiable memory.

NTM (Neural Turing Machine)
ModuleEdifice.Memory.NTM
PaperGraves et al., "Neural Turing Machines" (2014)
ReferencearXiv:1410.5401
OriginGoogle DeepMind

Neural network controller with external memory matrix read/written via differentiable attention. 4-stage addressing: content-based → interpolation → circular shift → sharpening.

Strengths: Can learn algorithms (copy, sort, recall), external memory decoupled from computation. Weaknesses: Difficult to train, complex addressing mechanism, largely superseded. Who uses it: Algorithm learning research, differentiable memory research (historical significance).

Memory Network
ModuleEdifice.Memory.MemoryNetwork
PaperSukhbaatar et al., "End-To-End Memory Networks" (2015)
ReferencearXiv:1503.08895
OriginMeta AI / FAIR

Iterative reasoning over memory slots via multi-hop attention. Each hop: attend to memory, read, update query. Different embedding matrices at each hop for different attention aspects.

Strengths: Multi-hop reasoning, iterative query refinement, interpretable attention. Weaknesses: Fixed memory slots, largely superseded by transformer attention. Who uses it: Question answering research, multi-hop reasoning (historical significance).


Neuromorphic (2 architectures)

Biologically-plausible architectures for ultra-low-power deployment on neuromorphic hardware.

SNN (Spiking Neural Network)
ModuleEdifice.Neuromorphic.SNN
PaperNeftci et al., "Surrogate Gradient Learning in SNNs" (2019)
ReferencearXiv:1901.09948
OriginUC San Diego

Leaky Integrate-and-Fire (LIF) neurons processing information as discrete spikes. Surrogate gradients (sigmoid approximation) enable backpropagation through non-differentiable spike function.

Strengths: Energy-efficient on neuromorphic hardware (Intel Loihi, IBM TrueNorth), biologically plausible. Weaknesses: Surrogate gradients are approximate, spike-based processing limits precision, specialized hardware needed. Who uses it: Neuromorphic computing research, ultra-low-power applications, brain-inspired AI.

ANN2SNN
ModuleEdifice.Neuromorphic.ANN2SNN
PaperDiehl et al. (IJCNN 2015); Rueckauer et al. (2017)
OriginETH Zurich

Converts trained ANNs to spiking networks by replacing ReLU with integrate-and-fire neurons that encode activation magnitudes as spike rates. Train with standard backprop, deploy on neuromorphic hardware.

Strengths: Train with standard tools, deploy on neuromorphic hardware, leverages existing ANN training. Weaknesses: Rate coding requires many timesteps for accuracy, conversion loss. Who uses it: ANN-to-neuromorphic deployment pipeline, edge computing.


Feedforward (5 architectures)

Foundational and specialized feedforward architectures.

MLP
ModuleEdifice.Feedforward.MLP
OriginRosenblatt (1958), Rumelhart et al. (1986)

Multi-layer perceptron — stacked dense layers with activations. Configurable hidden sizes, activation, dropout, layer norm, and residual connections.

Strengths: Universal approximator, simple, fast, building block for everything. Weaknesses: No inductive bias (doesn't exploit structure in data), requires more data. Who uses it: Everywhere — backbone component of virtually every architecture.

KAN (Kolmogorov-Arnold Network)
ModuleEdifice.Feedforward.KAN
PaperLiu et al., "KAN: Kolmogorov-Arnold Networks" (2024)
ReferencearXiv:2404.19756
OriginMIT / Ziming Liu

Learnable activation functions on edges (not fixed activations on nodes). Based on Kolmogorov-Arnold representation theorem. Supports B-spline, sine, Chebyshev, Fourier, and RBF basis functions.

Strengths: Learnable activations, interpretable (visualizable), good for symbolic/scientific tasks. Weaknesses: O(n²·g) parameters (g=grid_size), slower than MLP, less general-purpose. Who uses it: Scientific ML, symbolic regression, interpretable AI research.

KAT (KAN-Attention Transformer)
ModuleEdifice.Feedforward.KAT
PaperCombines Liu et al. (KAN, 2024) with Vaswani et al. (Attention, 2017)
OriginEdifice composition

Standard multi-head attention with KAN layers replacing the FFN sublayer. Learnable activation functions provide more expressive feature transformation than fixed-activation FFNs.

Strengths: More expressive FFN via learnable activations, attention + KAN synergy. Weaknesses: Higher parameter count (grid_size multiplier), slower than standard transformer. Who uses it: Hybrid architecture research, when standard FFN is the bottleneck.

TabNet
ModuleEdifice.Feedforward.TabNet
PaperArik & Pfister, "TabNet: Attentive Interpretable Tabular Learning" (AAAI 2021)
ReferencearXiv:1908.07442
OriginGoogle Cloud AI

Sequential attention for instance-wise feature selection in tabular data. Each decision step selects relevant features via sparse mask (sparsemax), with a relaxation factor controlling feature reuse.

Strengths: Inherently interpretable, instance-wise feature selection, designed for tabular data. Weaknesses: Tabular-specific, sequential decision steps, complex training. Who uses it: Tabular ML (alternative to gradient boosting), feature importance analysis.

BitNet
ModuleEdifice.Feedforward.BitNet
PaperWang et al., "BitNet" (2023); Ma et al., "The Era of 1-bit LLMs" (2024)
ReferencearXiv:2310.11453
OriginMicrosoft Research

Quantizes weights to ternary {-1, 0, +1} in forward pass while maintaining full-precision for gradients. BitLinear layers with straight-through estimator. Binary (1-bit) and ternary (1.58-bit) modes.

Strengths: Massive memory reduction (1-1.58 bits/weight), faster inference (integer ops), maintains quality. Weaknesses: Quantization-aware training required, straight-through estimator is approximate. Who uses it: Efficient LLM deployment, edge inference, Microsoft Research.


Liquid (1 architecture)

Liquid Neural Network
ModuleEdifice.Liquid
PaperHasani et al., "Liquid Time-constant Networks" (AAAI 2021)
OriginMIT CSAIL / Liquid AI

Continuous-time ODE dynamics with learnable time constants: dx/dt = -x/τ + f(x, I, θ)/τ. Multiple ODE solvers: exact, Euler, midpoint, RK4, Dormand-Prince (adaptive).

Strengths: Continuous adaptation during inference, robust to distributional drift, physically interpretable. Weaknesses: ODE solving is computationally expensive, more complex than discrete RNNs. Who uses it: Liquid AI ($250M from AMD), autonomous driving, time series with drift.


Architecture Gaps

Missing architectures identified through survey of current research, organized by priority.

High Priority

ArchitectureFamilyWhy It Matters
HymbaSSMHybrid Mamba + attention with learned routing — next-gen hybrid architecture
RWKV-6/7 (full)AttentionUpdated RWKV with improved token shift and time mixing — current v7 is simplified
DeltaNet v2RecurrentImproved delta rule with gated memory — significant accuracy improvements
xLSTM v2 (sLSTM)RecurrentScalar LSTM with exponential gating — sequential state-tracking specialist
Sparse AttentionAttentionBlock-sparse and sliding window patterns — essential for long-context LLMs

Medium Priority

ArchitectureFamilyWhy It Matters
Hawk/Griffin v2AttentionImproved gated linear recurrence — Google's RecurrentGemma successor
RetNet v2AttentionImproved multi-scale retention with better decay schedules
GLA v2AttentionEnhanced gated linear attention with improved forget gates
HGRN v2AttentionMulti-resolution hierarchical gated RNN
FlashLinearAttentionAttentionHardware-efficient linear attention kernel for production
LoRA+/DoRAMetaEnhanced parameter-efficient fine-tuning with direction-aware updates
DiT v2GenerativeImproved diffusion transformer conditioning
Byte Latent TransformerTransformerByte-level processing with latent patching — emerging paradigm

Low Priority / Exploratory

ArchitectureFamilyWhy It Matters
Test-Time ComputeMetaDynamic inference-time scaling strategies
Mixture of TokenizersMetaMulti-granularity tokenization routing
Medusa/EAGLEMetaSpeculative decoding heads for faster inference
Distillation HeadMetaKnowledge distillation output head
QAT utilitiesMetaQuantization-aware training beyond BitNet
Native RecurrenceRecurrentHardware-native recurrent implementations for specialized accelerators
State Space TransformerSSM/TransformerDeeper SSM-attention integration (beyond simple interleaving)

See ARCHITECTURE_ROADMAP.md for implementation timeline and prioritization.


Cross-Reference

Every architecture in Edifice.list_architectures/0 appears in this taxonomy. Family counts match Edifice.list_families/0:

FamilyCountVerified
ssm15yes
attention19yes
recurrent10yes
transformer1yes
vision9yes
convolutional6yes
graph8yes
sets2yes
generative11yes
contrastive6yes
meta10yes
energy3yes
probabilistic3yes
memory2yes
neuromorphic2yes
feedforward5yes
liquid1yes
Total113