Nemotron-H: NVIDIA's Hybrid Mamba-Transformer Architecture.
Nemotron-H is a hybrid language model that combines 90% Mamba2 (SSD) layers with 10% full attention layers. This design achieves Transformer-level quality while maintaining linear inference cost from the SSM components.
Key Innovation: Hybrid Layer Mixing
Rather than using all-attention or all-SSM, Nemotron-H interleaves them:
- 90% of layers use Mamba2 (State Space Duality) for efficient linear-time processing
- 10% of layers use full multi-head attention for global reasoning
- Attention blocks placed at regular intervals (every 10th layer by default)
Architecture
Input [batch, seq_len, embed_dim]
|
v
[Shared Embedding Projection]
|
v
+========================================+
| Layer 0 (Mamba2) |
| RMSNorm -> Mamba2 SSD -> Residual |
| RMSNorm -> SwiGLU FFN -> Residual |
+========================================+
|
... (Mamba2 layers 1-8)
|
+========================================+
| Layer 9 (Attention) |
| RMSNorm -> MultiHead Attn -> Residual |
| RMSNorm -> SwiGLU FFN -> Residual |
+========================================+
|
... (pattern repeats)
|
v
[Final RMSNorm]
|
v
[Output Projection (tied weights)]
|
v
Output [batch, hidden_dim]Mamba2 (SSD) Blocks
Use State Space Duality from Mamba-2:
- Chunked matmul for tensor core utilization
- Selective state space with input-dependent parameters
- Depthwise convolution + gating
Attention Blocks
Standard multi-head attention with:
- Grouped Query Attention (optional)
- RoPE position embeddings (optional)
- Causal masking
Usage
model = NemotronH.build(
embed_dim: 287,
hidden_dim: 2048,
num_layers: 32,
attention_every_n: 10,
num_heads: 16
)References
- Paper: "Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Language Models" (NVIDIA, 2025)
- Mamba-2: "Transformers are SSMs" (Gu & Dao, 2024)
Summary
Functions
Build a Nemotron-H hybrid model.
Build an attention block with RMSNorm and SwiGLU FFN.
Build a Mamba2 (SSD) block with RMSNorm and SwiGLU FFN.
Build a single Nemotron-H block.
Get the output size of a Nemotron-H model.
Calculate approximate parameter count for a Nemotron-H model.
Recommended default configuration for Nemotron-H.
Get small model configuration (for testing/prototyping).
Types
@type build_opt() :: {:embed_dim, pos_integer()} | {:hidden_dim, pos_integer()} | {:num_layers, pos_integer()} | {:attention_every_n, pos_integer()} | {:num_heads, pos_integer()} | {:num_kv_heads, pos_integer()} | {:mamba_d_state, pos_integer()} | {:mamba_d_conv, pos_integer()} | {:mamba_expand, pos_integer()} | {:dropout, float()} | {:rope, boolean()} | {:window_size, pos_integer()} | {:seq_len, pos_integer()}
Options for build/1.
Functions
Build a Nemotron-H hybrid model.
Options
:embed_dim- Input embedding dimension (required):hidden_dim- Model hidden dimension (default: 2048):num_layers- Total number of layers (default: 32):attention_every_n- Place attention at every Nth layer (default: 10):num_heads- Number of attention heads (default: 16):num_kv_heads- Number of KV heads for GQA (default: 4):mamba_d_state- Mamba SSM state dimension (default: 64):mamba_d_conv- Mamba convolution kernel size (default: 4):mamba_expand- Mamba expansion factor (default: 2):dropout- Dropout rate (default: 0.0):rope- Apply RoPE to attention layers (default: false):window_size/:seq_len- Expected sequence length (default: 60)
Returns
An Axon model that outputs [batch, hidden_dim].
Build an attention block with RMSNorm and SwiGLU FFN.
Architecture: RMSNorm -> MultiHead Attention -> (residual handled by caller)
RMSNorm -> SwiGLU FFN -> (residual handled by caller)Options
Same as build/1, plus :layer_idx for naming.
Build a Mamba2 (SSD) block with RMSNorm and SwiGLU FFN.
Architecture: RMSNorm -> Mamba2 -> (no residual here, handled by caller)
RMSNorm -> SwiGLU FFN -> (no residual here)Options
Same as build/1, plus :layer_idx for naming.
@spec nemotron_block(Axon.t(), non_neg_integer(), keyword()) :: Axon.t()
Build a single Nemotron-H block.
Dispatches to either Mamba2 or attention based on the block index.
Attention blocks are placed at positions where block_idx % attention_every_n == (attention_every_n - 1).
Parameters
input- Input Axon nodeblock_idx- 0-indexed block positionopts- Model options
Returns
Block output (before residual connection).
@spec output_size(keyword()) :: pos_integer()
Get the output size of a Nemotron-H model.
@spec param_count(keyword()) :: pos_integer()
Calculate approximate parameter count for a Nemotron-H model.
@spec recommended_defaults() :: keyword()
Recommended default configuration for Nemotron-H.
@spec small_config() :: keyword()
Get small model configuration (for testing/prototyping).