Edifice.Transformer.NemotronH (Edifice v0.2.0)

Copy Markdown View Source

Nemotron-H: NVIDIA's Hybrid Mamba-Transformer Architecture.

Nemotron-H is a hybrid language model that combines 90% Mamba2 (SSD) layers with 10% full attention layers. This design achieves Transformer-level quality while maintaining linear inference cost from the SSM components.

Key Innovation: Hybrid Layer Mixing

Rather than using all-attention or all-SSM, Nemotron-H interleaves them:

  • 90% of layers use Mamba2 (State Space Duality) for efficient linear-time processing
  • 10% of layers use full multi-head attention for global reasoning
  • Attention blocks placed at regular intervals (every 10th layer by default)

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
[Shared Embedding Projection]
      |
      v
+========================================+
|            Layer 0 (Mamba2)            |
|  RMSNorm -> Mamba2 SSD -> Residual     |
|  RMSNorm -> SwiGLU FFN -> Residual     |
+========================================+
      |
     ... (Mamba2 layers 1-8)
      |
+========================================+
|           Layer 9 (Attention)          |
|  RMSNorm -> MultiHead Attn -> Residual |
|  RMSNorm -> SwiGLU FFN -> Residual     |
+========================================+
      |
     ... (pattern repeats)
      |
      v
[Final RMSNorm]
      |
      v
[Output Projection (tied weights)]
      |
      v
Output [batch, hidden_dim]

Mamba2 (SSD) Blocks

Use State Space Duality from Mamba-2:

  • Chunked matmul for tensor core utilization
  • Selective state space with input-dependent parameters
  • Depthwise convolution + gating

Attention Blocks

Standard multi-head attention with:

  • Grouped Query Attention (optional)
  • RoPE position embeddings (optional)
  • Causal masking

Usage

model = NemotronH.build(
  embed_dim: 287,
  hidden_dim: 2048,
  num_layers: 32,
  attention_every_n: 10,
  num_heads: 16
)

References

  • Paper: "Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Language Models" (NVIDIA, 2025)
  • Mamba-2: "Transformers are SSMs" (Gu & Dao, 2024)

Summary

Types

Options for build/1.

Functions

Build a Nemotron-H hybrid model.

Build an attention block with RMSNorm and SwiGLU FFN.

Build a Mamba2 (SSD) block with RMSNorm and SwiGLU FFN.

Build a single Nemotron-H block.

Get the output size of a Nemotron-H model.

Calculate approximate parameter count for a Nemotron-H model.

Recommended default configuration for Nemotron-H.

Get small model configuration (for testing/prototyping).

Types

build_opt()

@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:hidden_dim, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:attention_every_n, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_kv_heads, pos_integer()}
  | {:mamba_d_state, pos_integer()}
  | {:mamba_d_conv, pos_integer()}
  | {:mamba_expand, pos_integer()}
  | {:dropout, float()}
  | {:rope, boolean()}
  | {:window_size, pos_integer()}
  | {:seq_len, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build a Nemotron-H hybrid model.

Options

  • :embed_dim - Input embedding dimension (required)
  • :hidden_dim - Model hidden dimension (default: 2048)
  • :num_layers - Total number of layers (default: 32)
  • :attention_every_n - Place attention at every Nth layer (default: 10)
  • :num_heads - Number of attention heads (default: 16)
  • :num_kv_heads - Number of KV heads for GQA (default: 4)
  • :mamba_d_state - Mamba SSM state dimension (default: 64)
  • :mamba_d_conv - Mamba convolution kernel size (default: 4)
  • :mamba_expand - Mamba expansion factor (default: 2)
  • :dropout - Dropout rate (default: 0.0)
  • :rope - Apply RoPE to attention layers (default: false)
  • :window_size / :seq_len - Expected sequence length (default: 60)

Returns

An Axon model that outputs [batch, hidden_dim].

build_attention_block(input, opts)

@spec build_attention_block(
  Axon.t(),
  keyword()
) :: Axon.t()

Build an attention block with RMSNorm and SwiGLU FFN.

Architecture: RMSNorm -> MultiHead Attention -> (residual handled by caller)

          RMSNorm -> SwiGLU FFN -> (residual handled by caller)

Options

Same as build/1, plus :layer_idx for naming.

build_mamba_block(input, opts)

@spec build_mamba_block(
  Axon.t(),
  keyword()
) :: Axon.t()

Build a Mamba2 (SSD) block with RMSNorm and SwiGLU FFN.

Architecture: RMSNorm -> Mamba2 -> (no residual here, handled by caller)

          RMSNorm -> SwiGLU FFN -> (no residual here)

Options

Same as build/1, plus :layer_idx for naming.

nemotron_block(input, block_idx, opts)

@spec nemotron_block(Axon.t(), non_neg_integer(), keyword()) :: Axon.t()

Build a single Nemotron-H block.

Dispatches to either Mamba2 or attention based on the block index. Attention blocks are placed at positions where block_idx % attention_every_n == (attention_every_n - 1).

Parameters

  • input - Input Axon node
  • block_idx - 0-indexed block position
  • opts - Model options

Returns

Block output (before residual connection).

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of a Nemotron-H model.

param_count(opts)

@spec param_count(keyword()) :: pos_integer()

Calculate approximate parameter count for a Nemotron-H model.

small_config()

@spec small_config() :: keyword()

Get small model configuration (for testing/prototyping).