Edifice.SSM.MambaSSD (Edifice v0.2.0)

Mamba variant using State Space Duality (SSD) algorithm from Mamba-2.

SSD Algorithm

The key insight: SSM computation can be decomposed into matrix multiplications that leverage tensor cores (10-20x faster than scalar operations).

Algorithm Steps

Split into chunks: Divide sequence into chunks of size C
Intra-chunk (matmul): Compute outputs within each chunk using dense matmul
- This uses tensor cores!
- O(C²) work per chunk, but highly parallel
Inter-chunk (scan): Small sequential scan over chunk boundaries
- Only L/C elements to scan
Combine: Merge chunk outputs with inter-chunk states

Complexity

Intra-chunk: O(L/C × C²) = O(L × C) work, but tensor core accelerated
Inter-chunk: O(L/C) sequential work (tiny)
Total: Much faster in practice due to tensor cores

Training Mode

When training_mode: true is set, the SSD algorithm uses matrix multiplication formulation optimized for tensor cores:

y = (L ⊙ (C @ B^T)) @ x + cumsum(A) @ h_prev

Where L is a lower-triangular mask. This formulation:

Uses dense matmuls for tensor core utilization
Computes all positions in parallel within each chunk
Is significantly faster for batched training

For inference, use training_mode: false (default) which uses efficient scans with O(1) memory per step.

Current Performance

Note: The XLA implementation has limitations compared to fused CUDA kernels. For production training, consider using a custom Triton kernel.

Usage

# Training (matmul formulation)
model = MambaSSD.build(embed_dim: 287, hidden_size: 256, training_mode: true)

# Inference (scan formulation)
model = MambaSSD.build(embed_dim: 287, hidden_size: 256, training_mode: false)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build an SSD Mamba model.

output_size(opts \\ [])

See Edifice.SSM.Common.output_size/1.

param_count(opts)

See Edifice.SSM.Common.param_count/1.

recommended_defaults()

Get recommended defaults for real-time sequence processing (60fps).

training_defaults()

Get training-optimized defaults.