Edifice.Transformer.DecoderOnly (Edifice v0.2.0)

GPT-style decoder-only transformer with GQA + RoPE + SwiGLU + RMSNorm.

Combines modern LLM techniques into a single decoder-only transformer:

Grouped Query Attention (GQA) for efficient KV cache
Rotary Position Embeddings (RoPE) for position encoding
SwiGLU gated feed-forward network
RMSNorm for faster normalization

Attention Variants

The :attention_type option allows switching between attention mechanisms:

:gqa (default) — Grouped Query Attention with RoPE
:lightning — Lightning Attention (hybrid linear/softmax block attention)
:dual_chunk — Dual Chunk Attention (intra-chunk + inter-chunk for long contexts)

Architecture

Input [batch, seq_len, embed_dim]
      |
Input projection to hidden_size
      |
+------------------------------------+
|   Decoder Block (x num_layers)     |
|                                    |
|   RMSNorm -> Attention             |
|     (GQA / Lightning / DualChunk)  |
|     + RoPE on Q and K (GQA only)   |
|   -> Residual                      |
|   RMSNorm -> SwiGLU FFN            |
|   -> Residual                      |
+------------------------------------+
      |
Final LayerNorm
      |
Last timestep -> [batch, hidden_size]

Usage

# Default GQA attention
model = DecoderOnly.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 8,
  num_kv_heads: 2,
  num_layers: 6
)

# Lightning Attention for subquadratic complexity
model = DecoderOnly.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 8,
  num_layers: 6,
  attention_type: :lightning,
  block_size: 64
)

# Dual Chunk Attention for long contexts
model = DecoderOnly.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 8,
  num_layers: 6,
  attention_type: :dual_chunk,
  chunk_size: 64
)

References

GPT-2/3 decoder-only architecture (Radford et al., 2019; Brown et al., 2020)
LLaMA architecture combining GQA + RoPE + SwiGLU + RMSNorm (Touvron et al., 2023)
Lightning Attention-2 (Qin et al., 2024)
DeepSeek/Qwen2.5 Dual Chunk Attention (2024)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a GPT-style decoder-only transformer model.

output_size(opts \\ [])

Get the output size of a decoder-only model.