Edifice.Attention.RWKV (Edifice v0.2.0)

RWKV-7 "Goose": Linear attention with O(1) space complexity.

RWKV (Receptance Weighted Key Value) is a linear attention architecture that combines the parallelizable training of Transformers with the efficient O(1) inference of RNNs.

Key Innovation: Generalized Delta Rule

RWKV-7 uses a generalized delta rule that surpasses the TC0 constraint, enabling it to comprehensively outperform Transformers on many tasks.

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
+-------------------------------------+
|  RWKV Block                          |
|                                      |
|  +----------------------------------+
|  | Time-Mixing (WKV Attention)      |
|  | - R-gate: receptance             |
|  | - W: time decay                  |
|  | - K, V: key-value pairs          |
|  | - time_first: first token bias   |
|  +----------------------------------+
|                                      |
|  +----------------------------------+
|  | Channel-Mixing (FFN)              |
|  | - R-gate * K-gate                 |
|  +----------------------------------+
+-------------------------------------+
      | (repeat for num_layers)
      v
[batch, hidden_size]

Complexity

Phase	Time	Space
Training	O(L)	O(L)
Inference	O(1) per step	O(1)

Key Difference from Mamba

Aspect	RWKV	Mamba
Attention	WKV (weighted key-value)	SSM (state space)
State	O(1) fixed size	O(L) for full sequence
Decay	Learned per-channel	Input-dependent
Gating	R-gate, K-gate	SiLU gating

Usage

model = RWKV.build(
  embed_dim: 287,
  hidden_size: 256,
  num_layers: 6
)

References

RWKV-7 "Goose" architecture wiki: https://wiki.rwkv.com/basic/architecture.html
Paper: "RWKV: Reinventing RNNs for the Transformer Era" (arXiv:2305.13048)
Deployment: Shipped to 1.5B Windows devices for on-device Copilot

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build an RWKV-7 model for sequence processing.

build_channel_mixing(input, opts)

Build the Channel-Mixing sub-block (FFN with gating).

build_rwkv_block(input, opts)

Build a single RWKV block.

build_time_mixing(input, opts)

Build the Time-Mixing sub-block (WKV attention).

init_cache(opts \\ [])

Initialize hidden state for O(1) incremental inference.

output_size(opts \\ [])

Get the output size of an RWKV model.

param_count(opts)

Calculate approximate parameter count for an RWKV model.

recommended_defaults()

Recommended default configuration for sequence processing.