Attention with Linear Biases (ALiBi).
Replaces positional embeddings with a simple linear bias added to attention scores. Each attention head gets a different slope, creating head-specific position sensitivity. ALiBi provides strong length extrapolation without any learned position parameters.
Formula
attention(Q, K) = softmax(QK^T / sqrt(d) + m * distance_matrix)where m is a head-specific slope and distance_matrix[i,j] = -(|i - j|).
Slope Schedule
Slopes are geometric: m_i = 2^(-8i/n_heads) for i = 1..n_heads. Lower heads get steeper slopes (more local), higher heads get gentler slopes (more global).
Usage
# Get ALiBi bias matrix for attention
bias = ALiBi.compute_bias(seq_len: 128, num_heads: 8)
# Add to attention scores before softmax
scores = scores + biasReferences
- "Train Short, Test Long" (Press et al., 2022)
- https://arxiv.org/abs/2108.12409
Summary
Functions
Compute ALiBi bias matrix for a given sequence length and number of heads.
Compute ALiBi slopes for each attention head.
Functions
@spec compute_bias(keyword()) :: Nx.Tensor.t()
Compute ALiBi bias matrix for a given sequence length and number of heads.
Returns bias of shape [num_heads, seq_len, seq_len] to add to attention scores.
Options
:seq_len- Sequence length (required):num_heads- Number of attention heads (required):causal- Use causal (lower-triangular) distances (default: true)
@spec compute_slopes(pos_integer()) :: Nx.Tensor.t()
Compute ALiBi slopes for each attention head.
Returns tensor of shape [num_heads] with geometric slopes.