Edifice.Blocks.ALiBi (Edifice v0.2.0)

Attention with Linear Biases (ALiBi).

Replaces positional embeddings with a simple linear bias added to attention scores. Each attention head gets a different slope, creating head-specific position sensitivity. ALiBi provides strong length extrapolation without any learned position parameters.

Formula

attention(Q, K) = softmax(QK^T / sqrt(d) + m * distance_matrix)

where m is a head-specific slope and distance_matrix[i,j] = -(|i - j|).

Slope Schedule

Slopes are geometric: m_i = 2^(-8i/n_heads) for i = 1..n_heads. Lower heads get steeper slopes (more local), higher heads get gentler slopes (more global).

Usage

# Get ALiBi bias matrix for attention
bias = ALiBi.compute_bias(seq_len: 128, num_heads: 8)

# Add to attention scores before softmax
scores = scores + bias

References

"Train Short, Test Long" (Press et al., 2022)
https://arxiv.org/abs/2108.12409

Summary

Functions

compute_bias(opts)

Compute ALiBi bias matrix for a given sequence length and number of heads.

compute_slopes(num_heads)

Compute ALiBi slopes for each attention head.

Functions

compute_bias(opts)

@spec compute_bias(keyword()) :: Nx.Tensor.t()

Compute ALiBi bias matrix for a given sequence length and number of heads.

Returns bias of shape [num_heads, seq_len, seq_len] to add to attention scores.

Options

:seq_len - Sequence length (required)
:num_heads - Number of attention heads (required)
:causal - Use causal (lower-triangular) distances (default: true)

compute_slopes(num_heads)

@spec compute_slopes(pos_integer()) :: Nx.Tensor.t()

Compute ALiBi slopes for each attention head.

Returns tensor of shape [num_heads] with geometric slopes.