RNoPE-SWA: Sliding Window Attention without positional encoding.
A minimalist attention mechanism that combines:
- Sliding Window Attention: Each position only attends to the last
window_sizepositions - No Positional Encoding: Pure content-based attention without position bias
Key Innovation
By removing positional encoding, the model learns purely content-based attention patterns. Combined with sliding window, this creates an efficient local attention mechanism that:
- Has O(L * W) complexity instead of O(L^2) where W = window_size
- Generalizes perfectly to any sequence length at inference time
- Forces the model to rely on content similarity, not position heuristics
Architecture
Input [batch, seq_len, embed_dim]
|
v (no positional encoding)
+--------------------------------+
| Sliding Window Attention |
| |
| Each position attends to |
| last W positions only |
| Q, K, V projections |
| Attention(Q, K, V) |
| Output projection |
+--------------------------------+
|
[batch, seq_len, hidden_size]When to Use
- Long sequences where full attention is too expensive
- Tasks where local context is most important (e.g., language modeling)
- When you want length generalization at inference time
- When you want to ablate the effect of positional encoding
Usage
model = RNoPESWA.build(
embed_dim: 256,
hidden_size: 256,
num_heads: 4,
window_size: 128,
num_layers: 6
)Reference
- "RoPE is Overrated: Positional Encoding Ablations" (2025)
- "Longformer: The Long-Document Transformer" (Beltagy et al., 2020)
Summary
Functions
Build an RNoPE-SWA model.
Build a sliding window attention layer without positional encoding.
Get the output dimension for a model configuration.
Recommended default configuration.
Types
@type build_opt() :: {:embed_dim, pos_integer()} | {:hidden_size, pos_integer()} | {:num_heads, pos_integer()} | {:num_layers, pos_integer()} | {:window_size, pos_integer()} | {:dropout, float()}
Options for build/1.
Functions
Build an RNoPE-SWA model.
Options
:embed_dim- Size of input embedding per timestep (required):hidden_size- Internal hidden dimension (default: 256):num_heads- Number of attention heads (default: 4):num_layers- Number of transformer blocks (default: 6):window_size- Attention window size (default: 128):dropout- Dropout rate (default: 0.1)
Returns
An Axon model that outputs [batch, hidden_size] from the last position.
Build a sliding window attention layer without positional encoding.
Options
:hidden_size- Hidden dimension (default: 256):num_heads- Number of attention heads (default: 4):window_size- Attention window size (default: 128):rope- Whether to use RoPE (default: false for RNoPE-SWA):name- Layer name prefix
@spec output_size(keyword()) :: non_neg_integer()
Get the output dimension for a model configuration.
@spec recommended_defaults() :: keyword()
Recommended default configuration.