Edifice.Attention.RNoPESWA (Edifice v0.2.0)

Copy Markdown View Source

RNoPE-SWA: Sliding Window Attention without positional encoding.

A minimalist attention mechanism that combines:

  • Sliding Window Attention: Each position only attends to the last window_size positions
  • No Positional Encoding: Pure content-based attention without position bias

Key Innovation

By removing positional encoding, the model learns purely content-based attention patterns. Combined with sliding window, this creates an efficient local attention mechanism that:

  • Has O(L * W) complexity instead of O(L^2) where W = window_size
  • Generalizes perfectly to any sequence length at inference time
  • Forces the model to rely on content similarity, not position heuristics

Architecture

Input [batch, seq_len, embed_dim]
      |
      v (no positional encoding)
+--------------------------------+
|  Sliding Window Attention      |
|                                |
|  Each position attends to      |
|  last W positions only         |
|  Q, K, V projections           |
|  Attention(Q, K, V)            |
|  Output projection             |
+--------------------------------+
      |
[batch, seq_len, hidden_size]

When to Use

  • Long sequences where full attention is too expensive
  • Tasks where local context is most important (e.g., language modeling)
  • When you want length generalization at inference time
  • When you want to ablate the effect of positional encoding

Usage

model = RNoPESWA.build(
  embed_dim: 256,
  hidden_size: 256,
  num_heads: 4,
  window_size: 128,
  num_layers: 6
)

Reference

  • "RoPE is Overrated: Positional Encoding Ablations" (2025)
  • "Longformer: The Long-Document Transformer" (Beltagy et al., 2020)

Summary

Types

Options for build/1.

Functions

Build an RNoPE-SWA model.

Build a sliding window attention layer without positional encoding.

Get the output dimension for a model configuration.

Recommended default configuration.

Types

build_opt()

@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:window_size, pos_integer()}
  | {:dropout, float()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build an RNoPE-SWA model.

Options

  • :embed_dim - Size of input embedding per timestep (required)
  • :hidden_size - Internal hidden dimension (default: 256)
  • :num_heads - Number of attention heads (default: 4)
  • :num_layers - Number of transformer blocks (default: 6)
  • :window_size - Attention window size (default: 128)
  • :dropout - Dropout rate (default: 0.1)

Returns

An Axon model that outputs [batch, hidden_size] from the last position.

build_sliding_window_attention(input, opts)

@spec build_sliding_window_attention(
  Axon.t(),
  keyword()
) :: Axon.t()

Build a sliding window attention layer without positional encoding.

Options

  • :hidden_size - Hidden dimension (default: 256)
  • :num_heads - Number of attention heads (default: 4)
  • :window_size - Attention window size (default: 128)
  • :rope - Whether to use RoPE (default: false for RNoPE-SWA)
  • :name - Layer name prefix

output_size(opts \\ [])

@spec output_size(keyword()) :: non_neg_integer()

Get the output dimension for a model configuration.