Edifice.Attention.GatedAttention (Edifice v0.2.0)

Gated Attention: learned gating over attention output.

Applies a learnable sigmoid gate to attention output:

output = sigmoid(g) * Attention(Q, K, V)

Where g is a learned gate vector (one scalar per hidden dimension). This allows the model to selectively suppress or amplify attention outputs per feature dimension.

Key Innovation

Standard attention outputs are weighted sums that can be noisy. The gate learns which dimensions of the attention output are reliable/useful and which should be dampened. This is similar to gating in LSTMs/GRUs but applied to attention.

Architecture

Input [batch, seq_len, embed_dim]
      |
+------------------------------+
|  Gated Attention Block       |
|                              |
|  Q, K, V projections         |
|         |                    |
|  Standard attention          |
|         |                    |
|  sigmoid(g) * attn_out       |
|         |                    |
|  Output projection           |
+------------------------------+
      |
[batch, seq_len, hidden_size]

Usage

model = GatedAttention.build(
  embed_dim: 256,
  hidden_size: 256,
  num_heads: 4,
  num_layers: 6
)

Reference

"Gated Attention Networks" (NeurIPS 2025 Best Paper)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Gated Attention model.

build_gated_attention(input, opts)

Build the gated attention layer.

output_size(opts \\ [])

Get the output dimension for a model configuration.

recommended_defaults()

Recommended default configuration.