Softpick: non-saturating, naturally sparse normalization.
Softpick normalizes inputs by dividing by the total absolute magnitude:
Softpick(x)_i = x_i / (1 + sum_j(|x_j|))Key Properties
- Non-saturating: Unlike softmax, gradients don't vanish for large inputs
- Naturally sparse: Outputs preserve sign and relative magnitudes
- Bounded: Output magnitudes are always < 1 (divided by 1 + sum)
- Simple: No exponentials, just absolute values and division
Comparison with Softmax
| Property | Softmax | Softpick |
|---|---|---|
| Output range | (0, 1) | (-1, 1) |
| Sum of outputs | 1 | varies |
| Preserves sign | No | Yes |
| Saturation | Yes (exp) | No |
| Sparsity | Low (sum=1) | Natural |
Use Cases
- Attention alternatives where sign matters
- Routing in mixture-of-experts
- Feature selection where sparsity is desired
- Any normalization where you want bounded outputs without saturation
Usage as Nx Function
# Direct computation
normalized = Softpick.compute(logits)Usage in Axon Model
model = Softpick.build(embed_dim: 256, hidden_size: 256)Reference
- "Beyond Softmax: Sparse and Non-Saturating Attention" (2025)
Summary
Functions
Build a transformer model using Softpick instead of softmax in attention.
Build attention layer using Softpick instead of softmax.
Apply Softpick normalization to a tensor.
Create a Softpick Axon layer.
Get the output dimension for a model configuration.
Types
@type build_opt() :: {:embed_dim, pos_integer()} | {:hidden_size, pos_integer()} | {:num_heads, pos_integer()} | {:num_layers, pos_integer()} | {:dropout, float()} | {:window_size, pos_integer()}
Options for build/1.
Functions
Build a transformer model using Softpick instead of softmax in attention.
Options
:embed_dim- Size of input embedding per timestep (required):hidden_size- Internal hidden dimension (default: 256):num_heads- Number of attention heads (default: 4):num_layers- Number of transformer blocks (default: 6):dropout- Dropout rate (default: 0.1):window_size- Expected sequence length for JIT optimization (default: 60)
Returns
An Axon model that outputs [batch, hidden_size] from the last position.
Build attention layer using Softpick instead of softmax.
@spec compute( Nx.Tensor.t(), keyword() ) :: Nx.Tensor.t()
Apply Softpick normalization to a tensor.
Parameters
x- Input tensor of any shapeopts- Options::axis- Axis to normalize over (default: -1, last axis)
Returns
Normalized tensor: x_i / (1 + sum(|x_j|)) over the specified axis.
Create a Softpick Axon layer.
Options
:name- Layer name prefix (default: "softpick"):axis- Axis to normalize over (default: -1)
Returns
An Axon layer that applies Softpick normalization.
@spec output_size(keyword()) :: non_neg_integer()
Get the output dimension for a model configuration.