Edifice.Blocks.Softpick (Edifice v0.2.0)

Softpick: non-saturating, naturally sparse normalization.

Softpick normalizes inputs by dividing by the total absolute magnitude:

Softpick(x)_i = x_i / (1 + sum_j(|x_j|))

Key Properties

Non-saturating: Unlike softmax, gradients don't vanish for large inputs
Naturally sparse: Outputs preserve sign and relative magnitudes
Bounded: Output magnitudes are always < 1 (divided by 1 + sum)
Simple: No exponentials, just absolute values and division

Comparison with Softmax

Property	Softmax	Softpick
Output range	(0, 1)	(-1, 1)
Sum of outputs	1	varies
Preserves sign	No	Yes
Saturation	Yes (exp)	No
Sparsity	Low (sum=1)	Natural

Use Cases

Attention alternatives where sign matters
Routing in mixture-of-experts
Feature selection where sparsity is desired
Any normalization where you want bounded outputs without saturation

Usage as Nx Function

# Direct computation
normalized = Softpick.compute(logits)

Usage in Axon Model

model = Softpick.build(embed_dim: 256, hidden_size: 256)

Reference

"Beyond Softmax: Sparse and Non-Saturating Attention" (2025)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a transformer model using Softpick instead of softmax in attention.

build_softpick_attention(input, opts)

Build attention layer using Softpick instead of softmax.

compute(x, opts \\ [])

Apply Softpick normalization to a tensor.

layer(input, opts \\ [])

Create a Softpick Axon layer.

output_size(opts \\ [])

Get the output dimension for a model configuration.

Types

build_opt()

@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:dropout, float()}
  | {:window_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build a transformer model using Softpick instead of softmax in attention.

Options

:embed_dim - Size of input embedding per timestep (required)
:hidden_size - Internal hidden dimension (default: 256)
:num_heads - Number of attention heads (default: 4)
:num_layers - Number of transformer blocks (default: 6)
:dropout - Dropout rate (default: 0.1)
:window_size - Expected sequence length for JIT optimization (default: 60)

Returns

An Axon model that outputs [batch, hidden_size] from the last position.

build_softpick_attention(input, opts)

@spec build_softpick_attention(
  Axon.t(),
  keyword()
) :: Axon.t()

Build attention layer using Softpick instead of softmax.

compute(x, opts \\ [])

@spec compute(
  Nx.Tensor.t(),
  keyword()
) :: Nx.Tensor.t()

Apply Softpick normalization to a tensor.

Parameters

x - Input tensor of any shape
opts - Options:
- :axis - Axis to normalize over (default: -1, last axis)

Returns

Normalized tensor: x_i / (1 + sum(|x_j|)) over the specified axis.

layer(input, opts \\ [])

@spec layer(
  Axon.t(),
  keyword()
) :: Axon.t()

Create a Softpick Axon layer.

Options

:name - Layer name prefix (default: "softpick")
:axis - Axis to normalize over (default: -1)

Returns

An Axon layer that applies Softpick normalization.

output_size(opts \\ [])

@spec output_size(keyword()) :: non_neg_integer()

Get the output dimension for a model configuration.