Edifice.Blocks.Softpick (Edifice v0.2.0)

Copy Markdown View Source

Softpick: non-saturating, naturally sparse normalization.

Softpick normalizes inputs by dividing by the total absolute magnitude:

Softpick(x)_i = x_i / (1 + sum_j(|x_j|))

Key Properties

  • Non-saturating: Unlike softmax, gradients don't vanish for large inputs
  • Naturally sparse: Outputs preserve sign and relative magnitudes
  • Bounded: Output magnitudes are always < 1 (divided by 1 + sum)
  • Simple: No exponentials, just absolute values and division

Comparison with Softmax

PropertySoftmaxSoftpick
Output range(0, 1)(-1, 1)
Sum of outputs1varies
Preserves signNoYes
SaturationYes (exp)No
SparsityLow (sum=1)Natural

Use Cases

  • Attention alternatives where sign matters
  • Routing in mixture-of-experts
  • Feature selection where sparsity is desired
  • Any normalization where you want bounded outputs without saturation

Usage as Nx Function

# Direct computation
normalized = Softpick.compute(logits)

Usage in Axon Model

model = Softpick.build(embed_dim: 256, hidden_size: 256)

Reference

  • "Beyond Softmax: Sparse and Non-Saturating Attention" (2025)

Summary

Types

Options for build/1.

Functions

Build a transformer model using Softpick instead of softmax in attention.

Build attention layer using Softpick instead of softmax.

Apply Softpick normalization to a tensor.

Create a Softpick Axon layer.

Get the output dimension for a model configuration.

Types

build_opt()

@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:dropout, float()}
  | {:window_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build a transformer model using Softpick instead of softmax in attention.

Options

  • :embed_dim - Size of input embedding per timestep (required)
  • :hidden_size - Internal hidden dimension (default: 256)
  • :num_heads - Number of attention heads (default: 4)
  • :num_layers - Number of transformer blocks (default: 6)
  • :dropout - Dropout rate (default: 0.1)
  • :window_size - Expected sequence length for JIT optimization (default: 60)

Returns

An Axon model that outputs [batch, hidden_size] from the last position.

build_softpick_attention(input, opts)

@spec build_softpick_attention(
  Axon.t(),
  keyword()
) :: Axon.t()

Build attention layer using Softpick instead of softmax.

compute(x, opts \\ [])

@spec compute(
  Nx.Tensor.t(),
  keyword()
) :: Nx.Tensor.t()

Apply Softpick normalization to a tensor.

Parameters

  • x - Input tensor of any shape
  • opts - Options:
    • :axis - Axis to normalize over (default: -1, last axis)

Returns

Normalized tensor: x_i / (1 + sum(|x_j|)) over the specified axis.

layer(input, opts \\ [])

@spec layer(
  Axon.t(),
  keyword()
) :: Axon.t()

Create a Softpick Axon layer.

Options

  • :name - Layer name prefix (default: "softpick")
  • :axis - Axis to normalize over (default: -1)

Returns

An Axon layer that applies Softpick normalization.

output_size(opts \\ [])

@spec output_size(keyword()) :: non_neg_integer()

Get the output dimension for a model configuration.