SigLIP - Sigmoid Loss for Language-Image Pre-training.
Implements SigLIP, which replaces the softmax-based contrastive loss (used in CLIP) with a simpler sigmoid-based binary classification loss. Each image-text pair is treated as an independent binary classification problem.
Key Innovation
Instead of softmax cross-entropy over all pairs:
CLIP: -log(exp(sim_pos) / sum(exp(sim_all)))SigLIP uses sigmoid for each pair independently:
SigLIP: sum_ij(log(sigmoid(t * sim_ij * y_ij)))Where:
sim_ijis the cosine similarity between embeddings i and jy_ij= +1 for matching pairs, -1 for non-matching pairstis a learnable temperature parameter
Advantages
- Simpler: No need to normalize over all negatives
- Scalable: Each pair is independent, enabling larger batch sizes
- Stable: Sigmoid gradients are bounded, unlike softmax with temperature
- Effective: Matches or exceeds CLIP performance in practice
Architecture
Image Encoder Text Encoder
| |
v v
+------------+ +------------+
| Backbone | | Backbone |
+------------+ +------------+
| |
v v
+------------+ +------------+
| Projection | | Projection |
+------------+ +------------+
| |
v v
z_img SigLIP z_txt
| Loss |
+-------> <-----------+Usage
{encoder, _} = SigLIP.build(input_dim: 512, projection_dim: 256)
# Compute SigLIP loss
loss = SigLIP.loss(z_img, z_txt, temperature: 1.0)Reference
- "Sigmoid Loss for Language Image Pre-Training" (Zhai et al., 2023)
Summary
Functions
Build a SigLIP encoder model.
Get the default hidden size.
Get the default projection dimension.
Get the default temperature.
Compute SigLIP loss between two sets of embeddings.
Compute SigLIP loss using log temperature parameter.
Types
@type build_opt() :: {:input_dim, pos_integer()} | {:embed_dim, pos_integer()} | {:projection_dim, pos_integer()} | {:hidden_size, pos_integer()} | {:temperature_init, float()}
Options for build/1.
Functions
Build a SigLIP encoder model.
Options
:input_dimor:embed_dim- Input feature dimension (required):projection_dim- Projection head output dimension (default: 256):hidden_size- Hidden dimension (default: 512):temperature_init- Initial temperature value (default: 1.0)
Returns
Tuple of {encoder_model, temperature_param} where:
encoder_modelis an Axon model mapping inputs to normalized embeddingstemperature_paramis an Axon parameter for learnable temperature
@spec default_projection_dim() :: pos_integer()
Get the default projection dimension.
@spec default_temperature() :: float()
Get the default temperature.
@spec loss(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()
Compute SigLIP loss between two sets of embeddings.
Parameters
z_a- Embeddings from modality A (e.g., images): [batch, dim]z_b- Embeddings from modality B (e.g., text): [batch, dim]
Options
:temperature- Temperature scaling (default: 1.0):log_temperature- Log temperature (overrides :temperature if provided)
Returns
Scalar loss tensor.
Notes
Assumes diagonal pairs (z_a[i], z_b[i]) are positive matches. All other pairs are treated as negatives.
@spec loss_with_log_temp(Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t()) :: Nx.Tensor.t()
Compute SigLIP loss using log temperature parameter.
Convenience function that takes the log temperature directly (useful when temperature is a learnable parameter in log space).