Edifice.Contrastive.SigLIP (Edifice v0.2.0)

Copy Markdown View Source

SigLIP - Sigmoid Loss for Language-Image Pre-training.

Implements SigLIP, which replaces the softmax-based contrastive loss (used in CLIP) with a simpler sigmoid-based binary classification loss. Each image-text pair is treated as an independent binary classification problem.

Key Innovation

Instead of softmax cross-entropy over all pairs:

CLIP: -log(exp(sim_pos) / sum(exp(sim_all)))

SigLIP uses sigmoid for each pair independently:

SigLIP: sum_ij(log(sigmoid(t * sim_ij * y_ij)))

Where:

  • sim_ij is the cosine similarity between embeddings i and j
  • y_ij = +1 for matching pairs, -1 for non-matching pairs
  • t is a learnable temperature parameter

Advantages

  • Simpler: No need to normalize over all negatives
  • Scalable: Each pair is independent, enabling larger batch sizes
  • Stable: Sigmoid gradients are bounded, unlike softmax with temperature
  • Effective: Matches or exceeds CLIP performance in practice

Architecture

Image Encoder          Text Encoder
      |                      |
      v                      v
+------------+        +------------+
|  Backbone  |        |  Backbone  |
+------------+        +------------+
      |                      |
      v                      v
+------------+        +------------+
| Projection |        | Projection |
+------------+        +------------+
      |                      |
      v                      v
     z_img    SigLIP       z_txt
      |       Loss           |
      +-------> <-----------+

Usage

{encoder, _} = SigLIP.build(input_dim: 512, projection_dim: 256)

# Compute SigLIP loss
loss = SigLIP.loss(z_img, z_txt, temperature: 1.0)

Reference

  • "Sigmoid Loss for Language Image Pre-Training" (Zhai et al., 2023)

Summary

Types

Options for build/1.

Functions

Build a SigLIP encoder model.

Get the default hidden size.

Get the default projection dimension.

Get the default temperature.

Compute SigLIP loss between two sets of embeddings.

Compute SigLIP loss using log temperature parameter.

Types

build_opt()

@type build_opt() ::
  {:input_dim, pos_integer()}
  | {:embed_dim, pos_integer()}
  | {:projection_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:temperature_init, float()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: {Axon.t(), struct()}

Build a SigLIP encoder model.

Options

  • :input_dim or :embed_dim - Input feature dimension (required)
  • :projection_dim - Projection head output dimension (default: 256)
  • :hidden_size - Hidden dimension (default: 512)
  • :temperature_init - Initial temperature value (default: 1.0)

Returns

Tuple of {encoder_model, temperature_param} where:

  • encoder_model is an Axon model mapping inputs to normalized embeddings
  • temperature_param is an Axon parameter for learnable temperature

default_hidden_size()

@spec default_hidden_size() :: pos_integer()

Get the default hidden size.

default_projection_dim()

@spec default_projection_dim() :: pos_integer()

Get the default projection dimension.

default_temperature()

@spec default_temperature() :: float()

Get the default temperature.

loss(z_a, z_b, opts \\ [])

@spec loss(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()

Compute SigLIP loss between two sets of embeddings.

Parameters

  • z_a - Embeddings from modality A (e.g., images): [batch, dim]
  • z_b - Embeddings from modality B (e.g., text): [batch, dim]

Options

  • :temperature - Temperature scaling (default: 1.0)
  • :log_temperature - Log temperature (overrides :temperature if provided)

Returns

Scalar loss tensor.

Notes

Assumes diagonal pairs (z_a[i], z_b[i]) are positive matches. All other pairs are treated as negatives.

loss_with_log_temp(z_a, z_b, log_temperature)

@spec loss_with_log_temp(Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t()) :: Nx.Tensor.t()

Compute SigLIP loss using log temperature parameter.

Convenience function that takes the log temperature directly (useful when temperature is a learnable parameter in log space).