Edifice.Contrastive.SigLIP (Edifice v0.2.0)

SigLIP - Sigmoid Loss for Language-Image Pre-training.

Implements SigLIP, which replaces the softmax-based contrastive loss (used in CLIP) with a simpler sigmoid-based binary classification loss. Each image-text pair is treated as an independent binary classification problem.

Key Innovation

Instead of softmax cross-entropy over all pairs:

CLIP: -log(exp(sim_pos) / sum(exp(sim_all)))

SigLIP uses sigmoid for each pair independently:

SigLIP: sum_ij(log(sigmoid(t * sim_ij * y_ij)))

Where:

sim_ij is the cosine similarity between embeddings i and j
y_ij = +1 for matching pairs, -1 for non-matching pairs
t is a learnable temperature parameter

Advantages

Simpler: No need to normalize over all negatives
Scalable: Each pair is independent, enabling larger batch sizes
Stable: Sigmoid gradients are bounded, unlike softmax with temperature
Effective: Matches or exceeds CLIP performance in practice

Architecture

Image Encoder          Text Encoder
      |                      |
      v                      v
+------------+        +------------+
|  Backbone  |        |  Backbone  |
+------------+        +------------+
      |                      |
      v                      v
+------------+        +------------+
| Projection |        | Projection |
+------------+        +------------+
      |                      |
      v                      v
     z_img    SigLIP       z_txt
      |       Loss           |
      +-------> <-----------+

Usage

{encoder, _} = SigLIP.build(input_dim: 512, projection_dim: 256)

# Compute SigLIP loss
loss = SigLIP.loss(z_img, z_txt, temperature: 1.0)

Reference

"Sigmoid Loss for Language Image Pre-Training" (Zhai et al., 2023)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a SigLIP encoder model.

default_hidden_size()

Get the default hidden size.

default_projection_dim()

Get the default projection dimension.

default_temperature()

Get the default temperature.

loss(z_a, z_b, opts \\ [])

Compute SigLIP loss between two sets of embeddings.

loss_with_log_temp(z_a, z_b, log_temperature)

Compute SigLIP loss using log temperature parameter.

Types

build_opt()

@type build_opt() ::
  {:input_dim, pos_integer()}
  | {:embed_dim, pos_integer()}
  | {:projection_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:temperature_init, float()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: {Axon.t(), struct()}

Build a SigLIP encoder model.

Options

:input_dim or :embed_dim - Input feature dimension (required)
:projection_dim - Projection head output dimension (default: 256)
:hidden_size - Hidden dimension (default: 512)
:temperature_init - Initial temperature value (default: 1.0)

Returns

Tuple of {encoder_model, temperature_param} where:

encoder_model is an Axon model mapping inputs to normalized embeddings
temperature_param is an Axon parameter for learnable temperature