# `Edifice.Contrastive.SigLIP`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/contrastive/siglip.ex#L1)

SigLIP - Sigmoid Loss for Language-Image Pre-training.

Implements SigLIP, which replaces the softmax-based contrastive loss (used in CLIP)
with a simpler sigmoid-based binary classification loss. Each image-text pair is
treated as an independent binary classification problem.

## Key Innovation

Instead of softmax cross-entropy over all pairs:
```
CLIP: -log(exp(sim_pos) / sum(exp(sim_all)))
```

SigLIP uses sigmoid for each pair independently:
```
SigLIP: sum_ij(log(sigmoid(t * sim_ij * y_ij)))
```

Where:
- `sim_ij` is the cosine similarity between embeddings i and j
- `y_ij` = +1 for matching pairs, -1 for non-matching pairs
- `t` is a learnable temperature parameter

## Advantages

- **Simpler**: No need to normalize over all negatives
- **Scalable**: Each pair is independent, enabling larger batch sizes
- **Stable**: Sigmoid gradients are bounded, unlike softmax with temperature
- **Effective**: Matches or exceeds CLIP performance in practice

## Architecture

```
Image Encoder          Text Encoder
      |                      |
      v                      v
+------------+        +------------+
|  Backbone  |        |  Backbone  |
+------------+        +------------+
      |                      |
      v                      v
+------------+        +------------+
| Projection |        | Projection |
+------------+        +------------+
      |                      |
      v                      v
     z_img    SigLIP       z_txt
      |       Loss           |
      +-------> <-----------+
```

## Usage

    {encoder, _} = SigLIP.build(input_dim: 512, projection_dim: 256)

    # Compute SigLIP loss
    loss = SigLIP.loss(z_img, z_txt, temperature: 1.0)

## Reference

- "Sigmoid Loss for Language Image Pre-Training" (Zhai et al., 2023)

# `build_opt`

```elixir
@type build_opt() ::
  {:input_dim, pos_integer()}
  | {:embed_dim, pos_integer()}
  | {:projection_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:temperature_init, float()}
```

Options for `build/1`.

# `build`

```elixir
@spec build([build_opt()]) :: {Axon.t(), struct()}
```

Build a SigLIP encoder model.

## Options

  - `:input_dim` or `:embed_dim` - Input feature dimension (required)
  - `:projection_dim` - Projection head output dimension (default: 256)
  - `:hidden_size` - Hidden dimension (default: 512)
  - `:temperature_init` - Initial temperature value (default: 1.0)

## Returns

  Tuple of `{encoder_model, temperature_param}` where:
  - `encoder_model` is an Axon model mapping inputs to normalized embeddings
  - `temperature_param` is an Axon parameter for learnable temperature

# `default_hidden_size`

```elixir
@spec default_hidden_size() :: pos_integer()
```

Get the default hidden size.

# `default_projection_dim`

```elixir
@spec default_projection_dim() :: pos_integer()
```

Get the default projection dimension.

# `default_temperature`

```elixir
@spec default_temperature() :: float()
```

Get the default temperature.

# `loss`

```elixir
@spec loss(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()
```

Compute SigLIP loss between two sets of embeddings.

## Parameters

  - `z_a` - Embeddings from modality A (e.g., images): [batch, dim]
  - `z_b` - Embeddings from modality B (e.g., text): [batch, dim]

## Options

  - `:temperature` - Temperature scaling (default: 1.0)
  - `:log_temperature` - Log temperature (overrides :temperature if provided)

## Returns

  Scalar loss tensor.

## Notes

  Assumes diagonal pairs (z_a[i], z_b[i]) are positive matches.
  All other pairs are treated as negatives.

# `loss_with_log_temp`

```elixir
@spec loss_with_log_temp(Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t()) :: Nx.Tensor.t()
```

Compute SigLIP loss using log temperature parameter.

Convenience function that takes the log temperature directly
(useful when temperature is a learnable parameter in log space).

---

*Consult [api-reference.md](api-reference.md) for complete listing*