# `Edifice.Vision.DINOv2`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/vision/dino_v2.ex#L1)

DINOv2: Self-supervised vision backbone via self-distillation.

Implements DINOv2 from "DINOv2: Learning Robust Visual Features without
Supervision" (Oquab et al., Meta 2023). Learns powerful visual representations
through self-distillation without labels, using a student-teacher framework
with masked patch prediction.

## Key Innovations

- **Self-distillation**: Student network learns to match teacher's output distribution
- **EMA teacher**: Teacher parameters are exponential moving average of student
- **Register tokens**: Learnable tokens (beyond CLS) that improve attention maps
- **KoLeo regularizer**: Maximizes entropy of patch token distribution
- **No labels needed**: Completely self-supervised pretraining

## Architecture

```
Student (trained)                 Teacher (EMA, no grad)
      |                                 |
Augmented/Masked Image            Full Image
      |                                 |
      v                                 v
+================+              +================+
|  Patch Embed   |              |  Patch Embed   |
+================+              +================+
      |                                 |
      v                                 v
+================+              +================+
| [CLS] + [REG]  |              | [CLS] + [REG]  |  (register tokens)
| + Patch Tokens |              | + Patch Tokens |
+================+              +================+
      |                                 |
      v                                 v
+================+              +================+
| Position Embed |              | Position Embed |
+================+              +================+
      |                                 |
      v                                 v
+================+              +================+
| Transformer    |              | Transformer    |
| Blocks x N     |              | Blocks x N     |
+================+              +================+
      |                                 |
      v                                 v
Extract CLS token              Extract CLS token
      |                                 |
      v                                 v
+================+              +================+
|   DINO Head    |              |   DINO Head    |
| (MLP + L2 norm)|              | (MLP + L2 norm)|
+================+              +================+
      |                                 |
      v                                 v
    Student                          Teacher
 Distribution                     Distribution
      |                                 |
      +---------> DINO Loss <-----------+
                (cross-entropy)
```

## DINO Loss

Cross-entropy between student and teacher output distributions:
  - Teacher outputs are centered and sharpened (temperature τ_t)
  - Student outputs are just sharpened (temperature τ_s)
  - Teacher centering prevents collapse to uniform distribution

## Usage

    # Build student and teacher ViT backbones
    {student, teacher} = DINOv2.build(
      image_size: 224,
      patch_size: 14,
      embed_dim: 384,
      num_heads: 6,
      num_layers: 12,
      num_register_tokens: 4
    )

    # Compute DINO loss
    loss = DINOv2.dino_loss(student_out, teacher_out,
      student_temp: 0.1,
      teacher_temp: 0.04,
      center: center_tensor
    )

    # Update teacher via EMA after each step
    teacher_params = DINOv2.update_teacher(student_params, teacher_params, momentum: 0.996)

## References

- Paper: "DINOv2: Learning Robust Visual Features without Supervision"
- arXiv: https://arxiv.org/abs/2304.07193
- Original DINO: "Emerging Properties in Self-Supervised Vision Transformers" (2021)

# `build_opt`

```elixir
@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:head_bottleneck_dim, pos_integer()}
  | {:head_hidden_dim, pos_integer()}
  | {:head_output_dim, pos_integer()}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:mlp_ratio, float()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:num_register_tokens, non_neg_integer()}
  | {:patch_size, pos_integer()}
```

Options for `build/1`.

# `build`

```elixir
@spec build([build_opt()]) :: {Axon.t(), Axon.t()}
```

Build both student and teacher DINOv2 networks.

Returns `{student_model, teacher_model}` tuple. The teacher should be
updated via EMA after each training step using `update_teacher/3`.

## Options

  - `:image_size` - Input image size, square (default: 224)
  - `:patch_size` - Patch size, square (default: 14)
  - `:in_channels` - Number of input channels (default: 3)
  - `:embed_dim` - Embedding dimension (default: 384)
  - `:num_heads` - Number of attention heads (default: 6)
  - `:num_layers` - Number of transformer blocks (default: 12)
  - `:mlp_ratio` - MLP expansion ratio (default: 4.0)
  - `:num_register_tokens` - Number of register tokens (default: 4)
  - `:head_hidden_dim` - DINO head hidden dimension (default: 2048)
  - `:head_bottleneck_dim` - DINO head bottleneck dimension (default: 256)
  - `:head_output_dim` - DINO head output dimension (default: 65536)

## Returns

  `{student_model, teacher_model}` tuple of Axon models.

# `build_backbone`

```elixir
@spec build_backbone(keyword()) :: Axon.t()
```

Build a single DINOv2 backbone (ViT with register tokens + DINO head).

## Options

  Same as `build/1`, plus:
  - `:prefix` - Layer name prefix ("student" or "teacher")
  - `:include_head` - Whether to include DINO head (default: true)

# `dino_loss`

```elixir
@spec dino_loss(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()
```

Compute the DINO self-distillation loss.

Cross-entropy between student and teacher output distributions, where:
- Teacher outputs are centered (prevents collapse) and sharpened
- Student outputs are sharpened with higher temperature

## Parameters

  - `student_out` - Student network output: [batch, output_dim]
  - `teacher_out` - Teacher network output: [batch, output_dim]

## Options

  - `:student_temp` - Student temperature (default: 0.1)
  - `:teacher_temp` - Teacher temperature (default: 0.04)
  - `:center` - Running center for teacher outputs (optional, zeros if not provided)

## Returns

  Scalar loss tensor.

# `koleo_loss`

```elixir
@spec koleo_loss(Nx.Tensor.t()) :: Nx.Tensor.t()
```

Compute KoLeo regularizer (Kozachenko-Leonenko entropy estimator).

Encourages uniform distribution of patch representations in feature space.
KoLeo = mean(log(distance to nearest neighbor)).

## Parameters

  - `patch_tokens` - Patch token representations: [batch, num_patches, embed_dim]

## Returns

  Scalar KoLeo loss (negative, to maximize entropy).

# `output_size`

```elixir
@spec output_size(keyword()) :: pos_integer()
```

Get the output size of a DINOv2 model.

# `recommended_defaults`

```elixir
@spec recommended_defaults(atom()) :: keyword()
```

Get recommended defaults for different model sizes.

# `update_center`

```elixir
@spec update_center(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()
```

Update the running center for teacher outputs.

The center is an exponential moving average of teacher outputs,
used to prevent collapse to uniform distribution.

## Parameters

  - `teacher_out` - Current batch teacher output: [batch, output_dim]
  - `center` - Current center: [output_dim]

## Options

  - `:momentum` - Center update momentum (default: 0.9)

## Returns

  Updated center tensor.

# `update_teacher`

```elixir
@spec update_teacher(map(), map(), keyword()) :: map()
```

Update teacher network parameters via exponential moving average.

teacher = momentum * teacher + (1 - momentum) * student

## Parameters

  - `student_params` - Student network parameters (map of tensors)
  - `teacher_params` - Teacher network parameters (map of tensors)

## Options

  - `:momentum` - EMA momentum (default: 0.996)

## Returns

  Updated teacher parameters.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
