Edifice.Vision.DINOv2 (Edifice v0.2.0)

Copy Markdown View Source

DINOv2: Self-supervised vision backbone via self-distillation.

Implements DINOv2 from "DINOv2: Learning Robust Visual Features without Supervision" (Oquab et al., Meta 2023). Learns powerful visual representations through self-distillation without labels, using a student-teacher framework with masked patch prediction.

Key Innovations

  • Self-distillation: Student network learns to match teacher's output distribution
  • EMA teacher: Teacher parameters are exponential moving average of student
  • Register tokens: Learnable tokens (beyond CLS) that improve attention maps
  • KoLeo regularizer: Maximizes entropy of patch token distribution
  • No labels needed: Completely self-supervised pretraining

Architecture

Student (trained)                 Teacher (EMA, no grad)
      |                                 |
Augmented/Masked Image            Full Image
      |                                 |
      v                                 v
+================+              +================+
|  Patch Embed   |              |  Patch Embed   |
+================+              +================+
      |                                 |
      v                                 v
+================+              +================+
| [CLS] + [REG]  |              | [CLS] + [REG]  |  (register tokens)
| + Patch Tokens |              | + Patch Tokens |
+================+              +================+
      |                                 |
      v                                 v
+================+              +================+
| Position Embed |              | Position Embed |
+================+              +================+
      |                                 |
      v                                 v
+================+              +================+
| Transformer    |              | Transformer    |
| Blocks x N     |              | Blocks x N     |
+================+              +================+
      |                                 |
      v                                 v
Extract CLS token              Extract CLS token
      |                                 |
      v                                 v
+================+              +================+
|   DINO Head    |              |   DINO Head    |
| (MLP + L2 norm)|              | (MLP + L2 norm)|
+================+              +================+
      |                                 |
      v                                 v
    Student                          Teacher
 Distribution                     Distribution
      |                                 |
      +---------> DINO Loss <-----------+
                (cross-entropy)

DINO Loss

Cross-entropy between student and teacher output distributions:

  • Teacher outputs are centered and sharpened (temperature τ_t)
  • Student outputs are just sharpened (temperature τ_s)
  • Teacher centering prevents collapse to uniform distribution

Usage

# Build student and teacher ViT backbones
{student, teacher} = DINOv2.build(
  image_size: 224,
  patch_size: 14,
  embed_dim: 384,
  num_heads: 6,
  num_layers: 12,
  num_register_tokens: 4
)

# Compute DINO loss
loss = DINOv2.dino_loss(student_out, teacher_out,
  student_temp: 0.1,
  teacher_temp: 0.04,
  center: center_tensor
)

# Update teacher via EMA after each step
teacher_params = DINOv2.update_teacher(student_params, teacher_params, momentum: 0.996)

References

  • Paper: "DINOv2: Learning Robust Visual Features without Supervision"
  • arXiv: https://arxiv.org/abs/2304.07193
  • Original DINO: "Emerging Properties in Self-Supervised Vision Transformers" (2021)

Summary

Types

Options for build/1.

Functions

Build both student and teacher DINOv2 networks.

Build a single DINOv2 backbone (ViT with register tokens + DINO head).

Compute the DINO self-distillation loss.

Compute KoLeo regularizer (Kozachenko-Leonenko entropy estimator).

Get the output size of a DINOv2 model.

Get recommended defaults for different model sizes.

Update the running center for teacher outputs.

Update teacher network parameters via exponential moving average.

Types

build_opt()

@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:head_bottleneck_dim, pos_integer()}
  | {:head_hidden_dim, pos_integer()}
  | {:head_output_dim, pos_integer()}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:mlp_ratio, float()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:num_register_tokens, non_neg_integer()}
  | {:patch_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: {Axon.t(), Axon.t()}

Build both student and teacher DINOv2 networks.

Returns {student_model, teacher_model} tuple. The teacher should be updated via EMA after each training step using update_teacher/3.

Options

  • :image_size - Input image size, square (default: 224)
  • :patch_size - Patch size, square (default: 14)
  • :in_channels - Number of input channels (default: 3)
  • :embed_dim - Embedding dimension (default: 384)
  • :num_heads - Number of attention heads (default: 6)
  • :num_layers - Number of transformer blocks (default: 12)
  • :mlp_ratio - MLP expansion ratio (default: 4.0)
  • :num_register_tokens - Number of register tokens (default: 4)
  • :head_hidden_dim - DINO head hidden dimension (default: 2048)
  • :head_bottleneck_dim - DINO head bottleneck dimension (default: 256)
  • :head_output_dim - DINO head output dimension (default: 65536)

Returns

{student_model, teacher_model} tuple of Axon models.

build_backbone(opts \\ [])

@spec build_backbone(keyword()) :: Axon.t()

Build a single DINOv2 backbone (ViT with register tokens + DINO head).

Options

Same as build/1, plus:

  • :prefix - Layer name prefix ("student" or "teacher")
  • :include_head - Whether to include DINO head (default: true)

dino_loss(student_out, teacher_out, opts \\ [])

@spec dino_loss(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()

Compute the DINO self-distillation loss.

Cross-entropy between student and teacher output distributions, where:

  • Teacher outputs are centered (prevents collapse) and sharpened
  • Student outputs are sharpened with higher temperature

Parameters

  • student_out - Student network output: [batch, output_dim]
  • teacher_out - Teacher network output: [batch, output_dim]

Options

  • :student_temp - Student temperature (default: 0.1)
  • :teacher_temp - Teacher temperature (default: 0.04)
  • :center - Running center for teacher outputs (optional, zeros if not provided)

Returns

Scalar loss tensor.

koleo_loss(patch_tokens)

@spec koleo_loss(Nx.Tensor.t()) :: Nx.Tensor.t()

Compute KoLeo regularizer (Kozachenko-Leonenko entropy estimator).

Encourages uniform distribution of patch representations in feature space. KoLeo = mean(log(distance to nearest neighbor)).

Parameters

  • patch_tokens - Patch token representations: [batch, num_patches, embed_dim]

Returns

Scalar KoLeo loss (negative, to maximize entropy).

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of a DINOv2 model.

update_center(teacher_out, center, opts \\ [])

@spec update_center(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()

Update the running center for teacher outputs.

The center is an exponential moving average of teacher outputs, used to prevent collapse to uniform distribution.

Parameters

  • teacher_out - Current batch teacher output: [batch, output_dim]
  • center - Current center: [output_dim]

Options

  • :momentum - Center update momentum (default: 0.9)

Returns

Updated center tensor.

update_teacher(student_params, teacher_params, opts \\ [])

@spec update_teacher(map(), map(), keyword()) :: map()

Update teacher network parameters via exponential moving average.

teacher = momentum teacher + (1 - momentum) student

Parameters

  • student_params - Student network parameters (map of tensors)
  • teacher_params - Teacher network parameters (map of tensors)

Options

  • :momentum - EMA momentum (default: 0.996)

Returns

Updated teacher parameters.