DINOv2: Self-supervised vision backbone via self-distillation.
Implements DINOv2 from "DINOv2: Learning Robust Visual Features without Supervision" (Oquab et al., Meta 2023). Learns powerful visual representations through self-distillation without labels, using a student-teacher framework with masked patch prediction.
Key Innovations
- Self-distillation: Student network learns to match teacher's output distribution
- EMA teacher: Teacher parameters are exponential moving average of student
- Register tokens: Learnable tokens (beyond CLS) that improve attention maps
- KoLeo regularizer: Maximizes entropy of patch token distribution
- No labels needed: Completely self-supervised pretraining
Architecture
Student (trained) Teacher (EMA, no grad)
| |
Augmented/Masked Image Full Image
| |
v v
+================+ +================+
| Patch Embed | | Patch Embed |
+================+ +================+
| |
v v
+================+ +================+
| [CLS] + [REG] | | [CLS] + [REG] | (register tokens)
| + Patch Tokens | | + Patch Tokens |
+================+ +================+
| |
v v
+================+ +================+
| Position Embed | | Position Embed |
+================+ +================+
| |
v v
+================+ +================+
| Transformer | | Transformer |
| Blocks x N | | Blocks x N |
+================+ +================+
| |
v v
Extract CLS token Extract CLS token
| |
v v
+================+ +================+
| DINO Head | | DINO Head |
| (MLP + L2 norm)| | (MLP + L2 norm)|
+================+ +================+
| |
v v
Student Teacher
Distribution Distribution
| |
+---------> DINO Loss <-----------+
(cross-entropy)DINO Loss
Cross-entropy between student and teacher output distributions:
- Teacher outputs are centered and sharpened (temperature τ_t)
- Student outputs are just sharpened (temperature τ_s)
- Teacher centering prevents collapse to uniform distribution
Usage
# Build student and teacher ViT backbones
{student, teacher} = DINOv2.build(
image_size: 224,
patch_size: 14,
embed_dim: 384,
num_heads: 6,
num_layers: 12,
num_register_tokens: 4
)
# Compute DINO loss
loss = DINOv2.dino_loss(student_out, teacher_out,
student_temp: 0.1,
teacher_temp: 0.04,
center: center_tensor
)
# Update teacher via EMA after each step
teacher_params = DINOv2.update_teacher(student_params, teacher_params, momentum: 0.996)References
- Paper: "DINOv2: Learning Robust Visual Features without Supervision"
- arXiv: https://arxiv.org/abs/2304.07193
- Original DINO: "Emerging Properties in Self-Supervised Vision Transformers" (2021)
Summary
Functions
Build both student and teacher DINOv2 networks.
Build a single DINOv2 backbone (ViT with register tokens + DINO head).
Compute the DINO self-distillation loss.
Compute KoLeo regularizer (Kozachenko-Leonenko entropy estimator).
Get the output size of a DINOv2 model.
Get recommended defaults for different model sizes.
Update the running center for teacher outputs.
Update teacher network parameters via exponential moving average.
Types
@type build_opt() :: {:embed_dim, pos_integer()} | {:head_bottleneck_dim, pos_integer()} | {:head_hidden_dim, pos_integer()} | {:head_output_dim, pos_integer()} | {:image_size, pos_integer()} | {:in_channels, pos_integer()} | {:mlp_ratio, float()} | {:num_heads, pos_integer()} | {:num_layers, pos_integer()} | {:num_register_tokens, non_neg_integer()} | {:patch_size, pos_integer()}
Options for build/1.
Functions
Build both student and teacher DINOv2 networks.
Returns {student_model, teacher_model} tuple. The teacher should be
updated via EMA after each training step using update_teacher/3.
Options
:image_size- Input image size, square (default: 224):patch_size- Patch size, square (default: 14):in_channels- Number of input channels (default: 3):embed_dim- Embedding dimension (default: 384):num_heads- Number of attention heads (default: 6):num_layers- Number of transformer blocks (default: 12):mlp_ratio- MLP expansion ratio (default: 4.0):num_register_tokens- Number of register tokens (default: 4):head_hidden_dim- DINO head hidden dimension (default: 2048):head_bottleneck_dim- DINO head bottleneck dimension (default: 256):head_output_dim- DINO head output dimension (default: 65536)
Returns
{student_model, teacher_model} tuple of Axon models.
Build a single DINOv2 backbone (ViT with register tokens + DINO head).
Options
Same as build/1, plus:
:prefix- Layer name prefix ("student" or "teacher"):include_head- Whether to include DINO head (default: true)
@spec dino_loss(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()
Compute the DINO self-distillation loss.
Cross-entropy between student and teacher output distributions, where:
- Teacher outputs are centered (prevents collapse) and sharpened
- Student outputs are sharpened with higher temperature
Parameters
student_out- Student network output: [batch, output_dim]teacher_out- Teacher network output: [batch, output_dim]
Options
:student_temp- Student temperature (default: 0.1):teacher_temp- Teacher temperature (default: 0.04):center- Running center for teacher outputs (optional, zeros if not provided)
Returns
Scalar loss tensor.
@spec koleo_loss(Nx.Tensor.t()) :: Nx.Tensor.t()
Compute KoLeo regularizer (Kozachenko-Leonenko entropy estimator).
Encourages uniform distribution of patch representations in feature space. KoLeo = mean(log(distance to nearest neighbor)).
Parameters
patch_tokens- Patch token representations: [batch, num_patches, embed_dim]
Returns
Scalar KoLeo loss (negative, to maximize entropy).
@spec output_size(keyword()) :: pos_integer()
Get the output size of a DINOv2 model.
Get recommended defaults for different model sizes.
@spec update_center(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()
Update the running center for teacher outputs.
The center is an exponential moving average of teacher outputs, used to prevent collapse to uniform distribution.
Parameters
teacher_out- Current batch teacher output: [batch, output_dim]center- Current center: [output_dim]
Options
:momentum- Center update momentum (default: 0.9)
Returns
Updated center tensor.
Update teacher network parameters via exponential moving average.
teacher = momentum teacher + (1 - momentum) student
Parameters
student_params- Student network parameters (map of tensors)teacher_params- Teacher network parameters (map of tensors)
Options
:momentum- EMA momentum (default: 0.996)
Returns
Updated teacher parameters.