Edifice.Blocks.PatchEmbed (Edifice v0.2.0)

Patch Embedding for Vision Transformers.

Splits images into fixed-size patches and linearly projects each patch into an embedding vector. This is the standard input processing for ViT, DeiT, MAE, and other vision transformer architectures.

How It Works

Split image into non-overlapping patches of size P x P
Flatten each patch into a vector of size PPC
Linear project to embedding dimension

For a 224x224 image with 16x16 patches: 196 patches, each 768-dim (16163).

Architecture

Image [batch, channels, height, width]
      |
      v
+----------------------------------+
| Split into P x P patches         |
| (H/P * W/P = num_patches total) |
+----------------------------------+
      |
      v
[batch, num_patches, P*P*C]
      |
      v
+----------------------------------+
| Linear projection to embed_dim   |
+----------------------------------+
      |
      v
[batch, num_patches, embed_dim]

Usage

patches = PatchEmbed.layer(image,
  image_size: 224,
  patch_size: 16,
  in_channels: 3,
  embed_dim: 768
)

References

"An Image is Worth 16x16 Words" (Dosovitskiy et al., 2021)

Summary

Functions

layer(input, opts \\ [])

Build a patch embedding Axon layer.

num_patches(image_size, patch_size)

Calculate the number of patches for given image and patch sizes.

Functions

layer(input, opts \\ [])

@spec layer(
  Axon.t(),
  keyword()
) :: Axon.t()

Build a patch embedding Axon layer.

Options

:image_size - Input image size (square, default: 224)
:patch_size - Patch size (square, default: 16)
:in_channels - Number of input channels (default: 3)
:embed_dim - Output embedding dimension (required)
:name - Layer name prefix (default: "patch_embed")

num_patches(image_size, patch_size)

@spec num_patches(pos_integer(), pos_integer()) :: pos_integer()

Calculate the number of patches for given image and patch sizes.