Edifice.Blocks.PatchEmbed (Edifice v0.2.0)

Copy Markdown View Source

Patch Embedding for Vision Transformers.

Splits images into fixed-size patches and linearly projects each patch into an embedding vector. This is the standard input processing for ViT, DeiT, MAE, and other vision transformer architectures.

How It Works

  1. Split image into non-overlapping patches of size P x P
  2. Flatten each patch into a vector of size PPC
  3. Linear project to embedding dimension

For a 224x224 image with 16x16 patches: 196 patches, each 768-dim (16163).

Architecture

Image [batch, channels, height, width]
      |
      v
+----------------------------------+
| Split into P x P patches         |
| (H/P * W/P = num_patches total) |
+----------------------------------+
      |
      v
[batch, num_patches, P*P*C]
      |
      v
+----------------------------------+
| Linear projection to embed_dim   |
+----------------------------------+
      |
      v
[batch, num_patches, embed_dim]

Usage

patches = PatchEmbed.layer(image,
  image_size: 224,
  patch_size: 16,
  in_channels: 3,
  embed_dim: 768
)

References

  • "An Image is Worth 16x16 Words" (Dosovitskiy et al., 2021)

Summary

Functions

Build a patch embedding Axon layer.

Calculate the number of patches for given image and patch sizes.

Functions

layer(input, opts \\ [])

@spec layer(
  Axon.t(),
  keyword()
) :: Axon.t()

Build a patch embedding Axon layer.

Options

  • :image_size - Input image size (square, default: 224)
  • :patch_size - Patch size (square, default: 16)
  • :in_channels - Number of input channels (default: 3)
  • :embed_dim - Output embedding dimension (required)
  • :name - Layer name prefix (default: "patch_embed")

num_patches(image_size, patch_size)

@spec num_patches(pos_integer(), pos_integer()) :: pos_integer()

Calculate the number of patches for given image and patch sizes.