Patch Embedding for Vision Transformers.
Splits images into fixed-size patches and linearly projects each patch into an embedding vector. This is the standard input processing for ViT, DeiT, MAE, and other vision transformer architectures.
How It Works
- Split image into non-overlapping patches of size P x P
- Flatten each patch into a vector of size PPC
- Linear project to embedding dimension
For a 224x224 image with 16x16 patches: 196 patches, each 768-dim (16163).
Architecture
Image [batch, channels, height, width]
|
v
+----------------------------------+
| Split into P x P patches |
| (H/P * W/P = num_patches total) |
+----------------------------------+
|
v
[batch, num_patches, P*P*C]
|
v
+----------------------------------+
| Linear projection to embed_dim |
+----------------------------------+
|
v
[batch, num_patches, embed_dim]Usage
patches = PatchEmbed.layer(image,
image_size: 224,
patch_size: 16,
in_channels: 3,
embed_dim: 768
)References
- "An Image is Worth 16x16 Words" (Dosovitskiy et al., 2021)
Summary
Functions
Build a patch embedding Axon layer.
Calculate the number of patches for given image and patch sizes.
Functions
Build a patch embedding Axon layer.
Options
:image_size- Input image size (square, default: 224):patch_size- Patch size (square, default: 16):in_channels- Number of input channels (default: 3):embed_dim- Output embedding dimension (required):name- Layer name prefix (default: "patch_embed")
@spec num_patches(pos_integer(), pos_integer()) :: pos_integer()
Calculate the number of patches for given image and patch sizes.