Image.ZeroShot lets you classify an image against arbitrary labels you provide at call time, without training or fine-tuning anything. Where Image.Classification is constrained to whatever 1000 ImageNet labels its model was trained on, zero-shot says "here are five categories I care about right now — which fits best?".
This is enormously useful when your label space:
- Doesn't exist in standard datasets (custom product categories, brand-specific taxonomies, ad-hoc tagging)
- Changes over time (new categories appear without retraining)
- Is unknown until query time (interactive search, user-driven filtering)
Powered by CLIP, a contrastive vision-language model that learned a shared embedding space for images and text from 400 million image-caption pairs.
Classifying
iex> photo = Image.open!("portrait.jpg")
iex> Image.ZeroShot.classify(photo, [
...> "a person on a horse",
...> "a person walking a dog",
...> "a parked car",
...> "an empty street"
...> ])
[
%{label: "a person on a horse", score: 0.94},
%{label: "a person walking a dog", score: 0.04},
%{label: "an empty street", score: 0.01},
%{label: "a parked car", score: 0.01}
]Scores sum to 1.0 (softmax over the candidate set). Results are sorted by descending score.
Just the best label
When you only want the winner:
iex> Image.ZeroShot.label(photo, ["dog", "cat", "horse"])
"horse"Image-to-image similarity
CLIP's image embeddings live in the same space as its text embeddings, so two images can also be compared directly:
iex> a = Image.open!("dog1.jpg")
iex> b = Image.open!("dog2.jpg")
iex> c = Image.open!("car.jpg")
iex>
iex> Image.ZeroShot.similarity(a, b)
0.82
iex> Image.ZeroShot.similarity(a, c)
0.41Useful for "find similar images" without standing up a vector database. For larger collections, compute embeddings once and cache them.
Prompt templates matter
CLIP was trained on natural-language captions, so it understands sentences much better than bare nouns. Wrapping each label in a simple template reliably improves accuracy. The default is "a photo of {label}" — every label is wrapped before tokenisation.
You can override:
# Photography-domain template
iex> Image.ZeroShot.classify(image, ["sunset", "rain", "snow"],
...> template: "a high-quality photograph of {label}")
# Document-domain template
iex> Image.ZeroShot.classify(scan, ["invoice", "receipt", "letter"],
...> template: "a scanned {label}")
# Disable templating if your labels are already full sentences
iex> Image.ZeroShot.classify(image,
...> ["a black cat sitting on a chair", "an empty room with a chair"],
...> template: nil)A general rule: if your labels are nouns, leave the default template. If they're already descriptive sentences, use template: nil. If they're domain-specific (medical imagery, document scans, product photography), a domain-tailored template can help.
Choosing a model
The default is openai/clip-vit-base-patch32 — MIT licensed, ~600 MB, broad training coverage, well-validated. For higher quality at larger size:
iex> Image.ZeroShot.classify(image, labels, repo: "openai/clip-vit-large-patch14")(About ~1.7 GB — roughly 3× the default.)
Pre-downloading
mix image_vision.download_models --zero-shot
Default model
OpenAI CLIP ViT-B/32 — MIT licensed, ~600 MB. The original CLIP, the most well-validated and broadly applicable variant. Contains both a vision encoder (ViT-B/32) and a text encoder (transformer) that produce vectors in a shared 512-dim space.
Dependencies
Zero-shot classification requires :bumblebee, :nx, and an Nx backend such as :exla. Add to mix.exs:
{:bumblebee, "~> 0.6"},
{:nx, "~> 0.10"},
{:exla, "~> 0.10"}