Zero-Shot Classification

Image.ZeroShot lets you classify an image against arbitrary labels you provide at call time, without training or fine-tuning anything. Where Image.Classification is constrained to whatever 1000 ImageNet labels its model was trained on, zero-shot says "here are five categories I care about right now — which fits best?".

This is enormously useful when your label space:

Doesn't exist in standard datasets (custom product categories, brand-specific taxonomies, ad-hoc tagging)
Changes over time (new categories appear without retraining)
Is unknown until query time (interactive search, user-driven filtering)

Powered by CLIP, a contrastive vision-language model that learned a shared embedding space for images and text from 400 million image-caption pairs.

Classifying

iex> photo = Image.open!("portrait.jpg")
iex> Image.ZeroShot.classify(photo, [
...>   "a person on a horse",
...>   "a person walking a dog",
...>   "a parked car",
...>   "an empty street"
...> ])
[
  %{label: "a person on a horse", score: 0.94},
  %{label: "a person walking a dog", score: 0.04},
  %{label: "an empty street", score: 0.01},
  %{label: "a parked car", score: 0.01}
]

Scores sum to 1.0 (softmax over the candidate set). Results are sorted by descending score.

Just the best label

When you only want the winner:

iex> Image.ZeroShot.label(photo, ["dog", "cat", "horse"])
"horse"

Image-to-image similarity

CLIP's image embeddings live in the same space as its text embeddings, so two images can also be compared directly:

iex> a = Image.open!("dog1.jpg")
iex> b = Image.open!("dog2.jpg")
iex> c = Image.open!("car.jpg")
iex>
iex> Image.ZeroShot.similarity(a, b)
0.82
iex> Image.ZeroShot.similarity(a, c)
0.41

Useful for "find similar images" without standing up a vector database. For larger collections, compute embeddings once and cache them.

Prompt templates matter

CLIP was trained on natural-language captions, so it understands sentences much better than bare nouns. Wrapping each label in a simple template reliably improves accuracy. The default is "a photo of {label}" — every label is wrapped before tokenisation.

You can override:

# Photography-domain template
iex> Image.ZeroShot.classify(image, ["sunset", "rain", "snow"],
...>   template: "a high-quality photograph of {label}")

# Document-domain template
iex> Image.ZeroShot.classify(scan, ["invoice", "receipt", "letter"],
...>   template: "a scanned {label}")

# Disable templating if your labels are already full sentences
iex> Image.ZeroShot.classify(image,
...>   ["a black cat sitting on a chair", "an empty room with a chair"],
...>   template: nil)

A general rule: if your labels are nouns, leave the default template. If they're already descriptive sentences, use template: nil. If they're domain-specific (medical imagery, document scans, product photography), a domain-tailored template can help.

Choosing a model

The default is openai/clip-vit-base-patch32 — MIT licensed, ~600 MB, broad training coverage, well-validated. For higher quality at larger size:

iex> Image.ZeroShot.classify(image, labels, repo: "openai/clip-vit-large-patch14")

(About ~1.7 GB — roughly 3× the default.)

Pre-downloading

mix image_vision.download_models --zero-shot

Default model

OpenAI CLIP ViT-B/32 — MIT licensed, ~600 MB. The original CLIP, the most well-validated and broadly applicable variant. Contains both a vision encoder (ViT-B/32) and a text encoder (transformer) that produce vectors in a shared 512-dim space.

Dependencies

Zero-shot classification requires :bumblebee, :nx, and an Nx backend such as :exla. Add to mix.exs:

{:bumblebee, "~> 0.6"},
{:nx, "~> 0.10"},
{:exla, "~> 0.10"}

← Previous Page Image Captioning