Zero-shot image classification — classify an image against arbitrary labels you supply at call time, without retraining.
Where Image.Classification is constrained to whatever 1000
ImageNet labels its model was trained on, Image.ZeroShot lets
you provide your own set of candidate labels and asks the model
"which of these best describes this image?". Powered by CLIP, a
contrastive vision-language model.
Three entry points cover different needs:
classify/3— return all candidate labels with scores, sorted descending. Best when you want to see the full distribution.label/3— return just the single highest-scoring label.similarity/3— compute cosine similarity between two images in CLIP's embedding space, useful for "find similar images".
Quick start
iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.ZeroShot.classify(puppy, ["a dog", "a cat", "a horse"])
iex> # => [
iex> # %{label: "a dog", score: 0.97},
iex> # %{label: "a cat", score: 0.02},
iex> # %{label: "a horse", score: 0.01}
iex> # ]Default model
openai/clip-vit-base-patch32
— MIT licensed, ~600 MB. The original CLIP, broad training
coverage and well-validated zero-shot behaviour. Override via the
:repo option to use a larger CLIP variant.
Prompt templates
CLIP's accuracy on short labels improves significantly when each
label is wrapped in a natural-language sentence. The default
template is "a photo of {label}", applied to every label before
tokenisation. Override with the :template option:
Image.ZeroShot.classify(image, ["dog", "cat"],
template: "a close-up photograph of {label}")Pass template: nil to disable the template entirely (useful if
your labels are already full sentences).
Optional dependency
This module is only available when Bumblebee,
Nx, and an Nx compiler such as
EXLA are configured in your
application's mix.exs.
Summary
Functions
Classifies an image against a list of candidate labels.
Returns the single highest-scoring label for an image.
Computes cosine similarity between two images in CLIP's embedding space.
Functions
@spec classify(Vix.Vips.Image.t(), [String.t(), ...], Keyword.t()) :: [ %{label: String.t(), score: float()} ]
Classifies an image against a list of candidate labels.
Wraps each label in the prompt template (default
"a photo of {label}"), tokenises it, and asks CLIP which label
best matches the image. Returns all labels scored and sorted
descending.
Arguments
imageis anyVix.Vips.Image.t/0.labelsis a non-empty list of label strings to classify against, e.g.["a dog", "a cat", "a sports car"].optionsis a keyword list of options.
Options
:repois the HuggingFace repository for the CLIP model. The default is"openai/clip-vit-base-patch32".:templateis a prompt template — a string with{label}as a placeholder. The default is"a photo of {label}". Passnilto use labels verbatim.
Returns
- A list of
%{label: String.t(), score: float()}maps sorted by descending:score. Scores sum to1.0.
Examples
iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.ZeroShot.classify(puppy, ["a dog", "a cat"])
@spec label(Vix.Vips.Image.t(), [String.t(), ...], Keyword.t()) :: String.t()
Returns the single highest-scoring label for an image.
Convenience wrapper over classify/3 for the common "just tell
me which one it is" case.
Arguments
imageis anyVix.Vips.Image.t/0.labelsis a non-empty list of label strings.optionsis a keyword list of options. Same asclassify/3.
Returns
- The label string with the highest score.
Examples
iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.ZeroShot.label(puppy, ["a dog", "a cat"])
iex> # => "a dog"
@spec similarity(Vix.Vips.Image.t(), Vix.Vips.Image.t(), Keyword.t()) :: float()
Computes cosine similarity between two images in CLIP's embedding space.
Returns a value in [-1.0, 1.0] where higher means more
visually similar. CLIP's image embeddings capture semantic
content, so two images of different dogs typically score
higher than a dog and a car, even if pixel-level differences
are large.
Arguments
image1andimage2are anyVix.Vips.Image.t/0.optionsis a keyword list of options.
Options
:repois the HuggingFace repository for the CLIP model. The default is"openai/clip-vit-base-patch32".
Returns
- A float in
[-1.0, 1.0].
Examples
iex> a = Image.open!("./test/support/images/puppy.webp")
iex> b = Image.open!("./test/support/images/cat.png")
iex> # Image.ZeroShot.similarity(a, b)