Image.ZeroShot (image_vision v0.2.0)

Zero-shot image classification — classify an image against arbitrary labels you supply at call time, without retraining.

Where Image.Classification is constrained to whatever 1000 ImageNet labels its model was trained on, Image.ZeroShot lets you provide your own set of candidate labels and asks the model "which of these best describes this image?". Powered by CLIP, a contrastive vision-language model.

Three entry points cover different needs:

classify/3 — return all candidate labels with scores, sorted descending. Best when you want to see the full distribution.
label/3 — return just the single highest-scoring label.
similarity/3 — compute cosine similarity between two images in CLIP's embedding space, useful for "find similar images".

Quick start

iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.ZeroShot.classify(puppy, ["a dog", "a cat", "a horse"])
iex> # => [
iex> #      %{label: "a dog", score: 0.97},
iex> #      %{label: "a cat", score: 0.02},
iex> #      %{label: "a horse", score: 0.01}
iex> #    ]

Default model

openai/clip-vit-base-patch32 — MIT licensed, ~600 MB. The original CLIP, broad training coverage and well-validated zero-shot behaviour. Override via the :repo option to use a larger CLIP variant.

Prompt templates

CLIP's accuracy on short labels improves significantly when each label is wrapped in a natural-language sentence. The default template is "a photo of {label}", applied to every label before tokenisation. Override with the :template option:

Image.ZeroShot.classify(image, ["dog", "cat"],
  template: "a close-up photograph of {label}")

Pass template: nil to disable the template entirely (useful if your labels are already full sentences).

Optional dependency

This module is only available when Bumblebee, Nx, and an Nx compiler such as EXLA are configured in your application's mix.exs.

Summary

Functions

classify(image, labels, options \\ [])

Classifies an image against a list of candidate labels.

label(image, labels, options \\ [])

Returns the single highest-scoring label for an image.

Computes cosine similarity between two images in CLIP's embedding space.

Functions

classify(image, labels, options \\ [])

@spec classify(Vix.Vips.Image.t(), [String.t(), ...], Keyword.t()) :: [
  %{label: String.t(), score: float()}
]

Classifies an image against a list of candidate labels.

Wraps each label in the prompt template (default "a photo of {label}"), tokenises it, and asks CLIP which label best matches the image. Returns all labels scored and sorted descending.

Arguments

image is any Vix.Vips.Image.t/0.
labels is a non-empty list of label strings to classify against, e.g. ["a dog", "a cat", "a sports car"].
options is a keyword list of options.

Options

:repo is the HuggingFace repository for the CLIP model. The default is "openai/clip-vit-base-patch32".
:template is a prompt template — a string with {label} as a placeholder. The default is "a photo of {label}". Pass nil to use labels verbatim.

Returns

A list of %{label: String.t(), score: float()} maps sorted by descending :score. Scores sum to 1.0.

Examples

iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.ZeroShot.classify(puppy, ["a dog", "a cat"])

label(image, labels, options \\ [])

@spec label(Vix.Vips.Image.t(), [String.t(), ...], Keyword.t()) :: String.t()

Returns the single highest-scoring label for an image.

Convenience wrapper over classify/3 for the common "just tell me which one it is" case.

Arguments

image is any Vix.Vips.Image.t/0.
labels is a non-empty list of label strings.
options is a keyword list of options. Same as classify/3.

Returns

The label string with the highest score.

Examples

iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.ZeroShot.label(puppy, ["a dog", "a cat"])
iex> # => "a dog"

similarity(image1, image2, options \\ [])

@spec similarity(Vix.Vips.Image.t(), Vix.Vips.Image.t(), Keyword.t()) :: float()

Computes cosine similarity between two images in CLIP's embedding space.

Returns a value in [-1.0, 1.0] where higher means more visually similar. CLIP's image embeddings capture semantic content, so two images of different dogs typically score higher than a dog and a car, even if pixel-level differences are large.

Arguments

image1 and image2 are any Vix.Vips.Image.t/0.
options is a keyword list of options.

Options

:repo is the HuggingFace repository for the CLIP model. The default is "openai/clip-vit-base-patch32".

Returns

A float in [-1.0, 1.0].

Examples

iex> a = Image.open!("./test/support/images/puppy.webp")
iex> b = Image.open!("./test/support/images/cat.png")
iex> # Image.ZeroShot.similarity(a, b)