# `Image.ZeroShot`
[🔗](https://github.com/elixir-image/image_vision/blob/v0.2.0/lib/zero_shot.ex#L2)

Zero-shot image classification — classify an image against
arbitrary labels you supply at call time, without retraining.

Where `Image.Classification` is constrained to whatever 1000
ImageNet labels its model was trained on, `Image.ZeroShot` lets
you provide your own set of candidate labels and asks the model
"which of these best describes this image?". Powered by CLIP, a
contrastive vision-language model.

Three entry points cover different needs:

* `classify/3` — return all candidate labels with scores, sorted
  descending. Best when you want to see the full distribution.

* `label/3` — return just the single highest-scoring label.

* `similarity/3` — compute cosine similarity between two images
  in CLIP's embedding space, useful for "find similar images".

## Quick start

    iex> _puppy = Image.open!("./test/support/images/puppy.webp")
    iex> # Image.ZeroShot.classify(puppy, ["a dog", "a cat", "a horse"])
    iex> # => [
    iex> #      %{label: "a dog", score: 0.97},
    iex> #      %{label: "a cat", score: 0.02},
    iex> #      %{label: "a horse", score: 0.01}
    iex> #    ]

## Default model

[`openai/clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32)
— MIT licensed, ~600 MB. The original CLIP, broad training
coverage and well-validated zero-shot behaviour. Override via the
`:repo` option to use a larger CLIP variant.

## Prompt templates

CLIP's accuracy on short labels improves significantly when each
label is wrapped in a natural-language sentence. The default
template is `"a photo of {label}"`, applied to every label before
tokenisation. Override with the `:template` option:

    Image.ZeroShot.classify(image, ["dog", "cat"],
      template: "a close-up photograph of {label}")

Pass `template: nil` to disable the template entirely (useful if
your labels are already full sentences).

## Optional dependency

This module is only available when [Bumblebee](https://hex.pm/packages/bumblebee),
[Nx](https://hex.pm/packages/nx), and an Nx compiler such as
[EXLA](https://hex.pm/packages/exla) are configured in your
application's `mix.exs`.

# `classify`

```elixir
@spec classify(Vix.Vips.Image.t(), [String.t(), ...], Keyword.t()) :: [
  %{label: String.t(), score: float()}
]
```

Classifies an image against a list of candidate labels.

Wraps each label in the prompt template (default
`"a photo of {label}"`), tokenises it, and asks CLIP which label
best matches the image. Returns all labels scored and sorted
descending.

### Arguments

* `image` is any `t:Vix.Vips.Image.t/0`.

* `labels` is a non-empty list of label strings to classify
  against, e.g. `["a dog", "a cat", "a sports car"]`.

* `options` is a keyword list of options.

### Options

* `:repo` is the HuggingFace repository for the CLIP model. The
  default is `"openai/clip-vit-base-patch32"`.

* `:template` is a prompt template — a string with `{label}` as a
  placeholder. The default is `"a photo of {label}"`. Pass
  `nil` to use labels verbatim.

### Returns

* A list of `%{label: String.t(), score: float()}` maps sorted
  by descending `:score`. Scores sum to `1.0`.

### Examples

    iex> _puppy = Image.open!("./test/support/images/puppy.webp")
    iex> # Image.ZeroShot.classify(puppy, ["a dog", "a cat"])

# `label`

```elixir
@spec label(Vix.Vips.Image.t(), [String.t(), ...], Keyword.t()) :: String.t()
```

Returns the single highest-scoring label for an image.

Convenience wrapper over `classify/3` for the common "just tell
me which one it is" case.

### Arguments

* `image` is any `t:Vix.Vips.Image.t/0`.

* `labels` is a non-empty list of label strings.

* `options` is a keyword list of options. Same as `classify/3`.

### Returns

* The label string with the highest score.

### Examples

    iex> _puppy = Image.open!("./test/support/images/puppy.webp")
    iex> # Image.ZeroShot.label(puppy, ["a dog", "a cat"])
    iex> # => "a dog"

# `similarity`

```elixir
@spec similarity(Vix.Vips.Image.t(), Vix.Vips.Image.t(), Keyword.t()) :: float()
```

Computes cosine similarity between two images in CLIP's
embedding space.

Returns a value in `[-1.0, 1.0]` where higher means more
visually similar. CLIP's image embeddings capture semantic
content, so two images of different dogs typically score
higher than a dog and a car, even if pixel-level differences
are large.

### Arguments

* `image1` and `image2` are any `t:Vix.Vips.Image.t/0`.

* `options` is a keyword list of options.

### Options

* `:repo` is the HuggingFace repository for the CLIP model. The
  default is `"openai/clip-vit-base-patch32"`.

### Returns

* A float in `[-1.0, 1.0]`.

### Examples

    iex> a = Image.open!("./test/support/images/puppy.webp")
    iex> b = Image.open!("./test/support/images/cat.png")
    iex> # Image.ZeroShot.similarity(a, b)

---

*Consult [api-reference.md](api-reference.md) for complete listing*