Image.Captioning (image_vision v0.2.0)

Copy Markdown View Source

Image captioning — generates a natural-language description of an image.

Pass a Vix.Vips.Image.t/0 to caption/2 and get back a string like "a small dog sitting on a wooden floor" or "a man riding a horse with a bird of prey".

Quick start

# The captioner serving is heavyweight and not autostarted by
# default. Either set `autostart: true` in config (see below)
# or add the child spec to your own supervision tree:
#
#     children = [Image.Captioning.captioner()]

iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.Captioning.caption(puppy)
iex> # => "a brown and white puppy sitting on a white surface"

Default model

Salesforce/blip-image-captioning-base — BSD-3-Clause licensed, ~990 MB. The base BLIP variant fine-tuned for image captioning. Solid baseline quality across general subject matter.

Note that this is by far the heaviest of the library's default models — the first call (or first app boot with autostart: true) blocks on a ~990 MB download from HuggingFace.

Configuration

Configure in config/runtime.exs:

config :image_vision, :captioner,
  model: {:hf, "Salesforce/blip-image-captioning-base"},
  featurizer: {:hf, "Salesforce/blip-image-captioning-base"},
  tokenizer: {:hf, "Salesforce/blip-image-captioning-base"},
  generation_config: {:hf, "Salesforce/blip-image-captioning-base"},
  model_options: [],
  featurizer_options: [],
  tokenizer_options: [],
  generation_config_options: [],
  batch_size: 1,
  name: Image.Captioning.Server,
  autostart: false

To use the larger and higher-quality variant:

config :image_vision, :captioner,
  model: {:hf, "Salesforce/blip-image-captioning-large"},
  featurizer: {:hf, "Salesforce/blip-image-captioning-large"},
  tokenizer: {:hf, "Salesforce/blip-image-captioning-large"},
  generation_config: {:hf, "Salesforce/blip-image-captioning-large"}

Servings and supervision

BLIP is a multi-module model (vision encoder, text decoder, cross-attention) and a load takes several seconds. The captioning entry point runs against a named serving process so the model loads once and is reused.

The serving is not autostarted by default — most apps either don't need captioning at all or want explicit control over when the download happens. To run it in your own supervision tree:

# application.ex
def start(_type, _args) do
  children = [Image.Captioning.captioner()]
  Supervisor.start_link(children, strategy: :one_for_one)
end

Or set autostart: true to have ImageVision.Supervisor start it when the :image_vision application starts.

Optional dependency

This module is only available when Bumblebee, Nx, and an Nx compiler such as EXLA are configured in your application's mix.exs.

Summary

Functions

Generates a natural-language caption for an image.

Returns a child spec suitable for starting an image captioning process as part of a supervision tree.

Functions

caption(image, options \\ [])

@spec caption(image :: Vix.Vips.Image.t(), options :: Keyword.t()) ::
  String.t() | {:error, Image.error()}

Generates a natural-language caption for an image.

Arguments

Options

  • :backend is any valid Nx backend used for the image-to-tensor conversion. The default is Nx.default_backend/0.

  • :server is the name of the captioning serving process. The default is Image.Captioning.Server.

Returns

  • The caption as a String.t/0, or

  • {:error, reason} if the input could not be processed.

Examples

iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.Captioning.caption(puppy)
iex> # => "a small dog sitting on a wooden surface"

captioner(configuration \\ Application.get_env(:image_vision, :captioner, []))

@spec captioner(configuration :: Keyword.t()) ::
  {Nx.Serving, Keyword.t()} | {:error, Image.error()}

Returns a child spec suitable for starting an image captioning process as part of a supervision tree.

Arguments

  • configuration is a keyword list merged over the default configuration.

Options

  • :model is any BLIP-family image captioning model supported by Bumblebee. The default is {:hf, "Salesforce/blip-image-captioning-base"}.

  • :featurizer is the BLIP featurizer. The default is {:hf, "Salesforce/blip-image-captioning-base"}.

  • :tokenizer is the BLIP tokenizer. The default is {:hf, "Salesforce/blip-image-captioning-base"}.

  • :generation_config is a Bumblebee generation config repo. The default is {:hf, "Salesforce/blip-image-captioning-base"}.

  • :model_options, :featurizer_options, :tokenizer_options, and :generation_config_options are keyword lists passed to the corresponding Bumblebee.load_* functions. Defaults are [].

  • :name is the name of the serving process. The default is Image.Captioning.Server.

  • :batch_size is the maximum batch size. The default is 1.

Returns