Image captioning — generates a natural-language description of an image.
Pass a Vix.Vips.Image.t/0 to caption/2 and get back a string
like "a small dog sitting on a wooden floor" or "a man riding a horse with a bird of prey".
Quick start
# The captioner serving is heavyweight and not autostarted by
# default. Either set `autostart: true` in config (see below)
# or add the child spec to your own supervision tree:
#
# children = [Image.Captioning.captioner()]
iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.Captioning.caption(puppy)
iex> # => "a brown and white puppy sitting on a white surface"Default model
Salesforce/blip-image-captioning-base
— BSD-3-Clause licensed, ~990 MB. The base BLIP variant fine-tuned
for image captioning. Solid baseline quality across general subject
matter.
Note that this is by far the heaviest of the library's default
models — the first call (or first app boot with autostart: true)
blocks on a ~990 MB download from HuggingFace.
Configuration
Configure in config/runtime.exs:
config :image_vision, :captioner,
model: {:hf, "Salesforce/blip-image-captioning-base"},
featurizer: {:hf, "Salesforce/blip-image-captioning-base"},
tokenizer: {:hf, "Salesforce/blip-image-captioning-base"},
generation_config: {:hf, "Salesforce/blip-image-captioning-base"},
model_options: [],
featurizer_options: [],
tokenizer_options: [],
generation_config_options: [],
batch_size: 1,
name: Image.Captioning.Server,
autostart: falseTo use the larger and higher-quality variant:
config :image_vision, :captioner,
model: {:hf, "Salesforce/blip-image-captioning-large"},
featurizer: {:hf, "Salesforce/blip-image-captioning-large"},
tokenizer: {:hf, "Salesforce/blip-image-captioning-large"},
generation_config: {:hf, "Salesforce/blip-image-captioning-large"}Servings and supervision
BLIP is a multi-module model (vision encoder, text decoder, cross-attention) and a load takes several seconds. The captioning entry point runs against a named serving process so the model loads once and is reused.
The serving is not autostarted by default — most apps either don't need captioning at all or want explicit control over when the download happens. To run it in your own supervision tree:
# application.ex
def start(_type, _args) do
children = [Image.Captioning.captioner()]
Supervisor.start_link(children, strategy: :one_for_one)
endOr set autostart: true to have ImageVision.Supervisor start it
when the :image_vision application starts.
Optional dependency
This module is only available when Bumblebee,
Nx, and an Nx compiler such as
EXLA are configured in your
application's mix.exs.
Summary
Functions
Generates a natural-language caption for an image.
Returns a child spec suitable for starting an image captioning process as part of a supervision tree.
Functions
@spec caption(image :: Vix.Vips.Image.t(), options :: Keyword.t()) :: String.t() | {:error, Image.error()}
Generates a natural-language caption for an image.
Arguments
imageis anyVix.Vips.Image.t/0.optionsis a keyword list of options.
Options
:backendis any valid Nx backend used for the image-to-tensor conversion. The default isNx.default_backend/0.:serveris the name of the captioning serving process. The default isImage.Captioning.Server.
Returns
The caption as a
String.t/0, or{:error, reason}if the input could not be processed.
Examples
iex> _puppy = Image.open!("./test/support/images/puppy.webp")
iex> # Image.Captioning.caption(puppy)
iex> # => "a small dog sitting on a wooden surface"
@spec captioner(configuration :: Keyword.t()) :: {Nx.Serving, Keyword.t()} | {:error, Image.error()}
Returns a child spec suitable for starting an image captioning process as part of a supervision tree.
Arguments
configurationis a keyword list merged over the default configuration.
Options
:modelis any BLIP-family image captioning model supported by Bumblebee. The default is{:hf, "Salesforce/blip-image-captioning-base"}.:featurizeris the BLIP featurizer. The default is{:hf, "Salesforce/blip-image-captioning-base"}.:tokenizeris the BLIP tokenizer. The default is{:hf, "Salesforce/blip-image-captioning-base"}.:generation_configis a Bumblebee generation config repo. The default is{:hf, "Salesforce/blip-image-captioning-base"}.:model_options,:featurizer_options,:tokenizer_options, and:generation_config_optionsare keyword lists passed to the correspondingBumblebee.load_*functions. Defaults are[].:nameis the name of the serving process. The default isImage.Captioning.Server.:batch_sizeis the maximum batch size. The default is1.
Returns
A child spec tuple suitable for
Supervisor.start_link/2, or{:error, reason}if the model could not be loaded.