# `ImageTextToText`
[🔗](https://github.com/huggingface/huggingface_client/blob/v0.1.0/lib/huggingface_client/inference/tasks/tasks.ex#L306)

Multimodal vision-language models (VLMs).

Combines an image and text prompt to generate a text response.
Used for GPT-4V-style tasks: image captioning with context, visual reasoning,
chart/doc understanding, multi-turn vision conversations.

Different from `image_to_text` (which only generates captions without a text prompt).

# `run`

Runs image+text to text generation.

## Options
- `:image` — image URL, binary, or base64 (required)
- `:prompt` — text prompt to condition the generation (required)
- `:model` — override model (e.g. `"llava-hf/llava-1.5-7b-hf"`)
- `:max_new_tokens` — max tokens to generate
- `:temperature` — sampling temperature

---

*Consult [api-reference.md](api-reference.md) for complete listing*