VLLM (VLLM v0.3.0)

VLLM - vLLM for Elixir via SnakeBridge.

Easy, fast, and cheap LLM serving for everyone. This library provides transparent access to Python vLLM through SnakeBridge's generated wrappers.

Quick Start

VLLM.run(fn ->
  # Create an LLM instance
  llm = VLLM.llm!("facebook/opt-125m")

  # Generate text
  outputs = VLLM.generate!(llm, ["Hello, my name is"])

  # Process results
  Enum.each(outputs, fn output ->
    prompt = VLLM.attr!(output, "prompt")
    generated = VLLM.attr!(output, "outputs") |> Enum.at(0)
    text = VLLM.attr!(generated, "text")
    IO.puts("Prompt: #{prompt}")
    IO.puts("Generated: #{text}")
  end)
end)

Chat Interface

VLLM.run(fn ->
  llm = VLLM.llm!("Qwen/Qwen2-0.5B-Instruct")

  messages = [[
    %{"role" => "system", "content" => "You are a helpful assistant."},
    %{"role" => "user", "content" => "What is the capital of France?"}
  ]]

  outputs = VLLM.chat!(llm, messages)
  # Process chat outputs...
end)

Sampling Parameters

Control generation with VLLM.SamplingParams:

VLLM.run(fn ->
  llm = VLLM.llm!("facebook/opt-125m")
  params = VLLM.sampling_params!(temperature: 0.8, top_p: 0.95, max_tokens: 100)

  outputs = VLLM.generate!(llm, ["Once upon a time"], sampling_params: params)
end)

Generated Wrappers

This library uses SnakeBridge's generated wrappers for type-safe bindings:

Vllm.LLM - Main inference class
Vllm.SamplingParams - Generation parameters
Vllm.PoolingParams - Embedding parameters
Vllm.LLMEngine - Low-level engine
Vllm.AsyncLLMEngine - Async engine for serving

Timeout Configuration

VLLM leverages SnakeBridge's timeout architecture for LLM workloads. By default, all vLLM calls use the :ml_inference profile (10 minute timeout).

Timeout Profiles

Profile	Timeout	Use Case
`:default`	2 min	Standard Python calls
`:streaming`	30 min	Streaming responses
`:ml_inference`	10 min	LLM inference (VLLM default)
`:batch_job`	1 hour	Long-running batch operations

Per-Call Timeout Override

VLLM.generate!(llm, prompts,
  sampling_params: params,
  __runtime__: [timeout_profile: :batch_job]
)

Architecture

VLLM uses SnakeBridge's generated wrappers to call vLLM:

Elixir (VLLM module)
    |
Generated Wrappers (Vllm.LLM, etc.)
    |
SnakeBridge.Runtime
    |
Snakepit gRPC
    |
Python vLLM
    |
GPU/TPU Inference

All Python lifecycle is managed automatically by Snakepit.

Summary

Functions

async_engine(model, opts \\ [])

Create an AsyncLLMEngine for asynchronous inference.

async_engine!(model, opts \\ [])

Bang version of async_engine/2.

attr(ref, attribute)

Get an attribute from a Python object reference.

attr!(ref, attribute)

Bang version of attr/2.

bytes(data)

Encode binary data as Python bytes.

call(module, function, args \\ [], opts \\ [])

Call any vLLM function or class.

call!(module, function, args \\ [], opts \\ [])

Bang version - raises on error, returns value directly.

chat(llm, messages, opts \\ [])

Generate chat completions from messages.

chat!(llm, messages, opts \\ [])

Bang version of chat/3 - raises on error.

embed(llm, texts, opts \\ [])

Generate embeddings for texts using a pooling model.

embed!(llm, texts, opts \\ [])

Bang version of embed/3.

encode(llm, text, opts \\ [])

Encode text to token IDs.

encode!(llm, text, opts \\ [])

Bang version of encode/3.

engine(model, opts \\ [])

Create an LLMEngine for fine-grained control over inference.

engine!(model, opts \\ [])

Bang version of engine/2.

generate(llm, prompts, opts \\ [])

Generate text completions from prompts.

generate!(llm, prompts, opts \\ [])

Bang version of generate/3 - raises on error.

get(module, attr)

Get a module attribute.

get!(module, attr)

Bang version of get/2.

guided_decoding_params(opts \\ [])

Create guided decoding parameters for structured outputs.

guided_decoding_params!(opts \\ [])

Bang version of guided_decoding_params/1.

guided_decoding_supported?()

Check whether guided decoding parameters are available in the installed vLLM.

llm(model, opts \\ [])

Create a vLLM LLM instance for offline inference.

llm!(model, opts \\ [])

Bang version of llm/2 - raises on error.

lora_request(name, lora_int_id, lora_path, opts \\ [])

Create a LoRARequest for serving LoRA adapters.

lora_request!(name, lora_int_id, lora_path, opts \\ [])

Bang version of lora_request/4.

method(ref, method, args \\ [], opts \\ [])

Call a method on a Python object reference.

method!(ref, method, args \\ [], opts \\ [])

Bang version of method/4.

pooling_params(opts \\ [])

Create PoolingParams for embedding models.

pooling_params!(opts \\ [])

Bang version of pooling_params/1.

ref?(value)

Check if a value is a Python object reference.

run(fun, opts \\ [])

Run VLLM code with automatic Python lifecycle management.

sampling_params(opts \\ [])

Create SamplingParams for controlling text generation.

sampling_params!(opts \\ [])

Bang version of sampling_params/1 - raises on error.

set_attr(ref, attribute, value)

Set an attribute on a Python object reference.

timeout_ms(milliseconds)

Create a timeout option for exact milliseconds.

timeout_profile(profile)

Timeout profile atoms for use with __runtime__ option.

version()

Get the installed vLLM version.

version!()

Bang version of version/0.

with_timeout(opts, timeout_opts)

Add timeout configuration to options.

Functions

async_engine(model, opts \\ [])

Create an AsyncLLMEngine for asynchronous inference.

Useful for building online serving applications with concurrent requests.

Examples

{:ok, engine} = VLLM.async_engine("facebook/opt-125m")

async_engine!(model, opts \\ [])

Bang version of async_engine/2.

attr(ref, attribute)

Get an attribute from a Python object reference.

attr!(ref, attribute)

Bang version of attr/2.

bytes(data)

Encode binary data as Python bytes.

call(module, function, args \\ [], opts \\ [])

Call any vLLM function or class.

Examples

{:ok, result} = VLLM.call("vllm", "LLM", ["facebook/opt-125m"])
{:ok, config} = VLLM.call("vllm.config", "ModelConfig", [], model: "...")

call!(module, function, args \\ [], opts \\ [])

Bang version - raises on error, returns value directly.

chat(llm, messages, opts \\ [])

Generate chat completions from messages.

Delegates to Vllm.LLM.chat/4.

Arguments

llm - LLM instance from VLLM.llm!/1
messages - List of message conversations, where each conversation is a list of message maps
opts - Options including:
- :sampling_params - SamplingParams instance
- :use_tqdm - Show progress bar
- :chat_template - Custom chat template (Jinja2 format)

Message Format

Each message is a map with:

"role" - One of "system", "user", "assistant"
"content" - Message content string

Examples

messages = [[
  %{"role" => "system", "content" => "You are helpful."},
  %{"role" => "user", "content" => "Hello!"}
]]

outputs = VLLM.chat!(llm, messages)

Returns

List of RequestOutput objects (same as generate/3).

chat!(llm, messages, opts \\ [])

Bang version of chat/3 - raises on error.

embed(llm, texts, opts \\ [])

Generate embeddings for texts using a pooling model.

Delegates to Vllm.LLM.embed/3.

Arguments

llm - LLM instance configured with an embedding model
texts - String or list of strings to embed
opts - Options including:
- :pooling_params - PoolingParams instance

Examples

llm = VLLM.llm!("intfloat/e5-mistral-7b-instruct", runner: "pooling")
outputs = VLLM.embed!(llm, ["Hello, world!", "How are you?"])

Returns

List of EmbeddingRequestOutput objects with:

outputs - List of embeddings

embed!(llm, texts, opts \\ [])

Bang version of embed/3.

encode(llm, text, opts \\ [])

Encode text to token IDs.

Delegates to Vllm.LLM.encode/3.

Examples

{:ok, token_ids} = VLLM.encode(llm, "Hello, world!")

encode!(llm, text, opts \\ [])

Bang version of encode/3.

engine(model, opts \\ [])

Create an LLMEngine for fine-grained control over inference.

The LLMEngine provides lower-level access to vLLM's inference capabilities, useful for building custom serving solutions.

Note: LLMEngine has a complex constructor requiring vllm_config and executor_class. This helper creates it from EngineArgs for simpler usage.

Options

Same as llm/2 plus:

:max_num_seqs - Maximum number of sequences per batch
:max_num_batched_tokens - Maximum tokens per batch

Examples

{:ok, engine} = VLLM.engine("facebook/opt-125m")

engine!(model, opts \\ [])

Bang version of engine/2.

generate(llm, prompts, opts \\ [])

Generate text completions from prompts.

Delegates to Vllm.LLM.generate/4.

Arguments

llm - LLM instance from VLLM.llm!/1
prompts - String or list of strings to complete
opts - Options including:
- :sampling_params - SamplingParams instance
- :use_tqdm - Show progress bar (default: true)
- :lora_request - LoRA adapter request

Examples

outputs = VLLM.generate!(llm, "Hello, my name is")
outputs = VLLM.generate!(llm, ["Prompt 1", "Prompt 2"], sampling_params: params)

Returns

List of RequestOutput objects. Each has:

prompt - Original prompt
outputs - List of CompletionOutput objects
- text - Generated text
- token_ids - Generated token IDs
- finish_reason - Reason for completion ("length", "stop", etc.)

generate!(llm, prompts, opts \\ [])

Bang version of generate/3 - raises on error.

get(module, attr)

Get a module attribute.

get!(module, attr)

Bang version of get/2.

guided_decoding_params(opts \\ [])

Create guided decoding parameters for structured outputs.

Options

:json - JSON schema string for JSON output
:json_object - Python dict/Pydantic model for JSON
:regex - Regex pattern for output
:choice - List of allowed string choices
:grammar - BNF grammar string

Examples

# JSON schema
{:ok, guided} = VLLM.guided_decoding_params(
  json: ~s({"type": "object", "properties": {"name": {"type": "string"}}})
)

# Regex pattern
{:ok, guided} = VLLM.guided_decoding_params(regex: "[0-9]{3}-[0-9]{4}")

# Choice
{:ok, guided} = VLLM.guided_decoding_params(choice: ["yes", "no", "maybe"])

Support

Guided decoding requires a vLLM build that exposes GuidedDecodingParams. Use guided_decoding_supported?/0 to check availability.

guided_decoding_params!(opts \\ [])

Bang version of guided_decoding_params/1.

guided_decoding_supported?()

Check whether guided decoding parameters are available in the installed vLLM.

llm(model, opts \\ [])

Create a vLLM LLM instance for offline inference.

Delegates to Vllm.LLM.new/2.

Options

Common options passed as keyword arguments:

:dtype - Data type ("auto", "float16", "bfloat16", "float32")
:tensor_parallel_size - Number of GPUs for tensor parallelism
:gpu_memory_utilization - Fraction of GPU memory to use (0.0-1.0)
:max_model_len - Maximum sequence length
:quantization - Quantization method ("awq", "gptq", "squeezellm", etc.)
:trust_remote_code - Whether to trust remote code from HuggingFace

Examples

{:ok, llm} = VLLM.llm("facebook/opt-125m")
{:ok, llm} = VLLM.llm("Qwen/Qwen2-7B", tensor_parallel_size: 2)
{:ok, llm} = VLLM.llm("TheBloke/Llama-2-7B-AWQ", quantization: "awq")

llm!(model, opts \\ [])

Bang version of llm/2 - raises on error.

lora_request(name, lora_int_id, lora_path, opts \\ [])

Create a LoRARequest for serving LoRA adapters.

Arguments

name - Unique name for this LoRA adapter
lora_int_id - Integer ID for the adapter
lora_path - Path to the LoRA adapter weights

Examples

{:ok, lora} = VLLM.lora_request("my-adapter", 1, "/path/to/adapter")

lora_request!(name, lora_int_id, lora_path, opts \\ [])

Bang version of lora_request/4.

method(ref, method, args \\ [], opts \\ [])

Call a method on a Python object reference.

method!(ref, method, args \\ [], opts \\ [])

Bang version of method/4.

pooling_params(opts \\ [])

Create PoolingParams for embedding models.

Delegates to Vllm.PoolingParams.new/2.

Options

:additional_data - Additional metadata for the pooling request

Examples

{:ok, params} = VLLM.pooling_params()

pooling_params!(opts \\ [])

Bang version of pooling_params/1.

ref?(value)

Check if a value is a Python object reference.

run(fun, opts \\ [])

Run VLLM code with automatic Python lifecycle management.

Wraps your code in Snakepit.run_as_script/2 which:

Starts the Python process pool
Runs your code
Cleans up on exit

Pass halt: true in opts if you need to force the BEAM to exit (for example, when running inside wrapper scripts).

Example

VLLM.run(fn ->
  llm = VLLM.llm!("facebook/opt-125m")
  outputs = VLLM.generate!(llm, ["Hello, world"])
  # ... process outputs
end)

sampling_params(opts \\ [])

Create SamplingParams for controlling text generation.

Delegates to Vllm.SamplingParams.new/2.

Options

:temperature - Sampling temperature (default: 1.0)
:top_p - Nucleus sampling probability (default: 1.0)
:top_k - Top-k sampling (default: -1, disabled)
:max_tokens - Maximum tokens to generate (default: 16)
:min_tokens - Minimum tokens to generate (default: 0)
:presence_penalty - Presence penalty (default: 0.0)
:frequency_penalty - Frequency penalty (default: 0.0)
:repetition_penalty - Repetition penalty (default: 1.0)
:stop - List of stop strings
:stop_token_ids - List of stop token IDs
:n - Number of completions to generate (default: 1)
:best_of - Number of sequences to generate and select best from
:seed - Random seed for reproducibility

Examples

{:ok, params} = VLLM.sampling_params(temperature: 0.8, max_tokens: 100)
{:ok, params} = VLLM.sampling_params(top_p: 0.9, stop: ["\n", "END"])

sampling_params!(opts \\ [])

Bang version of sampling_params/1 - raises on error.

set_attr(ref, attribute, value)

Set an attribute on a Python object reference.

timeout_ms(milliseconds)

Create a timeout option for exact milliseconds.

Examples

VLLM.generate!(llm, prompts,
  Keyword.merge([sampling_params: params], VLLM.timeout_ms(300_000))
)

timeout_profile(profile)

Timeout profile atoms for use with __runtime__ option.

Examples

VLLM.generate!(llm, prompts,
  Keyword.merge([sampling_params: params], VLLM.timeout_profile(:batch_job))
)

version()

Get the installed vLLM version.

version!()

Bang version of version/0.

with_timeout(opts, timeout_opts)

Add timeout configuration to options.

Options

:timeout - Exact timeout in milliseconds
:timeout_profile - Use a predefined profile

Examples

opts = VLLM.with_timeout([], timeout: 60_000)
VLLM.generate!(llm, prompts, Keyword.merge(opts, sampling_params: params))