VLLM (VLLM v0.1.1)

View Source

VLLM - vLLM for Elixir via SnakeBridge.

Easy, fast, and cheap LLM serving for everyone. This library provides transparent access to Python vLLM through SnakeBridge's Universal FFI.

Quick Start

VLLM.run(fn ->
  # Create an LLM instance
  llm = VLLM.llm!("facebook/opt-125m")

  # Generate text
  outputs = VLLM.generate!(llm, ["Hello, my name is"])

  # Process results
  Enum.each(outputs, fn output ->
    prompt = VLLM.attr!(output, "prompt")
    generated = VLLM.attr!(output, "outputs") |> Enum.at(0)
    text = VLLM.attr!(generated, "text")
    IO.puts("Prompt: #{prompt}")
    IO.puts("Generated: #{text}")
  end)
end)

Chat Interface

VLLM.run(fn ->
  llm = VLLM.llm!("Qwen/Qwen2-0.5B-Instruct")

  messages = [[
    %{"role" => "system", "content" => "You are a helpful assistant."},
    %{"role" => "user", "content" => "What is the capital of France?"}
  ]]

  outputs = VLLM.chat!(llm, messages)
  # Process chat outputs...
end)

Sampling Parameters

Control generation with VLLM.SamplingParams:

VLLM.run(fn ->
  llm = VLLM.llm!("facebook/opt-125m")
  params = VLLM.sampling_params!(temperature: 0.8, top_p: 0.95, max_tokens: 100)

  outputs = VLLM.generate!(llm, ["Once upon a time"], sampling_params: params)
end)

Timeout Configuration

VLLM leverages SnakeBridge's timeout architecture for LLM workloads. By default, all vLLM calls use the :ml_inference profile (10 minute timeout).

Timeout Profiles

ProfileTimeoutUse Case
:default2 minStandard Python calls
:streaming30 minStreaming responses
:ml_inference10 minLLM inference (VLLM default)
:batch_job1 hourLong-running batch operations

Per-Call Timeout Override

VLLM.generate!(llm, prompts,
  sampling_params: params,
  __runtime__: [timeout_profile: :batch_job]
)

Architecture

VLLM uses SnakeBridge's Universal FFI to call vLLM directly:

Elixir (VLLM.call/4)
    |
SnakeBridge.call/4
    |
Snakepit gRPC
    |
Python vLLM
    |
GPU/TPU Inference

All Python lifecycle is managed automatically by Snakepit.

Summary

Functions

Create an AsyncLLMEngine for asynchronous inference.

Bang version of async_engine/2.

Get an attribute from a Python object reference.

Bang version of attr/2.

Encode binary data as Python bytes.

Call any vLLM function or class.

Bang version - raises on error, returns value directly.

Generate chat completions from messages.

Bang version of chat/3 - raises on error.

Generate embeddings for texts using a pooling model.

Bang version of embed/3.

Encode text to token IDs.

Bang version of encode/3.

Create an LLMEngine for fine-grained control over inference.

Bang version of engine/2.

Generate text completions from prompts.

Bang version of generate/3 - raises on error.

Get a module attribute.

Bang version of get/2.

Create guided decoding parameters for structured outputs.

Bang version of guided_decoding_params/1.

Check whether guided decoding parameters are available in the installed vLLM.

Create a vLLM LLM instance for offline inference.

Bang version of llm/2 - raises on error.

Create a LoRARequest for serving LoRA adapters.

Call a method on a Python object reference.

Create PoolingParams for embedding models.

Bang version of pooling_params/1.

Check if a value is a Python object reference.

Run VLLM code with automatic Python lifecycle management.

Create SamplingParams for controlling text generation.

Bang version of sampling_params/1 - raises on error.

Set an attribute on a Python object reference.

Create a timeout option for exact milliseconds.

Timeout profile atoms for use with __runtime__ option.

Get the installed vLLM version.

Bang version of version/0.

Add timeout configuration to options.

Functions

async_engine(model, opts \\ [])

Create an AsyncLLMEngine for asynchronous inference.

Useful for building online serving applications with concurrent requests.

Examples

{:ok, engine} = VLLM.async_engine("facebook/opt-125m")

async_engine!(model, opts \\ [])

Bang version of async_engine/2.

attr(ref, attribute)

Get an attribute from a Python object reference.

attr!(ref, attribute)

Bang version of attr/2.

bytes(data)

Encode binary data as Python bytes.

call(module, function, args \\ [], opts \\ [])

Call any vLLM function or class.

Examples

{:ok, result} = VLLM.call("vllm", "LLM", ["facebook/opt-125m"])
{:ok, config} = VLLM.call("vllm.config", "ModelConfig", [], model: "...")

call!(module, function, args \\ [], opts \\ [])

Bang version - raises on error, returns value directly.

chat(llm, messages, opts \\ [])

Generate chat completions from messages.

Arguments

  • llm - LLM instance from VLLM.llm!/1
  • messages - List of message conversations, where each conversation is a list of message maps
  • opts - Options including:
    • :sampling_params - SamplingParams instance
    • :use_tqdm - Show progress bar
    • :chat_template - Custom chat template (Jinja2 format)

Message Format

Each message is a map with:

  • "role" - One of "system", "user", "assistant"
  • "content" - Message content string

Examples

messages = [[
  %{"role" => "system", "content" => "You are helpful."},
  %{"role" => "user", "content" => "Hello!"}
]]

outputs = VLLM.chat!(llm, messages)

Returns

List of RequestOutput objects (same as generate/3).

chat!(llm, messages, opts \\ [])

Bang version of chat/3 - raises on error.

embed(llm, texts, opts \\ [])

Generate embeddings for texts using a pooling model.

Arguments

  • llm - LLM instance configured with an embedding model
  • texts - String or list of strings to embed
  • opts - Options including:
    • :pooling_params - PoolingParams instance

Examples

llm = VLLM.llm!("intfloat/e5-mistral-7b-instruct", runner: "pooling")
outputs = VLLM.embed!(llm, ["Hello, world!", "How are you?"])

Returns

List of EmbeddingRequestOutput objects with:

  • outputs - List of embeddings

embed!(llm, texts, opts \\ [])

Bang version of embed/3.

encode(llm, text, opts \\ [])

Encode text to token IDs.

Examples

{:ok, token_ids} = VLLM.encode(llm, "Hello, world!")

encode!(llm, text, opts \\ [])

Bang version of encode/3.

engine(model, opts \\ [])

Create an LLMEngine for fine-grained control over inference.

The LLMEngine provides lower-level access to vLLM's inference capabilities, useful for building custom serving solutions.

Options

Same as llm/2 plus:

  • :max_num_seqs - Maximum number of sequences per batch
  • :max_num_batched_tokens - Maximum tokens per batch

Examples

{:ok, engine} = VLLM.engine("facebook/opt-125m")

engine!(model, opts \\ [])

Bang version of engine/2.

generate(llm, prompts, opts \\ [])

Generate text completions from prompts.

Arguments

  • llm - LLM instance from VLLM.llm!/1
  • prompts - String or list of strings to complete
  • opts - Options including:
    • :sampling_params - SamplingParams instance
    • :use_tqdm - Show progress bar (default: true)
    • :lora_request - LoRA adapter request

Examples

outputs = VLLM.generate!(llm, "Hello, my name is")
outputs = VLLM.generate!(llm, ["Prompt 1", "Prompt 2"], sampling_params: params)

Returns

List of RequestOutput objects. Each has:

  • prompt - Original prompt
  • outputs - List of CompletionOutput objects
    • text - Generated text
    • token_ids - Generated token IDs
    • finish_reason - Reason for completion ("length", "stop", etc.)

generate!(llm, prompts, opts \\ [])

Bang version of generate/3 - raises on error.

get(module, attr)

Get a module attribute.

get!(module, attr)

Bang version of get/2.

guided_decoding_params(opts \\ [])

Create guided decoding parameters for structured outputs.

Options

  • :json - JSON schema string for JSON output
  • :json_object - Python dict/Pydantic model for JSON
  • :regex - Regex pattern for output
  • :choice - List of allowed string choices
  • :grammar - BNF grammar string

Examples

# JSON schema
{:ok, guided} = VLLM.guided_decoding_params(
  json: ~s({"type": "object", "properties": {"name": {"type": "string"}}})
)

# Regex pattern
{:ok, guided} = VLLM.guided_decoding_params(regex: "[0-9]{3}-[0-9]{4}")

# Choice
{:ok, guided} = VLLM.guided_decoding_params(choice: ["yes", "no", "maybe"])

Support

Guided decoding requires a vLLM build that exposes GuidedDecodingParams. Use guided_decoding_supported?/0 to check availability.

guided_decoding_params!(opts \\ [])

Bang version of guided_decoding_params/1.

guided_decoding_supported?()

Check whether guided decoding parameters are available in the installed vLLM.

llm(model, opts \\ [])

Create a vLLM LLM instance for offline inference.

Options

Common options passed as keyword arguments:

  • :dtype - Data type ("auto", "float16", "bfloat16", "float32")
  • :tensor_parallel_size - Number of GPUs for tensor parallelism
  • :gpu_memory_utilization - Fraction of GPU memory to use (0.0-1.0)
  • :max_model_len - Maximum sequence length
  • :quantization - Quantization method ("awq", "gptq", "squeezellm", etc.)
  • :trust_remote_code - Whether to trust remote code from HuggingFace

Examples

{:ok, llm} = VLLM.llm("facebook/opt-125m")
{:ok, llm} = VLLM.llm("Qwen/Qwen2-7B", tensor_parallel_size: 2)
{:ok, llm} = VLLM.llm("TheBloke/Llama-2-7B-AWQ", quantization: "awq")

llm!(model, opts \\ [])

Bang version of llm/2 - raises on error.

lora_request(name, lora_int_id, lora_path, opts \\ [])

Create a LoRARequest for serving LoRA adapters.

Arguments

  • name - Unique name for this LoRA adapter
  • lora_int_id - Integer ID for the adapter
  • lora_path - Path to the LoRA adapter weights

Examples

{:ok, lora} = VLLM.lora_request("my-adapter", 1, "/path/to/adapter")

lora_request!(name, lora_int_id, lora_path, opts \\ [])

Bang version of lora_request/4.

method(ref, method, args \\ [], opts \\ [])

Call a method on a Python object reference.

method!(ref, method, args \\ [], opts \\ [])

Bang version of method/4.

pooling_params(opts \\ [])

Create PoolingParams for embedding models.

Options

  • :additional_data - Additional metadata for the pooling request

Examples

{:ok, params} = VLLM.pooling_params()

pooling_params!(opts \\ [])

Bang version of pooling_params/1.

ref?(value)

Check if a value is a Python object reference.

run(fun, opts \\ [])

Run VLLM code with automatic Python lifecycle management.

Wraps your code in Snakepit.run_as_script/2 which:

  • Starts the Python process pool
  • Runs your code
  • Cleans up on exit

Pass halt: true in opts if you need to force the BEAM to exit (for example, when running inside wrapper scripts).

Example

VLLM.run(fn ->
  llm = VLLM.llm!("facebook/opt-125m")
  outputs = VLLM.generate!(llm, ["Hello, world"])
  # ... process outputs
end)

sampling_params(opts \\ [])

Create SamplingParams for controlling text generation.

Options

  • :temperature - Sampling temperature (default: 1.0)
  • :top_p - Nucleus sampling probability (default: 1.0)
  • :top_k - Top-k sampling (default: -1, disabled)
  • :max_tokens - Maximum tokens to generate (default: 16)
  • :min_tokens - Minimum tokens to generate (default: 0)
  • :presence_penalty - Presence penalty (default: 0.0)
  • :frequency_penalty - Frequency penalty (default: 0.0)
  • :repetition_penalty - Repetition penalty (default: 1.0)
  • :stop - List of stop strings
  • :stop_token_ids - List of stop token IDs
  • :n - Number of completions to generate (default: 1)
  • :best_of - Number of sequences to generate and select best from
  • :seed - Random seed for reproducibility

Examples

{:ok, params} = VLLM.sampling_params(temperature: 0.8, max_tokens: 100)
{:ok, params} = VLLM.sampling_params(top_p: 0.9, stop: ["\n", "END"])

sampling_params!(opts \\ [])

Bang version of sampling_params/1 - raises on error.

set_attr(ref, attribute, value)

Set an attribute on a Python object reference.

timeout_ms(milliseconds)

Create a timeout option for exact milliseconds.

Examples

VLLM.generate!(llm, prompts,
  Keyword.merge([sampling_params: params], VLLM.timeout_ms(300_000))
)

timeout_profile(profile)

Timeout profile atoms for use with __runtime__ option.

Examples

VLLM.generate!(llm, prompts,
  Keyword.merge([sampling_params: params], VLLM.timeout_profile(:batch_job))
)

version()

Get the installed vLLM version.

version!()

Bang version of version/0.

with_timeout(opts, timeout_opts)

Add timeout configuration to options.

Options

  • :timeout - Exact timeout in milliseconds
  • :timeout_profile - Use a predefined profile

Examples

opts = VLLM.with_timeout([], timeout: 60_000)
VLLM.generate!(llm, prompts, Keyword.merge(opts, sampling_params: params))