VLLM (VLLM v0.1.1)
View SourceVLLM - vLLM for Elixir via SnakeBridge.
Easy, fast, and cheap LLM serving for everyone. This library provides transparent access to Python vLLM through SnakeBridge's Universal FFI.
Quick Start
VLLM.run(fn ->
# Create an LLM instance
llm = VLLM.llm!("facebook/opt-125m")
# Generate text
outputs = VLLM.generate!(llm, ["Hello, my name is"])
# Process results
Enum.each(outputs, fn output ->
prompt = VLLM.attr!(output, "prompt")
generated = VLLM.attr!(output, "outputs") |> Enum.at(0)
text = VLLM.attr!(generated, "text")
IO.puts("Prompt: #{prompt}")
IO.puts("Generated: #{text}")
end)
end)Chat Interface
VLLM.run(fn ->
llm = VLLM.llm!("Qwen/Qwen2-0.5B-Instruct")
messages = [[
%{"role" => "system", "content" => "You are a helpful assistant."},
%{"role" => "user", "content" => "What is the capital of France?"}
]]
outputs = VLLM.chat!(llm, messages)
# Process chat outputs...
end)Sampling Parameters
Control generation with VLLM.SamplingParams:
VLLM.run(fn ->
llm = VLLM.llm!("facebook/opt-125m")
params = VLLM.sampling_params!(temperature: 0.8, top_p: 0.95, max_tokens: 100)
outputs = VLLM.generate!(llm, ["Once upon a time"], sampling_params: params)
end)Timeout Configuration
VLLM leverages SnakeBridge's timeout architecture for LLM workloads.
By default, all vLLM calls use the :ml_inference profile (10 minute timeout).
Timeout Profiles
| Profile | Timeout | Use Case |
|---|---|---|
:default | 2 min | Standard Python calls |
:streaming | 30 min | Streaming responses |
:ml_inference | 10 min | LLM inference (VLLM default) |
:batch_job | 1 hour | Long-running batch operations |
Per-Call Timeout Override
VLLM.generate!(llm, prompts,
sampling_params: params,
__runtime__: [timeout_profile: :batch_job]
)Architecture
VLLM uses SnakeBridge's Universal FFI to call vLLM directly:
Elixir (VLLM.call/4)
|
SnakeBridge.call/4
|
Snakepit gRPC
|
Python vLLM
|
GPU/TPU InferenceAll Python lifecycle is managed automatically by Snakepit.
Summary
Functions
Create an AsyncLLMEngine for asynchronous inference.
Bang version of async_engine/2.
Get an attribute from a Python object reference.
Bang version of attr/2.
Encode binary data as Python bytes.
Call any vLLM function or class.
Bang version - raises on error, returns value directly.
Generate chat completions from messages.
Bang version of chat/3 - raises on error.
Generate embeddings for texts using a pooling model.
Bang version of embed/3.
Encode text to token IDs.
Bang version of encode/3.
Create an LLMEngine for fine-grained control over inference.
Bang version of engine/2.
Generate text completions from prompts.
Bang version of generate/3 - raises on error.
Get a module attribute.
Bang version of get/2.
Create guided decoding parameters for structured outputs.
Bang version of guided_decoding_params/1.
Check whether guided decoding parameters are available in the installed vLLM.
Create a vLLM LLM instance for offline inference.
Bang version of llm/2 - raises on error.
Create a LoRARequest for serving LoRA adapters.
Bang version of lora_request/4.
Call a method on a Python object reference.
Bang version of method/4.
Create PoolingParams for embedding models.
Bang version of pooling_params/1.
Check if a value is a Python object reference.
Run VLLM code with automatic Python lifecycle management.
Create SamplingParams for controlling text generation.
Bang version of sampling_params/1 - raises on error.
Set an attribute on a Python object reference.
Create a timeout option for exact milliseconds.
Timeout profile atoms for use with __runtime__ option.
Get the installed vLLM version.
Bang version of version/0.
Add timeout configuration to options.
Functions
Create an AsyncLLMEngine for asynchronous inference.
Useful for building online serving applications with concurrent requests.
Examples
{:ok, engine} = VLLM.async_engine("facebook/opt-125m")
Bang version of async_engine/2.
Get an attribute from a Python object reference.
Bang version of attr/2.
Encode binary data as Python bytes.
Call any vLLM function or class.
Examples
{:ok, result} = VLLM.call("vllm", "LLM", ["facebook/opt-125m"])
{:ok, config} = VLLM.call("vllm.config", "ModelConfig", [], model: "...")
Bang version - raises on error, returns value directly.
Generate chat completions from messages.
Arguments
llm- LLM instance fromVLLM.llm!/1messages- List of message conversations, where each conversation is a list of message mapsopts- Options including::sampling_params- SamplingParams instance:use_tqdm- Show progress bar:chat_template- Custom chat template (Jinja2 format)
Message Format
Each message is a map with:
"role"- One of "system", "user", "assistant""content"- Message content string
Examples
messages = [[
%{"role" => "system", "content" => "You are helpful."},
%{"role" => "user", "content" => "Hello!"}
]]
outputs = VLLM.chat!(llm, messages)Returns
List of RequestOutput objects (same as generate/3).
Bang version of chat/3 - raises on error.
Generate embeddings for texts using a pooling model.
Arguments
llm- LLM instance configured with an embedding modeltexts- String or list of strings to embedopts- Options including::pooling_params- PoolingParams instance
Examples
llm = VLLM.llm!("intfloat/e5-mistral-7b-instruct", runner: "pooling")
outputs = VLLM.embed!(llm, ["Hello, world!", "How are you?"])Returns
List of EmbeddingRequestOutput objects with:
outputs- List of embeddings
Bang version of embed/3.
Encode text to token IDs.
Examples
{:ok, token_ids} = VLLM.encode(llm, "Hello, world!")
Bang version of encode/3.
Create an LLMEngine for fine-grained control over inference.
The LLMEngine provides lower-level access to vLLM's inference capabilities, useful for building custom serving solutions.
Options
Same as llm/2 plus:
:max_num_seqs- Maximum number of sequences per batch:max_num_batched_tokens- Maximum tokens per batch
Examples
{:ok, engine} = VLLM.engine("facebook/opt-125m")
Bang version of engine/2.
Generate text completions from prompts.
Arguments
llm- LLM instance fromVLLM.llm!/1prompts- String or list of strings to completeopts- Options including::sampling_params- SamplingParams instance:use_tqdm- Show progress bar (default: true):lora_request- LoRA adapter request
Examples
outputs = VLLM.generate!(llm, "Hello, my name is")
outputs = VLLM.generate!(llm, ["Prompt 1", "Prompt 2"], sampling_params: params)Returns
List of RequestOutput objects. Each has:
prompt- Original promptoutputs- List of CompletionOutput objectstext- Generated texttoken_ids- Generated token IDsfinish_reason- Reason for completion ("length", "stop", etc.)
Bang version of generate/3 - raises on error.
Get a module attribute.
Bang version of get/2.
Create guided decoding parameters for structured outputs.
Options
:json- JSON schema string for JSON output:json_object- Python dict/Pydantic model for JSON:regex- Regex pattern for output:choice- List of allowed string choices:grammar- BNF grammar string
Examples
# JSON schema
{:ok, guided} = VLLM.guided_decoding_params(
json: ~s({"type": "object", "properties": {"name": {"type": "string"}}})
)
# Regex pattern
{:ok, guided} = VLLM.guided_decoding_params(regex: "[0-9]{3}-[0-9]{4}")
# Choice
{:ok, guided} = VLLM.guided_decoding_params(choice: ["yes", "no", "maybe"])Support
Guided decoding requires a vLLM build that exposes GuidedDecodingParams.
Use guided_decoding_supported?/0 to check availability.
Bang version of guided_decoding_params/1.
Check whether guided decoding parameters are available in the installed vLLM.
Create a vLLM LLM instance for offline inference.
Options
Common options passed as keyword arguments:
:dtype- Data type ("auto", "float16", "bfloat16", "float32"):tensor_parallel_size- Number of GPUs for tensor parallelism:gpu_memory_utilization- Fraction of GPU memory to use (0.0-1.0):max_model_len- Maximum sequence length:quantization- Quantization method ("awq", "gptq", "squeezellm", etc.):trust_remote_code- Whether to trust remote code from HuggingFace
Examples
{:ok, llm} = VLLM.llm("facebook/opt-125m")
{:ok, llm} = VLLM.llm("Qwen/Qwen2-7B", tensor_parallel_size: 2)
{:ok, llm} = VLLM.llm("TheBloke/Llama-2-7B-AWQ", quantization: "awq")
Bang version of llm/2 - raises on error.
Create a LoRARequest for serving LoRA adapters.
Arguments
name- Unique name for this LoRA adapterlora_int_id- Integer ID for the adapterlora_path- Path to the LoRA adapter weights
Examples
{:ok, lora} = VLLM.lora_request("my-adapter", 1, "/path/to/adapter")
Bang version of lora_request/4.
Call a method on a Python object reference.
Bang version of method/4.
Create PoolingParams for embedding models.
Options
:additional_data- Additional metadata for the pooling request
Examples
{:ok, params} = VLLM.pooling_params()
Bang version of pooling_params/1.
Check if a value is a Python object reference.
Run VLLM code with automatic Python lifecycle management.
Wraps your code in Snakepit.run_as_script/2 which:
- Starts the Python process pool
- Runs your code
- Cleans up on exit
Pass halt: true in opts if you need to force the BEAM to exit
(for example, when running inside wrapper scripts).
Example
VLLM.run(fn ->
llm = VLLM.llm!("facebook/opt-125m")
outputs = VLLM.generate!(llm, ["Hello, world"])
# ... process outputs
end)
Create SamplingParams for controlling text generation.
Options
:temperature- Sampling temperature (default: 1.0):top_p- Nucleus sampling probability (default: 1.0):top_k- Top-k sampling (default: -1, disabled):max_tokens- Maximum tokens to generate (default: 16):min_tokens- Minimum tokens to generate (default: 0):presence_penalty- Presence penalty (default: 0.0):frequency_penalty- Frequency penalty (default: 0.0):repetition_penalty- Repetition penalty (default: 1.0):stop- List of stop strings:stop_token_ids- List of stop token IDs:n- Number of completions to generate (default: 1):best_of- Number of sequences to generate and select best from:seed- Random seed for reproducibility
Examples
{:ok, params} = VLLM.sampling_params(temperature: 0.8, max_tokens: 100)
{:ok, params} = VLLM.sampling_params(top_p: 0.9, stop: ["\n", "END"])
Bang version of sampling_params/1 - raises on error.
Set an attribute on a Python object reference.
Create a timeout option for exact milliseconds.
Examples
VLLM.generate!(llm, prompts,
Keyword.merge([sampling_params: params], VLLM.timeout_ms(300_000))
)
Timeout profile atoms for use with __runtime__ option.
Examples
VLLM.generate!(llm, prompts,
Keyword.merge([sampling_params: params], VLLM.timeout_profile(:batch_job))
)
Get the installed vLLM version.
Bang version of version/0.
Add timeout configuration to options.
Options
:timeout- Exact timeout in milliseconds:timeout_profile- Use a predefined profile
Examples
opts = VLLM.with_timeout([], timeout: 60_000)
VLLM.generate!(llm, prompts, Keyword.merge(opts, sampling_params: params))