This directory contains comprehensive examples demonstrating VLLM capabilities for Elixir. VLLM wraps Python's vLLM library via SnakeBridge, providing high-throughput LLM inference.

Prerequisites

IMPORTANT: vLLM requires a CUDA-capable NVIDIA GPU. If you don't have a compatible GPU, the inference examples will fail with CUDA errors.

# Install dependencies and set up Python environment
mix deps.get
mix snakebridge.setup

# Verify you have a CUDA-capable GPU
nvidia-smi

GPU Requirements

  • CUDA-capable NVIDIA GPU (e.g., RTX 3090, A100, V100)
  • CUDA toolkit installed and configured
  • Sufficient GPU memory for your chosen model (8GB+ recommended)

Running Examples

Run any example individually:

mix run examples/basic.exs

Or run all examples with the test script:

./examples/run_all.sh

Runtime options

Some examples accept CLI flags for overrides:

# Embeddings example (optional override)
mix run examples/embeddings.exs -- --model "BAAI/bge-large-en-v1.5"

# LoRA example (auto-downloads a public default adapter on first run)
mix run examples/lora.exs

# LoRA example (optional overrides)
mix run examples/lora.exs -- \
  --adapter /path/to/adapter \
  --model "your-base-model" \
  --name "adapter" \
  --prompt "Write a short SQL query to list users." \
  --rank 64

# Timeout example (optional overrides)
mix run examples/timeout_config.exs -- --model "facebook/opt-125m"
mix run examples/timeout_config.exs -- --prompt "Explain Elixir in one sentence."

The default LoRA adapter comes from edbeeching/opt-125m-lora (base model facebook/opt-125m) and is downloaded automatically. This requires network access the first time it runs.


Core Examples

Basic Generation (basic.exs)

The foundational VLLM example showing core concepts:

  • Creating an LLM instance
  • Generating text completions
  • Processing results
{:ok, llm} = Vllm.LLM.new("facebook/opt-125m")
llm_ref = SnakeBridge.Ref.from_wire_format(llm)

runtime_opts =
  case llm_ref.pool_name do
    nil -> [session_id: llm_ref.session_id]
    pool_name -> [session_id: llm_ref.session_id, pool_name: pool_name]
  end

{:ok, outputs} =
  Vllm.LLM.generate(llm, ["Hello, my name is"], [], __runtime__: runtime_opts)

Run: mix run examples/basic.exs


Sampling Parameters (sampling_params.exs)

Control text generation behavior:

  • Temperature for randomness
  • Top-p (nucleus) sampling
  • Max tokens limit
  • Stop sequences
  • Multiple completions
llm_ref = SnakeBridge.Ref.from_wire_format(llm)

runtime_opts =
  case llm_ref.pool_name do
    nil -> [session_id: llm_ref.session_id]
    pool_name -> [session_id: llm_ref.session_id, pool_name: pool_name]
  end

{:ok, params} =
  Vllm.SamplingParams.new([], temperature: 0.8, top_p: 0.95, max_tokens: 100, __runtime__: runtime_opts)

{:ok, outputs} =
  Vllm.LLM.generate(llm, [prompt], [], sampling_params: params, __runtime__: runtime_opts)

Run: mix run examples/sampling_params.exs


Chat Completions (chat.exs)

Chat-style interactions with instruction-tuned models:

  • System prompts
  • Multi-turn conversations
  • Batch chat processing
llm_ref = SnakeBridge.Ref.from_wire_format(llm)

runtime_opts =
  case llm_ref.pool_name do
    nil -> [session_id: llm_ref.session_id]
    pool_name -> [session_id: llm_ref.session_id, pool_name: pool_name]
  end

messages = [[
  %{"role" => "system", "content" => "You are helpful."},
  %{"role" => "user", "content" => "Hello!"}
]]

{:ok, outputs} = Vllm.LLM.chat(llm, messages, [], __runtime__: runtime_opts)

Run: mix run examples/chat.exs


Batch Inference (batch_inference.exs)

High-throughput batch processing:

  • Processing multiple prompts efficiently
  • Continuous batching
  • Performance measurement
llm_ref = SnakeBridge.Ref.from_wire_format(llm)

runtime_opts =
  case llm_ref.pool_name do
    nil -> [session_id: llm_ref.session_id]
    pool_name -> [session_id: llm_ref.session_id, pool_name: pool_name]
  end

prompts = ["Prompt 1", "Prompt 2", "Prompt 3", ...]
{:ok, outputs} = Vllm.LLM.generate(llm, prompts, [], sampling_params: params, __runtime__: runtime_opts)

Run: mix run examples/batch_inference.exs


Advanced Examples

Structured Output (structured_output.exs)

Guided generation for structured outputs:

  • JSON schema constraints
  • Regex patterns
  • Choice constraints
{:ok, params} = Vllm.SamplingParams.new([], structured_outputs: %{choice: ["yes", "no", "maybe"]})

Run: mix run examples/structured_output.exs


Quantization (quantization.exs)

Memory-efficient inference with quantized models:

  • AWQ quantization
  • GPTQ quantization
  • Memory comparison
{:ok, llm} = Vllm.LLM.new("TheBloke/Llama-2-7B-AWQ", quantization: "awq")

Run: mix run examples/quantization.exs


Multi-GPU (multi_gpu.exs)

Distributed inference across GPUs:

  • Tensor parallelism
  • Pipeline parallelism
  • Memory utilization
{:ok, llm} = Vllm.LLM.new("meta-llama/Llama-2-13b-hf",
  tensor_parallel_size: 2,
  gpu_memory_utilization: 0.9
)

Run: mix run examples/multi_gpu.exs


Embeddings (embeddings.exs)

Vector embeddings for semantic search:

  • Loading embedding models
  • Batch embedding
  • Use cases
{:ok, llm} = Vllm.LLM.new("intfloat/e5-mistral-7b-instruct", runner: "pooling")
llm_ref = SnakeBridge.Ref.from_wire_format(llm)

runtime_opts =
  case llm_ref.pool_name do
    nil -> [session_id: llm_ref.session_id]
    pool_name -> [session_id: llm_ref.session_id, pool_name: pool_name]
  end

{:ok, outputs} = Vllm.LLM.embed(llm, ["Hello, world!"], __runtime__: runtime_opts)

Run: mix run examples/embeddings.exs


LoRA Adapters (lora.exs)

Fine-tuned model serving:

  • Loading LoRA adapters
  • Multi-LoRA serving
  • Configuration
{:ok, llm} = Vllm.LLM.new("meta-llama/Llama-2-7b-hf", enable_lora: true)
llm_ref = SnakeBridge.Ref.from_wire_format(llm)

runtime_opts =
  case llm_ref.pool_name do
    nil -> [session_id: llm_ref.session_id]
    pool_name -> [session_id: llm_ref.session_id, pool_name: pool_name]
  end

{:ok, lora} =
  Vllm.BeamSearch.LoRARequest.new(["my-adapter", 1, "/path/to/adapter"], __runtime__: runtime_opts)

{:ok, outputs} = Vllm.LLM.generate(llm, [prompt], [], lora_request: lora, __runtime__: runtime_opts)

Run: mix run examples/lora.exs


Timeout Configuration (timeout_config.exs)

Configure timeouts for long-running operations:

  • Timeout profiles
  • Per-call overrides
  • Helper functions
llm_ref = SnakeBridge.Ref.from_wire_format(llm)

runtime_opts =
  case llm_ref.pool_name do
    nil -> [session_id: llm_ref.session_id]
    pool_name -> [session_id: llm_ref.session_id, pool_name: pool_name]
  end

{:ok, outputs} =
  Vllm.LLM.generate(llm, prompts, [],
    sampling_params: params,
    __runtime__: Keyword.merge(runtime_opts, timeout_profile: :batch_job)
  )

Run: mix run examples/timeout_config.exs


Wrapper API (direct_api.exs)

Demonstrates wrapper-only usage:

  1. Generated wrappers (type-safe): Vllm.LLM.new/2, Vllm.SamplingParams.new/2
  2. Runtime attribute access for Python refs via SnakeBridge.Runtime.get_attr/2
{:ok, llm} = Vllm.LLM.new("facebook/opt-125m")
llm_ref = SnakeBridge.Ref.from_wire_format(llm)

runtime_opts =
  case llm_ref.pool_name do
    nil -> [session_id: llm_ref.session_id]
    pool_name -> [session_id: llm_ref.session_id, pool_name: pool_name]
  end

{:ok, params} = Vllm.SamplingParams.new([], temperature: 0.8, __runtime__: runtime_opts)
{:ok, outputs} = Vllm.LLM.generate(llm, ["Hello"], [], sampling_params: params, __runtime__: runtime_opts)

Run: mix run examples/direct_api.exs


Running All Examples

The run_all.sh script runs all examples sequentially with:

  • Colorized output
  • Per-example timing
  • Pass/fail summary
  • Automatic timeout handling
# Run with default timeout
./examples/run_all.sh

# Run with custom timeout (300s per example)
VLLM_RUN_TIMEOUT_SECONDS=300 ./examples/run_all.sh

# Disable timeout
VLLM_RUN_TIMEOUT_SECONDS=0 ./examples/run_all.sh

Example Index

ExampleFocusDescription
basic.exsCoreSimple text generation
sampling_params.exsCoreGeneration control parameters
chat.exsCoreChat completions
batch_inference.exsPerformanceHigh-throughput batching
structured_output.exsAdvancedConstrained generation
quantization.exsAdvancedMemory-efficient models
multi_gpu.exsAdvancedDistributed inference
embeddings.exsAdvancedVector embeddings
lora.exsAdvancedFine-tuned adapters
timeout_config.exsConfigurationTimeout settings
direct_api.exsAdvancedWrapper-only API usage

Troubleshooting

No CUDA-Capable GPU / CUDA Errors

CUDA error: no kernel image is available for execution on the device

or

RuntimeError: CUDA error
  • vLLM requires a CUDA-capable NVIDIA GPU - it cannot run on CPU-only systems
  • Verify your GPU is detected: nvidia-smi
  • Ensure CUDA toolkit is properly installed
  • Check GPU compute capability matches vLLM requirements (compute capability 7.0+)

CUDA Out of Memory

CUDA out of memory
  • Reduce gpu_memory_utilization
  • Use smaller model
  • Use quantized model

Model Not Found

Model not found
  • Check model name on HuggingFace
  • Check internet connection

Timeout Errors

For long operations, increase timeout:

Vllm.LLM.generate(llm, prompts, [],
  __runtime__: Keyword.merge(runtime_opts, timeout_profile: :batch_job)
)

Python/vLLM Not Installed

Module vllm not found

Run: mix snakebridge.setup