README

VLLM

Easy, fast, and cheap LLM serving for everyone - in Elixir

| Documentation | GitHub | vLLM Python Docs |

About

VLLM is an Elixir client for vLLM, the high-throughput LLM inference engine. It provides transparent access to vLLM's powerful features through SnakeBridge.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor, pipeline, data and expert parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU
Prefix caching support
Multi-LoRA support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

Transformer-like LLMs (e.g., Llama)
Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
Embedding Models (e.g., E5-Mistral)
Multi-modal LLMs (e.g., LLaVA)

Requirements

IMPORTANT: vLLM requires a CUDA-capable NVIDIA GPU. The library cannot run on CPU-only systems.

NVIDIA GPU with CUDA support (compute capability 7.0+)
CUDA toolkit installed
8GB+ GPU memory recommended (varies by model)

Verify your GPU setup:

nvidia-smi

Installation

Add vllm to your list of dependencies in mix.exs:

def deps do
  [
    {:vllm, "~> 0.3.0"}
  ]
end

Then fetch dependencies and set up Python:

mix deps.get
mix snakebridge.setup

To refresh generated wrappers after updating SnakeBridge or the docs manifest:

mix snakebridge.regen

Add --clean to remove generated artifacts and metadata before regenerating.

Quick Start

Basic Text Generation

VLLM.run(fn ->
  # Create an LLM instance
  llm = VLLM.llm!("facebook/opt-125m")

  # Generate text
  outputs = VLLM.generate!(llm, ["Hello, my name is"])

  # Process results
  Enum.each(outputs, fn output ->
    prompt = VLLM.attr!(output, "prompt")
    generated = VLLM.attr!(output, "outputs") |> Enum.at(0)
    text = VLLM.attr!(generated, "text")
    IO.puts("#{prompt}#{text}")
  end)
end)

Chat Completions

VLLM.run(fn ->
  llm = VLLM.llm!("Qwen/Qwen2-0.5B-Instruct")

  messages = [[
    %{"role" => "system", "content" => "You are a helpful assistant."},
    %{"role" => "user", "content" => "What is the capital of France?"}
  ]]

  outputs = VLLM.chat!(llm, messages)

  output = Enum.at(outputs, 0)
  completion = VLLM.attr!(output, "outputs") |> Enum.at(0)
  IO.puts(VLLM.attr!(completion, "text"))
end)

Sampling Parameters

VLLM.run(fn ->
  llm = VLLM.llm!("facebook/opt-125m")

  params = VLLM.sampling_params!(
    temperature: 0.8,
    top_p: 0.95,
    max_tokens: 100
  )

  outputs = VLLM.generate!(llm, ["Once upon a time"], sampling_params: params)
end)

Batch Processing

VLLM.run(fn ->
  llm = VLLM.llm!("facebook/opt-125m")
  params = VLLM.sampling_params!(temperature: 0.7, max_tokens: 50)

  prompts = [
    "The meaning of life is",
    "Artificial intelligence will",
    "The best programming language is"
  ]

  # Process all prompts efficiently with continuous batching
  outputs = VLLM.generate!(llm, prompts, sampling_params: params)
end)

Features

Quantization

Load quantized models for memory-efficient inference:

llm = VLLM.llm!("TheBloke/Llama-2-7B-AWQ", quantization: "awq")
llm = VLLM.llm!("TheBloke/Llama-2-7B-GPTQ", quantization: "gptq")

Multi-GPU / Tensor Parallelism

Distribute large models across multiple GPUs:

llm = VLLM.llm!("meta-llama/Llama-2-13b-hf",
  tensor_parallel_size: 2,
  gpu_memory_utilization: 0.9
)

LoRA Adapters

Serve fine-tuned models with LoRA:

llm = VLLM.llm!("meta-llama/Llama-2-7b-hf", enable_lora: true)
lora = VLLM.lora_request!("my-adapter", 1, "/path/to/adapter")
outputs = VLLM.generate!(llm, prompt, lora_request: lora)

Structured Outputs

Constrain generation with JSON schema, regex, or choices:

# JSON schema
guided = VLLM.guided_decoding_params!(
  json: ~s({"type": "object", "properties": {"name": {"type": "string"}}})
)

# Regex pattern
guided = VLLM.guided_decoding_params!(regex: "[0-9]{3}-[0-9]{4}")

# Choice
guided = VLLM.guided_decoding_params!(choice: ["yes", "no", "maybe"])

Embeddings

Generate vector embeddings with pooling models:

llm = VLLM.llm!("intfloat/e5-mistral-7b-instruct", runner: "pooling")
embeddings = VLLM.embed!(llm, ["Hello, world!", "How are you?"])

Timeout Configuration

VLLM uses SnakeBridge's timeout architecture optimized for ML workloads:

Profile	Timeout	Use Case
`:default`	2 min	Standard Python calls
`:streaming`	30 min	Streaming responses
`:ml_inference`	10 min	LLM inference (default)
`:batch_job`	1 hour	Long-running batches

Override per-call:

VLLM.generate!(llm, prompts,
  sampling_params: params,
  __runtime__: [timeout_profile: :batch_job]
)

Architecture

Elixir (VLLM)
    │
    ▼
SnakeBridge.call/4
    │
    ▼
Snakepit gRPC
    │
    ▼
Python vLLM
    │
    ▼
GPU/TPU Inference

Documentation

Examples

See the examples for comprehensive usage:

basic.exs - Simple text generation
sampling_params.exs - Generation control
chat.exs - Chat completions
batch_inference.exs - High-throughput batching
structured_output.exs - Constrained generation
quantization.exs - Memory-efficient models
multi_gpu.exs - Distributed inference
embeddings.exs - Vector embeddings
lora.exs - Fine-tuned adapters
timeout_config.exs - Timeout settings
direct_api.exs - Raw Python access

Run all examples:

./examples/run_all.sh

Citation

If you use vLLM for your research, please cite the paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

License

MIT License - see LICENSE for details.

Links

← Previous Page LICENSE

Next Page → Quickstart Guide