VLLM

VLLM

Easy, fast, and cheap LLM serving for everyone - in Elixir

Hex.pm Documentation License: MIT

| Documentation | GitHub | vLLM Python Docs |


About

VLLM is an Elixir client for vLLM, the high-throughput LLM inference engine. It provides transparent access to vLLM's powerful features through SnakeBridge.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor, pipeline, data and expert parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU
  • Prefix caching support
  • Multi-LoRA support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

  • Transformer-like LLMs (e.g., Llama)
  • Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
  • Embedding Models (e.g., E5-Mistral)
  • Multi-modal LLMs (e.g., LLaVA)

Requirements

IMPORTANT: vLLM requires a CUDA-capable NVIDIA GPU. The library cannot run on CPU-only systems.

  • NVIDIA GPU with CUDA support (compute capability 7.0+)
  • CUDA toolkit installed
  • 8GB+ GPU memory recommended (varies by model)

Verify your GPU setup:

nvidia-smi

Installation

Add vllm to your list of dependencies in mix.exs:

def deps do
  [
    {:vllm, "~> 0.1.1"}
  ]
end

Then fetch dependencies and set up Python:

mix deps.get
mix snakebridge.setup

Quick Start

Basic Text Generation

VLLM.run(fn ->
  # Create an LLM instance
  llm = VLLM.llm!("facebook/opt-125m")

  # Generate text
  outputs = VLLM.generate!(llm, ["Hello, my name is"])

  # Process results
  Enum.each(outputs, fn output ->
    prompt = VLLM.attr!(output, "prompt")
    generated = VLLM.attr!(output, "outputs") |> Enum.at(0)
    text = VLLM.attr!(generated, "text")
    IO.puts("#{prompt}#{text}")
  end)
end)

Chat Completions

VLLM.run(fn ->
  llm = VLLM.llm!("Qwen/Qwen2-0.5B-Instruct")

  messages = [[
    %{"role" => "system", "content" => "You are a helpful assistant."},
    %{"role" => "user", "content" => "What is the capital of France?"}
  ]]

  outputs = VLLM.chat!(llm, messages)

  output = Enum.at(outputs, 0)
  completion = VLLM.attr!(output, "outputs") |> Enum.at(0)
  IO.puts(VLLM.attr!(completion, "text"))
end)

Sampling Parameters

VLLM.run(fn ->
  llm = VLLM.llm!("facebook/opt-125m")

  params = VLLM.sampling_params!(
    temperature: 0.8,
    top_p: 0.95,
    max_tokens: 100
  )

  outputs = VLLM.generate!(llm, ["Once upon a time"], sampling_params: params)
end)

Batch Processing

VLLM.run(fn ->
  llm = VLLM.llm!("facebook/opt-125m")
  params = VLLM.sampling_params!(temperature: 0.7, max_tokens: 50)

  prompts = [
    "The meaning of life is",
    "Artificial intelligence will",
    "The best programming language is"
  ]

  # Process all prompts efficiently with continuous batching
  outputs = VLLM.generate!(llm, prompts, sampling_params: params)
end)

Features

Quantization

Load quantized models for memory-efficient inference:

llm = VLLM.llm!("TheBloke/Llama-2-7B-AWQ", quantization: "awq")
llm = VLLM.llm!("TheBloke/Llama-2-7B-GPTQ", quantization: "gptq")

Multi-GPU / Tensor Parallelism

Distribute large models across multiple GPUs:

llm = VLLM.llm!("meta-llama/Llama-2-13b-hf",
  tensor_parallel_size: 2,
  gpu_memory_utilization: 0.9
)

LoRA Adapters

Serve fine-tuned models with LoRA:

llm = VLLM.llm!("meta-llama/Llama-2-7b-hf", enable_lora: true)
lora = VLLM.lora_request!("my-adapter", 1, "/path/to/adapter")
outputs = VLLM.generate!(llm, prompt, lora_request: lora)

Structured Outputs

Constrain generation with JSON schema, regex, or choices:

# JSON schema
guided = VLLM.guided_decoding_params!(
  json: ~s({"type": "object", "properties": {"name": {"type": "string"}}})
)

# Regex pattern
guided = VLLM.guided_decoding_params!(regex: "[0-9]{3}-[0-9]{4}")

# Choice
guided = VLLM.guided_decoding_params!(choice: ["yes", "no", "maybe"])

Embeddings

Generate vector embeddings with pooling models:

llm = VLLM.llm!("intfloat/e5-mistral-7b-instruct", runner: "pooling")
embeddings = VLLM.embed!(llm, ["Hello, world!", "How are you?"])

Timeout Configuration

VLLM uses SnakeBridge's timeout architecture optimized for ML workloads:

ProfileTimeoutUse Case
:default2 minStandard Python calls
:streaming30 minStreaming responses
:ml_inference10 minLLM inference (default)
:batch_job1 hourLong-running batches

Override per-call:

VLLM.generate!(llm, prompts,
  sampling_params: params,
  __runtime__: [timeout_profile: :batch_job]
)

Architecture

Elixir (VLLM)
    
    
SnakeBridge.call/4
    
    
Snakepit gRPC
    
    
Python vLLM
    
    
GPU/TPU Inference

Documentation

Examples

See the examples for comprehensive usage:

  • basic.exs - Simple text generation
  • sampling_params.exs - Generation control
  • chat.exs - Chat completions
  • batch_inference.exs - High-throughput batching
  • structured_output.exs - Constrained generation
  • quantization.exs - Memory-efficient models
  • multi_gpu.exs - Distributed inference
  • embeddings.exs - Vector embeddings
  • lora.exs - Fine-tuned adapters
  • timeout_config.exs - Timeout settings
  • direct_api.exs - Raw Python access

Run all examples:

./examples/run_all.sh

Citation

If you use vLLM for your research, please cite the paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

License

MIT License - see LICENSE for details.