Quickstart Guide
View SourceThis guide will help you get started with VLLM for Elixir, providing high-throughput LLM inference via vLLM.
Prerequisites
- Elixir 1.18 or later
- Python 3.8 or later
- CUDA-capable GPU (recommended) or CPU-only mode
Installation
Add VLLM to your mix.exs dependencies:
def deps do
[
{:vllm, "~> 0.1.0"}
]
endFetch dependencies and set up the Python environment:
mix deps.get
mix snakebridge.setup
This will install vLLM and its dependencies in a managed Python environment.
Your First Generation
Here's a minimal example to generate text:
VLLM.run(fn ->
# Load a small model for testing
llm = VLLM.llm!("facebook/opt-125m")
# Generate completions
outputs = VLLM.generate!(llm, "Hello, my name is")
# Print the result
output = Enum.at(outputs, 0)
completion = VLLM.attr!(output, "outputs") |> Enum.at(0)
text = VLLM.attr!(completion, "text")
IO.puts(text)
end)Save this as hello_vllm.exs and run:
mix run hello_vllm.exs
Understanding the Output
vLLM returns RequestOutput objects with the following structure:
prompt- The original input promptoutputs- List ofCompletionOutputobjectstext- The generated texttoken_ids- List of generated token IDsfinish_reason- Why generation stopped ("length", "stop", etc.)
Controlling Generation
Use SamplingParams to control text generation:
VLLM.run(fn ->
llm = VLLM.llm!("facebook/opt-125m")
# Create sampling parameters
params = VLLM.sampling_params!(
temperature: 0.8, # Higher = more random
top_p: 0.95, # Nucleus sampling
max_tokens: 100, # Maximum tokens to generate
stop: ["\\n"] # Stop at newline
)
outputs = VLLM.generate!(llm, "The secret to happiness is",
sampling_params: params
)
end)Chat Mode
For instruction-tuned models, use the chat interface:
VLLM.run(fn ->
llm = VLLM.llm!("Qwen/Qwen2-0.5B-Instruct")
messages = [[
%{"role" => "system", "content" => "You are a helpful assistant."},
%{"role" => "user", "content" => "Explain quantum computing in simple terms."}
]]
params = VLLM.sampling_params!(temperature: 0.7, max_tokens: 200)
outputs = VLLM.chat!(llm, messages, sampling_params: params)
end)Batch Processing
Process multiple prompts efficiently:
VLLM.run(fn ->
llm = VLLM.llm!("facebook/opt-125m")
params = VLLM.sampling_params!(temperature: 0.7, max_tokens: 50)
prompts = [
"The capital of France is",
"Machine learning is",
"The best way to learn programming is"
]
# vLLM processes these efficiently with continuous batching
outputs = VLLM.generate!(llm, prompts, sampling_params: params)
Enum.each(outputs, fn output ->
prompt = VLLM.attr!(output, "prompt")
completion = VLLM.attr!(output, "outputs") |> Enum.at(0)
IO.puts("#{prompt}#{VLLM.attr!(completion, "text")}")
end)
end)Next Steps
- Sampling Parameters - Fine-tune generation behavior
- Configuration - Model and engine options
- Supported Models - Full list of supported models
- Examples - Comprehensive code examples