Nous.Providers.VLLM (nous v0.13.3)
View SourcevLLM provider implementation.
vLLM is a high-performance inference engine that provides an
OpenAI-compatible API. By default it runs on http://localhost:8000/v1.
Configuration
No API key is required for local usage. Configure the base URL if needed:
config :nous, :vllm,
base_url: "http://localhost:8000/v1"Or use environment variable:
export VLLM_BASE_URL="http://localhost:8000/v1"Usage
# Via Model.parse
model = Nous.Model.parse("vllm:meta-llama/Llama-3-8B-Instruct")
# Direct provider usage
{:ok, response} = Nous.Providers.VLLM.chat(%{
"model" => "meta-llama/Llama-3-8B-Instruct",
"messages" => [%{"role" => "user", "content" => "Hello"}]
})Features
vLLM supports:
- OpenAI-compatible chat completions
- Streaming responses
- High-throughput batched inference
- PagedAttention for memory efficiency
- Tensor parallelism for multi-GPU
vLLM-Specific Parameters
Additional parameters supported (pass in params map):
best_of- Number of outputs to generate and return the bestuse_beam_search- Use beam search instead of samplingignore_eos- Ignore end-of-sequence tokenskip_special_tokens- Skip special tokens in output
Summary
Functions
Get the API key from options, environment, or application config.
Get the base URL from options, application config, or default.
Count tokens in messages (rough estimate).
High-level request with message conversion, telemetry, and error wrapping.
High-level streaming request with message conversion and telemetry.
Functions
Get the API key from options, environment, or application config.
Lookup order:
:api_keyoption passed directly- Environment variable (VLLM_API_KEY)
- Application config:
config :nous, vllm, api_key: "..."
Get the base URL from options, application config, or default.
Lookup order:
:base_urloption passed directly- Application config:
config :nous, vllm, base_url: "..." - Default: http://localhost:8000/v1
Count tokens in messages (rough estimate).
Override this in your provider for more accurate counting.
High-level request with message conversion, telemetry, and error wrapping.
Default implementation that:
- Converts messages to provider format
- Builds request params
- Calls chat/2
- Parses response
- Emits telemetry events
- Wraps errors
High-level streaming request with message conversion and telemetry.