Nous.Providers.VLLM (nous v0.13.3)

vLLM provider implementation.

vLLM is a high-performance inference engine that provides an OpenAI-compatible API. By default it runs on http://localhost:8000/v1.

Configuration

No API key is required for local usage. Configure the base URL if needed:

config :nous, :vllm,
  base_url: "http://localhost:8000/v1"

Or use environment variable:

export VLLM_BASE_URL="http://localhost:8000/v1"

Usage

# Via Model.parse
model = Nous.Model.parse("vllm:meta-llama/Llama-3-8B-Instruct")

# Direct provider usage
{:ok, response} = Nous.Providers.VLLM.chat(%{
  "model" => "meta-llama/Llama-3-8B-Instruct",
  "messages" => [%{"role" => "user", "content" => "Hello"}]
})

Features

vLLM supports:

OpenAI-compatible chat completions
Streaming responses
High-throughput batched inference
PagedAttention for memory efficiency
Tensor parallelism for multi-GPU

vLLM-Specific Parameters

Additional parameters supported (pass in params map):

best_of - Number of outputs to generate and return the best
use_beam_search - Use beam search instead of sampling
ignore_eos - Ignore end-of-sequence token
skip_special_tokens - Skip special tokens in output

Summary

Functions

api_key(opts \\ [])

Get the API key from options, environment, or application config.

base_url(opts \\ [])

Get the base URL from options, application config, or default.

count_tokens(messages)

Count tokens in messages (rough estimate).

request(model, messages, settings)

High-level request with message conversion, telemetry, and error wrapping.

request_stream(model, messages, settings)

High-level streaming request with message conversion and telemetry.