GenServer for continuous batched multi-sequence inference.
Manages a shared model/context and serves multiple concurrent callers using a slot pool with continuous batching — one forward pass per tick with decode tokens and prefill chunks mixed in a single batch.
Example
{:ok, server} = LlamaCppEx.Server.start_link(
model_path: "model.gguf",
n_gpu_layers: -1,
n_parallel: 4,
n_ctx: 8192
)
# Sync generation
{:ok, text} = LlamaCppEx.Server.generate(server, "Once upon a time", max_tokens: 100)
# Streaming
LlamaCppEx.Server.stream(server, "Tell me a story", max_tokens: 200)
|> Enum.each(&IO.write/1)Telemetry
The server emits the following telemetry events:
[:llama_cpp_ex, :server, :tick]
Emitted after each batch forward pass.
Measurements:
:batch_size- Total tokens in the batch.:decode_tokens- Number of decode (generation) tokens.:prefill_tokens- Number of prefill (prompt) tokens.:active_slots- Slots currently prefilling or generating.:queue_depth- Requests waiting for a slot.:eval_ms- Forward pass wall time in milliseconds.
Metadata:
:server- PID of the server process.
[:llama_cpp_ex, :server, :request, :done]
Emitted when a request (generate or stream) completes.
Measurements:
:prompt_tokens- Number of prompt tokens.:generated_tokens- Number of tokens generated.:duration_ms- Total request duration in milliseconds.:ttft_ms- Time to first token in milliseconds.:prompt_eval_rate- Prompt evaluation speed (tokens/sec).:generation_rate- Generation speed (tokens/sec).:prefix_cache_tokens- Number of prompt tokens skipped via prefix cache.:prefix_cache_ratio- Ratio of cached to total prompt tokens (0.0–1.0).
Metadata:
:server- PID of the server process.:seq_id- Slot sequence ID (integer).:mode-:generateor:stream.
Summary
Functions
Returns a specification to start this module under a supervisor.
Generates text synchronously. Blocks until generation is complete.
Generates text from pre-tokenized input. Blocks until generation is complete.
Returns the model struct for external tokenization.
Returns a snapshot of the server's current state.
Starts the server.
Returns a stream of generated text chunks.
Returns a stream of generated text chunks from pre-tokenized input.
Functions
Returns a specification to start this module under a supervisor.
See Supervisor.
@spec generate(GenServer.server(), String.t(), keyword()) :: {:ok, String.t()} | {:error, term()}
Generates text synchronously. Blocks until generation is complete.
Options
:max_tokens- Maximum tokens to generate. Defaults to256.:timeout- Call timeout in ms. Defaults to60_000.
@spec generate_tokens(GenServer.server(), [integer()], keyword()) :: {:ok, String.t()} | {:error, term()}
Generates text from pre-tokenized input. Blocks until generation is complete.
Use get_model/1 to obtain the model for tokenization outside the server.
Options
:max_tokens- Maximum tokens to generate. Defaults to256.:timeout- Call timeout in ms. Defaults to60_000.
@spec get_model(GenServer.server()) :: LlamaCppEx.Model.t()
Returns the model struct for external tokenization.
The model resource is reference-counted and thread-safe for read-only operations like tokenization.
@spec get_stats(GenServer.server()) :: map()
Returns a snapshot of the server's current state.
@spec start_link(keyword()) :: GenServer.on_start()
Starts the server.
Options
:model_path(required) - Path to the GGUF model file.:n_gpu_layers- GPU layers. Defaults to99.:n_ctx- Total context size (shared across slots). Defaults to8192.:n_parallel- Number of concurrent slots. Defaults to4.:n_batch- Batch size. Defaults ton_ctx.:chunk_size- Max prefill tokens per slot per tick. Defaults to512.:max_queue- Max queued requests.0for unlimited. Defaults to0.:cache_prompt- Retain KV cache between requests on the same slot for prefix reuse. Defaults tofalse. Set totruefor multi-turn chat.:batch_strategy- Batch building strategy module. Defaults toLlamaCppEx.Server.Strategy.DecodeMaximal. SeeLlamaCppEx.Server.BatchStrategy.- Sampling options:
:temp,:top_k,:top_p,:min_p,:seed,:penalty_repeat,:penalty_freq,:penalty_present,:grammar,:grammar_root. - GenServer options like
:name.
@spec stream(GenServer.server(), String.t(), keyword()) :: Enumerable.t()
Returns a stream of generated text chunks.
Options
:max_tokens- Maximum tokens to generate. Defaults to256.:timeout- Per-token timeout. Defaults to30_000.
@spec stream_tokens(GenServer.server(), [integer()], keyword()) :: Enumerable.t()
Returns a stream of generated text chunks from pre-tokenized input.
Options
:max_tokens- Maximum tokens to generate. Defaults to256.:timeout- Per-token timeout. Defaults to30_000.