LlamaCppEx.Server (LlamaCppEx v0.8.13)

GenServer for continuous batched multi-sequence inference.

Manages a shared model/context and serves multiple concurrent callers using a slot pool with continuous batching — one forward pass per tick with decode tokens and prefill chunks mixed in a single batch.

Example

{:ok, server} = LlamaCppEx.Server.start_link(
  model_path: "model.gguf",
  n_gpu_layers: -1,
  n_parallel: 4,
  n_ctx: 8192
)

# Sync generation
{:ok, text} = LlamaCppEx.Server.generate(server, "Once upon a time", max_tokens: 100)

# Streaming
LlamaCppEx.Server.stream(server, "Tell me a story", max_tokens: 200)
|> Enum.each(&IO.write/1)

Telemetry

The server emits the following telemetry events:

`[:llama_cpp_ex, :server, :tick]`

Emitted after each batch forward pass.

Measurements:

:batch_size - Total tokens in the batch.
:decode_tokens - Number of decode (generation) tokens.
:prefill_tokens - Number of prefill (prompt) tokens.
:active_slots - Slots currently prefilling or generating.
:queue_depth - Requests waiting for a slot.
:eval_ms - Forward pass wall time in milliseconds.

Metadata:

:server - PID of the server process.

`[:llama_cpp_ex, :server, :request, :start]`

Emitted when a slot is assigned to a request and prefill begins.

Measurements:

:prompt_tokens - Number of prompt tokens.
:prefix_cache_tokens - Number of prompt tokens reused from the KV prefix cache (0 when cache_prompt: false).

Metadata:

:server - PID of the server process.
:seq_id - Slot sequence ID.
:mode - :generate or :stream.

`[:llama_cpp_ex, :server, :request, :done]`

Emitted when a request (generate or stream) completes.

Measurements:

:prompt_tokens - Number of prompt tokens.
:generated_tokens - Number of tokens generated.
:duration_ms - Total request duration in milliseconds.
:ttft_ms - Time to first token in milliseconds.
:prompt_eval_rate - Prompt evaluation speed (tokens/sec).
:generation_rate - Generation speed (tokens/sec).
:prefix_cache_tokens - Number of prompt tokens skipped via prefix cache.
:prefix_cache_ratio - Ratio of cached to total prompt tokens (0.0–1.0).

Metadata:

:server - PID of the server process.
:seq_id - Slot sequence ID (integer).
:mode - :generate or :stream.
:stop_reason - :eog (end-of-generation token sampled) or :max_tokens (request max_tokens reached).

`[:llama_cpp_ex, :server, :request, :exception]`

Emitted when an inference error aborts an active request (e.g. the underlying batch_eval returns an error). Measurement shape matches :done so handlers can aggregate them together; :stop_reason is :error and the failure reason is in :reason.

Metadata:

:server - PID of the server process.
:seq_id - Slot sequence ID.
:mode - :generate or :stream.
:stop_reason - :error.
:reason - The underlying failure term from the NIF.

Summary

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

generate(server, prompt, opts \\ [])

Generates text synchronously. Blocks until generation is complete.

generate_tokens(server, token_ids, opts \\ [])

Generates text from pre-tokenized input. Blocks until generation is complete.

get_model(server)

Returns the model struct for external tokenization.

get_stats(server)

Returns a snapshot of the server's current state.

start_link(opts)

Starts the server.

stream(server, prompt, opts \\ [])

Returns a stream of generated text chunks.

stream_tokens(server, token_ids, opts \\ [])

Returns a stream of generated text chunks from pre-tokenized input.

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

generate(server, prompt, opts \\ [])

@spec generate(GenServer.server(), String.t(), keyword()) ::
  {:ok, String.t()} | {:error, term()}

Generates text synchronously. Blocks until generation is complete.

Options

:max_tokens - Maximum tokens to generate. Defaults to 256.
:timeout - Call timeout in ms. Defaults to 60_000.

generate_tokens(server, token_ids, opts \\ [])

@spec generate_tokens(GenServer.server(), [integer()], keyword()) ::
  {:ok, String.t()} | {:error, term()}

Generates text from pre-tokenized input. Blocks until generation is complete.

Use get_model/1 to obtain the model for tokenization outside the server.

Options

:max_tokens - Maximum tokens to generate. Defaults to 256.
:timeout - Call timeout in ms. Defaults to 60_000.

get_model(server)

@spec get_model(GenServer.server()) :: LlamaCppEx.Model.t()

Returns the model struct for external tokenization.

The model resource is reference-counted and thread-safe for read-only operations like tokenization.

get_stats(server)

@spec get_stats(GenServer.server()) :: map()

Returns a snapshot of the server's current state.

start_link(opts)

@spec start_link(keyword()) :: GenServer.on_start()

Starts the server.

Options

:model_path (required) - Path to the GGUF model file.
:n_gpu_layers - GPU layers. Defaults to 99.
:n_ctx - Total context size (shared across slots). Defaults to 8192.
:n_parallel - Number of concurrent slots. Defaults to 4.
:n_batch - Batch size. Defaults to n_ctx.
:chunk_size - Max prefill tokens per slot per tick. Defaults to 512.
:max_queue - Max queued requests. 0 for unlimited. Defaults to 0.
:cache_prompt - Retain KV cache between requests on the same slot for prefix reuse. Defaults to false. Set to true for multi-turn chat.
:batch_strategy - Batch building strategy module. Defaults to LlamaCppEx.Server.Strategy.DecodeMaximal. See LlamaCppEx.Server.BatchStrategy.
Sampling options: :temp, :top_k, :top_p, :min_p, :seed, :penalty_repeat, :penalty_freq, :penalty_present, :grammar, :grammar_root.
GenServer options like :name.

stream(server, prompt, opts \\ [])

@spec stream(GenServer.server(), String.t(), keyword()) :: Enumerable.t()

Returns a stream of generated text chunks.

Options

:max_tokens - Maximum tokens to generate. Defaults to 256.
:timeout - Per-token timeout. Defaults to 30_000.

stream_tokens(server, token_ids, opts \\ [])

@spec stream_tokens(GenServer.server(), [integer()], keyword()) :: Enumerable.t()

Returns a stream of generated text chunks from pre-tokenized input.

Options

:max_tokens - Maximum tokens to generate. Defaults to 256.
:timeout - Per-token timeout. Defaults to 30_000.