LlamaCppEx.Server (LlamaCppEx v0.7.0)

Copy Markdown View Source

GenServer for continuous batched multi-sequence inference.

Manages a shared model/context and serves multiple concurrent callers using a slot pool with continuous batching — one forward pass per tick with decode tokens and prefill chunks mixed in a single batch.

Example

{:ok, server} = LlamaCppEx.Server.start_link(
  model_path: "model.gguf",
  n_gpu_layers: -1,
  n_parallel: 4,
  n_ctx: 8192
)

# Sync generation
{:ok, text} = LlamaCppEx.Server.generate(server, "Once upon a time", max_tokens: 100)

# Streaming
LlamaCppEx.Server.stream(server, "Tell me a story", max_tokens: 200)
|> Enum.each(&IO.write/1)

Telemetry

The server emits the following telemetry events:

[:llama_cpp_ex, :server, :tick]

Emitted after each batch forward pass.

Measurements:

  • :batch_size - Total tokens in the batch.
  • :decode_tokens - Number of decode (generation) tokens.
  • :prefill_tokens - Number of prefill (prompt) tokens.
  • :active_slots - Slots currently prefilling or generating.
  • :queue_depth - Requests waiting for a slot.
  • :eval_ms - Forward pass wall time in milliseconds.

Metadata:

  • :server - PID of the server process.

[:llama_cpp_ex, :server, :request, :done]

Emitted when a request (generate or stream) completes.

Measurements:

  • :prompt_tokens - Number of prompt tokens.
  • :generated_tokens - Number of tokens generated.
  • :duration_ms - Total request duration in milliseconds.
  • :ttft_ms - Time to first token in milliseconds.
  • :prompt_eval_rate - Prompt evaluation speed (tokens/sec).
  • :generation_rate - Generation speed (tokens/sec).
  • :prefix_cache_tokens - Number of prompt tokens skipped via prefix cache.
  • :prefix_cache_ratio - Ratio of cached to total prompt tokens (0.0–1.0).

Metadata:

  • :server - PID of the server process.
  • :seq_id - Slot sequence ID (integer).
  • :mode - :generate or :stream.

Summary

Functions

Returns a specification to start this module under a supervisor.

Generates text synchronously. Blocks until generation is complete.

Generates text from pre-tokenized input. Blocks until generation is complete.

Returns the model struct for external tokenization.

Returns a snapshot of the server's current state.

Starts the server.

Returns a stream of generated text chunks.

Returns a stream of generated text chunks from pre-tokenized input.

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

generate(server, prompt, opts \\ [])

@spec generate(GenServer.server(), String.t(), keyword()) ::
  {:ok, String.t()} | {:error, term()}

Generates text synchronously. Blocks until generation is complete.

Options

  • :max_tokens - Maximum tokens to generate. Defaults to 256.
  • :timeout - Call timeout in ms. Defaults to 60_000.

generate_tokens(server, token_ids, opts \\ [])

@spec generate_tokens(GenServer.server(), [integer()], keyword()) ::
  {:ok, String.t()} | {:error, term()}

Generates text from pre-tokenized input. Blocks until generation is complete.

Use get_model/1 to obtain the model for tokenization outside the server.

Options

  • :max_tokens - Maximum tokens to generate. Defaults to 256.
  • :timeout - Call timeout in ms. Defaults to 60_000.

get_model(server)

@spec get_model(GenServer.server()) :: LlamaCppEx.Model.t()

Returns the model struct for external tokenization.

The model resource is reference-counted and thread-safe for read-only operations like tokenization.

get_stats(server)

@spec get_stats(GenServer.server()) :: map()

Returns a snapshot of the server's current state.

start_link(opts)

@spec start_link(keyword()) :: GenServer.on_start()

Starts the server.

Options

  • :model_path (required) - Path to the GGUF model file.
  • :n_gpu_layers - GPU layers. Defaults to 99.
  • :n_ctx - Total context size (shared across slots). Defaults to 8192.
  • :n_parallel - Number of concurrent slots. Defaults to 4.
  • :n_batch - Batch size. Defaults to n_ctx.
  • :chunk_size - Max prefill tokens per slot per tick. Defaults to 512.
  • :max_queue - Max queued requests. 0 for unlimited. Defaults to 0.
  • :cache_prompt - Retain KV cache between requests on the same slot for prefix reuse. Defaults to false. Set to true for multi-turn chat.
  • :batch_strategy - Batch building strategy module. Defaults to LlamaCppEx.Server.Strategy.DecodeMaximal. See LlamaCppEx.Server.BatchStrategy.
  • Sampling options: :temp, :top_k, :top_p, :min_p, :seed, :penalty_repeat, :penalty_freq, :penalty_present, :grammar, :grammar_root.
  • GenServer options like :name.

stream(server, prompt, opts \\ [])

@spec stream(GenServer.server(), String.t(), keyword()) :: Enumerable.t()

Returns a stream of generated text chunks.

Options

  • :max_tokens - Maximum tokens to generate. Defaults to 256.
  • :timeout - Per-token timeout. Defaults to 30_000.

stream_tokens(server, token_ids, opts \\ [])

@spec stream_tokens(GenServer.server(), [integer()], keyword()) :: Enumerable.t()

Returns a stream of generated text chunks from pre-tokenized input.

Options

  • :max_tokens - Maximum tokens to generate. Defaults to 256.
  • :timeout - Per-token timeout. Defaults to 30_000.