# `LlamaCppEx.Server`
[🔗](https://github.com/nyo16/llama_cpp_ex/blob/main/lib/llama_cpp_ex/server.ex#L1)

GenServer for continuous batched multi-sequence inference.

Manages a shared model/context and serves multiple concurrent callers
using a slot pool with continuous batching — one forward pass per tick
with decode tokens and prefill chunks mixed in a single batch.

## Example

    {:ok, server} = LlamaCppEx.Server.start_link(
      model_path: "model.gguf",
      n_gpu_layers: -1,
      n_parallel: 4,
      n_ctx: 8192
    )

    # Sync generation
    {:ok, text} = LlamaCppEx.Server.generate(server, "Once upon a time", max_tokens: 100)

    # Streaming
    LlamaCppEx.Server.stream(server, "Tell me a story", max_tokens: 200)
    |> Enum.each(&IO.write/1)

## Telemetry

The server emits the following telemetry events:

### `[:llama_cpp_ex, :server, :tick]`

Emitted after each batch forward pass.

Measurements:

  * `:batch_size` - Total tokens in the batch.
  * `:decode_tokens` - Number of decode (generation) tokens.
  * `:prefill_tokens` - Number of prefill (prompt) tokens.
  * `:active_slots` - Slots currently prefilling or generating.
  * `:queue_depth` - Requests waiting for a slot.
  * `:eval_ms` - Forward pass wall time in milliseconds.

Metadata:

  * `:server` - PID of the server process.

### `[:llama_cpp_ex, :server, :request, :done]`

Emitted when a request (generate or stream) completes.

Measurements:

  * `:prompt_tokens` - Number of prompt tokens.
  * `:generated_tokens` - Number of tokens generated.
  * `:duration_ms` - Total request duration in milliseconds.
  * `:ttft_ms` - Time to first token in milliseconds.
  * `:prompt_eval_rate` - Prompt evaluation speed (tokens/sec).
  * `:generation_rate` - Generation speed (tokens/sec).
  * `:prefix_cache_tokens` - Number of prompt tokens skipped via prefix cache.
  * `:prefix_cache_ratio` - Ratio of cached to total prompt tokens (0.0–1.0).

Metadata:

  * `:server` - PID of the server process.
  * `:seq_id` - Slot sequence ID (integer).
  * `:mode` - `:generate` or `:stream`.

# `child_spec`

Returns a specification to start this module under a supervisor.

See `Supervisor`.

# `generate`

```elixir
@spec generate(GenServer.server(), String.t(), keyword()) ::
  {:ok, String.t()} | {:error, term()}
```

Generates text synchronously. Blocks until generation is complete.

## Options

  * `:max_tokens` - Maximum tokens to generate. Defaults to `256`.
  * `:timeout` - Call timeout in ms. Defaults to `60_000`.

# `generate_tokens`

```elixir
@spec generate_tokens(GenServer.server(), [integer()], keyword()) ::
  {:ok, String.t()} | {:error, term()}
```

Generates text from pre-tokenized input. Blocks until generation is complete.

Use `get_model/1` to obtain the model for tokenization outside the server.

## Options

  * `:max_tokens` - Maximum tokens to generate. Defaults to `256`.
  * `:timeout` - Call timeout in ms. Defaults to `60_000`.

# `get_model`

```elixir
@spec get_model(GenServer.server()) :: LlamaCppEx.Model.t()
```

Returns the model struct for external tokenization.

The model resource is reference-counted and thread-safe for read-only
operations like tokenization.

# `get_stats`

```elixir
@spec get_stats(GenServer.server()) :: map()
```

Returns a snapshot of the server's current state.

# `start_link`

```elixir
@spec start_link(keyword()) :: GenServer.on_start()
```

Starts the server.

## Options

  * `:model_path` (required) - Path to the GGUF model file.
  * `:n_gpu_layers` - GPU layers. Defaults to `99`.
  * `:n_ctx` - Total context size (shared across slots). Defaults to `8192`.
  * `:n_parallel` - Number of concurrent slots. Defaults to `4`.
  * `:n_batch` - Batch size. Defaults to `n_ctx`.
  * `:chunk_size` - Max prefill tokens per slot per tick. Defaults to `512`.
  * `:max_queue` - Max queued requests. `0` for unlimited. Defaults to `0`.
  * `:cache_prompt` - Retain KV cache between requests on the same slot for
    prefix reuse. Defaults to `false`. Set to `true` for multi-turn chat.
  * `:batch_strategy` - Batch building strategy module. Defaults to
    `LlamaCppEx.Server.Strategy.DecodeMaximal`. See `LlamaCppEx.Server.BatchStrategy`.
  * Sampling options: `:temp`, `:top_k`, `:top_p`, `:min_p`, `:seed`, `:penalty_repeat`,
    `:penalty_freq`, `:penalty_present`, `:grammar`, `:grammar_root`.
  * GenServer options like `:name`.

# `stream`

```elixir
@spec stream(GenServer.server(), String.t(), keyword()) :: Enumerable.t()
```

Returns a stream of generated text chunks.

## Options

  * `:max_tokens` - Maximum tokens to generate. Defaults to `256`.
  * `:timeout` - Per-token timeout. Defaults to `30_000`.

# `stream_tokens`

```elixir
@spec stream_tokens(GenServer.server(), [integer()], keyword()) :: Enumerable.t()
```

Returns a stream of generated text chunks from pre-tokenized input.

## Options

  * `:max_tokens` - Maximum tokens to generate. Defaults to `256`.
  * `:timeout` - Per-token timeout. Defaults to `30_000`.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
