Status
Superseded by ADR 006
Context
Serving multiple concurrent users with a single model requires efficient GPU utilization. Without batching, each user's token generation runs as a separate forward pass, leaving the GPU underutilized during the decode phase.
llama.cpp's llama_batch API supports multi-sequence batching — processing tokens from multiple independent sequences in a single llama_decode call.
Decision
We will implement a GenServer-based batching server that accumulates requests from multiple callers and flushes them as a single batched llama_decode call.
Background: Why Batching Matters
The Prefill vs Decode Asymmetry
| Phase | Operation | Bottleneck | GPU Utilization |
|---|---|---|---|
| Prefill | Process entire prompt | Compute-bound (matrix-matrix) | ~100% |
| Decode | Generate 1 token | Memory-bandwidth-bound (matrix-vector) | 10-30% |
During decode, each sequence contributes a single token per step. The GPU performs a matrix-vector multiply (accessing all model weights to produce one output), which is limited by memory bandwidth, not compute.
Batching Converts the Bottleneck
Batching N sequences together converts N matrix-vector multiplies into one matrix-matrix multiply:
| Sequences | Operation | Throughput |
|---|---|---|
| 1 | Matrix × Vector | ~30 tok/s |
| 8 | Matrix × Matrix (8 cols) | ~200 tok/s total (~25 tok/s per user) |
| 16 | Matrix × Matrix (16 cols) | ~350 tok/s total (~22 tok/s per user) |
Per-user latency decreases slightly, but total throughput increases dramatically.
Design
NIF Layer
Two new NIFs for batched operation:
prefill(ctx, tokens, seq_id, n_past)
Processes a prompt for a single sequence in n_batch-sized chunks. Only computes logits on the very last token. Runs on DirtyCPU.
decode_batch(ctx, entries)
Accepts a list of {seq_id, token_id, position} tuples, builds a single llama_batch, calls llama_decode once, and samples a next token for each sequence. Returns [{seq_id, next_token_id, token_text}]. Runs on DirtyCPU.
GenServer Batcher
┌─────────────────────────┐
Caller 1 ──────→│ │
Caller 2 ──────→│ LlamaCppEx.Server │
Caller 3 ──────→│ │
│ State: │
│ - pending: [{from, ...}] │
│ - sequences: %{id => ..} │
│ - seq_pool: MapSet │
│ │
│ Flush trigger: │
│ - batch_size reached │
│ - batch_timeout (20ms) │
└─────────┬─────────────────┘
│
decode_batch NIF
│
┌─────────┴─────────────────┐
│ GenServer.reply/2 │
│ per caller │
└───────────────────────────┘Sequence Lifecycle
- Caller sends
{:generate, prompt, opts}→ server acquires aseq_idfrom pool - Server prefills prompt tokens for the sequence
- Decode loop: server batches all active sequences, calls
decode_batch, replies to callers whose sequences finished or produced a token - On completion/error/timeout: server calls
llama_memory_seq_rmto free the KV cache slot, returnsseq_idto pool
Configuration
{LlamaCppEx.Server,
model_path: "model.gguf",
n_ctx: 8192, # Total KV cache (shared across all sequences)
n_parallel: 8, # Max concurrent sequences
n_gpu_layers: -1, # GPU layer offload
batch_size: 512, # Max tokens per decode call
batch_timeout: 20} # ms accumulation windowFlush Strategy
The server uses :noreply for handle_call to hold callers, then flushes on:
- Batch size: When
n_pending >= batch_size, flush immediately - Batch timeout: After
batch_timeoutms with pending entries, flush whatever is accumulated
This balances latency (small batches flush quickly) vs throughput (large batches amortize the forward pass).
Shared System Prompt
For chat applications where every request starts with the same system prompt:
- Prefill system prompt tokens tagged with ALL sequence IDs
- When a new sequence starts,
llama_memory_seq_cpcopies the shared prefix - Only user-specific tokens need prefilling per request
This avoids redundant computation of the system prompt for every request.
KV Cache Management
Total KV cache capacity (n_ctx) is shared across all active sequences:
n_ctx = 8192
n_parallel = 8
max_per_sequence = n_ctx / n_parallel = 1024 tokensThe server tracks positions per sequence and enforces limits. When a sequence completes, llama_memory_seq_rm frees its cache slots for reuse.
Alternatives Considered
Thread Pool in C++
Run the batching loop entirely in C++ with a thread pool. Rejected because:
- Harder to debug and monitor from Elixir
- Loses BEAM's process supervision and fault tolerance
- More complex error handling across the language boundary
One GenServer per Sequence
Each sequence gets its own GenServer + context. Rejected because:
- No batching benefit — each forward pass is still single-sequence
- N contexts × N KV caches = much higher memory usage
- Does not leverage GPU parallelism
Nx.Serving
Wrap as an Nx.Serving for automatic batching. This is planned as an optional Phase 5 integration, but the core batcher is a GenServer for:
- No Nx dependency in the core library
- More control over sequence lifecycle and KV cache management
- Simpler mental model for users not using Nx
Consequences
- The GenServer is a serialization point — all requests funnel through one process
- This is by design:
llama_decodeis not thread-safe, and batching requires coordinated access to the shared context - The
batch_timeoutadds up to 20ms latency for the first request in a batch window - Memory usage is bounded by
n_ctxtotal, divided amongn_parallelsequences - Callers that are slower than generation speed won't cause backpressure issues (GenServer.reply is non-blocking)