ADR 005: Batching Architecture

Status

Superseded by ADR 006

Context

Serving multiple concurrent users with a single model requires efficient GPU utilization. Without batching, each user's token generation runs as a separate forward pass, leaving the GPU underutilized during the decode phase.

llama.cpp's llama_batch API supports multi-sequence batching — processing tokens from multiple independent sequences in a single llama_decode call.

Decision

We will implement a GenServer-based batching server that accumulates requests from multiple callers and flushes them as a single batched llama_decode call.

Background: Why Batching Matters

The Prefill vs Decode Asymmetry

Phase	Operation	Bottleneck	GPU Utilization
Prefill	Process entire prompt	Compute-bound (matrix-matrix)	~100%
Decode	Generate 1 token	Memory-bandwidth-bound (matrix-vector)	10-30%

During decode, each sequence contributes a single token per step. The GPU performs a matrix-vector multiply (accessing all model weights to produce one output), which is limited by memory bandwidth, not compute.

Batching Converts the Bottleneck

Batching N sequences together converts N matrix-vector multiplies into one matrix-matrix multiply:

Sequences	Operation	Throughput
1	Matrix × Vector	~30 tok/s
8	Matrix × Matrix (8 cols)	~200 tok/s total (~25 tok/s per user)
16	Matrix × Matrix (16 cols)	~350 tok/s total (~22 tok/s per user)

Per-user latency decreases slightly, but total throughput increases dramatically.

Design

NIF Layer

Two new NIFs for batched operation:

`prefill(ctx, tokens, seq_id, n_past)`

Processes a prompt for a single sequence in n_batch-sized chunks. Only computes logits on the very last token. Runs on DirtyCPU.

`decode_batch(ctx, entries)`

Accepts a list of {seq_id, token_id, position} tuples, builds a single llama_batch, calls llama_decode once, and samples a next token for each sequence. Returns [{seq_id, next_token_id, token_text}]. Runs on DirtyCPU.

GenServer Batcher

                   ┌─────────────────────────┐
  Caller 1 ──────→│                           │
  Caller 2 ──────→│  LlamaCppEx.Server        │
  Caller 3 ──────→│                           │
                   │  State:                   │
                   │  - pending: [{from, ...}] │
                   │  - sequences: %{id => ..} │
                   │  - seq_pool: MapSet       │
                   │                           │
                   │  Flush trigger:            │
                   │  - batch_size reached      │
                   │  - batch_timeout (20ms)    │
                   └─────────┬─────────────────┘
                             │
                     decode_batch NIF
                             │
                   ┌─────────┴─────────────────┐
                   │  GenServer.reply/2         │
                   │  per caller                │
                   └───────────────────────────┘

Sequence Lifecycle

Caller sends {:generate, prompt, opts} → server acquires a seq_id from pool
Server prefills prompt tokens for the sequence
Decode loop: server batches all active sequences, calls decode_batch, replies to callers whose sequences finished or produced a token
On completion/error/timeout: server calls llama_memory_seq_rm to free the KV cache slot, returns seq_id to pool

Configuration

{LlamaCppEx.Server,
  model_path: "model.gguf",
  n_ctx: 8192,            # Total KV cache (shared across all sequences)
  n_parallel: 8,          # Max concurrent sequences
  n_gpu_layers: -1,       # GPU layer offload
  batch_size: 512,        # Max tokens per decode call
  batch_timeout: 20}      # ms accumulation window

Flush Strategy

The server uses :noreply for handle_call to hold callers, then flushes on:

Batch size: When n_pending >= batch_size, flush immediately
Batch timeout: After batch_timeout ms with pending entries, flush whatever is accumulated

This balances latency (small batches flush quickly) vs throughput (large batches amortize the forward pass).

Shared System Prompt

For chat applications where every request starts with the same system prompt:

Prefill system prompt tokens tagged with ALL sequence IDs
When a new sequence starts, llama_memory_seq_cp copies the shared prefix
Only user-specific tokens need prefilling per request

This avoids redundant computation of the system prompt for every request.

KV Cache Management

Total KV cache capacity (n_ctx) is shared across all active sequences:

n_ctx = 8192
n_parallel = 8
max_per_sequence = n_ctx / n_parallel = 1024 tokens

The server tracks positions per sequence and enforces limits. When a sequence completes, llama_memory_seq_rm frees its cache slots for reuse.

Alternatives Considered

Thread Pool in C++

Run the batching loop entirely in C++ with a thread pool. Rejected because:

Harder to debug and monitor from Elixir
Loses BEAM's process supervision and fault tolerance
More complex error handling across the language boundary

One GenServer per Sequence

Each sequence gets its own GenServer + context. Rejected because:

No batching benefit — each forward pass is still single-sequence
N contexts × N KV caches = much higher memory usage
Does not leverage GPU parallelism

Nx.Serving

Wrap as an Nx.Serving for automatic batching. This is planned as an optional Phase 5 integration, but the core batcher is a GenServer for:

No Nx dependency in the core library
More control over sequence lifecycle and KV cache management
Simpler mental model for users not using Nx

Consequences

The GenServer is a serialization point — all requests funnel through one process
This is by design: llama_decode is not thread-safe, and batching requires coordinated access to the shared context
The batch_timeout adds up to 20ms latency for the first request in a batch window
Memory usage is bounded by n_ctx total, divided among n_parallel sequences
Callers that are slower than generation speed won't cause backpressure issues (GenServer.reply is non-blocking)

← Previous Page ADR 004: Streaming via enif_send

Next Page → ADR 006: Continuous Batching