ADR 006: Continuous Batching

Status

Accepted (supersedes ADR 005)

Context

ADR 005 proposed a batch-accumulate-flush model with batch_timeout for serving multiple concurrent users. In practice, a tick-driven continuous batching loop — inspired by llama.cpp's server, vLLM, and SGLang — is simpler and lower latency.

The previous Server implementation processed slots sequentially: N active slots meant N separate llama_decode calls per tick, each a full forward pass. This wasted GPU parallelism and stalled generation during long prompt prefills.

Decision

Implement continuous batching with a single forward pass per tick, mixing decode tokens and prefill chunks in one batch.

Key Design Elements

Decode-maximal scheduling: Decode tokens (one per generating slot) are always added to the batch first. They represent active generation that users are waiting on, so they get priority.

Chunked prefill: Long prompts are split into chunks (default 512 tokens) and processed across multiple ticks, interleaved with decode tokens from other slots. This prevents a large prompt from stalling all generation.

Token budget: Each tick's batch is capped at n_batch tokens. The Elixir scheduler enforces this — no sub-batching in C++.

Two new NIFs:

batch_eval(ctx, entries) — Builds a llama_batch from a list of {token_id, pos, seq_id, logits_flag} tuples and calls llama_decode. Forward pass only, no sampling. Runs on DirtyCPU.
sampler_sample_at(sampler, ctx, idx) — Calls llama_sampler_sample with an explicit batch index, enabling sampling at specific positions after a batched decode. Runs on Normal scheduler (fast — just reads logits).

Per-slot samplers with batch-index-aware sampling: After batch_eval, each slot samples at its specific batch index using sampler_sample_at. This preserves per-slot sampler state (grammar, penalties, etc.).

Request queue: When all slots are busy, requests enter a FIFO :queue instead of being rejected with {:error, :no_slots}. Requests are served as slots become available. An optional :max_queue limit provides backpressure.

Telemetry events: :telemetry events are emitted for request completion and per-tick batch metrics, enabling monitoring without coupling to a specific metrics backend.

Tick Loop

Phase 1 — Finish completed slots
Phase 2 — Build batch (decode tokens first, then prefill chunks)
Phase 3 — Forward pass (single batch_eval call)
Phase 4 — Sample (sampler_sample_at per slot at their batch index)
Phase 5 — Continue (schedule next tick if any active slots)

Slot States

:idle → :prefilling → :generating → :idle

:idle — Slot is available for new requests
:prefilling — Prompt tokens being chunked into batches across ticks
:generating — Actively producing tokens, one per tick

Consequences

Single forward pass per tick improves GPU utilization dramatically compared to sequential per-slot decode calls
Chunked prefill prevents generation stalls — existing generating slots continue producing tokens while a new prompt is being prefilled
Per-slot samplers preserve grammar/penalty state across the batched decode
:telemetry enables monitoring without coupling to a specific metrics backend
Request queue provides graceful degradation under load instead of immediate rejection
The tick-based approach has inherent latency of one message round-trip per tick (negligible compared to forward pass time)

Alternatives Considered

Sub-batch processing in C++

Process the entire tick loop (batch construction, decode, sampling) in a single C++ NIF call. Rejected because it moves scheduling logic into C++, making it harder to debug and losing per-slot sampler flexibility from Elixir.

One-NIF-per-tick with sampling in C++

Have a single NIF that builds the batch, decodes, and samples all tokens. Rejected because it loses the ability to use different sampler configurations per slot (grammar, temperature, etc.) managed from Elixir.

Batch-accumulate-flush (ADR 005)

The original design accumulated requests over a batch_timeout window before flushing. The continuous batching approach is simpler (no timeout tuning) and lower latency (ticks fire immediately when work is available).

← Previous Page ADR 005: Batching Architecture

Next Page → ADR 007: Prefix Caching (Same-Slot KV Reuse)