ADR 007: Prefix Caching (Same-Slot KV Reuse)

Status

Accepted

Context

The continuous batching server (ADR 006) clears the KV cache for every slot on every request via memory_seq_rm(ctx, seq_id, 0, -1). This means repeated requests with shared prefixes — common in multi-turn chat, few-shot prompting, and system prompt reuse — must re-compute the same KV entries each time.

llama.cpp's own server implements per-slot prompt caching: when a slot finishes and gets a new request, it checks the common prefix between the old and new token sequences and skips prefill for the matching portion. SGLang takes this further with a global radix attention tree, but that requires page-level KV cache management that llama.cpp doesn't expose.

Decision

Implement same-slot KV reuse — after a slot completes a request, keep its KV cache alive and store the full token history (prompt + generated tokens). When the next request arrives:

Compute common_prefix_length(new_tokens, cached_tokens) — a simple zip-and-compare
If match > 0: call memory_seq_rm(ctx, seq_id, n_match, -1) to trim only the non-matching suffix, then set prefill_pos = n_match to skip the matched portion
If no match: clear fully with memory_seq_rm(ctx, seq_id, 0, -1)

Additionally, prefix-affinity slot selection: when acquiring an idle slot, prefer the one whose cached tokens have the longest prefix match with the incoming request's tokens. This maximizes cache hits when multiple slots are available.

Configuration

cache_prompt: false (default) — identical to pre-caching behavior
cache_prompt: true — enable same-slot KV reuse (opt-in)

Telemetry

New measurements in :llama_cpp_ex, :server, :request, :done:

prefix_cache_tokens — number of tokens skipped via cache
prefix_cache_ratio — ratio of cached to total prompt tokens

Performance

Benchmarked with Qwen3-0.6B-Q8_0 on Apple M1 Max, simulating 4-turn multi-turn chat where each request extends the previous:

Scenario	Average	Median
WITH prefix cache	487ms	452ms
WITHOUT prefix cache	597ms	591ms
Improvement	1.23x	1.31x

The improvement grows with longer shared prefixes (system prompts, multi-turn history).

Consequences

Multi-turn chat TTFT improves significantly — the system prompt and prior turns don't need re-computation
Memory usage is unchanged — KV cache was already allocated; we just avoid clearing it
Prefix-affinity slot selection adds a linear scan of idle slots (negligible for typical n_parallel values)
Generated tokens are tracked in a reverse list (O(1) prepend) and reversed once on slot reset

Alternatives Considered

A global radix tree mapping token sequences to dedicated cache sequence IDs, using memory_seq_cp to copy KV cache between slots. Deferred because:

Requires additional sequence IDs beyond n_parallel (API change)
KV cache capacity management across slots + cache sequences is complex
Same-slot reuse captures the most common case (multi-turn chat)
Can be added as a future enhancement building on this foundation

No caching (current behavior)

Always clear KV cache. Simple but wastes compute on repeated prefixes. The cache_prompt: false option preserves this behavior for users who need it.

← Previous Page ADR 006: Continuous Batching

Next Page → ADR 008: Pluggable Batching Strategies