Status
Accepted
Context
The continuous batching server (ADR 006) clears the KV cache for every slot on every request via memory_seq_rm(ctx, seq_id, 0, -1). This means repeated requests with shared prefixes — common in multi-turn chat, few-shot prompting, and system prompt reuse — must re-compute the same KV entries each time.
llama.cpp's own server implements per-slot prompt caching: when a slot finishes and gets a new request, it checks the common prefix between the old and new token sequences and skips prefill for the matching portion. SGLang takes this further with a global radix attention tree, but that requires page-level KV cache management that llama.cpp doesn't expose.
Decision
Implement same-slot KV reuse — after a slot completes a request, keep its KV cache alive and store the full token history (prompt + generated tokens). When the next request arrives:
- Compute
common_prefix_length(new_tokens, cached_tokens)— a simple zip-and-compare - If match > 0: call
memory_seq_rm(ctx, seq_id, n_match, -1)to trim only the non-matching suffix, then setprefill_pos = n_matchto skip the matched portion - If no match: clear fully with
memory_seq_rm(ctx, seq_id, 0, -1)
Additionally, prefix-affinity slot selection: when acquiring an idle slot, prefer the one whose cached tokens have the longest prefix match with the incoming request's tokens. This maximizes cache hits when multiple slots are available.
Configuration
cache_prompt: false(default) — identical to pre-caching behaviorcache_prompt: true— enable same-slot KV reuse (opt-in)
Telemetry
New measurements in :llama_cpp_ex, :server, :request, :done:
prefix_cache_tokens— number of tokens skipped via cacheprefix_cache_ratio— ratio of cached to total prompt tokens
Performance
Benchmarked with Qwen3-0.6B-Q8_0 on Apple M1 Max, simulating 4-turn multi-turn chat where each request extends the previous:
| Scenario | Average | Median |
|---|---|---|
| WITH prefix cache | 487ms | 452ms |
| WITHOUT prefix cache | 597ms | 591ms |
| Improvement | 1.23x | 1.31x |
The improvement grows with longer shared prefixes (system prompts, multi-turn history).
Consequences
- Multi-turn chat TTFT improves significantly — the system prompt and prior turns don't need re-computation
- Memory usage is unchanged — KV cache was already allocated; we just avoid clearing it
- Prefix-affinity slot selection adds a linear scan of idle slots (negligible for typical
n_parallelvalues) - Generated tokens are tracked in a reverse list (
O(1)prepend) and reversed once on slot reset
Alternatives Considered
Cross-slot prefix sharing via token trie (SGLang-style)
A global radix tree mapping token sequences to dedicated cache sequence IDs, using memory_seq_cp to copy KV cache between slots. Deferred because:
- Requires additional sequence IDs beyond
n_parallel(API change) - KV cache capacity management across slots + cache sequences is complex
- Same-slot reuse captures the most common case (multi-turn chat)
- Can be added as a future enhancement building on this foundation
No caching (current behavior)
Always clear KV cache. Simple but wastes compute on repeated prefixes. The cache_prompt: false option preserves this behavior for users who need it.