ADR 007: Prefix Caching (Same-Slot KV Reuse)

Copy Markdown View Source

Status

Accepted

Context

The continuous batching server (ADR 006) clears the KV cache for every slot on every request via memory_seq_rm(ctx, seq_id, 0, -1). This means repeated requests with shared prefixes — common in multi-turn chat, few-shot prompting, and system prompt reuse — must re-compute the same KV entries each time.

llama.cpp's own server implements per-slot prompt caching: when a slot finishes and gets a new request, it checks the common prefix between the old and new token sequences and skips prefill for the matching portion. SGLang takes this further with a global radix attention tree, but that requires page-level KV cache management that llama.cpp doesn't expose.

Decision

Implement same-slot KV reuse — after a slot completes a request, keep its KV cache alive and store the full token history (prompt + generated tokens). When the next request arrives:

  1. Compute common_prefix_length(new_tokens, cached_tokens) — a simple zip-and-compare
  2. If match > 0: call memory_seq_rm(ctx, seq_id, n_match, -1) to trim only the non-matching suffix, then set prefill_pos = n_match to skip the matched portion
  3. If no match: clear fully with memory_seq_rm(ctx, seq_id, 0, -1)

Additionally, prefix-affinity slot selection: when acquiring an idle slot, prefer the one whose cached tokens have the longest prefix match with the incoming request's tokens. This maximizes cache hits when multiple slots are available.

Configuration

  • cache_prompt: false (default) — identical to pre-caching behavior
  • cache_prompt: true — enable same-slot KV reuse (opt-in)

Telemetry

New measurements in :llama_cpp_ex, :server, :request, :done:

  • prefix_cache_tokens — number of tokens skipped via cache
  • prefix_cache_ratio — ratio of cached to total prompt tokens

Performance

Benchmarked with Qwen3-0.6B-Q8_0 on Apple M1 Max, simulating 4-turn multi-turn chat where each request extends the previous:

ScenarioAverageMedian
WITH prefix cache487ms452ms
WITHOUT prefix cache597ms591ms
Improvement1.23x1.31x

The improvement grows with longer shared prefixes (system prompts, multi-turn history).

Consequences

  • Multi-turn chat TTFT improves significantly — the system prompt and prior turns don't need re-computation
  • Memory usage is unchanged — KV cache was already allocated; we just avoid clearing it
  • Prefix-affinity slot selection adds a linear scan of idle slots (negligible for typical n_parallel values)
  • Generated tokens are tracked in a reverse list (O(1) prepend) and reversed once on slot reset

Alternatives Considered

Cross-slot prefix sharing via token trie (SGLang-style)

A global radix tree mapping token sequences to dedicated cache sequence IDs, using memory_seq_cp to copy KV cache between slots. Deferred because:

  • Requires additional sequence IDs beyond n_parallel (API change)
  • KV cache capacity management across slots + cache sequences is complex
  • Same-slot reuse captures the most common case (multi-turn chat)
  • Can be added as a future enhancement building on this foundation

No caching (current behavior)

Always clear KV cache. Simple but wastes compute on repeated prefixes. The cache_prompt: false option preserves this behavior for users who need it.