Agents

Instructions for AI coding agents working on this project.

Project Overview

erllama is a native Erlang/OTP wrapper around llama.cpp providing OpenAI-compatible inference with full supervision and a tiered KV cache. Requires Erlang/OTP 28 and rebar3 3.25+.

Single application, flat layout:

src/        Erlang sources (erllama_*, erllama_cache_*, erllama_nif)
include/    Shared headers (erllama_cache.hrl)
c_src/      C sources for the single NIF (erllama_nif.so)
test/       eunit + PropEr property tests
priv/       Build artefact: erllama_nif.so
config/     sys.config

The KV cache is logically a subsystem (its own supervisor, modules prefixed erllama_cache_*) but lives in the same OTP application as the rest of erllama. There is one NIF (erllama_nif) that holds the entire native surface (cache pipeline + future llama.cpp wrappers).

Authoritative behaviour is encoded in the test suites under test/ (EUnit, PropEr, and Common Test) and the module docstrings. The README has the public-API tables and configuration reference.

Required Checks

Every change must be formatted and pass all checks before committing:

rebar3 fmt          # Auto-format (always run first)
rebar3 compile      # Must compile cleanly (warnings_as_errors)
rebar3 eunit        # Unit tests
rebar3 proper       # Property tests
rebar3 ct           # Common Test suites
rebar3 lint         # Elvis linter
rebar3 dialyzer     # Type checking
rebar3 xref         # Cross-reference analysis

Build & Development Commands

rebar3 compile                                    # Build
rebar3 shell                                      # Boot the umbrella
rebar3 eunit                                      # All EUnit tests
rebar3 eunit --module=erllama_cache_kvc_tests     # Specific test module
rebar3 proper                                     # All PropEr property-based tests
rebar3 ct --suite=erllama_cache_meta_SUITE        # Specific Common Test suite
rebar3 fmt                                        # Auto-format (erlfmt)
rebar3 fmt --check                                # Format check, no writes
rebar3 lint                                       # Elvis linter
rebar3 dialyzer                                   # Type checking
rebar3 xref                                       # Cross-reference
rebar3 ex_doc                                     # Generate docs

Architecture

Cache subsystem (`erllama_cache_*` modules)

erllama_cache_sup
├── erllama_cache_meta_srv     sole writer for meta + LRU + reservations
├── erllama_cache_ram          RAM tier (ETS slab store)
├── erllama_cache_ramfile_sup
│   └── erllama_cache_ramfile_srv  per ram_file root dir
├── erllama_cache_disk_sup
│   └── erllama_cache_disk_srv     per disk root dir (plain read/write)
└── erllama_cache_writer_pool  poolboy: dirty-IO save workers

Public API lives in erllama_cache.erl (a stateless facade). Hot-path read lookups go through ETS directly. Writes (claim, release, evict, save announce) go through erllama_cache_meta_srv via gen_server:call.

Save pipeline correctness invariants (do not change without review)

Cache hits are token-exact by construction. The cache key includes model fingerprint, quant byte, ctx hash, and the full token list as little-endian u32. Approximate match is not allowed at this layer.
A save's payload is read from a paused live llama_context*. The context worker pauses decode for the pack window; no off-thread reads of the live context occur.
Disk publication is via link(2) (atomic create-if-not-exists), preceded by a meta-server reservation and a check_reservation immediately before link to defeat stale-writer races. EEXIST is validated and either adopted or replaced under the current reservation; never silently skipped.
Disk reads use plain file:read_file/1 into a fresh BEAM heap binary. mmap is deliberately avoided: the process already mmaps multi-GB GGUF weights, so a second mapping per cache restore doubles the VM footprint, and a region binary surviving the NIF call would expose the BEAM to SIGBUS from any external truncation.

Multi-turn warmth

v1 has no semantic candidate proposer. Multi-turn warmth is exact and deterministic: the session layer holds the previous turn's cache key in its state and passes it as parent_key on the next request. The cache uses lookup_exact_or_wait/2 (default 500 ms) to wait for an in-flight finish save to publish before falling through to cold.

The canonical pattern is claim, unpack, checkin (in that order): the holder is released before generation, so the slab returns to refcount=0 and is evictable while the user reads the streamed response.

Test Organization

test/<mod>_tests.erl: EUnit unit tests
test/prop_<mod>.erl: PropEr property-based tests
test/<feature>_SUITE.erl: Common Test suites

Real-model CT suite

erllama_real_model_SUITE exercises the llama.cpp backend against a real GGUF file. Disabled unless LLAMA_TEST_MODEL points at a valid model:

LLAMA_TEST_MODEL=/path/to/tinyllama-1.1b-q4_k_m.gguf rebar3 ct \
    --suite=test/erllama_real_model_SUITE

Without the env var the suite skips so default rebar3 ct runs stay green on CI without model files.

Linting Notes

Elvis rules are configured in elvis.config. The set is intentionally lean at v0; per-module ignores will be added as the codebase grows. The atom naming regex allows _SUITE suffix for CT suites: ^[a-z](_?[a-zA-Z0-9]+)*(_SUITE)?$. Max line length is 120.

Coding conventions

Default to writing no comments. Only annotate non-obvious why (a hidden constraint, an invariant, a workaround). Don't restate what the code does.
Erlang/OTP idioms: gen_statem, gen_server, supervisor, ETS. No magic, no DSL wrappers.
ETS reads are hot-path; ETS writes are funnelled through one owner process per table.
NIFs run on dirty schedulers (CPU or IO as appropriate). No NIF performs file I/O directly; framing and validation live in Erlang code that calls a small set of pure-data NIFs (kv_pack, kv_unpack, crc32c).
Configuration validation runs at supervisor init/1; misconfig is a hard start_link error, not a runtime warning.

What to avoid

No iolist_to_binary flattening of multi-GB payloads. Use iolists for prim_file:write/2.
No ets:select_replace/2 on the hot path; the meta server is the arbitration authority.
No silent EEXIST handling on link; always validate-and-adopt or replace.
No reliance on enif_resource_refcount; use the two-resource lifetime pattern (handle resource holding a mapping resource via enif_keep_resource).
No semantic candidate proposer in v1 (deferred to v2).
No KV state compression in v1 (TurboQuant is unproven for that; generic lz4/zstd is a future option).

When in doubt

Re-read the test suite for the area you're touching — every non-obvious invariant in the cache publish protocol, the reservation state machine, the warm-restore logits primer, and the NIF safety wrappers has a dedicated case. Surface tension with existing tests to the human reviewer before changing the behaviour.

← Previous Page NIF safety

Next Page → Updating llama.cpp