erllama
View Sourceerllama is a native Erlang/OTP wrapper around llama.cpp with a
token-exact, multi-tier, supervised KV cache. It turns a
multi-second prefill into a millisecond restore, and lets you keep
more warm state than fits in RAM by promoting cold-but-popular
prefixes down to the disk tier.
If you have ever waited five seconds for a chat assistant to acknowledge "hello" — that's prompt prefill. erllama caches the work so the second turn, the third turn, and every subsequent agent sharing the same system prompt skip it.
A 30-second taste
1> {ok, _} = application:ensure_all_started(erllama).
2> Path = "/srv/models/tinyllama-1.1b-chat.Q4_K_M.gguf".
3> {ok, Bin} = file:read_file(Path).
4> {ok, M} = erllama:load_model(#{
backend => erllama_model_llama,
model_path => Path,
fingerprint => crypto:hash(sha256, Bin)
}).
{ok, <<"erllama_model_2375">>}
5> {ok, Reply, _} = erllama:complete(M, <<"Once upon a time">>).
%% ~3 s on a CPU box. Prompt prefill, async cold save fired.
6> {ok, Reply2, _} = erllama:complete(M, <<"Once upon a time">>).
%% ~10 ms. Cache hit; KV state restored, one decode for fresh logits.
7> {ok, _, _} = erllama:complete(M, <<"Once upon a time, in a quiet village">>).
%% ~50 ms. Longest-prefix walk found the cached row even though
%% the new prompt is longer.load_model/1 returns a binary model_id that is also the registered
name for the underlying gen_statem. Pass it to complete/2,3,
unload/1, etc.
That is the whole pitch. The cache is on by default, runs under its own supervisor, and never returns approximate matches.
What you get
- Many models in one BEAM. Load TinyLlama and Llama-3-8B side by
side, hot-swap a model without bouncing the cache, give each model
its own
policyandtier. One shared cache; rows are fingerprint-segregated so models never collide on identical prompts. - Token-exact hits. Cache key is
sha256(model_fp || quant || ctx_params || tokens_le32). Same tokens, same key, guaranteed-correct restore. - Three storage tiers. ETS slabs for the hottest data, files
on
/dev/shmfor warm working set, on-disk files (plain read I/O) for everything else. Each tier supervised independently with its own quota and LRU. - Bigger than RAM. Disk is a first-class tier, not a fallback. A 70B model in Q4 already takes ~40 GB of weights; the disk tier holds the warm KV state your working set needs without crowding weights out of RAM.
- Shared-prefix hits across agents. Spawn N workers that all start with the same system prompt: the first cold-prefills, every subsequent worker gets a longest-prefix hit on the shared part.
- Multi-turn warmth. Pass the previous turn's
parent_keyand the cache waits up tosession_resume_wait_msfor the in-flight finish save to publish. - Stateless-friendly. OpenAI/Anthropic-shaped servers that
resend the full conversation each turn get hits automatically
through a longest-prefix walk. No
parent_keyneeded. - Crash-safe saves. Reserve, write temp, validate, atomic
link(2), announce. Two-stage TTL cleanup adopts orphans on writer crash. - Memory-pressure-driven eviction. Pluggable pressure source
(
memsup,nvidia-smi, or your own callback). Off by default. - Always-on metrics. Hits, misses, saves, evictions, and
per-path latency totals exposed via
erllama_cache:get_counters/0. Per-counter cost is ~10-20 ns; you cannot meaningfully turn them off.
Installation
erllama targets Erlang/OTP 28 with rebar3 3.25+.
Add to rebar.config:
{deps, [
{erllama, "~> 0.1"}
]}.Then in your supervision tree, wait for the application to start before loading models:
ok = application:ensure_started(erllama).The first compile builds vendored llama.cpp (~3 minutes on a fast machine). Subsequent builds are cached. See requirements for the toolchain.
Documentation
| Guide | What it covers |
|---|---|
| Loading a model | Every option to erllama:load_model/1,2, with examples and pitfalls. |
| Caching | Tiers, save reasons, lookup paths, watermarks. The operator's manual. |
| Configuration | Full sys.config and per-model option reference. |
| Building | Platform-specific build notes (Linux, macOS, FreeBSD), CUDA/Metal toggles, common build issues. |
| Examples | Drop-in patterns for one-shot completion, stateless HTTP servers, multi-turn sessions, concurrent agents, cache inspection. |
For the API reference (erllama, erllama_cache, erllama_scheduler,
erllama_nif), see the generated module docs on
HexDocs or run rebar3 ex_doc
locally.
For the design rationale behind the cache:
- Cache design — why multi-tier, why token-exact, what was deliberately left out.
- Publish protocol — the five-stage crash-safe save protocol.
- NIF safety — two-resource lifetime,
exception shim, why disk reads use plain
file:read_file/1.
Many models in one BEAM
Each loaded model is its own supervised gen_statem under
erllama_model_sup. The cache is process-wide and segregates rows
by fingerprint, so the only thing two models share is the byte
budget.
{ok, _} = erllama:load_model(<<"tiny">>, TinyConfig).
{ok, _} = erllama:load_model(<<"big">>, BigConfig).
{ok, R1, _} = erllama:complete(<<"tiny">>, <<"summarise: ...">>).
{ok, R2, _} = erllama:complete(<<"big">>, <<"deep analysis of: ...">>).
ok = erllama:unload(<<"tiny">>).| Capability | How |
|---|---|
| N models in one BEAM | load_model/2 per binary id; each is one gen_statem |
| No cross-model collisions | Cache key includes the model fingerprint |
| Hot-swap a model | unload/1 then load_model/2; the cache survives |
Per-model policy | policy => #{...} on the load; merges over app-env defaults |
Per-model tier | tier_srv => MyDisk, tier => disk per model |
| Shared-prefix hits across agents | Longest-prefix walk on every cold prompt |
| Concurrent saves bounded | Single writer pool with a leak-proof semaphore |
Tested end-to-end in
test/erllama_SUITE.erl:concurrent_complete_under_writer_cap —
four models with distinct fingerprints running parallel completions
under one writer cap.
A slightly longer example
A real load with all the cache parameters. The disk tier requires a
running erllama_cache_disk_srv started by the operator; the RAM tier
(erllama_cache_ram) starts automatically with the application.
{ok, _} = erllama_cache_disk_srv:start_link(my_disk, "/var/lib/erllama/kvc"),
{ok, Bin} = file:read_file("/srv/models/llama-3.1-8b.Q4_K_M.gguf"),
Fp = crypto:hash(sha256, Bin),
CtxHash = crypto:hash(sha256, term_to_binary({8192, 4096})),
{ok, M} = erllama:load_model(#{
backend => erllama_model_llama,
model_path => "/srv/models/llama-3.1-8b.Q4_K_M.gguf",
model_opts => #{n_gpu_layers => 99},
context_opts => #{n_ctx => 8192, n_batch => 4096},
fingerprint => Fp,
fingerprint_mode => safe,
quant_type => q4_k_m,
quant_bits => 4,
ctx_params_hash => CtxHash,
context_size => 8192,
tier_srv => my_disk,
tier => disk,
policy => #{
boundary_trim_tokens => 32,
boundary_align_tokens => 256,
session_resume_wait_ms => 500
}
}).Stateless OpenAI/Anthropic-shaped server:
handle_completion(ModelId, Prompt) ->
{ok, Reply, _Tokens} = erllama:complete(ModelId, Prompt),
Reply.No parent_key. The cache walks the new prompt backward by the
configured stride and finds the longest cached prefix. If the new
prompt is yesterday's conversation plus one fresh turn, the walk
hits.
Stateful Erlang-native multi-turn: the session layer threads
parent_key between turns. The previous turn's finish-save key is
the parent of the next call. It is held by the calling session
process, not retrieved from the cache.
%% First turn: cold prefill. The model fires an async finish save
%% whose key is sha256(fingerprint || quant || ctx_params || tokens).
{ok, R1, Tokens1} = erllama:complete(M, Prompt1),
ParentKey = erllama_cache_key:make(#{
fingerprint => Fp,
quant_type => q4_k_m,
ctx_params_hash => CtxHash,
tokens => Tokens1
}),
%% Second turn: pass ParentKey to skip the longest-prefix walk.
{ok, R2, _} = erllama:complete(M, Prompt2, #{parent_key => ParentKey}).Inspect cache state from a shell:
1> erllama_cache:get_counters().
#{hits_exact => 142, hits_resume => 17, hits_longest_prefix => 89,
misses => 12, saves_cold => 12, saves_continued => 67,
saves_finish => 31, evictions => 3, ...}
2> erllama_cache_meta_srv:dump().
%% List of raw ETS rows:
%% {Key, Tier, Size, LastUsedNs, Refcount, Status, HeaderBin,
%% Location, TokensRef, Hits}
[{<<_:256>>, disk, 8388608, 1737..., 0, available, _, _, _, 4}, ...]Requirements
- Erlang/OTP 28
- rebar3 3.25+
- C++17 toolchain (Apple clang or recent gcc;
cmake>= 3.20) - Apple Silicon: Metal + Accelerate auto-detected.
- Linux: BLAS auto-detected; CUDA off by default (toggle via
ERLLAMA_OPTS=-DGGML_CUDA=ON). - FreeBSD:
erlang-runtime28from ports, pluscmake bash gmake.
Architecture at a glance
erllama_sup
├── erllama_cache_sup
│ ├── erllama_cache_meta_srv sole writer; meta + LRU + reservations
│ ├── erllama_cache_ram RAM tier (ETS slabs)
│ ├── erllama_cache_ramfile_srv ram_file tier
│ ├── erllama_cache_disk_srv disk tier (plain read/write I/O)
│ └── erllama_cache_writer writer pool, leak-proof semaphore
├── erllama_model_sup simple_one_for_one for dynamic models
└── erllama_scheduler memory-pressure poller (off by default)Inside a request:
erllama:complete/2enters the per-modelgen_statem.- prefilling — tokenize, then either hit the cache and
kv_unpack(warm) or runllama_decodeover the prompt (cold). Cold path fires an asynccoldsave at the trimmed-prefix boundary. - generating — token-by-token greedy
llama_decode. Everycontinued_intervaltokens, fire an asynccontinuedsave. - idle — fire an async
finishsave for the full prompt + reply. The KV state becomes evictable.
For the publish protocol, the reservation state machine, and the exception-safe NIF wrappers, see internals/publish-protocol.md and internals/nif-safety.md.
Status
Pre-release. Core cache, scheduler, and NIF: 166 EUnit + 11
PropEr + 7 stub Common Test cases. End-to-end CT suite gated on
LLAMA_TEST_MODEL (6 cases, passing locally with TinyLlama 1.1B
Q4_K_M).
See CHANGELOG.md for the release notes.
Contributing
The contributor guide is AGENTS.md. The short version:
rebar3 fmt # auto-format (always run first)
rebar3 compile # warnings_as_errors
rebar3 eunit # unit tests
rebar3 proper # property tests
rebar3 ct # Common Test (without a real model)
rebar3 lint # Elvis
rebar3 dialyzer # static analysis
rebar3 xref # cross-reference
End-to-end against a real GGUF:
LLAMA_TEST_MODEL=/path/to/tinyllama-1.1b-chat.gguf \
rebar3 ct --suite=test/erllama_real_model_SUITE
Bumping the vendored llama.cpp: see UPDATE_LLAMA.md.
Coming next: erllama_cluster
A separate OTP application is in development to coordinate a fleet of erllama nodes into a single inference cluster. Each node continues to run erllama as a standalone library — local model loading, local KV cache, local inference. The cluster layer sits on top and decides which node serves which request.
Three distribution strategies, all in v1:
- Request distribution with pluggable load-balancing and cache-affinity routing — follow-up requests prefer the node that warmed the KV cache for the prefix.
- Speculative decoding across nodes — small draft model on one node, large verifier on another, coordinated per request.
- Pipeline parallelism — models too large for one node split by layer ranges across multiple nodes, hidden states passed between shards as Erlang binaries.
Transport is QUIC, via Erlang distribution carried over
erlang_quic — a pure Erlang
QUIC implementation, no C NIF in the protocol path. Circuit breakers
per {Node, ModelId} driven by nodeup/nodedown rather than
application-level pings. A globally registered scheduler handles
cluster-wide GPU budgeting and on-demand model placement, with local
fallback schedulers elected by pg quorum on network partition.
Repository: https://github.com/erllama/erllama_cluster (under construction).
Acknowledgements
Same idea as antirez/ds4.
License
MIT. Copyright (c) 2026 Benoit Chesneau. See LICENSE.
The vendored c_src/llama.cpp/ retains its upstream MIT license; see
c_src/llama.cpp/LICENSE.