Changelog
View SourceAll notable changes to erllama are documented here. The format follows Keep a Changelog and this project adheres to Semantic Versioning.
Unreleased
0.1.0 — 2026-05-11
Initial public release.
Added
- Native Erlang/OTP wrapper around llama.cpp via a single
dirty-scheduler NIF (
erllama_nif) covering model load, context construction, tokenisation, prefill, single-token decode, and KV pack/unpack. - Models are identified by
binary()on the public API.erllama:load_model/2,complete/2,3,unload/1,status/1,evict/1,shutdown/1takebinary() | pid(). Internal registration uses{via, erllama_registry, BinaryId}so user- supplied ids cannot exhaust the atom table. erllama:list_models/0returning[model_info()]anderllama:model_info/1keyed on a model id.- Public
erllama:tokenize/2anderllama:detokenize/2keyed on a model id. The low-levelerllama_nif:tokenize/3anderllama_nif:detokenize/2remain available. erllama:unload_model/1as an alias forerllama:unload/1matching the OpenAI/Ollama-style naming downstream HTTP servers use.erllama:infer/4streaming inference. Returns{ok, Ref}; tokens are delivered to the caller as{erllama_token, Ref, _},{erllama_done, Ref, Stats},{erllama_error, Ref, Reason}.erllama:cancel/1. Idempotent and fire-and-forget; observed between tokens.erllama:apply_chat_template/2. Renders a normalised chat request (messages,system,tools) through the model's GGUF chat template and tokenises. Backed byllama_chat_apply_template.erllama:embed/2. Per-sequence pooled embedding viallama_get_embeddings_seqwith last-token fallback.- Sampler parameters.
complete/3andinfer/4honourtemperature,top_k,top_p,min_p,repetition_penalty,seed, andgrammarvia one combined chain builder (erllama_nif:configure_sampler/2). Chain order:grammar -> repetition_penalty -> top_k -> top_p -> min_p -> (temperature > 0 ? temp -> dist(seed) : greedy).set_grammar/2retained as a backwards-compatible alias. - LoRA adapters.
erllama:load_adapter/2,unload_adapter/2,set_adapter_scale/3,list_adapters/1. Per-adapter sha256 + scale fold into the cache viaerllama_cache_key:effective_fingerprint/2, so rows produced with adapter A never collide with rows from adapter B. Snapshot- at-admission semantics keep in-flight requests on their original fingerprint even if an adapter mutation arrives mid-generation. - Concurrent request queue. A second
complete/3orinfer/4arriving while one is in flight is queued FIFO instead of getting{error, busy}. The reply{ok, Ref}is sent as soon as the call is admitted; streaming events follow once the queue head advances to the request. - Seq-aware NIFs (infrastructure for 0.2 multi-seq batching).
nif_kv_pack/4accepts an explicitseq_id; newerllama_sampler_tresource owning a standalonellama_sampler*;nif_sampler_new/2,nif_sampler_free/1. Cache rows stay seq-id-free:seq_idis a save/load call argument, never row metadata. - Grammar-constrained sampling. Pass
grammar => GBNFin thecomplete/3Opts orinfer/4Params; the per-model sampler chain is rebuilt as grammar then greedy for the duration of the request and reset on completion or cancellation. erllama_registrymodule: ETS-backedviacallback for binary model ids.erllama_inflightmodule:Ref -> ModelPidtable socancel/1routes to the right gen_statem.erllama_model_backendoptional callbacks:apply_chat_template/2,embed/2,set_grammar/2,configure_sampler/2,clear_sampler/1,load_adapter/2,unload_adapter/2,apply_adapters/2. Backends that omit them surface{error, not_supported}from the public API.- Decode loop schedules each step via
gen_statem:cast(self(), decode_step)instead ofnext_eventso cancel, evict, status, and queued requests interleave fairly between tokens. - Token-exact KV cache with three independently-supervised tiers:
RAM (ETS slabs),
ram_file(/dev/shm), and disk (plain read I/O). - Sole-writer arbitration through
erllama_cache_meta_srv; reads on the hot path go to ETS directly.lookup_exact/1is a single atomicets:lookup(no two-call race) and the meta server cancels waiter timers when an early reply lands so the mailbox doesn't bloat under load. - Crash-safe save publish protocol: reserve, write_tmp, check,
link(2), mark_published; two-stage TTL cleanup with orphan adoption. - Five save reasons (
cold,continued,finish,evict,shutdown) with async/sync semantics matching their use. saves_droppedcounter: bumps whenever a back-pressured writer pool refuses a save the model wanted to fire.- Multi-turn warmth via explicit
parent_keyresume and stateless longest-prefix walk for OpenAI/Anthropic-shaped clients. erllama_schedulermemory-pressure poller with pluggable sources (memsup,nvidia-smi, custom callback). Off by default.erllama_cache_writerpoolboy-backed dirty-IO writer pool with leak-proof reservation semaphore.- Persisted hit counters (u32 in disk header) so popular prefixes survive an LRU walk after restart.
- End-to-end metrics: hits/misses/saves/evictions plus per-path
latency totals (
pack_total_ns,load_total_ns,longest_prefix_ns,longest_prefix_probes). - Per-resource
pthread_mutexand two-resource lifetime pattern for safe concurrentfree_*/1plus dirty NIF ops. extern "C" noexceptshim catching every llama.cpp C++ exception at the boundary;decode_onedefensive guard againstGGML_ASSERTaborts.- Bench harness (
bench/run.sh) with cold-vs-warm matrix and a 4-agent shared-prefix scenario. TinyLlama and LLaMA-3 8B presets.
Changed
- NIF hardening.
llama_backend_initis deferred to the firstnif_load_modelviapthread_once. Cache-only and unit-test workloads no longer pay theggml_backend_load_allcost at NIF load.nif_tokenizeandnif_detokenizehonourrelease_pending, so a model returned byfree_model/1as{ok, deferred}cannot be reused via tokenize.nif_detokenizefails closed onn_vocab <= 0(matchesnif_prefill); 16 M-token cap on the size computation silences gcc-Walloc-size.- Dead
atom_not_implementedand the_unused_anchorhack removed (-Wpedantic). make_errno_atomnow maps FreeBSD'sEINTEGRITYtoeintegrityinstead ofunknown.
- Cache meta server. Sweep timer is cancelled on
terminate/2so a supervisor restart never leaves a zombie firing into a fresh server.pin_and_load/2wraps load + unpack intry/afterso the holder is always checked in. - Build.
FindErlang.cmakeadopted from erlang-rocksdb; replaces the inlineerl -noshell -evalsnippet that detectedERTS_INCLUDE_DIR.set(GGML_CCACHE OFF CACHE BOOL "" FORCE)silences the "ccache not found" diagnostic ggml emits on every build.
- Scheduler tests. Three bad-config cases now call
erllama_scheduler:validate_config/1directly (now exported) instead of spawning agen_serverwith bad config, so eunit no longer prints=CRASH REPORT=SASL output. - Dialyzer.
response_targettypednon_neg_integer()(idle state holds 0);inet:gethostnameandapplication:get_keypattern-match directly.erllama.erlmoduledoc fence closer fixed.
Removed
- mmap from the disk tier. The cache now reads files via plain
file:read_file/1into a fresh BEAM heap binary; theiommapdependency, thedisk_ioconfiguration option, and the 4-aritystart_linkform oferllama_cache_disk_srvare gone. The process already mmaps multi-GB GGUF weights, and a region binary that outlived its closing NIF call would have exposed the BEAM to SIGBUS from any external truncation.
CI
actions/checkoutandactions/cachebumped to@v5(Node.js 24).xref,dialyzer,erlfmt,elvispromoted to gate jobs;build,eunit,proper,ct,freebsddepend on them.- macOS matrix is
macos-14, macos-15. - FreeBSD matrix added:
release: ['14.2', '14.4']. Inside the VM: refreshpcre2so git can run, installgit, setgit config --global --add safe.directory '*'so llama.cpp's build-infogit rev-parsesucceeds. erllama_nif_tests:load_model_rejects_non_existent_path_testis now a generator with a 60 s timeout to absorb the lazy Metal init on macOS.
Tests
- 211 EUnit + PropEr property tests + 7 stub Common Test cases.
Real-model Common Test suite gated on
LLAMA_TEST_MODEL(14 cases including seed determinism, grammar+sampler, apply_chat_template, embeddings, KV pack/unpack round-trip). - New stub-backed coverage: sampler params (
erllama_sampler_tests), LoRA adapters + cache identity (erllama_lora_tests), FIFO queueing of concurrent infers (erllama_streaming_tests). - Multi-platform CI: Ubuntu 24.04 amd64, Ubuntu 24.04 arm64, macOS 14 + 15 (Apple Silicon), FreeBSD 14.2 + 14.4. OTP 28 across the matrix.
Documentation
- README rewritten as a friendly entry point with snippets.
- User guides: loading, caching, configuration, building, examples.
- Internal design notes: cache design, publish protocol, NIF safety.
- ex_doc-friendly module documentation throughout.
ROADMAP.md: what 0.1 doesn't do yet (multi-seq concurrent decoding, speculative decoding, vision, audio, ONNX/safetensors, stop-sequences, telemetry hooks, multi-GPU pressure, KV compression, cluster).- README closes with a teaser for the upcoming
erllama_clusterapplication — a separate OTP project that coordinates a fleet of erllama nodes (request distribution, cross-node speculative decoding, pipeline parallelism over QUIC).
Acknowledgements
Same idea as antirez/ds4.