erllama_model (erllama v0.1.0)
View SourcePer-model gen_statem that drives the request flow and wires the cache subsystem into the model lifecycle.
State machine (v0.1):
idle ──complete──▶ prefilling ──prefill_done──▶ generating ──finish──▶ idleOn the prefilling → generating transition the model fires a
cold save (boundary-trimmed prefix, async). Inside generating
it fires a finish save (full live token list, async) just
before returning to idle.
The continued save reason (every N tokens of new generation),
the evict save reason (driven by an external scheduler), and the
shutdown save reason (driven by application:prep_stop) are
defined in erllama_cache_policy but not yet wired here; they
land in follow-up steps.
Model operations (tokenize, prefill, decode, kv_pack, kv_unpack)
are stubbed — the gen_statem's tokens field IS the "context".
When step 2b lands the real erllama_nif for llama.cpp, those
stubs get replaced; the cache integration is unaffected.
Summary
Functions
Render a normalised chat request through the model's chat template
and tokenise in one step. The Request map carries messages,
system, and tools; the per-model template decides where each
field lands in the prompt.
Cancel an in-flight streaming inference. Idempotent and fire-and-
forget: returns ok even if the ref is unknown (already finished or
never existed). The cancellation is observed at the next
inter-token boundary; the model emits a final {erllama_done, Ref, Stats} with cancelled => true after the running decode step
completes.
Detokenise a list of token IDs back to a string. Safe to call
concurrently with complete/2,3.
Compute an embedding vector for the given prompt tokens.
Request that the model evict its current state. Fires an evict
save synchronously if there is anything in the context. Called by
erllama_scheduler (future) when GPU memory pressure requires this
model to release its context handle. No-op when the model is idle
with no live context.
Streaming inference. Admits a request and immediately returns a
unique reference(); tokens are delivered to CallerPid via
asynchronous messages
List currently attached adapters as [#{handle => H, scale => F}].
The handle is the same opaque value load_adapter/2 returned.
Load a LoRA adapter from a GGUF file and attach it to the model
with scale 1.0. Returns an opaque handle the caller threads into
unload_adapter/2 and set_adapter_scale/3. The adapter's sha256 is
folded into the effective fingerprint so cache rows produced under
this adapter never collide with rows from a different adapter set.
Snapshot of the model's metadata.
Change an attached adapter's scale. Re-applies the full set on the underlying context.
Fire a shutdown save synchronously and return. Called from the
application's prep_stop hook so live state survives a graceful
restart.
Tokenise a string using the model's tokenizer. Returns a list of
token IDs. Safe to call concurrently with complete/2,3; tokenisation
runs against the model's static vocabulary, not the live KV cache.
Detach + free a previously loaded adapter. Idempotent: a second call
on the same handle returns ok.
Types
-type cache_hit_kind() :: exact | partial | cold.
-type finish_reason() :: stop | length | cancelled.
-type infer_params() :: #{response_tokens => pos_integer(), parent_key => term(), temperature => float(), top_p => float(), top_k => pos_integer(), min_p => float(), repetition_penalty => float(), seed => non_neg_integer(), stop => [binary()], grammar => binary(), _ => _}.
-type model() :: erllama_registry:model_id() | pid().
-type model_info() :: #{id := binary(), pid := pid(), status := idle | prefilling | generating, backend := module(), context_size := non_neg_integer(), quant_type := atom(), quant_bits := non_neg_integer(), tier := disk | ram_file, fingerprint := binary()}.
-type stats() :: #{prompt_tokens := non_neg_integer(), completion_tokens := non_neg_integer(), prefill_ms := non_neg_integer(), generation_ms := non_neg_integer(), cache_hit_kind := cache_hit_kind(), finish_reason := finish_reason(), cancelled := boolean()}.
Functions
-spec apply_chat_template(model(), erllama_model_backend:chat_request()) -> {ok, [non_neg_integer()]} | {error, term()}.
Render a normalised chat request through the model's chat template
and tokenise in one step. The Request map carries messages,
system, and tools; the per-model template decides where each
field lands in the prompt.
Returns {error, not_supported} if the backend does not implement
chat templating.
-spec cancel(reference()) -> ok.
Cancel an in-flight streaming inference. Idempotent and fire-and-
forget: returns ok even if the ref is unknown (already finished or
never existed). The cancellation is observed at the next
inter-token boundary; the model emits a final {erllama_done, Ref, Stats} with cancelled => true after the running decode step
completes.
-spec complete(model(), binary()) -> {ok, binary(), [non_neg_integer()]} | {error, term()}.
-spec detokenize(model(), [non_neg_integer()]) -> {ok, binary()} | {error, term()}.
Detokenise a list of token IDs back to a string. Safe to call
concurrently with complete/2,3.
-spec embed(model(), [non_neg_integer()]) -> {ok, [float()]} | {error, term()}.
Compute an embedding vector for the given prompt tokens.
-spec evict(model()) -> ok.
Request that the model evict its current state. Fires an evict
save synchronously if there is anything in the context. Called by
erllama_scheduler (future) when GPU memory pressure requires this
model to release its context handle. No-op when the model is idle
with no live context.
-spec infer(model(), [non_neg_integer()], infer_params(), pid()) -> {ok, reference()} | {error, term()}.
Streaming inference. Admits a request and immediately returns a
unique reference(); tokens are delivered to CallerPid via
asynchronous messages:
{erllama_token, Ref, binary()}per generated token (text fragment){erllama_done, Ref, stats()}on normal completion{erllama_error, Ref, term()}on failure
Tokens is the prompt as a list of token ids - tokenisation is the
caller's responsibility (use tokenize/2 or apply a chat template
first). Params is an infer_params() map.
Calls that arrive while a previous request is in flight are queued
FIFO. The reply {ok, Ref} is sent as soon as the call is admitted;
streaming events follow once the queue head advances to this
request.
List currently attached adapters as [#{handle => H, scale => F}].
The handle is the same opaque value load_adapter/2 returned.
-spec load_adapter(model(), file:filename_all()) -> {ok, term()} | {error, term()}.
Load a LoRA adapter from a GGUF file and attach it to the model
with scale 1.0. Returns an opaque handle the caller threads into
unload_adapter/2 and set_adapter_scale/3. The adapter's sha256 is
folded into the effective fingerprint so cache rows produced under
this adapter never collide with rows from a different adapter set.
-spec model_info(model()) -> model_info().
Snapshot of the model's metadata.
Returns a model_info() map with status, context size, quantisation,
backend, fingerprint, and tier. Safe to call from any state - the
gen_statem handles it as a common event without disrupting in-flight
inference.
Change an attached adapter's scale. Re-applies the full set on the underlying context.
-spec shutdown(model()) -> ok.
Fire a shutdown save synchronously and return. Called from the
application's prep_stop hook so live state survives a graceful
restart.
-spec status(model()) -> idle | prefilling | generating.
-spec stop(model()) -> ok.
-spec tokenize(model(), binary()) -> {ok, [non_neg_integer()]} | {error, term()}.
Tokenise a string using the model's tokenizer. Returns a list of
token IDs. Safe to call concurrently with complete/2,3; tokenisation
runs against the model's static vocabulary, not the live KV cache.
Detach + free a previously loaded adapter. Idempotent: a second call
on the same handle returns ok.