erllama (erllama v0.1.0)
View SourcePublic façade for the erllama application.
The cache subsystem (erllama_cache) is independent. This module
is the user-facing surface for loading and running models.
Typical usage:
ok = application:ensure_all_started(erllama).
{ok, Bin} = file:read_file("/srv/models/tinyllama-1.1b-q4_k_m.gguf").
{ok, Model} = erllama:load_model(#{
backend => erllama_model_llama,
model_path => "/srv/models/tinyllama-1.1b-q4_k_m.gguf",
fingerprint => crypto:hash(sha256, Bin)
}).
{ok, Reply, _Tokens} = erllama:complete(Model, <<"hello">>).
ok = erllama:unload(Model).Extra cache parameters (tier, tier_srv, quant_type,
ctx_params_hash, policy, ...) are optional; the defaults route
saves to the RAM tier (erllama_cache_ram). See the loading guide
for the full option map and instructions to wire up
ram_file / disk tier servers.
Models are dynamic children of erllama_model_sup (simple_one_for_one).
A registered name is auto-generated when the caller does not provide
an explicit model_id in the config map.
Summary
Functions
Render a chat request through the model's chat template and
tokenise. The Request map carries messages, system, and tools.
Cancel an in-flight streaming inference. Idempotent and
fire-and-forget; cancellation is observed at the next inter-token
boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.
Run a completion against a loaded model.
Run a completion against a loaded model with options.
Snapshot of the cache subsystem operational counters.
Detokenise a list of token ids back to text.
Compute an embedding vector for the given prompt tokens.
Fire an evict save synchronously and release the model's live KV
state. Used by an external memory-pressure scheduler when it wants
this model's working set off the heap without unloading the model.
Streaming inference. Returns immediately with a reference() that
identifies this request; tokens are delivered to CallerPid via
async messages
List currently attached adapters with their scales.
List currently-loaded models as model_info() maps. Each entry
includes the model id, status, backend, context size, and
quantisation.
Load a LoRA adapter from a GGUF file and attach it to the model with
scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3
and unload_adapter/2.
Load a model with an auto-generated id.
Load a model with an explicit id.
Inspect a single loaded model. Returns the same map shape
list_models/0 produces. Crashes with noproc if the model is not
loaded.
List currently-loaded model pids (low-level supervisor view). Most
callers want list_models/0, which returns metadata maps.
Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.
Fire a shutdown save synchronously and return. Called from a
release stop hook; bounded by evict_save_timeout_ms.
Current model state. idle means no request is in flight;
prefilling and generating are the two active phases.
Tokenise text against a loaded model's tokenizer. Safe to call
concurrently with complete/2,3.
Unload a model. Terminates the gen_statem cleanly.
Detach and free a previously loaded adapter. Idempotent.
Alias for unload/1. Provided for API symmetry with load_model/1,2
and the OpenAI/Ollama-style naming used by downstream HTTP servers.
Types
-type model() :: erllama_model:model().
-type model_id() :: erllama_registry:model_id().
-type model_info() :: erllama_model:model_info().
Functions
-spec apply_chat_template(model(), erllama_model_backend:chat_request()) -> {ok, [erllama_nif:token_id()]} | {error, term()}.
Render a chat request through the model's chat template and
tokenise. The Request map carries messages, system, and tools.
-spec cancel(reference()) -> ok.
Cancel an in-flight streaming inference. Idempotent and
fire-and-forget; cancellation is observed at the next inter-token
boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.
-spec complete(model(), binary()) -> {ok, binary(), [erllama_nif:token_id()]} | {error, term()}.
Run a completion against a loaded model.
-spec complete(model(), binary(), map()) -> {ok, binary(), [erllama_nif:token_id()]} | {error, term()}.
Run a completion against a loaded model with options.
Recognised keys in Opts:
response_tokens(non_neg_integer()) — cap on the number of tokens generated. Defaults to the model'sn_ctxminus prompt length.parent_key(erllama_cache:cache_key()) — the previous turn's finish-save key. Skips the longest-prefix walk and resumes directly from that row.
Returns {ok, ReplyText, FullTokenList} on success.
-spec counters() -> #{atom() => non_neg_integer()}.
Snapshot of the cache subsystem operational counters.
-spec detokenize(model(), [erllama_nif:token_id()]) -> {ok, binary()} | {error, term()}.
Detokenise a list of token ids back to text.
-spec embed(model(), [erllama_nif:token_id()]) -> {ok, [float()]} | {error, term()}.
Compute an embedding vector for the given prompt tokens.
-spec evict(model()) -> ok.
Fire an evict save synchronously and release the model's live KV
state. Used by an external memory-pressure scheduler when it wants
this model's working set off the heap without unloading the model.
-spec infer(model(), [erllama_nif:token_id()], erllama_model:infer_params(), pid()) -> {ok, reference()} | {error, term()}.
Streaming inference. Returns immediately with a reference() that
identifies this request; tokens are delivered to CallerPid via
async messages:
{erllama_token, Ref, Bin :: binary()}— text fragment{erllama_done, Ref, Stats}— normal completion{erllama_error, Ref, Reason}— failure
Tokens is the prompt as a list of token ids; tokenisation is the
caller's responsibility (use tokenize/2 or apply a chat template
first).
List currently attached adapters with their scales.
-spec list_models() -> [model_info()].
List currently-loaded models as model_info() maps. Each entry
includes the model id, status, backend, context size, and
quantisation.
-spec load_adapter(model(), file:filename_all()) -> {ok, term()} | {error, term()}.
Load a LoRA adapter from a GGUF file and attach it to the model with
scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3
and unload_adapter/2.
The adapter's file sha256 is folded into the model's effective fingerprint so cache rows produced with the adapter attached never collide with rows from a different attachment set. In-flight requests keep their original fingerprint snapshot; the new value takes effect from the next request.
Load a model with an auto-generated id.
Load a model with an explicit id.
-spec model_info(model()) -> model_info().
Inspect a single loaded model. Returns the same map shape
list_models/0 produces. Crashes with noproc if the model is not
loaded.
-spec models() -> [pid()].
List currently-loaded model pids (low-level supervisor view). Most
callers want list_models/0, which returns metadata maps.
Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.
-spec shutdown(model()) -> ok.
Fire a shutdown save synchronously and return. Called from a
release stop hook; bounded by evict_save_timeout_ms.
-spec status(model()) -> idle | prefilling | generating.
Current model state. idle means no request is in flight;
prefilling and generating are the two active phases.
-spec tokenize(model(), binary()) -> {ok, [erllama_nif:token_id()]} | {error, term()}.
Tokenise text against a loaded model's tokenizer. Safe to call
concurrently with complete/2,3.
Unload a model. Terminates the gen_statem cleanly.
Detach and free a previously loaded adapter. Idempotent.
Alias for unload/1. Provided for API symmetry with load_model/1,2
and the OpenAI/Ollama-style naming used by downstream HTTP servers.