Configuration reference

View Source

erllama configuration lives in two places: the OTP application environment (config/sys.config) and the per-model option map passed to erllama:load_model/1,2. This page is the full set.

Application environment

{erllama, [
  %% --------------- Save-policy gates -----------------------------
  {min_tokens,             512},
  {cold_min_tokens,        512},
  {cold_max_tokens,      30000},
  {continued_interval,    2048},
  {boundary_trim_tokens,    32},
  {boundary_align_tokens, 2048},

  %% --------------- Cache flow tunables ---------------------------
  {evict_save_timeout_ms,  30000},
  {session_resume_wait_ms,   500},
  {fingerprint_mode,         safe},   %% safe | gguf_chunked | fast_unsafe

  %% --------------- Memory-pressure scheduler ---------------------
  {scheduler, #{
    enabled         => false,
    pressure_source => noop,
    interval_ms     => 5000,
    high_watermark  => 0.85,
    low_watermark   => 0.75,
    min_evict_bytes => 1048576,
    evict_tiers     => [ram, ram_file]
  }}
]}.

Tiers

The RAM tier (erllama_cache_ram) starts automatically with the application. For ram_file or disk tiers, start an erllama_cache_disk_srv per root in your own supervision tree (or from a release start hook) and pass its registered name as tier_srv on the relevant load_model/1,2 call:

{ok, _} = erllama_cache_disk_srv:start_link(my_disk,    "/var/lib/erllama/kvc"),
{ok, _} = erllama_cache_ramfile_srv:start_link(my_shm,  "/dev/shm/erllama").

There is no single tiers env key in v0.1: per-process supervision gives you crisper restart semantics than a static list.

Save-policy gates

See the caching guide for what each threshold does. All are overridable per-model via the policy map.

evict_save_timeout_ms

How long synchronous evict and shutdown saves wait for the writer to finish before giving up. Defaults to 30 s. Bump for 8B-class models on slow disks.

session_resume_wait_ms

When a parent_key is supplied and the cache sees a matching in-flight finish save, it waits up to this long for the save to publish before falling through to a cold prefill. 500 ms is enough for SSD-backed deployments; bump if you observe back-to-back multi-turn cold misses on slow storage.

fingerprint_mode

How to verify the model fingerprint at load:

  • safe — full SHA-256 over the file. Slow on multi-GB GGUFs.
  • gguf_chunked — fingerprint metadata + first weights tensor. Catches accidental corruption, not malicious tampering.
  • fast_unsafe — trust the supplied fingerprint blindly. Use only when you fingerprint upstream and pass the digest through.

scheduler

See the caching guide.

Per-model options

Passed to erllama:load_model/1,2:

#{
  backend           => erllama_model_llama,
  model_path        => "/path/to/x.gguf",
  model_opts        => #{n_gpu_layers => 99},
  context_opts      => #{n_ctx => 4096, n_batch => 512},
  fingerprint       => <<32 bytes>>,
  fingerprint_mode  => safe,
  quant_type        => q4_k_m,
  quant_bits        => 4,
  ctx_params_hash   => <<32 bytes>>,
  context_size      => 4096,
  tier_srv          => my_disk,
  tier              => disk,
  policy            => #{
    min_tokens             => 256,
    cold_min_tokens        => 256,
    cold_max_tokens        => 8192,
    continued_interval     => 256,
    boundary_trim_tokens   => 32,
    boundary_align_tokens  => 256,
    session_resume_wait_ms => 500
  }
}

See loading a model for the per-field walkthrough.

Inspecting effective config

1> application:get_env(erllama, scheduler).
{ok, #{enabled => true, ...}}

2> erllama_scheduler:status().
#{enabled => true, pressure_source => system, ...}

3> erllama_cache_meta_srv:dump().
%% List of raw ETS tuples; see include/erllama_cache.hrl for the
%% position layout.
[{<<_:256>>, disk, 8388608, _, 0, available, _, _, _, 4}, ...]