LlamaCppEx.Context (LlamaCppEx v0.8.13)

Inference context with KV cache.

Summary

Types

t()

Functions

clear(context)

Clears the KV cache.

create(model, opts \\ [])

Creates a new inference context for the given model.

decode(context, tokens)

Decodes a list of tokens through the model.

generate(context, sampler, tokens, opts \\ [])

Runs the generation loop: decodes prompt tokens and generates up to max_tokens new tokens.

n_ctx(context)

Returns the context size.

n_rs_seq(context)

Returns the number of recurrent-state snapshots per sequence available for partial rollback of speculative drafts.

n_seq_max(context)

Returns the max number of sequences.

Types

t()

@type t() :: %LlamaCppEx.Context{model: LlamaCppEx.Model.t(), ref: reference()}

Functions

clear(context)

@spec clear(t()) :: :ok

Clears the KV cache.

create(model, opts \\ [])

@spec create(
  LlamaCppEx.Model.t(),
  keyword()
) :: {:ok, t()} | {:error, String.t()}

Creates a new inference context for the given model.

Options

Core

:n_ctx - Context size (max tokens). Defaults to 2048.
:n_batch - Max tokens per decode batch. Defaults to n_ctx.
:n_ubatch - Max tokens per micro-batch. Defaults to 512.
:n_threads - Number of threads for generation. Defaults to system CPU count.
:n_threads_batch - Number of threads for prompt processing. Defaults to :n_threads.
:n_seq_max - Max number of concurrent sequences. Defaults to 1.
:embeddings - Enable embedding extraction. Defaults to false.
:pooling_type - Pooling type for embeddings: :unspecified, :none, :mean, :cls, :last, :rank. Defaults to :unspecified.

KV Cache Quantization

:type_k - Data type for K cache. Reduces memory at the cost of precision. Values: :f16 (default), :f32, :q8_0, :q4_0, :q4_1, :q5_0, :q5_1, :bf16.
:type_v - Data type for V cache. Same values as :type_k. Defaults to :f16.

Flash Attention & GPU Offload

:flash_attn - Flash Attention mode: :auto (default), :enabled, :disabled.
:offload_kqv - Offload KQV ops and KV cache to GPU. Defaults to true.
:op_offload - Offload host tensor operations to device. Defaults to true.

RoPE Scaling (Context Extension)

:rope_scaling_type - RoPE scaling mode: :unspecified (default), :none, :linear, :yarn, :longrope.
:rope_freq_base - RoPE base frequency. 0.0 uses model default.
:rope_freq_scale - RoPE frequency scale. 0.0 uses model default.
:yarn_ext_factor - YaRN extrapolation mix factor. -1.0 to disable.
:yarn_attn_factor - YaRN magnitude scaling. -1.0 to disable.
:yarn_beta_fast - YaRN low correction dimension. -1.0 to disable.
:yarn_beta_slow - YaRN high correction dimension. -1.0 to disable.
:yarn_orig_ctx - YaRN original context length. 0 to disable.

Misc

:attention_type - Attention type: :unspecified (default), :causal, :non_causal. Use :non_causal for embedding models.
:no_perf - Disable performance timing. Defaults to true.
:swa_full - Use full-size sliding window attention cache. Defaults to true.

Speculative decoding / MTP

:ctx_type - Context kind. :default (the main target context, default) or :mtp (a draft context that consumes MTP heads from the same model). Use :mtp together with a separate :default context to drive multi-token-prediction speculative decoding via LlamaCppEx.MTP.
:n_rs_seq - Number of recurrent-state snapshots per sequence to retain for partial rollback of speculative drafts. 0 (default) disables rollback. For an MTP draft context, set this to your intended max draft length (e.g. 3).