Inference context with KV cache.
Summary
Functions
Clears the KV cache.
Creates a new inference context for the given model.
Decodes a list of tokens through the model.
Runs the generation loop: decodes prompt tokens and generates up to max_tokens new tokens.
Returns the context size.
Returns the max number of sequences.
Types
@type t() :: %LlamaCppEx.Context{model: LlamaCppEx.Model.t(), ref: reference()}
Functions
@spec clear(t()) :: :ok
Clears the KV cache.
@spec create( LlamaCppEx.Model.t(), keyword() ) :: {:ok, t()} | {:error, String.t()}
Creates a new inference context for the given model.
Options
:n_ctx- Context size (max tokens). Defaults to2048.:n_batch- Max tokens per decode batch. Defaults ton_ctx.:n_ubatch- Max tokens per micro-batch. Defaults to512.:n_threads- Number of threads for generation. Defaults to system CPU count.:n_threads_batch- Number of threads for prompt processing. Defaults to:n_threads.:embeddings- Enable embedding extraction. Defaults tofalse.:pooling_type- Pooling type for embeddings. Defaults to:unspecified. Values::unspecified,:none,:mean,:cls,:last.:n_seq_max- Max number of concurrent sequences. Defaults to1.
Decodes a list of tokens through the model.
@spec generate(t(), LlamaCppEx.Sampler.t(), [integer()], keyword()) :: {:ok, String.t()} | {:error, String.t()}
Runs the generation loop: decodes prompt tokens and generates up to max_tokens new tokens.
Returns the generated text (not including the prompt).
Options
:max_tokens- Maximum tokens to generate. Defaults to256.
Returns the context size.
Returns the max number of sequences.