LlamaCppEx (LlamaCppEx v0.8.13)

Elixir bindings for llama.cpp.

Provides a high-level API for loading GGUF models and generating text.

Quick Start

# Initialize the backend (once per application)
:ok = LlamaCppEx.init()

# Load a model
{:ok, model} = LlamaCppEx.load_model("model.gguf", n_gpu_layers: -1)

# Generate text
{:ok, text} = LlamaCppEx.generate(model, "Once upon a time", max_tokens: 200)

Lower-level API

For fine-grained control, use the individual modules:

LlamaCppEx.Model - Model loading and introspection
LlamaCppEx.Context - Inference context with KV cache
LlamaCppEx.Sampler - Token sampling configuration
LlamaCppEx.Tokenizer - Text tokenization and detokenization
LlamaCppEx.Embedding - Embedding generation

Summary

Functions

chat(model, messages, opts \\ [])

Applies the chat template and generates a response.

chat_completion(model, messages, opts \\ [])

Generates an OpenAI-compatible chat completion response.

embed(model, text, opts \\ [])

Computes an embedding for a single text.

embed_batch(model, texts, opts \\ [])

Computes embeddings for multiple texts.

generate(model, prompt, opts \\ [])

Generates text from a prompt.

init()

Initializes the llama.cpp backend. Call once at application start.

load_model(path, opts \\ [])

Loads a GGUF model from the given file path.

load_model_from_hub(repo_id, filename, opts \\ [])

Downloads a GGUF model from HuggingFace Hub and loads it.

stream(model, prompt, opts \\ [])

Returns a lazy stream of generated text chunks (tokens).

stream_chat(model, messages, opts \\ [])

Returns a lazy stream of chat response chunks.

stream_chat_completion(model, messages, opts \\ [])

Returns a lazy stream of OpenAI-compatible chat completion chunks.

Functions

chat(model, messages, opts \\ [])

@spec chat(LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword()) ::
  {:ok, String.t()} | {:error, String.t()}

Applies the chat template and generates a response.

Options

Accepts all options from generate/3 plus:

:template - Custom chat template string. Defaults to the model's embedded template.

Examples

{:ok, reply} = LlamaCppEx.chat(model, [
  %{role: "system", content: "You are helpful."},
  %{role: "user", content: "What is Elixir?"}
], max_tokens: 200)

chat_completion(model, messages, opts \\ [])

@spec chat_completion(LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword()) ::
  {:ok, LlamaCppEx.ChatCompletion.t()} | {:error, term()}

Generates an OpenAI-compatible chat completion response.

Applies the chat template, runs generation, and returns a %ChatCompletion{} struct with choices, usage counts, and finish reason.

Options

Accepts all options from generate/3 plus:

:template - Custom chat template string. Defaults to the model's embedded template.

Examples

{:ok, completion} = LlamaCppEx.chat_completion(model, [
  %{role: "user", content: "What is Elixir?"}
], max_tokens: 200)

completion.choices |> hd() |> Map.get(:message) |> Map.get(:content)

embed(model, text, opts \\ [])

@spec embed(LlamaCppEx.Model.t(), String.t(), keyword()) ::
  {:ok, LlamaCppEx.Embedding.t()} | {:error, String.t()}

Computes an embedding for a single text.

See LlamaCppEx.Embedding.embed/3 for options.

embed_batch(model, texts, opts \\ [])

@spec embed_batch(LlamaCppEx.Model.t(), [String.t()], keyword()) ::
  {:ok, [LlamaCppEx.Embedding.t()]} | {:error, String.t()}

Computes embeddings for multiple texts.

See LlamaCppEx.Embedding.embed_batch/3 for options.

generate(model, prompt, opts \\ [])

@spec generate(LlamaCppEx.Model.t(), String.t(), keyword()) ::
  {:ok, String.t()} | {:error, String.t()}

Generates text from a prompt.

Creates a temporary context and sampler, tokenizes the prompt, runs generation, and returns the generated text.

Options

:max_tokens - Maximum tokens to generate. Defaults to 256.
:n_ctx - Context size. Defaults to 2048.
:temp - Sampling temperature. 0.0 for greedy. Defaults to 0.8.
:top_k - Top-K filtering. Defaults to 40.
:top_p - Top-P (nucleus) filtering. Defaults to 0.95.
:min_p - Min-P filtering. Defaults to 0.05.
:seed - Random seed. Defaults to random.
:penalty_repeat - Repetition penalty. Defaults to 1.0.
:penalty_freq - Frequency penalty (0.0–2.0). Defaults to 0.0.
:penalty_present - Presence penalty (0.0–2.0). Defaults to 0.0.
:grammar - GBNF grammar string for constrained generation.
:grammar_root - Root rule name for grammar. Defaults to "root".
:json_schema - JSON Schema map for structured output. Automatically converted to a GBNF grammar. Cannot be used together with :grammar. Tip: set "additionalProperties" => false for tighter grammars.

init()

@spec init() :: :ok

Initializes the llama.cpp backend. Call once at application start.

load_model(path, opts \\ [])

@spec load_model(
  String.t(),
  keyword()
) :: {:ok, LlamaCppEx.Model.t()} | {:error, String.t()}

Loads a GGUF model from the given file path.

See LlamaCppEx.Model.load/2 for options.

load_model_from_hub(repo_id, filename, opts \\ [])

@spec load_model_from_hub(String.t(), String.t(), keyword()) ::
  {:ok, LlamaCppEx.Model.t()} | {:error, String.t()}

Downloads a GGUF model from HuggingFace Hub and loads it.

Requires the optional :req dependency.

Examples

:ok = LlamaCppEx.init()
{:ok, model} = LlamaCppEx.load_model_from_hub(
  "Qwen/Qwen3-4B-GGUF",
  "qwen3-4b-q4_k_m.gguf",
  n_gpu_layers: -1
)

Options

Accepts all options from load_model/2 plus:

:cache_dir - Local cache directory for downloaded models.
:token - HuggingFace API token for private repos.
:progress - Download progress callback.
:revision - Git revision (branch, tag, commit). Defaults to "main".

stream(model, prompt, opts \\ [])

@spec stream(LlamaCppEx.Model.t(), String.t(), keyword()) :: Enumerable.t()

Returns a lazy stream of generated text chunks (tokens).

Each element is a string (the text piece for one token). The stream ends when an end-of-generation token is produced or max_tokens is reached.

Accepts the same options as generate/3.

Examples

model
|> LlamaCppEx.stream("Tell me a story", max_tokens: 500)
|> Enum.each(&IO.write/1)

stream_chat(model, messages, opts \\ [])

@spec stream_chat(LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword()) ::
  Enumerable.t()

Returns a lazy stream of chat response chunks.

Applies the chat template and streams the generated response token by token. Accepts same options as chat/3.

stream_chat_completion(model, messages, opts \\ [])

@spec stream_chat_completion(
  LlamaCppEx.Model.t(),
  [LlamaCppEx.Chat.message()],
  keyword()
) ::
  Enumerable.t()

Returns a lazy stream of OpenAI-compatible chat completion chunks.

Each element is a %ChatCompletionChunk{} struct. The first chunk contains delta: %{role: "assistant", content: ""}. Subsequent chunks contain delta: %{content: "token_text"}. The final chunk contains the finish_reason.

All chunks share the same id and created timestamp.

Options

Accepts same options as chat_completion/3.

Examples

model
|> LlamaCppEx.stream_chat_completion(messages, max_tokens: 200)
|> Enum.each(fn chunk ->
  chunk.choices |> hd() |> get_in([:delta, :content]) |> IO.write()
end)