Elixir bindings for llama.cpp.
Provides a high-level API for loading GGUF models and generating text.
Quick Start
# Initialize the backend (once per application)
:ok = LlamaCppEx.init()
# Load a model
{:ok, model} = LlamaCppEx.load_model("model.gguf", n_gpu_layers: -1)
# Generate text
{:ok, text} = LlamaCppEx.generate(model, "Once upon a time", max_tokens: 200)Lower-level API
For fine-grained control, use the individual modules:
LlamaCppEx.Model- Model loading and introspectionLlamaCppEx.Context- Inference context with KV cacheLlamaCppEx.Sampler- Token sampling configurationLlamaCppEx.Tokenizer- Text tokenization and detokenizationLlamaCppEx.Embedding- Embedding generation
Summary
Functions
Applies the chat template and generates a response.
Generates an OpenAI-compatible chat completion response.
Computes an embedding for a single text.
Computes embeddings for multiple texts.
Generates text from a prompt.
Initializes the llama.cpp backend. Call once at application start.
Loads a GGUF model from the given file path.
Downloads a GGUF model from HuggingFace Hub and loads it.
Returns a lazy stream of generated text chunks (tokens).
Returns a lazy stream of chat response chunks.
Returns a lazy stream of OpenAI-compatible chat completion chunks.
Functions
@spec chat(LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword()) :: {:ok, String.t()} | {:error, String.t()}
Applies the chat template and generates a response.
Options
Accepts all options from generate/3 plus:
:template- Custom chat template string. Defaults to the model's embedded template.
Examples
{:ok, reply} = LlamaCppEx.chat(model, [
%{role: "system", content: "You are helpful."},
%{role: "user", content: "What is Elixir?"}
], max_tokens: 200)
@spec chat_completion(LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword()) :: {:ok, LlamaCppEx.ChatCompletion.t()} | {:error, term()}
Generates an OpenAI-compatible chat completion response.
Applies the chat template, runs generation, and returns a %ChatCompletion{}
struct with choices, usage counts, and finish reason.
Options
Accepts all options from generate/3 plus:
:template- Custom chat template string. Defaults to the model's embedded template.
Examples
{:ok, completion} = LlamaCppEx.chat_completion(model, [
%{role: "user", content: "What is Elixir?"}
], max_tokens: 200)
completion.choices |> hd() |> Map.get(:message) |> Map.get(:content)
@spec embed(LlamaCppEx.Model.t(), String.t(), keyword()) :: {:ok, LlamaCppEx.Embedding.t()} | {:error, String.t()}
Computes an embedding for a single text.
See LlamaCppEx.Embedding.embed/3 for options.
@spec embed_batch(LlamaCppEx.Model.t(), [String.t()], keyword()) :: {:ok, [LlamaCppEx.Embedding.t()]} | {:error, String.t()}
Computes embeddings for multiple texts.
See LlamaCppEx.Embedding.embed_batch/3 for options.
@spec generate(LlamaCppEx.Model.t(), String.t(), keyword()) :: {:ok, String.t()} | {:error, String.t()}
Generates text from a prompt.
Creates a temporary context and sampler, tokenizes the prompt, runs generation, and returns the generated text.
Options
:max_tokens- Maximum tokens to generate. Defaults to256.:n_ctx- Context size. Defaults to2048.:temp- Sampling temperature.0.0for greedy. Defaults to0.8.:top_k- Top-K filtering. Defaults to40.:top_p- Top-P (nucleus) filtering. Defaults to0.95.:min_p- Min-P filtering. Defaults to0.05.:seed- Random seed. Defaults to random.:penalty_repeat- Repetition penalty. Defaults to1.0.:penalty_freq- Frequency penalty (0.0–2.0). Defaults to0.0.:penalty_present- Presence penalty (0.0–2.0). Defaults to0.0.:grammar- GBNF grammar string for constrained generation.:grammar_root- Root rule name for grammar. Defaults to"root".:json_schema- JSON Schema map for structured output. Automatically converted to a GBNF grammar. Cannot be used together with:grammar. Tip: set"additionalProperties" => falsefor tighter grammars.
@spec init() :: :ok
Initializes the llama.cpp backend. Call once at application start.
@spec load_model( String.t(), keyword() ) :: {:ok, LlamaCppEx.Model.t()} | {:error, String.t()}
Loads a GGUF model from the given file path.
See LlamaCppEx.Model.load/2 for options.
@spec load_model_from_hub(String.t(), String.t(), keyword()) :: {:ok, LlamaCppEx.Model.t()} | {:error, String.t()}
Downloads a GGUF model from HuggingFace Hub and loads it.
Requires the optional :req dependency.
Examples
:ok = LlamaCppEx.init()
{:ok, model} = LlamaCppEx.load_model_from_hub(
"Qwen/Qwen3-4B-GGUF",
"qwen3-4b-q4_k_m.gguf",
n_gpu_layers: -1
)Options
Accepts all options from load_model/2 plus:
:cache_dir- Local cache directory for downloaded models.:token- HuggingFace API token for private repos.:progress- Download progress callback.:revision- Git revision (branch, tag, commit). Defaults to"main".
@spec stream(LlamaCppEx.Model.t(), String.t(), keyword()) :: Enumerable.t()
Returns a lazy stream of generated text chunks (tokens).
Each element is a string (the text piece for one token). The stream ends
when an end-of-generation token is produced or max_tokens is reached.
Accepts the same options as generate/3.
Examples
model
|> LlamaCppEx.stream("Tell me a story", max_tokens: 500)
|> Enum.each(&IO.write/1)
@spec stream_chat(LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword()) :: Enumerable.t()
Returns a lazy stream of chat response chunks.
Applies the chat template and streams the generated response token by token.
Accepts same options as chat/3.
@spec stream_chat_completion( LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword() ) :: Enumerable.t()
Returns a lazy stream of OpenAI-compatible chat completion chunks.
Each element is a %ChatCompletionChunk{} struct. The first chunk contains
delta: %{role: "assistant", content: ""}. Subsequent chunks contain
delta: %{content: "token_text"}. The final chunk contains the finish_reason.
All chunks share the same id and created timestamp.
Options
Accepts same options as chat_completion/3.
Examples
model
|> LlamaCppEx.stream_chat_completion(messages, max_tokens: 200)
|> Enum.each(fn chunk ->
chunk.choices |> hd() |> get_in([:delta, :content]) |> IO.write()
end)