# `LlamaCppEx`
[🔗](https://github.com/nyo16/llama_cpp_ex/blob/main/lib/llama_cpp_ex.ex#L1)

Elixir bindings for llama.cpp.

Provides a high-level API for loading GGUF models and generating text.

## Quick Start

    # Initialize the backend (once per application)
    :ok = LlamaCppEx.init()

    # Load a model
    {:ok, model} = LlamaCppEx.load_model("model.gguf", n_gpu_layers: -1)

    # Generate text
    {:ok, text} = LlamaCppEx.generate(model, "Once upon a time", max_tokens: 200)

## Lower-level API

For fine-grained control, use the individual modules:

  * `LlamaCppEx.Model` - Model loading and introspection
  * `LlamaCppEx.Context` - Inference context with KV cache
  * `LlamaCppEx.Sampler` - Token sampling configuration
  * `LlamaCppEx.Tokenizer` - Text tokenization and detokenization
  * `LlamaCppEx.Embedding` - Embedding generation

# `chat`

```elixir
@spec chat(LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword()) ::
  {:ok, String.t()} | {:error, String.t()}
```

Applies the chat template and generates a response.

## Options

Accepts all options from `generate/3` plus:

  * `:template` - Custom chat template string. Defaults to the model's embedded template.

## Examples

    {:ok, reply} = LlamaCppEx.chat(model, [
      %{role: "system", content: "You are helpful."},
      %{role: "user", content: "What is Elixir?"}
    ], max_tokens: 200)

# `chat_completion`

```elixir
@spec chat_completion(LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword()) ::
  {:ok, LlamaCppEx.ChatCompletion.t()} | {:error, term()}
```

Generates an OpenAI-compatible chat completion response.

Applies the chat template, runs generation, and returns a `%ChatCompletion{}`
struct with choices, usage counts, and finish reason.

## Options

Accepts all options from `generate/3` plus:

  * `:template` - Custom chat template string. Defaults to the model's embedded template.

## Examples

    {:ok, completion} = LlamaCppEx.chat_completion(model, [
      %{role: "user", content: "What is Elixir?"}
    ], max_tokens: 200)

    completion.choices |> hd() |> Map.get(:message) |> Map.get(:content)

# `embed`

```elixir
@spec embed(LlamaCppEx.Model.t(), String.t(), keyword()) ::
  {:ok, LlamaCppEx.Embedding.t()} | {:error, String.t()}
```

Computes an embedding for a single text.

See `LlamaCppEx.Embedding.embed/3` for options.

# `embed_batch`

```elixir
@spec embed_batch(LlamaCppEx.Model.t(), [String.t()], keyword()) ::
  {:ok, [LlamaCppEx.Embedding.t()]} | {:error, String.t()}
```

Computes embeddings for multiple texts.

See `LlamaCppEx.Embedding.embed_batch/3` for options.

# `generate`

```elixir
@spec generate(LlamaCppEx.Model.t(), String.t(), keyword()) ::
  {:ok, String.t()} | {:error, String.t()}
```

Generates text from a prompt.

Creates a temporary context and sampler, tokenizes the prompt, runs generation,
and returns the generated text.

## Options

  * `:max_tokens` - Maximum tokens to generate. Defaults to `256`.
  * `:n_ctx` - Context size. Defaults to `2048`.
  * `:temp` - Sampling temperature. `0.0` for greedy. Defaults to `0.8`.
  * `:top_k` - Top-K filtering. Defaults to `40`.
  * `:top_p` - Top-P (nucleus) filtering. Defaults to `0.95`.
  * `:min_p` - Min-P filtering. Defaults to `0.05`.
  * `:seed` - Random seed. Defaults to random.
  * `:penalty_repeat` - Repetition penalty. Defaults to `1.0`.
  * `:penalty_freq` - Frequency penalty (0.0–2.0). Defaults to `0.0`.
  * `:penalty_present` - Presence penalty (0.0–2.0). Defaults to `0.0`.
  * `:grammar` - GBNF grammar string for constrained generation.
  * `:grammar_root` - Root rule name for grammar. Defaults to `"root"`.
  * `:json_schema` - JSON Schema map for structured output. Automatically converted
    to a GBNF grammar. Cannot be used together with `:grammar`. Tip: set
    `"additionalProperties" => false` for tighter grammars.

# `init`

```elixir
@spec init() :: :ok
```

Initializes the llama.cpp backend. Call once at application start.

# `load_model`

```elixir
@spec load_model(
  String.t(),
  keyword()
) :: {:ok, LlamaCppEx.Model.t()} | {:error, String.t()}
```

Loads a GGUF model from the given file path.

See `LlamaCppEx.Model.load/2` for options.

# `load_model_from_hub`

```elixir
@spec load_model_from_hub(String.t(), String.t(), keyword()) ::
  {:ok, LlamaCppEx.Model.t()} | {:error, String.t()}
```

Downloads a GGUF model from HuggingFace Hub and loads it.

Requires the optional `:req` dependency.

## Examples

    :ok = LlamaCppEx.init()
    {:ok, model} = LlamaCppEx.load_model_from_hub(
      "Qwen/Qwen3-4B-GGUF",
      "qwen3-4b-q4_k_m.gguf",
      n_gpu_layers: -1
    )

## Options

Accepts all options from `load_model/2` plus:

  * `:cache_dir` - Local cache directory for downloaded models.
  * `:token` - HuggingFace API token for private repos.
  * `:progress` - Download progress callback.
  * `:revision` - Git revision (branch, tag, commit). Defaults to `"main"`.

# `stream`

```elixir
@spec stream(LlamaCppEx.Model.t(), String.t(), keyword()) :: Enumerable.t()
```

Returns a lazy stream of generated text chunks (tokens).

Each element is a string (the text piece for one token). The stream ends
when an end-of-generation token is produced or `max_tokens` is reached.

Accepts the same options as `generate/3`.

## Examples

    model
    |> LlamaCppEx.stream("Tell me a story", max_tokens: 500)
    |> Enum.each(&IO.write/1)

# `stream_chat`

```elixir
@spec stream_chat(LlamaCppEx.Model.t(), [LlamaCppEx.Chat.message()], keyword()) ::
  Enumerable.t()
```

Returns a lazy stream of chat response chunks.

Applies the chat template and streams the generated response token by token.
Accepts same options as `chat/3`.

# `stream_chat_completion`

```elixir
@spec stream_chat_completion(
  LlamaCppEx.Model.t(),
  [LlamaCppEx.Chat.message()],
  keyword()
) ::
  Enumerable.t()
```

Returns a lazy stream of OpenAI-compatible chat completion chunks.

Each element is a `%ChatCompletionChunk{}` struct. The first chunk contains
`delta: %{role: "assistant", content: ""}`. Subsequent chunks contain
`delta: %{content: "token_text"}`. The final chunk contains the `finish_reason`.

All chunks share the same `id` and `created` timestamp.

## Options

Accepts same options as `chat_completion/3`.

## Examples

    model
    |> LlamaCppEx.stream_chat_completion(messages, max_tokens: 200)
    |> Enum.each(fn chunk ->
      chunk.choices |> hd() |> get_in([:delta, :content]) |> IO.write()
    end)

---

*Consult [api-reference.md](api-reference.md) for complete listing*