Architecture

Overview

LlamaCppEx provides Elixir bindings for llama.cpp via C++ NIFs (Native Implemented Functions). The design follows the same pattern used by production Elixir ML libraries like EXLA and Evision.

Layer Diagram

graph TD
    A[Elixir API<br/>LlamaCppEx] --> B[NIF Stubs<br/>LlamaCppEx.NIF]
    B --> C[C++ NIF Layer<br/>c_src/llama_nif.cpp]
    C --> D[fine.hpp<br/>Type Encoding + RAII]
    C --> E[llama.cpp Static Libs<br/>libllama.a + libggml.a]
    E --> F[Hardware Backend]
    F --> G[Metal<br/>macOS GPU]
    F --> H[CUDA<br/>NVIDIA GPU]
    F --> I[Vulkan<br/>Cross-platform GPU]
    F --> J[CPU<br/>Fallback]

Module Structure

graph LR
    subgraph "High-Level API"
        A[LlamaCppEx]
    end

    subgraph "Core Modules"
        B[Model]
        C[Context]
        D[Sampler]
        E[Tokenizer]
        F[Chat]
        H[Embedding]
        I[Server]
    end

    subgraph "Internal"
        G[NIF]
    end

    A --> B
    A --> C
    A --> D
    A --> E
    A --> F
    A --> H
    A --> I
    B --> G
    C --> G
    D --> G
    E --> G
    F --> G
    H --> G
    I --> G

Resource Lifecycle

All C++ objects are wrapped in RAII classes registered with the BEAM via fine. When the Elixir process holding a reference is garbage collected, the C++ destructor runs automatically.

sequenceDiagram
    participant Elixir as Elixir Process
    participant NIF as C++ NIF
    participant BEAM as BEAM GC

    Elixir->>NIF: Model.load("model.gguf")
    NIF->>NIF: llama_model_load_from_file()
    NIF->>NIF: Wrap in ResourcePtr<LlamaModel>
    NIF-->>Elixir: {:ok, %Model{ref: resource}}

    Elixir->>NIF: Context.create(model, opts)
    NIF->>NIF: llama_init_from_model()
    NIF->>NIF: Wrap in ResourcePtr<LlamaContext>
    Note over NIF: Context holds ResourcePtr<LlamaModel><br/>preventing model GC
    NIF-->>Elixir: {:ok, %Context{ref: resource}}

    Note over Elixir: Context goes out of scope
    BEAM->>NIF: ~LlamaContext()
    NIF->>NIF: llama_free(ctx)
    Note over NIF: Model ref count drops

    Note over Elixir: Model goes out of scope
    BEAM->>NIF: ~LlamaModel()
    NIF->>NIF: llama_model_free(model)

Resource Types

C++ Wrapper	Wraps	Destructor	Prevents GC of
`LlamaModel`	`llama_model*`	`llama_model_free()`	-
`LlamaContext`	`llama_context*`	`llama_free()`	LlamaModel
`LlamaSampler`	`llama_sampler*`	`llama_sampler_free()`	-

The Context holds a ResourcePtr<LlamaModel> to prevent the model from being garbage collected while the context is alive. This is critical since llama_context internally references the model's weights.

NIF Scheduler Assignment

NIFs are assigned to the appropriate scheduler based on their execution characteristics:

NIF	Scheduler	Reason
`model_load`	DirtyIO	Reads multi-GB file from disk
`context_create`	DirtyCPU	GPU memory allocation
`decode`	DirtyCPU	Forward pass (compute-heavy)
`generate`	DirtyCPU	Tight decode+sample loop
`generate_tokens`	DirtyCPU	Streaming decode+sample loop
`tokenize`, `detokenize`	Normal	Fast string operations
`sampler_*`	Normal	Lightweight operations
`model_*` (introspection)	Normal	Simple field reads
`prefill`	DirtyCPU	Prompt processing forward pass
`embed_decode`	DirtyCPU	Embedding forward pass
`get_embeddings`	Normal	Read embedding vectors
`batch_eval`	DirtyCPU	Batched forward pass (continuous batching)
`sampler_sample_at`	Normal	Sample at specific batch index
`decode_token`	DirtyCPU	Single-token forward pass
`decode_batch`	DirtyCPU	Multi-sequence decode + sample

Why dirty schedulers? Regular NIF calls must return within ~1ms to avoid blocking BEAM schedulers. Model loading and inference can take seconds to minutes. Dirty schedulers provide dedicated OS threads for these long-running operations without impacting BEAM responsiveness.

Text Generation Flow

sequenceDiagram
    participant User as Elixir Caller
    participant API as LlamaCppEx
    participant Tok as Tokenizer
    participant Ctx as Context
    participant Sam as Sampler
    participant NIF as C++ NIF

    User->>API: generate(model, "Hello", max_tokens: 100)
    API->>Tok: encode(model, "Hello")
    Tok->>NIF: tokenize(vocab, "Hello")
    NIF-->>Tok: [token_ids]
    Tok-->>API: {:ok, [15496]}

    API->>Ctx: create(model, n_ctx: 2048)
    Ctx->>NIF: context_create(model, params)
    NIF-->>Ctx: {:ok, ctx_ref}

    API->>Sam: create(temp: 0.8)
    Sam->>NIF: sampler_init(params)
    NIF-->>Sam: {:ok, sampler_ref}

    API->>Ctx: generate(ctx, sampler, tokens, max_tokens: 100)
    Ctx->>NIF: generate(ctx, sampler, tokens, 100)

    loop For each token (on DirtyCPU scheduler)
        NIF->>NIF: llama_decode(batch)
        NIF->>NIF: llama_sampler_sample(sampler, ctx, -1)
        NIF->>NIF: llama_sampler_accept(sampler, token)
        Note over NIF: Check for EOG token
    end

    NIF->>NIF: Detokenize all generated tokens
    NIF-->>Ctx: {:ok, "world, how are you?"}
    Ctx-->>API: {:ok, "world, how are you?"}
    API-->>User: {:ok, "world, how are you?"}

Streaming Flow

Streaming uses enif_send to send tokens from the dirty scheduler to the calling Elixir process:

sequenceDiagram
    participant User as Elixir Caller
    participant Stream as Stream.resource/3
    participant Gen as Generator (spawn_link)
    participant NIF as C++ NIF (DirtyCPU)

    User->>Stream: LlamaCppEx.stream(model, prompt)

    Stream->>Stream: Tokenize, create ctx + sampler
    Stream->>Gen: spawn_link(generate_tokens NIF)

    loop Token generation
        NIF->>NIF: llama_decode + llama_sampler_sample
        NIF-->>Stream: enif_send {ref, {:token, id, "text"}}
        Stream-->>User: "text" (via Enum.each)
    end

    alt End of generation
        NIF-->>Stream: enif_send {ref, :eog}
    else Max tokens reached
        NIF-->>Stream: enif_send {ref, :done}
    end

    Stream->>Gen: Process.exit(:kill)
    Stream->>Stream: Flush remaining messages

Key design decisions:

Generator runs in a spawn_linked process on a dirty scheduler
Messages use a unique ref to prevent cross-stream interference
Stream.resource/3 provides lazy enumeration with proper cleanup
Early termination (e.g., Enum.take/2) kills the generator and flushes messages

Build System

graph TD
    A[mix compile] --> B[elixir_make]
    B --> C[Makefile]
    C --> D{Backend Detection}
    D -->|LLAMA_BACKEND=metal| E[CMake -DGGML_METAL=ON]
    D -->|LLAMA_BACKEND=cuda| F[CMake -DGGML_CUDA=ON]
    D -->|LLAMA_BACKEND=vulkan| G[CMake -DGGML_VULKAN=ON]
    D -->|LLAMA_BACKEND=cpu| H[CMake - CPU only]
    D -->|Auto-detect| I{Platform?}
    I -->|macOS| E
    I -->|nvcc found| F
    I -->|Otherwise| H

    E --> J[Build llama.cpp static libs]
    F --> J
    G --> J
    H --> J

    J --> K[Compile llama_nif.cpp]
    K --> L[Link into priv/llama_cpp_ex_nif.so]

    subgraph "Static Libraries"
        J --> M[libllama.a]
        J --> N[libggml.a]
        J --> O[libggml-base.a]
        J --> P[libggml-metal.a / libggml-cuda.a / ...]
    end

Continuous Batching

The Server uses continuous batching to serve multiple concurrent users with a single forward pass per tick:

graph TD
    subgraph "Elixir Layer"
        A[Caller 1] --> D[LlamaCppEx.Server<br/>Tick-driven GenServer]
        B[Caller 2] --> D
        C[Caller 3] --> D
        D -->|Queue| Q[Request Queue<br/>FIFO :queue]
    end

    D -->|"One tick = one forward pass"| E[batch_eval NIF]

    subgraph "C++ Layer"
        E --> F[Build llama_batch<br/>decode tokens + prefill chunks]
        F --> G[Single llama_decode call]
    end

    G --> H[sampler_sample_at per slot]
    H --> I[Stream/reply to callers]

Tick Loop

Each tick executes five phases:

Finish — Complete slots that hit EOG or max tokens, dequeue waiting requests
Build batch — Add decode tokens first (priority), then fill remaining budget with prefill chunks
Forward pass — Single batch_eval NIF call
Sample — sampler_sample_at for each generating/completing slot at their batch index
Continue — Schedule next tick if any active slots remain

Chunked Prefill

Long prompts are split into chunks (default 512 tokens) and processed across multiple ticks, interleaved with decode tokens from generating slots:

sequenceDiagram
    participant S as Server
    participant NIF as batch_eval NIF

    Note over S: Tick 1: Slot 0 generating, Slot 1 prefilling (2048 tok prompt)
    S->>NIF: batch_eval([slot0_decode_tok, slot1_prefill_chunk_0..511])
    NIF-->>S: :ok
    Note over S: Sample slot 0, advance slot 1 prefill_pos

    Note over S: Tick 2: Slot 0 generating, Slot 1 still prefilling
    S->>NIF: batch_eval([slot0_decode_tok, slot1_prefill_chunk_512..1023])
    NIF-->>S: :ok

    Note over S: Tick 3-4: Continue chunking...

    Note over S: Tick 5: Slot 1 prefill complete (last chunk has logits=true)
    S->>NIF: batch_eval([slot0_decode_tok, slot1_prefill_chunk_1536..2047])
    NIF-->>S: :ok
    Note over S: Sample both slots — slot 1 now generating

Prefix Caching

When cache_prompt: true, the server retains KV cache after a slot finishes. The next request on that slot gets automatic prefix detection:

Request 1: [system_prompt, user_turn_1]  → full prefill
Request 2: [system_prompt, user_turn_1, assistant_reply, user_turn_2]
                                          → skip prefill for common prefix
                                            only process new tokens

The common_prefix_length helper compares new tokens with cached tokens. Prefix-affinity slot selection picks the idle slot with the best match.

Pluggable Batching Strategies

The batch building logic is extracted into a BatchStrategy behaviour. Three strategies are provided:

DecodeMaximal (default): Decode tokens first, prefill fills remaining budget
PrefillPriority: Prefill first, decode fills remainder (throughput-oriented)
Balanced: Equal budget split between decode and prefill

Custom strategies implement build_batch(slots, budget, chunk_size, opts).

Why Batching Matters

Prefill (prompt processing): Already GPU-efficient, compute-bound
Decode (token generation): Memory-bandwidth-bound, GPU utilization 10-30%
Batching: Converts N serial matrix-vector ops into one matrix-matrix multiply

File Map

llama_cpp_ex/
├── mix.exs                          # Project config, deps, Hex package metadata
├── Makefile                         # CMake + NIF build system
├── vendor/llama.cpp/                # Git submodule (pinned to release)
├── c_src/llama_cpp_ex/
│   ├── llama_nif.h                  # RAII wrappers (LlamaModel, LlamaContext, LlamaSampler)
│   └── llama_nif.cpp                # All NIF implementations (~900 lines)
├── lib/
│   ├── llama_cpp_ex.ex              # High-level API: generate, stream, chat, embed
│   └── llama_cpp_ex/
│       ├── nif.ex                   # @on_load + NIF stubs
│       ├── model.ex                 # Model loading + introspection
│       ├── context.ex               # Inference context with KV cache
│       ├── sampler.ex               # Sampling chain configuration
│       ├── tokenizer.ex             # Text <-> token conversion
│       ├── chat.ex                  # Chat template formatting
│       ├── embedding.ex             # Text embeddings (L2 norm, batched)
│       ├── server.ex                # Continuous batching GenServer
│       ├── server/
│       │   ├── batch_strategy.ex    # BatchStrategy behaviour
│       │   └── strategy/
│       │       ├── decode_maximal.ex  # Decode-first (default)
│       │       ├── prefill_priority.ex # Prefill-first (throughput)
│       │       └── balanced.ex        # Equal split
│       └── hub.ex                   # HuggingFace Hub downloads
├── priv/                            # Build output (.so / .dylib)
├── bench/                           # Benchee benchmarks
├── docs/                            # Architecture docs + ADRs
└── test/
    ├── llama_cpp_ex_test.exs        # Model-dependent tests
    ├── batch_strategy_test.exs      # Strategy unit tests (no model)
    └── hub_test.exs                 # Hub unit tests (no network)

← Previous Page LICENSE

Next Page → Cross-Platform Builds