Overview

LlamaCppEx provides Elixir bindings for llama.cpp via C++ NIFs (Native Implemented Functions). The design follows the same pattern used by production Elixir ML libraries like EXLA and Evision.

Layer Diagram

graph TD
    A[Elixir API<br/>LlamaCppEx] --> B[NIF Stubs<br/>LlamaCppEx.NIF]
    B --> C[C++ NIF Layer<br/>c_src/llama_nif.cpp]
    C --> D[fine.hpp<br/>Type Encoding + RAII]
    C --> E[llama.cpp Static Libs<br/>libllama.a + libggml.a]
    E --> F[Hardware Backend]
    F --> G[Metal<br/>macOS GPU]
    F --> H[CUDA<br/>NVIDIA GPU]
    F --> I[Vulkan<br/>Cross-platform GPU]
    F --> J[CPU<br/>Fallback]

Module Structure

graph LR
    subgraph "High-Level API"
        A[LlamaCppEx]
    end

    subgraph "Core Modules"
        B[Model]
        C[Context]
        D[Sampler]
        E[Tokenizer]
        F[Chat]
        H[Embedding]
        I[Server]
    end

    subgraph "Internal"
        G[NIF]
    end

    A --> B
    A --> C
    A --> D
    A --> E
    A --> F
    A --> H
    A --> I
    B --> G
    C --> G
    D --> G
    E --> G
    F --> G
    H --> G
    I --> G

Resource Lifecycle

All C++ objects are wrapped in RAII classes registered with the BEAM via fine. When the Elixir process holding a reference is garbage collected, the C++ destructor runs automatically.

sequenceDiagram
    participant Elixir as Elixir Process
    participant NIF as C++ NIF
    participant BEAM as BEAM GC

    Elixir->>NIF: Model.load("model.gguf")
    NIF->>NIF: llama_model_load_from_file()
    NIF->>NIF: Wrap in ResourcePtr<LlamaModel>
    NIF-->>Elixir: {:ok, %Model{ref: resource}}

    Elixir->>NIF: Context.create(model, opts)
    NIF->>NIF: llama_init_from_model()
    NIF->>NIF: Wrap in ResourcePtr<LlamaContext>
    Note over NIF: Context holds ResourcePtr<LlamaModel><br/>preventing model GC
    NIF-->>Elixir: {:ok, %Context{ref: resource}}

    Note over Elixir: Context goes out of scope
    BEAM->>NIF: ~LlamaContext()
    NIF->>NIF: llama_free(ctx)
    Note over NIF: Model ref count drops

    Note over Elixir: Model goes out of scope
    BEAM->>NIF: ~LlamaModel()
    NIF->>NIF: llama_model_free(model)

Resource Types

C++ WrapperWrapsDestructorPrevents GC of
LlamaModelllama_model*llama_model_free()-
LlamaContextllama_context*llama_free()LlamaModel
LlamaSamplerllama_sampler*llama_sampler_free()-

The Context holds a ResourcePtr<LlamaModel> to prevent the model from being garbage collected while the context is alive. This is critical since llama_context internally references the model's weights.

NIF Scheduler Assignment

NIFs are assigned to the appropriate scheduler based on their execution characteristics:

NIFSchedulerReason
model_loadDirtyIOReads multi-GB file from disk
context_createDirtyCPUGPU memory allocation
decodeDirtyCPUForward pass (compute-heavy)
generateDirtyCPUTight decode+sample loop
generate_tokensDirtyCPUStreaming decode+sample loop
tokenize, detokenizeNormalFast string operations
sampler_*NormalLightweight operations
model_* (introspection)NormalSimple field reads
prefillDirtyCPUPrompt processing forward pass
embed_decodeDirtyCPUEmbedding forward pass
get_embeddingsNormalRead embedding vectors
batch_evalDirtyCPUBatched forward pass (continuous batching)
sampler_sample_atNormalSample at specific batch index
decode_tokenDirtyCPUSingle-token forward pass
decode_batchDirtyCPUMulti-sequence decode + sample

Why dirty schedulers? Regular NIF calls must return within ~1ms to avoid blocking BEAM schedulers. Model loading and inference can take seconds to minutes. Dirty schedulers provide dedicated OS threads for these long-running operations without impacting BEAM responsiveness.

Text Generation Flow

sequenceDiagram
    participant User as Elixir Caller
    participant API as LlamaCppEx
    participant Tok as Tokenizer
    participant Ctx as Context
    participant Sam as Sampler
    participant NIF as C++ NIF

    User->>API: generate(model, "Hello", max_tokens: 100)
    API->>Tok: encode(model, "Hello")
    Tok->>NIF: tokenize(vocab, "Hello")
    NIF-->>Tok: [token_ids]
    Tok-->>API: {:ok, [15496]}

    API->>Ctx: create(model, n_ctx: 2048)
    Ctx->>NIF: context_create(model, params)
    NIF-->>Ctx: {:ok, ctx_ref}

    API->>Sam: create(temp: 0.8)
    Sam->>NIF: sampler_init(params)
    NIF-->>Sam: {:ok, sampler_ref}

    API->>Ctx: generate(ctx, sampler, tokens, max_tokens: 100)
    Ctx->>NIF: generate(ctx, sampler, tokens, 100)

    loop For each token (on DirtyCPU scheduler)
        NIF->>NIF: llama_decode(batch)
        NIF->>NIF: llama_sampler_sample(sampler, ctx, -1)
        NIF->>NIF: llama_sampler_accept(sampler, token)
        Note over NIF: Check for EOG token
    end

    NIF->>NIF: Detokenize all generated tokens
    NIF-->>Ctx: {:ok, "world, how are you?"}
    Ctx-->>API: {:ok, "world, how are you?"}
    API-->>User: {:ok, "world, how are you?"}

Streaming Flow

Streaming uses enif_send to send tokens from the dirty scheduler to the calling Elixir process:

sequenceDiagram
    participant User as Elixir Caller
    participant Stream as Stream.resource/3
    participant Gen as Generator (spawn_link)
    participant NIF as C++ NIF (DirtyCPU)

    User->>Stream: LlamaCppEx.stream(model, prompt)

    Stream->>Stream: Tokenize, create ctx + sampler
    Stream->>Gen: spawn_link(generate_tokens NIF)

    loop Token generation
        NIF->>NIF: llama_decode + llama_sampler_sample
        NIF-->>Stream: enif_send {ref, {:token, id, "text"}}
        Stream-->>User: "text" (via Enum.each)
    end

    alt End of generation
        NIF-->>Stream: enif_send {ref, :eog}
    else Max tokens reached
        NIF-->>Stream: enif_send {ref, :done}
    end

    Stream->>Gen: Process.exit(:kill)
    Stream->>Stream: Flush remaining messages

Key design decisions:

  • Generator runs in a spawn_linked process on a dirty scheduler
  • Messages use a unique ref to prevent cross-stream interference
  • Stream.resource/3 provides lazy enumeration with proper cleanup
  • Early termination (e.g., Enum.take/2) kills the generator and flushes messages

Build System

graph TD
    A[mix compile] --> B[elixir_make]
    B --> C[Makefile]
    C --> D{Backend Detection}
    D -->|LLAMA_BACKEND=metal| E[CMake -DGGML_METAL=ON]
    D -->|LLAMA_BACKEND=cuda| F[CMake -DGGML_CUDA=ON]
    D -->|LLAMA_BACKEND=vulkan| G[CMake -DGGML_VULKAN=ON]
    D -->|LLAMA_BACKEND=cpu| H[CMake - CPU only]
    D -->|Auto-detect| I{Platform?}
    I -->|macOS| E
    I -->|nvcc found| F
    I -->|Otherwise| H

    E --> J[Build llama.cpp static libs]
    F --> J
    G --> J
    H --> J

    J --> K[Compile llama_nif.cpp]
    K --> L[Link into priv/llama_cpp_ex_nif.so]

    subgraph "Static Libraries"
        J --> M[libllama.a]
        J --> N[libggml.a]
        J --> O[libggml-base.a]
        J --> P[libggml-metal.a / libggml-cuda.a / ...]
    end

Continuous Batching

The Server uses continuous batching to serve multiple concurrent users with a single forward pass per tick:

graph TD
    subgraph "Elixir Layer"
        A[Caller 1] --> D[LlamaCppEx.Server<br/>Tick-driven GenServer]
        B[Caller 2] --> D
        C[Caller 3] --> D
        D -->|Queue| Q[Request Queue<br/>FIFO :queue]
    end

    D -->|"One tick = one forward pass"| E[batch_eval NIF]

    subgraph "C++ Layer"
        E --> F[Build llama_batch<br/>decode tokens + prefill chunks]
        F --> G[Single llama_decode call]
    end

    G --> H[sampler_sample_at per slot]
    H --> I[Stream/reply to callers]

Tick Loop

Each tick executes five phases:

  1. Finish — Complete slots that hit EOG or max tokens, dequeue waiting requests
  2. Build batch — Add decode tokens first (priority), then fill remaining budget with prefill chunks
  3. Forward pass — Single batch_eval NIF call
  4. Samplesampler_sample_at for each generating/completing slot at their batch index
  5. Continue — Schedule next tick if any active slots remain

Chunked Prefill

Long prompts are split into chunks (default 512 tokens) and processed across multiple ticks, interleaved with decode tokens from generating slots:

sequenceDiagram
    participant S as Server
    participant NIF as batch_eval NIF

    Note over S: Tick 1: Slot 0 generating, Slot 1 prefilling (2048 tok prompt)
    S->>NIF: batch_eval([slot0_decode_tok, slot1_prefill_chunk_0..511])
    NIF-->>S: :ok
    Note over S: Sample slot 0, advance slot 1 prefill_pos

    Note over S: Tick 2: Slot 0 generating, Slot 1 still prefilling
    S->>NIF: batch_eval([slot0_decode_tok, slot1_prefill_chunk_512..1023])
    NIF-->>S: :ok

    Note over S: Tick 3-4: Continue chunking...

    Note over S: Tick 5: Slot 1 prefill complete (last chunk has logits=true)
    S->>NIF: batch_eval([slot0_decode_tok, slot1_prefill_chunk_1536..2047])
    NIF-->>S: :ok
    Note over S: Sample both slots — slot 1 now generating

Prefix Caching

When cache_prompt: true, the server retains KV cache after a slot finishes. The next request on that slot gets automatic prefix detection:

Request 1: [system_prompt, user_turn_1]   full prefill
Request 2: [system_prompt, user_turn_1, assistant_reply, user_turn_2]
                                           skip prefill for common prefix
                                            only process new tokens

The common_prefix_length helper compares new tokens with cached tokens. Prefix-affinity slot selection picks the idle slot with the best match.

Pluggable Batching Strategies

The batch building logic is extracted into a BatchStrategy behaviour. Three strategies are provided:

  • DecodeMaximal (default): Decode tokens first, prefill fills remaining budget
  • PrefillPriority: Prefill first, decode fills remainder (throughput-oriented)
  • Balanced: Equal budget split between decode and prefill

Custom strategies implement build_batch(slots, budget, chunk_size, opts).

Why Batching Matters

  • Prefill (prompt processing): Already GPU-efficient, compute-bound
  • Decode (token generation): Memory-bandwidth-bound, GPU utilization 10-30%
  • Batching: Converts N serial matrix-vector ops into one matrix-matrix multiply

File Map

llama_cpp_ex/
 mix.exs                          # Project config, deps, Hex package metadata
 Makefile                         # CMake + NIF build system
 vendor/llama.cpp/                # Git submodule (pinned to release)
 c_src/llama_cpp_ex/
    llama_nif.h                  # RAII wrappers (LlamaModel, LlamaContext, LlamaSampler)
    llama_nif.cpp                # All NIF implementations (~900 lines)
 lib/
    llama_cpp_ex.ex              # High-level API: generate, stream, chat, embed
    llama_cpp_ex/
        nif.ex                   # @on_load + NIF stubs
        model.ex                 # Model loading + introspection
        context.ex               # Inference context with KV cache
        sampler.ex               # Sampling chain configuration
        tokenizer.ex             # Text <-> token conversion
        chat.ex                  # Chat template formatting
        embedding.ex             # Text embeddings (L2 norm, batched)
        server.ex                # Continuous batching GenServer
        server/
           batch_strategy.ex    # BatchStrategy behaviour
           strategy/
               decode_maximal.ex  # Decode-first (default)
               prefill_priority.ex # Prefill-first (throughput)
               balanced.ex        # Equal split
        hub.ex                   # HuggingFace Hub downloads
 priv/                            # Build output (.so / .dylib)
 bench/                           # Benchee benchmarks
 docs/                            # Architecture docs + ADRs
 test/
     llama_cpp_ex_test.exs        # Model-dependent tests
     batch_strategy_test.exs      # Strategy unit tests (no model)
     hub_test.exs                 # Hub unit tests (no network)