v0.7.0
Added
Prefix caching — Same-slot KV cache reuse for multi-turn chat. When a new request shares a prefix with the slot's previous request, the common prefix is skipped during prefill. 1.23x faster for multi-turn conversations. Controlled by
cache_promptoption (defaultfalse, opt-in). Includes prefix-affinity slot selection. See ADR 007.Pluggable batching strategies — Extracted batch building into
BatchStrategybehaviour with three built-in strategies:DecodeMaximal(default, generation-latency optimized),PrefillPriority(throughput optimized),Balanced(fair split). Custom strategies can implement the behaviour. See ADR 008.Pre-tokenized API —
Server.generate_tokens/3,Server.stream_tokens/3, andServer.get_model/1allow callers to tokenize outside the GenServer, reducing mailbox contention under concurrent load.HuggingFace Hub integration — New
LlamaCppEx.Hubmodule withsearch/2(find GGUF models),list_gguf_files/2(with file sizes via tree API),download/3(with local caching, ETag support, offline mode viaLLAMA_OFFLINE=1), andget_model_info/2. Authentication viaHF_TOKENorHUGGING_FACE_HUB_TOKENenv vars. NewLlamaCppEx.load_model_from_hub/3convenience wrapper. Requires optional:reqdependency.Performance guide — New
docs/performance.mdwith server tuning, prefix caching patterns, strategy selection guide, and optimization recipes.Benchee benchmarks — New
bench/prefix_cache.exs,bench/strategies.exs,bench/tokenize_overhead.exsfor measuring prefix cache impact, strategy comparison, and tokenization overhead.
Changed
- Graceful batch_eval error handling — The server now fails active slots with error replies instead of crashing the GenServer when
batch_evalreturns an error (e.g., KV cache overflow).
Fixed
- CI warning suppression — Suppress
-Wunused-functionwarnings from vendored llama.cpp jinja headers (runtime.h,lexer.h).
v0.6.14
Changed
- llama.cpp submodule — Updated from 50e0ad08f to b8635075f (7 commits).
- common: add Gemma 4 specialized parser (#21418), respect specified tag fallback when tag is empty (#21413)
- llama-model: read
final_logit_softcappingfor Gemma 4 (#21390) - llama: add custom newline split for Gemma 4 (#21406)
- server: fix undefined timing measurement errors in server context (#21201)
- ggml-webgpu: move from parameter buffer pool to single buffer with offsets (#21278)
- ci: add Windows Vulkan backend testing on Intel (#21292)
v0.6.13
Changed
- llama.cpp submodule — Updated from 95a6ebabb to 50e0ad08f (32 commits).
- server: save and clear idle slots on new task (
--clear-idle) (#20993) - common/parser: fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers (#21230)
- common: fix tool call type detection for nullable and enum schemas (#21327), add commentary rules for gpt-oss-20b (#21286)
- chat: avoid including json in chat.h (#21306), add Granite 4.0 chat template (#20804), Gemma4 tool response support
- jinja: coerce input for string-specific filters (#21370)
- vocab: fix Gemma4 tokenizer (#21343)
- ggml: bump to 0.9.11 (ggml/1456)
- ggml-webgpu: add vectorized flash attention (#20709)
- ggml-zendnn: add MUL_MAT_ID op support for MoE models (#21315)
- rpc: reuse compute graph buffers (#21299)
- kv-cache: do not quantize SWA KV cache (#21277)
- SYCL: fix llama_kv_cache hang when kv_cache is huge: 5GB (#21283)
- hexagon: add cumsum op support (#21246)
- model/mtmd: fix gguf conversion for audio/vision mmproj (#21309)
- tests: add unit test coverage for llama_tensor_get_type (#20112), allow exporting graph ops from HF file without downloading weights (#21182)
- fix: remove stale assert (#21369), fix gemma 4 template (#21326)
- server: save and clear idle slots on new task (
v0.6.12
Changed
- llama.cpp submodule — Updated from 08f21453a to 95a6ebabb (37 commits).
- CUDA: add FA support for head dim 512 (#20998), fix FA kernel selection logic (#21271), add generic NVFP4 MMQ kernel (#21074), fix kernel selection for mmvq mmid kernel (#21238)
- opencl: fix leak in Adreno q8_0 path (#21212)
- ggml: bump to 0.9.10 (ggml/1454), fix RWKV ops thread assignment (#21226)
- ggml-cpu: fix fallback for RVV kernels without zvfh (#21157)
- ggml-webgpu: quantized buffers to u32 + wider browser/device support (#21046), port AOT operators to JIT (#20728)
- kleidiai: add CPU feature detection to CI run script (#20394)
- hexagon: improve RMS_NORM and DIV accuracy (#21251)
- SYCL: support nvfp4 in mul_mat (#21227), enhance fattn perf (#21185)
- CANN: fix multi-thread set_tensor race conditions (#20151)
- memory: respect unified KV cache in hybrid memory for eval tasks (#21224)
- llama: rotate activations for better quantization (#21038), refactor llama_model_quantize_params to pure C interface (#20346)
- common: gpt-oss handle builtin/unsolicited tool calls (#21213), cleanup logs and modernize progress bar (#21215), disable backend sampling if reasoning budget enabled (#21209), add bounds check to prevent segfault on failed model load (#21082), move up common_init() and fix Windows UTF-8 logs (#21176)
- server: bypass API key validation for WebUI static assets (#21269), no more gzip compression for webui (#21073), cleanup dual representation to openai-compat (#21090)
- fix: tool call parsing for LFM2/LFM2.5 (#21242), correct misspellings (#21217), use lower-case proxy headers (#21235), include API key in CORS proxy for MCP (#21193)
- vendor: update BoringSSL to 0.20260327.0 (#21211)
v0.6.11
Changed
- llama.cpp submodule — Updated from 82b703f8b to 08f21453a (21 commits).
- opencl: add q4_K gemm and gemv kernels for Adreno (#20919)
- CUDA: fix CUB's argsort when nrows % block_size == 0 (#21181), optimize MOE GEMV kernel for BS > 1 (#20905)
- jinja: handle empty expressions correctly (#20913)
- common/parser: fix handling of tool definition with missing properties key (#21128), add reasoning_format = none support to gpt-oss (#21094)
- common/json-schema: fix non-capturing groups in pattern converter (#21124)
- common: add character class support to glob_match (#21111)
- server: wrap headers for mcp proxy (#21072), fix processing of multiple back-to-back mtmd chunks (#21107)
- model: add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150)
- llama-model-loader: print warning when using overrides with mmap (#20978)
- hexagon: dma optimizations (#21137)
- SYCL: enhance build script to use half cores to avoid OS hang (#21093)
- rpc: fix misleading error log (#21184)
v0.6.10
Changed
- llama.cpp submodule — Updated from 5c1a7b835 to 82b703f8b (7 commits).
- vendor: update cpp-httplib to 0.40.0 (#21100)
- vulkan: add noncontiguous GLU support (#21081)
- common/parser: fix reasoning whitespace bugs + extra parser tests (#21085)
- cli: add /glob command (#21084)
- webui: conversation forking + branching improvements (#21021)
- docker: fix and enable ARM64 image build (#20929)
v0.6.9
Changed
- llama.cpp submodule — Updated from 9f102a140 to 1743d9805 (38 commits).
- model: F2LLM-v2 support, allow causal_attn and pooling_type on all architectures (#20973)
- convert: register Qwen3Model architecture (#20967), support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505), add RuGPT3XL support (#21011)
- ggml-cuda: add NVFP4 dp4a kernel (#20644), support F32 kernel type for CONV_TRANSPOSE_2D (#17094)
- hip: use fnuz fp8 for conversion on CDNA3 (#21040)
- opencl: allow large buffer for Adreno (#20997)
- jinja: fix macro with kwargs (#20960)
- common: make LLAMA_CACHE the one cache for everything (#21009), fix split model migration (#21019), fix verbosity setup (#20989), add getpwuid fallback for HF cache (#21035), filter out imatrix when finding models (#21023)
- llama: fix llama-model-saver (#20503)
- mtmd: add DeepSeekOCR support (#17400), refactor image preprocessing (#21031), fix quant and im2col ops on Metal for deepseek-ocr (#21027)
- imatrix: fix crash with --show-statistics and zero counts (#19532)
v0.6.8
Changed
- llama.cpp submodule — Updated from 1772701f9 to 9f102a140 (15 commits).
- models: move the token embedding norms to the first layer (#20943)
- ggml-backend: re-enable graph reuse with pipeline parallelism (#20927)
- metal: add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930), add FA instantiations for HSK=512, HSV=512 (#20902)
- common: add standard Hugging Face cache support (#20775), add a WARNING for HF cache migration (#20935), fix get_gguf_split_info (#20946), replace wrap_for_generation with a prefix convenience function (#20912)
- hexagon: general DMA and Binary Op fixes for large strides (#20918)
- llama-fit: fix regex pattern for gate_up tensors (#20910)
- vendor: update cpp-httplib to 0.39.0 (#20933)
v0.6.7
Changed
- llama.cpp submodule — Updated from eac9c6ea8 to 1772701f9 (30 commits).
- rpc: RCE patch (#20908), prevent division by zero in deserialize_tensor (#20712)
- memory: fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887)
- server: use httplib dynamic threads (#20817), allow router to report child instances sleep status (#20849), fix Host header (#20843)
- metal: add CONV_3D (#19927)
- common/autoparser: detect reasoning markers when enable_thinking changes system prompt (#20859)
- common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604)
- context: use n_embd_out for pooled embedding extraction (#20840)
- jinja: refactor token advancement (#20864)
- CUDA: fix BF16 FA compilation (#20865), native bf16 flash attention for vec kernel (#20525), increase output elements per-thread block for small K-dimension (#20635)
- CANN: add RoPE cache preload before ACL graph capture (#20747)
- opencl: add q6_K gemm and gemv kernels for Adreno (#20089), add flattened Q4_K mv and general Q4_K mm (#20773)
- openvino: explicit memset in buffer_context allocation (#20857)
- mtmd: add dynamic high-resolution image preprocessing for InternVL model (#20847), fix LightOnOCR image preprocessing (#20877)
- ggml: support bf16 and quantized type (#20803)
- webui: improve chat form positioning (#20901), fix --webui-config-file settings not applied on load (#20823)
v0.6.6
Changed
- llama.cpp submodule — Updated from 6729d4920 to eac9c6ea8 (47 commits).
- context: zero output buffer on allocation (#20781)
- model: assert nextn_predict_layers to prevent underflow (#20783), fix Granite Hybrid type check for 7B.A1B (#20795)
- jinja: fix heap OOB read in value equality comparison (#20782)
- common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825), fix out_of_range crash in throw path (#20777), add proper reasoning tag prefill reading (#20424), fix gpt-oss content removal (#20745)
- chat: handle tool calls with no required args in TAG_WITH_TAGGED format (#20764)
- server: fix router mode deadlock on child crash and TOCTOU race (#20763), add cached_tokens info to oaicompat responses (#19361), improve mtmd ctx checkpoints (#20726), become source of truth for sampling defaults (#20558)
- vulkan: change gated_delta_net to shard across subgroup (#20662), dequantize iq4_xs 4 at a time (#20657)
- hip: avoid compiler bug in RDNA code generation during debug builds on Windows (#20655)
- hexagon: add Matrix Extensions (HMX) for NPU backend (#20693)
- CANN: add BF16 support for core operators (#20152), handle in-place ROPE on non-contiguous f32 tensors (#20274), support flash attention for head dim not multiple of 16 (#20031)
- ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791)
- ggml-webgpu: ops support for qwen3.5 (SET, TRI_SOLVE, SSM_CONV, GATED_DELTA_NET) (#20687), add DIAG/TRI ops (#20664), update RMS_NORM/L2_NORM (#20665)
- vocab: assert array size of scores and toktypes (#20737)
- convert: support is_causal hyperparameter (#20746), make NVFP4/MXFP4 say correct type (#20730)
- cmake: fix build warning when kleidiai is enabled (#20457), guard KleidiAI DOWNLOAD_EXTRACT_TIMESTAMP for cmake < 3.24 (#20767)
v0.6.5
Changed
- llama.cpp submodule — Updated from b6c83aad5 to 6729d4920 (26 commits).
- model: add control vector support where missing (#20653)
- ggml: bump version to 0.9.8 (ggml/1442), restore ggml_type_sizef() to avoid major version bump (ggml/1441)
- ggml-cpu: fix RVV checks in quants and repacking (#20682), fix unused changemask warning in repack (#20692)
- ggml-blas: set MKL threads from thread context (#20602)
- Vulkan: async and event fixes (#20518), disable MMVQ on Intel Windows driver (#20672), allow graphics queue only through env var (#20599)
- HIP: ignore return of hipMemAdvise (#20696)
- hexagon: add neg, exp, sigmoid, softplus, cont, repeat ops (#20701)
- kleidiai: fix MUL_MAT support for batched (3D) inputs (#20620)
- server: fix ctx checkpoint invalidation (#20671)
- context: fix graph not resetting when control vector changes (#20381)
- llama: re-enable manual LoRA adapter free (#19983)
- common: rework gpt-oss parser (#20393), add
--skip-chat-parsingto force pure content parser (#20289) - webui: fix duplicated messages on q param (#20715), improve tooltip wording for attachment requirements (#20688)
- OpenCL: no timeout for WaitAny in graph submission to avoid deadlocks on llvm-pipe backends (#20618)
v0.6.4
Changed
- llama.cpp submodule — Updated from 463b6a963 to b6c83aad5 (56 commits).
- model: Mistral Small 4 support (#20649), Nemotron-H NVFP4 tensors (#20561), Qwen3.5/Qwen3.5MoE NVFP4 tensors (#20506)
- ggml: OpenVINO backend (#15307), native AVX512-FP16 support for F16 operations (#20529), extend im2col f16 (#1434), guard against sumq2 being 0 in IQ4_NL (#20460)
- CUDA: GDN shared mem latency hiding (#20537), limit FA stream-k block count (#20586), RDNA4-specific MMVQ for bs=1 decode (#19478), FP32 cuBLAS for V100 to avoid overflows (#19959), fix data race in cpy kernel (#20507), avoid creating CUDA context during device init (#20595)
- metal: FA specialization for HSK=320, HSV=256 (#20549)
- Vulkan: fix flash attention dot product precision (#20589), use graphics queue on AMD (#20551)
- HIP: APU compatibility — soft error handling for hipMemAdviseSetCoarseGrain (#20536)
- SYCL: fix untransposed GDA recurrent state (#20583), enhance UPSCALE to support all UT cases (#20637)
- OpenCL: fix l2_norm (#20480)
- server: support refusal content for Responses API (#20285), fix wait in test_cancel_requests() (#20601), fix model selector locked to first loaded model (#20580)
- tools/cli: fix disable reasoning (#20606)
- convert: support mixed-precision ModelOpt NVFP4/FP8 quantization (#20539), support contiguous method on lora tensors (#20489)
- kv-cache: fix reading llama_kv_cell_ext during state read (#20273)
- common: fix iterator::end() dereference (#20445)
- vendor: cpp-httplib 0.37.2 → 0.38.0 (#20484, #20578)
- webui: model information dialog (#20600), MCP CORS proxy detection (#20167), code preview iframe isolation (#20477)
- hexagon: Q4_0 and MXFP4 repack fixes (#20527)
v0.6.3
Added
- CI workflow — New
.github/workflows/ci.ymlrunsmix compile --warnings-as-errors,mix format --check-formatted,mix test, andmix dialyzeron push/PR to master. - Dialyzer — Added
dialyxirdependency for static analysis. All modules pass with zero warnings. - Example scripts — New
examples/directory with 6 runnable scripts:basic_generation.exs,streaming.exs,chat.exs,structured_output.exs,embeddings.exs, andserver.exs. - Expanded test coverage — New
test/schema_test.exscoveringembeds_one,embeds_many, additional Ecto types (:date,:utc_datetime,:decimal,:map), empty schemas, and end-to-end nested schema to GBNF conversion. Added edge case tests totest/thinking_test.exsfor unicode content, nested/malformed tags, and very long content.
Fixed
Chat.apply_template/3— Now accepts string-keyed message maps (%{"role" => ..., "content" => ...}) in addition to atom-keyed maps and tuples.Schema.to_json_schema/1— Fixed Dialyzer opaque type warning (replacedMapSet.member?/2withinoperator).- GitHub Actions Node.js 20 deprecation — Updated
actions/checkoutto v5 and addedFORCE_JAVASCRIPT_ACTIONS_TO_NODE24env to precompile workflow, preparing for the June 2026 Node.js 24 migration. - Stream test reliability — Fixed
stream with early halttest to use a prompt compatible with instruction-tuned models.
Changed
- llama.cpp submodule — Updated from fdb17643d to 463b6a963 (31 commits).
- tools: enable kvu in perplexity for hellaswag, winogrande, multiple-choice (#19954)
- graph: remove redundant GDN state transposes (#20443)
- llama: fix pooling assertion crash in chunked GDN detection path (#20468), disable graph reuse with pipeline parallelism (#20463)
- metal: fix l2 norm scale (#20493), avoid divisions in bin kernel (#20426)
- Vulkan: add GATED_DELTA_NET op support (#20334), fix l2_norm epsilon handling (#20350), fix OOB check in flash_attn_mask_opt (#20296), fix ErrorOutOfHostMemory on Intel GPU with --no-mmap (#20059)
- OpenCL: add cumsum op (#18981), use larger workgroup size for get_rows (#20316)
- HIP: compile debug builds with -O2 to avoid compiler bug (#20392)
- ggml-cpu: add RVV vec dot kernels for quantization types (#18859)
- server: reset counter related to kill-switch on client error (#20513), auto-select first loaded model for new conversations (#20403)
- common/parser: gracefully handle undetected tool parser (#20286), add GigaChatV3/3.1 models support (#19931)
- grammar: fix root symbol check (#19761)
- vendor: update cpp-httplib to 0.37.1 (#20390)
- convert: better mtp check and fix return (#20419)
v0.6.1
Changed
- llama.cpp submodule — Updated from c5a778891 to fdb17643d (70 commits).
- model: add support for Phi4ForCausalLMV, Nemotron 3 Super, Qwen3VL reranker text
- ggml: add NVFP4 quantization type support
- llama: chunked fused GDN path, dynamic head_dim and n_rot for SWA
- metal: extend mul_mv_ext to BF16/Q2_K/Q3_K, fix q5_k register spill, add upscale, handle command buffer failures gracefully
- CUDA/HIP: GDN shared mem for HIP, fix loop unrolling in ssm-conv, display VRAM capacity on init
- Vulkan: add SGN and ELU ops, fix data races in coopmat1, skip zero size tensors in copies
- SYCL: Flash Attention support for fp32/fp16/Q4/Q5/Q8
- WebGPU: add REPEAT op, faster quant matrix operations
- KleidiAI: concurrent SME and NEON kernel execution
- ggml-cpu: add RVV repack GEMM/GEMV for quantization types
- server: kill switch when stuck, fix checkpoints and OAI completion stream index
- common: fix --n-cpu-moe/--cpu-moe for fused gate+up models, gracefully handle incomplete output
- vendor: update cpp-httplib to 0.37.0, miniaudio to 0.11.25
- llama-quant: fail early on missing imatrix, refactor type selection
v0.6.0
Added
- Qwen 3.5 support — llama.cpp updated to c5a778891 (35 commits since v0.5.0).
reasoning_contentin ChatCompletion —chat_completion/3now splits<think>...</think>blocks from the response whenenable_thinking: true. The choice message includesreasoning_content(the thinking text) andcontent(the final answer). Returnsnilwhen thinking is not enabled or no thinking block is present.reasoning_contentin ChatCompletionChunk —stream_chat_completion/3emits chunks withreasoning_contentin the delta while the model is thinking, then switches tocontentafter</think>.LlamaCppEx.Thinking— New module withparse/1for one-shot parsing andstream_parser/1+feed/2for streaming token-boundary-safe parsing of think blocks. Handles the real-world Qwen3/3.5 template behavior where<think>is opened by the template itself.
Changed
- llama.cpp submodule — Updated from 7f5ee54 to c5a778891.
- ggml: add GATED_DELTA_NET op for Qwen 3.5 hybrid architecture
- model: update Qwen 3.5 model type detection
- convert: register Qwen 3.5 ForCausalLM for text only
- CUDA: use shared mem for ssm_conv, improve performance via fewer synchronizations
- Hexagon: add f32 ssm_conv, fp16 binary ops, Flash Attention optimizations
- OpenCL: add l2_norm, neg, exp, diag ops
- CPU: skip redundant ROPE cache updates, fix data race for debug asserts
- quants: add memsets and other fixes for IQ quants
- kv-cache: fix M-RoPE checkpoints, checkpoint every n tokens
- server: preserve Anthropic thinking blocks in conversion
Unchanged
chat/3andstream_chat/3continue returning raw text (no breaking change).
v0.5.0
Added
Structured output via JSON Schema — New
:json_schemaoption ongenerate/3,stream/3,chat/3,stream_chat/3,chat_completion/3, andstream_chat_completion/3. Pass a JSON Schema map and the model output is automatically constrained to valid JSON matching the schema. Uses llama.cpp's built-injson_schema_to_grammar()under the hood.schema = %{ "type" => "object", "properties" => %{"name" => %{"type" => "string"}, "age" => %{"type" => "integer"}}, "required" => ["name", "age"], "additionalProperties" => false } {:ok, json} = LlamaCppEx.chat(model, messages, json_schema: schema, temp: 0.0)LlamaCppEx.Grammar— New module for JSON Schema to GBNF conversion.from_json_schema/1— returns{:ok, gbnf_string}or{:error, reason}from_json_schema!/1— returns the GBNF string or raises
LlamaCppEx.Schema— New module for converting Ecto schema modules to JSON Schema maps. Maps all standard Ecto types (:string,:integer,:float,:boolean,:date,{:array, inner}, etc.) and supports nestedembeds_one/embeds_many. Automatically excludes:idand timestamp fields.NIF:
json_schema_to_grammar_nif/1— Exposes llama.cpp'sjson_schema_to_grammar()vianlohmann::ordered_json.
Changed
- Elixir requirement bumped to
~> 1.18(for built-inJSON.encode!/1). - Dependencies — added
{:ecto, "~> 3.0", optional: true}for optional Ecto schema integration.
v0.4.4
Changed
- llama.cpp submodule — Updated to latest upstream (b8198).
- ggml: fix
ggml_is_contiguous_nfor ne == 1 - ggml: use simple
std::threadin AMX without OpenMP - KleidiAI: add SME fp16 compute path for q4_0 GEMM on aarch64
- OpenCL: add optimized q4_1 mm kernel for Adreno
- Vulkan: tune MMVQ for Intel Windows
- WebGPU: fix workgroup dispatch limit for large batch sizes
- Fix locale-dependent float printing in GGUF metadata
- ggml: fix
v0.4.3
Changed
- llama.cpp submodule — Updated to latest upstream (b8185).
- Vulkan: improve partial offloading performance on AMD
- CUDA: cap grid.y at 65535 in non-contiguous dequantize/convert kernels
- ggml-cpu: optimise s390x multiply extend instructions
- Vendors: update cpp-httplib to 0.35.0, miniaudio to 0.11.24
v0.4.2
Changed
- llama.cpp submodule — Updated to latest upstream (b8179).
v0.4.1
Improved
- Error handling —
Chat.apply_template/3,Tokenizer.encode/3, andTokenizer.decode/2now return{:error, reason}instead of crashing when NIFs raise. - Telemetry documentation — Server moduledoc documents all telemetry events, measurements, and metadata.
- Typespecs — Added
@spectoServer.start_link/1.
Changed
- llama.cpp submodule — Updated to latest upstream (b8157).
v0.4.0
Added
- Full model loading params —
main_gpu,split_mode,tensor_splitfor multi-GPU placement;use_mlockanduse_direct_iofor memory control;vocab_onlyfor cheap model introspection without loading weights. - Server GPU forwarding —
Server.start_link/1now forwardsmain_gpu,split_mode,tensor_split,use_mlock, anduse_direct_iotoModel.load/2.
v0.3.0
Added
- Jinja chat templates — switched from
llama_chat_apply_template()C API to the full Jinja-basedcommon_chat_templates_apply()engine from llama.cpp's common library. enable_thinkingoption — passenable_thinking: falsetoChat.apply_template/3,chat/3,stream_chat/3,chat_completion/3, andstream_chat_completion/3to disable CoT reasoning for models like Qwen3/3.5.chat_template_kwargsoption — pass arbitrary key-value pairs to the Jinja template engine.- Penalty parameters —
penalty_repeat,penalty_freq, andpenalty_presentoptions for repetition/frequency/presence penalties in sampling. - OpenAI-compatible response format —
chat_completion/3andstream_chat_completion/3returnChatCompletionandChatCompletionChunkstructs. - Qwen3.5 benchmark results in README — Qwen3.5-27B and Qwen3.5-35B-A3B on Apple M4 Max.
Changed
Chat.apply_template/3now uses the Jinja engine and takes the model ref directly (no longer accepts:templateoption for raw template strings).- Linked
libcommon.afrom llama.cpp build (previously excluded). LlamaModelRAII wrapper now cachescommon_chat_templatesat model load time.
v0.2.0
Added
- Continuous batching server (
LlamaCppEx.Server) — GenServer with slot pool for concurrent multi-sequence inference. One forward pass per tick with decode tokens and prefill chunks mixed in a single batch. - Embeddings (
LlamaCppEx.Embedding) —embed/3andembed_batch/3with L2 normalization and configurable pooling type. - Grammar-constrained generation — GBNF grammar support via
grammarandgrammar_rootoptions inSampler.create/2andgenerate/3. - Batched inference primitives —
prefill/3,decode_batch/3,decode_token/4,batch_eval/2,sampler_sample_at/3NIFs for building custom inference loops. - Streaming via Server —
LlamaCppEx.Server.stream/3for token-by-token streaming through the batched server. - Telemetry events —
[:llama_cpp_ex, :server, :tick]and[:llama_cpp_ex, :server, :request, :done]for observability. - Benchmark suite (
bench/) — Benchee-based benchmarks for single-sequence and server generation, plus a custom continuous batching harness measuring throughput scaling.
Changed
Sampler.create/1now requires the model as the first argument:Sampler.create(model, opts).Context.create/2accepts new options::embeddings,:pooling_type,:n_seq_max.
v0.1.0
Initial release.
- Model loading and introspection
- Text generation with configurable sampling
- Streaming token generation via
Stream.resource/3 - Chat template support
- Tokenization and detokenization
- Metal, CUDA, Vulkan, and CPU backends
- RAII resource management via
fine