LlamaCppEx.Server.BatchStrategy behaviour (LlamaCppEx v0.7.0)

Copy Markdown View Source

Behavior for batch building strategies.

A strategy decides how to allocate the token budget between decode tokens (generation) and prefill chunks each tick.

Built-in Strategies

Custom Strategies

Implement the build_batch/4 callback:

defmodule MyStrategy do
  @behaviour LlamaCppEx.Server.BatchStrategy

  @impl true
  def build_batch(slots, budget, chunk_size, opts) do
    # Return {entries, updated_slots}
  end
end

Summary

Callbacks

Build a batch of entries from the current slot state.

Types

entry()

@type entry() ::
  {token_id :: integer(), pos :: integer(), seq_id :: integer(),
   logits :: boolean()}

Callbacks

build_batch(slots, budget, chunk_size, opts)

@callback build_batch(
  slots :: %{required(non_neg_integer()) => map()},
  budget :: pos_integer(),
  chunk_size :: pos_integer(),
  opts :: keyword()
) ::
  {entries :: [entry()],
   updated_slots :: %{required(non_neg_integer()) => map()}}

Build a batch of entries from the current slot state.

Returns {entries, updated_slots} where entries is a list of {token_id, pos, seq_id, logits} tuples in forward order (will be reversed by the caller).

Parameters

  • slots - Map of seq_id to slot state maps.
  • budget - Maximum tokens allowed in this batch (n_batch).
  • chunk_size - Maximum prefill tokens per slot per tick.
  • opts - Additional context:
    • :queue_depth - Number of requests waiting for a slot.
    • :model_ref - Model reference for detokenization.