Tinkex.Tokenizer (Tinkex v0.4.0)

Tokenization entrypoint for the Tinkex SDK.

This module will wrap the HuggingFace tokenizers NIF, resolve tokenizer IDs (TrainingClient metadata + Llama-3 hack), and coordinate caching strategy via ETS handles. Tokenizers are keyed by the resolved tokenizer ID and reused across calls to avoid repeated downloads. Chat templating is out of scope for v1.0; callers must provide fully formatted prompts/strings before encoding.

Summary

Types

handle()

Loaded tokenizer handle.

tokenizer_id()

Identifier for a tokenizer (e.g., HuggingFace repo name).

Functions

cache_table()

Return the ETS table used for tokenizer caching.

decode(ids, model_name, opts \\ [])

Decode token IDs back to text using a cached tokenizer.

encode(text, model_name, opts \\ [])

Encode text into token IDs using a cached tokenizer.

encode_text(text, model_name, opts \\ [])

Convenience alias for encode/3.

get_or_load_tokenizer(tokenizer_id, opts \\ [])

Get a tokenizer handle from cache or load and cache it using the resolved ID.

get_tokenizer_id(model_name, training_client \\ nil, opts \\ [])

Resolve the tokenizer ID for the given model.

Types

handle()

@type handle() :: Tokenizers.Tokenizer.t() | TiktokenEx.Encoding.t()

Loaded tokenizer handle.

tokenizer_id()

@type tokenizer_id() :: String.t()

Identifier for a tokenizer (e.g., HuggingFace repo name).

Functions

cache_table()

@spec cache_table() :: atom() | :ets.tid()

Return the ETS table used for tokenizer caching.

Tests can override this via __supertester_set_table__/2.

decode(ids, model_name, opts \\ [])

@spec decode([integer()], tokenizer_id() | String.t(), keyword()) ::
  {:ok, String.t()} | {:error, Tinkex.Error.t()}

Decode token IDs back to text using a cached tokenizer.

Mirrors encode/3 with the same caching and error contract.

encode(text, model_name, opts \\ [])

@spec encode(String.t(), tokenizer_id() | String.t(), keyword()) ::
  {:ok, [integer()]} | {:error, Tinkex.Error.t()}

Encode text into token IDs using a cached tokenizer.

Loads (or reuses) the tokenizer keyed by the resolved tokenizer ID and returns {:ok, [integer()]}. Does not apply chat templates; pass the already formatted string you want to tokenize.

Examples

iex> {:ok, ids} = Tinkex.Tokenizer.encode("Hello", "gpt2")
iex> Enum.all?(ids, &is_integer/1)
true

encode_text(text, model_name, opts \\ [])

@spec encode_text(String.t(), tokenizer_id() | String.t(), keyword()) ::
  {:ok, [integer()]} | {:error, Tinkex.Error.t()}

Convenience alias for encode/3.

Accepts the same options and returns the same tuple contract. Useful for user-facing API symmetry with Tinkex.Types.ModelInput.from_text/2.

get_or_load_tokenizer(tokenizer_id, opts \\ [])

@spec get_or_load_tokenizer(
  tokenizer_id(),
  keyword()
) :: {:ok, handle()} | {:error, Tinkex.Error.t()}

Get a tokenizer handle from cache or load and cache it using the resolved ID.

The ETS table :tinkex_tokenizers is created on demand if the application has not already started it.

get_tokenizer_id(model_name, training_client \\ nil, opts \\ [])

@spec get_tokenizer_id(String.t(), Tinkex.TrainingClient.t() | nil, keyword()) ::
  tokenizer_id()

Resolve the tokenizer ID for the given model.

If a training_client is provided, attempts to fetch model_data.tokenizer_id via the provided :info_fun (defaults to &TrainingClient.get_info/1).
Applies the Llama-3 gating workaround ("thinkingmachineslabinc/meta-llama-3-tokenizer").
Falls back to the provided model_name.