Tinkex.Tokenizer (Tinkex v0.3.4)
View SourceTokenization entrypoint for the Tinkex SDK.
This module will wrap the HuggingFace tokenizers NIF, resolve tokenizer IDs
(TrainingClient metadata + Llama-3 hack), and coordinate caching strategy via
ETS handles. Tokenizers are keyed by the resolved tokenizer ID and reused
across calls to avoid repeated downloads. Chat templating is out of scope for
v1.0; callers must provide fully formatted prompts/strings before encoding.
Summary
Functions
Return the ETS table used for tokenizer caching.
Decode token IDs back to text using a cached tokenizer.
Encode text into token IDs using a cached tokenizer.
Convenience alias for encode/3.
Get a tokenizer handle from cache or load and cache it using the resolved ID.
Resolve the tokenizer ID for the given model.
Types
@type handle() :: Tokenizers.Tokenizer.t() | TiktokenEx.Encoding.t()
Loaded tokenizer handle.
@type tokenizer_id() :: String.t()
Identifier for a tokenizer (e.g., HuggingFace repo name).
Functions
Return the ETS table used for tokenizer caching.
Tests can override this via __supertester_set_table__/2.
@spec decode([integer()], tokenizer_id() | String.t(), keyword()) :: {:ok, String.t()} | {:error, Tinkex.Error.t()}
Decode token IDs back to text using a cached tokenizer.
Mirrors encode/3 with the same caching and error contract.
@spec encode(String.t(), tokenizer_id() | String.t(), keyword()) :: {:ok, [integer()]} | {:error, Tinkex.Error.t()}
Encode text into token IDs using a cached tokenizer.
Loads (or reuses) the tokenizer keyed by the resolved tokenizer ID and returns
{:ok, [integer()]}. Does not apply chat templates; pass the already
formatted string you want to tokenize.
Examples
iex> {:ok, ids} = Tinkex.Tokenizer.encode("Hello", "gpt2")
iex> Enum.all?(ids, &is_integer/1)
true
@spec encode_text(String.t(), tokenizer_id() | String.t(), keyword()) :: {:ok, [integer()]} | {:error, Tinkex.Error.t()}
Convenience alias for encode/3.
Accepts the same options and returns the same tuple contract. Useful for
user-facing API symmetry with Tinkex.Types.ModelInput.from_text/2.
@spec get_or_load_tokenizer( tokenizer_id(), keyword() ) :: {:ok, handle()} | {:error, Tinkex.Error.t()}
Get a tokenizer handle from cache or load and cache it using the resolved ID.
The ETS table :tinkex_tokenizers is created on demand if the
application has not already started it.
@spec get_tokenizer_id(String.t(), Tinkex.TrainingClient.t() | nil, keyword()) :: tokenizer_id()
Resolve the tokenizer ID for the given model.
- If a
training_clientis provided, attempts to fetchmodel_data.tokenizer_idvia the provided:info_fun(defaults to&TrainingClient.get_info/1). - Applies the Llama-3 gating workaround (
"thinkingmachineslabinc/meta-llama-3-tokenizer"). - Falls back to the provided
model_name.