Core tokenizer API.
This module is the main entrypoint for loading tokenizers and running inference-time encode/decode operations.
Supported load paths:
- local or in-memory Hugging Face
tokenizer.json - local or in-memory OpenAI
.tiktoken - local or in-memory SentencePiece
.model - remote Hugging Face repositories via
from_pretrained/2
Supported model families:
- BPE
- WordPiece
- Unigram
The API is intentionally inference-focused. It mirrors a useful subset of
elixir-nx/tokenizers while keeping IREE as the underlying runtime.
Summary
Types
Input accepted by encode operations.
Supported serialized tokenizer formats accepted by the constructor family.
Common {:ok, value} | {:error, {kind, message}} result shape used by the
public API.
A loaded tokenizer handle.
Functions
Returns the token ID for the requested special token, or nil when that
token is not defined.
Returns the token ID for the requested special token, or nil when that
token is not defined.
Decodes a list of token IDs back into text.
Decodes multiple token ID lists in one batch call.
Encodes a single binary input into an IREE.Tokenizers.Encoding.
Encodes multiple binary inputs in one batch call.
Returns the token ID for the requested special token, or nil when that
token is not defined.
Loads a tokenizer from an in-memory buffer.
Loads a tokenizer from a local file.
Downloads, caches, and loads a tokenizer from a remote repository.
Returns the model specification used to build this tokenizer when available.
Returns the tokenizer vocabulary as a %{token => id} map.
Returns the size of the tokenizer vocabulary.
Looks up the token string for a token ID.
Builds a tokenizer from a pure Elixir model specification.
Returns the token ID for the requested special token, or nil when that
token is not defined.
Returns the tokenizer model type name, such as "BPE", "WordPiece", or
"Unigram".
Returns the token ID for the requested special token, or nil when that
token is not defined.
Returns the token ID for the requested special token, or nil when that
token is not defined.
Replaces the tokenizer model with the given model specification.
Returns the predefined IREE tiktoken encoding names supported by the loader.
Infers a tiktoken encoding name from a known model or deployment name.
Looks up the token ID for a token string.
Returns the token ID for the requested special token, or nil when that
token is not defined.
Returns the number of active vocabulary entries.
Types
@type encode_input() :: binary()
Input accepted by encode operations.
The current implementation supports only single binary sequences.
@type load_format() :: :huggingface_json | :tiktoken | :sentencepiece_model
Supported serialized tokenizer formats accepted by the constructor family.
Common {:ok, value} | {:error, {kind, message}} result shape used by the
public API.
@type t() :: %IREE.Tokenizers.Tokenizer{resource: reference()}
A loaded tokenizer handle.
Functions
Returns the token ID for the requested special token, or nil when that
token is not defined.
Returns the token ID for the requested special token, or nil when that
token is not defined.
Decodes a list of token IDs back into text.
Supported options:
:skip_special_tokens- suppress special tokens in the output text, defaults totrue
Decodes multiple token ID lists in one batch call.
@spec encode(t(), encode_input(), keyword()) :: result(IREE.Tokenizers.Encoding.t())
Encodes a single binary input into an IREE.Tokenizers.Encoding.
Supported options:
:add_special_tokens- include tokenizer post-processing special tokens, defaults totrue:track_offsets- track byte offsets, defaults tofalse:encoding_transformations- list ofIREE.Tokenizers.Encoding.Transformationvalues applied after encoding
@spec encode_batch(t(), [encode_input()], keyword()) :: result([IREE.Tokenizers.Encoding.t()])
Encodes multiple binary inputs in one batch call.
Uses the same options as encode/3.
Returns the token ID for the requested special token, or nil when that
token is not defined.
Loads a tokenizer from an in-memory buffer.
Supported options:
:format- one of:huggingface_json,:tiktoken, or:sentencepiece_model:tiktoken_encoding- required for raw.tiktokenbuffers when the encoding cannot be inferred from a filename or model name
Loads a tokenizer from a local file.
Format can be inferred from the file extension:
.json-> Hugging Face tokenizer JSON.tiktoken-> OpenAI tiktoken.model-> SentencePiece model
You can also override the inferred format with :format.
Downloads, caches, and loads a tokenizer from a remote repository.
By default this expects a Hugging Face repository containing
tokenizer.json. For .tiktoken and SentencePiece .model loads, pass
:format.
Common options:
:revision- revision or branch name, defaults to"main":use_cache- whether to reuse an existing cached file, defaults totrue:cache_dir- cache directory, defaults to a per-user application cache:http_client-{module, opts}tuple implementingrequest/1:token- optional Hugging Face token for gated/private repos:filename- optional explicit remote filename override:format- serialized tokenizer format:tiktoken_encoding- optional explicit tiktoken encoding override
@spec get_model(t()) :: IREE.Tokenizers.Model.t()
Returns the model specification used to build this tokenizer when available.
For tokenizers loaded from serialized files, this returns a minimal
%IREE.Tokenizers.Model{} containing only the model type metadata.
Returns the tokenizer vocabulary as a %{token => id} map.
The :with_added_tokens option is accepted for compatibility and currently
defaults to true.
@spec get_vocab_size( t(), keyword() ) :: non_neg_integer()
Returns the size of the tokenizer vocabulary.
The :with_added_tokens option is accepted for compatibility and currently
defaults to true.
Looks up the token string for a token ID.
@spec init(IREE.Tokenizers.Model.t()) :: result(t())
Builds a tokenizer from a pure Elixir model specification.
See IREE.Tokenizers.Model.BPE, IREE.Tokenizers.Model.WordPiece, and
IREE.Tokenizers.Model.Unigram.
Returns the token ID for the requested special token, or nil when that
token is not defined.
Returns the tokenizer model type name, such as "BPE", "WordPiece", or
"Unigram".
Returns the token ID for the requested special token, or nil when that
token is not defined.
Returns the token ID for the requested special token, or nil when that
token is not defined.
@spec set_model(t(), IREE.Tokenizers.Model.t()) :: t()
Replaces the tokenizer model with the given model specification.
This currently rebuilds a new tokenizer from the provided model and returns that tokenizer.
@spec supported_tiktoken_encodings() :: [binary()]
Returns the predefined IREE tiktoken encoding names supported by the loader.
Infers a tiktoken encoding name from a known model or deployment name.
Returns nil when the model name is not recognized.
Looks up the token ID for a token string.
Returns the token ID for the requested special token, or nil when that
token is not defined.
@spec vocab_size(t()) :: non_neg_integer()
Returns the number of active vocabulary entries.