IREE.Tokenizers.Tokenizer (iree_tokenizers v0.7.0)

Core tokenizer API.

This module is the main entrypoint for loading tokenizers and running inference-time encode/decode operations.

Supported load paths:

local or in-memory Hugging Face tokenizer.json
local or in-memory OpenAI .tiktoken
local or in-memory SentencePiece .model
remote Hugging Face repositories via from_pretrained/2

Supported model families:

BPE
WordPiece
Unigram

The API is intentionally inference-focused. It mirrors a useful subset of elixir-nx/tokenizers while keeping IREE as the underlying runtime.

Summary

Types

encode_input()

Input accepted by encode operations.

load_format()

Supported serialized tokenizer formats accepted by the constructor family.

result(value)

Common {:ok, value} | {:error, {kind, message}} result shape used by the public API.

t()

A loaded tokenizer handle.

Functions

bos_token_id(tokenizer)

Returns the token ID for the requested special token, or nil when that token is not defined.

cls_token_id(tokenizer)

Returns the token ID for the requested special token, or nil when that token is not defined.

decode(tokenizer, ids, opts \\ [])

Decodes a list of token IDs back into text.

decode_batch(tokenizer, batch_ids, opts \\ [])

Decodes multiple token ID lists in one batch call.

encode(tokenizer, input, opts \\ [])

Encodes a single binary input into an IREE.Tokenizers.Encoding.

encode_batch(tokenizer, inputs, opts \\ [])

Encodes multiple binary inputs in one batch call.

eos_token_id(tokenizer)

Returns the token ID for the requested special token, or nil when that token is not defined.

from_buffer(buffer, opts \\ [])

Loads a tokenizer from an in-memory buffer.

from_file(path, opts \\ [])

Loads a tokenizer from a local file.

from_pretrained(repo_id, opts \\ [])

Downloads, caches, and loads a tokenizer from a remote repository.

get_model(tokenizer)

Returns the model specification used to build this tokenizer when available.

get_vocab(tokenizer, opts \\ [])

Returns the tokenizer vocabulary as a %{token => id} map.

get_vocab_size(tokenizer, opts \\ [])

Returns the size of the tokenizer vocabulary.

id_to_token(tokenizer, id)

Looks up the token string for a token ID.

init(model)

Builds a tokenizer from a pure Elixir model specification.

mask_token_id(tokenizer)

Returns the token ID for the requested special token, or nil when that token is not defined.

model_type(tokenizer)

Returns the tokenizer model type name, such as "BPE", "WordPiece", or "Unigram".

pad_token_id(tokenizer)

Returns the token ID for the requested special token, or nil when that token is not defined.

sep_token_id(tokenizer)

Returns the token ID for the requested special token, or nil when that token is not defined.

set_model(tokenizer, model)

Replaces the tokenizer model with the given model specification.

supported_tiktoken_encodings()

Returns the predefined IREE tiktoken encoding names supported by the loader.

tiktoken_encoding_for_model(model)

Infers a tiktoken encoding name from a known model or deployment name.

token_to_id(tokenizer, token)

Looks up the token ID for a token string.

unk_token_id(tokenizer)

Returns the token ID for the requested special token, or nil when that token is not defined.

vocab_size(tokenizer)

Returns the number of active vocabulary entries.

Types

encode_input()

@type encode_input() :: binary()

Input accepted by encode operations.

The current implementation supports only single binary sequences.

load_format()

@type load_format() :: :huggingface_json | :tiktoken | :sentencepiece_model

Supported serialized tokenizer formats accepted by the constructor family.

result(value)

@type result(value) :: {:ok, value} | {:error, {atom(), binary()}}

Common {:ok, value} | {:error, {kind, message}} result shape used by the public API.

t()

@type t() :: %IREE.Tokenizers.Tokenizer{resource: reference()}

A loaded tokenizer handle.

Functions

bos_token_id(tokenizer)

@spec bos_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

cls_token_id(tokenizer)

@spec cls_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

decode(tokenizer, ids, opts \\ [])

@spec decode(t(), [integer()], keyword()) :: result(binary())

Decodes a list of token IDs back into text.

Supported options:

:skip_special_tokens - suppress special tokens in the output text, defaults to true

decode_batch(tokenizer, batch_ids, opts \\ [])

@spec decode_batch(t(), [[integer()]], keyword()) :: result([binary()])

Decodes multiple token ID lists in one batch call.

encode(tokenizer, input, opts \\ [])

@spec encode(t(), encode_input(), keyword()) :: result(IREE.Tokenizers.Encoding.t())

Encodes a single binary input into an IREE.Tokenizers.Encoding.

Supported options:

:add_special_tokens - include tokenizer post-processing special tokens, defaults to true
:track_offsets - track byte offsets, defaults to false
:encoding_transformations - list of IREE.Tokenizers.Encoding.Transformation values applied after encoding

encode_batch(tokenizer, inputs, opts \\ [])

@spec encode_batch(t(), [encode_input()], keyword()) ::
  result([IREE.Tokenizers.Encoding.t()])

Encodes multiple binary inputs in one batch call.

Uses the same options as encode/3.

eos_token_id(tokenizer)

@spec eos_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

from_buffer(buffer, opts \\ [])

@spec from_buffer(
  binary(),
  keyword()
) :: result(t())

Loads a tokenizer from an in-memory buffer.

Supported options:

:format - one of :huggingface_json, :tiktoken, or :sentencepiece_model
:tiktoken_encoding - required for raw .tiktoken buffers when the encoding cannot be inferred from a filename or model name

from_file(path, opts \\ [])

@spec from_file(
  Path.t(),
  keyword()
) :: result(t())

Loads a tokenizer from a local file.

Format can be inferred from the file extension:

.json -> Hugging Face tokenizer JSON
.tiktoken -> OpenAI tiktoken
.model -> SentencePiece model

You can also override the inferred format with :format.

from_pretrained(repo_id, opts \\ [])

@spec from_pretrained(
  binary(),
  keyword()
) :: result(t())

Downloads, caches, and loads a tokenizer from a remote repository.

By default this expects a Hugging Face repository containing tokenizer.json. For .tiktoken and SentencePiece .model loads, pass :format.

Common options:

:revision - revision or branch name, defaults to "main"
:use_cache - whether to reuse an existing cached file, defaults to true
:cache_dir - cache directory, defaults to a per-user application cache
:http_client - {module, opts} tuple implementing request/1
:token - optional Hugging Face token for gated/private repos
:filename - optional explicit remote filename override
:format - serialized tokenizer format
:subfolder - optional subdirectory within the repository that holds the tokenizer assets. Diffusers-style repositories such as stabilityai/stable-diffusion-xl-base-1.0 ship their tokenizer under tokenizer/tokenizer.json (and a second under tokenizer_2/). When :subfolder is omitted, from_pretrained/2 tries the repository root, tokenizer/, tokenizer_2/, and text_encoder/ in order and returns the first successful download. Pass an explicit value (or "" for the root) to disable the fallback walk.
:tiktoken_encoding - optional explicit tiktoken encoding override

get_model(tokenizer)

@spec get_model(t()) :: IREE.Tokenizers.Model.t()

Returns the model specification used to build this tokenizer when available.

For tokenizers loaded from serialized files, this returns a minimal %IREE.Tokenizers.Model{} containing only the model type metadata.

get_vocab(tokenizer, opts \\ [])

@spec get_vocab(
  t(),
  keyword()
) :: %{required(binary()) => integer()}

Returns the tokenizer vocabulary as a %{token => id} map.

The :with_added_tokens option is accepted for compatibility and currently defaults to true.

get_vocab_size(tokenizer, opts \\ [])

@spec get_vocab_size(
  t(),
  keyword()
) :: non_neg_integer()

Returns the size of the tokenizer vocabulary.

The :with_added_tokens option is accepted for compatibility and currently defaults to true.

id_to_token(tokenizer, id)

@spec id_to_token(t(), integer()) :: binary() | nil

Looks up the token string for a token ID.

init(model)

@spec init(IREE.Tokenizers.Model.t()) :: result(t())

Builds a tokenizer from a pure Elixir model specification.

See IREE.Tokenizers.Model.BPE, IREE.Tokenizers.Model.WordPiece, and IREE.Tokenizers.Model.Unigram.

mask_token_id(tokenizer)

@spec mask_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

model_type(tokenizer)

@spec model_type(t()) :: binary()

Returns the tokenizer model type name, such as "BPE", "WordPiece", or "Unigram".

pad_token_id(tokenizer)

@spec pad_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

sep_token_id(tokenizer)

@spec sep_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

set_model(tokenizer, model)

@spec set_model(t(), IREE.Tokenizers.Model.t()) :: t()

Replaces the tokenizer model with the given model specification.

This currently rebuilds a new tokenizer from the provided model and returns that tokenizer.

supported_tiktoken_encodings()

@spec supported_tiktoken_encodings() :: [binary()]

Returns the predefined IREE tiktoken encoding names supported by the loader.

tiktoken_encoding_for_model(model)

@spec tiktoken_encoding_for_model(binary()) :: binary() | nil

Infers a tiktoken encoding name from a known model or deployment name.

Returns nil when the model name is not recognized.

token_to_id(tokenizer, token)

@spec token_to_id(t(), binary()) :: integer() | nil

Looks up the token ID for a token string.

unk_token_id(tokenizer)

@spec unk_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

vocab_size(tokenizer)

@spec vocab_size(t()) :: non_neg_integer()

Returns the number of active vocabulary entries.