IREE.Tokenizers.Tokenizer (iree_tokenizers v0.4.0)

Copy Markdown View Source

Core tokenizer API.

This module is the main entrypoint for loading tokenizers and running inference-time encode/decode operations.

Supported load paths:

  • local or in-memory Hugging Face tokenizer.json
  • local or in-memory OpenAI .tiktoken
  • local or in-memory SentencePiece .model
  • remote Hugging Face repositories via from_pretrained/2

Supported model families:

  • BPE
  • WordPiece
  • Unigram

The API is intentionally inference-focused. It mirrors a useful subset of elixir-nx/tokenizers while keeping IREE as the underlying runtime.

Summary

Types

Input accepted by encode operations.

Supported serialized tokenizer formats accepted by the constructor family.

Common {:ok, value} | {:error, {kind, message}} result shape used by the public API.

t()

A loaded tokenizer handle.

Functions

Returns the token ID for the requested special token, or nil when that token is not defined.

Returns the token ID for the requested special token, or nil when that token is not defined.

Decodes a list of token IDs back into text.

Decodes multiple token ID lists in one batch call.

Encodes a single binary input into an IREE.Tokenizers.Encoding.

Encodes multiple binary inputs in one batch call.

Returns the token ID for the requested special token, or nil when that token is not defined.

Loads a tokenizer from an in-memory buffer.

Loads a tokenizer from a local file.

Downloads, caches, and loads a tokenizer from a remote repository.

Returns the model specification used to build this tokenizer when available.

Returns the tokenizer vocabulary as a %{token => id} map.

Returns the size of the tokenizer vocabulary.

Looks up the token string for a token ID.

Builds a tokenizer from a pure Elixir model specification.

Returns the token ID for the requested special token, or nil when that token is not defined.

Returns the tokenizer model type name, such as "BPE", "WordPiece", or "Unigram".

Returns the token ID for the requested special token, or nil when that token is not defined.

Returns the token ID for the requested special token, or nil when that token is not defined.

Replaces the tokenizer model with the given model specification.

Returns the predefined IREE tiktoken encoding names supported by the loader.

Infers a tiktoken encoding name from a known model or deployment name.

Looks up the token ID for a token string.

Returns the token ID for the requested special token, or nil when that token is not defined.

Returns the number of active vocabulary entries.

Types

encode_input()

@type encode_input() :: binary()

Input accepted by encode operations.

The current implementation supports only single binary sequences.

load_format()

@type load_format() :: :huggingface_json | :tiktoken | :sentencepiece_model

Supported serialized tokenizer formats accepted by the constructor family.

result(value)

@type result(value) :: {:ok, value} | {:error, {atom(), binary()}}

Common {:ok, value} | {:error, {kind, message}} result shape used by the public API.

t()

@type t() :: %IREE.Tokenizers.Tokenizer{resource: reference()}

A loaded tokenizer handle.

Functions

bos_token_id(tokenizer)

@spec bos_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

cls_token_id(tokenizer)

@spec cls_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

decode(tokenizer, ids, opts \\ [])

@spec decode(t(), [integer()], keyword()) :: result(binary())

Decodes a list of token IDs back into text.

Supported options:

  • :skip_special_tokens - suppress special tokens in the output text, defaults to true

decode_batch(tokenizer, batch_ids, opts \\ [])

@spec decode_batch(t(), [[integer()]], keyword()) :: result([binary()])

Decodes multiple token ID lists in one batch call.

encode(tokenizer, input, opts \\ [])

Encodes a single binary input into an IREE.Tokenizers.Encoding.

Supported options:

  • :add_special_tokens - include tokenizer post-processing special tokens, defaults to true
  • :track_offsets - track byte offsets, defaults to false
  • :encoding_transformations - list of IREE.Tokenizers.Encoding.Transformation values applied after encoding

encode_batch(tokenizer, inputs, opts \\ [])

@spec encode_batch(t(), [encode_input()], keyword()) ::
  result([IREE.Tokenizers.Encoding.t()])

Encodes multiple binary inputs in one batch call.

Uses the same options as encode/3.

eos_token_id(tokenizer)

@spec eos_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

from_buffer(buffer, opts \\ [])

@spec from_buffer(
  binary(),
  keyword()
) :: result(t())

Loads a tokenizer from an in-memory buffer.

Supported options:

  • :format - one of :huggingface_json, :tiktoken, or :sentencepiece_model
  • :tiktoken_encoding - required for raw .tiktoken buffers when the encoding cannot be inferred from a filename or model name

from_file(path, opts \\ [])

@spec from_file(
  Path.t(),
  keyword()
) :: result(t())

Loads a tokenizer from a local file.

Format can be inferred from the file extension:

  • .json -> Hugging Face tokenizer JSON
  • .tiktoken -> OpenAI tiktoken
  • .model -> SentencePiece model

You can also override the inferred format with :format.

from_pretrained(repo_id, opts \\ [])

@spec from_pretrained(
  binary(),
  keyword()
) :: result(t())

Downloads, caches, and loads a tokenizer from a remote repository.

By default this expects a Hugging Face repository containing tokenizer.json. For .tiktoken and SentencePiece .model loads, pass :format.

Common options:

  • :revision - revision or branch name, defaults to "main"
  • :use_cache - whether to reuse an existing cached file, defaults to true
  • :cache_dir - cache directory, defaults to a per-user application cache
  • :http_client - {module, opts} tuple implementing request/1
  • :token - optional Hugging Face token for gated/private repos
  • :filename - optional explicit remote filename override
  • :format - serialized tokenizer format
  • :tiktoken_encoding - optional explicit tiktoken encoding override

get_model(tokenizer)

@spec get_model(t()) :: IREE.Tokenizers.Model.t()

Returns the model specification used to build this tokenizer when available.

For tokenizers loaded from serialized files, this returns a minimal %IREE.Tokenizers.Model{} containing only the model type metadata.

get_vocab(tokenizer, opts \\ [])

@spec get_vocab(
  t(),
  keyword()
) :: %{required(binary()) => integer()}

Returns the tokenizer vocabulary as a %{token => id} map.

The :with_added_tokens option is accepted for compatibility and currently defaults to true.

get_vocab_size(tokenizer, opts \\ [])

@spec get_vocab_size(
  t(),
  keyword()
) :: non_neg_integer()

Returns the size of the tokenizer vocabulary.

The :with_added_tokens option is accepted for compatibility and currently defaults to true.

id_to_token(tokenizer, id)

@spec id_to_token(t(), integer()) :: binary() | nil

Looks up the token string for a token ID.

init(model)

@spec init(IREE.Tokenizers.Model.t()) :: result(t())

Builds a tokenizer from a pure Elixir model specification.

See IREE.Tokenizers.Model.BPE, IREE.Tokenizers.Model.WordPiece, and IREE.Tokenizers.Model.Unigram.

mask_token_id(tokenizer)

@spec mask_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

model_type(tokenizer)

@spec model_type(t()) :: binary()

Returns the tokenizer model type name, such as "BPE", "WordPiece", or "Unigram".

pad_token_id(tokenizer)

@spec pad_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

sep_token_id(tokenizer)

@spec sep_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

set_model(tokenizer, model)

@spec set_model(t(), IREE.Tokenizers.Model.t()) :: t()

Replaces the tokenizer model with the given model specification.

This currently rebuilds a new tokenizer from the provided model and returns that tokenizer.

supported_tiktoken_encodings()

@spec supported_tiktoken_encodings() :: [binary()]

Returns the predefined IREE tiktoken encoding names supported by the loader.

tiktoken_encoding_for_model(model)

@spec tiktoken_encoding_for_model(binary()) :: binary() | nil

Infers a tiktoken encoding name from a known model or deployment name.

Returns nil when the model name is not recognized.

token_to_id(tokenizer, token)

@spec token_to_id(t(), binary()) :: integer() | nil

Looks up the token ID for a token string.

unk_token_id(tokenizer)

@spec unk_token_id(t()) :: integer() | nil

Returns the token ID for the requested special token, or nil when that token is not defined.

vocab_size(tokenizer)

@spec vocab_size(t()) :: non_neg_integer()

Returns the number of active vocabulary entries.