# `IREE.Tokenizers.Tokenizer`
[🔗](https://github.com/goodhamgupta/iree_tokenizers/blob/v0.7.0/lib/iree/tokenizers/tokenizer.ex#L1)

Core tokenizer API.

This module is the main entrypoint for loading tokenizers and running
inference-time encode/decode operations.

Supported load paths:

- local or in-memory Hugging Face `tokenizer.json`
- local or in-memory OpenAI `.tiktoken`
- local or in-memory SentencePiece `.model`
- remote Hugging Face repositories via `from_pretrained/2`

Supported model families:

- BPE
- WordPiece
- Unigram

The API is intentionally inference-focused. It mirrors a useful subset of
`elixir-nx/tokenizers` while keeping IREE as the underlying runtime.

# `encode_input`

```elixir
@type encode_input() :: binary()
```

Input accepted by encode operations.

The current implementation supports only single binary sequences.

# `load_format`

```elixir
@type load_format() :: :huggingface_json | :tiktoken | :sentencepiece_model
```

Supported serialized tokenizer formats accepted by the constructor family.

# `result`

```elixir
@type result(value) :: {:ok, value} | {:error, {atom(), binary()}}
```

Common `{:ok, value} | {:error, {kind, message}}` result shape used by the
public API.

# `t`

```elixir
@type t() :: %IREE.Tokenizers.Tokenizer{resource: reference()}
```

A loaded tokenizer handle.

# `bos_token_id`

```elixir
@spec bos_token_id(t()) :: integer() | nil
```

Returns the token ID for the requested special token, or `nil` when that
token is not defined.

# `cls_token_id`

```elixir
@spec cls_token_id(t()) :: integer() | nil
```

Returns the token ID for the requested special token, or `nil` when that
token is not defined.

# `decode`

```elixir
@spec decode(t(), [integer()], keyword()) :: result(binary())
```

Decodes a list of token IDs back into text.

Supported options:

- `:skip_special_tokens` - suppress special tokens in the output text,
  defaults to `true`

# `decode_batch`

```elixir
@spec decode_batch(t(), [[integer()]], keyword()) :: result([binary()])
```

Decodes multiple token ID lists in one batch call.

# `encode`

```elixir
@spec encode(t(), encode_input(), keyword()) :: result(IREE.Tokenizers.Encoding.t())
```

Encodes a single binary input into an `IREE.Tokenizers.Encoding`.

Supported options:

- `:add_special_tokens` - include tokenizer post-processing special tokens,
  defaults to `true`
- `:track_offsets` - track byte offsets, defaults to `false`
- `:encoding_transformations` - list of
  `IREE.Tokenizers.Encoding.Transformation` values applied after encoding

# `encode_batch`

```elixir
@spec encode_batch(t(), [encode_input()], keyword()) ::
  result([IREE.Tokenizers.Encoding.t()])
```

Encodes multiple binary inputs in one batch call.

Uses the same options as `encode/3`.

# `eos_token_id`

```elixir
@spec eos_token_id(t()) :: integer() | nil
```

Returns the token ID for the requested special token, or `nil` when that
token is not defined.

# `from_buffer`

```elixir
@spec from_buffer(
  binary(),
  keyword()
) :: result(t())
```

Loads a tokenizer from an in-memory buffer.

Supported options:

- `:format` - one of `:huggingface_json`, `:tiktoken`, or
  `:sentencepiece_model`
- `:tiktoken_encoding` - required for raw `.tiktoken` buffers when the
  encoding cannot be inferred from a filename or model name

# `from_file`

```elixir
@spec from_file(
  Path.t(),
  keyword()
) :: result(t())
```

Loads a tokenizer from a local file.

Format can be inferred from the file extension:

- `.json` -> Hugging Face tokenizer JSON
- `.tiktoken` -> OpenAI tiktoken
- `.model` -> SentencePiece model

You can also override the inferred format with `:format`.

# `from_pretrained`

```elixir
@spec from_pretrained(
  binary(),
  keyword()
) :: result(t())
```

Downloads, caches, and loads a tokenizer from a remote repository.

By default this expects a Hugging Face repository containing
`tokenizer.json`. For `.tiktoken` and SentencePiece `.model` loads, pass
`:format`.

Common options:

- `:revision` - revision or branch name, defaults to `"main"`
- `:use_cache` - whether to reuse an existing cached file, defaults to `true`
- `:cache_dir` - cache directory, defaults to a per-user application cache
- `:http_client` - `{module, opts}` tuple implementing `request/1`
- `:token` - optional Hugging Face token for gated/private repos
- `:filename` - optional explicit remote filename override
- `:format` - serialized tokenizer format
- `:subfolder` - optional subdirectory within the repository that holds
  the tokenizer assets. Diffusers-style repositories such as
  `stabilityai/stable-diffusion-xl-base-1.0` ship their tokenizer under
  `tokenizer/tokenizer.json` (and a second under `tokenizer_2/`). When
  `:subfolder` is omitted, `from_pretrained/2` tries the repository root,
  `tokenizer/`, `tokenizer_2/`, and `text_encoder/` in order and returns
  the first successful download. Pass an explicit value (or `""` for the
  root) to disable the fallback walk.
- `:tiktoken_encoding` - optional explicit tiktoken encoding override

# `get_model`

```elixir
@spec get_model(t()) :: IREE.Tokenizers.Model.t()
```

Returns the model specification used to build this tokenizer when available.

For tokenizers loaded from serialized files, this returns a minimal
`%IREE.Tokenizers.Model{}` containing only the model type metadata.

# `get_vocab`

```elixir
@spec get_vocab(
  t(),
  keyword()
) :: %{required(binary()) =&gt; integer()}
```

Returns the tokenizer vocabulary as a `%{token => id}` map.

The `:with_added_tokens` option is accepted for compatibility and currently
defaults to `true`.

# `get_vocab_size`

```elixir
@spec get_vocab_size(
  t(),
  keyword()
) :: non_neg_integer()
```

Returns the size of the tokenizer vocabulary.

The `:with_added_tokens` option is accepted for compatibility and currently
defaults to `true`.

# `id_to_token`

```elixir
@spec id_to_token(t(), integer()) :: binary() | nil
```

Looks up the token string for a token ID.

# `init`

```elixir
@spec init(IREE.Tokenizers.Model.t()) :: result(t())
```

Builds a tokenizer from a pure Elixir model specification.

See `IREE.Tokenizers.Model.BPE`, `IREE.Tokenizers.Model.WordPiece`, and
`IREE.Tokenizers.Model.Unigram`.

# `mask_token_id`

```elixir
@spec mask_token_id(t()) :: integer() | nil
```

Returns the token ID for the requested special token, or `nil` when that
token is not defined.

# `model_type`

```elixir
@spec model_type(t()) :: binary()
```

Returns the tokenizer model type name, such as `"BPE"`, `"WordPiece"`, or
`"Unigram"`.

# `pad_token_id`

```elixir
@spec pad_token_id(t()) :: integer() | nil
```

Returns the token ID for the requested special token, or `nil` when that
token is not defined.

# `sep_token_id`

```elixir
@spec sep_token_id(t()) :: integer() | nil
```

Returns the token ID for the requested special token, or `nil` when that
token is not defined.

# `set_model`

```elixir
@spec set_model(t(), IREE.Tokenizers.Model.t()) :: t()
```

Replaces the tokenizer model with the given model specification.

This currently rebuilds a new tokenizer from the provided model and returns
that tokenizer.

# `supported_tiktoken_encodings`

```elixir
@spec supported_tiktoken_encodings() :: [binary()]
```

Returns the predefined IREE tiktoken encoding names supported by the loader.

# `tiktoken_encoding_for_model`

```elixir
@spec tiktoken_encoding_for_model(binary()) :: binary() | nil
```

Infers a tiktoken encoding name from a known model or deployment name.

Returns `nil` when the model name is not recognized.

# `token_to_id`

```elixir
@spec token_to_id(t(), binary()) :: integer() | nil
```

Looks up the token ID for a token string.

# `unk_token_id`

```elixir
@spec unk_token_id(t()) :: integer() | nil
```

Returns the token ID for the requested special token, or `nil` when that
token is not defined.

# `vocab_size`

```elixir
@spec vocab_size(t()) :: non_neg_integer()
```

Returns the number of active vocabulary entries.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
