IREE.Tokenizers.Model.BPE (iree_tokenizers v0.7.0)

Copy Markdown View Source

BPE model specification compatible with IREE.Tokenizers.Tokenizer.init/1.

Use this module when you already have a vocabulary and merge list in memory or on disk and want to build an IREE-backed tokenizer from those pieces.

Summary

Types

Options for BPE model construction.

Functions

Returns an empty BPE model specification.

Builds a BPE model specification from a vocabulary JSON file and a merges file.

Builds a BPE model specification from an in-memory vocabulary and merge list.

Types

options()

@type options() :: [
  cache_capacity: number(),
  dropout: float(),
  unk_token: String.t(),
  continuing_subword_prefix: String.t(),
  end_of_word_suffix: String.t(),
  fuse_unk: boolean(),
  byte_fallback: boolean()
]

Options for BPE model construction.

Supported options are intentionally close to elixir-nx/tokenizers, though only the subset that can be represented through the current IREE-backed load path is applied.

Functions

empty()

@spec empty() :: {:ok, IREE.Tokenizers.Model.t()}

Returns an empty BPE model specification.

from_file(vocab_path, merges_path, opts \\ [])

@spec from_file(String.t(), String.t(), options()) ::
  {:ok, IREE.Tokenizers.Model.t()} | {:error, term()}

Builds a BPE model specification from a vocabulary JSON file and a merges file.

The vocabulary file is expected to be a JSON object mapping token strings to integer IDs. The merges file is expected to contain one merge pair per line.

init(vocab, merges, opts \\ [])

@spec init(
  %{required(String.t()) => integer()},
  [{String.t(), String.t()}],
  options()
) ::
  {:ok, IREE.Tokenizers.Model.t()}

Builds a BPE model specification from an in-memory vocabulary and merge list.

The returned %IREE.Tokenizers.Model{} can be passed to IREE.Tokenizers.Tokenizer.init/1.