BPE model specification compatible with IREE.Tokenizers.Tokenizer.init/1.
Use this module when you already have a vocabulary and merge list in memory or on disk and want to build an IREE-backed tokenizer from those pieces.
Summary
Functions
Returns an empty BPE model specification.
Builds a BPE model specification from a vocabulary JSON file and a merges file.
Builds a BPE model specification from an in-memory vocabulary and merge list.
Types
@type options() :: [ cache_capacity: number(), dropout: float(), unk_token: String.t(), continuing_subword_prefix: String.t(), end_of_word_suffix: String.t(), fuse_unk: boolean(), byte_fallback: boolean() ]
Options for BPE model construction.
Supported options are intentionally close to elixir-nx/tokenizers, though
only the subset that can be represented through the current IREE-backed load
path is applied.
Functions
@spec empty() :: {:ok, IREE.Tokenizers.Model.t()}
Returns an empty BPE model specification.
@spec from_file(String.t(), String.t(), options()) :: {:ok, IREE.Tokenizers.Model.t()} | {:error, term()}
Builds a BPE model specification from a vocabulary JSON file and a merges file.
The vocabulary file is expected to be a JSON object mapping token strings to integer IDs. The merges file is expected to contain one merge pair per line.
@spec init( %{required(String.t()) => integer()}, [{String.t(), String.t()}], options() ) :: {:ok, IREE.Tokenizers.Model.t()}
Builds a BPE model specification from an in-memory vocabulary and merge list.
The returned %IREE.Tokenizers.Model{} can be passed to
IREE.Tokenizers.Tokenizer.init/1.