View Source Tokenizers.Tokenizer (Tokenizers v0.3.2)
The struct and associated functions for a tokenizer.
A Tokenizers.Tokenizer.t()
is a container that holds the constituent parts of the tokenization pipeline.
When you call Tokenizers.Tokenizer.encode/3
, the input text goes through the following pipeline:
- normalization
- pre-tokenization
- model
- post-processing
This returns a Tokenizers.Encoding.t()
, which can then give you the token ids for each token in the input text. These token ids are usually used as the input for natural language processing machine learning models.
Link to this section Summary
Functions
Decode the given list of ids or list of lists of ids back to strings.
Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t()
.
Instantiate a new tokenizer from the file at the given path.
Instantiate a new tokenizer from an existing file on the Hugging Face Hub.
Get the Tokenizer
's Model
.
Get the tokenizer's vocabulary as a map of token to id.
Get the number of tokens in the vocabulary.
Convert a given id to its token.
Save the tokenizer to the provided path.
Convert a given token to its id.
Link to this section Types
An input being a subject to tokenization.
Can be either a single sequence, or a pair of sequences.
Link to this section Functions
@spec decode(Tokenizer.t(), non_neg_integer() | [non_neg_integer()], Keyword.t()) :: {:ok, String.t() | [String.t()]} | {:error, term()}
Decode the given list of ids or list of lists of ids back to strings.
options
Options
:skip_special_tokens
- whether the special tokens should be removed from the decoded string. Defaults totrue
.
@spec encode(Tokenizer.t(), encode_input() | [encode_input()], Keyword.t()) :: {:ok, Encoding.t() | [Encoding.t()]} | {:error, term()}
Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t()
.
options
Options
:add_special_tokens
- whether to add special tokens to the encoding. Defaults totrue
.
Instantiate a new tokenizer from the file at the given path.
options
Options
:additional_special_tokens
- A list of special tokens to append to the tokenizer. Defaults to[]
.
Instantiate a new tokenizer from an existing file on the Hugging Face Hub.
This is going to download a tokenizer file, save it to disk and load that file.
options
Options
:http_client
- A tuple with a module and options. This module should implement therequest/1
function, accepting a keyword list with the options for a request. This is inspired byReq.request/1
: https://hexdocs.pm/req/Req.html#request/1The default HTTP client config is:
{Tokenizers.HTTPClient, []}
. Since it's inspired byReq
, it's possible to use that client without any adjustments.When making request, the options
:url
and:method
are going to be overridden.:headers
contains the "user-agent" set by default.:revision
- The revision name that should be used for fetching the tokenizers from Hugging Face.:use_cache
- Tells if it should read from cache when the file already exists. Defaults totrue
.:cache_dir
- The directory where cache is saved. Files are written to cache even if:use_cache
is false. By default it uses:filename.basedir/3
to get a cache dir based in the "tokenizers_elixir" application name.:additional_special_tokens
- A list of special tokens to append to the tokenizer. Defaults to[]
.
@spec get_model(Tokenizer.t()) :: Tokenizers.Model.t()
Get the Tokenizer
's Model
.
Get the tokenizer's vocabulary as a map of token to id.
@spec get_vocab_size(Tokenizer.t()) :: non_neg_integer()
Get the number of tokens in the vocabulary.
Convert a given id to its token.
Save the tokenizer to the provided path.
@spec token_to_id(Tokenizer.t(), binary()) :: non_neg_integer()
Convert a given token to its id.