View Source Tokenizers.Tokenizer (Tokenizers v0.5.1)
Functions to load, apply and train tokenizers.
The Tokenizers.Tokenizer.t/0
struct represents the tokenization
pipeline. When you call Tokenizers.Tokenizer.encode/3
, the input
text goes through the following steps:
- normalization
- pre-tokenization
- model
- post-processing
This pipeline returns a Tokenizers.Encoding.t/0
, which can then
give you the token ids representing the input text. These token ids
are usually used as the input for natural language processing (NLP)
machine learning models.
Summary
Loading
Instantiate a new tokenizer from the buffer.
Instantiate a new tokenizer from the file at the given path.
Loads a new tokenizer from a repository on Hugging Face Hub.
Save the tokenizer to the provided path.
Inference
Decodes the given list of ids back to a string.
Batched version of decode/3
.
Encode the given sequence to a Tokenizers.Encoding.t()
.
Batched version of encode/3
.
Convert a given id to its token.
Convert a given token to its id.
Configuration
Adds special tokens to tokenizer
's vocabulary.
Adds tokens to tokenizer
's vocabulary.
Disable padding on tokenizer
.
Disable truncation on tokenizer
.
Returns the decoder currently used by tokenizer
.
Returns the model currently used by tokenizer
.
Returns the normalizer currently used by tokenizer
.
Returns the post-processor currently used by tokenizer
.
Returns the pre-tokenizer currently used by tokenizer
.
Get the tokenizer's vocabulary as a map of token to id.
Get the number of tokens in the vocabulary.
Instantiate a new tokenizer from an existing model.
Sets tokenizer
's decoder.
Sets tokenizer
's model.
Sets tokenizer
's normalizer.
Configures tokenizer
with padding.
Sets tokenizer
's post-processor.
Sets tokenizer
's pre-tokenizer.
Configures tokenizer
with truncation.
Training
Train the tokenizer on the given files.
Loading
Instantiate a new tokenizer from the buffer.
Instantiate a new tokenizer from the file at the given path.
Loads a new tokenizer from a repository on Hugging Face Hub.
This is going to download a tokenizer file, save it to disk and load that file.
Options
:http_client
- a tuple with a module and options. This module should implement therequest/1
function, accepting a keyword list with the options for a request. This is inspired byReq.request/1
: https://hexdocs.pm/req/Req.html#request/1The default HTTP client config is:
{Tokenizers.HTTPClient, []}
. Since it's inspired byReq
, it's possible to use that client without any adjustments.When making request, the options
:url
and:method
are going to be overridden.:headers
contains the "user-agent" set by default.:revision
- the revision name that should be used for fetching the tokenizers from the Hugging Face repository:use_cache
- tells if it should read from cache when the file already exists. Defaults totrue
:cache_dir
- the directory where cache is saved. Files are written to cache even if:use_cache
isfalse
. By default it uses:filename.basedir/3
to get a cache dir based in the "tokenizers_elixir" application name
Save the tokenizer to the provided path.
Options
:pretty
- whether to pretty print the JSON file. Defaults totrue
Inference
@spec decode(t(), [non_neg_integer()], keyword()) :: {:ok, String.t()} | {:error, term()}
Decodes the given list of ids back to a string.
Options
:skip_special_tokens
- whether to exclude special tokens from the decoded string. Defaults totrue
@spec decode_batch(t(), [[non_neg_integer()]], keyword()) :: {:ok, [String.t()]} | {:error, term()}
Batched version of decode/3
.
@spec encode(t(), encode_input(), keyword()) :: {:ok, Tokenizers.Encoding.t()} | {:error, term()}
Encode the given sequence to a Tokenizers.Encoding.t()
.
Options
:add_special_tokens
- whether to add special tokens to the sequence. Defaults totrue
:encoding_transformations
- a list ofTokenizers.Encoding.Transformation.t/0
to apply to the encoding. CheckTokenizers.Encoding.transform/2
for more information. Defaults to[]
@spec encode_batch(t(), [encode_input()], keyword()) :: {:ok, [Tokenizers.Encoding.t()]} | {:error, term()}
Batched version of encode/3
.
Convert a given id to its token.
@spec token_to_id(t(), String.t()) :: non_neg_integer() | nil
Convert a given token to its id.
Configuration
@spec add_special_tokens(tokenizer :: t(), tokens :: [String.t()]) :: non_neg_integer()
Adds special tokens to tokenizer
's vocabulary.
These tokens are special. To add regular tokens use add_tokens/2
.
@spec add_tokens(tokenizer :: t(), tokens :: [String.t()]) :: non_neg_integer()
Adds tokens to tokenizer
's vocabulary.
These tokens are not special. To add special tokens use
add_special_tokens/2
.
Disable padding on tokenizer
.
Disable truncation on tokenizer
.
@spec get_decoder(t()) :: Tokenizers.Decoder.t() | nil
Returns the decoder currently used by tokenizer
.
@spec get_model(t()) :: Tokenizers.Model.t()
Returns the model currently used by tokenizer
.
@spec get_normalizer(t()) :: Tokenizers.Normalizer.t() | nil
Returns the normalizer currently used by tokenizer
.
@spec get_post_processor(t()) :: Tokenizers.PostProcessor.t() | nil
Returns the post-processor currently used by tokenizer
.
@spec get_pre_tokenizer(t()) :: Tokenizers.PreTokenizer.t() | nil
Returns the pre-tokenizer currently used by tokenizer
.
Get the tokenizer's vocabulary as a map of token to id.
Options
:with_added_tokens
- whether to include the tokens explicitly added to the tokenizer. Defaults totrue
@spec get_vocab_size( t(), keyword() ) :: non_neg_integer()
Get the number of tokens in the vocabulary.
Options
:with_added_tokens
- whether to include the tokens explicitly added to the tokenizer. Defaults totrue
@spec init(Tokenizers.Model.t()) :: {:ok, t()} | {:error, any()}
Instantiate a new tokenizer from an existing model.
@spec set_decoder(t(), Tokenizers.Decoder.t()) :: t()
Sets tokenizer
's decoder.
@spec set_model(t(), Tokenizers.Model.t()) :: t()
Sets tokenizer
's model.
@spec set_normalizer(t(), Tokenizers.Normalizer.t()) :: t()
Sets tokenizer
's normalizer.
@spec set_padding(tokenizer :: t(), opts) :: t() when opts: [ strategy: :batch_longest | {:fixed, non_neg_integer()}, direction: :left | :right, pad_to_multiple_of: non_neg_integer(), pad_id: non_neg_integer(), pad_type_id: non_neg_integer(), pad_token: String.t() ]
Configures tokenizer
with padding.
To disable padding use disable_padding/1
.
Options
:strategy
(default::batch_longest
) - the strategy to use when padding:direction
(default::right
) - the direction to use when padding:pad_to_multiple_of
(default:0
) - the multiple to pad to:pad_id
(default:0
) - the id of the token to use for padding:pad_type_id
(default:0
) - the id of the token type to use for padding:pad_token
(default:"[PAD]"
) - the token to use for padding
@spec set_post_processor(t(), Tokenizers.PostProcessor.t()) :: t()
Sets tokenizer
's post-processor.
@spec set_pre_tokenizer(t(), Tokenizers.PreTokenizer.t()) :: t()
Sets tokenizer
's pre-tokenizer.
@spec set_truncation(t(), opts) :: t() when opts: [ max_length: non_neg_integer(), stride: non_neg_integer(), strategy: :longest_first | :only_first | :only_second, direction: :left | :right ]
Configures tokenizer
with truncation.
To disable truncation use disable_truncation/1
.
Options
:max_length
(default:512
) - the maximum length to truncate the model's input to:stride
(default:0
) - the stride to use when overflowing the model's input:strategy
(default::longest_first
) - the strategy to use when overflowing the model's input:direction
(default::right
) - the direction to use when overflowing the model's input
Training
Types
An input being a subject to tokenization.
Can be either a single sequence, or a pair of sequences.
@type t() :: %Tokenizers.Tokenizer{resource: reference()}