View Source Tokenizers.Tokenizer (Tokenizers v0.5.1)

Functions to load, apply and train tokenizers.

The Tokenizers.Tokenizer.t/0 struct represents the tokenization pipeline. When you call Tokenizers.Tokenizer.encode/3, the input text goes through the following steps:

normalization
pre-tokenization
model
post-processing

This pipeline returns a Tokenizers.Encoding.t/0, which can then give you the token ids representing the input text. These token ids are usually used as the input for natural language processing (NLP) machine learning models.

Summary

Loading

from_buffer(data, opts \\ [])

Instantiate a new tokenizer from the buffer.

from_file(path, opts \\ [])

Instantiate a new tokenizer from the file at the given path.

from_pretrained(identifier, opts \\ [])

Loads a new tokenizer from a repository on Hugging Face Hub.

save(tokenizer, path, opts \\ [])

Save the tokenizer to the provided path.

Inference

decode(tokenizer, ids, opts \\ [])

Decodes the given list of ids back to a string.

decode_batch(tokenizer, sentences, opts \\ [])

Batched version of decode/3.

encode(tokenizer, input, opts \\ [])

Encode the given sequence to a Tokenizers.Encoding.t().

encode_batch(tokenizer, input, opts \\ [])

Batched version of encode/3.

id_to_token(tokenizer, id)

Convert a given id to its token.

token_to_id(tokenizer, token)

Convert a given token to its id.

Configuration

add_special_tokens(tokenizer, tokens)

Adds special tokens to tokenizer's vocabulary.

add_tokens(tokenizer, tokens)

Adds tokens to tokenizer's vocabulary.

disable_padding(tokenizer)

Disable padding on tokenizer.

disable_truncation(tokenizer)

Disable truncation on tokenizer.

get_decoder(tokenizer)

Returns the decoder currently used by tokenizer.

get_model(tokenizer)

Returns the model currently used by tokenizer.

get_normalizer(tokenizer)

Returns the normalizer currently used by tokenizer.

get_post_processor(tokenizer)

Returns the post-processor currently used by tokenizer.

get_pre_tokenizer(tokenizer)

Returns the pre-tokenizer currently used by tokenizer.

get_vocab(tokenizer, opts \\ [])

Get the tokenizer's vocabulary as a map of token to id.

get_vocab_size(tokenizer, opts \\ [])

Get the number of tokens in the vocabulary.

init(model)

Instantiate a new tokenizer from an existing model.

set_decoder(tokenizer, decoder)

Sets tokenizer's decoder.

set_model(tokenizer, model)

Sets tokenizer's model.

set_normalizer(tokenizer, normalizer)

Sets tokenizer's normalizer.

set_padding(tokenizer, opts)

Configures tokenizer with padding.

set_post_processor(tokenizer, post_processor)

Sets tokenizer's post-processor.

set_pre_tokenizer(tokenizer, pre_tokenizer)

Sets tokenizer's pre-tokenizer.

set_truncation(tokenizer, opts \\ [])

Configures tokenizer with truncation.

Training

train_from_files(tokenizer, paths, opts \\ [])

Train the tokenizer on the given files.

Types

encode_input()

An input being a subject to tokenization.

t()

Loading

from_buffer(data, opts \\ [])

@spec from_buffer(
  data :: String.t(),
  keyword()
) :: {:ok, t()} | {:error, term()}

Instantiate a new tokenizer from the buffer.

from_file(path, opts \\ [])

@spec from_file(
  path :: String.t(),
  keyword()
) :: {:ok, t()} | {:error, term()}

Instantiate a new tokenizer from the file at the given path.

from_pretrained(identifier, opts \\ [])

@spec from_pretrained(String.t(), Keyword.t()) :: {:ok, t()} | {:error, term()}

Loads a new tokenizer from a repository on Hugging Face Hub.

This is going to download a tokenizer file, save it to disk and load that file.

Options

:http_client - a tuple with a module and options. This module should implement the request/1 function, accepting a keyword list with the options for a request. This is inspired by Req.request/1: https://hexdocs.pm/req/Req.html#request/1
The default HTTP client config is: {Tokenizers.HTTPClient, []}. Since it's inspired by Req, it's possible to use that client without any adjustments.
When making request, the options :url and :method are going to be overridden. :headers contains the "user-agent" set by default.
:revision - the revision name that should be used for fetching the tokenizers from the Hugging Face repository
:use_cache - tells if it should read from cache when the file already exists. Defaults to true
:cache_dir - the directory where cache is saved. Files are written to cache even if :use_cache is false. By default it uses :filename.basedir/3 to get a cache dir based in the "tokenizers_elixir" application name

save(tokenizer, path, opts \\ [])

Save the tokenizer to the provided path.

Options

:pretty - whether to pretty print the JSON file. Defaults to true

Inference

decode(tokenizer, ids, opts \\ [])

@spec decode(t(), [non_neg_integer()], keyword()) ::
  {:ok, String.t()} | {:error, term()}

Decodes the given list of ids back to a string.

Options

:skip_special_tokens - whether to exclude special tokens from the decoded string. Defaults to true

decode_batch(tokenizer, sentences, opts \\ [])

@spec decode_batch(t(), [[non_neg_integer()]], keyword()) ::
  {:ok, [String.t()]} | {:error, term()}

Batched version of decode/3.

encode(tokenizer, input, opts \\ [])

@spec encode(t(), encode_input(), keyword()) ::
  {:ok, Tokenizers.Encoding.t()} | {:error, term()}

Encode the given sequence to a Tokenizers.Encoding.t().

Options

:add_special_tokens - whether to add special tokens to the sequence. Defaults to true
:encoding_transformations - a list of Tokenizers.Encoding.Transformation.t/0 to apply to the encoding. Check Tokenizers.Encoding.transform/2 for more information. Defaults to []

encode_batch(tokenizer, input, opts \\ [])

@spec encode_batch(t(), [encode_input()], keyword()) ::
  {:ok, [Tokenizers.Encoding.t()]} | {:error, term()}

Batched version of encode/3.

id_to_token(tokenizer, id)

@spec id_to_token(t(), integer()) :: String.t() | nil

Convert a given id to its token.

token_to_id(tokenizer, token)

@spec token_to_id(t(), String.t()) :: non_neg_integer() | nil

Convert a given token to its id.

Configuration

add_special_tokens(tokenizer, tokens)

@spec add_special_tokens(tokenizer :: t(), tokens :: [String.t()]) ::
  non_neg_integer()

Adds special tokens to tokenizer's vocabulary.

These tokens are special. To add regular tokens use add_tokens/2.

add_tokens(tokenizer, tokens)

@spec add_tokens(tokenizer :: t(), tokens :: [String.t()]) :: non_neg_integer()

Adds tokens to tokenizer's vocabulary.

These tokens are not special. To add special tokens use add_special_tokens/2.

disable_padding(tokenizer)

@spec disable_padding(tokenizer :: t()) :: t()

Disable padding on tokenizer.

disable_truncation(tokenizer)

@spec disable_truncation(t()) :: t()

Disable truncation on tokenizer.

get_decoder(tokenizer)

@spec get_decoder(t()) :: Tokenizers.Decoder.t() | nil

Returns the decoder currently used by tokenizer.

get_model(tokenizer)

@spec get_model(t()) :: Tokenizers.Model.t()

Returns the model currently used by tokenizer.

get_normalizer(tokenizer)

@spec get_normalizer(t()) :: Tokenizers.Normalizer.t() | nil

Returns the normalizer currently used by tokenizer.

get_post_processor(tokenizer)

@spec get_post_processor(t()) :: Tokenizers.PostProcessor.t() | nil

Returns the post-processor currently used by tokenizer.

get_pre_tokenizer(tokenizer)

@spec get_pre_tokenizer(t()) :: Tokenizers.PreTokenizer.t() | nil

Returns the pre-tokenizer currently used by tokenizer.

get_vocab(tokenizer, opts \\ [])

@spec get_vocab(
  t(),
  keyword()
) :: %{required(String.t()) => integer()}

Get the tokenizer's vocabulary as a map of token to id.

Options

:with_added_tokens - whether to include the tokens explicitly added to the tokenizer. Defaults to true

get_vocab_size(tokenizer, opts \\ [])

@spec get_vocab_size(
  t(),
  keyword()
) :: non_neg_integer()

Get the number of tokens in the vocabulary.

Options

:with_added_tokens - whether to include the tokens explicitly added to the tokenizer. Defaults to true

init(model)

@spec init(Tokenizers.Model.t()) :: {:ok, t()} | {:error, any()}

Instantiate a new tokenizer from an existing model.

set_decoder(tokenizer, decoder)

@spec set_decoder(t(), Tokenizers.Decoder.t()) :: t()

Sets tokenizer's decoder.

set_model(tokenizer, model)

@spec set_model(t(), Tokenizers.Model.t()) :: t()

Sets tokenizer's model.

set_normalizer(tokenizer, normalizer)

@spec set_normalizer(t(), Tokenizers.Normalizer.t()) :: t()

Sets tokenizer's normalizer.

set_padding(tokenizer, opts)

@spec set_padding(tokenizer :: t(), opts) :: t()
when opts: [
       strategy: :batch_longest | {:fixed, non_neg_integer()},
       direction: :left | :right,
       pad_to_multiple_of: non_neg_integer(),
       pad_id: non_neg_integer(),
       pad_type_id: non_neg_integer(),
       pad_token: String.t()
     ]

Configures tokenizer with padding.

To disable padding use disable_padding/1.

Options

:strategy (default: :batch_longest) - the strategy to use when padding
:direction (default: :right) - the direction to use when padding
:pad_to_multiple_of (default: 0) - the multiple to pad to
:pad_id (default: 0) - the id of the token to use for padding
:pad_type_id (default: 0) - the id of the token type to use for padding
:pad_token (default: "[PAD]") - the token to use for padding

set_post_processor(tokenizer, post_processor)

@spec set_post_processor(t(), Tokenizers.PostProcessor.t()) :: t()

Sets tokenizer's post-processor.

set_pre_tokenizer(tokenizer, pre_tokenizer)

@spec set_pre_tokenizer(t(), Tokenizers.PreTokenizer.t()) :: t()

Sets tokenizer's pre-tokenizer.

set_truncation(tokenizer, opts \\ [])

@spec set_truncation(t(), opts) :: t()
when opts: [
       max_length: non_neg_integer(),
       stride: non_neg_integer(),
       strategy: :longest_first | :only_first | :only_second,
       direction: :left | :right
     ]

Configures tokenizer with truncation.

To disable truncation use disable_truncation/1.

Options

:max_length (default: 512) - the maximum length to truncate the model's input to
:stride (default: 0) - the stride to use when overflowing the model's input
:strategy (default: :longest_first) - the strategy to use when overflowing the model's input
:direction (default: :right) - the direction to use when overflowing the model's input

Training

train_from_files(tokenizer, paths, opts \\ [])

@spec train_from_files(t(), [String.t()], keyword()) :: {:ok, t()} | {:error, term()}

Train the tokenizer on the given files.

Options

:trainer - the trainer to use. Defaults to the default trainer corresponding to tokenizers's model

Types

encode_input()

@type encode_input() :: String.t() | {String.t(), String.t()}

An input being a subject to tokenization.

Can be either a single sequence, or a pair of sequences.

t()

@type t() :: %Tokenizers.Tokenizer{resource: reference()}