View Source Tokenizers.Tokenizer (Tokenizers v0.4.0)

Functions to load, apply and train tokenizers.

The Tokenizers.Tokenizer.t/0 struct represents the tokenization pipeline. When you call Tokenizers.Tokenizer.encode/3, the input text goes through the following steps:

  • normalization
  • pre-tokenization
  • model
  • post-processing

This pipeline returns a Tokenizers.Encoding.t/0, which can then give you the token ids representing the input text. These token ids are usually used as the input for natural language processing (NLP) machine learning models.

Summary

Types

An input being a subject to tokenization.

t()

Loading

Instantiate a new tokenizer from the buffer.

Instantiate a new tokenizer from the file at the given path.

Loads a new tokenizer from a repository on Hugging Face Hub.

Save the tokenizer to the provided path.

Inference

Decodes the given list of ids back to a string.

Encode the given sequence to a Tokenizers.Encoding.t().

Convert a given id to its token.

Convert a given token to its id.

Configuration

Adds special tokens to tokenizer's vocabulary.

Adds tokens to tokenizer's vocabulary.

Disable padding on tokenizer.

Disable truncation on tokenizer.

Returns the decoder currently used by tokenizer.

Returns the model currently used by tokenizer.

Returns the normalizer currently used by tokenizer.

Returns the post-processor currently used by tokenizer.

Returns the pre-tokenizer currently used by tokenizer.

Get the tokenizer's vocabulary as a map of token to id.

Get the number of tokens in the vocabulary.

Instantiate a new tokenizer from an existing model.

Sets tokenizer's decoder.

Sets tokenizer's model.

Sets tokenizer's normalizer.

Configures tokenizer with padding.

Sets tokenizer's post-processor.

Sets tokenizer's pre-tokenizer.

Configures tokenizer with truncation.

Training

Train the tokenizer on the given files.

Types

@type encode_input() :: String.t() | {String.t(), String.t()}

An input being a subject to tokenization.

Can be either a single sequence, or a pair of sequences.

@type t() :: %Tokenizers.Tokenizer{resource: reference()}

Loading

Link to this function

from_buffer(data, opts \\ [])

View Source
@spec from_buffer(
  data :: String.t(),
  keyword()
) :: {:ok, t()} | {:error, term()}

Instantiate a new tokenizer from the buffer.

Link to this function

from_file(path, opts \\ [])

View Source
@spec from_file(
  path :: String.t(),
  keyword()
) :: {:ok, t()} | {:error, term()}

Instantiate a new tokenizer from the file at the given path.

Link to this function

from_pretrained(identifier, opts \\ [])

View Source
@spec from_pretrained(String.t(), Keyword.t()) :: {:ok, t()} | {:error, term()}

Loads a new tokenizer from a repository on Hugging Face Hub.

This is going to download a tokenizer file, save it to disk and load that file.

Options

  • :http_client - a tuple with a module and options. This module should implement the request/1 function, accepting a keyword list with the options for a request. This is inspired by Req.request/1: https://hexdocs.pm/req/Req.html#request/1

    The default HTTP client config is: {Tokenizers.HTTPClient, []}. Since it's inspired by Req, it's possible to use that client without any adjustments.

    When making request, the options :url and :method are going to be overridden. :headers contains the "user-agent" set by default.

  • :revision - the revision name that should be used for fetching the tokenizers from the Hugging Face repository

  • :use_cache - tells if it should read from cache when the file already exists. Defaults to true

  • :cache_dir - the directory where cache is saved. Files are written to cache even if :use_cache is false. By default it uses :filename.basedir/3 to get a cache dir based in the "tokenizers_elixir" application name

Link to this function

save(tokenizer, path, opts \\ [])

View Source

Save the tokenizer to the provided path.

Options

  • :pretty - whether to pretty print the JSON file. Defaults to true

Inference

Link to this function

decode(tokenizer, ids, opts \\ [])

View Source
@spec decode(t(), [non_neg_integer()], keyword()) ::
  {:ok, String.t()} | {:error, term()}

Decodes the given list of ids back to a string.

Options

  • :skip_special_tokens - whether to exclude special tokens from the decoded string. Defaults to true
Link to this function

decode_batch(tokenizer, sentences, opts \\ [])

View Source
@spec decode_batch(t(), [[non_neg_integer()]], keyword()) ::
  {:ok, [String.t()]} | {:error, term()}

Batched version of decode/3.

Link to this function

encode(tokenizer, input, opts \\ [])

View Source
@spec encode(t(), encode_input(), keyword()) ::
  {:ok, Tokenizers.Encoding.t()} | {:error, term()}

Encode the given sequence to a Tokenizers.Encoding.t().

Options

Link to this function

encode_batch(tokenizer, input, opts \\ [])

View Source
@spec encode_batch(t(), [encode_input()], keyword()) ::
  {:ok, [Tokenizers.Encoding.t()]} | {:error, term()}

Batched version of encode/3.

Link to this function

id_to_token(tokenizer, id)

View Source
@spec id_to_token(t(), integer()) :: String.t() | nil

Convert a given id to its token.

Link to this function

token_to_id(tokenizer, token)

View Source
@spec token_to_id(t(), String.t()) :: non_neg_integer() | nil

Convert a given token to its id.

Configuration

Link to this function

add_special_tokens(tokenizer, tokens)

View Source
@spec add_special_tokens(tokenizer :: t(), tokens :: [String.t()]) ::
  non_neg_integer()

Adds special tokens to tokenizer's vocabulary.

These tokens are special. To add regular tokens use add_tokens/2.

Link to this function

add_tokens(tokenizer, tokens)

View Source
@spec add_tokens(tokenizer :: t(), tokens :: [String.t()]) :: non_neg_integer()

Adds tokens to tokenizer's vocabulary.

These tokens are not special. To add special tokens use add_special_tokens/2.

Link to this function

disable_padding(tokenizer)

View Source
@spec disable_padding(tokenizer :: t()) :: t()

Disable padding on tokenizer.

Link to this function

disable_truncation(tokenizer)

View Source
@spec disable_truncation(t()) :: t()

Disable truncation on tokenizer.

@spec get_decoder(t()) :: Tokenizers.Decoder.t() | nil

Returns the decoder currently used by tokenizer.

@spec get_model(t()) :: Tokenizers.Model.t()

Returns the model currently used by tokenizer.

Link to this function

get_normalizer(tokenizer)

View Source
@spec get_normalizer(t()) :: Tokenizers.Normalizer.t() | nil

Returns the normalizer currently used by tokenizer.

Link to this function

get_post_processor(tokenizer)

View Source
@spec get_post_processor(t()) :: Tokenizers.PostProcessor.t() | nil

Returns the post-processor currently used by tokenizer.

Link to this function

get_pre_tokenizer(tokenizer)

View Source
@spec get_pre_tokenizer(t()) :: Tokenizers.PreTokenizer.t() | nil

Returns the pre-tokenizer currently used by tokenizer.

Link to this function

get_vocab(tokenizer, opts \\ [])

View Source
@spec get_vocab(
  t(),
  keyword()
) :: %{required(String.t()) => integer()}

Get the tokenizer's vocabulary as a map of token to id.

Options

  • :with_added_tokens - whether to include the tokens explicitly added to the tokenizer. Defaults to true
Link to this function

get_vocab_size(tokenizer, opts \\ [])

View Source
@spec get_vocab_size(
  t(),
  keyword()
) :: non_neg_integer()

Get the number of tokens in the vocabulary.

Options

  • :with_added_tokens - whether to include the tokens explicitly added to the tokenizer. Defaults to true
@spec init(Tokenizers.Model.t()) :: {:ok, t()} | {:error, any()}

Instantiate a new tokenizer from an existing model.

Link to this function

set_decoder(tokenizer, decoder)

View Source
@spec set_decoder(t(), Tokenizers.Decoder.t()) :: t()

Sets tokenizer's decoder.

Link to this function

set_model(tokenizer, model)

View Source
@spec set_model(t(), Tokenizers.Model.t()) :: t()

Sets tokenizer's model.

Link to this function

set_normalizer(tokenizer, normalizer)

View Source
@spec set_normalizer(t(), Tokenizers.Normalizer.t()) :: t()

Sets tokenizer's normalizer.

Link to this function

set_padding(tokenizer, opts)

View Source
@spec set_padding(tokenizer :: t(), opts) :: t()
when opts: [
       strategy: :batch_longest | {:fixed, non_neg_integer()},
       direction: :left | :right,
       pad_to_multiple_of: non_neg_integer(),
       pad_id: non_neg_integer(),
       pad_type_id: non_neg_integer(),
       pad_token: String.t()
     ]

Configures tokenizer with padding.

To disable padding use disable_padding/1.

Options

  • :strategy (default: :batch_longest) - the strategy to use when padding

  • :direction (default: :right) - the direction to use when padding

  • :pad_to_multiple_of (default: 0) - the multiple to pad to

  • :pad_id (default: 0) - the id of the token to use for padding

  • :pad_type_id (default: 0) - the id of the token type to use for padding

  • :pad_token (default: "[PAD]") - the token to use for padding

Link to this function

set_post_processor(tokenizer, post_processor)

View Source
@spec set_post_processor(t(), Tokenizers.PostProcessor.t()) :: t()

Sets tokenizer's post-processor.

Link to this function

set_pre_tokenizer(tokenizer, pre_tokenizer)

View Source
@spec set_pre_tokenizer(t(), Tokenizers.PreTokenizer.t()) :: t()

Sets tokenizer's pre-tokenizer.

Link to this function

set_truncation(tokenizer, opts \\ [])

View Source
@spec set_truncation(t(), opts) :: t()
when opts: [
       max_length: non_neg_integer(),
       stride: non_neg_integer(),
       strategy: :longest_first | :only_first | :only_second,
       direction: :left | :right
     ]

Configures tokenizer with truncation.

To disable truncation use disable_truncation/1.

Options

  • :max_length (default: 512) - the maximum length to truncate the model's input to

  • :stride (default: 0) - the stride to use when overflowing the model's input

  • :strategy (default: :longest_first) - the strategy to use when overflowing the model's input

  • :direction (default: :right) - the direction to use when overflowing the model's input

Training

Link to this function

train_from_files(tokenizer, paths, opts \\ [])

View Source
@spec train_from_files(t(), [String.t()], keyword()) :: {:ok, t()} | {:error, term()}

Train the tokenizer on the given files.

Options

  • :trainer - the trainer to use. Defaults to the default trainer corresponding to tokenizers's model