View Source Tokenizers.Tokenizer (Tokenizers v0.3.2)

The struct and associated functions for a tokenizer.

A Tokenizers.Tokenizer.t() is a container that holds the constituent parts of the tokenization pipeline.

When you call Tokenizers.Tokenizer.encode/3, the input text goes through the following pipeline:

  • normalization
  • pre-tokenization
  • model
  • post-processing

This returns a Tokenizers.Encoding.t(), which can then give you the token ids for each token in the input text. These token ids are usually used as the input for natural language processing machine learning models.

Link to this section Summary

Types

An input being a subject to tokenization.

t()

Functions

Decode the given list of ids or list of lists of ids back to strings.

Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t().

Instantiate a new tokenizer from the file at the given path.

Instantiate a new tokenizer from an existing file on the Hugging Face Hub.

Get the Tokenizer's Model.

Get the tokenizer's vocabulary as a map of token to id.

Get the number of tokens in the vocabulary.

Convert a given id to its token.

Save the tokenizer to the provided path.

Convert a given token to its id.

Link to this section Types

@type encode_input() :: String.t() | {String.t(), String.t()}

An input being a subject to tokenization.

Can be either a single sequence, or a pair of sequences.

@type t() :: %Tokenizers.Tokenizer{reference: reference(), resource: binary()}

Link to this section Functions

Link to this function

decode(tokenizer, ids, opts \\ [])

View Source
@spec decode(Tokenizer.t(), non_neg_integer() | [non_neg_integer()], Keyword.t()) ::
  {:ok, String.t() | [String.t()]} | {:error, term()}

Decode the given list of ids or list of lists of ids back to strings.

options

Options

  • :skip_special_tokens - whether the special tokens should be removed from the decoded string. Defaults to true.
Link to this function

encode(tokenizer, input, opts \\ [])

View Source
@spec encode(Tokenizer.t(), encode_input() | [encode_input()], Keyword.t()) ::
  {:ok, Encoding.t() | [Encoding.t()]} | {:error, term()}

Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t().

options

Options

  • :add_special_tokens - whether to add special tokens to the encoding. Defaults to true.
Link to this function

from_file(path, opts \\ [])

View Source
@spec from_file(String.t(), Keyword.t()) :: {:ok, Tokenizer.t()} | {:error, term()}

Instantiate a new tokenizer from the file at the given path.

options

Options

  • :additional_special_tokens - A list of special tokens to append to the tokenizer. Defaults to [].
Link to this function

from_pretrained(identifier, opts \\ [])

View Source
@spec from_pretrained(String.t(), Keyword.t()) ::
  {:ok, Tokenizer.t()} | {:error, term()}

Instantiate a new tokenizer from an existing file on the Hugging Face Hub.

This is going to download a tokenizer file, save it to disk and load that file.

options

Options

  • :http_client - A tuple with a module and options. This module should implement the request/1 function, accepting a keyword list with the options for a request. This is inspired by Req.request/1: https://hexdocs.pm/req/Req.html#request/1

    The default HTTP client config is: {Tokenizers.HTTPClient, []}. Since it's inspired by Req, it's possible to use that client without any adjustments.

    When making request, the options :url and :method are going to be overridden. :headers contains the "user-agent" set by default.

  • :revision - The revision name that should be used for fetching the tokenizers from Hugging Face.

  • :use_cache - Tells if it should read from cache when the file already exists. Defaults to true.

  • :cache_dir - The directory where cache is saved. Files are written to cache even if :use_cache is false. By default it uses :filename.basedir/3 to get a cache dir based in the "tokenizers_elixir" application name.

  • :additional_special_tokens - A list of special tokens to append to the tokenizer. Defaults to [].

@spec get_model(Tokenizer.t()) :: Tokenizers.Model.t()

Get the Tokenizer's Model.

@spec get_vocab(Tokenizer.t()) :: %{required(binary()) => integer()}

Get the tokenizer's vocabulary as a map of token to id.

Link to this function

get_vocab_size(tokenizer)

View Source
@spec get_vocab_size(Tokenizer.t()) :: non_neg_integer()

Get the number of tokens in the vocabulary.

Link to this function

id_to_token(tokenizer, id)

View Source
@spec id_to_token(Tokenizer.t(), integer()) :: String.t()

Convert a given id to its token.

@spec save(Tokenizer.t(), String.t()) :: {:ok, String.t()} | {:error, term()}

Save the tokenizer to the provided path.

Link to this function

token_to_id(tokenizer, token)

View Source
@spec token_to_id(Tokenizer.t(), binary()) :: non_neg_integer()

Convert a given token to its id.