View Source Tokenizers.Tokenizer (Tokenizers v0.3.2)

The struct and associated functions for a tokenizer.

A Tokenizers.Tokenizer.t() is a container that holds the constituent parts of the tokenization pipeline.

When you call Tokenizers.Tokenizer.encode/3, the input text goes through the following pipeline:

normalization
pre-tokenization
model
post-processing

This returns a Tokenizers.Encoding.t(), which can then give you the token ids for each token in the input text. These token ids are usually used as the input for natural language processing machine learning models.

Link to this section Summary

Types

encode_input()

An input being a subject to tokenization.

t()

Functions

decode(tokenizer, ids, opts \\ [])

Decode the given list of ids or list of lists of ids back to strings.

encode(tokenizer, input, opts \\ [])

Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t().

from_file(path, opts \\ [])

Instantiate a new tokenizer from the file at the given path.

from_pretrained(identifier, opts \\ [])

Instantiate a new tokenizer from an existing file on the Hugging Face Hub.

get_model(tokenizer)

Get the Tokenizer's Model.

get_vocab(tokenizer)

Get the tokenizer's vocabulary as a map of token to id.

get_vocab_size(tokenizer)

Get the number of tokens in the vocabulary.

id_to_token(tokenizer, id)

Convert a given id to its token.

save(tokenizer, path)

Save the tokenizer to the provided path.

token_to_id(tokenizer, token)

Convert a given token to its id.

Link to this section Types

encode_input()

@type encode_input() :: String.t() | {String.t(), String.t()}

An input being a subject to tokenization.

Can be either a single sequence, or a pair of sequences.

t()

@type t() :: %Tokenizers.Tokenizer{reference: reference(), resource: binary()}

Link to this section Functions

decode(tokenizer, ids, opts \\ [])

@spec decode(Tokenizer.t(), non_neg_integer() | [non_neg_integer()], Keyword.t()) ::
  {:ok, String.t() | [String.t()]} | {:error, term()}

Decode the given list of ids or list of lists of ids back to strings.

options
Options

:skip_special_tokens - whether the special tokens should be removed from the decoded string. Defaults to true.

encode(tokenizer, input, opts \\ [])

@spec encode(Tokenizer.t(), encode_input() | [encode_input()], Keyword.t()) ::
  {:ok, Encoding.t() | [Encoding.t()]} | {:error, term()}

Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t().

options
Options

:add_special_tokens - whether to add special tokens to the encoding. Defaults to true.

from_file(path, opts \\ [])

@spec from_file(String.t(), Keyword.t()) :: {:ok, Tokenizer.t()} | {:error, term()}

Instantiate a new tokenizer from the file at the given path.

options
Options

:additional_special_tokens - A list of special tokens to append to the tokenizer. Defaults to [].

from_pretrained(identifier, opts \\ [])

@spec from_pretrained(String.t(), Keyword.t()) ::
  {:ok, Tokenizer.t()} | {:error, term()}

Instantiate a new tokenizer from an existing file on the Hugging Face Hub.

This is going to download a tokenizer file, save it to disk and load that file.

options
Options

:http_client - A tuple with a module and options. This module should implement the request/1 function, accepting a keyword list with the options for a request. This is inspired by Req.request/1: https://hexdocs.pm/req/Req.html#request/1
The default HTTP client config is: {Tokenizers.HTTPClient, []}. Since it's inspired by Req, it's possible to use that client without any adjustments.
When making request, the options :url and :method are going to be overridden. :headers contains the "user-agent" set by default.
:revision - The revision name that should be used for fetching the tokenizers from Hugging Face.
:use_cache - Tells if it should read from cache when the file already exists. Defaults to true.
:cache_dir - The directory where cache is saved. Files are written to cache even if :use_cache is false. By default it uses :filename.basedir/3 to get a cache dir based in the "tokenizers_elixir" application name.
:additional_special_tokens - A list of special tokens to append to the tokenizer. Defaults to [].

get_model(tokenizer)

@spec get_model(Tokenizer.t()) :: Tokenizers.Model.t()

Get the Tokenizer's Model.

get_vocab(tokenizer)

@spec get_vocab(Tokenizer.t()) :: %{required(binary()) => integer()}

Get the tokenizer's vocabulary as a map of token to id.

get_vocab_size(tokenizer)

@spec get_vocab_size(Tokenizer.t()) :: non_neg_integer()

Get the number of tokens in the vocabulary.

id_to_token(tokenizer, id)

@spec id_to_token(Tokenizer.t(), integer()) :: String.t()

Convert a given id to its token.

save(tokenizer, path)

@spec save(Tokenizer.t(), String.t()) :: {:ok, String.t()} | {:error, term()}

Save the tokenizer to the provided path.

token_to_id(tokenizer, token)

@spec token_to_id(Tokenizer.t(), binary()) :: non_neg_integer()

Convert a given token to its id.

Settings View Source Tokenizers.Tokenizer (Tokenizers v0.3.2)

Link to this section Summary

Types

Functions

Link to this section Types

encode_input()

t()

Link to this section Functions

decode(tokenizer, ids, opts \\ [])

options Options

encode(tokenizer, input, opts \\ [])

options Options

from_file(path, opts \\ [])

options Options

from_pretrained(identifier, opts \\ [])

options Options

get_model(tokenizer)

get_vocab(tokenizer)

get_vocab_size(tokenizer)

id_to_token(tokenizer, id)

save(tokenizer, path)

token_to_id(tokenizer, token)

View Source Tokenizers.Tokenizer (Tokenizers v0.3.2)

options
Options

options
Options

options
Options

options
Options