View Source Tokenizers.Encoding (Tokenizers v0.5.0)

Encoding is the result of passing a text through tokenization pipeline.

This function defines a struct and a number of functions to retrieve information about the encoded text.

For further machine learning processing you most likely want to access the encoded token ids via get_ids/1. If you want to convert the ids to a tensor, use get_u32_ids/1 to get a zero-copy binary.

Summary

Types

Padding configuration.

t()

Truncation configuration.

Functions

Returns the token that contains the given char.

Returns the word that contains the given char.

Returns the attention mask from encoding.

Returns the ids from encoding.

Returns the number of tokens in encoding.

Returns the number of sequences combined in encoding.

Returns offsets from encoding.

Returns the overflow from encoding.

Returns sequence ids from encoding.

Returns the special tokens mask from encoding.

Returns the tokens from encoding.

Returns token type ids from encoding.

Same as get_attention_mask/1, but returns binary with u32 values.

Same as get_ids/1, but returns binary with u32 values.

Same as get_special_tokens_mask/1, but returns binary with u32 values.

Same as get_type_ids/1, but returns binary with u32 values.

Returns word ids from encoding.

Returns the number of tokens in encoding.

Pad the encoding to the given length.

Sets the given sequence id for all tokens contained in encoding.

Returns the offsets of the token at the given index.

Returns the index of the sequence containing the given token.

Returns the word that contains the token at the given index.

Performs set of transformations to given encoding, creating a new one. Transformations are applied in order they are given.

Truncate the encoding to the given length.

Returns the offsets of the word at the given index in the input sequence.

Returns the encoded tokens corresponding to the word at the given index in the input sequence, with the form {start_token, end_token + 1}.

Types

@type padding_opts() :: [
  pad_id: non_neg_integer(),
  pad_type_id: non_neg_integer(),
  pad_token: String.t(),
  direction: :left | :right
]

Padding configuration.

  • :direction - the padding direction. Defaults to :right

  • :pad_id - the id corresponding to the padding token. Defaults to 0

  • :pad_type_id - the type ID corresponding to the padding token. Defaults to 0

  • :pad_token - the padding token to use. Defaults to "[PAD]"

@type t() :: %Tokenizers.Encoding{resource: reference()}
@type truncation_opts() :: [stride: non_neg_integer(), direction: :left | :right]

Truncation configuration.

  • :stride - the length of previous content to be included in each overflowing piece. Defaults to 0

  • :direction - the truncation direction. Defaults to :right

Functions

Link to this function

char_to_token(encoding, position, seq_id)

View Source
@spec char_to_token(t(), non_neg_integer(), non_neg_integer()) ::
  non_neg_integer() | nil

Returns the token that contains the given char.

Link to this function

char_to_word(encoding, position, seq_id)

View Source
@spec char_to_word(t(), non_neg_integer(), non_neg_integer()) ::
  non_neg_integer() | nil

Returns the word that contains the given char.

Link to this function

get_attention_mask(encoding)

View Source
@spec get_attention_mask(t()) :: [integer()]

Returns the attention mask from encoding.

@spec get_ids(t()) :: [integer()]

Returns the ids from encoding.

@spec get_length(t()) :: non_neg_integer()

Returns the number of tokens in encoding.

Link to this function

get_n_sequences(encoding)

View Source
@spec get_n_sequences(t()) :: non_neg_integer()

Returns the number of sequences combined in encoding.

@spec get_offsets(t()) :: [{integer(), integer()}]

Returns offsets from encoding.

The offsets are expressed in terms of UTF-8 bytes.

Link to this function

get_overflowing(encoding)

View Source
@spec get_overflowing(t()) :: [t()]

Returns the overflow from encoding.

Link to this function

get_sequence_ids(encoding)

View Source
@spec get_sequence_ids(t()) :: [non_neg_integer() | nil]

Returns sequence ids from encoding.

Link to this function

get_special_tokens_mask(encoding)

View Source
@spec get_special_tokens_mask(t()) :: [integer()]

Returns the special tokens mask from encoding.

@spec get_tokens(t()) :: [binary()]

Returns the tokens from encoding.

@spec get_type_ids(t()) :: [integer()]

Returns token type ids from encoding.

Link to this function

get_u32_attention_mask(encoding)

View Source
@spec get_u32_attention_mask(t()) :: binary()

Same as get_attention_mask/1, but returns binary with u32 values.

@spec get_u32_ids(t()) :: binary()

Same as get_ids/1, but returns binary with u32 values.

Link to this function

get_u32_special_tokens_mask(encoding)

View Source
@spec get_u32_special_tokens_mask(t()) :: binary()

Same as get_special_tokens_mask/1, but returns binary with u32 values.

Link to this function

get_u32_type_ids(encoding)

View Source
@spec get_u32_type_ids(t()) :: binary()

Same as get_type_ids/1, but returns binary with u32 values.

@spec get_word_ids(t()) :: [non_neg_integer() | nil]

Returns word ids from encoding.

@spec n_tokens(encoding :: t()) :: non_neg_integer()

Returns the number of tokens in encoding.

Link to this function

pad(encoding, target_length, opts \\ [])

View Source
@spec pad(t(), non_neg_integer(), opts :: padding_opts()) :: t()

Pad the encoding to the given length.

For available options see padding_opts/0.

Link to this function

set_sequence_id(encoding, id)

View Source
@spec set_sequence_id(t(), non_neg_integer()) :: t()

Sets the given sequence id for all tokens contained in encoding.

Link to this function

token_to_chars(encoding, token)

View Source
@spec token_to_chars(t(), non_neg_integer()) ::
  {non_neg_integer(), {non_neg_integer(), non_neg_integer()}} | nil

Returns the offsets of the token at the given index.

Link to this function

token_to_sequence(encoding, token)

View Source
@spec token_to_sequence(t(), non_neg_integer()) :: non_neg_integer() | nil

Returns the index of the sequence containing the given token.

Link to this function

token_to_word(encoding, token)

View Source
@spec token_to_word(t(), non_neg_integer()) ::
  {non_neg_integer(), non_neg_integer()} | nil

Returns the word that contains the token at the given index.

Link to this function

transform(encoding, transformations)

View Source

Performs set of transformations to given encoding, creating a new one. Transformations are applied in order they are given.

While all these transformations can be done one by one, this function is more efficient as it avoids multiple allocations and Garbage Collection for intermediate encodings.

Check the module Tokenizers.Encoding.Transformation for handy functions, that can be used to build the transformations list. Also, you can build this list manually, as long as it follows the format.

Link to this function

truncate(encoding, max_length, opts \\ [])

View Source
@spec truncate(t(), non_neg_integer(), opts :: truncation_opts()) :: t()

Truncate the encoding to the given length.

For available options see truncation_opts/0.

Link to this function

word_to_chars(encoding, word, seq_id)

View Source
@spec word_to_chars(t(), non_neg_integer(), non_neg_integer()) ::
  {non_neg_integer(), non_neg_integer()} | nil

Returns the offsets of the word at the given index in the input sequence.

Link to this function

word_to_tokens(encoding, word, seq_id)

View Source
@spec word_to_tokens(t(), non_neg_integer(), non_neg_integer()) ::
  {non_neg_integer(), non_neg_integer()} | nil

Returns the encoded tokens corresponding to the word at the given index in the input sequence, with the form {start_token, end_token + 1}.