View Source Tokenizers.Encoding (Tokenizers v0.4.0)

Encoding is the result of passing a text through tokenization pipeline.

This function defines a struct and a number of functions to retrieve information about the encoded text.

For further machine learning processing you most likely want to access the encoded token ids via get_ids/1. If you want to convert the ids to a tensor, use get_u32_ids/1 to get a zero-copy binary.

Summary

Types

padding_opts()

Padding configuration.

t()

truncation_opts()

Truncation configuration.

Functions

char_to_token(encoding, position, seq_id)

Returns the token that contains the given char.

char_to_word(encoding, position, seq_id)

Returns the word that contains the given char.

get_attention_mask(encoding)

Returns the attention mask from encoding.

get_ids(encoding)

Returns the ids from encoding.

get_length(encoding)

Returns the number of tokens in encoding.

get_n_sequences(encoding)

Returns the number of sequences combined in encoding.

get_offsets(encoding)

Returns offsets from encoding.

get_overflowing(encoding)

Returns the overflow from encoding.

get_sequence_ids(encoding)

Returns sequence ids from encoding.

get_special_tokens_mask(encoding)

Returns the special tokens mask from encoding.

get_tokens(encoding)

Returns the tokens from encoding.

get_type_ids(encoding)

Returns token type ids from encoding.

get_u32_attention_mask(encoding)

Same as get_attention_mask/1, but returns binary with u32 values.

get_u32_ids(encoding)

Same as get_ids/1, but returns binary with u32 values.

get_u32_special_tokens_mask(encoding)

Same as get_special_tokens_mask/1, but returns binary with u32 values.

get_u32_type_ids(encoding)

Same as get_type_ids/1, but returns binary with u32 values.

get_word_ids(encoding)

Returns word ids from encoding.

n_tokens(encoding)

Returns the number of tokens in encoding.

pad(encoding, target_length, opts \\ [])

Pad the encoding to the given length.

set_sequence_id(encoding, id)

Sets the given sequence id for all tokens contained in encoding.

token_to_chars(encoding, token)

Returns the offsets of the token at the given index.

token_to_sequence(encoding, token)

Returns the index of the sequence containing the given token.

token_to_word(encoding, token)

Returns the word that contains the token at the given index.

transform(encoding, transformations)

Performs set of transformations to given encoding, creating a new one. Transformations are applied in order they are given.

truncate(encoding, max_length, opts \\ [])

Truncate the encoding to the given length.

word_to_chars(encoding, word, seq_id)

Returns the offsets of the word at the given index in the input sequence.

word_to_tokens(encoding, word, seq_id)

Returns the encoded tokens corresponding to the word at the given index in the input sequence, with the form {start_token, end_token + 1}.

Types

padding_opts()

@type padding_opts() :: [
  pad_id: non_neg_integer(),
  pad_type_id: non_neg_integer(),
  pad_token: String.t(),
  direction: :left | :right
]

Padding configuration.

:direction - the padding direction. Defaults to :right
:pad_id - the id corresponding to the padding token. Defaults to 0
:pad_type_id - the type ID corresponding to the padding token. Defaults to 0
:pad_token - the padding token to use. Defaults to "[PAD]"

t()

@type t() :: %Tokenizers.Encoding{resource: reference()}

truncation_opts()

@type truncation_opts() :: [stride: non_neg_integer(), direction: :left | :right]

Truncation configuration.

:stride - the length of previous content to be included in each overflowing piece. Defaults to 0
:direction - the truncation direction. Defaults to :right

Functions

char_to_token(encoding, position, seq_id)

@spec char_to_token(t(), non_neg_integer(), non_neg_integer()) ::
  non_neg_integer() | nil

Returns the token that contains the given char.

char_to_word(encoding, position, seq_id)

@spec char_to_word(t(), non_neg_integer(), non_neg_integer()) ::
  non_neg_integer() | nil

Returns the word that contains the given char.

get_attention_mask(encoding)

@spec get_attention_mask(t()) :: [integer()]

Returns the attention mask from encoding.

get_ids(encoding)

@spec get_ids(t()) :: [integer()]

Returns the ids from encoding.

get_length(encoding)

@spec get_length(t()) :: non_neg_integer()

Returns the number of tokens in encoding.

get_n_sequences(encoding)

@spec get_n_sequences(t()) :: non_neg_integer()

Returns the number of sequences combined in encoding.

get_offsets(encoding)

@spec get_offsets(t()) :: [{integer(), integer()}]

Returns offsets from encoding.

The offsets are expressed in terms of UTF-8 bytes.

get_overflowing(encoding)

@spec get_overflowing(t()) :: [t()]

Returns the overflow from encoding.

get_sequence_ids(encoding)

@spec get_sequence_ids(t()) :: [non_neg_integer() | nil]

Returns sequence ids from encoding.

get_special_tokens_mask(encoding)

@spec get_special_tokens_mask(t()) :: [integer()]

Returns the special tokens mask from encoding.

get_tokens(encoding)

@spec get_tokens(t()) :: [binary()]

Returns the tokens from encoding.

get_type_ids(encoding)

@spec get_type_ids(t()) :: [integer()]

Returns token type ids from encoding.

get_u32_attention_mask(encoding)

@spec get_u32_attention_mask(t()) :: binary()

Same as get_attention_mask/1, but returns binary with u32 values.

get_u32_ids(encoding)

@spec get_u32_ids(t()) :: binary()

Same as get_ids/1, but returns binary with u32 values.

get_u32_special_tokens_mask(encoding)

@spec get_u32_special_tokens_mask(t()) :: binary()

Same as get_special_tokens_mask/1, but returns binary with u32 values.

get_u32_type_ids(encoding)

@spec get_u32_type_ids(t()) :: binary()

Same as get_type_ids/1, but returns binary with u32 values.

get_word_ids(encoding)

@spec get_word_ids(t()) :: [non_neg_integer() | nil]

Returns word ids from encoding.

n_tokens(encoding)

@spec n_tokens(encoding :: t()) :: non_neg_integer()

Returns the number of tokens in encoding.

pad(encoding, target_length, opts \\ [])

@spec pad(t(), non_neg_integer(), opts :: padding_opts()) :: t()

Pad the encoding to the given length.

For available options see padding_opts/0.

set_sequence_id(encoding, id)

@spec set_sequence_id(t(), non_neg_integer()) :: t()

Sets the given sequence id for all tokens contained in encoding.

token_to_chars(encoding, token)

@spec token_to_chars(t(), non_neg_integer()) ::
  {non_neg_integer(), {non_neg_integer(), non_neg_integer()}} | nil

Returns the offsets of the token at the given index.

token_to_sequence(encoding, token)

@spec token_to_sequence(t(), non_neg_integer()) :: non_neg_integer() | nil

Returns the index of the sequence containing the given token.

token_to_word(encoding, token)

@spec token_to_word(t(), non_neg_integer()) ::
  {non_neg_integer(), non_neg_integer()} | nil

Returns the word that contains the token at the given index.

transform(encoding, transformations)

Performs set of transformations to given encoding, creating a new one. Transformations are applied in order they are given.

While all these transformations can be done one by one, this function is more efficient as it avoids multiple allocations and Garbage Collection for intermediate encodings.

Check the module Tokenizers.Encoding.Transformation for handy functions, that can be used to build the transformations list. Also, you can build this list manually, as long as it follows the format.

truncate(encoding, max_length, opts \\ [])

@spec truncate(t(), non_neg_integer(), opts :: truncation_opts()) :: t()

Truncate the encoding to the given length.

For available options see truncation_opts/0.

word_to_chars(encoding, word, seq_id)

@spec word_to_chars(t(), non_neg_integer(), non_neg_integer()) ::
  {non_neg_integer(), non_neg_integer()} | nil

Returns the offsets of the word at the given index in the input sequence.

word_to_tokens(encoding, word, seq_id)

@spec word_to_tokens(t(), non_neg_integer(), non_neg_integer()) ::
  {non_neg_integer(), non_neg_integer()} | nil

Returns the encoded tokens corresponding to the word at the given index in the input sequence, with the form {start_token, end_token + 1}.

Settings View Source Tokenizers.Encoding (Tokenizers v0.4.0)

Summary

Types

Functions

Types

padding_opts()

t()

truncation_opts()

Functions

char_to_token(encoding, position, seq_id)

char_to_word(encoding, position, seq_id)

get_attention_mask(encoding)

get_ids(encoding)

get_length(encoding)

get_n_sequences(encoding)

get_offsets(encoding)

get_overflowing(encoding)

get_sequence_ids(encoding)

get_special_tokens_mask(encoding)

get_tokens(encoding)

get_type_ids(encoding)

get_u32_attention_mask(encoding)

get_u32_ids(encoding)

get_u32_special_tokens_mask(encoding)

get_u32_type_ids(encoding)

get_word_ids(encoding)

n_tokens(encoding)

pad(encoding, target_length, opts \\ [])

set_sequence_id(encoding, id)

token_to_chars(encoding, token)

token_to_sequence(encoding, token)

token_to_word(encoding, token)

transform(encoding, transformations)

truncate(encoding, max_length, opts \\ [])

word_to_chars(encoding, word, seq_id)

word_to_tokens(encoding, word, seq_id)

View Source Tokenizers.Encoding (Tokenizers v0.4.0)