View Source Tokenizers.Encoding (Tokenizers v0.4.0)
Encoding is the result of passing a text through tokenization pipeline.
This function defines a struct and a number of functions to retrieve information about the encoded text.
For further machine learning processing you most likely want to
access the encoded token ids via get_ids/1
. If you want to convert
the ids to a tensor, use get_u32_ids/1
to get a zero-copy binary.
Summary
Functions
Returns the token that contains the given char.
Returns the word that contains the given char.
Returns the attention mask from encoding
.
Returns the ids from encoding
.
Returns the number of tokens in encoding
.
Returns the number of sequences combined in encoding
.
Returns offsets from encoding
.
Returns the overflow from encoding
.
Returns sequence ids from encoding
.
Returns the special tokens mask from encoding
.
Returns the tokens from encoding
.
Returns token type ids from encoding
.
Same as get_attention_mask/1
, but returns binary with u32 values.
Same as get_ids/1
, but returns binary with u32 values.
Same as get_special_tokens_mask/1
, but returns binary with u32 values.
Same as get_type_ids/1
, but returns binary with u32 values.
Returns word ids from encoding
.
Returns the number of tokens in encoding
.
Pad the encoding to the given length.
Sets the given sequence id for all tokens contained in encoding
.
Returns the offsets of the token at the given index.
Returns the index of the sequence containing the given token.
Returns the word that contains the token at the given index.
Performs set of transformations to given encoding, creating a new one. Transformations are applied in order they are given.
Truncate the encoding to the given length.
Returns the offsets of the word at the given index in the input sequence.
Returns the encoded tokens corresponding to the word at the given
index in the input sequence, with the form {start_token, end_token + 1}
.
Types
@type padding_opts() :: [ pad_id: non_neg_integer(), pad_type_id: non_neg_integer(), pad_token: String.t(), direction: :left | :right ]
Padding configuration.
:direction
- the padding direction. Defaults to:right
:pad_id
- the id corresponding to the padding token. Defaults to0
:pad_type_id
- the type ID corresponding to the padding token. Defaults to0
:pad_token
- the padding token to use. Defaults to"[PAD]"
@type t() :: %Tokenizers.Encoding{resource: reference()}
@type truncation_opts() :: [stride: non_neg_integer(), direction: :left | :right]
Truncation configuration.
:stride
- the length of previous content to be included in each overflowing piece. Defaults to0
:direction
- the truncation direction. Defaults to:right
Functions
@spec char_to_token(t(), non_neg_integer(), non_neg_integer()) :: non_neg_integer() | nil
Returns the token that contains the given char.
@spec char_to_word(t(), non_neg_integer(), non_neg_integer()) :: non_neg_integer() | nil
Returns the word that contains the given char.
Returns the attention mask from encoding
.
Returns the ids from encoding
.
@spec get_length(t()) :: non_neg_integer()
Returns the number of tokens in encoding
.
@spec get_n_sequences(t()) :: non_neg_integer()
Returns the number of sequences combined in encoding
.
Returns offsets from encoding
.
The offsets are expressed in terms of UTF-8 bytes.
Returns the overflow from encoding
.
@spec get_sequence_ids(t()) :: [non_neg_integer() | nil]
Returns sequence ids from encoding
.
Returns the special tokens mask from encoding
.
Returns the tokens from encoding
.
Returns token type ids from encoding
.
Same as get_attention_mask/1
, but returns binary with u32 values.
Same as get_ids/1
, but returns binary with u32 values.
Same as get_special_tokens_mask/1
, but returns binary with u32 values.
Same as get_type_ids/1
, but returns binary with u32 values.
@spec get_word_ids(t()) :: [non_neg_integer() | nil]
Returns word ids from encoding
.
@spec n_tokens(encoding :: t()) :: non_neg_integer()
Returns the number of tokens in encoding
.
@spec pad(t(), non_neg_integer(), opts :: padding_opts()) :: t()
Pad the encoding to the given length.
For available options see padding_opts/0
.
@spec set_sequence_id(t(), non_neg_integer()) :: t()
Sets the given sequence id for all tokens contained in encoding
.
@spec token_to_chars(t(), non_neg_integer()) :: {non_neg_integer(), {non_neg_integer(), non_neg_integer()}} | nil
Returns the offsets of the token at the given index.
@spec token_to_sequence(t(), non_neg_integer()) :: non_neg_integer() | nil
Returns the index of the sequence containing the given token.
@spec token_to_word(t(), non_neg_integer()) :: {non_neg_integer(), non_neg_integer()} | nil
Returns the word that contains the token at the given index.
Performs set of transformations to given encoding, creating a new one. Transformations are applied in order they are given.
While all these transformations can be done one by one, this function is more efficient as it avoids multiple allocations and Garbage Collection for intermediate encodings.
Check the module Tokenizers.Encoding.Transformation
for handy functions,
that can be used to build the transformations list.
Also, you can build this list manually, as long as it follows the format.
@spec truncate(t(), non_neg_integer(), opts :: truncation_opts()) :: t()
Truncate the encoding to the given length.
For available options see truncation_opts/0
.
@spec word_to_chars(t(), non_neg_integer(), non_neg_integer()) :: {non_neg_integer(), non_neg_integer()} | nil
Returns the offsets of the word at the given index in the input sequence.
@spec word_to_tokens(t(), non_neg_integer(), non_neg_integer()) :: {non_neg_integer(), non_neg_integer()} | nil
Returns the encoded tokens corresponding to the word at the given
index in the input sequence, with the form {start_token, end_token + 1}
.