IREE.Tokenizers.Encoding (iree_tokenizers v0.5.0)

Copy Markdown View Source

Result returned by encoding operations.

This module intentionally mirrors the most useful Tokenizers.Encoding helpers so callers can inspect token IDs, offsets, masks, and derived metadata without dealing with the NIF directly.

Summary

Types

t()

An encoded token sequence with optional offsets and derived masks.

Functions

Returns the attention mask.

Returns the token IDs.

Returns the number of tokens in the encoding.

Returns the number of sequences represented by the encoding.

Returns byte offsets for each token.

Returns overflowing encodings, if any.

Returns sequence IDs for each token, with special tokens represented as nil.

Returns the special-tokens mask.

Returns the token strings corresponding to the encoding.

Returns the type IDs.

Returns the attention mask packed into a little-endian u32 binary.

Returns the token IDs packed into a little-endian u32 binary.

Returns the special-tokens mask packed into a little-endian u32 binary.

Returns the type IDs packed into a little-endian u32 binary.

Returns word IDs for each token.

Pads the encoding to target_length.

Replaces all sequence IDs in the encoding with the given value.

Applies a list of transformations in order.

Truncates the encoding to max_length.

Types

t()

@type t() :: %IREE.Tokenizers.Encoding{
  attention_mask: [non_neg_integer()],
  ids: [integer()],
  offsets: nil | [{non_neg_integer(), non_neg_integer()}],
  special_tokens_mask: [non_neg_integer()],
  tokens: [binary()],
  type_ids: [non_neg_integer()]
}

An encoded token sequence with optional offsets and derived masks.

Functions

get_attention_mask(encoding)

@spec get_attention_mask(t()) :: [integer()]

Returns the attention mask.

get_ids(encoding)

@spec get_ids(t()) :: [integer()]

Returns the token IDs.

get_length(encoding)

@spec get_length(t()) :: non_neg_integer()

Returns the number of tokens in the encoding.

get_n_sequences(encoding)

@spec get_n_sequences(t()) :: non_neg_integer()

Returns the number of sequences represented by the encoding.

The current IREE-backed implementation only emits single-sequence encodings.

get_offsets(encoding)

@spec get_offsets(t()) :: [{integer(), integer()}]

Returns byte offsets for each token.

get_overflowing(encoding)

@spec get_overflowing(t()) :: [t()]

Returns overflowing encodings, if any.

The current implementation does not emit overflowing pieces and always returns an empty list.

get_sequence_ids(encoding)

@spec get_sequence_ids(t()) :: [non_neg_integer() | nil]

Returns sequence IDs for each token, with special tokens represented as nil.

get_special_tokens_mask(encoding)

@spec get_special_tokens_mask(t()) :: [integer()]

Returns the special-tokens mask.

get_tokens(encoding)

@spec get_tokens(t()) :: [binary()]

Returns the token strings corresponding to the encoding.

get_type_ids(encoding)

@spec get_type_ids(t()) :: [integer()]

Returns the type IDs.

get_u32_attention_mask(encoding)

@spec get_u32_attention_mask(t()) :: binary()

Returns the attention mask packed into a little-endian u32 binary.

get_u32_ids(encoding)

@spec get_u32_ids(t()) :: binary()

Returns the token IDs packed into a little-endian u32 binary.

get_u32_special_tokens_mask(encoding)

@spec get_u32_special_tokens_mask(t()) :: binary()

Returns the special-tokens mask packed into a little-endian u32 binary.

get_u32_type_ids(encoding)

@spec get_u32_type_ids(t()) :: binary()

Returns the type IDs packed into a little-endian u32 binary.

get_word_ids(encoding)

@spec get_word_ids(t()) :: [nil]

Returns word IDs for each token.

The current implementation does not track word IDs and returns nil entries.

n_tokens(encoding)

@spec n_tokens(t()) :: non_neg_integer()

Alias for get_length/1.

pad(encoding, target_length, opts \\ [])

@spec pad(t(), non_neg_integer(), keyword()) :: t()

Pads the encoding to target_length.

Supported options:

  • :direction - :left or :right, defaults to :right
  • :pad_id - token ID used for padding, defaults to 0
  • :pad_type_id - type ID used for padding, defaults to 0
  • :pad_token - token string used for padding, defaults to "[PAD]"

set_sequence_id(encoding, id)

@spec set_sequence_id(t(), non_neg_integer()) :: t()

Replaces all sequence IDs in the encoding with the given value.

transform(encoding, transformations)

@spec transform(t(), [IREE.Tokenizers.Encoding.Transformation.t()]) :: t()

Applies a list of transformations in order.

truncate(encoding, max_length, opts \\ [])

@spec truncate(t(), non_neg_integer(), keyword()) :: t()

Truncates the encoding to max_length.

Supported options:

  • :direction - :left or :right, defaults to :right
  • :stride - accepted for compatibility, currently not applied