View Source Bumblebee.Tokenizer behaviour (Bumblebee v0.1.2)

An interface for configuring and applying tokenizers.

A tokenizer is used to convert raw text data into model input.

Every module implementing this behaviour is expected to also define a configuration struct.

Link to this section Summary

Types

A type corresponding to a special token in the vocabulary.

t()

Callbacks

Performs tokenization and encoding on the given input.

Decodes a list of token ids into a sentence.

Converts the given token id the corresponding token.

Returns a map with special tokens.

Converts the given token into the corresponding numeric id.

Functions

Decodes a list of token ids into a sentence.

Converts the given token id the corresponding token.

Returns a special token by name.

Returns id of a special token by name.

Converts the given token into the corresponding numeric id.

Link to this section Types

@type input() :: String.t() | {String.t(), String.t()}
@type special_token_type() :: atom()

A type corresponding to a special token in the vocabulary.

common-types

Common types

  • :bos - a token representing the beginning of a sentence

  • :eos - a token representing the end of a sentence

  • :unk - a token representing an out-of-vocabulary token

  • :sep - a token separating two different sentences in the same input

  • :pad - a token added when processing a batch of sequences with different length

  • :cls - a token representing the class of the input

  • :mask - a token representing a masked token, used for masked language modeling tasks

@type t() :: struct()
@type token() :: String.t()
@type token_id() :: non_neg_integer()

Link to this section Callbacks

@callback apply(t(), input() | [input()], keyword()) :: any()

Performs tokenization and encoding on the given input.

@callback decode(t(), [token_id()] | [[token_id()]]) :: String.t()

Decodes a list of token ids into a sentence.

Link to this callback

id_to_token(t, token_id)

View Source
@callback id_to_token(t(), token_id()) :: token()

Converts the given token id the corresponding token.

@callback special_tokens(t()) :: %{required(special_token_type()) => token()}

Returns a map with special tokens.

@callback token_to_id(t(), token()) :: token_id()

Converts the given token into the corresponding numeric id.

Link to this section Functions

@spec decode(
  t(),
  token() | [token_id()] | [[token_id()]] | Nx.Tensor.t()
) :: String.t()

Decodes a list of token ids into a sentence.

Link to this function

id_to_token(tokenizer, id)

View Source

Converts the given token id the corresponding token.

Link to this function

special_token(tokenizer, type)

View Source
@spec special_token(t(), special_token_type()) :: token() | nil

Returns a special token by name.

Link to this function

special_token_id(tokenizer, type)

View Source
@spec special_token_id(t(), special_token_type()) :: token_id() | nil

Returns id of a special token by name.

Link to this function

token_to_id(tokenizer, token)

View Source
@spec token_to_id(t(), token()) :: token_id()
@spec token_to_id(t(), token_id()) :: token()

Converts the given token into the corresponding numeric id.