View Source Bumblebee.Tokenizer behaviour (Bumblebee v0.5.3)

An interface for configuring and applying tokenizers.

A tokenizer is used to convert raw text data into model input.

Every module implementing this behaviour is expected to also define a configuration struct.

Summary

Types

A type corresponding to a special token in the vocabulary.

t()

Callbacks

Returns a list with extra special tokens, in addition to the named special_tokens/1.

Performs tokenization and encoding on the given input.

Decodes a list of token ids into a sentence.

Converts the given token id the corresponding token.

Returns a map with special tokens.

Converts the given token into the corresponding numeric id.

Functions

Returns all special tokens, including any extra tokens.

Decodes a list of token ids into a sentence.

Converts the given token id the corresponding token.

Returns a special token by name.

Returns id of a special token by name.

Converts the given token into the corresponding numeric id.

Types

@type input() :: String.t() | {String.t(), String.t()}
@type special_token_type() :: atom()

A type corresponding to a special token in the vocabulary.

Common types

  • :bos - a token representing the beginning of a sentence

  • :eos - a token representing the end of a sentence

  • :unk - a token representing an out-of-vocabulary token

  • :sep - a token separating two different sentences in the same input

  • :pad - a token added when processing a batch of sequences with different length

  • :cls - a token representing the class of the input

  • :mask - a token representing a masked token, used for masked language modeling tasks

@type t() :: struct()
@type token() :: String.t()
@type token_id() :: non_neg_integer()

Callbacks

Link to this callback

additional_special_tokens(t)

View Source
@callback additional_special_tokens(t()) :: MapSet.t(token())

Returns a list with extra special tokens, in addition to the named special_tokens/1.

@callback apply(t(), input() | [input()]) :: any()

Performs tokenization and encoding on the given input.

@callback decode(t(), [token_id()] | [[token_id()]]) :: String.t()

Decodes a list of token ids into a sentence.

Link to this callback

id_to_token(t, token_id)

View Source
@callback id_to_token(t(), token_id()) :: token()

Converts the given token id the corresponding token.

@callback special_tokens(t()) :: %{required(special_token_type()) => token()}

Returns a map with special tokens.

@callback token_to_id(t(), token()) :: token_id()

Converts the given token into the corresponding numeric id.

Functions

Link to this function

all_special_tokens(tokenizer)

View Source
@spec all_special_tokens(t()) :: [token_id()]

Returns all special tokens, including any extra tokens.

@spec decode(
  t(),
  token() | [token_id()] | [[token_id()]] | Nx.Tensor.t()
) :: String.t()

Decodes a list of token ids into a sentence.

Link to this function

id_to_token(tokenizer, id)

View Source

Converts the given token id the corresponding token.

Link to this function

special_token(tokenizer, type)

View Source
@spec special_token(t(), special_token_type()) :: token() | nil

Returns a special token by name.

Link to this function

special_token_id(tokenizer, type)

View Source
@spec special_token_id(t(), special_token_type()) :: token_id() | nil

Returns id of a special token by name.

Link to this function

token_to_id(tokenizer, token)

View Source
@spec token_to_id(t(), token()) :: token_id()
@spec token_to_id(t(), token_id()) :: token()

Converts the given token into the corresponding numeric id.