View Source Bumblebee.Tokenizer behaviour (Bumblebee v0.5.3)
An interface for configuring and applying tokenizers.
A tokenizer is used to convert raw text data into model input.
Every module implementing this behaviour is expected to also define a configuration struct.
Summary
Types
A type corresponding to a special token in the vocabulary.
Callbacks
Returns a list with extra special tokens, in addition to the named
special_tokens/1
.
Performs tokenization and encoding on the given input.
Decodes a list of token ids into a sentence.
Converts the given token id the corresponding token.
Returns a map with special tokens.
Converts the given token into the corresponding numeric id.
Functions
Returns all special tokens, including any extra tokens.
Decodes a list of token ids into a sentence.
Converts the given token id the corresponding token.
Returns a special token by name.
Returns id of a special token by name.
Converts the given token into the corresponding numeric id.
Types
@type special_token_type() :: atom()
A type corresponding to a special token in the vocabulary.
Common types
:bos
- a token representing the beginning of a sentence:eos
- a token representing the end of a sentence:unk
- a token representing an out-of-vocabulary token:sep
- a token separating two different sentences in the same input:pad
- a token added when processing a batch of sequences with different length:cls
- a token representing the class of the input:mask
- a token representing a masked token, used for masked language modeling tasks
@type t() :: struct()
@type token() :: String.t()
@type token_id() :: non_neg_integer()
Callbacks
Returns a list with extra special tokens, in addition to the named
special_tokens/1
.
Performs tokenization and encoding on the given input.
Decodes a list of token ids into a sentence.
Converts the given token id the corresponding token.
@callback special_tokens(t()) :: %{required(special_token_type()) => token()}
Returns a map with special tokens.
Converts the given token into the corresponding numeric id.
Functions
Returns all special tokens, including any extra tokens.
Decodes a list of token ids into a sentence.
Converts the given token id the corresponding token.
@spec special_token(t(), special_token_type()) :: token() | nil
Returns a special token by name.
@spec special_token_id(t(), special_token_type()) :: token_id() | nil
Returns id of a special token by name.
Converts the given token into the corresponding numeric id.