View Source Bumblebee.Text.PreTrainedTokenizer (Bumblebee v0.6.0)

Wraps a pre-trained tokenizer from the Tokenizers library.

Configuration

:add_special_tokens - whether to add special tokens during tokenization. Defaults to true
:length - applies fixed length padding or truncation to the given input if set. Can be either a specific number or a list of numbers. When a list is given, the smallest number that exceeds all input lengths is used as the padding length
:pad_direction - the padding direction, either :right or :left. Defaults to :right
:truncate_direction - the truncation direction, either :right or :left. Defaults to :right
:return_attention_mask - whether to return attention mask for encoded sequence. The mask is a boolean tensor indicating which tokens are padding and should effectively be ignored by the model . Defaults to true
:return_token_type_ids - whether to return token type ids for encoded sequence. Defaults to true
:return_special_tokens_mask - whether to return special tokens mask for encoded sequence. The mask is a boolean tensor indicating which tokens are special . Defaults to false
:return_offsets - whether to return token offsets for encoded sequence. This tensor includes a list of position pairs that map tokens to the input text . Defaults to false
:return_length - whether to return the sequence length. The length is the effective number of tokens, so it is calculated after truncation, but does not include padding . Defaults to false
:template_options - options configuring the tokenization template, specific to the given tokenizer type. Recognised options are:
- :language_token - for tokenizers: :nllb
. Defaults to []