View Source Bumblebee.Text.PreTrainedTokenizer (Bumblebee v0.6.0)
Wraps a pre-trained tokenizer from the Tokenizers
library.
Configuration
:add_special_tokens
- whether to add special tokens during tokenization. Defaults totrue
:length
- applies fixed length padding or truncation to the given input if set. Can be either a specific number or a list of numbers. When a list is given, the smallest number that exceeds all input lengths is used as the padding length:pad_direction
- the padding direction, either:right
or:left
. Defaults to:right
:truncate_direction
- the truncation direction, either:right
or:left
. Defaults to:right
:return_attention_mask
- whether to return attention mask for encoded sequence. The mask is a boolean tensor indicating which tokens are padding and should effectively be ignored by the model . Defaults totrue
:return_token_type_ids
- whether to return token type ids for encoded sequence. Defaults totrue
:return_special_tokens_mask
- whether to return special tokens mask for encoded sequence. The mask is a boolean tensor indicating which tokens are special . Defaults tofalse
:return_offsets
- whether to return token offsets for encoded sequence. This tensor includes a list of position pairs that map tokens to the input text . Defaults tofalse
:return_length
- whether to return the sequence length. The length is the effective number of tokens, so it is calculated after truncation, but does not include padding . Defaults tofalse
:template_options
- options configuring the tokenization template, specific to the given tokenizer type. Recognised options are::language_token
- for tokenizers::nllb
. Defaults to
[]