View Source Bumblebee.Text.PreTrainedTokenizer (Bumblebee v0.5.3)
Wraps a pre-trained tokenizer from the Tokenizers library.
Configuration
:add_special_tokens- whether to add special tokens during tokenization. Defaults totrue:length- applies fixed length padding or truncation to the given input if set. Can be either a specific number or a list of numbers. When a list is given, the smallest number that exceeds all input lengths is used as the padding length:pad_direction- the padding direction, either:rightor:left. Defaults to:right:truncate_direction- the truncation direction, either:rightor:left. Defaults to:right:return_attention_mask- whether to return attention mask for encoded sequence. The mask is a boolean tensor indicating which tokens are padding and should effectively be ignored by the model . Defaults totrue:return_token_type_ids- whether to return token type ids for encoded sequence. Defaults totrue:return_special_tokens_mask- whether to return special tokens mask for encoded sequence. The mask is a boolean tensor indicating which tokens are special . Defaults tofalse:return_offsets- whether to return token offsets for encoded sequence. This tensor includes a list of position pairs that map tokens to the input text . Defaults tofalse:return_length- whether to return the sequence length. The length is the effective number of tokens, so it is calculated after truncation, but does not include padding . Defaults tofalse