View Source Bumblebee.Text.PreTrainedTokenizer (Bumblebee v0.6.0)

Wraps a pre-trained tokenizer from the Tokenizers library.

Configuration

  • :add_special_tokens - whether to add special tokens during tokenization. Defaults to true

  • :length - applies fixed length padding or truncation to the given input if set. Can be either a specific number or a list of numbers. When a list is given, the smallest number that exceeds all input lengths is used as the padding length

  • :pad_direction - the padding direction, either :right or :left. Defaults to :right

  • :truncate_direction - the truncation direction, either :right or :left. Defaults to :right

  • :return_attention_mask - whether to return attention mask for encoded sequence. The mask is a boolean tensor indicating which tokens are padding and should effectively be ignored by the model . Defaults to true

  • :return_token_type_ids - whether to return token type ids for encoded sequence. Defaults to true

  • :return_special_tokens_mask - whether to return special tokens mask for encoded sequence. The mask is a boolean tensor indicating which tokens are special . Defaults to false

  • :return_offsets - whether to return token offsets for encoded sequence. This tensor includes a list of position pairs that map tokens to the input text . Defaults to false

  • :return_length - whether to return the sequence length. The length is the effective number of tokens, so it is calculated after truncation, but does not include padding . Defaults to false

  • :template_options - options configuring the tokenization template, specific to the given tokenizer type. Recognised options are:

    • :language_token - for tokenizers: :nllb

    . Defaults to []