View Source Tokenizers.PostProcessor (Tokenizers v0.4.0)
Post-processors.
After the whole pipeline, we sometimes want to insert some special tokens before we feed the encoded text into a model like ”[CLS] My horse is amazing [SEP]”, we can do that with a post-processor.
Summary
Functions
Creates a Bert post-processor with the given tokens.
Creates a ByteLevel post-processor.
Creates a Roberta post-processor.
Instantiate a new Sequence post-processor
Creates a Template post-processor.
Types
@type t() :: %Tokenizers.PostProcessor{resource: reference()}
Functions
Creates a Bert post-processor with the given tokens.
Creates a ByteLevel post-processor.
Options
:trim_offsets
- whether to trim the whitespaces in the produced offsets. Defaults totrue
Creates a Roberta post-processor.
Options
:trim_offest
- whether to trim the whitespaces in the produced offsets. Defaults totrue
:add_prefix_space
- whether add_prefix_space was ON during the pre-tokenization. Defaults totrue
Instantiate a new Sequence post-processor
Creates a Template post-processor.
Let’s you easily template the post processing, adding special tokens and specifying the type id for each sequence/special token. The template is given two strings representing the single sequence and the pair of sequences, as well as a set of special tokens to use.
For example, when specifying a template with these values:
- single:
"[CLS] $A [SEP]"
- pair:
"[CLS] $A [SEP] $B [SEP]"
- special tokens:
"[CLS]"
"[SEP]"
Input:
("I like this", "but not this")
Output:"[CLS] I like this [SEP] but not this [SEP]"
Options
:single
- a string describing the template for a single sequence:pair
- a string describing the template for a pair of sequences:special_tokens
- a list of special tokens to use in the template. Must be a list of{token, token_id}
tuples