View Source Tokenizers.PostProcessor (Tokenizers v0.4.0)

Post-processors.

After the whole pipeline, we sometimes want to insert some special tokens before we feed the encoded text into a model like ”[CLS] My horse is amazing [SEP]”, we can do that with a post-processor.

Summary

Functions

bert(sep, cls)

Creates a Bert post-processor with the given tokens.

byte_level(opts \\ [])

Creates a ByteLevel post-processor.

roberta(sep, cls, opts \\ [])

Creates a Roberta post-processor.

sequence(post_processors)

Instantiate a new Sequence post-processor

template(opts \\ [])

Creates a Template post-processor.

Types

t()

@type t() :: %Tokenizers.PostProcessor{resource: reference()}

Functions

bert(sep, cls)

@spec bert({String.t(), integer()}, {String.t(), integer()}) :: t()

Creates a Bert post-processor with the given tokens.

byte_level(opts \\ [])

@spec byte_level(keyword()) :: t()

Creates a ByteLevel post-processor.

Options

:trim_offsets - whether to trim the whitespaces in the produced offsets. Defaults to true

roberta(sep, cls, opts \\ [])

@spec roberta({String.t(), integer()}, {String.t(), integer()}, keyword()) :: t()

Creates a Roberta post-processor.

Options

:trim_offest - whether to trim the whitespaces in the produced offsets. Defaults to true
:add_prefix_space - whether add_prefix_space was ON during the pre-tokenization. Defaults to true

sequence(post_processors)

@spec sequence(post_processors :: [t()]) :: t()

Instantiate a new Sequence post-processor

template(opts \\ [])

@spec template(keyword()) :: t()

Creates a Template post-processor.

Let’s you easily template the post processing, adding special tokens and specifying the type id for each sequence/special token. The template is given two strings representing the single sequence and the pair of sequences, as well as a set of special tokens to use.

For example, when specifying a template with these values:

single: "[CLS] $A [SEP]"
pair: "[CLS] $A [SEP] $B [SEP]"
special tokens:
- "[CLS]"
- "[SEP]"

Input: ("I like this", "but not this") Output: "[CLS] I like this [SEP] but not this [SEP]"

Options

:single - a string describing the template for a single sequence
:pair - a string describing the template for a pair of sequences
:special_tokens - a list of special tokens to use in the template. Must be a list of {token, token_id} tuples

Settings View Source Tokenizers.PostProcessor (Tokenizers v0.4.0)

t()

bert(sep, cls)

byte_level(opts \\ [])

roberta(sep, cls, opts \\ [])

sequence(post_processors)

template(opts \\ [])

View Source Tokenizers.PostProcessor (Tokenizers v0.4.0)