View Source Tokenizers.PostProcessor (Tokenizers v0.4.0)

Post-processors.

After the whole pipeline, we sometimes want to insert some special tokens before we feed the encoded text into a model like ”[CLS] My horse is amazing [SEP]”, we can do that with a post-processor.

Summary

Functions

Creates a Bert post-processor with the given tokens.

Creates a ByteLevel post-processor.

Creates a Roberta post-processor.

Instantiate a new Sequence post-processor

Creates a Template post-processor.

Types

@type t() :: %Tokenizers.PostProcessor{resource: reference()}

Functions

@spec bert({String.t(), integer()}, {String.t(), integer()}) :: t()

Creates a Bert post-processor with the given tokens.

@spec byte_level(keyword()) :: t()

Creates a ByteLevel post-processor.

Options

  • :trim_offsets - whether to trim the whitespaces in the produced offsets. Defaults to true
Link to this function

roberta(sep, cls, opts \\ [])

View Source
@spec roberta({String.t(), integer()}, {String.t(), integer()}, keyword()) :: t()

Creates a Roberta post-processor.

Options

  • :trim_offest - whether to trim the whitespaces in the produced offsets. Defaults to true

  • :add_prefix_space - whether add_prefix_space was ON during the pre-tokenization. Defaults to true

Link to this function

sequence(post_processors)

View Source
@spec sequence(post_processors :: [t()]) :: t()

Instantiate a new Sequence post-processor

@spec template(keyword()) :: t()

Creates a Template post-processor.

Let’s you easily template the post processing, adding special tokens and specifying the type id for each sequence/special token. The template is given two strings representing the single sequence and the pair of sequences, as well as a set of special tokens to use.

For example, when specifying a template with these values:

  • single: "[CLS] $A [SEP]"
  • pair: "[CLS] $A [SEP] $B [SEP]"
  • special tokens:
    • "[CLS]"
    • "[SEP]"

Input: ("I like this", "but not this") Output: "[CLS] I like this [SEP] but not this [SEP]"

Options

  • :single - a string describing the template for a single sequence

  • :pair - a string describing the template for a pair of sequences

  • :special_tokens - a list of special tokens to use in the template. Must be a list of {token, token_id} tuples