View Source Tokenizers.PreTokenizer (Tokenizers v0.5.1)

Pre-tokenizers.

A pre-tokenizer takes care of splitting the input according to a set of rules. This pre-processing lets you ensure that the underlying model does not build tokens across multiple “splits”. For example if you don’t want to have whitespaces inside a token, then you can have a pre-tokenizer that splits on these whitespaces.

You can easily combine multiple pre-tokenizers together using sequence/1.

A pre-tokenizer is also allowed to modify the string, just like a normalizer does. This is necessary to allow some complicated algorithms that require to split before normalizing (e.g. ByteLevel).

Summary

Types

split_delimiter_behaviour()

Specifies how delimiter should behave for several pretokenizers.

t()

Functions

bert_pre_tokenizer()

Creates a BertPreTokenizer pre-tokenizer.

byte_level(opts \\ [])

Creates a ByteLevel pre-tokenizer.

byte_level_alphabet()

Gets ByteLevel pre-tokenizer's alphabet.

char_delimiter_split(delimiter)

Creates a CharDelimiterSplit pre-tokenizer.

digits(opts \\ [])

Creates a Digits pre-tokenizer.

metaspace(opts \\ [])

Creates Metaspace pre-tokenizer.

pre_tokenize(pre_tokenizer, input)

Converts a string into a sequence of pre-tokens.

punctuation(behaviour)

Creates a Punctuation pre-tokenizer.

sequence(pre_tokenizers)

Creates a Sequence pre-tokenizer.

split(pattern, behavior, opts \\ [])

Creates a Split pre-tokenizer using a string as split pattern.

split_regex(pattern, behavior, opts \\ [])

Creates a Split pre-tokenizer using a regular expression as split pattern.

whitespace()

Creates a Whitespace pre-tokenizer.

whitespace_split()

Creates a WhitespaceSplit pre-tokenizer.

Types

split_delimiter_behaviour()

@type split_delimiter_behaviour() ::
  :removed | :isolated | :merged_with_previous | :merged_with_next | :contiguous

Specifies how delimiter should behave for several pretokenizers.

t()

@type t() :: %Tokenizers.PreTokenizer{resource: reference()}

Functions

bert_pre_tokenizer()

@spec bert_pre_tokenizer() :: t()

Creates a BertPreTokenizer pre-tokenizer.

Splits for use in BERT models.

byte_level(opts \\ [])

@spec byte_level(keyword()) :: t()

Creates a ByteLevel pre-tokenizer.

Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique has been introduced by OpenAI with GPT-2 and has some more or less nice properties:

Since it maps on bytes, a tokenizer using this only requires 256 characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.
A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)
For non ascii characters, it gets completely unreadable, but it works nonetheless!

Options

:add_prefix_space - whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Defaults to true
:use_regex - set this to false to prevent this pre-tokenizer from using the GPT2 specific regexp for splitting on whitespace. Defaults to true

byte_level_alphabet()

@spec byte_level_alphabet() :: charlist()

Gets ByteLevel pre-tokenizer's alphabet.

char_delimiter_split(delimiter)

@spec char_delimiter_split(char()) :: t()

Creates a CharDelimiterSplit pre-tokenizer.

This pre-tokenizer simply splits on the provided delimiter. Works almost like simple split function, except that it accounts for multiple consecutive spaces.

digits(opts \\ [])

@spec digits(keyword()) :: t()

Creates a Digits pre-tokenizer.

Splits the numbers from any other characters.

Options

:individual_digits - whether to split individual digits or not. Defaults to false

metaspace(opts \\ [])

@spec metaspace(keyword()) :: t()

Creates Metaspace pre-tokenizer.

Splits on whitespaces and replaces them with a special char “▁” (U+2581).

Options

:replacement - the replacement character to use. Defaults to "▁"
:prepend_scheme - whether to add a space to the first word if there isn't already one. This lets us treat "hello" exactly like "say hello". Either of :always, :never, :first. :first means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used). Defaults to :always

pre_tokenize(pre_tokenizer, input)

@spec pre_tokenize(t(), String.t()) :: {:ok, [{String.t(), {integer(), integer()}}]}

Converts a string into a sequence of pre-tokens.

punctuation(behaviour)

@spec punctuation(split_delimiter_behaviour()) :: t()

Creates a Punctuation pre-tokenizer.

Will isolate all punctuation characters.

sequence(pre_tokenizers)

@spec sequence([t()]) :: t()

Creates a Sequence pre-tokenizer.

Lets you compose multiple pre-tokenizers that will be run in the given order.

split(pattern, behavior, opts \\ [])

@spec split(String.t(), split_delimiter_behaviour(), keyword()) :: t()

Creates a Split pre-tokenizer using a string as split pattern.

Versatile pre-tokenizer that splits on provided pattern and according to provided behavior.

Options

:invert - whether to invert the split or not. Defaults to false

split_regex(pattern, behavior, opts \\ [])

@spec split_regex(String.t(), split_delimiter_behaviour(), keyword()) :: t()

Creates a Split pre-tokenizer using a regular expression as split pattern.

Versatile pre-tokenizer that splits on provided regex pattern and according to provided behavior.

The pattern should be a string representing a regular expression according to the Oniguruma Regex Engine.

Options

:invert - whether to invert the split or not. Defaults to false

Example

iex> Tokenizers.PreTokenizer.split_regex(~S(\?\d{2}\?), :removed)
#Tokenizers.PreTokenizer<[pre_tokenizer_type: "Split"]>

whitespace()

@spec whitespace() :: t()

Creates a Whitespace pre-tokenizer.

Splits on word boundaries. Uses the following regular expression: w+|[^w ]+.

whitespace_split()

@spec whitespace_split() :: t()

Creates a WhitespaceSplit pre-tokenizer.

Splits on any whitespace character.