View Source Tokenizers.PreTokenizer (Tokenizers v0.4.0)
Pre-tokenizers.
A pre-tokenizer takes care of splitting the input according to a set of rules. This pre-processing lets you ensure that the underlying model does not build tokens across multiple “splits”. For example if you don’t want to have whitespaces inside a token, then you can have a pre-tokenizer that splits on these whitespaces.
You can easily combine multiple pre-tokenizers together using
sequence/1
.
A pre-tokenizer is also allowed to modify the string, just like a normalizer does. This is necessary to allow some complicated algorithms that require to split before normalizing (e.g. ByteLevel).
Summary
Types
Specifies how delimiter should behave for several pretokenizers.
Functions
Creates a BertPreTokenizer pre-tokenizer.
Creates a ByteLevel pre-tokenizer.
Gets ByteLevel pre-tokenizer's alphabet.
Creates a CharDelimiterSplit pre-tokenizer.
Creates a Digits pre-tokenizer.
Creates Metaspace pre-tokenizer.
Converts a string into a sequence of pre-tokens.
Creates a Punctuation pre-tokenizer.
Creates a Sequence pre-tokenizer.
Creates a Split pre-tokenizer.
Creates a Whitespace pre-tokenizer.
Creates a WhitespaceSplit pre-tokenizer.
Types
@type split_delimiter_behaviour() ::
:removed | :isolated | :merged_with_previous | :merged_with_next | :contiguous
Specifies how delimiter should behave for several pretokenizers.
@type t() :: %Tokenizers.PreTokenizer{resource: reference()}
Functions
@spec bert_pre_tokenizer() :: t()
Creates a BertPreTokenizer pre-tokenizer.
Splits for use in Bert models.
Creates a ByteLevel pre-tokenizer.
Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique has been introduced by OpenAI with GPT-2 and has some more or less nice properties:
Since it maps on bytes, a tokenizer using this only requires 256 characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.
A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)
For non ascii characters, it gets completely unreadable, but it works nonetheless!
Options
:add_prefix_space
- whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Defaults totrue
:use_regex
- set this tofalse
to prevent this pre-tokenizer from using the GPT2 specific regexp for splitting on whitespace. Defaults totrue
@spec byte_level_alphabet() :: charlist()
Gets ByteLevel pre-tokenizer's alphabet.
Creates a CharDelimiterSplit pre-tokenizer.
This pre-tokenizer simply splits on the provided delimiter. Works almost like simple split function, except that it accounts for multiple consecutive spaces.
Creates a Digits pre-tokenizer.
Splits the numbers from any other characters.
Options
:individual_digits
- whether to split individual digits or not. Defaults tofalse
Creates Metaspace pre-tokenizer.
Splits on whitespaces and replaces them with a special char “▁” (U+2581).
Options
:replacement
- the replacement character to use. Defaults to"▁"
:add_prefix_space
- whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Defaults totrue
Converts a string into a sequence of pre-tokens.
@spec punctuation(split_delimiter_behaviour()) :: t()
Creates a Punctuation pre-tokenizer.
Will isolate all punctuation characters.
Creates a Sequence pre-tokenizer.
Lets you compose multiple pre-tokenizers that will be run in the given order.
@spec split(String.t(), split_delimiter_behaviour(), keyword()) :: t()
Creates a Split pre-tokenizer.
Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary.
Options
:invert
- whether to invert the split or not. Defaults tofalse
@spec whitespace() :: t()
Creates a Whitespace pre-tokenizer.
Splits on word boundaries. Uses the following regular expression:
w+|[^w ]+
.
@spec whitespace_split() :: t()
Creates a WhitespaceSplit pre-tokenizer.
Splits on any whitespace character.