View Source Tokenizers.PreTokenizer (Tokenizers v0.5.1)
Pre-tokenizers.
A pre-tokenizer takes care of splitting the input according to a set of rules. This pre-processing lets you ensure that the underlying model does not build tokens across multiple โsplitsโ. For example if you donโt want to have whitespaces inside a token, then you can have a pre-tokenizer that splits on these whitespaces.
You can easily combine multiple pre-tokenizers together using
sequence/1
.
A pre-tokenizer is also allowed to modify the string, just like a normalizer does. This is necessary to allow some complicated algorithms that require to split before normalizing (e.g. ByteLevel).
Summary
Types
Specifies how delimiter should behave for several pretokenizers.
Functions
Creates a BertPreTokenizer pre-tokenizer.
Creates a ByteLevel pre-tokenizer.
Gets ByteLevel pre-tokenizer's alphabet.
Creates a CharDelimiterSplit pre-tokenizer.
Creates a Digits pre-tokenizer.
Creates Metaspace pre-tokenizer.
Converts a string into a sequence of pre-tokens.
Creates a Punctuation pre-tokenizer.
Creates a Sequence pre-tokenizer.
Creates a Split pre-tokenizer using a string as split pattern.
Creates a Split pre-tokenizer using a regular expression as split pattern.
Creates a Whitespace pre-tokenizer.
Creates a WhitespaceSplit pre-tokenizer.
Types
@type split_delimiter_behaviour() ::
:removed | :isolated | :merged_with_previous | :merged_with_next | :contiguous
Specifies how delimiter should behave for several pretokenizers.
@type t() :: %Tokenizers.PreTokenizer{resource: reference()}
Functions
@spec bert_pre_tokenizer() :: t()
Creates a BertPreTokenizer pre-tokenizer.
Splits for use in BERT models.
Creates a ByteLevel pre-tokenizer.
Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique has been introduced by OpenAI with GPT-2 and has some more or less nice properties:
Since it maps on bytes, a tokenizer using this only requires 256 characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.
A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! ๐๐)
For non ascii characters, it gets completely unreadable, but it works nonetheless!
Options
:add_prefix_space
- whether to add a space to the first word if there isnโt already one. This lets us treat hello exactly like say hello. Defaults totrue
:use_regex
- set this tofalse
to prevent this pre-tokenizer from using the GPT2 specific regexp for splitting on whitespace. Defaults totrue
@spec byte_level_alphabet() :: charlist()
Gets ByteLevel pre-tokenizer's alphabet.
Creates a CharDelimiterSplit pre-tokenizer.
This pre-tokenizer simply splits on the provided delimiter. Works almost like simple split function, except that it accounts for multiple consecutive spaces.
Creates a Digits pre-tokenizer.
Splits the numbers from any other characters.
Options
:individual_digits
- whether to split individual digits or not. Defaults tofalse
Creates Metaspace pre-tokenizer.
Splits on whitespaces and replaces them with a special char โโโ (U+2581).
Options
:replacement
- the replacement character to use. Defaults to"โ"
:prepend_scheme
- whether to add a space to the first word if there isn't already one. This lets us treat "hello" exactly like "say hello". Either of:always
,:never
,:first
.:first
means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used). Defaults to:always
Converts a string into a sequence of pre-tokens.
@spec punctuation(split_delimiter_behaviour()) :: t()
Creates a Punctuation pre-tokenizer.
Will isolate all punctuation characters.
Creates a Sequence pre-tokenizer.
Lets you compose multiple pre-tokenizers that will be run in the given order.
@spec split(String.t(), split_delimiter_behaviour(), keyword()) :: t()
Creates a Split pre-tokenizer using a string as split pattern.
Versatile pre-tokenizer that splits on provided pattern and according to provided behavior.
Options
:invert
- whether to invert the split or not. Defaults tofalse
@spec split_regex(String.t(), split_delimiter_behaviour(), keyword()) :: t()
Creates a Split pre-tokenizer using a regular expression as split pattern.
Versatile pre-tokenizer that splits on provided regex pattern and according to provided behavior.
The pattern
should be a string representing a regular expression
according to the Oniguruma Regex Engine.
Options
:invert
- whether to invert the split or not. Defaults tofalse
Example
iex> Tokenizers.PreTokenizer.split_regex(~S(\?\d{2}\?), :removed)
#Tokenizers.PreTokenizer<[pre_tokenizer_type: "Split"]>
@spec whitespace() :: t()
Creates a Whitespace pre-tokenizer.
Splits on word boundaries. Uses the following regular expression:
w+|[^w ]+
.
@spec whitespace_split() :: t()
Creates a WhitespaceSplit pre-tokenizer.
Splits on any whitespace character.