View Source Tokenizers.PreTokenizer (Tokenizers v0.5.0)

Pre-tokenizers.

A pre-tokenizer takes care of splitting the input according to a set of rules. This pre-processing lets you ensure that the underlying model does not build tokens across multiple โ€œsplitsโ€. For example if you donโ€™t want to have whitespaces inside a token, then you can have a pre-tokenizer that splits on these whitespaces.

You can easily combine multiple pre-tokenizers together using sequence/1.

A pre-tokenizer is also allowed to modify the string, just like a normalizer does. This is necessary to allow some complicated algorithms that require to split before normalizing (e.g. ByteLevel).

Summary

Types

Specifies how delimiter should behave for several pretokenizers.

t()

Functions

Creates a BertPreTokenizer pre-tokenizer.

Creates a ByteLevel pre-tokenizer.

Gets ByteLevel pre-tokenizer's alphabet.

Creates a CharDelimiterSplit pre-tokenizer.

Creates a Digits pre-tokenizer.

Creates Metaspace pre-tokenizer.

Converts a string into a sequence of pre-tokens.

Creates a Punctuation pre-tokenizer.

Creates a Sequence pre-tokenizer.

Creates a Split pre-tokenizer using a string as split pattern.

Creates a Split pre-tokenizer using a regular expression as split pattern.

Creates a Whitespace pre-tokenizer.

Creates a WhitespaceSplit pre-tokenizer.

Types

Link to this type

split_delimiter_behaviour()

View Source
@type split_delimiter_behaviour() ::
  :removed | :isolated | :merged_with_previous | :merged_with_next | :contiguous

Specifies how delimiter should behave for several pretokenizers.

@type t() :: %Tokenizers.PreTokenizer{resource: reference()}

Functions

@spec bert_pre_tokenizer() :: t()

Creates a BertPreTokenizer pre-tokenizer.

Splits for use in Bert models.

@spec byte_level(keyword()) :: t()

Creates a ByteLevel pre-tokenizer.

Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique has been introduced by OpenAI with GPT-2 and has some more or less nice properties:

  • Since it maps on bytes, a tokenizer using this only requires 256 characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.

  • A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! ๐ŸŽ‰๐ŸŽ‰)

  • For non ascii characters, it gets completely unreadable, but it works nonetheless!

Options

  • :add_prefix_space - whether to add a space to the first word if there isnโ€™t already one. This lets us treat hello exactly like say hello. Defaults to true

  • :use_regex - set this to false to prevent this pre-tokenizer from using the GPT2 specific regexp for splitting on whitespace. Defaults to true

@spec byte_level_alphabet() :: charlist()

Gets ByteLevel pre-tokenizer's alphabet.

Link to this function

char_delimiter_split(delimiter)

View Source
@spec char_delimiter_split(char()) :: t()

Creates a CharDelimiterSplit pre-tokenizer.

This pre-tokenizer simply splits on the provided delimiter. Works almost like simple split function, except that it accounts for multiple consecutive spaces.

@spec digits(keyword()) :: t()

Creates a Digits pre-tokenizer.

Splits the numbers from any other characters.

Options

  • :individual_digits - whether to split individual digits or not. Defaults to false
@spec metaspace(keyword()) :: t()

Creates Metaspace pre-tokenizer.

Splits on whitespaces and replaces them with a special char โ€œโ–โ€ (U+2581).

Options

  • :replacement - the replacement character to use. Defaults to "โ–"

  • :prepend_scheme - whether to add a space to the first word if there isn't already one. This lets us treat "hello" exactly like "say hello". Either of :always, :never, :first. :first means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used). Defaults to :always

Link to this function

pre_tokenize(pre_tokenizer, input)

View Source
@spec pre_tokenize(t(), String.t()) :: {:ok, [{String.t(), {integer(), integer()}}]}

Converts a string into a sequence of pre-tokens.

@spec punctuation(split_delimiter_behaviour()) :: t()

Creates a Punctuation pre-tokenizer.

Will isolate all punctuation characters.

Link to this function

sequence(pre_tokenizers)

View Source
@spec sequence([t()]) :: t()

Creates a Sequence pre-tokenizer.

Lets you compose multiple pre-tokenizers that will be run in the given order.

Link to this function

split(pattern, behavior, opts \\ [])

View Source
@spec split(String.t(), split_delimiter_behaviour(), keyword()) :: t()

Creates a Split pre-tokenizer using a string as split pattern.

Versatile pre-tokenizer that splits on provided pattern and according to provided behavior.

Options

  • :invert - whether to invert the split or not. Defaults to false
Link to this function

split_regex(pattern, behavior, opts \\ [])

View Source
@spec split_regex(String.t(), split_delimiter_behaviour(), keyword()) :: t()

Creates a Split pre-tokenizer using a regular expression as split pattern.

Versatile pre-tokenizer that splits on provided regex pattern and according to provided behavior.

The pattern should be a string representing a regular expression according to the Oniguruma Regex Engine.

Options

  • :invert - whether to invert the split or not. Defaults to false

Example

iex> Tokenizers.PreTokenizer.split_regex(~S(\?\d{2}\?), :removed)
#Tokenizers.PreTokenizer<[pre_tokenizer_type: "Split"]>
@spec whitespace() :: t()

Creates a Whitespace pre-tokenizer.

Splits on word boundaries. Uses the following regular expression: w+|[^w ]+.

@spec whitespace_split() :: t()

Creates a WhitespaceSplit pre-tokenizer.

Splits on any whitespace character.