View Source Tokenizers.Normalizer (Tokenizers v0.5.1)
Normalizers and normalization functions.
A normalizer is in charge of pre-processing the input string in order to normalize it as relevant for the given use case.
Some common examples of normalization are the Unicode normalization algorithms (NFD, NFKD, NFC & NFKC) or lowercasing. The specificity of tokenizers is that we keep track of the alignment while normalizing. This is essential to allow mapping from the generated tokens back to the input text.
Summary
Functions
Takes care of normalizing raw text before giving it to a BERT model.
Created ByteLevel normalizer.
Gets ByteLevel normalizer's alphabet.
Replaces all uppercase to lowercase
Creates a NFC Unicode normalizer.
Creates a NFD Unicode normalizer.
Creates a NFKC Unicode normalizer.
Creates a NFKD Unicode normalizer.
Creates a Nmt normalizer.
Normalizes the given text input.
Precompiled normalizer.
Creates a Prepend normalizer.
Replaces a custom search
string with the given content
.
Replaces occurrences of a custom regexp pattern
with the given content
.
Composes multiple normalizers that will run in the provided order.
Creates a Strip normalizer.
Creates a Strip Accent normalizer.
Types
@type t() :: %Tokenizers.Normalizer{resource: reference()}
Functions
Takes care of normalizing raw text before giving it to a BERT model.
This includes cleaning the text, handling accents, Chinese chars and lowercasing.
Options
:clean_text
- whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one. Defaults totrue
:handle_chinese_chars
- whether to handle chinese chars by putting spaces around them. Defaulttrue
:strip_accents
- whether to strip all accents. If this option is not specified, then it will be determined by the value for lowercase (as in the original Bert):lowercase
- whether to lowercase. Defaulttrue
@spec byte_level() :: t()
Created ByteLevel normalizer.
Gets ByteLevel normalizer's alphabet.
@spec lowercase() :: t()
Replaces all uppercase to lowercase
@spec nfc() :: t()
Creates a NFC Unicode normalizer.
@spec nfd() :: t()
Creates a NFD Unicode normalizer.
@spec nfkc() :: t()
Creates a NFKC Unicode normalizer.
@spec nfkd() :: t()
Creates a NFKD Unicode normalizer.
@spec nmt() :: t()
Creates a Nmt normalizer.
Normalizes the given text input.
Precompiled normalizer.
Don’t use manually it is used for compatibility with SentencePiece.
Creates a Prepend normalizer.
Replaces a custom search
string with the given content
.
Replaces occurrences of a custom regexp pattern
with the given content
.
The pattern
should be a string representing a regular expression
according to the Oniguruma Regex Engine.
Composes multiple normalizers that will run in the provided order.
Creates a Strip normalizer.
Removes all whitespace characters on the specified sides (left, right or both) of the input
Options
:left
- whether to strip left side. Defaults totrue
:right
- whether to strip right side. Defaults totrue
@spec strip_accents() :: t()
Creates a Strip Accent normalizer.
Removes all accent symbols in unicode (to be used with NFD for consistency).