View Source Tokenizers.Normalizer (Tokenizers v0.5.1)

Normalizers and normalization functions.

A normalizer is in charge of pre-processing the input string in order to normalize it as relevant for the given use case.

Some common examples of normalization are the Unicode normalization algorithms (NFD, NFKD, NFC & NFKC) or lowercasing. The specificity of tokenizers is that we keep track of the alignment while normalizing. This is essential to allow mapping from the generated tokens back to the input text.

Summary

Types

t()

Functions

bert_normalizer(opts \\ [])

Takes care of normalizing raw text before giving it to a BERT model.

byte_level()

Created ByteLevel normalizer.

byte_level_alphabet()

Gets ByteLevel normalizer's alphabet.

lowercase()

Replaces all uppercase to lowercase

nfc()

Creates a NFC Unicode normalizer.

nfd()

Creates a NFD Unicode normalizer.

nfkc()

Creates a NFKC Unicode normalizer.

nfkd()

Creates a NFKD Unicode normalizer.

nmt()

Creates a Nmt normalizer.

normalize(normalizer, input)

Normalizes the given text input.

precompiled(data)

Precompiled normalizer.

prepend(prepend)

Creates a Prepend normalizer.

replace(search, content)

Replaces a custom search string with the given content.

replace_regex(pattern, content)

Replaces occurrences of a custom regexp pattern with the given content.

sequence(normalizers)

Composes multiple normalizers that will run in the provided order.

strip(opts \\ [])

Creates a Strip normalizer.

strip_accents()

Creates a Strip Accent normalizer.

Types

t()

@type t() :: %Tokenizers.Normalizer{resource: reference()}

Functions

bert_normalizer(opts \\ [])

@spec bert_normalizer(keyword()) :: t()

Takes care of normalizing raw text before giving it to a BERT model.

This includes cleaning the text, handling accents, Chinese chars and lowercasing.

Options

:clean_text - whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one. Defaults to true
:handle_chinese_chars - whether to handle chinese chars by putting spaces around them. Default true
:strip_accents - whether to strip all accents. If this option is not specified, then it will be determined by the value for lowercase (as in the original Bert)
:lowercase - whether to lowercase. Default true

byte_level()

@spec byte_level() :: t()

Created ByteLevel normalizer.

byte_level_alphabet()

Gets ByteLevel normalizer's alphabet.

lowercase()

@spec lowercase() :: t()

Replaces all uppercase to lowercase

nfc()

@spec nfc() :: t()

Creates a NFC Unicode normalizer.

nfd()

@spec nfd() :: t()

Creates a NFD Unicode normalizer.

nfkc()

@spec nfkc() :: t()

Creates a NFKC Unicode normalizer.

nfkd()

@spec nfkd() :: t()

Creates a NFKD Unicode normalizer.

nmt()

@spec nmt() :: t()

Creates a Nmt normalizer.

normalize(normalizer, input)

@spec normalize(t(), String.t()) :: {:ok, String.t()}

Normalizes the given text input.

precompiled(data)

@spec precompiled(binary()) :: {:ok, t()} | {:error, any()}

Precompiled normalizer.

Don’t use manually it is used for compatibility with SentencePiece.

prepend(prepend)

@spec prepend(prepend :: String.t()) :: t()

Creates a Prepend normalizer.

replace(search, content)

@spec replace(String.t(), String.t()) :: t()

Replaces a custom search string with the given content.

replace_regex(pattern, content)

@spec replace_regex(String.t(), String.t()) :: t()

Replaces occurrences of a custom regexp pattern with the given content.

The pattern should be a string representing a regular expression according to the Oniguruma Regex Engine.

sequence(normalizers)

@spec sequence([t()]) :: t()

Composes multiple normalizers that will run in the provided order.

strip(opts \\ [])

@spec strip(keyword()) :: t()

Creates a Strip normalizer.

Removes all whitespace characters on the specified sides (left, right or both) of the input

Options

:left - whether to strip left side. Defaults to true
:right - whether to strip right side. Defaults to true

strip_accents()

@spec strip_accents() :: t()

Creates a Strip Accent normalizer.

Removes all accent symbols in unicode (to be used with NFD for consistency).