Localize.Collation.Normalizer (Localize v0.38.0)

Copy Markdown View Source

Unicode NFD normalization for collation. Delegates to Erlang's :unicode module.

Summary

Functions

Normalize a string to NFD (Canonical Decomposition) form.

Optionally normalize a string and convert it to a list of integer codepoints.

Convert a string to a list of integer codepoints.

Functions

nfd(string)

@spec nfd(String.t()) :: String.t()

Normalize a string to NFD (Canonical Decomposition) form.

Uses Erlang's :unicode.characters_to_nfd_binary/1 followed by a canonical reordering pass using the unicode package's CCC data to correct ordering for newer Unicode codepoints.

Arguments

  • string - a UTF-8 binary string.

Returns

The NFD-normalized string as a UTF-8 binary.

Examples

iex> "café" |> Localize.Collation.Normalizer.nfd() |> String.to_charlist() |> length()
5

iex> Localize.Collation.Normalizer.nfd("é")
"é"

normalize_to_codepoints(string, normalize? \\ false)

@spec normalize_to_codepoints(String.t(), boolean()) :: [non_neg_integer()]

Optionally normalize a string and convert it to a list of integer codepoints.

Arguments

  • string - a UTF-8 binary string.

  • normalize? - whether to apply NFD normalization first (default: false).

Returns

A list of integer codepoints, optionally NFD-normalized.

Examples

iex> Localize.Collation.Normalizer.normalize_to_codepoints("abc")
[97, 98, 99]

iex> Localize.Collation.Normalizer.normalize_to_codepoints("café", true)
[99, 97, 102, 101, 769]

to_codepoints(string)

@spec to_codepoints(String.t()) :: [non_neg_integer(), ...]

Convert a string to a list of integer codepoints.

Arguments

  • string - a UTF-8 binary string.

Returns

A list of integer codepoints.

Examples

iex> Localize.Collation.Normalizer.to_codepoints("abc")
[97, 98, 99]