Unicode.Transform (Unicode Transform v1.0.0)

Copy Markdown View Source

Implements the CLDR Transform specification for transforming text from one script to another.

Transforms are defined by the Unicode CLDR specification and support operations such as transliteration between scripts, normalization, and case mapping.

Usage Examples

iex> Unicode.Transform.transform("Ä Ö Ü ß", from: :latin, to: :ascii)
{:ok, "A O U ss"}

iex> Unicode.Transform.transform("hello", to: :upper)
{:ok, "HELLO"}

iex> Unicode.Transform.transform("Ä ö ü", transform: "de-ASCII")
{:ok, "AE oe ue"}

Transform ID resolution

When transform/2 is called, the transform ID is resolved through one of two paths depending on the options provided.

Direct ID (:transform option)

The string is used as-is. If the ID is not found as a built-in or in the CLDR transform files, and it has the form "Any-Target", unicode_transform falls back to automatic script detection (see below).

Script-based (:from / :to options)

The :from and :to values are normalized to canonical script names (case-insensitive, supporting both Unicode names like :greek and BCP47 codes like :grek). Resolution then proceeds as follows:

  1. Built-in check — if the ID matches a built-in transform (e.g., Any-NFC, Any-Upper), it is dispatched directly to the corresponding String function.

  2. Forward file lookupunicode_transform looks for a CLDR XML file matching "From-To" (e.g., "Greek-Latin"), checking the alias index built from file metadata.

  3. Reverse file lookup — if no forward match is found, unicode_transform looks for "To-From" and marks the direction as :reverse (e.g., to: :greek, from: :latin resolves to "Greek-Latin" in reverse).

  4. BCP47 fallback — if neither exact nor case-insensitive matches succeed, the ID is resolved as a BCP47 transform ID (e.g., "Grek-Latn""Greek-Latin").

The Any source and script detection

When :from is :any (the default) or when a transform: "Any-X" ID is used, unicode_transform first checks for a specific Any-X transform (built-in or file-based, such as Any-Accents or Any-Publishing).

If no specific Any-X transform exists, unicode_transform falls back to automatic script detection: it calls Unicode.script_dominance/1 to identify the scripts present in the input string, then chains a {detected_script}-X transform for each detected script. Common, inherited, and unknown scripts are skipped.

For example, transform("αβγδ абвг", from: :any, to: :latin) detects Greek and Cyrillic, then applies Greek-Latin followed by Cyrillic-Latin.

This is equivalent to using from: :detect, which always uses script detection without checking for a specific Any-X transform first.

Sub-transform narrowing

CLDR transform files can reference sub-transforms via ::Name; rules. When a sub-transform is a bare script name (e.g., ::Latin; inside Greek-Latin.xml), it is narrowed using the parent transform's source and target scripts — resolving ::Latin; to Greek-Latin. Sub-transforms that are already compound names (e.g., ::Bengali-InterIndic;) or built-ins (e.g., ::NFC;) are used as-is.

Summary

Functions

Returns a list of available transform IDs.

Returns the default transform backend.

Transforms a string using the specified transform.

Transforms a string using the specified transform, raising on error.

Types

transform_option()

@type transform_option() ::
  {:from, atom() | String.t()}
  | {:to, atom() | String.t()}
  | {:transform, String.t()}
  | {:direction, :forward | :reverse}
  | {:backend, :nif | :elixir}

Functions

available_transforms()

@spec available_transforms() :: [String.t()]

Returns a list of available transform IDs.

Returns

A list of transform ID strings.

default_backend()

@spec default_backend() :: :nif | :elixir

Returns the default transform backend.

Returns

  • :nif if the ICU NIF is loaded and available.

  • :elixir otherwise.

transform(string, options)

@spec transform(String.t(), [transform_option()]) ::
  {:ok, String.t()} | {:error, term()}

Transforms a string using the specified transform.

There are two ways to specify which transform to apply:

  1. Script-based — use :from and :to to specify source and target scripts as atoms. The transform ID and direction are inferred.

  2. Direct — use :transform with the string transform ID, and optionally :direction (default :forward).

See the Transform ID resolution section in the module documentation for details on how transform IDs are resolved, including Any- handling and automatic script detection.

Arguments

  • string — the input string to transform.

Options

Either :from/:to or :transform must be provided:

  • :to — the target script as an atom or string (e.g., :latin, "ASCII", :upper, :nfc). Required unless :transform is given. Resolution is case-insensitive.

  • :from — the source script as an atom or string (default: :any). E.g., :greek, "Cyrillic". Resolution is case-insensitive. Use :detect to automatically detect scripts in the input and chain a transform for each detected script.

  • :transform — a string transform ID (e.g., "de-ASCII", "Armenian-Latin-BGN"). Mutually exclusive with :from/:to.

  • :direction:forward (default) or :reverse. Only used with :transform.

  • :backend:nif or :elixir. Selects the transform engine. When set to :nif, transforms are executed via ICU4C's native transliterator. When set to :elixir, the pure-Elixir CLDR-based engine is used. Defaults to :nif when the NIF is available, otherwise :elixir.

Returns

  • {:ok, transformed_string} on success.

  • {:error, reason} on failure.

Examples

iex> Unicode.Transform.transform("Ä Ö Ü ß", from: :latin, to: :ascii)
{:ok, "A O U ss"}

iex> Unicode.Transform.transform("αβγδ", from: :greek, to: :latin)
{:ok, "abgd"}

iex> Unicode.Transform.transform("hello", to: :upper)
{:ok, "HELLO"}

iex> Unicode.Transform.transform("Ä ö ü", transform: "de-ASCII")
{:ok, "AE oe ue"}

transform!(string, options)

@spec transform!(String.t(), [transform_option()]) :: String.t()

Transforms a string using the specified transform, raising on error.

Arguments

  • string — the input string to transform.

Options

Same as transform/2.

Returns

The transformed string.

Examples

iex> Unicode.Transform.transform!("Ä Ö Ü ß", from: :latin, to: :ascii)
"A O U ss"