Localize.Collation.Table.Parser (Localize v0.38.0)

Copy Markdown View Source

Parses the FractionalUCA.txt file into a map of codepoint sequences to collation elements.

FractionalUCA.txt is the single source of truth for the collation table. Each data line contains both fractional weights (used for script reordering) and allkeys-format decimal weights (used for collation element construction) in the comment:

  • Single codepoint: 0041; [2B, 05, 9C] # Latn Lu [23EC.0020.0008] * LATIN CAPITAL LETTER A.

  • Multi-CE: 00E9; [2B 86, 05, 05] # Latn Ll [2453.0020.0002][0000.0024.0002] * LATIN SMALL LETTER E WITH ACUTE.

  • Context entry: 004C | 00B7; [, FB B6, 05] # Zyyy Po [0000.011F.0002] * MIDDLE DOT.

Context entries represent CLDR-specific contractions where a target codepoint's weights change depending on the preceding context codepoint. These are converted to explicit contraction entries (e.g., {0x004C, 0x00B7} => L's CEs ++ modified CEs).

Variable status (spaces, punctuation, symbols, currency) is derived from the [last variable] header line rather than per-entry markers.

Summary

Functions

Convert a codepoint list to a table key.

Parse FractionalUCA.txt into a collation table.

Parse weight elements from an allkeys weight string.

Parse a single FractionalUCA.txt data entry.

Functions

codepoints_to_key(cps)

Convert a codepoint list to a table key.

Single codepoints become bare integers, multi-codepoint sequences (contractions) become tuples for compact persistent_term storage.

Arguments

  • codepoints - a list of integer codepoints.

Returns

An integer for single codepoints, or a tuple for contractions.

Examples

iex> Localize.Collation.Table.Parser.codepoints_to_key([0x0041])
0x0041

iex> Localize.Collation.Table.Parser.codepoints_to_key([0x006C, 0x00B7])
{0x006C, 0x00B7}

parse(path)

Parse FractionalUCA.txt into a collation table.

This is the primary parser that builds the complete collation table from a single data file. Variable status is derived from the [last variable] header line.

Arguments

  • path - file path to the FractionalUCA.txt data file.

Returns

A map with two keys:

  • :entries - %{integer() | tuple() => [Element.t()]} mapping codepoints (integers for single, tuples for contractions) to collation elements.

  • :version - the UCA version string from the file header, or nil.

parse_elements(str)

Parse weight elements from an allkeys weight string.

Arguments

  • str - the weight portion of an allkeys line (e.g., "[.23EC.0020.0008]").

Returns

A list of collation element tuples {primary, secondary, tertiary, variable}.

Examples

iex> Localize.Collation.Table.Parser.parse_elements("[.23EC.0020.0008]")
[{0x23EC, 0x0020, 0x0008, false}]

iex> Localize.Collation.Table.Parser.parse_elements("[*0269.0020.0002]")
[{0x0269, 0x0020, 0x0002, true}]

parse_fractional_entry(line)

Parse a single FractionalUCA.txt data entry.

Arguments

  • line - a single data line from FractionalUCA.txt.

Returns

  • {:ok, codepoints, elements} - the parsed codepoint list and collation elements.

  • {:context, context_cp, target_cp, elements} - a context entry to be resolved later.

  • :skip - the line could not be parsed.