This guide explains how to use Localize.Collation for locale-sensitive string sorting and comparison.
What Localize.Collation does
Localize.Collation implements the Unicode Collation Algorithm (UCA) with CLDR locale-specific tailoring. It provides:
sort/2— sort a list of strings in locale-appropriate order.compare/3— compare two strings, returning:lt,:eq, or:gt.sort_key/2— generate a binary sort key for external sorting (e.g., database ORDER BY).
These functions handle multi-level comparison (base character, accents, case, punctuation), locale-specific letter ordering, script reordering, and special rules for digraphs, contractions, and expansions.
Why Enum.sort is not enough
Elixir's Enum.sort/1 compares strings by Unicode codepoint value. This produces results that are incorrect for most human-facing use cases:
iex> # Codepoint sorting — wrong for users
iex> Enum.sort(["résumé", "resume", "Résumé", "RESUME"])
["RESUME", "Résumé", "resume", "résumé"]Problems with codepoint sorting:
Case: uppercase letters (A–Z, U+0041–005A) sort before all lowercase letters (a–z, U+0061–007A), so "RESUME" appears before "resume".
Accents: accented characters sort after all ASCII letters, so "résumé" appears last.
Non-Latin scripts: Cyrillic, Greek, CJK, and other scripts sort in arbitrary codepoint order that doesn't match any language's expectations.
Locale conventions: many languages treat certain character combinations as single letters (e.g., Croatian "dž", Spanish traditional "ch", Hungarian "cs").
UCA-based collation fixes all of these:
iex> Localize.Collation.sort(["résumé", "resume", "Résumé", "RESUME"])
["resume", "RESUME", "résumé", "Résumé"]Base letters sort together, case and accents are secondary/tertiary distinctions, and locale-specific rules apply automatically.
How locale affects collation
Every locale can define:
Letter ordering — which characters sort where. For example, Swedish places å, ä, ö after z; German standard treats ä as a variant of a.
Digraphs and contractions — character sequences that sort as single units. Croatian treats "lj" as a letter between l and m.
Expansions — single characters that sort as if they were multiple characters. German phonebook treats "ä" as "ae".
Default options — some locales set
case_first: :upperby default (Danish, Norwegian).
You specify the locale with the :locale option:
iex> # Croatian: č sorts between c and d
iex> Localize.Collation.sort(["č", "c", "d"], locale: "hr")
["c", "č", "d"]
iex> # Spanish: ñ sorts between n and o
iex> Localize.Collation.sort(["ñ", "n", "o"], locale: "es")
["n", "ñ", "o"]When no locale is specified, Localize.get_locale() is used.
Examples
Basic DUCET sorting
The Default Unicode Collation Element Table (DUCET) provides the base ordering for all characters. Letters sort by base character first, then by accent, then by case:
iex> Localize.Collation.sort(["résumé", "resume", "Résumé", "RESUME"])
["resume", "RESUME", "résumé", "Résumé"]
iex> Localize.Collation.sort(["banana", "Apple", "cherry"])
["Apple", "banana", "cherry"]Case-insensitive sorting
Set strength: :secondary to ignore case differences (level 3). Characters that differ only in case compare as equal:
iex> Localize.Collation.compare("a", "A", strength: :secondary)
:eq
iex> Localize.Collation.sort(["banana", "Apple", "cherry"],
...> strength: :secondary)
["Apple", "banana", "cherry"]Accent-insensitive sorting
Set strength: :primary to ignore both accent and case differences. Characters that share the same base letter compare as equal:
iex> Localize.Collation.compare("cafe", "café", strength: :primary)
:eq
iex> Localize.Collation.sort(["cafe", "café", "caff"],
...> strength: :primary)
["cafe", "café", "caff"]Case-first sorting
Control whether uppercase or lowercase sorts first among otherwise equal strings:
iex> # Uppercase first
iex> Localize.Collation.sort(["apple", "Apple", "APPLE"],
...> case_first: :upper)
["APPLE", "Apple", "apple"]
iex> # Lowercase first (default for most locales)
iex> Localize.Collation.sort(["apple", "Apple", "APPLE"],
...> case_first: :lower)
["apple", "Apple", "APPLE"]
iex> # Danish defaults to uppercase first
iex> Localize.Collation.sort(["apple", "Apple"], locale: "da")
["Apple", "apple"]German phonebook sorting
German has two collation types. Standard collation treats ä as a variant of a. Phonebook collation expands ä to "ae", placing it between "ad" and "af":
iex> # Standard: Ä sorts near A
iex> Localize.Collation.sort(["Ärger", "Anger", "Azur"], locale: "de")
["Anger", "Ärger", "Azur"]
iex> # Phonebook: Ä expands to AE, sorts between AD and AF
iex> Localize.Collation.sort(["Ärger", "Anger", "Azur"],
...> locale: "de", type: :phonebook)
["Ärger", "Anger", "Azur"]French Canadian backwards accent sorting
French Canadian uses "backwards" level-2 comparison, meaning accents are compared from the end of the string rather than the beginning. This affects words that differ in accent position:
iex> # French Canadian: accent on later syllable sorts first
iex> Localize.Collation.sort(["côte", "coté", "cote", "côté"],
...> locale: "fr-CA")
["cote", "coté", "côte", "côté"]
iex> # Default (non-backwards): accent on earlier syllable sorts first
iex> Localize.Collation.sort(["côte", "coté", "cote", "côté"])
["cote", "coté", "côte", "côté"]Numeric sorting
Enable numeric collation to sort embedded digit sequences by numeric value rather than digit-by-digit:
iex> # Without numeric: "10" < "2" (codepoint: "1" < "2")
iex> Localize.Collation.sort(["file10", "file2", "file1"])
["file1", "file10", "file2"]
iex> # With numeric: 2 < 10
iex> Localize.Collation.sort(["file10", "file2", "file1"],
...> numeric: true)
["file1", "file2", "file10"]This applies to all digits in all scripts - not just indo-arabic digits.
Search collation
Search collation provides tailored equivalences for loose matching. For example, Arabic presentation forms are equated to their base forms, and Korean composed jamo are decomposed:
iex> Localize.Collation.compare("cafe", "café",
...> type: :search, strength: :primary)
:eqIgnoring punctuation (shifted)
Set alternate: :shifted to treat whitespace and punctuation as ignorable at primary and secondary levels:
iex> Localize.Collation.sort(
...> ["black bird", "blackbird", "black-bird"],
...> alternate: :shifted)
["black bird", "blackbird", "black-bird"]Collation options
Keyword options
All functions accept the following keyword options:
| Option | Values | Default | Description |
|---|---|---|---|
:locale | locale atom or string | Localize.get_locale() | Locale for tailoring and defaults. |
:type | :standard, :search, :phonebook, :pinyin, :stroke, :traditional, etc. | :standard | Collation type. |
:strength | :primary, :secondary, :tertiary, :quaternary, :identical | :tertiary | Comparison depth. |
:alternate | :non_ignorable, :shifted | :non_ignorable | How to handle variable-weight characters (whitespace, punctuation). |
:backwards | true, false | false | Reverse level-2 (accent) comparison direction. |
:normalization | true, false | false | Force NFD normalisation (auto-enabled when tailoring requires it). |
:case_level | true, false | false | Insert a case-comparison level between secondary and tertiary. |
:case_first | :upper, :lower, false | false | Whether uppercase or lowercase sorts first. |
:numeric | true, false | false | Sort embedded digit sequences by numeric value. |
:max_variable | :space, :punct, :symbol, :currency | :punct | Highest character class treated as variable when alternate: :shifted. |
:reorder | list of script atoms | [] | Reorder script groups (e.g., [:Cyrl, :Latn]). |
Shorthand options
These convenience options map to the core options above. If a corresponding core option is also provided, the core option takes precedence.
| Shorthand | Equivalent | Description |
|---|---|---|
ignore_accents: true | strength: :primary | Ignore accent and case differences. |
ignore_case: true | strength: :secondary | Ignore case differences but respect accents. |
ignore_punctuation: true | strength: :tertiary, alternate: :shifted | Treat whitespace and punctuation as ignorable. |
casing: :insensitive | strength: :secondary | Alias for case-insensitive comparison. |
casing: :sensitive | (no change) | Explicit case-sensitive comparison (the default). |
iex> Localize.Collation.compare("cafe", "café", ignore_accents: true)
:eq
iex> Localize.Collation.compare("a", "A", ignore_case: true)
:eq
iex> Localize.Collation.compare("a", "A", casing: :insensitive)
:eqOptions encoded in a locale identifier
Collation options can be embedded in a BCP 47 locale identifier using the -u- Unicode extension. The BCP 47 collation keys are parsed from the locale tag's -u- extension and applied automatically:
iex> # German phonebook via locale string
iex> Localize.Collation.sort(["Ärger", "Anger", "Azur"], locale: "de-u-co-phonebk")
["Ärger", "Anger", "Azur"]
iex> # Case-insensitive via locale string
iex> Localize.Collation.compare("a", "A", locale: "en-u-ks-level2")
:eq
iex> # Numeric sorting via locale string
iex> Localize.Collation.sort(["file10", "file2", "file1"], locale: "en-u-kn-true")
["file1", "file2", "file10"]
iex> # Uppercase-first via locale string
iex> Localize.Collation.sort(["apple", "Apple", "APPLE"], locale: "en-u-kf-upper")
["APPLE", "Apple", "apple"]The BCP 47 keys and their mappings:
| BCP 47 Key | Option | Values |
|---|---|---|
co | :type | standard, search, phonebk, pinyin, stroke, trad, etc. |
ks | :strength | level1, level2, level3, level4, identic |
ka | :alternate | noignore, shifted |
kb | :backwards | true, false |
kk | :normalization | true, false |
kc | :case_level | true, false |
kf | :case_first | upper, lower, false |
kn | :numeric | true, false |
kr | :reorder | Script codes (e.g., latn-arab) |
kv | :max_variable | space, punct, symbol, currency |
Keyword options and locale-encoded options can be combined. When both are present, keyword options take precedence.
How options interact
Strength controls how many levels are compared:
| Strength | Compares | Ignores |
|---|---|---|
:primary | Base character | Accents, case, punctuation |
:secondary | Base + accents | Case, punctuation |
:tertiary (default) | Base + accents + case | Punctuation differences |
:quaternary | Base + accents + case + punctuation | Nothing (except NFD ordering) |
:identical | Everything including codepoint | Nothing |
Case options interact with strength:
case_first: :upperonly has effect at tertiary level or above. At:primaryor:secondarystrength, case is already ignored.case_level: trueinserts an extra comparison level between secondary and tertiary. This allows case-sensitive, accent-insensitive sorting.
Alternate interacts with max_variable:
alternate: :shiftedmakes characters up tomax_variableignorable at primary/secondary levels. Withmax_variable: :punct(default), whitespace and punctuation are ignored.Setting
max_variable: :symboladditionally ignores symbols. Settingmax_variable: :currencyignores currency symbols too.
Backwards only affects level 2 (accents). It reverses the direction of accent comparison — accents at the end of a string take precedence over accents at the beginning.
Available locale-specific collations
The following table lists all locales with CLDR collation tailoring. Locales not listed here use the default DUCET ordering with no modifications.
| Locale | Types | Description |
|---|---|---|
aa | standard | Afar. |
af | standard | Afrikaans. |
am | standard | Amharic — Ethiopic script ordering. |
ar | standard, compat | Arabic script ordering with presentation form mappings. |
as | standard | Assamese — Bengali script ordering. |
az | standard, search | Azerbaijani — Latin with Turkish-style dotted/dotless i. |
bal | standard | Baluchi. |
bal-Latn | standard | Baluchi (Latin script). |
be | standard | Belarusian — Cyrillic ordering. |
bg | standard | Bulgarian — Cyrillic ordering. |
blo | standard | Anii. |
bn | standard, traditional | Bengali script ordering. |
bo | standard | Tibetan script ordering. |
br | standard | Breton. |
bs | standard, search | Bosnian — imports Croatian standard rules. |
bs-Cyrl | standard | Bosnian (Cyrillic) — imports Serbian rules. |
ca | standard, search | Catalan. |
ceb | standard | Cebuano. |
chr | standard | Cherokee — Cherokee script ordering. |
cs | standard, digits-after | Czech — č, ř, š, ž as separate letters. digits-after places digits after letters. |
cu | standard | Church Slavic — Cyrillic with case-first:upper. |
cy | standard | Welsh — ch, dd, ff, ng, ll, ph, rh, th digraphs. |
da | standard, search | Danish — æ, ø, å after z; case-first:upper by default. |
de | phonebook, eor, search | German — phonebook expands ä→ae, ö→oe, ü→ue. EOR is European Ordering Rules. |
de-AT | phonebook | Austrian German phonebook. |
dsb | standard | Lower Sorbian. |
dz | standard | Dzongkha — Tibetan script ordering. |
ee | standard | Ewe — additional letters ɖ, ɛ, ƒ, ɣ, ŋ, ɔ, ʋ. |
el | standard | Greek script ordering. |
en-US-POSIX | standard | POSIX sort order (codepoint-like). |
eo | standard | Esperanto — ĉ, ĝ, ĥ, ĵ, ŝ, ŭ placement. |
es | standard, search, traditional | Spanish — traditional treats ch and ll as separate letters. |
et | standard | Estonian — š, ž, ö, ä, ü, õ after z. |
fa | standard | Persian — Arabic script with Persian-specific ordering. |
fa-AF | standard | Dari (Afghan Persian). |
ff-Adlm | standard | Fulah (Adlam script). |
fi | standard, search, traditional | Finnish — å after z; traditional differs in w/v ordering. |
fil | standard | Filipino — Spanish-derived letter ordering. |
fo | standard, search | Faroese — Nordic letter ordering. |
fr-CA | standard | French Canadian — backwards level-2 (accent) comparison. |
fy | standard | Western Frisian. |
gl | standard, search | Galician — imports Spanish/Catalan rules. |
gu | standard | Gujarati script ordering. |
ha | standard | Hausa — additional letters ɓ, ɗ, ƙ. |
haw | standard | Hawaiian — ʻokina placement. |
he | standard | Hebrew script ordering. |
hi | standard | Hindi — Devanagari script ordering. |
hr | standard, search | Croatian — č, ć, dž, đ, lj, nj, š, ž as separate letters. |
hsb | standard | Upper Sorbian. |
hu | standard | Hungarian — digraphs cs, dz, dzs, gy, ly, ny, sz, ty, zs; double consonant expansion (ccs→cs). |
hy | standard | Armenian script ordering. |
ig | standard | Igbo — additional letters ị, ọ, ụ, ñ, gb, gw, kp, kw, nw, ny, sh. |
is | standard, search | Icelandic — Nordic letter ordering with ð, þ. |
ja | standard, unihan, private-kana | Japanese — kana ordering with Kanji by radical/stroke. Unihan uses Han reading order. |
ka | standard | Georgian script ordering. |
kk | standard | Kazakh — Cyrillic ordering. |
kk-Arab | standard | Kazakh (Arabic script). |
kl | standard, search | Kalaallisut — Nordic letter ordering. |
km | standard | Khmer script ordering. |
kn | standard, traditional | Kannada script ordering. |
ko | standard, search, unihan | Korean — Hangul jamo decomposition; unihan uses Han reading order. |
kok | standard | Konkani — Devanagari ordering. |
ku | standard | Kurdish. |
ky | standard | Kyrgyz — Cyrillic ordering. |
lkt | standard | Lakota. |
ln | standard, phonetic | Lingala — phonetic variant available. |
lo | standard | Lao script ordering. |
lt | standard | Lithuanian — y between i and j. |
lv | standard | Latvian — č, ģ, ķ, ļ, ņ, š, ž as separate letters. |
mk | standard | Macedonian — Cyrillic ordering. |
ml | standard | Malayalam script ordering. |
mn | standard | Mongolian — Cyrillic ordering. |
mr | standard | Marathi — Devanagari ordering. |
mt | standard | Maltese — ċ, ġ, għ, ħ, ż; case-first:upper by default. |
my | standard | Myanmar (Burmese) script ordering. |
ne | standard | Nepali — Devanagari ordering. |
no | standard, search | Norwegian — æ, ø, å after z; case-first:upper by default. |
nso | standard | Northern Sotho. |
om | standard | Oromo. |
or | standard | Odia script ordering. |
pa | standard | Punjabi — Gurmukhi script ordering. |
pl | standard | Polish — ą, ć, ę, ł, ń, ó, ś, ź, ż as separate letters. |
ps | standard | Pashto — Arabic script ordering. |
ro | standard | Romanian — ă, â, î, ș, ț placement. |
ru | standard | Russian — Cyrillic ordering. |
sa | standard, traditional | Sanskrit — Devanagari ordering. |
se | standard, search | Northern Sami. |
sgs | standard | Samogitian. |
si | standard, dictionary | Sinhala — dictionary variant available. |
sk | standard, search | Slovak — č, ď, ĺ, ľ, ň, ô, ŕ, š, ť, ž, á, ä, é, í, ó, ú, ý. |
sl | standard | Slovenian. |
smn | standard, search | Inari Sami. |
sq | standard | Albanian — ç, dh, ë, gj, ll, nj, rr, sh, th, xh, zh. |
sr | standard | Serbian — Cyrillic ordering. |
sr-Latn | standard, search | Serbian (Latin) — imports Croatian rules. |
ssy | standard | Saho. |
sv | standard, search, traditional | Swedish — å, ä, ö after z. |
ta | standard | Tamil script ordering. |
te | standard | Telugu script ordering. |
th | standard | Thai script ordering. |
tk | standard | Turkmen. |
tn | standard | Tswana. |
to | standard | Tongan. |
tr | standard, search | Turkish — dotted/dotless i distinction; ç, ğ, ı, ö, ş, ü. |
ug | standard | Uyghur — Arabic script ordering. |
uk | standard | Ukrainian — Cyrillic ordering. |
und | search | Root search rules — Arabic form equivalences, Korean jamo decomposition, Thai/Lao contraction suppression. |
ur | standard | Urdu — Arabic script ordering. |
uz | standard | Uzbek. |
vi | standard, traditional | Vietnamese — extensive accent and tone ordering. |
vo | standard | Volapük. |
wae | standard | Walser. |
wo | standard | Wolof. |
yi | standard | Yiddish — Hebrew script ordering. |
yo | standard | Yoruba — additional letters ẹ, ọ, ṣ, gb. |
zh | pinyin, stroke, unihan, zhuyin, private-pinyin | Chinese — pinyin (pronunciation), stroke (stroke count), zhuyin (Bopomofo), unihan (radical-stroke). |
Sort keys
For bulk sorting or database indexing, generate sort keys once and compare the raw binaries:
iex> key_a = Localize.Collation.sort_key("café")
iex> key_b = Localize.Collation.sort_key("caff")
iex> key_a < key_b
trueSort keys are binaries that preserve the collation ordering when compared with <, >, and ==. They encode all comparison levels into a single binary, so comparison is a simple byte comparison with no need to re-apply collation rules.
Performance notes
Collation tables are loaded once into
:persistent_termon first use. Subsequent calls have no file I/O overhead.Locale tailoring overlays are computed once per locale/type pair and cached. Overlay lookup is a hash map get (~575ns for 7,000+ entry CJK overlays).
Optional NIF — when
LOCALIZE_NIF=trueis set at compile time, sort key generation uses a C NIF for significantly faster performance. SeeLocalize.Niffor details.