This guide explains how to use Localize.Collation for locale-sensitive string sorting and comparison.

What Localize.Collation does

Localize.Collation implements the Unicode Collation Algorithm (UCA) with CLDR locale-specific tailoring. It provides:

  • sort/2 — sort a list of strings in locale-appropriate order.

  • compare/3 — compare two strings, returning :lt, :eq, or :gt.

  • sort_key/2 — generate a binary sort key for external sorting (e.g., database ORDER BY).

These functions handle multi-level comparison (base character, accents, case, punctuation), locale-specific letter ordering, script reordering, and special rules for digraphs, contractions, and expansions.

Why Enum.sort is not enough

Elixir's Enum.sort/1 compares strings by Unicode codepoint value. This produces results that are incorrect for most human-facing use cases:

iex> # Codepoint sorting — wrong for users
iex> Enum.sort(["résumé", "resume", "Résumé", "RESUME"])
["RESUME", "Résumé", "resume", "résumé"]

Problems with codepoint sorting:

  • Case: uppercase letters (A–Z, U+0041–005A) sort before all lowercase letters (a–z, U+0061–007A), so "RESUME" appears before "resume".

  • Accents: accented characters sort after all ASCII letters, so "résumé" appears last.

  • Non-Latin scripts: Cyrillic, Greek, CJK, and other scripts sort in arbitrary codepoint order that doesn't match any language's expectations.

  • Locale conventions: many languages treat certain character combinations as single letters (e.g., Croatian "dž", Spanish traditional "ch", Hungarian "cs").

UCA-based collation fixes all of these:

iex> Localize.Collation.sort(["résumé", "resume", "Résumé", "RESUME"])
["resume", "RESUME", "résumé", "Résumé"]

Base letters sort together, case and accents are secondary/tertiary distinctions, and locale-specific rules apply automatically.

How locale affects collation

Every locale can define:

  • Letter ordering — which characters sort where. For example, Swedish places å, ä, ö after z; German standard treats ä as a variant of a.

  • Digraphs and contractions — character sequences that sort as single units. Croatian treats "lj" as a letter between l and m.

  • Expansions — single characters that sort as if they were multiple characters. German phonebook treats "ä" as "ae".

  • Default options — some locales set case_first: :upper by default (Danish, Norwegian).

You specify the locale with the :locale option:

iex> # Croatian: č sorts between c and d
iex> Localize.Collation.sort(["č", "c", "d"], locale: "hr")
["c", "č", "d"]

iex> # Spanish: ñ sorts between n and o
iex> Localize.Collation.sort(["ñ", "n", "o"], locale: "es")
["n", "ñ", "o"]

When no locale is specified, Localize.get_locale() is used.

Examples

Basic DUCET sorting

The Default Unicode Collation Element Table (DUCET) provides the base ordering for all characters. Letters sort by base character first, then by accent, then by case:

iex> Localize.Collation.sort(["résumé", "resume", "Résumé", "RESUME"])
["resume", "RESUME", "résumé", "Résumé"]

iex> Localize.Collation.sort(["banana", "Apple", "cherry"])
["Apple", "banana", "cherry"]

Case-insensitive sorting

Set strength: :secondary to ignore case differences (level 3). Characters that differ only in case compare as equal:

iex> Localize.Collation.compare("a", "A", strength: :secondary)
:eq

iex> Localize.Collation.sort(["banana", "Apple", "cherry"],
...>   strength: :secondary)
["Apple", "banana", "cherry"]

Accent-insensitive sorting

Set strength: :primary to ignore both accent and case differences. Characters that share the same base letter compare as equal:

iex> Localize.Collation.compare("cafe", "café", strength: :primary)
:eq

iex> Localize.Collation.sort(["cafe", "café", "caff"],
...>   strength: :primary)
["cafe", "café", "caff"]

Case-first sorting

Control whether uppercase or lowercase sorts first among otherwise equal strings:

iex> # Uppercase first
iex> Localize.Collation.sort(["apple", "Apple", "APPLE"],
...>   case_first: :upper)
["APPLE", "Apple", "apple"]

iex> # Lowercase first (default for most locales)
iex> Localize.Collation.sort(["apple", "Apple", "APPLE"],
...>   case_first: :lower)
["apple", "Apple", "APPLE"]

iex> # Danish defaults to uppercase first
iex> Localize.Collation.sort(["apple", "Apple"], locale: "da")
["Apple", "apple"]

German phonebook sorting

German has two collation types. Standard collation treats ä as a variant of a. Phonebook collation expands ä to "ae", placing it between "ad" and "af":

iex> # Standard: Ä sorts near A
iex> Localize.Collation.sort(["Ärger", "Anger", "Azur"], locale: "de")
["Anger", "Ärger", "Azur"]

iex> # Phonebook: Ä expands to AE, sorts between AD and AF
iex> Localize.Collation.sort(["Ärger", "Anger", "Azur"],
...>   locale: "de", type: :phonebook)
["Ärger", "Anger", "Azur"]

French Canadian backwards accent sorting

French Canadian uses "backwards" level-2 comparison, meaning accents are compared from the end of the string rather than the beginning. This affects words that differ in accent position:

iex> # French Canadian: accent on later syllable sorts first
iex> Localize.Collation.sort(["côte", "coté", "cote", "côté"],
...>   locale: "fr-CA")
["cote", "coté", "côte", "côté"]

iex> # Default (non-backwards): accent on earlier syllable sorts first
iex> Localize.Collation.sort(["côte", "coté", "cote", "côté"])
["cote", "coté", "côte", "côté"]

Numeric sorting

Enable numeric collation to sort embedded digit sequences by numeric value rather than digit-by-digit:

iex> # Without numeric: "10" < "2" (codepoint: "1" < "2")
iex> Localize.Collation.sort(["file10", "file2", "file1"])
["file1", "file10", "file2"]

iex> # With numeric: 2 < 10
iex> Localize.Collation.sort(["file10", "file2", "file1"],
...>   numeric: true)
["file1", "file2", "file10"]

This applies to all digits in all scripts - not just indo-arabic digits.

Search collation

Search collation provides tailored equivalences for loose matching. For example, Arabic presentation forms are equated to their base forms, and Korean composed jamo are decomposed:

iex> Localize.Collation.compare("cafe", "café",
...>   type: :search, strength: :primary)
:eq

Ignoring punctuation (shifted)

Set alternate: :shifted to treat whitespace and punctuation as ignorable at primary and secondary levels:

iex> Localize.Collation.sort(
...>   ["black bird", "blackbird", "black-bird"],
...>   alternate: :shifted)
["black bird", "blackbird", "black-bird"]

Collation options

Keyword options

All functions accept the following keyword options:

OptionValuesDefaultDescription
:localelocale atom or stringLocalize.get_locale()Locale for tailoring and defaults.
:type:standard, :search, :phonebook, :pinyin, :stroke, :traditional, etc.:standardCollation type.
:strength:primary, :secondary, :tertiary, :quaternary, :identical:tertiaryComparison depth.
:alternate:non_ignorable, :shifted:non_ignorableHow to handle variable-weight characters (whitespace, punctuation).
:backwardstrue, falsefalseReverse level-2 (accent) comparison direction.
:normalizationtrue, falsefalseForce NFD normalisation (auto-enabled when tailoring requires it).
:case_leveltrue, falsefalseInsert a case-comparison level between secondary and tertiary.
:case_first:upper, :lower, falsefalseWhether uppercase or lowercase sorts first.
:numerictrue, falsefalseSort embedded digit sequences by numeric value.
:max_variable:space, :punct, :symbol, :currency:punctHighest character class treated as variable when alternate: :shifted.
:reorderlist of script atoms[]Reorder script groups (e.g., [:Cyrl, :Latn]).

Shorthand options

These convenience options map to the core options above. If a corresponding core option is also provided, the core option takes precedence.

ShorthandEquivalentDescription
ignore_accents: truestrength: :primaryIgnore accent and case differences.
ignore_case: truestrength: :secondaryIgnore case differences but respect accents.
ignore_punctuation: truestrength: :tertiary, alternate: :shiftedTreat whitespace and punctuation as ignorable.
casing: :insensitivestrength: :secondaryAlias for case-insensitive comparison.
casing: :sensitive(no change)Explicit case-sensitive comparison (the default).
iex> Localize.Collation.compare("cafe", "café", ignore_accents: true)
:eq

iex> Localize.Collation.compare("a", "A", ignore_case: true)
:eq

iex> Localize.Collation.compare("a", "A", casing: :insensitive)
:eq

Options encoded in a locale identifier

Collation options can be embedded in a BCP 47 locale identifier using the -u- Unicode extension. The BCP 47 collation keys are parsed from the locale tag's -u- extension and applied automatically:

iex> # German phonebook via locale string
iex> Localize.Collation.sort(["Ärger", "Anger", "Azur"], locale: "de-u-co-phonebk")
["Ärger", "Anger", "Azur"]

iex> # Case-insensitive via locale string
iex> Localize.Collation.compare("a", "A", locale: "en-u-ks-level2")
:eq

iex> # Numeric sorting via locale string
iex> Localize.Collation.sort(["file10", "file2", "file1"], locale: "en-u-kn-true")
["file1", "file2", "file10"]

iex> # Uppercase-first via locale string
iex> Localize.Collation.sort(["apple", "Apple", "APPLE"], locale: "en-u-kf-upper")
["APPLE", "Apple", "apple"]

The BCP 47 keys and their mappings:

BCP 47 KeyOptionValues
co:typestandard, search, phonebk, pinyin, stroke, trad, etc.
ks:strengthlevel1, level2, level3, level4, identic
ka:alternatenoignore, shifted
kb:backwardstrue, false
kk:normalizationtrue, false
kc:case_leveltrue, false
kf:case_firstupper, lower, false
kn:numerictrue, false
kr:reorderScript codes (e.g., latn-arab)
kv:max_variablespace, punct, symbol, currency

Keyword options and locale-encoded options can be combined. When both are present, keyword options take precedence.

How options interact

Strength controls how many levels are compared:

StrengthComparesIgnores
:primaryBase characterAccents, case, punctuation
:secondaryBase + accentsCase, punctuation
:tertiary (default)Base + accents + casePunctuation differences
:quaternaryBase + accents + case + punctuationNothing (except NFD ordering)
:identicalEverything including codepointNothing

Case options interact with strength:

  • case_first: :upper only has effect at tertiary level or above. At :primary or :secondary strength, case is already ignored.

  • case_level: true inserts an extra comparison level between secondary and tertiary. This allows case-sensitive, accent-insensitive sorting.

Alternate interacts with max_variable:

  • alternate: :shifted makes characters up to max_variable ignorable at primary/secondary levels. With max_variable: :punct (default), whitespace and punctuation are ignored.

  • Setting max_variable: :symbol additionally ignores symbols. Setting max_variable: :currency ignores currency symbols too.

Backwards only affects level 2 (accents). It reverses the direction of accent comparison — accents at the end of a string take precedence over accents at the beginning.

Available locale-specific collations

The following table lists all locales with CLDR collation tailoring. Locales not listed here use the default DUCET ordering with no modifications.

LocaleTypesDescription
aastandardAfar.
afstandardAfrikaans.
amstandardAmharic — Ethiopic script ordering.
arstandard, compatArabic script ordering with presentation form mappings.
asstandardAssamese — Bengali script ordering.
azstandard, searchAzerbaijani — Latin with Turkish-style dotted/dotless i.
balstandardBaluchi.
bal-LatnstandardBaluchi (Latin script).
bestandardBelarusian — Cyrillic ordering.
bgstandardBulgarian — Cyrillic ordering.
blostandardAnii.
bnstandard, traditionalBengali script ordering.
bostandardTibetan script ordering.
brstandardBreton.
bsstandard, searchBosnian — imports Croatian standard rules.
bs-CyrlstandardBosnian (Cyrillic) — imports Serbian rules.
castandard, searchCatalan.
cebstandardCebuano.
chrstandardCherokee — Cherokee script ordering.
csstandard, digits-afterCzech — č, ř, š, ž as separate letters. digits-after places digits after letters.
custandardChurch Slavic — Cyrillic with case-first:upper.
cystandardWelsh — ch, dd, ff, ng, ll, ph, rh, th digraphs.
dastandard, searchDanish — æ, ø, å after z; case-first:upper by default.
dephonebook, eor, searchGerman — phonebook expands ä→ae, ö→oe, ü→ue. EOR is European Ordering Rules.
de-ATphonebookAustrian German phonebook.
dsbstandardLower Sorbian.
dzstandardDzongkha — Tibetan script ordering.
eestandardEwe — additional letters ɖ, ɛ, ƒ, ɣ, ŋ, ɔ, ʋ.
elstandardGreek script ordering.
en-US-POSIXstandardPOSIX sort order (codepoint-like).
eostandardEsperanto — ĉ, ĝ, ĥ, ĵ, ŝ, ŭ placement.
esstandard, search, traditionalSpanish — traditional treats ch and ll as separate letters.
etstandardEstonian — š, ž, ö, ä, ü, õ after z.
fastandardPersian — Arabic script with Persian-specific ordering.
fa-AFstandardDari (Afghan Persian).
ff-AdlmstandardFulah (Adlam script).
fistandard, search, traditionalFinnish — å after z; traditional differs in w/v ordering.
filstandardFilipino — Spanish-derived letter ordering.
fostandard, searchFaroese — Nordic letter ordering.
fr-CAstandardFrench Canadian — backwards level-2 (accent) comparison.
fystandardWestern Frisian.
glstandard, searchGalician — imports Spanish/Catalan rules.
gustandardGujarati script ordering.
hastandardHausa — additional letters ɓ, ɗ, ƙ.
hawstandardHawaiian — ʻokina placement.
hestandardHebrew script ordering.
histandardHindi — Devanagari script ordering.
hrstandard, searchCroatian — č, ć, dž, đ, lj, nj, š, ž as separate letters.
hsbstandardUpper Sorbian.
hustandardHungarian — digraphs cs, dz, dzs, gy, ly, ny, sz, ty, zs; double consonant expansion (ccs→cs).
hystandardArmenian script ordering.
igstandardIgbo — additional letters ị, ọ, ụ, ñ, gb, gw, kp, kw, nw, ny, sh.
isstandard, searchIcelandic — Nordic letter ordering with ð, þ.
jastandard, unihan, private-kanaJapanese — kana ordering with Kanji by radical/stroke. Unihan uses Han reading order.
kastandardGeorgian script ordering.
kkstandardKazakh — Cyrillic ordering.
kk-ArabstandardKazakh (Arabic script).
klstandard, searchKalaallisut — Nordic letter ordering.
kmstandardKhmer script ordering.
knstandard, traditionalKannada script ordering.
kostandard, search, unihanKorean — Hangul jamo decomposition; unihan uses Han reading order.
kokstandardKonkani — Devanagari ordering.
kustandardKurdish.
kystandardKyrgyz — Cyrillic ordering.
lktstandardLakota.
lnstandard, phoneticLingala — phonetic variant available.
lostandardLao script ordering.
ltstandardLithuanian — y between i and j.
lvstandardLatvian — č, ģ, ķ, ļ, ņ, š, ž as separate letters.
mkstandardMacedonian — Cyrillic ordering.
mlstandardMalayalam script ordering.
mnstandardMongolian — Cyrillic ordering.
mrstandardMarathi — Devanagari ordering.
mtstandardMaltese — ċ, ġ, għ, ħ, ż; case-first:upper by default.
mystandardMyanmar (Burmese) script ordering.
nestandardNepali — Devanagari ordering.
nostandard, searchNorwegian — æ, ø, å after z; case-first:upper by default.
nsostandardNorthern Sotho.
omstandardOromo.
orstandardOdia script ordering.
pastandardPunjabi — Gurmukhi script ordering.
plstandardPolish — ą, ć, ę, ł, ń, ó, ś, ź, ż as separate letters.
psstandardPashto — Arabic script ordering.
rostandardRomanian — ă, â, î, ș, ț placement.
rustandardRussian — Cyrillic ordering.
sastandard, traditionalSanskrit — Devanagari ordering.
sestandard, searchNorthern Sami.
sgsstandardSamogitian.
sistandard, dictionarySinhala — dictionary variant available.
skstandard, searchSlovak — č, ď, ĺ, ľ, ň, ô, ŕ, š, ť, ž, á, ä, é, í, ó, ú, ý.
slstandardSlovenian.
smnstandard, searchInari Sami.
sqstandardAlbanian — ç, dh, ë, gj, ll, nj, rr, sh, th, xh, zh.
srstandardSerbian — Cyrillic ordering.
sr-Latnstandard, searchSerbian (Latin) — imports Croatian rules.
ssystandardSaho.
svstandard, search, traditionalSwedish — å, ä, ö after z.
tastandardTamil script ordering.
testandardTelugu script ordering.
thstandardThai script ordering.
tkstandardTurkmen.
tnstandardTswana.
tostandardTongan.
trstandard, searchTurkish — dotted/dotless i distinction; ç, ğ, ı, ö, ş, ü.
ugstandardUyghur — Arabic script ordering.
ukstandardUkrainian — Cyrillic ordering.
undsearchRoot search rules — Arabic form equivalences, Korean jamo decomposition, Thai/Lao contraction suppression.
urstandardUrdu — Arabic script ordering.
uzstandardUzbek.
vistandard, traditionalVietnamese — extensive accent and tone ordering.
vostandardVolapük.
waestandardWalser.
wostandardWolof.
yistandardYiddish — Hebrew script ordering.
yostandardYoruba — additional letters ẹ, ọ, ṣ, gb.
zhpinyin, stroke, unihan, zhuyin, private-pinyinChinese — pinyin (pronunciation), stroke (stroke count), zhuyin (Bopomofo), unihan (radical-stroke).

Sort keys

For bulk sorting or database indexing, generate sort keys once and compare the raw binaries:

iex> key_a = Localize.Collation.sort_key("café")
iex> key_b = Localize.Collation.sort_key("caff")
iex> key_a < key_b
true

Sort keys are binaries that preserve the collation ordering when compared with <, >, and ==. They encode all comparison levels into a single binary, so comparison is a simple byte comparison with no need to re-apply collation rules.

Performance notes

  • Collation tables are loaded once into :persistent_term on first use. Subsequent calls have no file I/O overhead.

  • Locale tailoring overlays are computed once per locale/type pair and cached. Overlay lookup is a hash map get (~575ns for 7,000+ entry CJK overlays).

  • Optional NIF — when LOCALIZE_NIF=true is set at compile time, sort key generation uses a C NIF for significantly faster performance. See Localize.Nif for details.