Localize.LanguageTag (Localize v0.5.0)

Copy Markdown View Source

Represents a language tag as defined in rfc5646 with extensions "u" and "t" as defined in BCP 47.

Language tags are used to help identify languages, whether spoken, written, signed, or otherwise signaled, for the purpose of communication. This includes constructed and artificial languages but excludes languages not intended primarily for human communication, such as programming languages.

Syntax

A language tag is composed from a sequence of one or more "subtags", each of which refines or narrows the range of language identified by the overall tag. Subtags, in turn, are a sequence of alphanumeric characters (letters and digits), distinguished and separated from other subtags in a tag by a hyphen ("-", [Unicode] U+002D).

There are different types of subtag, each of which is distinguished by length, position in the tag, and content: each subtag's type can be recognized solely by these features. This makes it possible to extract and assign some semantic information to the subtags, even if the specific subtag values are not recognized. Thus, a language tag processor need not have a list of valid tags or subtags (that is, a copy of some version of the IANA Language Subtag Registry) in order to perform common searching and matching operations. The only exceptions to this ability to infer meaning from subtag structure are the grandfathered tags listed in the productions 'regular' and 'irregular' below. These tags were registered under [RFC3066] and are a fixed list that can never change.

The syntax of the language tag in ABNF is:

Language-Tag = langtag ; normal language tags

           / privateuse          ; private use tag
           / grandfathered       ; grandfathered tags

langtag = language

             ["-" script]
             ["-" region]
             *("-" variant)
             *("-" extension)
             ["-" privateuse]

language = 2*3ALPHA ; shortest ISO 639 code

             ["-" extlang]       ; sometimes followed by
                                 ; extended language subtags
           / 4ALPHA              ; or reserved for future use
           / 5*8ALPHA            ; or registered language subtag

extlang = 3ALPHA ; selected ISO 639 codes

             *2("-" 3ALPHA)      ; permanently reserved

script = 4ALPHA ; ISO 15924 code

region = 2ALPHA ; ISO 3166-1 code

           / 3DIGIT              ; UN M.49 code

variant = 5*8alphanum ; registered variants

           / (DIGIT 3alphanum)

extension = singleton 1("-" (28alphanum))

                                 ; Single alphanumerics
                                 ; "x" reserved for private use

singleton = DIGIT ; 0 - 9

           / %x41-57             ; A - W
           / %x59-5A             ; Y - Z
           / %x61-77             ; a - w
           / %x79-7A             ; y - z

privateuse = "x" 1("-" (18alphanum))

grandfathered = irregular ; non-redundant tags registered

           / regular             ; during the RFC 3066 era

irregular = "en-GB-oed" ; irregular tags do not match

           / "i-ami"             ; the 'langtag' production and
           / "i-bnn"             ; would not otherwise be
           / "i-default"         ; considered 'well-formed'
           / "i-enochian"        ; These tags are all valid,
           / "i-hak"             ; but most are deprecated
           / "i-klingon"         ; in favor of more modern
           / "i-lux"             ; subtags or subtag
           / "i-mingo"           ; combination
           / "i-navajo"
           / "i-pwn"
           / "i-tao"
           / "i-tay"
           / "i-tsu"
           / "sgn-BE-FR"
           / "sgn-BE-NL"
           / "sgn-CH-DE"

regular = "art-lojban" ; these tags match the 'langtag'

           / "cel-gaulish"       ; production, but their subtags
           / "no-bok"            ; are not extended language
           / "no-nyn"            ; or variant subtags: their meaning
           / "zh-guoyu"          ; is defined by their registration
           / "zh-hakka"          ; and all of these are deprecated
           / "zh-min"            ; in favor of a more modern
           / "zh-min-nan"        ; subtag or sequence of subtags
           / "zh-xiang"

alphanum = (ALPHA / DIGIT) ; letters and numbers

All subtags have a maximum length of eight characters. Whitespace is not permitted in a language tag. There is a subtlety in the ABNF production 'variant': a variant starting with a digit has a minimum length of four characters, while those starting with a letter have a minimum length of five characters.

Unicode BCP 47 Extension type "u" - Locale

ExtensionDescriptionExamples
+-------+----------------------------------------
caCalendar typebuddhist, chinese, gregory
cfCurrency format stylestandard, account
coCollation typestandard, search, phonetic, pinyin
cuCurrency typeISO4217 code like "USD", "EUR"
fwFirst day of the week identifiersun, mon, tue, wed, ...
hcHour cycle identifierh12, h23, h11, h24
lbLine break style identifierstrict, normal, loose
lwWord break identifiernormal, breakall, keepall, phrase
msMeasurement system identifiermetric, ussystem, uksystem
muMeasurement unit overridecelsius, fahrenhe, kelvin which overrides the ms key
nuNumber system identifierarabext, armnlow, roman, tamldec
rgRegion overrideThe value is a unicode_region_subtag for a regular region (not a macroregion), suffixed by "ZZZZ"
sdSubdivision identifierA unicode_subdivision_id, which is a unicode_region_subtagconcatenated with a unicode_subdivision_suffix.
ssBreak suppressions identifiernone, standard
tzTimezone identifierShort identifiers defined in terms of a TZ time zone database
vaCommon variant typePOSIX style locale variant

Unicode BCP 47 Extension type "t" - Transforms

ExtensionDescription
+-------+-----------------------------------------
moTransform extension mechanism: to reference an authority or rules for a type of transformation
s0Transform source: for non-languages/scripts, such as fullwidth-halfwidth conversion.
d0Transform sdestination: for non-languages/scripts, such as fullwidth-halfwidth conversion.
i0Input Method Engine transform
k0Keyboard transform
t0Machine Translation: Used to indicate content that has been machine translated
h0Hybrid Locale Identifiers: h0 with the value 'hybrid' indicates that the -t- value is a language that is mixed into the main language tag to form a hybrid
x0Private use transform

Extensions are formatted by specifying keyword pairs after an extension separator. The example de-DE-u-co-phonebk specifies German as spoken in Germany with a collation of phonebk. Another example, "en-latn-AU-u-cf-account" represents English as spoken in Australia, with the number system "latn" but formatting currencies with the "accounting" style.

Summary

Functions

Add likely subtags to a language tag.

Add likely subtags to a language tag, raising on error.

Find the best matching supported locale for a desired locale.

Canonicalize a parsed language tag.

Canonicalize a parsed language tag, raising on error.

Compute the match distance between two locale tags.

Create a fully resolved language tag from a locale string.

Create a fully resolved language tag, raising on error.

Parse a locale identifier into a t:Localize.LanguageTag struct.

Parse a locale identifier into a Localize.LanguageTag struct and raises on error

Remove likely subtags from a language tag.

Remove likely subtags from a language tag, raising on error.

Produce the canonical locale identifier string from a Localize.LanguageTag struct.

Types

t()

@type t() :: %Localize.LanguageTag{
  canonical_locale_id: String.t() | nil,
  cldr_locale_id: Localize.Locale.locale_id(),
  extensions: map(),
  language: Localize.Locale.language(),
  language_subtags: [String.t()],
  language_variants: [String.t()],
  locale: Localize.LanguageTag.U.t() | %{},
  private_use: [String.t()],
  requested_locale_id: String.t(),
  script: Localize.Locale.script(),
  territory: Localize.Locale.territory(),
  transform: Localize.LanguageTag.T.t() | %{}
}

Functions

add_likely_subtags(language_tag)

@spec add_likely_subtags(t()) :: {:ok, t()} | {:error, Exception.t()}

Add likely subtags to a language tag.

Implements the Add Likely Subtags algorithm from Unicode TR35. This fills in missing script and region subtags with the most likely values from the CLDR likely subtags data.

Arguments

  • language_tag is a %Localize.LanguageTag{} struct.

Returns

  • {:ok, maximized_tag} with all subtags filled in and canonical_locale_id updated, or

  • {:error, reason} if no likely subtags data is found.

Examples

iex> {:ok, tag} = Localize.LanguageTag.parse("en")
iex> {:ok, max} = Localize.LanguageTag.add_likely_subtags(tag)
iex> max.canonical_locale_id
"en-Latn-US"

iex> {:ok, tag} = Localize.LanguageTag.parse("zh-TW")
iex> {:ok, max} = Localize.LanguageTag.add_likely_subtags(tag)
iex> max.canonical_locale_id
"zh-Hant-TW"

add_likely_subtags!(language_tag)

@spec add_likely_subtags!(t()) :: t() | no_return()

Add likely subtags to a language tag, raising on error.

Same as add_likely_subtags/1 but returns the struct directly or raises an exception.

best_match(desired, supported, distance \\ 80)

@spec best_match(
  t() | String.t() | atom(),
  [t() | String.t() | atom()],
  non_neg_integer()
) ::
  {:ok, t() | String.t() | atom(), non_neg_integer()} | {:error, String.t()}

Find the best matching supported locale for a desired locale.

Implements the CLDR Language Matching algorithm. The desired locale is compared against each supported locale and the closest match (lowest distance score) is returned.

Arguments

  • desired is a %Localize.LanguageTag{} struct or a BCP 47 locale string.

  • supported is a list of %Localize.LanguageTag{} structs or BCP 47 locale strings.

  • distance is the maximum acceptable distance score. Matches with a score above this threshold are rejected. The default is 80.

Returns

  • {:ok, matched_locale, score} where matched_locale is the best supported match and score is the numeric distance.

  • {:error, reason} if no match is found within the threshold.

Fallback behaviour

When using the default threshold (80), the CLDR algorithm always returns a result when the supported list is non-empty — even if the best match is very distant. This matches the CLDR specification, which says the algorithm should always select a locale rather than fail. The first supported locale is returned as a last resort.

When an explicit threshold below the default is provided, no fallback occurs. If nothing matches within the threshold, an error is returned. This is useful for strict validation (e.g. resolving configuration values) where a distant match would be surprising.

Examples

iex> {:ok, match, _score} = Localize.LanguageTag.best_match("en-AU", ["en", "en-GB", "fr"])
iex> match
"en-GB"

iex> # Strict matching: threshold 0 rejects non-exact matches
iex> {:error, _} = Localize.LanguageTag.best_match("xyzzy", ["en", "fr"], 0)

canonicalize(language_tag)

@spec canonicalize(t()) :: {:ok, t()} | {:error, term()}

Canonicalize a parsed language tag.

Takes a %Localize.LanguageTag{} struct (typically returned by parse/1) and applies canonical syntax rules:

  • Sorts variants alphabetically.

  • Canonicalizes the -u- and -t- extension keys.

  • Sorts extensions by their singleton letter.

  • Computes and stores the canonical locale name string.

Arguments

  • language_tag is a %Localize.LanguageTag{} struct.

Returns

  • {:ok, canonicalized_tag} with the canonical_locale_id field populated, or

  • {:error, reason} if extension validation fails.

Examples

iex> {:ok, tag} = Localize.LanguageTag.parse("en-US-u-nu-arab-ca-gregory")
iex> {:ok, canonical} = Localize.LanguageTag.canonicalize(tag)
iex> canonical.canonical_locale_id
"en-US-u-ca-gregory-nu-arab"

canonicalize!(language_tag)

@spec canonicalize!(t()) :: t() | no_return()

Canonicalize a parsed language tag, raising on error.

Same as canonicalize/1 but returns the struct directly or raises an exception.

match_distance(desired, supported)

@spec match_distance(t() | String.t(), t() | String.t()) ::
  number() | {:error, String.t()}

Compute the match distance between two locale tags.

Arguments

  • desired is a %Localize.LanguageTag{} struct or locale string.

  • supported is a %Localize.LanguageTag{} struct or locale string.

Returns

  • A non-negative integer distance score. 0 is a perfect match. Scores below 10 indicate a good fit. Scores above 50 indicate a poor fit.

Examples

iex> Localize.LanguageTag.match_distance("en", "en")
0

iex> Localize.LanguageTag.match_distance("en-AU", "en-GB")
3

new(locale_id)

@spec new(String.t()) :: {:ok, t()} | {:error, term()}

Create a fully resolved language tag from a locale string.

Parses the input, canonicalizes extensions, adds likely subtags to populate missing fields, then computes the minimized canonical locale name via remove likely subtags. The resulting struct has all fields populated but the canonical_locale_id is the shortest unambiguous form.

Arguments

  • locale_id is any BCP 47 locale string.

Returns

  • {:ok, language_tag} with all fields resolved, or

  • {:error, reason} if parsing, canonicalization, or likely subtag resolution fails.

Examples

iex> {:ok, tag} = Localize.LanguageTag.new("zh-TW")
iex> tag.language
:zh
iex> tag.script
:Hant
iex> tag.territory
:TW
iex> tag.canonical_locale_id
"zh-Hant"

new!(locale_id)

@spec new!(String.t()) :: t() | no_return()

Create a fully resolved language tag, raising on error.

Same as new/1 but returns the struct directly or raises an exception.

parse(locale_id)

@spec parse(String.t()) :: {:ok, t()} | {:error, Exception.t()}

Parse a locale identifier into a t:Localize.LanguageTag struct.

Arguments

  • locale_id is any BCP 47 string.

Returns

  • {:ok, t:Localize.LanguageTag} or

  • {:error, reason}

parse!(locale_string)

@spec parse!(String.t()) :: t() | none()

Parse a locale identifier into a Localize.LanguageTag struct and raises on error

Arguments

  • locale_id is any BCP 47 string.

Returns

  • t:Localize.LanguageTag or

  • raises an exception

remove_likely_subtags(language_tag)

@spec remove_likely_subtags(t()) :: {:ok, t()} | {:error, Exception.t()}

Remove likely subtags from a language tag.

Implements the Remove Likely Subtags algorithm from Unicode TR35, using the "favor script" variant.

This removes script and/or region subtags that can be inferred from the remaining subtags using the likely subtags data, producing the shortest unambiguous locale identifier.

Arguments

  • language_tag is a %Localize.LanguageTag{} struct.

Returns

  • {:ok, minimized_tag} with redundant subtags removed and canonical_locale_id updated, or

  • {:error, reason} if maximization fails.

Examples

iex> {:ok, tag} = Localize.LanguageTag.parse("en-Latn-US")
iex> {:ok, min} = Localize.LanguageTag.remove_likely_subtags(tag)
iex> min.canonical_locale_id
"en"

iex> {:ok, tag} = Localize.LanguageTag.parse("zh-Hant-TW")
iex> {:ok, min} = Localize.LanguageTag.remove_likely_subtags(tag)
iex> min.canonical_locale_id
"zh-Hant"

remove_likely_subtags!(language_tag)

@spec remove_likely_subtags!(t()) :: t() | no_return()

Remove likely subtags from a language tag, raising on error.

Same as remove_likely_subtags/1 but returns the struct directly or raises an exception.

to_string(language_tag)

@spec to_string(t()) :: String.t()

Produce the canonical locale identifier string from a Localize.LanguageTag struct.

The canonical form follows the Unicode CLDR specification at TR35 Locale ID Canonicalization:

  • Language subtag is lowercase.

  • Script subtag is title case.

  • Region subtag is uppercase.

  • All other subtags are lowercase.

  • Variants are sorted alphabetically.

  • Extensions are sorted alphabetically by their singleton.

  • Within extensions, attributes are sorted alphabetically and fields are sorted by key.

  • The keyword value "true" is removed from the canonical form.

If the canonical_locale_id has already been computed, it is returned directly.

Arguments

  • language_tag is a %Localize.LanguageTag{} struct.

Returns

  • A canonical string representation of the language tag.

Examples

iex> {:ok, tag} = Localize.LanguageTag.parse("en-US")
iex> Localize.LanguageTag.to_string(tag)
"en-US"