# `Localize.LanguageTag`
[🔗](https://github.com/elixir-localize/localize/blob/v0.6.0/lib/localize/language_tag.ex#L1)

Represents a language tag as defined in [rfc5646](https://tools.ietf.org/html/rfc5646)
with extensions "u" and "t" as defined in [BCP 47](https://tools.ietf.org/html/bcp47).

Language tags are used to help identify languages, whether spoken,
written, signed, or otherwise signaled, for the purpose of
communication.  This includes constructed and artificial languages
but excludes languages not intended primarily for human
communication, such as programming languages.

## Syntax

A language tag is composed from a sequence of one or more "subtags",
each of which refines or narrows the range of language identified by
the overall tag.  Subtags, in turn, are a sequence of alphanumeric
characters (letters and digits), distinguished and separated from
other subtags in a tag by a hyphen ("-", [Unicode] U+002D).

There are different types of subtag, each of which is distinguished
by length, position in the tag, and content: each subtag's type can
be recognized solely by these features.  This makes it possible to
extract and assign some semantic information to the subtags, even if
the specific subtag values are not recognized.  Thus, a language tag
processor need not have a list of valid tags or subtags (that is, a
copy of some version of the IANA Language Subtag Registry) in order
to perform common searching and matching operations.  The only
exceptions to this ability to infer meaning from subtag structure are
the grandfathered tags listed in the productions 'regular' and
'irregular' below.  These tags were registered under [RFC3066] and
are a fixed list that can never change.

The syntax of the language tag in ABNF is:

 Language-Tag  = langtag             ; normal language tags
               / privateuse          ; private use tag
               / grandfathered       ; grandfathered tags

 langtag       = language
                 ["-" script]
                 ["-" region]
                 *("-" variant)
                 *("-" extension)
                 ["-" privateuse]

 language      = 2*3ALPHA            ; shortest ISO 639 code
                 ["-" extlang]       ; sometimes followed by
                                     ; extended language subtags
               / 4ALPHA              ; or reserved for future use
               / 5*8ALPHA            ; or registered language subtag

 extlang       = 3ALPHA              ; selected ISO 639 codes
                 *2("-" 3ALPHA)      ; permanently reserved

 script        = 4ALPHA              ; ISO 15924 code

 region        = 2ALPHA              ; ISO 3166-1 code
               / 3DIGIT              ; UN M.49 code

 variant       = 5*8alphanum         ; registered variants
               / (DIGIT 3alphanum)

 extension     = singleton 1*("-" (2*8alphanum))

                                     ; Single alphanumerics
                                     ; "x" reserved for private use
 singleton     = DIGIT               ; 0 - 9
               / %x41-57             ; A - W
               / %x59-5A             ; Y - Z
               / %x61-77             ; a - w
               / %x79-7A             ; y - z

 privateuse    = "x" 1*("-" (1*8alphanum))

 grandfathered = irregular           ; non-redundant tags registered
               / regular             ; during the RFC 3066 era

 irregular     = "en-GB-oed"         ; irregular tags do not match
               / "i-ami"             ; the 'langtag' production and
               / "i-bnn"             ; would not otherwise be
               / "i-default"         ; considered 'well-formed'
               / "i-enochian"        ; These tags are all valid,
               / "i-hak"             ; but most are deprecated
               / "i-klingon"         ; in favor of more modern
               / "i-lux"             ; subtags or subtag
               / "i-mingo"           ; combination
               / "i-navajo"
               / "i-pwn"
               / "i-tao"
               / "i-tay"
               / "i-tsu"
               / "sgn-BE-FR"
               / "sgn-BE-NL"
               / "sgn-CH-DE"

 regular       = "art-lojban"        ; these tags match the 'langtag'
               / "cel-gaulish"       ; production, but their subtags
               / "no-bok"            ; are not extended language
               / "no-nyn"            ; or variant subtags: their meaning
               / "zh-guoyu"          ; is defined by their registration
               / "zh-hakka"          ; and all of these are deprecated
               / "zh-min"            ; in favor of a more modern
               / "zh-min-nan"        ; subtag or sequence of subtags
               / "zh-xiang"

 alphanum      = (ALPHA / DIGIT)     ; letters and numbers

All subtags have a maximum length of eight characters.  Whitespace is
not permitted in a language tag.  There is a subtlety in the ABNF
production 'variant': a variant starting with a digit has a minimum
length of four characters, while those starting with a letter have a
minimum length of five characters.

## Unicode BCP 47 Extension type "u" - Locale

Extension | Description                      | Examples
+-------+ | -------------------------------  | ---------
ca        | Calendar type                    | buddhist, chinese, gregory
cf        | Currency format style            | standard, account
co        | Collation type                   | standard, search, phonetic, pinyin
cu        | Currency type                    | ISO4217 code like "USD", "EUR"
fw        | First day of the week identifier | sun, mon, tue, wed, ...
hc        | Hour cycle identifier            | h12, h23, h11, h24
lb        | Line break style identifier      | strict, normal, loose
lw        | Word break identifier            | normal, breakall, keepall, phrase
ms        | Measurement system identifier    | metric, ussystem, uksystem
mu        | Measurement unit override        | celsius, fahrenhe, kelvin which overrides the ms key
nu        | Number system identifier         | arabext, armnlow, roman, tamldec
rg        | Region override                  | The value is a unicode_region_subtag for a regular region (not a macroregion), suffixed by "ZZZZ"
sd        | Subdivision identifier           | A unicode_subdivision_id, which is a unicode_region_subtagconcatenated with a unicode_subdivision_suffix.
ss        | Break suppressions identifier    | none, standard
tz        | Timezone identifier              | Short identifiers defined in terms of a TZ time zone database
va        | Common variant type              | POSIX style locale variant

## Unicode BCP 47 Extension type "t" - Transforms

Extension | Description
+-------+ | -----------------------------------------
mo        | Transform extension mechanism: to reference an authority or rules for a type of transformation
s0        | Transform source: for non-languages/scripts, such as fullwidth-halfwidth conversion.
d0        | Transform sdestination: for non-languages/scripts, such as fullwidth-halfwidth conversion.
i0        | Input Method Engine transform
k0        | Keyboard transform
t0        | Machine Translation: Used to indicate content that has been machine translated
h0        | Hybrid Locale Identifiers: h0 with the value 'hybrid' indicates that the -t- value is a language that is mixed into the main language tag to form a hybrid
x0        | Private use transform

Extensions are formatted by specifying keyword pairs after an extension
separator. The example `de-DE-u-co-phonebk` specifies German as spoken in
Germany with a collation of `phonebk`.  Another example, "en-latn-AU-u-cf-account"
represents English as spoken in Australia, with the number system "latn" but
formatting currencies with the "accounting" style.

# `t`

```elixir
@type t() :: %Localize.LanguageTag{
  canonical_locale_id: String.t() | nil,
  cldr_locale_id: Localize.Locale.locale_id(),
  extensions: map(),
  language: Localize.Locale.language(),
  language_subtags: [String.t()],
  language_variants: [String.t()],
  locale: Localize.LanguageTag.U.t() | %{},
  private_use: [String.t()],
  requested_locale_id: String.t(),
  script: Localize.Locale.script(),
  territory: Localize.Locale.territory(),
  transform: Localize.LanguageTag.T.t() | %{}
}
```

# `add_likely_subtags`

```elixir
@spec add_likely_subtags(t()) :: {:ok, t()} | {:error, Exception.t()}
```

Add likely subtags to a language tag.

Implements the *Add Likely Subtags* algorithm from
[Unicode TR35](https://www.unicode.org/reports/tr35/tr35.html#Likely_Subtags).
This fills in missing script and region subtags with the most
likely values from the CLDR likely subtags data.

### Arguments

* `language_tag` is a `%Localize.LanguageTag{}` struct.

### Returns

* `{:ok, maximized_tag}` with all subtags filled in and
  `canonical_locale_id` updated, or

* `{:error, reason}` if no likely subtags data is found.

### Examples

    iex> {:ok, tag} = Localize.LanguageTag.parse("en")
    iex> {:ok, max} = Localize.LanguageTag.add_likely_subtags(tag)
    iex> max.canonical_locale_id
    "en-Latn-US"

    iex> {:ok, tag} = Localize.LanguageTag.parse("zh-TW")
    iex> {:ok, max} = Localize.LanguageTag.add_likely_subtags(tag)
    iex> max.canonical_locale_id
    "zh-Hant-TW"

# `add_likely_subtags!`

```elixir
@spec add_likely_subtags!(t()) :: t() | no_return()
```

Add likely subtags to a language tag, raising on error.

Same as `add_likely_subtags/1` but returns the struct directly
or raises an exception.

# `best_match`

```elixir
@spec best_match(
  t() | String.t() | atom(),
  [t() | String.t() | atom()],
  non_neg_integer()
) ::
  {:ok, t() | String.t() | atom(), non_neg_integer()} | {:error, String.t()}
```

Find the best matching supported locale for a desired locale.

Implements the [CLDR Language Matching](https://www.unicode.org/reports/tr35/tr35.html#LanguageMatching)
algorithm. The desired locale is compared against each supported
locale and the closest match (lowest distance score) is returned.

### Arguments

* `desired` is a `%Localize.LanguageTag{}` struct or a BCP 47
  locale string.

* `supported` is a list of `%Localize.LanguageTag{}` structs
  or BCP 47 locale strings.

* `distance` is the maximum acceptable distance score. Matches
  with a score above this threshold are rejected.
  The default is 80.

### Returns

* `{:ok, matched_locale, score}` where `matched_locale` is
  the best supported match and `score` is the numeric distance.

* `{:error, reason}` if no match is found within the threshold.

### Fallback behaviour

When using the default threshold (80), the CLDR
algorithm always returns a result when the supported list is
non-empty — even if the best match is very distant. This
matches the CLDR specification, which says the algorithm should
always select a locale rather than fail. The first supported
locale is returned as a last resort.

When an explicit threshold below the default is provided, no
fallback occurs. If nothing matches within the threshold, an
error is returned. This is useful for strict validation (e.g.
resolving configuration values) where a distant match would be
surprising.

### Examples

    iex> {:ok, match, _score} = Localize.LanguageTag.best_match("en-AU", ["en", "en-GB", "fr"])
    iex> match
    "en-GB"

    iex> # Strict matching: threshold 0 rejects non-exact matches
    iex> {:error, _} = Localize.LanguageTag.best_match("xyzzy", ["en", "fr"], 0)

# `canonicalize`

```elixir
@spec canonicalize(t()) :: {:ok, t()} | {:error, term()}
```

Canonicalize a parsed language tag.

Takes a `%Localize.LanguageTag{}` struct (typically returned
by `parse/1`) and applies canonical syntax rules:

* Sorts variants alphabetically.

* Canonicalizes the `-u-` and `-t-` extension keys.

* Sorts extensions by their singleton letter.

* Computes and stores the canonical locale name string.

### Arguments

* `language_tag` is a `%Localize.LanguageTag{}` struct.

### Returns

* `{:ok, canonicalized_tag}` with the `canonical_locale_id`
  field populated, or

* `{:error, reason}` if extension validation fails.

### Examples

    iex> {:ok, tag} = Localize.LanguageTag.parse("en-US-u-nu-arab-ca-gregory")
    iex> {:ok, canonical} = Localize.LanguageTag.canonicalize(tag)
    iex> canonical.canonical_locale_id
    "en-US-u-ca-gregory-nu-arab"

# `canonicalize!`

```elixir
@spec canonicalize!(t()) :: t() | no_return()
```

Canonicalize a parsed language tag, raising on error.

Same as `canonicalize/1` but returns the struct directly
or raises an exception.

# `match_distance`

```elixir
@spec match_distance(t() | String.t(), t() | String.t()) ::
  number() | {:error, String.t()}
```

Compute the match distance between two locale tags.

### Arguments

* `desired` is a `%Localize.LanguageTag{}` struct or locale string.

* `supported` is a `%Localize.LanguageTag{}` struct or locale string.

### Returns

* A non-negative integer distance score. `0` is a perfect match.
  Scores below `10` indicate a good fit. Scores above `50`
  indicate a poor fit.

### Examples

    iex> Localize.LanguageTag.match_distance("en", "en")
    0

    iex> Localize.LanguageTag.match_distance("en-AU", "en-GB")
    3

# `new`

```elixir
@spec new(String.t()) :: {:ok, t()} | {:error, term()}
```

Create a fully resolved language tag from a locale string.

Parses the input, canonicalizes extensions, adds likely subtags
to populate missing fields, then computes the minimized
canonical locale name via remove likely subtags. The resulting
struct has all fields populated but the `canonical_locale_id`
is the shortest unambiguous form.

### Arguments

* `locale_id` is any BCP 47 locale string.

### Returns

* `{:ok, language_tag}` with all fields resolved, or

* `{:error, reason}` if parsing, canonicalization, or likely
  subtag resolution fails.

### Examples

    iex> {:ok, tag} = Localize.LanguageTag.new("zh-TW")
    iex> tag.language
    :zh
    iex> tag.script
    :Hant
    iex> tag.territory
    :TW
    iex> tag.canonical_locale_id
    "zh-Hant"

# `new!`

```elixir
@spec new!(String.t()) :: t() | no_return()
```

Create a fully resolved language tag, raising on error.

Same as `new/1` but returns the struct directly or raises
an exception.

# `parse`

```elixir
@spec parse(String.t()) :: {:ok, t()} | {:error, Exception.t()}
```

Parse a locale identifier into a `t:Localize.LanguageTag` struct.

## Arguments

* `locale_id` is any [BCP 47](https://tools.ietf.org/search/bcp47)
  string.

## Returns

* `{:ok, t:Localize.LanguageTag}` or

* `{:error, reason}`

# `parse!`

```elixir
@spec parse!(String.t()) :: t() | none()
```

Parse a locale identifier into a `Localize.LanguageTag` struct and raises on error

## Arguments

* `locale_id` is any [BCP 47](https://tools.ietf.org/search/bcp47)
  string.

## Returns

* `t:Localize.LanguageTag` or

* raises an exception

# `remove_likely_subtags`

```elixir
@spec remove_likely_subtags(t()) :: {:ok, t()} | {:error, Exception.t()}
```

Remove likely subtags from a language tag.

Implements the *Remove Likely Subtags* algorithm from
[Unicode TR35](https://www.unicode.org/reports/tr35/tr35.html#Likely_Subtags),
using the "favor script" variant.

This removes script and/or region subtags that can be inferred
from the remaining subtags using the likely subtags data, producing
the shortest unambiguous locale identifier.

### Arguments

* `language_tag` is a `%Localize.LanguageTag{}` struct.

### Returns

* `{:ok, minimized_tag}` with redundant subtags removed and
  `canonical_locale_id` updated, or

* `{:error, reason}` if maximization fails.

### Examples

    iex> {:ok, tag} = Localize.LanguageTag.parse("en-Latn-US")
    iex> {:ok, min} = Localize.LanguageTag.remove_likely_subtags(tag)
    iex> min.canonical_locale_id
    "en"

    iex> {:ok, tag} = Localize.LanguageTag.parse("zh-Hant-TW")
    iex> {:ok, min} = Localize.LanguageTag.remove_likely_subtags(tag)
    iex> min.canonical_locale_id
    "zh-Hant"

# `remove_likely_subtags!`

```elixir
@spec remove_likely_subtags!(t()) :: t() | no_return()
```

Remove likely subtags from a language tag, raising on error.

Same as `remove_likely_subtags/1` but returns the struct directly
or raises an exception.

# `to_string`

```elixir
@spec to_string(t()) :: String.t()
```

Produce the canonical locale identifier string from a
`Localize.LanguageTag` struct.

The canonical form follows the Unicode CLDR specification
at [TR35 Locale ID Canonicalization](https://www.unicode.org/reports/tr35/tr35.html#LocaleId_Canonicalization):

* Language subtag is lowercase.

* Script subtag is title case.

* Region subtag is uppercase.

* All other subtags are lowercase.

* Variants are sorted alphabetically.

* Extensions are sorted alphabetically by their singleton.

* Within extensions, attributes are sorted alphabetically
  and fields are sorted by key.

* The keyword value `"true"` is removed from the canonical form.

If the `canonical_locale_id` has already been computed, it
is returned directly.

### Arguments

* `language_tag` is a `%Localize.LanguageTag{}` struct.

### Returns

* A canonical string representation of the language tag.

### Examples

    iex> {:ok, tag} = Localize.LanguageTag.parse("en-US")
    iex> Localize.LanguageTag.to_string(tag)
    "en-US"

---

*Consult [api-reference.md](api-reference.md) for complete listing*