Text.Language.Classifier.Fasttext.Locale (Text v0.5.0)

Copy Markdown View Source

Resolves a language detection into a CLDR-canonical locale string.

fastText's lid.176 reports a bare language code ("en", "zh", "sr"). For the wider Elixir localisation ecosystem to consume that result it generally needs three pieces of information: the language, the script (Hans vs Hant, Latn vs Cyrl, ...), and the territory. This module assembles all three.

Inputs

  • The detection itself, which already carries the language and the text-derived script.

  • Optional :region and :script overrides — typically wired to an Accept-Language header, an IP geolocation, or a user preference.

Resolution algorithm

  1. Form a candidate BCP-47 tag from {language, script_override OR detection.script, region_override} — omitting any unspecified piece.

  2. If localize is available (the optional dep is loaded), call Localize.validate_locale/1 to run CLDR's likely-subtags algorithm. This fills in the remaining pieces and produces a canonical locale id like "zh-Hans-CN".

  3. If localize is not available, fall back to a hand-rolled map of the most common language defaults (e.g. "en""en-US", "zh""zh-Hans-CN", "pt""pt-BR"). The fallback set is deliberately conservative — it covers the languages most users will hit but does not pretend to span all 176 fastText labels.

Hans vs Hant

When the detected language is "zh" and the script signal indicates :Hani (the generic Han atom from ScriptDetector), this module uses the language tag's region/script preferences to pick Hans or Hant. With localize present the choice flows through CLDR likely-subtags; without it, the default for bare "zh" is Hans-CN.

Summary

Functions

Resolves a Detection into a canonical CLDR locale string.

Types

Functions

resolve(detection, options \\ [])

@spec resolve(
  Text.Language.Classifier.Fasttext.Detection.t(),
  keyword()
) :: {:ok, String.t()} | {:error, term()}

Resolves a Detection into a canonical CLDR locale string.

Arguments

Options

  • :region — overrides the territory inferred by likely-subtags. Useful when the caller has stronger evidence (Accept-Language, geolocation, user preference). An ISO 3166-1 alpha-2 code as either a binary or atom.

  • :script — overrides the script inferred from the text. Useful when the caller knows better than codepoint-frequency analysis (e.g. a publisher tagging Traditional Chinese content explicitly).

  • :fallback — controls behaviour when :localize is not available or the language is not in the fallback map. Either :language_only (return just the language code) or :tag_with_script (include the script subtag if known). Defaults to :language_only to match the behaviour of fastText's own outputs.

Returns

  • {:ok, locale_string} — for example "en-Latn-US" or "zh-Hans-CN" when :localize is available, "en-US" or "zh-Hans-CN" from the fallback table, or just "en" if the language is unknown to the fallback.

  • {:error, reason} — when :localize is loaded and rejects the candidate tag.

Examples

iex> alias Text.Language.Classifier.Fasttext.Detection
iex> det = %Detection{language: "en", confidence: 0.9, script: :Latn,
...>                  alternatives: [], text: "hello"}
iex> {:ok, locale} = Text.Language.Classifier.Fasttext.Locale.resolve(det)
iex> locale =~ "en"
true

iex> alias Text.Language.Classifier.Fasttext.Detection
iex> det = %Detection{language: "zh", confidence: 0.95, script: :Hani,
...>                  alternatives: [], text: "你好世界"}
iex> {:ok, locale} = Text.Language.Classifier.Fasttext.Locale.resolve(det)
iex> String.starts_with?(locale, "zh")
true