Identifies the dominant Unicode script of a piece of text.
fastText's lid.176 classifier reports a language code (e.g. zh,
sr) without distinguishing scripts: Chinese is zh whether the input
is Simplified or Traditional Han, Serbian is sr whether written in
Latin or Cyrillic. The script signal is needed downstream to assemble a
full CLDR locale (e.g. zh-Hans-CN vs zh-Hant-TW).
This module is a thin wrapper around Unicode.script_dominance/1 from
the unicode Hex package. The
wrapper does two things:
Returns the most-frequent script as a single ISO 15924 four-letter atom (
:Latn,:Cyrl,:Hans, ...) suitable for direct use in BCP-47 locale strings.Folds Unicode's
:commonscript (digits, punctuation, whitespace — "characters used in many scripts") out of the dominance computation, so a sentence with one Cyrillic word and three trailing punctuation marks still resolves to:Cyrl.
Han disambiguation
Unicode.script_dominance/1 reports CJK ideographs as :han. When
the dominant script of the input is Han, detect/1 runs a second
pass over the codepoints comparing them against curated lists of
Simplified-only (Hans) and Traditional-only (Hant) characters.
The variant whose count is higher wins; ties (including text using
only shared Han codepoints) fall back to the generic :Hani.
The curated lists cover the ~50 most distinguishing characters for
each variant — the high-frequency function words and pronouns that
reliably differ between Simplified and Traditional. They are not
exhaustive (the Unihan
Variants database has
thousands of entries) but cover the realistic case where the input
is more than a handful of characters long. For shorter or
ambiguous input, the disambiguation may stay at :Hani.
Summary
Functions
Returns the dominant script of text as an ISO 15924 four-letter atom.
Returns the Han variant (:Hans, :Hant, or :Hani) for the
Han-script content of text.
Returns the per-script codepoint counts as a map keyed by ISO 15924 atoms.
Types
@type script() :: atom()
Functions
Returns the dominant script of text as an ISO 15924 four-letter atom.
Arguments
textis a UTF-8 binary.
Returns
An ISO 15924 four-letter script code (e.g.
:Latn,:Cyrl,:Hani,:Hira,:Hang).:Zyyy(the ISO 15924 "common" sentinel) when the input is empty or contains only digits, punctuation, and other non-script characters.
Examples
iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("Hello world")
:Latn
iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("Привет мир")
:Cyrl
iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("你好世界,这是中文")
:Hans
iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("你好世界,這是中文")
:Hant
iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("你好世界")
:Hani
iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("こんにちは")
:Hira
iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("123 !!!")
:Zyyy
@spec han_variant(binary()) :: :Hans | :Hant | :Hani
Returns the Han variant (:Hans, :Hant, or :Hani) for the
Han-script content of text.
Counts codepoints against curated lists of Simplified-only and
Traditional-only characters. The variant with the higher count
wins; ties (or text containing only codepoints shared between the
variants) return :Hani.
Useful when the caller already knows the script is Han (for
instance, after running detect/1 and seeing :Hani) and wants
the variant separately. detect/1 calls this internally.
Examples
iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("国学时来这")
:Hans
iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("國學時來這")
:Hant
iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("你好世界")
:Hani
iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("Hello world")
:Hani
@spec tally(binary()) :: %{required(script()) => non_neg_integer()}
Returns the per-script codepoint counts as a map keyed by ISO 15924 atoms.
Useful when a caller needs more than the dominant script — e.g. a mixed-script input may want a confidence ratio across scripts.
Examples
iex> Text.Language.Classifier.Fasttext.ScriptDetector.tally("Hello мир")
%{Latn: 5, Cyrl: 3}