Text.Language.Classifier.Fasttext.ScriptDetector (Text v0.5.0)

Copy Markdown View Source

Identifies the dominant Unicode script of a piece of text.

fastText's lid.176 classifier reports a language code (e.g. zh, sr) without distinguishing scripts: Chinese is zh whether the input is Simplified or Traditional Han, Serbian is sr whether written in Latin or Cyrillic. The script signal is needed downstream to assemble a full CLDR locale (e.g. zh-Hans-CN vs zh-Hant-TW).

This module is a thin wrapper around Unicode.script_dominance/1 from the unicode Hex package. The wrapper does two things:

  • Returns the most-frequent script as a single ISO 15924 four-letter atom (:Latn, :Cyrl, :Hans, ...) suitable for direct use in BCP-47 locale strings.

  • Folds Unicode's :common script (digits, punctuation, whitespace — "characters used in many scripts") out of the dominance computation, so a sentence with one Cyrillic word and three trailing punctuation marks still resolves to :Cyrl.

Han disambiguation

Unicode.script_dominance/1 reports CJK ideographs as :han. When the dominant script of the input is Han, detect/1 runs a second pass over the codepoints comparing them against curated lists of Simplified-only (Hans) and Traditional-only (Hant) characters. The variant whose count is higher wins; ties (including text using only shared Han codepoints) fall back to the generic :Hani.

The curated lists cover the ~50 most distinguishing characters for each variant — the high-frequency function words and pronouns that reliably differ between Simplified and Traditional. They are not exhaustive (the Unihan Variants database has thousands of entries) but cover the realistic case where the input is more than a handful of characters long. For shorter or ambiguous input, the disambiguation may stay at :Hani.

Summary

Functions

Returns the dominant script of text as an ISO 15924 four-letter atom.

Returns the Han variant (:Hans, :Hant, or :Hani) for the Han-script content of text.

Returns the per-script codepoint counts as a map keyed by ISO 15924 atoms.

Types

script()

@type script() :: atom()

Functions

detect(text)

@spec detect(binary()) :: script()

Returns the dominant script of text as an ISO 15924 four-letter atom.

Arguments

  • text is a UTF-8 binary.

Returns

  • An ISO 15924 four-letter script code (e.g. :Latn, :Cyrl, :Hani, :Hira, :Hang).

  • :Zyyy (the ISO 15924 "common" sentinel) when the input is empty or contains only digits, punctuation, and other non-script characters.

Examples

iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("Hello world")
:Latn

iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("Привет мир")
:Cyrl

iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("你好世界,这是中文")
:Hans

iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("你好世界,這是中文")
:Hant

iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("你好世界")
:Hani

iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("こんにちは")
:Hira

iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("123 !!!")
:Zyyy

han_variant(text)

@spec han_variant(binary()) :: :Hans | :Hant | :Hani

Returns the Han variant (:Hans, :Hant, or :Hani) for the Han-script content of text.

Counts codepoints against curated lists of Simplified-only and Traditional-only characters. The variant with the higher count wins; ties (or text containing only codepoints shared between the variants) return :Hani.

Useful when the caller already knows the script is Han (for instance, after running detect/1 and seeing :Hani) and wants the variant separately. detect/1 calls this internally.

Examples

iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("国学时来这")
:Hans

iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("國學時來這")
:Hant

iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("你好世界")
:Hani

iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("Hello world")
:Hani

tally(text)

@spec tally(binary()) :: %{required(script()) => non_neg_integer()}

Returns the per-script codepoint counts as a map keyed by ISO 15924 atoms.

Useful when a caller needs more than the dominant script — e.g. a mixed-script input may want a confidence ratio across scripts.

Examples

iex> Text.Language.Classifier.Fasttext.ScriptDetector.tally("Hello мир")
%{Latn: 5, Cyrl: 3}