# `Text.Language.Classifier.Fasttext.ScriptDetector`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/language/classifier/fasttext/script_detector.ex#L1)

Identifies the dominant Unicode script of a piece of text.

fastText's `lid.176` classifier reports a language code (e.g. `zh`,
`sr`) without distinguishing scripts: Chinese is `zh` whether the input
is Simplified or Traditional Han, Serbian is `sr` whether written in
Latin or Cyrillic. The script signal is needed downstream to assemble a
full CLDR locale (e.g. `zh-Hans-CN` vs `zh-Hant-TW`).

This module is a thin wrapper around `Unicode.script_dominance/1` from
the [`unicode`](https://hex.pm/packages/unicode) Hex package. The
wrapper does two things:

* Returns the most-frequent script as a single ISO 15924 four-letter
  atom (`:Latn`, `:Cyrl`, `:Hans`, ...) suitable for direct use in
  BCP-47 locale strings.

* Folds Unicode's `:common` script (digits, punctuation, whitespace —
  "characters used in many scripts") out of the dominance computation,
  so a sentence with one Cyrillic word and three trailing punctuation
  marks still resolves to `:Cyrl`.

### Han disambiguation

`Unicode.script_dominance/1` reports CJK ideographs as `:han`. When
the dominant script of the input is Han, `detect/1` runs a second
pass over the codepoints comparing them against curated lists of
Simplified-only (`Hans`) and Traditional-only (`Hant`) characters.
The variant whose count is higher wins; ties (including text using
only shared Han codepoints) fall back to the generic `:Hani`.

The curated lists cover the ~50 most distinguishing characters for
each variant — the high-frequency function words and pronouns that
reliably differ between Simplified and Traditional. They are not
exhaustive (the [Unihan
Variants](https://www.unicode.org/reports/tr38/) database has
thousands of entries) but cover the realistic case where the input
is more than a handful of characters long. For shorter or
ambiguous input, the disambiguation may stay at `:Hani`.

# `script`

```elixir
@type script() :: atom()
```

# `detect`

```elixir
@spec detect(binary()) :: script()
```

Returns the dominant script of `text` as an ISO 15924 four-letter atom.

### Arguments

* `text` is a UTF-8 binary.

### Returns

* An ISO 15924 four-letter script code (e.g. `:Latn`, `:Cyrl`, `:Hani`,
  `:Hira`, `:Hang`).

* `:Zyyy` (the ISO 15924 "common" sentinel) when the input is empty or
  contains only digits, punctuation, and other non-script characters.

### Examples

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("Hello world")
    :Latn

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("Привет мир")
    :Cyrl

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("你好世界，这是中文")
    :Hans

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("你好世界，這是中文")
    :Hant

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("你好世界")
    :Hani

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("こんにちは")
    :Hira

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.detect("123 !!!")
    :Zyyy

# `han_variant`

```elixir
@spec han_variant(binary()) :: :Hans | :Hant | :Hani
```

Returns the Han variant (`:Hans`, `:Hant`, or `:Hani`) for the
Han-script content of `text`.

Counts codepoints against curated lists of Simplified-only and
Traditional-only characters. The variant with the higher count
wins; ties (or text containing only codepoints shared between the
variants) return `:Hani`.

Useful when the caller already knows the script is Han (for
instance, after running `detect/1` and seeing `:Hani`) and wants
the variant separately. `detect/1` calls this internally.

### Examples

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("国学时来这")
    :Hans

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("國學時來這")
    :Hant

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("你好世界")
    :Hani

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.han_variant("Hello world")
    :Hani

# `tally`

```elixir
@spec tally(binary()) :: %{required(script()) =&gt; non_neg_integer()}
```

Returns the per-script codepoint counts as a map keyed by ISO 15924
atoms.

Useful when a caller needs more than the dominant script — e.g. a
mixed-script input may want a confidence ratio across scripts.

### Examples

    iex> Text.Language.Classifier.Fasttext.ScriptDetector.tally("Hello мир")
    %{Latn: 5, Cyrl: 3}

---

*Consult [api-reference.md](api-reference.md) for complete listing*