Unicode.String.DictionaryBreak (Unicode String v2.1.0)

Copy Markdown View Source

Implements ICU's lookahead-based dictionary word break algorithm for scripts that don't use spaces between words.

This module handles word segmentation for Thai, Lao, Khmer, and Burmese (Myanmar) using the same approach as ICU's DictionaryBreakEngine: a 3-word lookahead with fallback for non-dictionary character sequences.

The algorithm works by:

  1. At each position, gathering all dictionary word candidates (shortest to longest match).

  2. When exactly one candidate exists, accepting it immediately.

  3. When multiple candidates exist, using a 3-word lookahead to select the candidate that leads to the best overall segmentation — preferring the candidate where subsequent words are also found in the dictionary.

  4. When no dictionary word follows a short word, scanning forward through non-dictionary characters until finding a position where dictionary words resume, then combining the non-dictionary stretch with the preceding word.

  5. Absorbing combining marks (Unicode General Category M) into the preceding word so that vowel signs, tone marks, and virama/coeng characters stay attached to their base.

Summary

Functions

Splits a string into word segments using dictionary-based lookahead for the given locale.

Splits a string using the dictionary break algorithm for target-script ranges and a fallback function for other ranges.

Functions

split(string, locale)

@spec split(String.t(), atom()) :: [String.t()]

Splits a string into word segments using dictionary-based lookahead for the given locale.

Returns a list of word segments (strings). Unlike the simple greedy algorithm, this considers multiple word candidates at each position and uses a 3-word lookahead to select the segmentation that produces the most dictionary-valid sequence.

Arguments

  • string is a binary string to segment.

  • locale is a dictionary locale atom (:th, :lo, :km, or :my).

Returns

  • A list of binary strings representing word segments.

split_with_fallback(string, locale, fallback_fn)

@spec split_with_fallback(String.t(), atom(), (String.t() -> [String.t()])) :: [
  String.t()
]

Splits a string using the dictionary break algorithm for target-script ranges and a fallback function for other ranges.

Text is partitioned into ranges belonging to the locale's script and ranges that don't. Dictionary breaking is applied to the former; fallback_fn is called on the latter.