# `Unicode.String.DictionaryBreak`
[🔗](https://github.com/elixir-unicode/unicode_string/blob/v2.1.0/lib/unicode/dictionary_break.ex#L1)

Implements ICU's lookahead-based dictionary word break algorithm
for scripts that don't use spaces between words.

This module handles word segmentation for Thai, Lao, Khmer, and
Burmese (Myanmar) using the same approach as ICU's
`DictionaryBreakEngine`: a 3-word lookahead with fallback for
non-dictionary character sequences.

The algorithm works by:

1. At each position, gathering all dictionary word candidates
   (shortest to longest match).

2. When exactly one candidate exists, accepting it immediately.

3. When multiple candidates exist, using a 3-word lookahead to
   select the candidate that leads to the best overall
   segmentation — preferring the candidate where subsequent
   words are also found in the dictionary.

4. When no dictionary word follows a short word, scanning
   forward through non-dictionary characters until finding a
   position where dictionary words resume, then combining the
   non-dictionary stretch with the preceding word.

5. Absorbing combining marks (Unicode General Category M) into
   the preceding word so that vowel signs, tone marks, and
   virama/coeng characters stay attached to their base.

# `split`

```elixir
@spec split(String.t(), atom()) :: [String.t()]
```

Splits a string into word segments using dictionary-based
lookahead for the given locale.

Returns a list of word segments (strings). Unlike the simple
greedy algorithm, this considers multiple word candidates at
each position and uses a 3-word lookahead to select the
segmentation that produces the most dictionary-valid sequence.

### Arguments

* `string` is a binary string to segment.

* `locale` is a dictionary locale atom (`:th`, `:lo`, `:km`,
  or `:my`).

### Returns

* A list of binary strings representing word segments.

# `split_with_fallback`

```elixir
@spec split_with_fallback(String.t(), atom(), (String.t() -&gt; [String.t()])) :: [
  String.t()
]
```

Splits a string using the dictionary break algorithm for
target-script ranges and a fallback function for other ranges.

Text is partitioned into ranges belonging to the locale's
script and ranges that don't. Dictionary breaking is applied
to the former; `fallback_fn` is called on the latter.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
