# `Text.Segment`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/segment.ex#L1)

Locale-aware word and sentence segmentation.

Implements the public surface area you reach for when you want to
break a string into its semantic pieces — words or sentences —
following the Unicode segmentation rules in
[UAX #29 (Text Segmentation)](https://unicode.org/reports/tr29/).
Line-break segmentation (UAX #14) is available via `stream/2` for
the small set of callers who need it.

This module is a thin facade over the
[`unicode_string`](https://hex.pm/packages/unicode_string) package,
exposed under the `Text` namespace so callers don't have to know
where the underlying primitives live. Two adjustments over the raw
`Unicode.String.split/2`:

* **`words/2`** drops punctuation tokens by default. The Unicode word
  segmentation algorithm treats `","` and `"!"` as words just like
  `"hello"`; in practice every text-processing pipeline immediately
  filters those out, so this module does it by default.

### Locale input shapes

The `:locale` option (where it appears) accepts an atom (`:fr`), a
string (`"fr"`, `"fr-CA"`, `"zh-Hans-CN"`), or a
`Localize.LanguageTag` struct when the optional `:localize`
dependency is loaded. See `Text.Language` for details.

* **`sentences/2`** trims trailing whitespace from each sentence so
  callers don't have to. The Unicode rules attach trailing whitespace
  to the preceding sentence, which is rarely what you want.

Pass `punctuation: :keep` or `trim: false` to opt back in to the raw
Unicode behaviour.

### Locale-awareness and abbreviation suppressions

Word and sentence segmentation rules differ across languages —
German treats `"Donaudampfschiffahrtsgesellschaft"` as one word,
Japanese needs dictionary-based segmentation for any meaningful
tokenisation, and Thai has no spaces at all. Pass `locale: "ja"`,
`locale: "th"`, etc., to pick up those tailorings; without a locale
option the default Unicode rules apply.

When a locale is supplied, `sentences/2` also applies CLDR's
*suppression* list for that locale by default, which keeps common
abbreviations like `"i.e."` and `"e.g."` from being treated as
sentence terminators. The list is partial — it covers the most
common forms but not every abbreviation in use — so callers needing
comprehensive abbreviation handling should still layer their own
rules on top. Pass `suppressions: false` to opt out of even the
default list.

### Streaming

For very large inputs (log files, full books) prefer `stream/2`,
which returns a lazy `Stream` rather than realising the entire token
list in memory.

# `break`

```elixir
@type break() :: :word | :sentence | :line | :grapheme
```

A segmentation break type understood by Unicode UAX #29.

# `sentences`

```elixir
@spec sentences(
  String.t(),
  keyword()
) :: [String.t()]
```

Splits `text` into sentences, trimming trailing whitespace by default.

### Arguments

* `text` is a UTF-8 string.

### Options

* `:trim` — `true` (default) trims leading and trailing whitespace
  from each sentence. `false` preserves the raw Unicode-segmentation
  output where the trailing whitespace before the next sentence is
  attached to the preceding sentence.

* `:locale` — a BCP-47 locale string. When given, locale-specific
  sentence break rules from CLDR are applied. Also enables CLDR's
  abbreviation suppression list (see `:suppressions`).

* `:suppressions` — `true` (default) applies CLDR's per-locale
  abbreviation suppression list, which prevents common forms like
  `"i.e."` and `"e.g."` from being treated as sentence terminators.
  Has no effect when `:locale` is not set.

### Returns

* A list of strings.

### Examples

    iex> Text.Segment.sentences("Hello, world! How are you?")
    ["Hello, world!", "How are you?"]

    iex> Text.Segment.sentences("First. Second! Third? Yes.")
    ["First.", "Second!", "Third?", "Yes."]

    iex> Text.Segment.sentences("Hello, world! How are you?", trim: false)
    ["Hello, world! ", "How are you?"]

    iex> Text.Segment.sentences("He used i.e. and e.g. in his memo. Then stopped.", locale: "en")
    ["He used i.e. and e.g. in his memo.", "Then stopped."]

    iex> Text.Segment.sentences("")
    []

# `stream`

```elixir
@spec stream(
  String.t(),
  keyword()
) :: Enumerable.t()
```

Returns a `Stream` that lazily yields segments of `text`.

Use this for inputs large enough that materialising the entire token
list at once would be wasteful. The `:break` option chooses the
segmentation level (`:word`, `:sentence`, `:line`, or `:grapheme`).

### Arguments

* `text` is a UTF-8 string.

### Options

* `:break` — required. One of `:word`, `:sentence`, `:line`, or
  `:grapheme`.

* `:trim` — defaults to `true`. Drops whitespace-only tokens.

* `:locale` — a BCP-47 locale string.

### Returns

* A `Stream`.

### Examples

    iex> Text.Segment.stream("Hello world", break: :word) |> Enum.to_list()
    ["Hello", "world"]

# `words`

```elixir
@spec words(
  String.t(),
  keyword()
) :: [String.t()]
```

Splits `text` into word tokens, dropping whitespace and punctuation
by default.

### Arguments

* `text` is a UTF-8 string.

### Options

* `:punctuation` — `:drop` (default) or `:keep`. Whether to include
  word-typed punctuation tokens (`","`, `"!"`, `"?"`, etc.) in the
  output.

* `:locale` — a BCP-47 locale string (`"en"`, `"ja"`, ...). When
  given, locale-specific break rules from CLDR are applied. Defaults
  to the Unicode root locale rules.

### Returns

* A list of strings.

### Examples

    iex> Text.Segment.words("Hello, world! How are you?")
    ["Hello", "world", "How", "are", "you"]

    iex> Text.Segment.words("Hello, world!", punctuation: :keep)
    ["Hello", ",", "world", "!"]

    iex> Text.Segment.words("naïve café")
    ["naïve", "café"]

    iex> Text.Segment.words("")
    []

---

*Consult [api-reference.md](api-reference.md) for complete listing*
