Text.Segment (Text v0.5.0)

Copy Markdown View Source

Locale-aware word and sentence segmentation.

Implements the public surface area you reach for when you want to break a string into its semantic pieces — words or sentences — following the Unicode segmentation rules in UAX #29 (Text Segmentation). Line-break segmentation (UAX #14) is available via stream/2 for the small set of callers who need it.

This module is a thin facade over the unicode_string package, exposed under the Text namespace so callers don't have to know where the underlying primitives live. Two adjustments over the raw Unicode.String.split/2:

  • words/2 drops punctuation tokens by default. The Unicode word segmentation algorithm treats "," and "!" as words just like "hello"; in practice every text-processing pipeline immediately filters those out, so this module does it by default.

Locale input shapes

The :locale option (where it appears) accepts an atom (:fr), a string ("fr", "fr-CA", "zh-Hans-CN"), or a Localize.LanguageTag struct when the optional :localize dependency is loaded. See Text.Language for details.

  • sentences/2 trims trailing whitespace from each sentence so callers don't have to. The Unicode rules attach trailing whitespace to the preceding sentence, which is rarely what you want.

Pass punctuation: :keep or trim: false to opt back in to the raw Unicode behaviour.

Locale-awareness and abbreviation suppressions

Word and sentence segmentation rules differ across languages — German treats "Donaudampfschiffahrtsgesellschaft" as one word, Japanese needs dictionary-based segmentation for any meaningful tokenisation, and Thai has no spaces at all. Pass locale: "ja", locale: "th", etc., to pick up those tailorings; without a locale option the default Unicode rules apply.

When a locale is supplied, sentences/2 also applies CLDR's suppression list for that locale by default, which keeps common abbreviations like "i.e." and "e.g." from being treated as sentence terminators. The list is partial — it covers the most common forms but not every abbreviation in use — so callers needing comprehensive abbreviation handling should still layer their own rules on top. Pass suppressions: false to opt out of even the default list.

Streaming

For very large inputs (log files, full books) prefer stream/2, which returns a lazy Stream rather than realising the entire token list in memory.

Summary

Types

A segmentation break type understood by Unicode UAX #29.

Functions

Splits text into sentences, trimming trailing whitespace by default.

Returns a Stream that lazily yields segments of text.

Splits text into word tokens, dropping whitespace and punctuation by default.

Types

break()

@type break() :: :word | :sentence | :line | :grapheme

A segmentation break type understood by Unicode UAX #29.

Functions

sentences(text, options \\ [])

@spec sentences(
  String.t(),
  keyword()
) :: [String.t()]

Splits text into sentences, trimming trailing whitespace by default.

Arguments

  • text is a UTF-8 string.

Options

  • :trimtrue (default) trims leading and trailing whitespace from each sentence. false preserves the raw Unicode-segmentation output where the trailing whitespace before the next sentence is attached to the preceding sentence.

  • :locale — a BCP-47 locale string. When given, locale-specific sentence break rules from CLDR are applied. Also enables CLDR's abbreviation suppression list (see :suppressions).

  • :suppressionstrue (default) applies CLDR's per-locale abbreviation suppression list, which prevents common forms like "i.e." and "e.g." from being treated as sentence terminators. Has no effect when :locale is not set.

Returns

  • A list of strings.

Examples

iex> Text.Segment.sentences("Hello, world! How are you?")
["Hello, world!", "How are you?"]

iex> Text.Segment.sentences("First. Second! Third? Yes.")
["First.", "Second!", "Third?", "Yes."]

iex> Text.Segment.sentences("Hello, world! How are you?", trim: false)
["Hello, world! ", "How are you?"]

iex> Text.Segment.sentences("He used i.e. and e.g. in his memo. Then stopped.", locale: "en")
["He used i.e. and e.g. in his memo.", "Then stopped."]

iex> Text.Segment.sentences("")
[]

stream(text, options)

@spec stream(
  String.t(),
  keyword()
) :: Enumerable.t()

Returns a Stream that lazily yields segments of text.

Use this for inputs large enough that materialising the entire token list at once would be wasteful. The :break option chooses the segmentation level (:word, :sentence, :line, or :grapheme).

Arguments

  • text is a UTF-8 string.

Options

  • :break — required. One of :word, :sentence, :line, or :grapheme.

  • :trim — defaults to true. Drops whitespace-only tokens.

  • :locale — a BCP-47 locale string.

Returns

Examples

iex> Text.Segment.stream("Hello world", break: :word) |> Enum.to_list()
["Hello", "world"]

words(text, options \\ [])

@spec words(
  String.t(),
  keyword()
) :: [String.t()]

Splits text into word tokens, dropping whitespace and punctuation by default.

Arguments

  • text is a UTF-8 string.

Options

  • :punctuation:drop (default) or :keep. Whether to include word-typed punctuation tokens (",", "!", "?", etc.) in the output.

  • :locale — a BCP-47 locale string ("en", "ja", ...). When given, locale-specific break rules from CLDR are applied. Defaults to the Unicode root locale rules.

Returns

  • A list of strings.

Examples

iex> Text.Segment.words("Hello, world! How are you?")
["Hello", "world", "How", "are", "you"]

iex> Text.Segment.words("Hello, world!", punctuation: :keep)
["Hello", ",", "world", "!"]

iex> Text.Segment.words("naïve café")
["naïve", "café"]

iex> Text.Segment.words("")
[]