Text.Hyphenation (Text v0.5.0)

Copy Markdown View Source

Hyphenation via Liang's algorithm with TeX hyphenation patterns.

Returns the legal hyphenation points for a word — the positions where a soft hyphen could be inserted for line-breaking, and where a syllable boundary lies. This is the engine that backs both hyphenation (hyphenate/2) and pattern-based syllable counting (count/2).

Bundled and on-demand language packs

Seven hyph-utf8 packs are bundled and loaded at compile time, so the common European-language paths are zero-I/O:

  • en-us — American English (~5,000 patterns plus DEK's exception list)
  • de-1996 — German (modern spelling)
  • fr — French
  • es — Spanish
  • it — Italian
  • nl — Dutch
  • pt — Portuguese

Every other hyph-utf8 language pack (Russian, Hindi, and ~75 others) can be loaded on demand from the hyph-utf8 upstream.

On-demand loading goes through Text.Data, which means:

  • Loaded language packs are cached under :data_dir (default ~/.cache/text/hyphenation/) so subsequent calls do no I/O beyond the cache lookup.

  • Auto-download from upstream is opt-in, gated by config :text, auto_download_hyphenation_data: true. Without that flag the package never reaches out to the network — calls for an unbundled language raise an ArgumentError explaining the situation.

  • If you would rather populate the cache manually, drop the hyph-<tag>.tex file from upstream into the configured :data_dir/hyphenation/ directory and it will be picked up without any download.

Bundled files live in priv/hyphenation/. Each .tex file ships under its upstream license: en-us, de-1996, fr, es, nl, and pt are MIT/X11/BSD; it is LPPL. All are compatible with redistribution; the licenses (and original copyright notices) are preserved verbatim in the headers of each file.

Language input shapes

Every option that takes a :language accepts:

  • an atom (:fr, :"de-1996"),

  • a string ("fr", "en-GB", "de-CH"),

  • or a Localize.LanguageTag struct (when the optional localize dependency is loaded). The :language and :territory fields are used to derive the upstream file tag.

The mapping prefers the most common variant when the input is ambiguous: :en resolves to en-us, :de to the modern spelling de-1996, :el to monotonic Greek, etc. Pass an explicit BCP-47 tag ("en-GB", "de-CH", "de-1901") to override.

Left and right minima

TeX hyphenation patterns are designed with minimum gaps at the start and end of a word — \lefthyphenmin and \righthyphenmin. The American English file recommends left: 2, right: 3. The recommended values are read from the upstream .tex file's header when available, and applied as defaults in hyphenate/2, points/2, and count/2. They can be overridden per-call via the :left_min and :right_min options.

Summary

Functions

Returns the number of syllables in a word, using hyphenation pattern boundaries as a proxy for syllable boundaries.

Inserts soft hyphens at every legal break point in a word.

Pre-loads a hyphenation pattern file for a language.

Returns the list of valid hyphenation points within a word.

Functions

count(word, options \\ [])

@spec count(
  String.t(),
  keyword()
) :: pos_integer()

Returns the number of syllables in a word, using hyphenation pattern boundaries as a proxy for syllable boundaries.

This is more accurate than the heuristic in Text.Syllable.count/2 for words that match the bundled patterns, but the two occasionally disagree on edge cases. For readability metrics, use Text.Syllable. For typographic syllable count, prefer this.

Arguments

  • word is a single word as a string.

Options

  • :language, :left_min, :right_min — see points/2.

Returns

  • A positive integer: the number of break points plus one.

Examples

iex> Text.Hyphenation.count("hyphenation")
3

iex> Text.Hyphenation.count("cat")
1

hyphenate(word, options \\ [])

@spec hyphenate(
  String.t(),
  keyword()
) :: String.t()

Inserts soft hyphens at every legal break point in a word.

Arguments

  • word is a single word as a string.

Options

  • :hyphen is the string inserted at each break point. Default is "-". For typesetting use, "\u00AD" (soft hyphen) is common.

  • :language, :left_min, :right_min — see points/2.

Returns

  • The word with the hyphen string inserted at every break point.

Examples

iex> Text.Hyphenation.hyphenate("hyphenation")
"hy-phen-ation"

iex> Text.Hyphenation.hyphenate("computer", hyphen: "·", right_min: 1)
"com·put·er"

load_language(language)

@spec load_language(atom()) :: :ok

Pre-loads a hyphenation pattern file for a language.

Calling this is optional. The first call to points/2, hyphenate/2, or count/2 for a given language already loads (and if necessary downloads) the pattern file via Text.Dataload_language/1,2,3 is a way to warm the cache up front (e.g. during application startup) so the first user-facing call has no latency, or to register a custom pattern file under a name of your choosing.

Forms

load_language(language)
load_language(language, options)
load_language(language, tex_path)
load_language(language, tex_path, options)

Without an explicit tex_path, the upstream URL is derived from the language input and the file is fetched via Text.Data (see the moduledoc — auto-download must be enabled for the network fetch to occur).

With an explicit tex_path, that file is read directly and registered under language.

Arguments

  • language is an atom identifying the language (e.g. :de, :"de-ch", :my_custom_language).

  • tex_path is an optional path to a TeX hyphenation pattern file. When omitted, the file is resolved through Text.Data.

Options

  • :left_min is the minimum left hyphenation distance for the language. Default is parsed from the file's header, falling back to 2.

  • :right_min is the minimum right hyphenation distance for the language. Default is parsed from the file's header, falling back to 2.

Returns

  • :ok on success.

load_language(language, options)

@spec load_language(atom(), keyword() | Path.t()) :: :ok

load_language(language, tex_path, options)

@spec load_language(atom(), Path.t(), keyword()) :: :ok

points(word, options \\ [])

@spec points(
  String.t(),
  keyword()
) :: [pos_integer()]

Returns the list of valid hyphenation points within a word.

A hyphenation point is the number of characters from the start of the word at which a hyphen is permitted. For example, the points of "hyphenation" are [2, 5, 7], indicating breaks hy-phen-a-tion (two characters from the left is the first break, and so on).

Arguments

  • word is a single word as a string. Surrounding punctuation is not stripped — pass tokens that have already been split.

Options

  • :language is the hyphenation language. The default is :en (American English). Other languages must first be loaded with load_language/2.

  • :left_min is the minimum number of characters at the start of the word before the first allowed break. Default is the language's recommended value (2 for English).

  • :right_min is the minimum number of characters at the end of the word after the last allowed break. Default is the language's recommended value (3 for English).

Returns

  • A list of integer hyphenation points, in ascending order. An empty list means no legal break points exist (very short words, or words below the left/right minima).

Examples

iex> Text.Hyphenation.points("hyphenation")
[2, 6]

iex> Text.Hyphenation.points("computer")
[3]

iex> Text.Hyphenation.points("a")
[]