Text.Clean (Text v0.5.0)

Copy Markdown View Source

Text cleanup utilities: HTML stripping, whitespace collapse, Unicode normalization, and mojibake repair.

These are the small, fiddly transforms that every text-processing pipeline needs but no single library exposes coherently. The clean/2 function chains them in a sensible default order; the individual transforms are also exposed so callers can compose their own pipeline.

What each function does

  • strip_html/1 — removes HTML/XML tags and decodes the most common HTML entities.

  • collapse_whitespace/1 — replaces runs of any whitespace (including non-breaking space and other Unicode spaces) with a single ASCII space, and trims the ends.

  • strip_control/1 — removes ASCII and Unicode control characters except \n, \t, and \r.

  • normalize/2 — Unicode normalization. Defaults to NFC.

  • fix_mojibake/1 — repairs the most common mojibake patterns (UTF-8 misinterpreted as Windows-1252 or Latin-1). Inspired by ftfy. Only handles the well-known cases — not a complete replacement for ftfy.

  • clean/2 — applies all of the above in a sensible default order. Steps can be turned off via options.

Summary

Functions

Applies the full cleanup pipeline.

Collapses runs of whitespace to single ASCII spaces and trims.

Repairs the most common mojibake patterns.

Applies a Unicode normalization form to the text.

Removes control characters except \n, \t, and \r.

Removes HTML/XML tags and decodes common HTML entities.

Removes diacritics, accents, and other Latin-script decorations from text by transliterating to ASCII.

Functions

clean(text, options \\ [])

@spec clean(
  String.t(),
  keyword()
) :: String.t()

Applies the full cleanup pipeline.

Default order: HTML strip → mojibake fix → control-char strip → Unicode normalize (NFC) → whitespace collapse.

Arguments

  • text is the input string.

Options

  • :strip_html (default true) — apply strip_html/1.

  • :fix_mojibake (default true) — apply fix_mojibake/1.

  • :strip_control (default true) — apply strip_control/1.

  • :normalize (default :nfc) — Unicode form to apply, or false to skip.

  • :collapse_whitespace (default true) — apply collapse_whitespace/1.

  • :unaccent (default false) — apply unaccent/1 (after the other steps) to fold accented Latin characters to ASCII.

Returns

  • The cleaned string.

Examples

iex> Text.Clean.clean("<p>Hello,&nbsp;<b>world</b>!</p>")
"Hello, world!"

iex> Text.Clean.clean("it’s   <em>cool</em>")
"it’s cool"

iex> Text.Clean.clean("CAFÉ <em>RÉSUMÉ</em>", unaccent: true)
"CAFE RESUME"

collapse_whitespace(text)

@spec collapse_whitespace(String.t()) :: String.t()

Collapses runs of whitespace to single ASCII spaces and trims.

Recognises Unicode whitespace, not just ASCII — non-breaking spaces and ideographic spaces collapse too.

Arguments

  • text is the input string.

Returns

  • The text with each run of whitespace replaced by a single space and leading/trailing whitespace trimmed.

Examples

iex> Text.Clean.collapse_whitespace("  hello   world  \n")
"hello world"

iex> Text.Clean.collapse_whitespace("non\u00A0breaking\tspaces")
"non breaking spaces"

fix_mojibake(text)

@spec fix_mojibake(String.t()) :: String.t()

Repairs the most common mojibake patterns.

Mojibake happens when UTF-8 bytes are decoded as Windows-1252 or Latin-1, producing strings like ’ (a real U+2019 read as three single bytes). This function reverses the common cases.

Only well-known patterns are repaired — for harder cases, use a dedicated tool. Output is unchanged if no patterns match.

Arguments

  • text is the input string.

Returns

  • The repaired string.

Examples

iex> Text.Clean.fix_mojibake("it’s working")
"it’s working"

iex> Text.Clean.fix_mojibake("café")
"café"

normalize(text, form \\ :nfc)

@spec normalize(String.t(), :nfc | :nfd | :nfkc | :nfkd) :: String.t()

Applies a Unicode normalization form to the text.

Arguments

  • text is the input string.

  • form is :nfc, :nfd, :nfkc, or :nfkd. The default is :nfc.

Returns

  • The text in the requested normalization form.

Examples

iex> e_decomposed = "e" <> <<0x0301::utf8>>
iex> Text.Clean.normalize(e_decomposed) == "é"
true

strip_control(text)

@spec strip_control(String.t()) :: String.t()

Removes control characters except \n, \t, and \r.

Arguments

  • text is the input string.

Returns

  • The text with non-printable control characters removed.

Examples

iex> Text.Clean.strip_control("hello\u0007world")
"helloworld"

iex> Text.Clean.strip_control("keep\nnewlines")
"keep\nnewlines"

strip_html(text)

@spec strip_html(String.t()) :: String.t()

Removes HTML/XML tags and decodes common HTML entities.

This is a pragmatic regex-based stripper, not a security-grade HTML parser. Use it for cleaning user input or scraped text, not for sanitizing output to a browser.

Arguments

  • text is a string that may contain HTML/XML tags and entities.

Returns

  • The text with tags removed and entities decoded.

Examples

iex> Text.Clean.strip_html("<p>Hello, <b>world</b>!</p>")
"Hello, world!"

iex> Text.Clean.strip_html("Tom &amp; Jerry &mdash; cats &amp; mice")
"Tom & Jerry — cats & mice"

unaccent(text)

@spec unaccent(String.t()) :: String.t()

Removes diacritics, accents, and other Latin-script decorations from text by transliterating to ASCII.

Delegates to Unicode.Transform.LatinAscii.transform/1 (a CLDR Latin-ASCII transform compiled to pattern-matched function heads for O(1) per-codepoint dispatch).

Unlike a naïve "NFD then strip Mn" recipe, this also handles non-decomposable letters: ÞTh, ßss, ÆAE, łl, đd, etc.

Useful as a preprocessing step before fuzzy matching, search-index insertion, or filename sanitization.

Arguments

  • text is the input string.

Returns

  • The transliterated ASCII string.

Examples

iex> Text.Clean.unaccent("naïve café résumé")
"naive cafe resume"

iex> Text.Clean.unaccent("Þórbergur Þórðarson")
"THorbergur THordarson"

iex> Text.Clean.unaccent("Łódź")
"Lodz"