Text.Clean (Text v0.5.0)

Text cleanup utilities: HTML stripping, whitespace collapse, Unicode normalization, and mojibake repair.

These are the small, fiddly transforms that every text-processing pipeline needs but no single library exposes coherently. The clean/2 function chains them in a sensible default order; the individual transforms are also exposed so callers can compose their own pipeline.

What each function does

strip_html/1 — removes HTML/XML tags and decodes the most common HTML entities.
collapse_whitespace/1 — replaces runs of any whitespace (including non-breaking space and other Unicode spaces) with a single ASCII space, and trims the ends.
strip_control/1 — removes ASCII and Unicode control characters except \n, \t, and \r.
normalize/2 — Unicode normalization. Defaults to NFC.
fix_mojibake/1 — repairs the most common mojibake patterns (UTF-8 misinterpreted as Windows-1252 or Latin-1). Inspired by ftfy. Only handles the well-known cases — not a complete replacement for ftfy.
clean/2 — applies all of the above in a sensible default order. Steps can be turned off via options.

Summary

Functions

clean(text, options \\ [])

Applies the full cleanup pipeline.

collapse_whitespace(text)

Collapses runs of whitespace to single ASCII spaces and trims.

fix_mojibake(text)

Repairs the most common mojibake patterns.

normalize(text, form \\ :nfc)

Applies a Unicode normalization form to the text.

strip_control(text)

Removes control characters except \n, \t, and \r.

strip_html(text)

Removes HTML/XML tags and decodes common HTML entities.

unaccent(text)

Removes diacritics, accents, and other Latin-script decorations from text by transliterating to ASCII.

Functions

clean(text, options \\ [])

@spec clean(
  String.t(),
  keyword()
) :: String.t()

Applies the full cleanup pipeline.

Default order: HTML strip → mojibake fix → control-char strip → Unicode normalize (NFC) → whitespace collapse.

Arguments

text is the input string.

Options

:strip_html (default true) — apply strip_html/1.
:fix_mojibake (default true) — apply fix_mojibake/1.
:strip_control (default true) — apply strip_control/1.
:normalize (default :nfc) — Unicode form to apply, or false to skip.
:collapse_whitespace (default true) — apply collapse_whitespace/1.
:unaccent (default false) — apply unaccent/1 (after the other steps) to fold accented Latin characters to ASCII.

Returns

The cleaned string.

Examples

iex> Text.Clean.clean("<p>Hello,&nbsp;<b>world</b>!</p>")
"Hello, world!"

iex> Text.Clean.clean("itâ€™s   <em>cool</em>")
"it’s cool"

iex> Text.Clean.clean("CAFÉ <em>RÉSUMÉ</em>", unaccent: true)
"CAFE RESUME"

collapse_whitespace(text)

@spec collapse_whitespace(String.t()) :: String.t()

Collapses runs of whitespace to single ASCII spaces and trims.

Recognises Unicode whitespace, not just ASCII — non-breaking spaces and ideographic spaces collapse too.

Arguments

text is the input string.

Returns

The text with each run of whitespace replaced by a single space and leading/trailing whitespace trimmed.

Examples

iex> Text.Clean.collapse_whitespace("  hello   world  \n")
"hello world"

iex> Text.Clean.collapse_whitespace("non\u00A0breaking\tspaces")
"non breaking spaces"

fix_mojibake(text)

@spec fix_mojibake(String.t()) :: String.t()

Repairs the most common mojibake patterns.

Mojibake happens when UTF-8 bytes are decoded as Windows-1252 or Latin-1, producing strings like â€™ (a real ’ U+2019 read as three single bytes). This function reverses the common cases.

Only well-known patterns are repaired — for harder cases, use a dedicated tool. Output is unchanged if no patterns match.

Arguments

text is the input string.

Returns

The repaired string.

Examples

iex> Text.Clean.fix_mojibake("itâ€™s working")
"it’s working"

iex> Text.Clean.fix_mojibake("café")
"café"

normalize(text, form \\ :nfc)

@spec normalize(String.t(), :nfc | :nfd | :nfkc | :nfkd) :: String.t()

Applies a Unicode normalization form to the text.

Arguments

text is the input string.
form is :nfc, :nfd, :nfkc, or :nfkd. The default is :nfc.

Returns

The text in the requested normalization form.

Examples

iex> e_decomposed = "e" <> <<0x0301::utf8>>
iex> Text.Clean.normalize(e_decomposed) == "é"
true

strip_control(text)

@spec strip_control(String.t()) :: String.t()

Removes control characters except \n, \t, and \r.

Arguments

text is the input string.

Returns

The text with non-printable control characters removed.

Examples

iex> Text.Clean.strip_control("hello\u0007world")
"helloworld"

iex> Text.Clean.strip_control("keep\nnewlines")
"keep\nnewlines"

strip_html(text)

@spec strip_html(String.t()) :: String.t()

Removes HTML/XML tags and decodes common HTML entities.

This is a pragmatic regex-based stripper, not a security-grade HTML parser. Use it for cleaning user input or scraped text, not for sanitizing output to a browser.

Arguments

text is a string that may contain HTML/XML tags and entities.

Returns

The text with tags removed and entities decoded.

Examples

iex> Text.Clean.strip_html("<p>Hello, <b>world</b>!</p>")
"Hello, world!"

iex> Text.Clean.strip_html("Tom &amp; Jerry &mdash; cats &amp; mice")
"Tom & Jerry — cats & mice"

unaccent(text)

@spec unaccent(String.t()) :: String.t()

Removes diacritics, accents, and other Latin-script decorations from text by transliterating to ASCII.

Delegates to Unicode.Transform.LatinAscii.transform/1 (a CLDR Latin-ASCII transform compiled to pattern-matched function heads for O(1) per-codepoint dispatch).

Unlike a naïve "NFD then strip Mn" recipe, this also handles non-decomposable letters: Þ → Th, ß → ss, Æ → AE, ł → l, đ → d, etc.

Useful as a preprocessing step before fuzzy matching, search-index insertion, or filename sanitization.

Arguments

text is the input string.

Returns

The transliterated ASCII string.

Examples

iex> Text.Clean.unaccent("naïve café résumé")
"naive cafe resume"

iex> Text.Clean.unaccent("Þórbergur Þórðarson")
"THorbergur THordarson"

iex> Text.Clean.unaccent("Łódź")
"Lodz"