Text cleanup utilities: HTML stripping, whitespace collapse, Unicode normalization, and mojibake repair.
These are the small, fiddly transforms that every text-processing
pipeline needs but no single library exposes coherently. The
clean/2 function chains them in a sensible default order; the
individual transforms are also exposed so callers can compose
their own pipeline.
What each function does
strip_html/1— removes HTML/XML tags and decodes the most common HTML entities.collapse_whitespace/1— replaces runs of any whitespace (including non-breaking space and other Unicode spaces) with a single ASCII space, and trims the ends.strip_control/1— removes ASCII and Unicode control characters except\n,\t, and\r.normalize/2— Unicode normalization. Defaults to NFC.fix_mojibake/1— repairs the most common mojibake patterns (UTF-8 misinterpreted as Windows-1252 or Latin-1). Inspired byftfy. Only handles the well-known cases — not a complete replacement forftfy.clean/2— applies all of the above in a sensible default order. Steps can be turned off via options.
Summary
Functions
Applies the full cleanup pipeline.
Collapses runs of whitespace to single ASCII spaces and trims.
Repairs the most common mojibake patterns.
Applies a Unicode normalization form to the text.
Removes control characters except \n, \t, and \r.
Removes HTML/XML tags and decodes common HTML entities.
Removes diacritics, accents, and other Latin-script decorations
from text by transliterating to ASCII.
Functions
Applies the full cleanup pipeline.
Default order: HTML strip → mojibake fix → control-char strip → Unicode normalize (NFC) → whitespace collapse.
Arguments
textis the input string.
Options
:strip_html(defaulttrue) — applystrip_html/1.:fix_mojibake(defaulttrue) — applyfix_mojibake/1.:strip_control(defaulttrue) — applystrip_control/1.:normalize(default:nfc) — Unicode form to apply, orfalseto skip.:collapse_whitespace(defaulttrue) — applycollapse_whitespace/1.:unaccent(defaultfalse) — applyunaccent/1(after the other steps) to fold accented Latin characters to ASCII.
Returns
- The cleaned string.
Examples
iex> Text.Clean.clean("<p>Hello, <b>world</b>!</p>")
"Hello, world!"
iex> Text.Clean.clean("it’s <em>cool</em>")
"it’s cool"
iex> Text.Clean.clean("CAFÉ <em>RÉSUMÉ</em>", unaccent: true)
"CAFE RESUME"
Collapses runs of whitespace to single ASCII spaces and trims.
Recognises Unicode whitespace, not just ASCII — non-breaking spaces and ideographic spaces collapse too.
Arguments
textis the input string.
Returns
- The text with each run of whitespace replaced by a single space and leading/trailing whitespace trimmed.
Examples
iex> Text.Clean.collapse_whitespace(" hello world \n")
"hello world"
iex> Text.Clean.collapse_whitespace("non\u00A0breaking\tspaces")
"non breaking spaces"
Repairs the most common mojibake patterns.
Mojibake happens when UTF-8 bytes are decoded as Windows-1252 or
Latin-1, producing strings like ’ (a real ’ U+2019 read as
three single bytes). This function reverses the common cases.
Only well-known patterns are repaired — for harder cases, use a dedicated tool. Output is unchanged if no patterns match.
Arguments
textis the input string.
Returns
- The repaired string.
Examples
iex> Text.Clean.fix_mojibake("it’s working")
"it’s working"
iex> Text.Clean.fix_mojibake("café")
"café"
Applies a Unicode normalization form to the text.
Arguments
textis the input string.formis:nfc,:nfd,:nfkc, or:nfkd. The default is:nfc.
Returns
- The text in the requested normalization form.
Examples
iex> e_decomposed = "e" <> <<0x0301::utf8>>
iex> Text.Clean.normalize(e_decomposed) == "é"
true
Removes control characters except \n, \t, and \r.
Arguments
textis the input string.
Returns
- The text with non-printable control characters removed.
Examples
iex> Text.Clean.strip_control("hello\u0007world")
"helloworld"
iex> Text.Clean.strip_control("keep\nnewlines")
"keep\nnewlines"
Removes HTML/XML tags and decodes common HTML entities.
This is a pragmatic regex-based stripper, not a security-grade HTML parser. Use it for cleaning user input or scraped text, not for sanitizing output to a browser.
Arguments
textis a string that may contain HTML/XML tags and entities.
Returns
- The text with tags removed and entities decoded.
Examples
iex> Text.Clean.strip_html("<p>Hello, <b>world</b>!</p>")
"Hello, world!"
iex> Text.Clean.strip_html("Tom & Jerry — cats & mice")
"Tom & Jerry — cats & mice"
Removes diacritics, accents, and other Latin-script decorations
from text by transliterating to ASCII.
Delegates to Unicode.Transform.LatinAscii.transform/1 (a CLDR
Latin-ASCII transform compiled to pattern-matched function heads
for O(1) per-codepoint dispatch).
Unlike a naïve "NFD then strip Mn" recipe, this also handles
non-decomposable letters: Þ → Th, ß → ss, Æ → AE,
ł → l, đ → d, etc.
Useful as a preprocessing step before fuzzy matching, search-index insertion, or filename sanitization.
Arguments
textis the input string.
Returns
- The transliterated ASCII string.
Examples
iex> Text.Clean.unaccent("naïve café résumé")
"naive cafe resume"
iex> Text.Clean.unaccent("Þórbergur Þórðarson")
"THorbergur THordarson"
iex> Text.Clean.unaccent("Łódź")
"Lodz"