# `Text.Clean`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/clean.ex#L1)

Text cleanup utilities: HTML stripping, whitespace collapse,
Unicode normalization, and mojibake repair.

These are the small, fiddly transforms that every text-processing
pipeline needs but no single library exposes coherently. The
`clean/2` function chains them in a sensible default order; the
individual transforms are also exposed so callers can compose
their own pipeline.

### What each function does

* `strip_html/1` — removes HTML/XML tags and decodes the most
  common HTML entities.

* `collapse_whitespace/1` — replaces runs of any whitespace
  (including non-breaking space and other Unicode spaces) with a
  single ASCII space, and trims the ends.

* `strip_control/1` — removes ASCII and Unicode control
  characters except `\n`, `\t`, and `\r`.

* `normalize/2` — Unicode normalization. Defaults to NFC.

* `fix_mojibake/1` — repairs the most common mojibake patterns
  (UTF-8 misinterpreted as Windows-1252 or Latin-1). Inspired by
  `ftfy`. Only handles the well-known cases — not a complete
  replacement for `ftfy`.

* `clean/2` — applies all of the above in a sensible default
  order. Steps can be turned off via options.

# `clean`

```elixir
@spec clean(
  String.t(),
  keyword()
) :: String.t()
```

Applies the full cleanup pipeline.

Default order: HTML strip → mojibake fix → control-char strip →
Unicode normalize (NFC) → whitespace collapse.

### Arguments

* `text` is the input string.

### Options

* `:strip_html` (default `true`) — apply `strip_html/1`.

* `:fix_mojibake` (default `true`) — apply `fix_mojibake/1`.

* `:strip_control` (default `true`) — apply `strip_control/1`.

* `:normalize` (default `:nfc`) — Unicode form to apply, or
  `false` to skip.

* `:collapse_whitespace` (default `true`) — apply
  `collapse_whitespace/1`.

* `:unaccent` (default `false`) — apply `unaccent/1` (after the
  other steps) to fold accented Latin characters to ASCII.

### Returns

* The cleaned string.

### Examples

    iex> Text.Clean.clean("<p>Hello,&nbsp;<b>world</b>!</p>")
    "Hello, world!"

    iex> Text.Clean.clean("itâ€™s   <em>cool</em>")
    "it’s cool"

    iex> Text.Clean.clean("CAFÉ <em>RÉSUMÉ</em>", unaccent: true)
    "CAFE RESUME"

# `collapse_whitespace`

```elixir
@spec collapse_whitespace(String.t()) :: String.t()
```

Collapses runs of whitespace to single ASCII spaces and trims.

Recognises Unicode whitespace, not just ASCII — non-breaking
spaces and ideographic spaces collapse too.

### Arguments

* `text` is the input string.

### Returns

* The text with each run of whitespace replaced by a single space
  and leading/trailing whitespace trimmed.

### Examples

    iex> Text.Clean.collapse_whitespace("  hello   world  \n")
    "hello world"

    iex> Text.Clean.collapse_whitespace("non\u00A0breaking\tspaces")
    "non breaking spaces"

# `fix_mojibake`

```elixir
@spec fix_mojibake(String.t()) :: String.t()
```

Repairs the most common mojibake patterns.

Mojibake happens when UTF-8 bytes are decoded as Windows-1252 or
Latin-1, producing strings like `â€™` (a real `’` U+2019 read as
three single bytes). This function reverses the common cases.

Only well-known patterns are repaired — for harder cases, use a
dedicated tool. Output is unchanged if no patterns match.

### Arguments

* `text` is the input string.

### Returns

* The repaired string.

### Examples

    iex> Text.Clean.fix_mojibake("itâ€™s working")
    "it’s working"

    iex> Text.Clean.fix_mojibake("café")
    "café"

# `normalize`

```elixir
@spec normalize(String.t(), :nfc | :nfd | :nfkc | :nfkd) :: String.t()
```

Applies a Unicode normalization form to the text.

### Arguments

* `text` is the input string.

* `form` is `:nfc`, `:nfd`, `:nfkc`, or `:nfkd`. The default is
  `:nfc`.

### Returns

* The text in the requested normalization form.

### Examples

    iex> e_decomposed = "e" <> <<0x0301::utf8>>
    iex> Text.Clean.normalize(e_decomposed) == "é"
    true

# `strip_control`

```elixir
@spec strip_control(String.t()) :: String.t()
```

Removes control characters except `\n`, `\t`, and `\r`.

### Arguments

* `text` is the input string.

### Returns

* The text with non-printable control characters removed.

### Examples

    iex> Text.Clean.strip_control("hello\u0007world")
    "helloworld"

    iex> Text.Clean.strip_control("keep\nnewlines")
    "keep\nnewlines"

# `strip_html`

```elixir
@spec strip_html(String.t()) :: String.t()
```

Removes HTML/XML tags and decodes common HTML entities.

This is a pragmatic regex-based stripper, not a security-grade
HTML parser. Use it for cleaning user input or scraped text, not
for sanitizing output to a browser.

### Arguments

* `text` is a string that may contain HTML/XML tags and entities.

### Returns

* The text with tags removed and entities decoded.

### Examples

    iex> Text.Clean.strip_html("<p>Hello, <b>world</b>!</p>")
    "Hello, world!"

    iex> Text.Clean.strip_html("Tom &amp; Jerry &mdash; cats &amp; mice")
    "Tom & Jerry — cats & mice"

# `unaccent`

```elixir
@spec unaccent(String.t()) :: String.t()
```

Removes diacritics, accents, and other Latin-script decorations
from `text` by transliterating to ASCII.

Delegates to `Unicode.Transform.LatinAscii.transform/1` (a CLDR
`Latin-ASCII` transform compiled to pattern-matched function heads
for O(1) per-codepoint dispatch).

Unlike a naïve "NFD then strip Mn" recipe, this also handles
non-decomposable letters: `Þ` → `Th`, `ß` → `ss`, `Æ` → `AE`,
`ł` → `l`, `đ` → `d`, etc.

Useful as a preprocessing step before fuzzy matching, search-index
insertion, or filename sanitization.

### Arguments

* `text` is the input string.

### Returns

* The transliterated ASCII string.

### Examples

    iex> Text.Clean.unaccent("naïve café résumé")
    "naive cafe resume"

    iex> Text.Clean.unaccent("Þórbergur Þórðarson")
    "THorbergur THordarson"

    iex> Text.Clean.unaccent("Łódź")
    "Lodz"

---

*Consult [api-reference.md](api-reference.md) for complete listing*