Text.PII (Text v0.5.0)

Copy Markdown View Source

Pattern-based detection and redaction of personally-identifiable information.

Useful as a sanitisation step before logging text, before sending user input to a third-party service, or before training/eval on a corpus that may contain accidental PII. The detectors are pure regex (with a Luhn check on credit cards) — fast, deterministic, and small enough to inspect.

Pattern coverage is conservative: false positives are minimised at the cost of missing unusual formats. For broader recall on names, addresses, and other open-class entities, combine this with Text.NER (which uses a Bumblebee model).

Detected types

TypeWhat it catches
:emailRFC-5322-ish email addresses.
:phoneInternational E.164 (+1234567890) and common US/EU dashed/parens forms with 7+ digits.
:credit_card13–19 digit sequences that pass the Luhn check.
:ibanIBAN format (country code + 2 check digits + up to 30 alphanumerics).
:ssnUS Social Security numbers NNN-NN-NNNN.
:ipv4Dotted-quad IP addresses with octets 0–255.
:ipv6IPv6 addresses (full and compressed forms).
:urlhttp(s):// URLs.

Summary

Functions

Detects PII matches in the text.

Replaces every detected PII match with a redaction placeholder.

Returns the list of detector type atoms supported by this module.

Functions

detect(text, options \\ [])

@spec detect(
  String.t(),
  keyword()
) :: [
  %{
    type: atom(),
    value: String.t(),
    start: non_neg_integer(),
    length: pos_integer()
  }
]

Detects PII matches in the text.

Arguments

  • text is the input string.

Options

  • :types is the list of detector types to run. Default is all types from types/0. Pass [:email, :phone] to limit detection.

Returns

  • A list of maps %{type: atom, value: String.t(), start: integer, length: integer} sorted by :start. The :start is a byte offset, suitable for String.slice/3. Credit-card matches are filtered to only those that pass the Luhn check.

Examples

iex> [m] = Text.PII.detect("contact me at alice@example.com please")
iex> {m.type, m.value}
{:email, "alice@example.com"}

iex> Text.PII.detect("nothing here")
[]

redact(text, options \\ [])

@spec redact(
  String.t(),
  keyword()
) :: String.t()

Replaces every detected PII match with a redaction placeholder.

Arguments

  • text is the input string.

Options

  • :types — same as detect/2.

  • :placeholder — either a string (used for every match) or a function (type :: atom -> String.t()) returning the placeholder for each match type. The default is fn type -> "[" <> String.upcase(to_string(type)) <> "]" end.

Returns

  • The text with every detected match replaced by the configured placeholder. If matches overlap, the earlier-starting match wins.

Examples

iex> Text.PII.redact("email me at alice@example.com")
"email me at [EMAIL]"

iex> Text.PII.redact("phone +1-555-123-4567 email alice@x.io",
...>   placeholder: fn _ -> "***" end)
"phone *** email ***"

types()

@spec types() :: [atom()]

Returns the list of detector type atoms supported by this module.