Extract URLs and email addresses from arbitrary text at social-media quality.
The extractor follows the rules that twitter-text uses (and which
Slack, Mastodon, and most "auto-link" implementations imitate),
layered with full UTS #46 IDNA processing for internationalised
domain names via the :unicode_idna package.
Pipeline
Scan —
Text.Extract.Scannerperforms one linear pass over the text and emits candidate spans. Boundary rules (twitter-text §3) reject candidates immediately preceded by$,_,@,#,-,., or alphanumerics.Validate — each candidate goes through
Text.Extract.Url(orText.Extract.Emailonce that ships) for:RFC 3986 / RFC 5322 structural parsing.
UTS #46 ToASCII for every host label (rejects DISALLOWED codepoints, invalid hyphen positions, oversized labels, …).
Twitter-style host-label rules — no leading/trailing
-/_, underscores forbidden in the registrable domain and TLD labels.TLD lookup against
Text.Extract.Tld's bundled IANA list.
Boundary cleanup —
Text.Extract.Boundaryshrinks the span to drop trailing punctuation and unbalanced brackets without losing legitimate inner punctuation (Wikipedia-style URLs with parentheses are preserved).
Each result is a map with the original Unicode form, the all-ASCII Punycode form, byte offsets into the source, and the parsed RFC 3986 components.
Phoenix integration
Text.Extract.autolink/2 returns Phoenix.HTML.safe() (a
{:safe, iodata} tuple), so it drops directly into a Phoenix
template:
<%= Text.Extract.autolink(@user_post) %>No raw/1 needed — Phoenix knows the iodata is already escaped
appropriately. For non-Phoenix callers, convert with
Phoenix.HTML.safe_to_string/1.
Examples
iex> Text.Extract.urls("see http://example.com today") |> Enum.map(& &1.url)
["http://example.com"]
iex> Text.Extract.urls("foo.com bar.net baz.org") |> Enum.map(& &1.url)
["foo.com", "bar.net", "baz.org"]
iex> Text.Extract.urls("see http://en.wikipedia.org/wiki/URI_(disambiguation).")
...> |> Enum.map(& &1.url)
["http://en.wikipedia.org/wiki/URI_(disambiguation)"]
Summary
Functions
Extracts both URLs and email addresses from text, interleaved in
document order.
Wraps URLs and email addresses in text with HTML <a> anchors.
Extracts email addresses from text.
Splits text into a list of plain-string fragments and entity maps,
preserving document order.
Extracts URLs from text.
Functions
Extracts both URLs and email addresses from text, interleaved in
document order.
Where a candidate matches both — e.g. mailto:alice@example.com —
the email wins (it's the canonical interpretation).
Arguments
textis a UTF-8 string.
Options
- Same options as
urls/2andemails/2—:require_scheme,:schemes,:tld_mode,:eai,:strict_idn,:twitter_quirks.
Returns
- A list of records each with a
:kindfield (:urlor:email) plus the kind-specific fields. Sorted by:spanstart position.
Examples
iex> [url, email] = Text.Extract.all("Visit https://example.com or email alice@example.com.")
iex> {url.kind, url.url, email.kind, email.email}
{:url, "https://example.com", :email, "alice@example.com"}
iex> Text.Extract.all("no links or emails here")
[]
@spec autolink( String.t(), keyword() ) :: Phoenix.HTML.safe()
Wraps URLs and email addresses in text with HTML <a> anchors.
Plain-text segments are HTML-escaped. Anchor display text is the
original matched form (so bücher.de displays as bücher.de,
not its Punycode); the href uses the ASCII Punycode form for
URLs and mailto: plus the ASCII form for emails. This matches
what every modern browser does when a user clicks an IDN link.
For schemeless URLs (example.com) the href adds an explicit
scheme — defaults to https for safety; flip via :href_scheme.
Arguments
textis a UTF-8 string.
Options
All all/2 options pass through, plus:
:href_scheme—:https(default) or:http. Used to construct thehreffor schemeless URL matches likeexample.com.:url_attrs— extra HTML attributes (a keyword list) added to every URL anchor. Defaults to[]. Common choices:[target: "_blank", rel: "noopener noreferrer nofollow"].:email_attrs— extra HTML attributes added to every email anchor.:url_renderer—(entity -> iodata)callback that fully overrides URL rendering. When supplied,:url_attrsand:href_schemeare ignored for URL output.:email_renderer—(entity -> iodata)callback that fully overrides email rendering.
Returns
Phoenix.HTML.safe()— i.e.{:safe, iodata}. Drop the result straight into a Phoenix template (<%= autolink(text) %>) and Phoenix renders it without re-escaping. For non-Phoenix callers, convert withPhoenix.HTML.safe_to_string/1.
Examples
iex> "Visit https://example.com today."
...> |> Text.Extract.autolink()
...> |> Phoenix.HTML.safe_to_string()
~s|Visit <a href="https://example.com">https://example.com</a> today.|
iex> "Email alice@example.com please."
...> |> Text.Extract.autolink()
...> |> Phoenix.HTML.safe_to_string()
~s|Email <a href="mailto:alice@example.com">alice@example.com</a> please.|
iex> "foo.com is a site."
...> |> Text.Extract.autolink()
...> |> Phoenix.HTML.safe_to_string()
...> |> String.contains?(~s|href="https://foo.com"|)
true
iex> "plain text & symbols < >"
...> |> Text.Extract.autolink()
...> |> Phoenix.HTML.safe_to_string()
"plain text & symbols < >"
@spec emails( String.t(), keyword() ) :: [Text.Extract.Email.email_record()]
Extracts email addresses from text.
Arguments
textis a UTF-8 string.
Options
:eai— whentrue, allow non-ASCII codepoints in the local part per RFC 6531 SMTPUTF8 (e.g.用户@example.com). Defaulttrue.:tld_mode—:iana(default) or:any.:strict_idn— whentrue, IDNA uses STD3 ASCII rules. Defaultfalse.
Returns
- A list of email records in document order. Each record has
:email,:ascii,:span,:local,:host,:ascii_host.
Examples
iex> [r] = Text.Extract.emails("Contact alice@example.com today.")
iex> {r.email, r.local, r.host}
{"alice@example.com", "alice", "example.com"}
iex> Text.Extract.emails("no email here") |> length()
0
Splits text into a list of plain-string fragments and entity maps,
preserving document order.
This is the primitive Text.Extract.autolink/2 is built on, and the
building block users typically want when rendering extracted text:
walk the list, leave strings alone, render entity maps however you
like (anchors, mentions, badges, link previews, …).
Adjacent fragments and entities concatenate to the original text
byte-for-byte, so round-tripping Enum.map_join(split, &render/1)
with the identity renderer reproduces the input exactly.
Arguments
textis a UTF-8 string.
Options
- Same options as
all/2—:require_scheme,:schemes,:tld_mode,:eai,:strict_idn,:twitter_quirks.
Returns
- A list whose elements are either
String.t()(plain text segments) or entity maps (%{kind: :url, …}/%{kind: :email, …}). Empty string segments are omitted, so two adjacent entities appear with no separator between them.
Examples
iex> Text.Extract.split("Visit https://example.com today.")
...> |> Enum.map(fn
...> text when is_binary(text) -> {:text, text}
...> entity -> {entity.kind, entity.url || entity.email}
...> end)
[{:text, "Visit "}, {:url, "https://example.com"}, {:text, " today."}]
iex> Text.Extract.split("plain text only")
["plain text only"]
iex> Text.Extract.split("")
[]
@spec urls( String.t(), keyword() ) :: [Text.Extract.Url.url_record()]
Extracts URLs from text.
Arguments
textis a UTF-8 string.
Options
:require_scheme— whentrue, onlyscheme://…URLs validate; schemeless candidates likeexample.comare rejected. Defaultfalse(matches Twitter / auto-linking behaviour).:schemes— allowlist of accepted schemes. Default["http", "https", "ftp", "ftps"].:tld_mode—:iana(default) or:any. SeeText.Extract.Tld.:strict_idn— whentrue, IDNA uses STD3 ASCII rules (rejects_in labels). Defaultfalse.
Returns
- A list of URL records in document order. Each record has
:url,:ascii,:span,:scheme,:userinfo,:host,:ascii_host,:port,:path,:query,:fragment.
Examples
iex> [r] = Text.Extract.urls("see http://example.com today")
iex> {r.url, r.host, r.span}
{"http://example.com", "example.com", {4, 18}}
iex> Text.Extract.urls("nothing here") |> length()
0