Phase 1 of the URL / email extraction pipeline: find candidate spans.
The scanner is intentionally permissive. It identifies anything that
could be a URL or email — full structural validation, IDNA mapping,
TLD lookup and boundary cleanup happen later in
Text.Extract.Url, Text.Extract.Email, and
Text.Extract.Boundary.
Boundary rules
Following twitter-text §3, a candidate's first character must be preceded by one of:
Beginning of the text.
Whitespace (any Unicode
Zs/Zl/Zp,,,).A CJK character or other "word break" character.
One of the safe punctuation chars:
(,[,{,<,>,",',,,;,:,!,?.
Candidates immediately preceded by $, _, alphanumerics, @,
#, -, . (or the same set on the other side of a CJK char,
when the previous grapheme is itself part of a token) are rejected.
Output
scan/1 returns a list of {kind, {start_byte, length_bytes}}
candidates in source order. kind is :url or :email.
Summary
Types
Element of the scanner's interleaved output.
Byte offsets {start, length} into the source text.
Functions
Walks text once and returns an interleaved list of plain-text
fragments and URL / email candidates, preserving document order.
Types
Element of the scanner's interleaved output.
@type span() :: {non_neg_integer(), non_neg_integer()}
Byte offsets {start, length} into the source text.
Functions
Walks text once and returns an interleaved list of plain-text
fragments and URL / email candidates, preserving document order.
Concatenating every element's content reproduces text byte-for-byte:
text == scan(text) |> Enum.map_join(&content/1)This shape is the building block for everything else in
Text.Extract. Text.Extract.urls/2 filters for :url and
validates; Text.Extract.split/2 validates each candidate and
promotes failures back to :text; Text.Extract.autolink/2
renders the result.
Arguments
textis a UTF-8 string.
Returns
- A list of elements:
{:text, fragment}— a span oftextcontaining no candidate.{:url, candidate, {start, length}}— a URL-shaped candidate.candidateis the substring;start/lengthare byte offsets back intotext.{:email, candidate, {start, length}}— an email-shaped candidate.
Where a URL match is wholly contained inside an email match, the
email wins and the URL is dropped (which is the canonical
interpretation: mailto: aside, an email is never a URL).
Examples
iex> Text.Extract.Scanner.scan("see http://example.com today")
[{:text, "see "}, {:url, "http://example.com", {4, 18}}, {:text, " today"}]
iex> Text.Extract.Scanner.scan("alice@example.com")
[{:email, "alice@example.com", {0, 17}}]
iex> Text.Extract.Scanner.scan("hello world")
[{:text, "hello world"}]
iex> Text.Extract.Scanner.scan("$invalid http://example.com")
[{:text, "$invalid "}, {:url, "http://example.com", {9, 18}}]