Text.Extract.Scanner (Text v0.6.1)

Copy Markdown View Source

Phase 1 of the URL / email extraction pipeline: find candidate spans.

The scanner is intentionally permissive. It identifies anything that could be a URL or email — full structural validation, IDNA mapping, TLD lookup and boundary cleanup happen later in Text.Extract.Url, Text.Extract.Email, and Text.Extract.Boundary.

Boundary rules

Following twitter-text §3, a candidate's first character must be preceded by one of:

  • Beginning of the text.

  • Whitespace (any Unicode Zs/Zl/Zp, , , ).

  • A CJK character or other "word break" character.

  • One of the safe punctuation chars: (, [, {, <, >, ", ', ,, ;, :, !, ?.

Candidates immediately preceded by $, _, alphanumerics, @, #, -, . (or the same set on the other side of a CJK char, when the previous grapheme is itself part of a token) are rejected.

Output

scan/1 returns a list of {kind, {start_byte, length_bytes}} candidates in source order. kind is :url or :email.

Summary

Types

Element of the scanner's interleaved output.

Byte offsets {start, length} into the source text.

Functions

Walks text once and returns an interleaved list of plain-text fragments and URL / email candidates, preserving document order.

Types

element()

@type element() ::
  {:text, String.t()}
  | {:url, String.t(), span()}
  | {:email, String.t(), span()}

Element of the scanner's interleaved output.

span()

@type span() :: {non_neg_integer(), non_neg_integer()}

Byte offsets {start, length} into the source text.

Functions

scan(text)

@spec scan(String.t()) :: [element()]

Walks text once and returns an interleaved list of plain-text fragments and URL / email candidates, preserving document order.

Concatenating every element's content reproduces text byte-for-byte:

text == scan(text) |> Enum.map_join(&content/1)

This shape is the building block for everything else in Text.Extract. Text.Extract.urls/2 filters for :url and validates; Text.Extract.split/2 validates each candidate and promotes failures back to :text; Text.Extract.autolink/2 renders the result.

Arguments

  • text is a UTF-8 string.

Returns

  • A list of elements:
    • {:text, fragment} — a span of text containing no candidate.

    • {:url, candidate, {start, length}} — a URL-shaped candidate. candidate is the substring; start/length are byte offsets back into text.

    • {:email, candidate, {start, length}} — an email-shaped candidate.

Where a URL match is wholly contained inside an email match, the email wins and the URL is dropped (which is the canonical interpretation: mailto: aside, an email is never a URL).

Examples

iex> Text.Extract.Scanner.scan("see http://example.com today")
[{:text, "see "}, {:url, "http://example.com", {4, 18}}, {:text, " today"}]

iex> Text.Extract.Scanner.scan("alice@example.com")
[{:email, "alice@example.com", {0, 17}}]

iex> Text.Extract.Scanner.scan("hello world")
[{:text, "hello world"}]

iex> Text.Extract.Scanner.scan("$invalid http://example.com")
[{:text, "$invalid "}, {:url, "http://example.com", {9, 18}}]