Phase 2 validator for URL candidates.
Takes a scanner candidate span, applies boundary cleanup, parses the URL into RFC 3986 components, and validates each piece against:
Twitter-style host-label rules — no leading/trailing dashes, underscores allowed only in subdomain labels (not in the registrable domain or TLD).
UTS #46 IDNA via
Unicode.IDNA.to_ascii/2— every non-ASCII host label must encode to a valid Punycode form. The original Unicode form is preserved in:host; the all-ASCII form is in:ascii_host.TLD existence in the bundled IANA list (or the caller-selected
:tld_mode).
Returns a %{} record on success or {:error, reason} on rejection.
Summary
Functions
Validates a URL candidate span.
Types
@type reason() ::
:empty
| :no_host
| :invalid_label
| :invalid_tld
| :idna_failed
| :unsupported_scheme
| :mixed_script
| :twitter_quirk_rejected
Reasons for rejecting a URL candidate.
@type url_record() :: %{ url: String.t(), ascii: String.t(), span: {non_neg_integer(), non_neg_integer()}, scheme: String.t() | nil, userinfo: String.t() | nil, host: String.t(), ascii_host: String.t(), port: non_neg_integer() | nil, path: String.t() | nil, query: String.t() | nil, fragment: String.t() | nil }
Parsed URL record.
Functions
@spec validate(String.t(), {non_neg_integer(), non_neg_integer()}, keyword()) :: {:ok, url_record()} | {:error, reason()}
Validates a URL candidate span.
Arguments
candidateis the candidate substring as emitted byText.Extract.Scanner.scan/1.spanis the{start_byte, length_bytes}tuple positioningcandidatewithin the original source text — preserved through to the returned record's:spanfield.
Options
:require_scheme— whentrue, onlyscheme://…URLs validate; schemeless candidates are rejected. Defaultfalse.:schemes— allowlist of accepted schemes. Default["http", "https", "ftp", "ftps"].:tld_mode—:iana(default) or:any. SeeText.Extract.Tld.:strict_idn— whentrue, applies two extra defences against homograph attacks: STD3 ASCII rules in the IDNA call (rejects_in labels), and the UTR #39 §5.1 single-script restriction (rejects mixed-script hosts likeаpple.comwhere theаis Cyrillic). Defaultfalse(matches Twitter behaviour).:twitter_quirks— whentrue, applies the Twitter-text-specific rules inText.Extract.Twitter(t.co slug max 40 chars, English possessive'sstripping). Defaultfalse.
Returns
{:ok, record}on success.{:error, reason}if the candidate fails validation.recordfields are documented in the module docs.
Examples
iex> {:ok, r} = Text.Extract.Url.validate("http://example.com", {0, 18})
iex> {r.scheme, r.host, r.span}
{"http", "example.com", {0, 18}}
iex> Text.Extract.Url.validate("http://no-tld", {0, 13})
{:error, :invalid_tld}