Text.Extract.Url (Text v0.6.1)

Copy Markdown View Source

Phase 2 validator for URL candidates.

Takes a scanner candidate span, applies boundary cleanup, parses the URL into RFC 3986 components, and validates each piece against:

  • Twitter-style host-label rules — no leading/trailing dashes, underscores allowed only in subdomain labels (not in the registrable domain or TLD).

  • UTS #46 IDNA via Unicode.IDNA.to_ascii/2 — every non-ASCII host label must encode to a valid Punycode form. The original Unicode form is preserved in :host; the all-ASCII form is in :ascii_host.

  • TLD existence in the bundled IANA list (or the caller-selected :tld_mode).

Returns a %{} record on success or {:error, reason} on rejection.

Summary

Types

Reasons for rejecting a URL candidate.

Parsed URL record.

Functions

Validates a URL candidate span.

Types

reason()

@type reason() ::
  :empty
  | :no_host
  | :invalid_label
  | :invalid_tld
  | :idna_failed
  | :unsupported_scheme
  | :mixed_script
  | :twitter_quirk_rejected

Reasons for rejecting a URL candidate.

url_record()

@type url_record() :: %{
  url: String.t(),
  ascii: String.t(),
  span: {non_neg_integer(), non_neg_integer()},
  scheme: String.t() | nil,
  userinfo: String.t() | nil,
  host: String.t(),
  ascii_host: String.t(),
  port: non_neg_integer() | nil,
  path: String.t() | nil,
  query: String.t() | nil,
  fragment: String.t() | nil
}

Parsed URL record.

Functions

validate(candidate, span, options \\ [])

@spec validate(String.t(), {non_neg_integer(), non_neg_integer()}, keyword()) ::
  {:ok, url_record()} | {:error, reason()}

Validates a URL candidate span.

Arguments

  • candidate is the candidate substring as emitted by Text.Extract.Scanner.scan/1.

  • span is the {start_byte, length_bytes} tuple positioning candidate within the original source text — preserved through to the returned record's :span field.

Options

  • :require_scheme — when true, only scheme://… URLs validate; schemeless candidates are rejected. Default false.

  • :schemes — allowlist of accepted schemes. Default ["http", "https", "ftp", "ftps"].

  • :tld_mode:iana (default) or :any. See Text.Extract.Tld.

  • :strict_idn — when true, applies two extra defences against homograph attacks: STD3 ASCII rules in the IDNA call (rejects _ in labels), and the UTR #39 §5.1 single-script restriction (rejects mixed-script hosts like аpple.com where the а is Cyrillic). Default false (matches Twitter behaviour).

  • :twitter_quirks — when true, applies the Twitter-text-specific rules in Text.Extract.Twitter (t.co slug max 40 chars, English possessive 's stripping). Default false.

Returns

  • {:ok, record} on success.

  • {:error, reason} if the candidate fails validation. record fields are documented in the module docs.

Examples

iex> {:ok, r} = Text.Extract.Url.validate("http://example.com", {0, 18})
iex> {r.scheme, r.host, r.span}
{"http", "example.com", {0, 18}}

iex> Text.Extract.Url.validate("http://no-tld", {0, 13})
{:error, :invalid_tld}