# `Text.Extract.Url`
[🔗](https://github.com/kipcole9/text/blob/v0.6.1/lib/extract/url.ex#L1)

Phase 2 validator for URL candidates.

Takes a scanner candidate span, applies boundary cleanup, parses the
URL into RFC 3986 components, and validates each piece against:

* Twitter-style host-label rules — no leading/trailing dashes,
  underscores allowed only in subdomain labels (not in the
  registrable domain or TLD).

* UTS #46 IDNA via `Unicode.IDNA.to_ascii/2` — every non-ASCII host
  label must encode to a valid Punycode form. The original Unicode
  form is preserved in `:host`; the all-ASCII form is in
  `:ascii_host`.

* TLD existence in the bundled IANA list (or the caller-selected
  `:tld_mode`).

Returns a `%{}` record on success or `{:error, reason}` on rejection.

# `reason`

```elixir
@type reason() ::
  :empty
  | :no_host
  | :invalid_label
  | :invalid_tld
  | :idna_failed
  | :unsupported_scheme
  | :mixed_script
  | :twitter_quirk_rejected
```

Reasons for rejecting a URL candidate.

# `url_record`

```elixir
@type url_record() :: %{
  url: String.t(),
  ascii: String.t(),
  span: {non_neg_integer(), non_neg_integer()},
  scheme: String.t() | nil,
  userinfo: String.t() | nil,
  host: String.t(),
  ascii_host: String.t(),
  port: non_neg_integer() | nil,
  path: String.t() | nil,
  query: String.t() | nil,
  fragment: String.t() | nil
}
```

Parsed URL record.

# `validate`

```elixir
@spec validate(String.t(), {non_neg_integer(), non_neg_integer()}, keyword()) ::
  {:ok, url_record()} | {:error, reason()}
```

Validates a URL candidate span.

### Arguments

* `candidate` is the candidate substring as emitted by
  `Text.Extract.Scanner.scan/1`.

* `span` is the `{start_byte, length_bytes}` tuple positioning
  `candidate` within the original source text — preserved through
  to the returned record's `:span` field.

### Options

* `:require_scheme` — when `true`, only `scheme://…` URLs validate;
  schemeless candidates are rejected. Default `false`.

* `:schemes` — allowlist of accepted schemes. Default
  `["http", "https", "ftp", "ftps"]`.

* `:tld_mode` — `:iana` (default) or `:any`. See `Text.Extract.Tld`.

* `:strict_idn` — when `true`, applies two extra defences against
  homograph attacks: STD3 ASCII rules in the IDNA call (rejects `_`
  in labels), and the UTR #39 §5.1 single-script restriction
  (rejects mixed-script hosts like `аpple.com` where the `а` is
  Cyrillic). Default `false` (matches Twitter behaviour).

* `:twitter_quirks` — when `true`, applies the Twitter-text-specific
  rules in `Text.Extract.Twitter` (t.co slug max 40 chars, English
  possessive `'s` stripping). Default `false`.

### Returns

* `{:ok, record}` on success.

* `{:error, reason}` if the candidate fails validation. `record`
  fields are documented in the module docs.

### Examples

    iex> {:ok, r} = Text.Extract.Url.validate("http://example.com", {0, 18})
    iex> {r.scheme, r.host, r.span}
    {"http", "example.com", {0, 18}}

    iex> Text.Extract.Url.validate("http://no-tld", {0, 13})
    {:error, :invalid_tld}

---

*Consult [api-reference.md](api-reference.md) for complete listing*