# `Text.Extract.Scanner`
[🔗](https://github.com/kipcole9/text/blob/v0.6.1/lib/extract/scanner.ex#L1)

Phase 1 of the URL / email extraction pipeline: find candidate spans.

The scanner is intentionally permissive. It identifies anything that
*could* be a URL or email — full structural validation, IDNA mapping,
TLD lookup and boundary cleanup happen later in
`Text.Extract.Url`, `Text.Extract.Email`, and
`Text.Extract.Boundary`.

### Boundary rules

Following twitter-text §3, a candidate's first character must be
preceded by one of:

* Beginning of the text.

* Whitespace (any Unicode `Zs`/`Zl`/`Zp`, `	`, `
`, ``).

* A CJK character or other "word break" character.

* One of the safe punctuation chars: `(`, `[`, `{`, `<`, `>`, `"`,
  `'`, `,`, `;`, `:`, `!`, `?`.

Candidates immediately preceded by `$`, `_`, alphanumerics, `@`,
`#`, `-`, `.` (or the same set on the *other* side of a CJK char,
when the previous grapheme is itself part of a token) are rejected.

### Output

`scan/1` returns a list of `{kind, {start_byte, length_bytes}}`
candidates in source order. `kind` is `:url` or `:email`.

# `element`

```elixir
@type element() ::
  {:text, String.t()}
  | {:url, String.t(), span()}
  | {:email, String.t(), span()}
```

Element of the scanner's interleaved output.

# `span`

```elixir
@type span() :: {non_neg_integer(), non_neg_integer()}
```

Byte offsets `{start, length}` into the source text.

# `scan`

```elixir
@spec scan(String.t()) :: [element()]
```

Walks `text` once and returns an interleaved list of plain-text
fragments and URL / email candidates, preserving document order.

Concatenating every element's content reproduces `text` byte-for-byte:

    text == scan(text) |> Enum.map_join(&content/1)

This shape is the building block for everything else in
`Text.Extract`. `Text.Extract.urls/2` filters for `:url` and
validates; `Text.Extract.split/2` validates each candidate and
promotes failures back to `:text`; `Text.Extract.autolink/2`
renders the result.

### Arguments

* `text` is a UTF-8 string.

### Returns

* A list of elements:

  * `{:text, fragment}` — a span of `text` containing no candidate.

  * `{:url, candidate, {start, length}}` — a URL-shaped candidate.
    `candidate` is the substring; `start`/`length` are byte offsets
    back into `text`.

  * `{:email, candidate, {start, length}}` — an email-shaped
    candidate.

Where a URL match is wholly contained inside an email match, the
email wins and the URL is dropped (which is the canonical
interpretation: `mailto:` aside, an email is never a URL).

### Examples

    iex> Text.Extract.Scanner.scan("see http://example.com today")
    [{:text, "see "}, {:url, "http://example.com", {4, 18}}, {:text, " today"}]

    iex> Text.Extract.Scanner.scan("alice@example.com")
    [{:email, "alice@example.com", {0, 17}}]

    iex> Text.Extract.Scanner.scan("hello world")
    [{:text, "hello world"}]

    iex> Text.Extract.Scanner.scan("$invalid http://example.com")
    [{:text, "$invalid "}, {:url, "http://example.com", {9, 18}}]

---

*Consult [api-reference.md](api-reference.md) for complete listing*
