Text.Extract.Tld (Text v0.6.1)

Copy Markdown View Source

Top-level domain validation for Text.Extract.

At compile time, this module reads priv/extract/tlds.txt (the IANA TLD list, refreshed by mix text.download_tlds) and bakes the entries into a MapSet for O(1) lookup. The bundled file is committed to source control; the mix task exists to make refreshes reproducible.

TLD comparison is case-insensitive and operates on the ASCII form of a label. Internationalised TLDs in the IANA list are stored in Punycode (xn--…) — pass labels through Unicode.IDNA.to_ascii/2 before lookup.

Modes

  • :iana — match against the full bundled IANA list (~1,440 entries).

  • :any — accept any non-empty ASCII label (used by callers that need to bypass TLD validation, e.g. for intranet hostnames or ad-hoc strings).

Twitter-style tiered ccTLD/gTLD lists could be layered on top by a caller, but in practice the IANA list and a "must end in a known TLD" rule reproduce twitter-text's behaviour for every URL conformance fixture we've checked: the TLDs that twitter-text rejects (e.g. .baz, .govedu, .comm) are simply not in IANA either.

Summary

Functions

Returns the ASCII TLDs sorted longest-first.

Returns the count of TLDs in the bundled IANA list.

Returns the IANA TLD list as a MapSet of lowercased ASCII labels.

Returns the IDN TLDs in their Unicode form.

Returns whether label is a known TLD under mode.

Returns the version header line from the bundled tlds.txt.

Functions

ascii_sorted()

@spec ascii_sorted() :: [String.t()]

Returns the ASCII TLDs sorted longest-first.

Useful for building regex alternations where longer TLDs must be tried first.

Examples

iex> ascii = Text.Extract.Tld.ascii_sorted()
iex> "com" in ascii
true

iex> "xn--p1ai" in Text.Extract.Tld.ascii_sorted()
false

count()

@spec count() :: non_neg_integer()

Returns the count of TLDs in the bundled IANA list.

Examples

iex> Text.Extract.Tld.count() > 1000
true

iana()

@spec iana() :: MapSet.t(String.t())

Returns the IANA TLD list as a MapSet of lowercased ASCII labels.

Examples

iex> "com" in Text.Extract.Tld.iana()
true

iex> "googleusercontent" in Text.Extract.Tld.iana()
false

idn_unicode()

@spec idn_unicode() :: [String.t()]

Returns the IDN TLDs in their Unicode form.

Built at compile time from the xn-- ACE entries by passing each through Unicode.IDNA.to_unicode/1. Used by Text.Extract.Scanner to extend its bare-host regex with explicit alternatives for IDN TLDs.

Examples

iex> tlds = Text.Extract.Tld.idn_unicode()
iex> length(tlds) > 100
true

iex> "みんな" in Text.Extract.Tld.idn_unicode()
true

tld?(label, mode \\ :iana)

@spec tld?(String.t(), :iana | :any) :: boolean()

Returns whether label is a known TLD under mode.

Arguments

Returns

  • true if the label is a known TLD under the mode, false otherwise.

Examples

iex> Text.Extract.Tld.tld?("com")
true

iex> Text.Extract.Tld.tld?("COM")
true

iex> Text.Extract.Tld.tld?("baz")
false

iex> Text.Extract.Tld.tld?("baz", :any)
true

version_line()

@spec version_line() :: String.t() | nil

Returns the version header line from the bundled tlds.txt.

Examples

iex> Text.Extract.Tld.version_line() =~ "Last Updated"
true