Top-level domain validation for Text.Extract.
At compile time, this module reads priv/extract/tlds.txt (the IANA
TLD list, refreshed by mix text.download_tlds) and bakes the entries
into a MapSet for O(1) lookup. The bundled file is committed to
source control; the mix task exists to make refreshes reproducible.
TLD comparison is case-insensitive and operates on the ASCII form
of a label. Internationalised TLDs in the IANA list are stored in
Punycode (xn--…) — pass labels through Unicode.IDNA.to_ascii/2
before lookup.
Modes
:iana— match against the full bundled IANA list (~1,440 entries).:any— accept any non-empty ASCII label (used by callers that need to bypass TLD validation, e.g. for intranet hostnames or ad-hoc strings).
Twitter-style tiered ccTLD/gTLD lists could be layered on top by a
caller, but in practice the IANA list and a "must end in a known TLD"
rule reproduce twitter-text's behaviour for every URL conformance
fixture we've checked: the TLDs that twitter-text rejects (e.g.
.baz, .govedu, .comm) are simply not in IANA either.
Summary
Functions
Returns the ASCII TLDs sorted longest-first.
Returns the count of TLDs in the bundled IANA list.
Returns the IDN TLDs in their Unicode form.
Returns whether label is a known TLD under mode.
Returns the version header line from the bundled tlds.txt.
Functions
@spec ascii_sorted() :: [String.t()]
Returns the ASCII TLDs sorted longest-first.
Useful for building regex alternations where longer TLDs must be tried first.
Examples
iex> ascii = Text.Extract.Tld.ascii_sorted()
iex> "com" in ascii
true
iex> "xn--p1ai" in Text.Extract.Tld.ascii_sorted()
false
@spec count() :: non_neg_integer()
Returns the count of TLDs in the bundled IANA list.
Examples
iex> Text.Extract.Tld.count() > 1000
true
Returns the IANA TLD list as a MapSet of lowercased ASCII labels.
Examples
iex> "com" in Text.Extract.Tld.iana()
true
iex> "googleusercontent" in Text.Extract.Tld.iana()
false
@spec idn_unicode() :: [String.t()]
Returns the IDN TLDs in their Unicode form.
Built at compile time from the xn-- ACE entries by passing each
through Unicode.IDNA.to_unicode/1. Used by Text.Extract.Scanner
to extend its bare-host regex with explicit alternatives for IDN
TLDs.
Examples
iex> tlds = Text.Extract.Tld.idn_unicode()
iex> length(tlds) > 100
true
iex> "みんな" in Text.Extract.Tld.idn_unicode()
true
Returns whether label is a known TLD under mode.
Arguments
labelis an ASCII string. Pass IDN labels throughUnicode.IDNA.to_ascii/2first.modeis:iana(default) or:any.
Returns
trueif the label is a known TLD under the mode,falseotherwise.
Examples
iex> Text.Extract.Tld.tld?("com")
true
iex> Text.Extract.Tld.tld?("COM")
true
iex> Text.Extract.Tld.tld?("baz")
false
iex> Text.Extract.Tld.tld?("baz", :any)
true
@spec version_line() :: String.t() | nil
Returns the version header line from the bundled tlds.txt.
Examples
iex> Text.Extract.Tld.version_line() =~ "Last Updated"
true