View Source Pdf.Reader.Wordlist (ExPDF v1.0.3)
Compile-time dictionaries used by Pdf.Reader.read/2 to recover word
boundaries that the PDF producer collapsed (e.g. iniciode →
inicio + de).
Lookup is performed against a MapSet of lowercase words. Membership
checks are case-insensitive — the input is downcased before the
MapSet query, so "De", "DE", and "de" all match a stored
"de" entry.
Bundled dictionaries
The :es dictionary is built from TWO bundled wordlists merged at
compile time:
1. priv/wordlists/spanish.txt (50,000 entries, ~428 KB)
The 50,000 most-frequent Spanish words from the hermitdave/FrequencyWords project (MIT License, © Hermit Dave), derived from the OpenSubtitles 2018 corpus. Covers conversational, technical, and legal Spanish vocabulary. Format: one lowercase word per line (the upstream repo also includes per-word frequencies; we strip those and keep the words only).
2. priv/wordlists/spanish_mx_extras.txt (~700 entries)
Mexican tax/legal/government vocabulary curated specifically for this project, covering terms that are missing from the subtitle-derived 50k list. This includes:
- SAT (Servicio de Administración Tributaria) terminology:
padrón,tributario/tributarios,federativa,asimilado,lineamientos,contribuyente,recaudación,fiscalización,gravable,deducible, etc. - Common labour / employment terms:
asalariado,honorario,arrendamiento,prestación,salario/sueldovariants. - Document/process terms:
constancia,cédula,expedición,notarial,apoderado,domiciliado, etc. - Adverbs and verb conjugations not in the base list:
inmediatamente,posteriormente,denúnciala,conferidas,corresponda, etc. - Common periodicity words:
mensual,trimestral,bimestral,quincenal, etc.
Released under the same MIT License as the rest of the project.
Filtering
A small @es_blacklist removes colloquial/slang merges that show
up in subtitle frequency lists but harm partition splitting (e.g.
dela is colloquial for "de la"; if left in the dict the
whole-token guard would prevent the correct "de" + "la" split).
Currently blacklisted: dela, pal.
Final size
~50,500 unique entries after merge and blacklist filter. Loaded
into a MapSet at compile time via @external_resource so there
is zero IO cost at runtime.
Custom dictionaries
Callers can supply any MapSet.t() of lowercase strings as the
dictionary opt — e.g. a Hunspell .dic file loaded into a MapSet,
or a frequency list specific to the user's domain.
Spec references
Wordlist licensing follows the upstream MIT license — attribution is preserved in the project README and LICENSE.md.
Summary
Functions
Case-insensitive membership check.
Resolves a dictionary specifier to a MapSet.
:es→ bundled Spanish wordlist%MapSet{}→ returned as-isnil→ returned as-is (caller should treat as "no dictionary")
Returns the bundled Spanish wordlist as a MapSet of lowercase
strings.
Lazy-loaded — the wordlist files are only read from priv/ the
first time this function is called. Result is cached in
:persistent_term so subsequent calls are O(1). Callers who never
pass dictionary: :es to Pdf.Reader.read/2 never touch the disk
and don't pay the ~50 ms one-time load cost.