Word frequency lookup tables.
A drop-in equivalent of Python's
wordfreq for the use cases
that matter most: ranking candidate words during spell correction,
filtering rare-but-not-OOV terms in keyword extraction, and
reporting how common a word is on a human-readable scale.
Bundled and on-demand language packs
Seven frequency tables are bundled at compile time and loaded with zero I/O on first lookup:
en— top 30,000 American English words from the Google Web Trillion Word Corpus (Peter Norvig's distribution at https://norvig.com/ngrams/).de,fr,es,it,nl,pt— top 30,000 entries each from Hermit Dave's MIT-licensed FrequencyWords OpenSubtitles 2018 corpus.
Other languages are resolved through Text.Data from the cache
directory (:data_dir/wordfreq/, default
~/.cache/text/wordfreq/). There is no canonical per-language
download URL for word-frequency data, so auto-download is not
configured by default; set auto_download_wordfreq_data: true
and call load_language/2 with an explicit URL/path when you
have a frequency table to register, or drop pre-built
<lang>.tsv files (with <word>\t<count> per line) into the
cache directory.
Frequency tables are loaded lazily on first access and cached in
:persistent_term for the lifetime of the runtime, so subsequent
calls are essentially free.
Language input shapes
Every option that takes a :language accepts an atom (:fr), a
string ("fr", "fr-CA"), or a Localize.LanguageTag. The base
language subtag is used for lookup.
Zipf scale
The Zipf scale, popularised by Marc Brysbaert and reproduced by
the Python wordfreq library, expresses frequency as
log10(count_per_billion) = log10(frequency) + 9. Useful values:
- 7+ — extremely common (
the,of,and). - 5-6 — common conversational vocabulary.
- 3-4 — recognisable, less frequent.
- 1-2 — rare or technical.
- 0 — not in the corpus at all.
Summary
Functions
Returns the raw corpus count of a word in the chosen language.
Returns the normalised frequency of a word: count divided by the corpus total.
Pre-loads a frequency table for a language.
Returns the descending-frequency rank of a word.
Returns the top n most frequent words in the language.
Returns the size of the loaded vocabulary for a language.
Returns the Zipf score of a word: log10(frequency) + 9.
Functions
@spec count( String.t(), keyword() ) :: non_neg_integer()
Returns the raw corpus count of a word in the chosen language.
Arguments
wordis a string. The lookup is case-insensitive.
Options
:languageis the language. The default is:en.
Returns
- A non-negative integer count. Returns
0for unknown words.
Examples
iex> Text.WordFreq.count("the") > Text.WordFreq.count("rare")
true
iex> Text.WordFreq.count("the_definitely_not_a_real_word_xyz")
0
Returns the normalised frequency of a word: count divided by the corpus total.
Arguments
wordis a string.
Options
:languageis the language. The default is:en.
Returns
- A float between
0.0and1.0. Returns0.0for unknown words.
Examples
iex> Text.WordFreq.frequency("the") > 0.0
true
iex> Text.WordFreq.frequency("definitely_not_a_real_word_xyz")
0.0
Pre-loads a frequency table for a language.
Calling this is optional when the file already lives in the
cache directory under <lang>.tsv — the first lookup will pick
it up automatically. Use this to warm the cache during application
startup or to register a custom dictionary under a name of your
choosing.
Forms
load_language(language)
load_language(language, tsv_path)Without an explicit path, the file is resolved through Text.Data
(the cache directory is consulted; no canonical URL is configured
for :wordfreq, so download is not attempted).
Arguments
languageis an atom, string, orLocalize.LanguageTag.tsv_pathis an optional path to a TSV file withword<TAB>countentries.
Returns
:okon success.
@spec rank( String.t(), keyword() ) :: pos_integer() | nil
Returns the descending-frequency rank of a word.
Rank 1 is the most frequent word in the corpus.
Arguments
wordis a string.
Options
:languageis the language. The default is:en.
Returns
- A positive integer rank, or
nilfor unknown words.
Examples
iex> Text.WordFreq.rank("the")
1
iex> Text.WordFreq.rank("definitely_not_a_real_word_xyz")
nil
@spec top( pos_integer(), keyword() ) :: [{String.t(), pos_integer()}]
Returns the top n most frequent words in the language.
Arguments
nis the number of entries to return.
Options
:languageis the language. The default is:en.
Returns
- A list of
{word, count}tuples, ordered by descending count.
Examples
iex> [{first, _} | _] = Text.WordFreq.top(5)
iex> first
"the"
@spec vocabulary_size(keyword()) :: non_neg_integer()
Returns the size of the loaded vocabulary for a language.
Arguments
- No positional arguments.
Options
:languageis the language. The default is:en.
Returns
- The number of distinct words in the loaded frequency table.
Examples
iex> Text.WordFreq.vocabulary_size() > 1000
true
Returns the Zipf score of a word: log10(frequency) + 9.
Arguments
wordis a string.
Options
:languageis the language. The default is:en.
Returns
- A float Zipf score, or
0.0for unknown words.
Examples
iex> Text.WordFreq.zipf("the") > 6.0
true
iex> Text.WordFreq.zipf("definitely_not_a_real_word_xyz")
0.0