Text.WordFreq (Text v0.5.0)

Copy Markdown View Source

Word frequency lookup tables.

A drop-in equivalent of Python's wordfreq for the use cases that matter most: ranking candidate words during spell correction, filtering rare-but-not-OOV terms in keyword extraction, and reporting how common a word is on a human-readable scale.

Bundled and on-demand language packs

Seven frequency tables are bundled at compile time and loaded with zero I/O on first lookup:

  • en — top 30,000 American English words from the Google Web Trillion Word Corpus (Peter Norvig's distribution at https://norvig.com/ngrams/).

  • de, fr, es, it, nl, pt — top 30,000 entries each from Hermit Dave's MIT-licensed FrequencyWords OpenSubtitles 2018 corpus.

Other languages are resolved through Text.Data from the cache directory (:data_dir/wordfreq/, default ~/.cache/text/wordfreq/). There is no canonical per-language download URL for word-frequency data, so auto-download is not configured by default; set auto_download_wordfreq_data: true and call load_language/2 with an explicit URL/path when you have a frequency table to register, or drop pre-built <lang>.tsv files (with <word>\t<count> per line) into the cache directory.

Frequency tables are loaded lazily on first access and cached in :persistent_term for the lifetime of the runtime, so subsequent calls are essentially free.

Language input shapes

Every option that takes a :language accepts an atom (:fr), a string ("fr", "fr-CA"), or a Localize.LanguageTag. The base language subtag is used for lookup.

Zipf scale

The Zipf scale, popularised by Marc Brysbaert and reproduced by the Python wordfreq library, expresses frequency as log10(count_per_billion) = log10(frequency) + 9. Useful values:

  • 7+ — extremely common (the, of, and).
  • 5-6 — common conversational vocabulary.
  • 3-4 — recognisable, less frequent.
  • 1-2 — rare or technical.
  • 0 — not in the corpus at all.

Summary

Functions

Returns the raw corpus count of a word in the chosen language.

Returns the normalised frequency of a word: count divided by the corpus total.

Pre-loads a frequency table for a language.

Returns the descending-frequency rank of a word.

Returns the top n most frequent words in the language.

Returns the size of the loaded vocabulary for a language.

Returns the Zipf score of a word: log10(frequency) + 9.

Functions

count(word, options \\ [])

@spec count(
  String.t(),
  keyword()
) :: non_neg_integer()

Returns the raw corpus count of a word in the chosen language.

Arguments

  • word is a string. The lookup is case-insensitive.

Options

  • :language is the language. The default is :en.

Returns

  • A non-negative integer count. Returns 0 for unknown words.

Examples

iex> Text.WordFreq.count("the") > Text.WordFreq.count("rare")
true

iex> Text.WordFreq.count("the_definitely_not_a_real_word_xyz")
0

frequency(word, options \\ [])

@spec frequency(
  String.t(),
  keyword()
) :: float()

Returns the normalised frequency of a word: count divided by the corpus total.

Arguments

  • word is a string.

Options

  • :language is the language. The default is :en.

Returns

  • A float between 0.0 and 1.0. Returns 0.0 for unknown words.

Examples

iex> Text.WordFreq.frequency("the") > 0.0
true

iex> Text.WordFreq.frequency("definitely_not_a_real_word_xyz")
0.0

load_language(language)

@spec load_language(atom() | String.t() | struct()) :: :ok

Pre-loads a frequency table for a language.

Calling this is optional when the file already lives in the cache directory under <lang>.tsv — the first lookup will pick it up automatically. Use this to warm the cache during application startup or to register a custom dictionary under a name of your choosing.

Forms

load_language(language)
load_language(language, tsv_path)

Without an explicit path, the file is resolved through Text.Data (the cache directory is consulted; no canonical URL is configured for :wordfreq, so download is not attempted).

Arguments

  • language is an atom, string, or Localize.LanguageTag.

  • tsv_path is an optional path to a TSV file with word<TAB>count entries.

Returns

  • :ok on success.

load_language(language, tsv_path)

@spec load_language(atom() | String.t() | struct(), Path.t()) :: :ok

rank(word, options \\ [])

@spec rank(
  String.t(),
  keyword()
) :: pos_integer() | nil

Returns the descending-frequency rank of a word.

Rank 1 is the most frequent word in the corpus.

Arguments

  • word is a string.

Options

  • :language is the language. The default is :en.

Returns

  • A positive integer rank, or nil for unknown words.

Examples

iex> Text.WordFreq.rank("the")
1

iex> Text.WordFreq.rank("definitely_not_a_real_word_xyz")
nil

top(n, options \\ [])

@spec top(
  pos_integer(),
  keyword()
) :: [{String.t(), pos_integer()}]

Returns the top n most frequent words in the language.

Arguments

  • n is the number of entries to return.

Options

  • :language is the language. The default is :en.

Returns

  • A list of {word, count} tuples, ordered by descending count.

Examples

iex> [{first, _} | _] = Text.WordFreq.top(5)
iex> first
"the"

vocabulary_size(options \\ [])

@spec vocabulary_size(keyword()) :: non_neg_integer()

Returns the size of the loaded vocabulary for a language.

Arguments

  • No positional arguments.

Options

  • :language is the language. The default is :en.

Returns

  • The number of distinct words in the loaded frequency table.

Examples

iex> Text.WordFreq.vocabulary_size() > 1000
true

zipf(word, options \\ [])

@spec zipf(
  String.t(),
  keyword()
) :: float()

Returns the Zipf score of a word: log10(frequency) + 9.

Arguments

  • word is a string.

Options

  • :language is the language. The default is :en.

Returns

  • A float Zipf score, or 0.0 for unknown words.

Examples

iex> Text.WordFreq.zipf("the") > 6.0
true

iex> Text.WordFreq.zipf("definitely_not_a_real_word_xyz")
0.0