# `Text.WordFreq`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/word_freq.ex#L1)

Word frequency lookup tables.

A drop-in equivalent of Python's
[`wordfreq`](https://pypi.org/project/wordfreq/) for the use cases
that matter most: ranking candidate words during spell correction,
filtering rare-but-not-OOV terms in keyword extraction, and
reporting how common a word is on a human-readable scale.

### Bundled and on-demand language packs

Seven frequency tables are bundled at compile time and loaded
with zero I/O on first lookup:

* `en` — top 30,000 American English words from the Google Web
  Trillion Word Corpus (Peter Norvig's distribution at
  <https://norvig.com/ngrams/>).

* `de`, `fr`, `es`, `it`, `nl`, `pt` — top 30,000 entries each
  from Hermit Dave's MIT-licensed
  [FrequencyWords](https://github.com/hermitdave/FrequencyWords)
  OpenSubtitles 2018 corpus.

Other languages are resolved through `Text.Data` from the cache
directory (`:data_dir`/`wordfreq/`, default
`~/.cache/text/wordfreq/`). There is no canonical per-language
download URL for word-frequency data, so auto-download is not
configured by default; set `auto_download_wordfreq_data: true`
and call `load_language/2` with an explicit URL/path when you
have a frequency table to register, or drop pre-built
`<lang>.tsv` files (with `<word>\t<count>` per line) into the
cache directory.

Frequency tables are loaded lazily on first access and cached in
`:persistent_term` for the lifetime of the runtime, so subsequent
calls are essentially free.

### Language input shapes

Every option that takes a `:language` accepts an atom (`:fr`), a
string (`"fr"`, `"fr-CA"`), or a `Localize.LanguageTag`. The base
language subtag is used for lookup.

### Zipf scale

The Zipf scale, popularised by Marc Brysbaert and reproduced by
the Python `wordfreq` library, expresses frequency as
`log10(count_per_billion) = log10(frequency) + 9`. Useful values:

* 7+ — extremely common (`the`, `of`, `and`).
* 5-6 — common conversational vocabulary.
* 3-4 — recognisable, less frequent.
* 1-2 — rare or technical.
* 0   — not in the corpus at all.

# `count`

```elixir
@spec count(
  String.t(),
  keyword()
) :: non_neg_integer()
```

Returns the raw corpus count of a word in the chosen language.

### Arguments

* `word` is a string. The lookup is case-insensitive.

### Options

* `:language` is the language. The default is `:en`.

### Returns

* A non-negative integer count. Returns `0` for unknown words.

### Examples

    iex> Text.WordFreq.count("the") > Text.WordFreq.count("rare")
    true

    iex> Text.WordFreq.count("the_definitely_not_a_real_word_xyz")
    0

# `frequency`

```elixir
@spec frequency(
  String.t(),
  keyword()
) :: float()
```

Returns the normalised frequency of a word: count divided by the
corpus total.

### Arguments

* `word` is a string.

### Options

* `:language` is the language. The default is `:en`.

### Returns

* A float between `0.0` and `1.0`. Returns `0.0` for unknown words.

### Examples

    iex> Text.WordFreq.frequency("the") > 0.0
    true

    iex> Text.WordFreq.frequency("definitely_not_a_real_word_xyz")
    0.0

# `load_language`

```elixir
@spec load_language(atom() | String.t() | struct()) :: :ok
```

Pre-loads a frequency table for a language.

Calling this is **optional** when the file already lives in the
cache directory under `<lang>.tsv` — the first lookup will pick
it up automatically. Use this to warm the cache during application
startup or to register a custom dictionary under a name of your
choosing.

### Forms

    load_language(language)
    load_language(language, tsv_path)

Without an explicit path, the file is resolved through `Text.Data`
(the cache directory is consulted; no canonical URL is configured
for `:wordfreq`, so download is not attempted).

### Arguments

* `language` is an atom, string, or `Localize.LanguageTag`.

* `tsv_path` is an optional path to a TSV file with
  `word<TAB>count` entries.

### Returns

* `:ok` on success.

# `load_language`

```elixir
@spec load_language(atom() | String.t() | struct(), Path.t()) :: :ok
```

# `rank`

```elixir
@spec rank(
  String.t(),
  keyword()
) :: pos_integer() | nil
```

Returns the descending-frequency rank of a word.

Rank `1` is the most frequent word in the corpus.

### Arguments

* `word` is a string.

### Options

* `:language` is the language. The default is `:en`.

### Returns

* A positive integer rank, or `nil` for unknown words.

### Examples

    iex> Text.WordFreq.rank("the")
    1

    iex> Text.WordFreq.rank("definitely_not_a_real_word_xyz")
    nil

# `top`

```elixir
@spec top(
  pos_integer(),
  keyword()
) :: [{String.t(), pos_integer()}]
```

Returns the top `n` most frequent words in the language.

### Arguments

* `n` is the number of entries to return.

### Options

* `:language` is the language. The default is `:en`.

### Returns

* A list of `{word, count}` tuples, ordered by descending count.

### Examples

    iex> [{first, _} | _] = Text.WordFreq.top(5)
    iex> first
    "the"

# `vocabulary_size`

```elixir
@spec vocabulary_size(keyword()) :: non_neg_integer()
```

Returns the size of the loaded vocabulary for a language.

### Arguments

* No positional arguments.

### Options

* `:language` is the language. The default is `:en`.

### Returns

* The number of distinct words in the loaded frequency table.

### Examples

    iex> Text.WordFreq.vocabulary_size() > 1000
    true

# `zipf`

```elixir
@spec zipf(
  String.t(),
  keyword()
) :: float()
```

Returns the Zipf score of a word: `log10(frequency) + 9`.

### Arguments

* `word` is a string.

### Options

* `:language` is the language. The default is `:en`.

### Returns

* A float Zipf score, or `0.0` for unknown words.

### Examples

    iex> Text.WordFreq.zipf("the") > 6.0
    true

    iex> Text.WordFreq.zipf("definitely_not_a_real_word_xyz")
    0.0

---

*Consult [api-reference.md](api-reference.md) for complete listing*