# `Text.Lemma`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/lemma.ex#L1)

Dictionary-driven lemmatization.

Reduces an inflected word to its dictionary form (the *lemma*).
Examples: `running → run`, `mice → mouse`, `cats → cat`.

This module follows the same approach as Python's
[`simplemma`](https://github.com/adbar/simplemma): a flat
inflected-form-to-lemma lookup table per language. No POS tagging,
no morphological analysis — just a dictionary. Coverage is broad
for common vocabulary and unhelpful for rare or technical terms.

### Bundled and on-demand language packs

English is bundled at compile time and loaded with zero I/O. The
other lemma dictionaries published by Michal Boleslav Měchura's
[`lemmatization-lists`](https://github.com/michmech/lemmatization-lists)
project (Open Database License) are large enough that bundling
them all would push the package over hex's size limit, so they
are **not** shipped with the package and are instead loaded on
demand via `Text.Data`. Available upstream:

* Western European: `de`, `es`, `fr`, `it`, `pt`, `ca`, `gl`, `ast`
* Northern European: `sv`, `et`
* Central / Eastern European: `cs`, `sk`, `sl`, `bg`, `ro`, `ru`,
  `uk`, `hu`
* Celtic: `ga`, `gd`, `cy`, `gv`
* Other: `fa`

No `nl` (Dutch) file exists upstream — register a third-party
Dutch dictionary via `load_language/2` if you need it.

Three ways to make a non-English pack available:

* Run `mix text.download_lemma_data <lang> [...]` once. This
  fetches the upstream files and places them in the configured
  `Text.Data` cache so subsequent lookups are zero-network. Set
  `data_dir:` in app config if you want them written somewhere
  other than `~/.cache/text/lemma/`.

* Set `config :text, auto_download_lemma_data: true` and let the
  first lookup for an uncached language fetch from upstream
  automatically. Without that flag, calls for an uncached
  language raise an `ArgumentError` explaining the situation.

* Drop the `lemmatization-<lang>.txt` file into the configured
  cache directory yourself, or call `load_language/2` with an
  explicit path.

### Language input shapes

The `:language` option accepts an atom (`:de`), a string (`"de"`,
`"de-CH"`), or a `Localize.LanguageTag` struct. The base language
subtag is used to pick the upstream file.

Lookup is case-insensitive on the input; the returned lemma uses
the casing of the dictionary entry. Unknown words are returned
unchanged.

# `known?`

```elixir
@spec known?(
  String.t(),
  keyword()
) :: boolean()
```

Returns true if a word has a known lemma in the dictionary.

### Arguments

* `word` is a string.

### Options

* `:language` is the language. Default `:en`.

### Returns

* `true` if the word appears as an inflected form in the
  dictionary, `false` otherwise. A word that *is* its own lemma
  only returns true if it also appears as an inflected form
  (e.g. `runs` is a form of `run`; `run` itself may not be a key).

### Examples

    iex> Text.Lemma.known?("running")
    true

    iex> Text.Lemma.known?("xyznotaword")
    false

# `lemmatize`

```elixir
@spec lemmatize(
  String.t(),
  keyword()
) :: String.t()
```

Returns the lemma of a single word.

### Arguments

* `word` is a string. Lookup is case-insensitive.

### Options

* `:language` is the language. Default `:en`. Accepts an atom, a
  string, or a `Localize.LanguageTag`.

### Returns

* The lemma if known, otherwise the input unchanged.

### Examples

    iex> Text.Lemma.lemmatize("running")
    "run"

    iex> Text.Lemma.lemmatize("mice")
    "mouse"

    iex> Text.Lemma.lemmatize("xyznotaword")
    "xyznotaword"

# `lemmatize_text`

```elixir
@spec lemmatize_text(
  String.t(),
  keyword()
) :: String.t()
```

Lemmatizes every word in a string.

### Arguments

* `text` is a string of zero or more words.

### Options

* `:language` is the language. Default `:en`.

### Returns

* A string with each word replaced by its lemma. Whitespace and
  surrounding punctuation are preserved as-is.

### Examples

    iex> Text.Lemma.lemmatize_text("the cats are running")
    "the cat be run"

# `load_language`

```elixir
@spec load_language(atom() | String.t() | struct()) :: :ok
```

Pre-loads a lemma dictionary for a language.

Calling this is **optional**. The first lookup for a given
language already triggers the same load (and, if enabled, the
download). Use this to warm the cache during application startup
or to register a custom dictionary under a name of your choosing.

### Forms

    load_language(language)
    load_language(language, tsv_path)

Without an explicit path, the upstream URL is derived from the
language input and the file is fetched via `Text.Data` (auto-download
must be enabled for the network fetch to occur).

### Arguments

* `language` is an atom, string, or `Localize.LanguageTag`.

* `tsv_path` is an optional path to a TSV file with
  `lemma<TAB>form` entries.

### Returns

* `:ok` on success.

# `load_language`

```elixir
@spec load_language(atom() | String.t() | struct(), Path.t()) :: :ok
```

---

*Consult [api-reference.md](api-reference.md) for complete listing*