Text.Lemma (Text v0.5.0)

Copy Markdown View Source

Dictionary-driven lemmatization.

Reduces an inflected word to its dictionary form (the lemma). Examples: running → run, mice → mouse, cats → cat.

This module follows the same approach as Python's simplemma: a flat inflected-form-to-lemma lookup table per language. No POS tagging, no morphological analysis — just a dictionary. Coverage is broad for common vocabulary and unhelpful for rare or technical terms.

Bundled and on-demand language packs

English is bundled at compile time and loaded with zero I/O. The other lemma dictionaries published by Michal Boleslav Měchura's lemmatization-lists project (Open Database License) are large enough that bundling them all would push the package over hex's size limit, so they are not shipped with the package and are instead loaded on demand via Text.Data. Available upstream:

  • Western European: de, es, fr, it, pt, ca, gl, ast
  • Northern European: sv, et
  • Central / Eastern European: cs, sk, sl, bg, ro, ru, uk, hu
  • Celtic: ga, gd, cy, gv
  • Other: fa

No nl (Dutch) file exists upstream — register a third-party Dutch dictionary via load_language/2 if you need it.

Three ways to make a non-English pack available:

  • Run mix text.download_lemma_data <lang> [...] once. This fetches the upstream files and places them in the configured Text.Data cache so subsequent lookups are zero-network. Set data_dir: in app config if you want them written somewhere other than ~/.cache/text/lemma/.

  • Set config :text, auto_download_lemma_data: true and let the first lookup for an uncached language fetch from upstream automatically. Without that flag, calls for an uncached language raise an ArgumentError explaining the situation.

  • Drop the lemmatization-<lang>.txt file into the configured cache directory yourself, or call load_language/2 with an explicit path.

Language input shapes

The :language option accepts an atom (:de), a string ("de", "de-CH"), or a Localize.LanguageTag struct. The base language subtag is used to pick the upstream file.

Lookup is case-insensitive on the input; the returned lemma uses the casing of the dictionary entry. Unknown words are returned unchanged.

Summary

Functions

Returns true if a word has a known lemma in the dictionary.

Returns the lemma of a single word.

Lemmatizes every word in a string.

Pre-loads a lemma dictionary for a language.

Functions

known?(word, options \\ [])

@spec known?(
  String.t(),
  keyword()
) :: boolean()

Returns true if a word has a known lemma in the dictionary.

Arguments

  • word is a string.

Options

  • :language is the language. Default :en.

Returns

  • true if the word appears as an inflected form in the dictionary, false otherwise. A word that is its own lemma only returns true if it also appears as an inflected form (e.g. runs is a form of run; run itself may not be a key).

Examples

iex> Text.Lemma.known?("running")
true

iex> Text.Lemma.known?("xyznotaword")
false

lemmatize(word, options \\ [])

@spec lemmatize(
  String.t(),
  keyword()
) :: String.t()

Returns the lemma of a single word.

Arguments

  • word is a string. Lookup is case-insensitive.

Options

Returns

  • The lemma if known, otherwise the input unchanged.

Examples

iex> Text.Lemma.lemmatize("running")
"run"

iex> Text.Lemma.lemmatize("mice")
"mouse"

iex> Text.Lemma.lemmatize("xyznotaword")
"xyznotaword"

lemmatize_text(text, options \\ [])

@spec lemmatize_text(
  String.t(),
  keyword()
) :: String.t()

Lemmatizes every word in a string.

Arguments

  • text is a string of zero or more words.

Options

  • :language is the language. Default :en.

Returns

  • A string with each word replaced by its lemma. Whitespace and surrounding punctuation are preserved as-is.

Examples

iex> Text.Lemma.lemmatize_text("the cats are running")
"the cat be run"

load_language(language)

@spec load_language(atom() | String.t() | struct()) :: :ok

Pre-loads a lemma dictionary for a language.

Calling this is optional. The first lookup for a given language already triggers the same load (and, if enabled, the download). Use this to warm the cache during application startup or to register a custom dictionary under a name of your choosing.

Forms

load_language(language)
load_language(language, tsv_path)

Without an explicit path, the upstream URL is derived from the language input and the file is fetched via Text.Data (auto-download must be enabled for the network fetch to occur).

Arguments

  • language is an atom, string, or Localize.LanguageTag.

  • tsv_path is an optional path to a TSV file with lemma<TAB>form entries.

Returns

  • :ok on success.

load_language(language, tsv_path)

@spec load_language(atom() | String.t() | struct(), Path.t()) :: :ok