Dictionary-driven lemmatization.
Reduces an inflected word to its dictionary form (the lemma).
Examples: running → run, mice → mouse, cats → cat.
This module follows the same approach as Python's
simplemma: a flat
inflected-form-to-lemma lookup table per language. No POS tagging,
no morphological analysis — just a dictionary. Coverage is broad
for common vocabulary and unhelpful for rare or technical terms.
Bundled and on-demand language packs
English is bundled at compile time and loaded with zero I/O. The
other lemma dictionaries published by Michal Boleslav Měchura's
lemmatization-lists
project (Open Database License) are large enough that bundling
them all would push the package over hex's size limit, so they
are not shipped with the package and are instead loaded on
demand via Text.Data. Available upstream:
- Western European:
de,es,fr,it,pt,ca,gl,ast - Northern European:
sv,et - Central / Eastern European:
cs,sk,sl,bg,ro,ru,uk,hu - Celtic:
ga,gd,cy,gv - Other:
fa
No nl (Dutch) file exists upstream — register a third-party
Dutch dictionary via load_language/2 if you need it.
Three ways to make a non-English pack available:
Run
mix text.download_lemma_data <lang> [...]once. This fetches the upstream files and places them in the configuredText.Datacache so subsequent lookups are zero-network. Setdata_dir:in app config if you want them written somewhere other than~/.cache/text/lemma/.Set
config :text, auto_download_lemma_data: trueand let the first lookup for an uncached language fetch from upstream automatically. Without that flag, calls for an uncached language raise anArgumentErrorexplaining the situation.Drop the
lemmatization-<lang>.txtfile into the configured cache directory yourself, or callload_language/2with an explicit path.
Language input shapes
The :language option accepts an atom (:de), a string ("de",
"de-CH"), or a Localize.LanguageTag struct. The base language
subtag is used to pick the upstream file.
Lookup is case-insensitive on the input; the returned lemma uses the casing of the dictionary entry. Unknown words are returned unchanged.
Summary
Functions
Returns true if a word has a known lemma in the dictionary.
Returns the lemma of a single word.
Lemmatizes every word in a string.
Pre-loads a lemma dictionary for a language.
Functions
Returns true if a word has a known lemma in the dictionary.
Arguments
wordis a string.
Options
:languageis the language. Default:en.
Returns
trueif the word appears as an inflected form in the dictionary,falseotherwise. A word that is its own lemma only returns true if it also appears as an inflected form (e.g.runsis a form ofrun;runitself may not be a key).
Examples
iex> Text.Lemma.known?("running")
true
iex> Text.Lemma.known?("xyznotaword")
false
Returns the lemma of a single word.
Arguments
wordis a string. Lookup is case-insensitive.
Options
:languageis the language. Default:en. Accepts an atom, a string, or aLocalize.LanguageTag.
Returns
- The lemma if known, otherwise the input unchanged.
Examples
iex> Text.Lemma.lemmatize("running")
"run"
iex> Text.Lemma.lemmatize("mice")
"mouse"
iex> Text.Lemma.lemmatize("xyznotaword")
"xyznotaword"
Lemmatizes every word in a string.
Arguments
textis a string of zero or more words.
Options
:languageis the language. Default:en.
Returns
- A string with each word replaced by its lemma. Whitespace and surrounding punctuation are preserved as-is.
Examples
iex> Text.Lemma.lemmatize_text("the cats are running")
"the cat be run"
Pre-loads a lemma dictionary for a language.
Calling this is optional. The first lookup for a given language already triggers the same load (and, if enabled, the download). Use this to warm the cache during application startup or to register a custom dictionary under a name of your choosing.
Forms
load_language(language)
load_language(language, tsv_path)Without an explicit path, the upstream URL is derived from the
language input and the file is fetched via Text.Data (auto-download
must be enabled for the network fetch to occur).
Arguments
languageis an atom, string, orLocalize.LanguageTag.tsv_pathis an optional path to a TSV file withlemma<TAB>formentries.
Returns
:okon success.