Nasty.Language.Catalan.Morphology (Nasty v0.3.0)

View Source

Morphological analyzer for Catalan tokens.

Provides lemmatization (finding the base form of words) using:

  • Dictionary lookup for irregular forms
  • Rule-based suffix removal for regular conjugations/declensions

Catalan-Specific Features

  • Verb lemmatization: all conjugations → infinitive (-ar, -re, -ir)
  • Noun lemmatization: plural → singular, gender variations
  • Adjective lemmatization: gender/number agreement
  • Morphological features: gender, number, tense, mood, person
  • Clitic handling (em, et, es, el, la, etc.)

Summary

Functions

Analyzes tokens to add lemma and morphological features.

Lemmatizes a Catalan word based on its part-of-speech tag.

Functions

analyze(tokens)

@spec analyze([Nasty.AST.Token.t()]) :: {:ok, [Nasty.AST.Token.t()]}

Analyzes tokens to add lemma and morphological features.

Updates each token with:

  • :lemma - Base form of the word (infinitive for verbs, singular for nouns)
  • :morphology - Map of morphological features (gender, number, tense, etc.)

Parameters

  • tokens - List of Token structs (with POS tags)

Returns

  • {:ok, tokens} - Tokens with lemma and morphology fields updated

lemmatize(word, pos_tag)

@spec lemmatize(String.t(), atom()) :: String.t()

Lemmatizes a Catalan word based on its part-of-speech tag.

Returns the base form (lemma) of a word using dictionary lookup for irregular forms and rule-based suffix removal for regular forms.

Parameters

  • word - The word to lemmatize (lowercase string)
  • pos_tag - Part-of-speech tag atom (:verb, :noun, :adj, etc.)

Returns

  • String.t() - The lemmatized form of the word