Catalan Language Support
View SourceComprehensive Catalan language support for the Nasty NLP library.
Status
Implemented (Phases 1-7):
- Tokenization with Catalan-specific features
- POS tagging with Universal Dependencies tagset
- Morphological analysis and lemmatization
- Grammar resource files (phrase and dependency rules)
- Phrase and sentence parsing (NP, VP, PP, clause detection)
- Dependency extraction (Universal Dependencies relations)
- Named entity recognition (PERSON, LOCATION, ORGANIZATION, DATE, MONEY, PERCENT)
Pending (Phase 8):
- Text summarization (stub implementation)
- Coreference resolution
- Semantic role labeling
Features
Tokenization
The Catalan tokenizer handles all language-specific features:
Interpunct (l·l): Kept as single token
- Example:
"Col·laborar"→["Col·laborar"] - Common in compound words: col·laborar, intel·ligent, il·lusió
- Example:
Apostrophe Contractions: Separated as distinct tokens
- Determiners:
l'(el/la) - Prepositions:
d'(de),s'(es/se) - Pronouns:
n'(en),m'(me),t'(te) - Example:
"L'home d'or"→["L'", "home", "d'", "or"]
- Determiners:
Article Contractions: Recognized as single tokens
del= de + elal= a + elpel= per + el- Example:
"Vaig al mercat"→["Vaig", "al", "mercat"]
Diacritics: Complete support for all 10 Catalan diacritics
- Vowels: à, è, é, í, ï, ò, ó, ú, ü
- Consonant: ç (ce trencada)
- Unicode NFC normalization
POS Tagging
Rule-based POS tagger using Universal Dependencies tagset:
Comprehensive Lexicon: 300+ word forms
- Articles, pronouns, prepositions
- Common verbs, nouns, adjectives, adverbs
- Function words and particles
Verb Conjugations: All tenses supported
- Present, preterite, imperfect, future, conditional
- Subjunctive mood patterns
- Gerunds and past participles
Context-Based Disambiguation
- Post-nominal adjective detection
- Determiner-noun sequences
- Preposition-noun patterns
Morphology
Morphological analyzer with lemmatization:
Verb Classes: 3 conjugation classes
-arverbs: parlar → parlar, parlant → parlar-reverbs: viure → viure, vivint → viure-irverbs: dormir → dormir, dormint → dormir
Irregular Verbs: Dictionary of 100+ forms
- ser, estar, haver (auxiliaries)
- anar, fer, dir, poder, voler (common verbs)
- tenir, venir, veure (irregulars)
Morphological Features
- Gender: masculine/feminine
- Number: singular/plural
- Tense: present, past, future, conditional, imperfect
- Mood: indicative, conditional, subjunctive
- Aspect: progressive, perfective
Grammar Rules
Externalized grammar files in priv/languages/ca/grammars/:
Phrase Rules (phrase_rules.exs):
- Noun phrases with post-nominal adjectives
- Verb phrases with flexible word order
- Prepositional, adjectival, adverbial phrases
- Relative clause patterns
- Special rules for Catalan-specific features
Dependency Rules (dependency_rules.exs):
- Universal Dependencies v2 relations
- Core arguments (subject, object, indirect object)
- Non-core dependents (oblique, adverbials)
- Function word relations
- Catalan-specific patterns (clitics, pro-drop)
Usage
alias Nasty.Language.Catalan
# Complete pipeline
text = "El gat dorm al sofà."
{:ok, tokens} = Catalan.tokenize(text)
{:ok, tagged} = Catalan.tag_pos(tokens)
{:ok, document} = Catalan.parse(tagged)
# Extract entities
alias Nasty.Language.Catalan.EntityRecognizer
{:ok, entities} = EntityRecognizer.recognize(tagged)
# => [%Entity{type: :person, text: "Joan Garcia", ...}]
# Extract dependencies
alias Nasty.Language.Catalan.DependencyExtractor
sentences = document.paragraphs |> Enum.flat_map(& &1.sentences)
deps = Enum.flat_map(sentences, &DependencyExtractor.extract/1)
# => [%Dependency{relation: :nsubj, head: "dorm", dependent: "gat", ...}]
# Individual components
{:ok, tokens} = Catalan.Tokenizer.tokenize("El gat dorm al sofà.")
{:ok, tagged} = Catalan.POSTagger.tag_pos(tokens)
{:ok, analyzed} = Catalan.Morphology.analyze(tagged)
# Access lemmas and features
Enum.each(analyzed, fn token ->
IO.puts("#{token.text} [#{token.pos_tag}] → #{token.lemma}")
end)Linguistic Features
Word Order
Catalan allows flexible word order while maintaining SVO as default:
- SVO (Subject-Verb-Object):
"El gat menja peix"(The cat eats fish) - VSO (Verb-Subject-Object):
"Menja el gat peix"(Eats the cat fish) - emphatic - VOS (Verb-Object-Subject):
"Menja peix el gat"(Eats fish the cat) - rare
Pro-Drop
Subject pronouns often omitted when context is clear:
"Parla català"(I/he/she/it speaks Catalan) - subject implicit"Hem anat al mercat"(We have gone to the market) - subject implicit
Post-Nominal Adjectives
Descriptive adjectives typically follow nouns:
"casa gran"(big house)"llibre interessant"(interesting book)- Exception:
"bon dia"(good day) - some adjectives precede for emphasis
Clitic Pronouns
Pronouns can attach to verbs as clitics:
"Dona'm el llibre"(Give me the book) - m' = me"Digue-li la veritat"(Tell him/her the truth) - li = him/her
Test Coverage
74 tests, 0 failures
Tokenization: 54 tests
- Interpunct words
- Apostrophe and article contractions
- Diacritics
- Position tracking
- Edge cases
POS Tagging: 20 tests
- Basic word classes
- Verb conjugations
- Catalan-specific features
- Context-based tagging
Implementation Details
Phrase Parser (lib/language/catalan/phrase_parser.ex - 334 lines)
parse_noun_phrase/2: Handles quantifiers, determiners, adjectives, and post-modifiersparse_verb_phrase/2: Processes auxiliaries, main verbs, objects, and complementsparse_prep_phrase/2: Parses preposition + noun phrase structures- Catalan-specific: Post-nominal adjectives, quantifying adjectives (molt, poc, algun, tot)
Sentence Parser (lib/language/catalan/sentence_parser.ex - 281 lines)
parse_sentences/2: Sentence boundary detection and splittingparse_clause/2: Subject and predicate extraction- Catalan subordinators: que, perquè, quan, on, si, encara, mentre, així, doncs, ja
- Coordination: i, o, però, sinó, ni
Dependency Extractor (lib/language/catalan/dependency_extractor.ex - 226 lines)
- Extracts Universal Dependencies relations from parsed structures
- Core relations: nsubj (nominal subject), obj (object), iobj (indirect object)
- Modifiers: det (determiner), amod (adjectival modifier), advmod (adverbial modifier)
- Function words: aux (auxiliary), case (case marking), mark (subordinating conjunction)
- Coordination: cc (coordinating conjunction), conj (conjunct)
Entity Recognizer (lib/language/catalan/entity_recognizer.ex - 285 lines)
- Rule-based NER with 6 entity types
- PERSON: Catalan titles (Sr., Sra., Dr., Dra., Don, Donya), capitalized name sequences
- LOCATION: Catalan places (Barcelona, Catalunya, València, Girona, Tarragona, Lleida, Andorra)
- ORGANIZATION: Indicators (banc, universitat, hospital, ajuntament, govern)
- DATE: Catalan months and days (gener, febrer, març, dilluns, dimarts)
- MONEY: Euro symbols (€, euros, dòlar, dòlars)
- PERCENT: Percentage symbols (%, per cent)
- Confidence scoring: 0.5-0.95 based on pattern strength
Future Work (Phase 8 and Beyond)
- Summarizer: Extractive and abstractive text summarization
- Coreference Resolution: Link mentions across sentences
- Semantic Role Labeling: Predicate-argument structure
- End-to-end Tests: Integration tests for complete pipeline
- Advanced Entity Recognition: ML-based NER with larger lexicons
- Question Answering: Extractive QA for Catalan texts
- Text Classification: Sentiment analysis, topic classification
References
- Universal Dependencies Catalan Treebank: UD_Catalan-AnCora
- Catalan Grammar: Institut d'Estudis Catalans
- Linguistic Patterns: Based on Central Catalan (Barcelona dialect)
Language Code
ISO 639-1: ca
ISO 639-3: cat
Contributing
When enhancing Catalan support:
- Maintain consistency with Spanish implementation patterns
- Follow Universal Dependencies standards
- Document Catalan-specific features
- Add comprehensive tests for new functionality
- Update this documentation