Nasty.Language.Catalan (Nasty v0.3.0)
View SourceCatalan (Català) language implementation for Nasty.
Provides complete NLP pipeline for Catalan text:
- Tokenization with Catalan-specific features (interpunct, contractions)
- POS tagging using Universal Dependencies tagset
- Morphological analysis (lemmatization, features)
- Syntactic parsing (phrases, sentences, clauses)
- Dependency extraction (Universal Dependencies)
- Named entity recognition
- Text summarization
Catalan-Specific Features
- Interpunct (l·l): Handled in tokenization (e.g., "col·laborar")
- Apostrophe contractions: l', d', s', n', m', t'
- Article contractions: del (de + el), al (a + el), pel (per + el)
- Pro-drop: Subject pronouns often omitted
- Post-nominal adjectives: "casa blanca" (white house)
- Clitic pronouns: em, et, es, ens, us
Usage
iex> alias Nasty.Language.Catalan
iex> {:ok, tokens} = Catalan.tokenize("El gat dorm al sofà.")
iex> {:ok, tagged} = Catalan.tag_pos(tokens)
iex> {:ok, document} = Catalan.parse(tagged)Language Code
Catalan uses the ISO 639-1 code :ca.
Summary
Functions
Extracts named entities from Catalan text.
Returns the ISO 639-1 language code for Catalan.
Returns metadata about the Catalan language implementation.
Parses tagged Catalan tokens into a complete Document AST.
Renders a Catalan AST node back to natural language text.
Summarizes Catalan text using extractive summarization.
Assigns part-of-speech tags to Catalan tokens using Universal Dependencies tagset.
Tokenizes Catalan text into tokens with position tracking.
Functions
@spec extract_entities(Nasty.AST.Document.t()) :: [Nasty.AST.Semantic.Entity.t()]
Extracts named entities from Catalan text.
Recognizes:
- Person names (with Catalan naming patterns)
- Organizations
- Locations (Catalan place names)
- Dates
Examples
iex> {:ok, document} = Catalan.parse(tokens)
iex> Catalan.extract_entities(document)
[%Entity{type: :person, text: "Josep Maria"}, ...]
@spec language_code() :: :ca
Returns the ISO 639-1 language code for Catalan.
Examples
iex> Nasty.Language.Catalan.language_code()
:ca
Returns metadata about the Catalan language implementation.
Examples
iex> Catalan.metadata()
%{
name: "Catalan",
native_name: "Català",
iso_639_1: "ca",
family: "Romance",
speakers: "~10 million"
}
Parses tagged Catalan tokens into a complete Document AST.
The parsing pipeline:
- Morphological analysis (lemmatization, features)
- Phrase parsing (NP, VP, PP, AdjP, AdvP)
- Sentence parsing (clauses, coordination, subordination)
- Document construction (paragraphs, sentences)
Options
:dependencies- Extract dependency relations (default: false):entities- Recognize named entities (default: false):semantic_roles- Extract semantic roles (default: false)
Examples
iex> {:ok, tokens} = Catalan.tokenize("La Maria treballa a Barcelona.")
iex> {:ok, tagged} = Catalan.tag_pos(tokens)
iex> Catalan.parse(tagged)
{:ok, %Document{paragraphs: [%Paragraph{sentences: [...]}]}}
Renders a Catalan AST node back to natural language text.
Handles:
- Subject-verb agreement
- Gender/number agreement (adjectives, articles)
- Catalan word order (post-nominal adjectives)
- Proper punctuation and capitalization
Examples
iex> document = %Document{...}
iex> Catalan.render(document)
{:ok, "El gat dorm al sofà."}
@spec summarize( Nasty.AST.Document.t(), keyword() ) :: String.t()
Summarizes Catalan text using extractive summarization.
Options
:ratio- Compression ratio (0.0-1.0):max_sentences- Maximum sentences in summary:method-:textrankor:mmr(default::textrank)
Examples
iex> {:ok, document} = Catalan.parse(tokens)
iex> Catalan.summarize(document, ratio: 0.3)
"El gat dorm. La casa és gran."
Assigns part-of-speech tags to Catalan tokens using Universal Dependencies tagset.
Supports multiple tagging models:
:rule- Rule-based tagging (default, ~85% accuracy):hmm- Hidden Markov Model (future, ~95% accuracy):neural- Neural network (future, ~97% accuracy)
Options
:model- Tagging model to use (default::rule)
Examples
iex> {:ok, tokens} = Catalan.tokenize("El gat dorm.")
iex> Catalan.tag_pos(tokens)
{:ok, [%Token{text: "El", pos_tag: :det}, %Token{text: "gat", pos_tag: :noun}, ...]}
Tokenizes Catalan text into tokens with position tracking.
Handles Catalan-specific features:
- Interpunct (l·l) kept as single token
- Apostrophe contractions (l'home → ["l'", "home"])
- Article contractions (del → ["de", "el"])
- Catalan diacritics (à, è, é, í, ï, ò, ó, ú, ü, ç)
Options
:preserve_contractions- Keep contractions intact (default: false)
Examples
iex> Catalan.tokenize("L'home col·labora.")
{:ok, [%Token{text: "L'"}, %Token{text: "home"}, %Token{text: "col·labora"}, %Token{text: "."}]}