Nasty.Language.Catalan (Nasty v0.3.0)

View Source

Catalan (Català) language implementation for Nasty.

Provides complete NLP pipeline for Catalan text:

  • Tokenization with Catalan-specific features (interpunct, contractions)
  • POS tagging using Universal Dependencies tagset
  • Morphological analysis (lemmatization, features)
  • Syntactic parsing (phrases, sentences, clauses)
  • Dependency extraction (Universal Dependencies)
  • Named entity recognition
  • Text summarization

Catalan-Specific Features

  • Interpunct (l·l): Handled in tokenization (e.g., "col·laborar")
  • Apostrophe contractions: l', d', s', n', m', t'
  • Article contractions: del (de + el), al (a + el), pel (per + el)
  • Pro-drop: Subject pronouns often omitted
  • Post-nominal adjectives: "casa blanca" (white house)
  • Clitic pronouns: em, et, es, ens, us

Usage

iex> alias Nasty.Language.Catalan
iex> {:ok, tokens} = Catalan.tokenize("El gat dorm al sofà.")
iex> {:ok, tagged} = Catalan.tag_pos(tokens)
iex> {:ok, document} = Catalan.parse(tagged)

Language Code

Catalan uses the ISO 639-1 code :ca.

Summary

Functions

Extracts named entities from Catalan text.

Returns the ISO 639-1 language code for Catalan.

Returns metadata about the Catalan language implementation.

Parses tagged Catalan tokens into a complete Document AST.

Renders a Catalan AST node back to natural language text.

Summarizes Catalan text using extractive summarization.

Assigns part-of-speech tags to Catalan tokens using Universal Dependencies tagset.

Tokenizes Catalan text into tokens with position tracking.

Functions

extract_entities(document)

@spec extract_entities(Nasty.AST.Document.t()) :: [Nasty.AST.Semantic.Entity.t()]

Extracts named entities from Catalan text.

Recognizes:

  • Person names (with Catalan naming patterns)
  • Organizations
  • Locations (Catalan place names)
  • Dates

Examples

iex> {:ok, document} = Catalan.parse(tokens)
iex> Catalan.extract_entities(document)
[%Entity{type: :person, text: "Josep Maria"}, ...]

language_code()

@spec language_code() :: :ca

Returns the ISO 639-1 language code for Catalan.

Examples

iex> Nasty.Language.Catalan.language_code()
:ca

metadata()

Returns metadata about the Catalan language implementation.

Examples

iex> Catalan.metadata()
%{
  name: "Catalan",
  native_name: "Català",
  iso_639_1: "ca",
  family: "Romance",
  speakers: "~10 million"
}

parse(tokens, opts \\ [])

Parses tagged Catalan tokens into a complete Document AST.

The parsing pipeline:

  1. Morphological analysis (lemmatization, features)
  2. Phrase parsing (NP, VP, PP, AdjP, AdvP)
  3. Sentence parsing (clauses, coordination, subordination)
  4. Document construction (paragraphs, sentences)

Options

  • :dependencies - Extract dependency relations (default: false)
  • :entities - Recognize named entities (default: false)
  • :semantic_roles - Extract semantic roles (default: false)

Examples

iex> {:ok, tokens} = Catalan.tokenize("La Maria treballa a Barcelona.")
iex> {:ok, tagged} = Catalan.tag_pos(tokens)
iex> Catalan.parse(tagged)
{:ok, %Document{paragraphs: [%Paragraph{sentences: [...]}]}}

render(ast, opts \\ [])

Renders a Catalan AST node back to natural language text.

Handles:

  • Subject-verb agreement
  • Gender/number agreement (adjectives, articles)
  • Catalan word order (post-nominal adjectives)
  • Proper punctuation and capitalization

Examples

iex> document = %Document{...}
iex> Catalan.render(document)
{:ok, "El gat dorm al sofà."}

summarize(document, opts \\ [])

@spec summarize(
  Nasty.AST.Document.t(),
  keyword()
) :: String.t()

Summarizes Catalan text using extractive summarization.

Options

  • :ratio - Compression ratio (0.0-1.0)
  • :max_sentences - Maximum sentences in summary
  • :method - :textrank or :mmr (default: :textrank)

Examples

iex> {:ok, document} = Catalan.parse(tokens)
iex> Catalan.summarize(document, ratio: 0.3)
"El gat dorm. La casa és gran."

tag_pos(tokens, opts \\ [])

Assigns part-of-speech tags to Catalan tokens using Universal Dependencies tagset.

Supports multiple tagging models:

  • :rule - Rule-based tagging (default, ~85% accuracy)
  • :hmm - Hidden Markov Model (future, ~95% accuracy)
  • :neural - Neural network (future, ~97% accuracy)

Options

  • :model - Tagging model to use (default: :rule)

Examples

iex> {:ok, tokens} = Catalan.tokenize("El gat dorm.")
iex> Catalan.tag_pos(tokens)
{:ok, [%Token{text: "El", pos_tag: :det}, %Token{text: "gat", pos_tag: :noun}, ...]}

tokenize(text, opts \\ [])

Tokenizes Catalan text into tokens with position tracking.

Handles Catalan-specific features:

  • Interpunct (l·l) kept as single token
  • Apostrophe contractions (l'home → ["l'", "home"])
  • Article contractions (del → ["de", "el"])
  • Catalan diacritics (à, è, é, í, ï, ò, ó, ú, ü, ç)

Options

  • :preserve_contractions - Keep contractions intact (default: false)

Examples

iex> Catalan.tokenize("L'home col·labora.")
{:ok, [%Token{text: "L'"}, %Token{text: "home"}, %Token{text: "col·labora"}, %Token{text: "."}]}