Nasty → Natural Abstract Syntax Tree Yeoman

A comprehensive NLP library for Elixir that treats natural language with the same rigor as programming languages.

Nasty provides a complete grammatical Abstract Syntax Tree (AST) for multiple natural languages (English, Spanish, and Catalan), with a full NLP pipeline from tokenization to text summarization.

Tokenization - NimbleParsec-based text segmentation
POS Tagging - Rule-based + Statistical (HMM with Viterbi) + Neural (BiLSTM-CRF)
Morphological Analysis - Lemmatization and features
Phrase Structure Parsing - NP, VP, PP, and relative clauses
Complex Sentences - Coordination, subordination
Dependency Extraction - Universal Dependencies relations
Named Entity Recognition - Person, place, organization
Semantic Role Labeling - Predicate-argument structure (who did what to whom)
Coreference Resolution - Link mentions across sentences
Text Summarization - Extractive summarization with MMR
Question Answering - Extractive QA for factoid questions
Text Classification - Multinomial Naive Bayes classifier with multiple feature types
Information Extraction - Relation extraction, event extraction, and template-based extraction
Statistical Models - HMM POS tagger with 95% accuracy
Neural Models - BiLSTM-CRF with 97-98% accuracy using Axon/EXLA
Code Interoperability - Bidirectional NL ↔ Code conversion (Natural language commands to Elixir code and vice versa)
AST Rendering - Convert AST back to natural language text
Translation - AST-based translation with morphological agreement and word order transformations
AST Utilities - Traversal, queries, validation, and transformations
Visualization - Export to DOT/Graphviz and JSON formats
Multi-Language Support - English, Spanish, and Catalan with language-agnostic architecture

Quick Start

# Run the complete demo
mix run demo.exs

# Or try specific examples
mix run examples/catalan_example.exs
mix run examples/roundtrip_translation.exs
mix run examples/multilingual_pipeline.exs

New to Nasty? Start with the Getting Started Guide for a beginner-friendly tutorial.

alias Nasty.Language.English

# Simple example
text = "John Smith works at Google in New York."

{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged)

# Extract entities
alias Nasty.Language.English.EntityRecognizer
entities = EntityRecognizer.recognize(tagged)
# => [%Entity{type: :person, text: "John Smith"}, 
#     %Entity{type: :org, text: "Google"}, ...]

# Extract dependencies
alias Nasty.Language.English.DependencyExtractor
sentences = document.paragraphs |> Enum.flat_map(& &1.sentences)
deps = Enum.flat_map(sentences, &DependencyExtractor.extract/1)

# Semantic role labeling
{:ok, document_with_srl} = Nasty.Language.English.parse(tagged, semantic_roles: true)
# Access semantic frames
frames = document_with_srl.semantic_frames
# => [%SemanticFrame{predicate: "works", roles: [%Role{type: :agent, text: "John Smith"}, ...]}]

# Coreference resolution
{:ok, document_with_coref} = Nasty.Language.English.parse(tagged, coreference: true)
# Access coreference chains
chains = document_with_coref.coref_chains
# => [%CorefChain{representative: "John Smith", mentions: ["John Smith", "he"], ...}]

# Summarize
summary = English.summarize(document, ratio: 0.3)  # 30% compression
# or
summary = English.summarize(document, max_sentences: 3)  # Fixed count

# MMR (Maximal Marginal Relevance) for reduced redundancy
summary_mmr = English.summarize(document, max_sentences: 3, method: :mmr, mmr_lambda: 0.5)

# Question answering
{:ok, answers} = English.answer_question(document, "Who works at Google?")
# => [%Answer{text: "John Smith", confidence: 0.85, ...}]

# Statistical POS tagging (auto-loads from priv/models/)
{:ok, tokens_hmm} = English.tag_pos(tokens, model: :hmm)

# Neural POS tagging (97-98% accuracy)
{:ok, tokens_neural} = English.tag_pos(tokens, model: :neural)

# Or ensemble mode (combines neural + statistical + rule-based)
{:ok, tokens_ensemble} = English.tag_pos(tokens, model: :ensemble)

# Text classification
# Train a sentiment classifier
training_data = [
  {positive_doc1, :positive},
  {positive_doc2, :positive},
  {negative_doc1, :negative},
  {negative_doc2, :negative}
]
model = English.train_classifier(training_data, features: [:bow, :lexical])

# Classify new documents
{:ok, predictions} = English.classify(test_doc, model)
# => [%Classification{class: :positive, confidence: 0.85, ...}, ...]

# Information extraction
# Extract relations between entities
{:ok, relations} = English.extract_relations(document)
# => [%Relation{type: :works_at, subject: person, object: org, confidence: 0.8}]

# Extract events with participants
{:ok, events} = English.extract_events(document)
# => [%Event{type: :business_acquisition, trigger: "acquired", participants: %{agent: ..., patient: ...}}]

# Template-based extraction
templates = [TemplateExtractor.employment_template()]
{:ok, results} = English.extract_templates(document, templates)
# => [%{template: "employment", slots: %{employee: "John", employer: "Google"}, confidence: 0.85}]

Architecture

graph LR
    A[Text] --> B[Tokenization]
    B --> C[POS Tagging]
    C --> D[Phrase Parsing]
    D --> E[Sentence Parsing]
    E --> F[Document AST]
    F --> G[Dependencies]
    F --> H[Entities]
    F --> I[Summarization]
    F --> J[Translation]
    F --> K[More...]
    
    style F fill:#e1f5ff
    style A fill:#fff3e0

Complete Pipeline

Tokenization (English.Tokenizer) → Split text into tokens
POS Tagging (English.POSTagger) → Assign grammatical categories
Morphology (English.Morphology) → Lemmatization and features
Phrase Parsing (English.PhraseParser) → Build NP, VP, PP structures
Sentence Parsing (English.SentenceParser) → Detect clauses and structure
Dependency Extraction (English.DependencyExtractor) → Grammatical relations
Entity Recognition (English.EntityRecognizer) → Named entities
Semantic Role Labeling (English.SemanticRoleLabeler) → Predicate-argument structure
Coreference Resolution (English.CoreferenceResolver) → Link mentions
Summarization (English.Summarizer) → Extract key sentences
Question Answering (English.QuestionAnalyzer, English.AnswerExtractor) → Answer questions
Text Classification (English.FeatureExtractor, English.TextClassifier) → Train and classify documents
Information Extraction (English.RelationExtractor, English.EventExtractor, English.TemplateExtractor) → Extract structured information
AST Rendering (Rendering.Text) → Convert AST back to natural language
AST Utilities (Utils.Traversal, Utils.Query, Utils.Validator, Utils.Transform) → Traverse, query, validate, and transform trees
Visualization (Rendering.Visualization, Rendering.PrettyPrint) → Export to DOT/JSON and debug output

Features

Phrase Structures

Noun Phrases (NP): Det? Adj* Noun PP* RelClause*
Verb Phrases (VP): Aux* Verb NP? PP* Adv*
Prepositional Phrases (PP): Prep NP
Relative Clauses: RelPron/RelAdv Clause

Sentence Types

Simple, Compound, Complex sentences
Coordination (and, or, but)
Subordination (because, although, if)
Relative clauses (who, which, that)

Dependencies (Universal Dependencies)

Core arguments: nsubj, obj, iobj
Modifiers: amod, advmod, det, case
Clausal: acl, advcl, mark
Coordination: conj, cc

Entity Types

Person, Organization, Place (GPE)
With confidence scores and multi-word support

Multi-Language Support

Nasty provides a language-agnostic architecture using Elixir behaviours, enabling support for multiple natural languages:

Supported Languages

English (Nasty.Language.English) - Fully implemented
Spanish (Nasty.Language.Spanish) - Fully implemented
- Spanish-specific tokenization (¿?, ¡!, contractions del/al, accented characters)
- Spanish morphology (verb conjugations, gender/number agreement)
- Complete NLP pipeline (tokenization → parsing → summarization)
Catalan (Nasty.Language.Catalan) - Fully implemented (Phases 1-7)
- Catalan-specific tokenization (interpunct l·l, apostrophe contractions, 10 diacritics)
- Catalan morphology (3 verb classes, irregular verbs, gender/number agreement)
- Full parsing pipeline (phrase/sentence parsing, dependency extraction, NER)

Usage

alias Nasty.Language.Spanish

# Spanish text processing
text = "El gato duerme en el sofá."
{:ok, tokens} = Spanish.tokenize(text)
{:ok, tagged} = Spanish.tag_pos(tokens)
{:ok, document} = Spanish.parse(tagged)

# Works identically to English
summary = Spanish.summarize(document, ratio: 0.3)
{:ok, entities} = Spanish.extract_entities(document)

# Catalan text processing
alias Nasty.Language.Catalan

text_ca = "El gat dorm al sofà."
{:ok, tokens_ca} = Catalan.tokenize(text_ca)
{:ok, tagged_ca} = Catalan.tag_pos(tokens_ca)
{:ok, document_ca} = Catalan.parse(tagged_ca)

# Extract entities (Catalan-specific lexicons)
alias Nasty.Language.Catalan.EntityRecognizer
{:ok, entities_ca} = EntityRecognizer.recognize(tagged_ca)

# Translate between languages (AST-based)
alias Nasty.Translation.Translator

# English to Spanish
{:ok, tokens_en} = English.tokenize("The quick cat runs.")
{:ok, tagged_en} = English.tag_pos(tokens_en)
{:ok, doc_en} = English.parse(tagged_en)
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
{:ok, text_es} = Nasty.render(doc_es)
# => "El gato rápido corre."

# Spanish to English
{:ok, tokens_es} = Spanish.tokenize("La casa grande.")
{:ok, tagged_es} = Spanish.tag_pos(tokens_es)
{:ok, doc_es} = Spanish.parse(tagged_es)
{:ok, doc_en} = Translator.translate_document(doc_es, :en)
{:ok, text_en} = Nasty.render(doc_en)
# => "The big house."

Language Registry

All languages are registered in Nasty.Language.Registry and can be accessed dynamically:

# Auto-detect language
{:ok, lang} = Nasty.Language.Registry.detect_language("¿Cómo estás?")
# => :es

# Get language module
{:ok, Spanish} = Nasty.Language.Registry.get(:es)

See complete language-specific examples:

examples/spanish_example.exs - Spanish NLP pipeline demonstration
examples/catalan_example.exs - Catalan tokenization, morphology, and parsing
examples/roundtrip_translation.exs - Translation quality analysis with roundtrip testing
examples/multilingual_pipeline.exs - Side-by-side comparison of English/Spanish/Catalan

Text Summarization

Extractive summarization - Select important sentences from document
Multiple scoring features:
- Position weight (early sentences score higher)
- Entity density (sentences with named entities)
- Discourse markers ("in conclusion", "importantly", etc.)
- Keyword frequency (TF scoring)
- Sentence length (prefer moderate length)
- Coreference participation (sentences in coref chains)
Selection methods:
- :greedy - Top-N by score (default)
- :mmr - Maximal Marginal Relevance (reduces redundancy)
Flexible options: compression ratio or fixed sentence count

Question Answering

Extractive QA - Extract answer spans from documents
Question classification:
- WHO (person entities)
- WHAT (things, organizations)
- WHEN (temporal expressions)
- WHERE (locations)
- WHY (reasons, clauses)
- HOW (manner, quantity)
- YES/NO (boolean questions)
Answer extraction strategies:
- Keyword matching with lemmatization
- Entity type filtering (person, organization, location)
- Temporal expression recognition
- Confidence scoring and ranking
Multiple answer support with confidence scores

Text Classification

Multinomial Naive Bayes - Probabilistic classifier with Laplace smoothing
Multiple feature types:
- :bow - Bag of words (lemmatized, stop word filtering)
- :ngrams - Word sequences (bigrams, trigrams, etc.)
- :pos_patterns - POS tag sequences
- :syntactic - Sentence structure statistics
- :entities - Named entity distributions
- :lexical - Vocabulary richness and text statistics
Training and prediction:
- Train on labeled documents: {document, class} tuples
- Multi-class classification support
- Confidence scores and probability distributions
Model evaluation:
- Accuracy, precision, recall, F1 metrics
- Per-class performance breakdowns
Use cases:
- Sentiment analysis (positive/negative reviews)
- Spam detection (spam/ham classification)
- Topic classification (sports, tech, politics, etc.)
- Formality detection (formal/informal text)

Information Extraction

Relation Extraction - Extract semantic relationships between entities
- Supported relations:
  - Employment: works_at, employed_by, member_of
  - Organization: founded, acquired_by, subsidiary_of
  - Location: located_in, based_in, headquarters_in
  - Temporal: occurred_on, founded_in
- Pattern-based extraction using verb patterns and prepositions
- Confidence scoring (0.5-0.8 based on pattern strength)
- Integrates with NER and dependency parsing
Event Extraction - Identify events with triggers and participants
- Event types:
  - Business: business_acquisition, business_merger, company_founding, product_launch
  - Employment: employment_start, employment_end
  - Communication: announcement, meeting
  - Other: movement, transaction
- Verb and nominalization triggers
- Participant extraction using semantic role labeling
- Temporal expression linking
- Confidence scoring (0.7-0.8)
Template-Based Extraction - Structured information using custom templates
- Define extraction templates with typed slots
- Pre-defined templates: employment, acquisition, location
- Flexible pattern matching
- Required/optional slot support
- Confidence based on slot fill rate

API Functions:

# Extract relations
{:ok, relations} = English.extract_relations(document, min_confidence: 0.6)

# Extract events
{:ok, events} = English.extract_events(document, max_events: 10)

# Template extraction
templates = [TemplateExtractor.employment_template()]
{:ok, results} = English.extract_templates(document, templates)

Code Interoperability

Convert between natural language and Elixir code bidirectionally:

NL → Code Generation - Convert natural language commands to executable Elixir code
- List operations: "Sort the numbers" → Enum.sort(numbers)
- Filtering: "Filter users where age > 18" → Enum.filter(users, fn item -> item > 18 end)
- Mapping: "Map the list" → Enum.map(list, fn item -> item end)
- Arithmetic: "X plus Y" → x + y
- Assignments: "X is 5" → x = 5
- Conditionals: "If X then Y" → if x, do: y
Code → NL Explanation - Generate natural language explanations from code
- Enum.sort(numbers) → "sort numbers"
- x = a + b → "X is a plus b"
- if x > 5, do: :ok → "If x is greater than 5, then :ok"
- Pipeline support: list |> Enum.map(&(&1 * 2)) |> Enum.sum() → "map list to each element times 2, then sum list"

API Functions:

# Natural language → Code
{:ok, code} = English.to_code("Sort the numbers")
# => "Enum.sort(numbers)"

# Code → Natural language
{:ok, explanation} = English.explain_code("Enum.filter(users, fn u -> u.age > 18 end)")
# => "filter users where u u age is greater than 18"

# Get intent without generating code
{:ok, intent} = English.recognize_intent("Filter the users")
# => %Intent{type: :action, action: "filter", target: "users", confidence: 0.95}

# Optional: Enhance with Ragex for context-aware suggestions
{:ok, code} = English.to_code("Sort the list", enhance_with_ragex: true)

Example Scripts:
- examples/code_generation.exs - Natural language to code demos
- examples/code_explanation.exs - Code to natural language demos

AST Rendering & Utilities

Convert AST back to text, traverse and query trees, validate structures, and export visualizations:

Text Rendering - Regenerate natural language from AST

alias Nasty.Rendering.Text

# Render AST to text
{:ok, text} = Text.render(document)
# => "The cat sat on the mat."

# Custom rendering options
{:ok, text} = Text.render(document, 
  capitalize_sentences: false,
  add_punctuation: false,
  paragraph_separator: "\n\n"
)

AST Traversal - Walk the tree with visitor pattern

alias Nasty.Utils.Traversal

# Count all tokens
token_count = Traversal.reduce(document, 0, fn
  %Token{}, acc -> acc + 1
  _, acc -> acc
end)

# Collect all nouns
nouns = Traversal.collect(document, fn
  %Token{pos_tag: :noun} -> true
  _ -> false
end)

# Transform tree (lowercase all text)
lowercased = Traversal.map(document, fn
  %Token{} = token -> %{token | text: String.downcase(token.text)}
  node -> node
end)

AST Queries - High-level query API

alias Nasty.Utils.Query

# Find all noun phrases
noun_phrases = Query.find_all(document, :noun_phrase)

# Find tokens by POS tag
verbs = Query.find_by_pos(document, :verb)

# Extract entities
people = Query.extract_entities(document, type: :PERSON)

# Find sentence subject
subject = Query.find_subject(sentence)

# Count nodes
token_count = Query.count(document, :token)

Pretty Printing - Debug AST structures

alias Nasty.Rendering.PrettyPrint

# Indented output
IO.puts(PrettyPrint.print(document, color: true))

# Tree-style output with box characters
IO.puts(PrettyPrint.tree(document))

# Statistics
IO.puts(PrettyPrint.stats(document))
# => AST Statistics:
#      Paragraphs: 3
#      Sentences: 12
#      Tokens: 127

Visualization - Export for graphical rendering

alias Nasty.Rendering.Visualization

# Export to DOT format (Graphviz)
dot = Visualization.to_dot(document, type: :parse_tree)
File.write("tree.dot", dot)
# Then: dot -Tpng tree.dot -o tree.png

# Dependency graph
deps_dot = Visualization.to_dot(sentence, type: :dependencies)

# Entity graph
entity_dot = Visualization.to_dot(document, type: :entities)

# JSON export for d3.js
json = Visualization.to_json(document)

Validation - Ensure AST integrity

alias Nasty.Utils.Validator

# Validate structure
{:ok, document} = Validator.validate(document)

# Check spans
:ok = Validator.validate_spans(document)

# Check language consistency
:ok = Validator.validate_language(document)

Transformations - Modify AST structures

alias Nasty.Utils.Transform

# Normalize case
lowercased = Transform.normalize_case(document, :lower)

# Remove punctuation
no_punct = Transform.remove_punctuation(document)

# Remove stop words
no_stops = Transform.remove_stop_words(document)

# Lemmatize all tokens
lemmatized = Transform.lemmatize(document)

# Apply pipeline of transformations
processed = Transform.pipeline(document, [
  &Transform.normalize_case(&1, :lower),
  &Transform.remove_punctuation/1,
  &Transform.remove_stop_words/1
])

Testing

# Run all tests
mix test

# Run specific module tests
mix test test/language/english/tokenizer_test.exs
mix test test/language/english/phrase_parser_test.exs
mix test test/language/english/dependency_extractor_test.exs

Documentation

Comprehensive documentation is available in the docs/ directory:

Getting Started

STRENGTHS_AND_LIMITATIONS.md - The conprehensive analysis of what Nasty is good/bad for
GETTING_STARTED.md - Beginner-friendly tutorial with step-by-step examples
EXAMPLES.md - Complete catalog of all 18 example scripts with usage guides

Core Documentation

PLAN.md - Original vision and architectural design
TODO.md - Unimplemented features and future enhancements
PARSING_GUIDE.md - Complete parsing algorithms reference (tokenization, POS tagging, morphology, phrase/sentence parsing, dependencies)
ARCHITECTURE.md - System architecture and design patterns
USER_GUIDE.md - User guide with examples and API reference
API.md - Complete API reference for all modules
AST_REFERENCE.md - Complete AST node reference
PERFORMANCE.md - Benchmarks, optimization tips, and performance considerations

Language-Specific Documentation

ENGLISH_GRAMMAR.md - Formal English grammar specification with CFG rules
SPANISH.md - Spanish language support details
CATALAN.md - Catalan language support details
TRANSLATION.md - AST-based translation system guide
GRAMMAR_CUSTOMIZATION.md - Guide for custom grammar rules and domain variants

Statistical & Neural Models

Nasty includes comprehensive statistical and neural network models for state-of-the-art NLP:

Statistical Models

Sequence Labeling

HMM POS Tagger: Hidden Markov Model with Viterbi decoding (~95% accuracy)
CRF (Conditional Random Fields): Feature-based sequence labeling
- Named Entity Recognition
- POS tagging
- Chunking and segmentation
- Forward-backward algorithm for training
- Viterbi decoding for prediction
- Multiple optimization methods (SGD, Momentum, AdaGrad)

Parsing

PCFG (Probabilistic Context-Free Grammar): Statistical phrase structure parsing
- CYK algorithm for efficient parsing
- Grammar learning from treebanks
- Chomsky Normal Form (CNF) conversion
- Smoothing and probability estimation
- Beam search for pruning

Classification

Naive Bayes Classifier: Fast text classification
- Multiple feature types (BOW, n-grams, POS patterns)
- Laplace smoothing
- Multi-class support

Neural Models

BiLSTM-CRF: Bidirectional LSTM with CRF for sequence tagging (97-98% accuracy)
Axon/EXLA: Pure Elixir neural networks with GPU acceleration
Pre-trained embeddings: Support for GloVe, FastText
Training infrastructure: Train custom models on your own data
Evaluation metrics: Accuracy, precision, recall, F1, confusion matrices

Transformer Models (Bumblebee Integration)

Pre-trained Models: BERT, RoBERTa, DistilBERT, XLM-RoBERTa via Hugging Face
Fine-tuning: Full fine-tuning pipeline for POS tagging and NER (98-99% accuracy)
Zero-shot Classification: Classify without training using NLI models (70-85% accuracy)
Model Quantization: INT8 quantization for 4x compression and 2-3x speedup
Multilingual Support: XLM-RoBERTa for cross-lingual transfer
Mix Tasks: CLI tools for model management, fine-tuning, and inference

See Statistical Models for complete reference, Neural Models for neural architecture details, Training Neural for training guide, Pretrained Models for transformer usage, Zero Shot for zero-shot classification, and Quantization for model optimization.

Quick Start: Model Management

# List available models
mix nasty.models list

# Train HMM POS tagger (fast, 95% accuracy)
mix nasty.train.pos \
  --corpus data/UD_English-EWT/en_ewt-ud-train.conllu \
  --test data/UD_English-EWT/en_ewt-ud-test.conllu \
  --output priv/models/en/pos_hmm_v1.model

# Train neural POS tagger (slower, 97-98% accuracy)
mix nasty.train.neural_pos \
  --corpus data/UD_English-EWT/en_ewt-ud-train.conllu \
  --output priv/models/en/pos_neural_v1.axon \
  --epochs 10 \
  --batch-size 32

# Train CRF for NER
mix nasty.train.crf \
  --corpus data/train.conllu \
  --test data/test.conllu \
  --output priv/models/en/ner_crf.model \
  --task ner \
  --iterations 100

# Train PCFG parser
mix nasty.train.pcfg \
  --corpus data/en_ewt-ud-train.conllu \
  --test data/en_ewt-ud-test.conllu \
  --output priv/models/en/pcfg.model \
  --smoothing 0.001

# Evaluate models
mix nasty.eval.pos \
  --model priv/models/en/pos_hmm_v1.model \
  --test data/UD_English-EWT/en_ewt-ud-test.conllu \
  --baseline

mix nasty.eval \
  --model priv/models/en/ner_crf.model \
  --test data/test.conllu \
  --type crf \
  --task ner

mix nasty.eval \
  --model priv/models/en/pcfg.model \
  --test data/test.conllu \
  --type pcfg

Future Enhancements

[x] Statistical models for improved accuracy (HMM POS tagger - done!)
[x] Neural models (BiLSTM-CRF POS tagger with 97-98% accuracy - done!)
[x] PCFG parser for phrase structure (done!)
[x] CRF for named entity recognition (done!)
[x] Semantic role labeling (rule-based SRL - done!)
[x] Coreference resolution (heuristic-based - done!)
[x] Question answering (extractive QA - done!)
[x] Information extraction (relations, events, templates - done!)
[x] Code ↔ NL bidirectional conversion (done!)
[x] Pre-trained transformers (BERT, RoBERTa via Bumblebee - done!)
[x] Fine-tuning infrastructure for POS tagging and NER (done!)
[x] Zero-shot classification using NLI models (done!)
[x] Model quantization (INT8 with 4x compression) (done!)
[x] Integration of PCFG/CRF with main pipeline (done!)
[x] Multi-language support - Spanish and Catalan complete
[ ] Advanced coreference (neural models)

License

MIT License — see LICENSE file for details.

Built with ❤️ using Elixir and NimbleParsec

Next Page → Getting Started with Nasty