Nasty → Natural Abstract Syntax Tree Yeoman
View Source
A comprehensive NLP library for Elixir that treats natural language with the same rigor as programming languages.
Nasty provides a complete grammatical Abstract Syntax Tree (AST) for multiple natural languages (English, Spanish, and Catalan), with a full NLP pipeline from tokenization to text summarization.
- Tokenization - NimbleParsec-based text segmentation
- POS Tagging - Rule-based + Statistical (HMM with Viterbi) + Neural (BiLSTM-CRF)
- Morphological Analysis - Lemmatization and features
- Phrase Structure Parsing - NP, VP, PP, and relative clauses
- Complex Sentences - Coordination, subordination
- Dependency Extraction - Universal Dependencies relations
- Named Entity Recognition - Person, place, organization
- Semantic Role Labeling - Predicate-argument structure (who did what to whom)
- Coreference Resolution - Link mentions across sentences
- Text Summarization - Extractive summarization with MMR
- Question Answering - Extractive QA for factoid questions
- Text Classification - Multinomial Naive Bayes classifier with multiple feature types
- Information Extraction - Relation extraction, event extraction, and template-based extraction
- Statistical Models - HMM POS tagger with 95% accuracy
- Neural Models - BiLSTM-CRF with 97-98% accuracy using Axon/EXLA
- Code Interoperability - Bidirectional NL ↔ Code conversion (Natural language commands to Elixir code and vice versa)
- AST Rendering - Convert AST back to natural language text
- Translation - AST-based translation with morphological agreement and word order transformations
- AST Utilities - Traversal, queries, validation, and transformations
- Visualization - Export to DOT/Graphviz and JSON formats
- Multi-Language Support - English, Spanish, and Catalan with language-agnostic architecture
Quick Start
# Run the complete demo
mix run demo.exs
# Or try specific examples
mix run examples/catalan_example.exs
mix run examples/roundtrip_translation.exs
mix run examples/multilingual_pipeline.exs
New to Nasty? Start with the Getting Started Guide for a beginner-friendly tutorial.
alias Nasty.Language.English
# Simple example
text = "John Smith works at Google in New York."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged)
# Extract entities
alias Nasty.Language.English.EntityRecognizer
entities = EntityRecognizer.recognize(tagged)
# => [%Entity{type: :person, text: "John Smith"},
# %Entity{type: :org, text: "Google"}, ...]
# Extract dependencies
alias Nasty.Language.English.DependencyExtractor
sentences = document.paragraphs |> Enum.flat_map(& &1.sentences)
deps = Enum.flat_map(sentences, &DependencyExtractor.extract/1)
# Semantic role labeling
{:ok, document_with_srl} = Nasty.Language.English.parse(tagged, semantic_roles: true)
# Access semantic frames
frames = document_with_srl.semantic_frames
# => [%SemanticFrame{predicate: "works", roles: [%Role{type: :agent, text: "John Smith"}, ...]}]
# Coreference resolution
{:ok, document_with_coref} = Nasty.Language.English.parse(tagged, coreference: true)
# Access coreference chains
chains = document_with_coref.coref_chains
# => [%CorefChain{representative: "John Smith", mentions: ["John Smith", "he"], ...}]
# Summarize
summary = English.summarize(document, ratio: 0.3) # 30% compression
# or
summary = English.summarize(document, max_sentences: 3) # Fixed count
# MMR (Maximal Marginal Relevance) for reduced redundancy
summary_mmr = English.summarize(document, max_sentences: 3, method: :mmr, mmr_lambda: 0.5)
# Question answering
{:ok, answers} = English.answer_question(document, "Who works at Google?")
# => [%Answer{text: "John Smith", confidence: 0.85, ...}]
# Statistical POS tagging (auto-loads from priv/models/)
{:ok, tokens_hmm} = English.tag_pos(tokens, model: :hmm)
# Neural POS tagging (97-98% accuracy)
{:ok, tokens_neural} = English.tag_pos(tokens, model: :neural)
# Or ensemble mode (combines neural + statistical + rule-based)
{:ok, tokens_ensemble} = English.tag_pos(tokens, model: :ensemble)
# Text classification
# Train a sentiment classifier
training_data = [
{positive_doc1, :positive},
{positive_doc2, :positive},
{negative_doc1, :negative},
{negative_doc2, :negative}
]
model = English.train_classifier(training_data, features: [:bow, :lexical])
# Classify new documents
{:ok, predictions} = English.classify(test_doc, model)
# => [%Classification{class: :positive, confidence: 0.85, ...}, ...]
# Information extraction
# Extract relations between entities
{:ok, relations} = English.extract_relations(document)
# => [%Relation{type: :works_at, subject: person, object: org, confidence: 0.8}]
# Extract events with participants
{:ok, events} = English.extract_events(document)
# => [%Event{type: :business_acquisition, trigger: "acquired", participants: %{agent: ..., patient: ...}}]
# Template-based extraction
templates = [TemplateExtractor.employment_template()]
{:ok, results} = English.extract_templates(document, templates)
# => [%{template: "employment", slots: %{employee: "John", employer: "Google"}, confidence: 0.85}]Architecture
graph LR
A[Text] --> B[Tokenization]
B --> C[POS Tagging]
C --> D[Phrase Parsing]
D --> E[Sentence Parsing]
E --> F[Document AST]
F --> G[Dependencies]
F --> H[Entities]
F --> I[Summarization]
F --> J[Translation]
F --> K[More...]
style F fill:#e1f5ff
style A fill:#fff3e0Complete Pipeline
- Tokenization (
English.Tokenizer) → Split text into tokens - POS Tagging (
English.POSTagger) → Assign grammatical categories - Morphology (
English.Morphology) → Lemmatization and features - Phrase Parsing (
English.PhraseParser) → Build NP, VP, PP structures - Sentence Parsing (
English.SentenceParser) → Detect clauses and structure - Dependency Extraction (
English.DependencyExtractor) → Grammatical relations - Entity Recognition (
English.EntityRecognizer) → Named entities - Semantic Role Labeling (
English.SemanticRoleLabeler) → Predicate-argument structure - Coreference Resolution (
English.CoreferenceResolver) → Link mentions - Summarization (
English.Summarizer) → Extract key sentences - Question Answering (
English.QuestionAnalyzer,English.AnswerExtractor) → Answer questions - Text Classification (
English.FeatureExtractor,English.TextClassifier) → Train and classify documents - Information Extraction (
English.RelationExtractor,English.EventExtractor,English.TemplateExtractor) → Extract structured information - AST Rendering (
Rendering.Text) → Convert AST back to natural language - AST Utilities (
Utils.Traversal,Utils.Query,Utils.Validator,Utils.Transform) → Traverse, query, validate, and transform trees - Visualization (
Rendering.Visualization,Rendering.PrettyPrint) → Export to DOT/JSON and debug output
Features
Phrase Structures
- Noun Phrases (NP):
Det? Adj* Noun PP* RelClause* - Verb Phrases (VP):
Aux* Verb NP? PP* Adv* - Prepositional Phrases (PP):
Prep NP - Relative Clauses:
RelPron/RelAdv Clause
Sentence Types
- Simple, Compound, Complex sentences
- Coordination (and, or, but)
- Subordination (because, although, if)
- Relative clauses (who, which, that)
Dependencies (Universal Dependencies)
- Core arguments:
nsubj,obj,iobj - Modifiers:
amod,advmod,det,case - Clausal:
acl,advcl,mark - Coordination:
conj,cc
Entity Types
- Person, Organization, Place (GPE)
- With confidence scores and multi-word support
Multi-Language Support
Nasty provides a language-agnostic architecture using Elixir behaviours, enabling support for multiple natural languages:
Supported Languages
- English (
Nasty.Language.English) - Fully implemented - Spanish (
Nasty.Language.Spanish) - Fully implemented- Spanish-specific tokenization (¿?, ¡!, contractions del/al, accented characters)
- Spanish morphology (verb conjugations, gender/number agreement)
- Complete NLP pipeline (tokenization → parsing → summarization)
- Catalan (
Nasty.Language.Catalan) - Fully implemented (Phases 1-7)- Catalan-specific tokenization (interpunct l·l, apostrophe contractions, 10 diacritics)
- Catalan morphology (3 verb classes, irregular verbs, gender/number agreement)
- Full parsing pipeline (phrase/sentence parsing, dependency extraction, NER)
Usage
alias Nasty.Language.Spanish
# Spanish text processing
text = "El gato duerme en el sofá."
{:ok, tokens} = Spanish.tokenize(text)
{:ok, tagged} = Spanish.tag_pos(tokens)
{:ok, document} = Spanish.parse(tagged)
# Works identically to English
summary = Spanish.summarize(document, ratio: 0.3)
{:ok, entities} = Spanish.extract_entities(document)
# Catalan text processing
alias Nasty.Language.Catalan
text_ca = "El gat dorm al sofà."
{:ok, tokens_ca} = Catalan.tokenize(text_ca)
{:ok, tagged_ca} = Catalan.tag_pos(tokens_ca)
{:ok, document_ca} = Catalan.parse(tagged_ca)
# Extract entities (Catalan-specific lexicons)
alias Nasty.Language.Catalan.EntityRecognizer
{:ok, entities_ca} = EntityRecognizer.recognize(tagged_ca)
# Translate between languages (AST-based)
alias Nasty.Translation.Translator
# English to Spanish
{:ok, tokens_en} = English.tokenize("The quick cat runs.")
{:ok, tagged_en} = English.tag_pos(tokens_en)
{:ok, doc_en} = English.parse(tagged_en)
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
{:ok, text_es} = Nasty.render(doc_es)
# => "El gato rápido corre."
# Spanish to English
{:ok, tokens_es} = Spanish.tokenize("La casa grande.")
{:ok, tagged_es} = Spanish.tag_pos(tokens_es)
{:ok, doc_es} = Spanish.parse(tagged_es)
{:ok, doc_en} = Translator.translate_document(doc_es, :en)
{:ok, text_en} = Nasty.render(doc_en)
# => "The big house."Language Registry
All languages are registered in Nasty.Language.Registry and can be accessed dynamically:
# Auto-detect language
{:ok, lang} = Nasty.Language.Registry.detect_language("¿Cómo estás?")
# => :es
# Get language module
{:ok, Spanish} = Nasty.Language.Registry.get(:es)See complete language-specific examples:
examples/spanish_example.exs- Spanish NLP pipeline demonstrationexamples/catalan_example.exs- Catalan tokenization, morphology, and parsingexamples/roundtrip_translation.exs- Translation quality analysis with roundtrip testingexamples/multilingual_pipeline.exs- Side-by-side comparison of English/Spanish/Catalan
Text Summarization
- Extractive summarization - Select important sentences from document
- Multiple scoring features:
- Position weight (early sentences score higher)
- Entity density (sentences with named entities)
- Discourse markers ("in conclusion", "importantly", etc.)
- Keyword frequency (TF scoring)
- Sentence length (prefer moderate length)
- Coreference participation (sentences in coref chains)
- Selection methods:
:greedy- Top-N by score (default):mmr- Maximal Marginal Relevance (reduces redundancy)
- Flexible options: compression ratio or fixed sentence count
Question Answering
- Extractive QA - Extract answer spans from documents
- Question classification:
- WHO (person entities)
- WHAT (things, organizations)
- WHEN (temporal expressions)
- WHERE (locations)
- WHY (reasons, clauses)
- HOW (manner, quantity)
- YES/NO (boolean questions)
- Answer extraction strategies:
- Keyword matching with lemmatization
- Entity type filtering (person, organization, location)
- Temporal expression recognition
- Confidence scoring and ranking
- Multiple answer support with confidence scores
Text Classification
- Multinomial Naive Bayes - Probabilistic classifier with Laplace smoothing
- Multiple feature types:
:bow- Bag of words (lemmatized, stop word filtering):ngrams- Word sequences (bigrams, trigrams, etc.):pos_patterns- POS tag sequences:syntactic- Sentence structure statistics:entities- Named entity distributions:lexical- Vocabulary richness and text statistics
- Training and prediction:
- Train on labeled documents:
{document, class}tuples - Multi-class classification support
- Confidence scores and probability distributions
- Train on labeled documents:
- Model evaluation:
- Accuracy, precision, recall, F1 metrics
- Per-class performance breakdowns
- Use cases:
- Sentiment analysis (positive/negative reviews)
- Spam detection (spam/ham classification)
- Topic classification (sports, tech, politics, etc.)
- Formality detection (formal/informal text)
Information Extraction
Relation Extraction - Extract semantic relationships between entities
- Supported relations:
- Employment:
works_at,employed_by,member_of - Organization:
founded,acquired_by,subsidiary_of - Location:
located_in,based_in,headquarters_in - Temporal:
occurred_on,founded_in
- Employment:
- Pattern-based extraction using verb patterns and prepositions
- Confidence scoring (0.5-0.8 based on pattern strength)
- Integrates with NER and dependency parsing
- Supported relations:
Event Extraction - Identify events with triggers and participants
- Event types:
- Business:
business_acquisition,business_merger,company_founding,product_launch - Employment:
employment_start,employment_end - Communication:
announcement,meeting - Other:
movement,transaction
- Business:
- Verb and nominalization triggers
- Participant extraction using semantic role labeling
- Temporal expression linking
- Confidence scoring (0.7-0.8)
- Event types:
Template-Based Extraction - Structured information using custom templates
- Define extraction templates with typed slots
- Pre-defined templates: employment, acquisition, location
- Flexible pattern matching
- Required/optional slot support
- Confidence based on slot fill rate
API Functions:
# Extract relations {:ok, relations} = English.extract_relations(document, min_confidence: 0.6) # Extract events {:ok, events} = English.extract_events(document, max_events: 10) # Template extraction templates = [TemplateExtractor.employment_template()] {:ok, results} = English.extract_templates(document, templates)
Code Interoperability
Convert between natural language and Elixir code bidirectionally:
NL → Code Generation - Convert natural language commands to executable Elixir code
- List operations: "Sort the numbers" →
Enum.sort(numbers) - Filtering: "Filter users where age > 18" →
Enum.filter(users, fn item -> item > 18 end) - Mapping: "Map the list" →
Enum.map(list, fn item -> item end) - Arithmetic: "X plus Y" →
x + y - Assignments: "X is 5" →
x = 5 - Conditionals: "If X then Y" →
if x, do: y
- List operations: "Sort the numbers" →
Code → NL Explanation - Generate natural language explanations from code
Enum.sort(numbers)→ "sort numbers"x = a + b→ "X is a plus b"if x > 5, do: :ok→ "If x is greater than 5, then :ok"- Pipeline support:
list |> Enum.map(&(&1 * 2)) |> Enum.sum()→ "map list to each element times 2, then sum list"
API Functions:
# Natural language → Code {:ok, code} = English.to_code("Sort the numbers") # => "Enum.sort(numbers)" # Code → Natural language {:ok, explanation} = English.explain_code("Enum.filter(users, fn u -> u.age > 18 end)") # => "filter users where u u age is greater than 18" # Get intent without generating code {:ok, intent} = English.recognize_intent("Filter the users") # => %Intent{type: :action, action: "filter", target: "users", confidence: 0.95} # Optional: Enhance with Ragex for context-aware suggestions {:ok, code} = English.to_code("Sort the list", enhance_with_ragex: true)Example Scripts:
examples/code_generation.exs- Natural language to code demosexamples/code_explanation.exs- Code to natural language demos
AST Rendering & Utilities
Convert AST back to text, traverse and query trees, validate structures, and export visualizations:
Text Rendering - Regenerate natural language from AST
alias Nasty.Rendering.Text # Render AST to text {:ok, text} = Text.render(document) # => "The cat sat on the mat." # Custom rendering options {:ok, text} = Text.render(document, capitalize_sentences: false, add_punctuation: false, paragraph_separator: "\n\n" )AST Traversal - Walk the tree with visitor pattern
alias Nasty.Utils.Traversal # Count all tokens token_count = Traversal.reduce(document, 0, fn %Token{}, acc -> acc + 1 _, acc -> acc end) # Collect all nouns nouns = Traversal.collect(document, fn %Token{pos_tag: :noun} -> true _ -> false end) # Transform tree (lowercase all text) lowercased = Traversal.map(document, fn %Token{} = token -> %{token | text: String.downcase(token.text)} node -> node end)AST Queries - High-level query API
alias Nasty.Utils.Query # Find all noun phrases noun_phrases = Query.find_all(document, :noun_phrase) # Find tokens by POS tag verbs = Query.find_by_pos(document, :verb) # Extract entities people = Query.extract_entities(document, type: :PERSON) # Find sentence subject subject = Query.find_subject(sentence) # Count nodes token_count = Query.count(document, :token)Pretty Printing - Debug AST structures
alias Nasty.Rendering.PrettyPrint # Indented output IO.puts(PrettyPrint.print(document, color: true)) # Tree-style output with box characters IO.puts(PrettyPrint.tree(document)) # Statistics IO.puts(PrettyPrint.stats(document)) # => AST Statistics: # Paragraphs: 3 # Sentences: 12 # Tokens: 127Visualization - Export for graphical rendering
alias Nasty.Rendering.Visualization # Export to DOT format (Graphviz) dot = Visualization.to_dot(document, type: :parse_tree) File.write("tree.dot", dot) # Then: dot -Tpng tree.dot -o tree.png # Dependency graph deps_dot = Visualization.to_dot(sentence, type: :dependencies) # Entity graph entity_dot = Visualization.to_dot(document, type: :entities) # JSON export for d3.js json = Visualization.to_json(document)Validation - Ensure AST integrity
alias Nasty.Utils.Validator # Validate structure {:ok, document} = Validator.validate(document) # Check spans :ok = Validator.validate_spans(document) # Check language consistency :ok = Validator.validate_language(document)Transformations - Modify AST structures
alias Nasty.Utils.Transform # Normalize case lowercased = Transform.normalize_case(document, :lower) # Remove punctuation no_punct = Transform.remove_punctuation(document) # Remove stop words no_stops = Transform.remove_stop_words(document) # Lemmatize all tokens lemmatized = Transform.lemmatize(document) # Apply pipeline of transformations processed = Transform.pipeline(document, [ &Transform.normalize_case(&1, :lower), &Transform.remove_punctuation/1, &Transform.remove_stop_words/1 ])
Testing
# Run all tests
mix test
# Run specific module tests
mix test test/language/english/tokenizer_test.exs
mix test test/language/english/phrase_parser_test.exs
mix test test/language/english/dependency_extractor_test.exs
Documentation
Comprehensive documentation is available in the docs/ directory:
Getting Started
- STRENGTHS_AND_LIMITATIONS.md - The conprehensive analysis of what Nasty is good/bad for
- GETTING_STARTED.md - Beginner-friendly tutorial with step-by-step examples
- EXAMPLES.md - Complete catalog of all 18 example scripts with usage guides
Core Documentation
- PLAN.md - Original vision and architectural design
- TODO.md - Unimplemented features and future enhancements
- PARSING_GUIDE.md - Complete parsing algorithms reference (tokenization, POS tagging, morphology, phrase/sentence parsing, dependencies)
- ARCHITECTURE.md - System architecture and design patterns
- USER_GUIDE.md - User guide with examples and API reference
- API.md - Complete API reference for all modules
- AST_REFERENCE.md - Complete AST node reference
- PERFORMANCE.md - Benchmarks, optimization tips, and performance considerations
Language-Specific Documentation
- ENGLISH_GRAMMAR.md - Formal English grammar specification with CFG rules
- SPANISH.md - Spanish language support details
- CATALAN.md - Catalan language support details
- TRANSLATION.md - AST-based translation system guide
- GRAMMAR_CUSTOMIZATION.md - Guide for custom grammar rules and domain variants
Statistical & Neural Models
Nasty includes comprehensive statistical and neural network models for state-of-the-art NLP:
Statistical Models
Sequence Labeling
- HMM POS Tagger: Hidden Markov Model with Viterbi decoding (~95% accuracy)
- CRF (Conditional Random Fields): Feature-based sequence labeling
- Named Entity Recognition
- POS tagging
- Chunking and segmentation
- Forward-backward algorithm for training
- Viterbi decoding for prediction
- Multiple optimization methods (SGD, Momentum, AdaGrad)
Parsing
- PCFG (Probabilistic Context-Free Grammar): Statistical phrase structure parsing
- CYK algorithm for efficient parsing
- Grammar learning from treebanks
- Chomsky Normal Form (CNF) conversion
- Smoothing and probability estimation
- Beam search for pruning
Classification
- Naive Bayes Classifier: Fast text classification
- Multiple feature types (BOW, n-grams, POS patterns)
- Laplace smoothing
- Multi-class support
Neural Models
- BiLSTM-CRF: Bidirectional LSTM with CRF for sequence tagging (97-98% accuracy)
- Axon/EXLA: Pure Elixir neural networks with GPU acceleration
- Pre-trained embeddings: Support for GloVe, FastText
- Training infrastructure: Train custom models on your own data
- Evaluation metrics: Accuracy, precision, recall, F1, confusion matrices
Transformer Models (Bumblebee Integration)
- Pre-trained Models: BERT, RoBERTa, DistilBERT, XLM-RoBERTa via Hugging Face
- Fine-tuning: Full fine-tuning pipeline for POS tagging and NER (98-99% accuracy)
- Zero-shot Classification: Classify without training using NLI models (70-85% accuracy)
- Model Quantization: INT8 quantization for 4x compression and 2-3x speedup
- Multilingual Support: XLM-RoBERTa for cross-lingual transfer
- Mix Tasks: CLI tools for model management, fine-tuning, and inference
See Statistical Models for complete reference, Neural Models for neural architecture details, Training Neural for training guide, Pretrained Models for transformer usage, Zero Shot for zero-shot classification, and Quantization for model optimization.
Quick Start: Model Management
# List available models
mix nasty.models list
# Train HMM POS tagger (fast, 95% accuracy)
mix nasty.train.pos \
--corpus data/UD_English-EWT/en_ewt-ud-train.conllu \
--test data/UD_English-EWT/en_ewt-ud-test.conllu \
--output priv/models/en/pos_hmm_v1.model
# Train neural POS tagger (slower, 97-98% accuracy)
mix nasty.train.neural_pos \
--corpus data/UD_English-EWT/en_ewt-ud-train.conllu \
--output priv/models/en/pos_neural_v1.axon \
--epochs 10 \
--batch-size 32
# Train CRF for NER
mix nasty.train.crf \
--corpus data/train.conllu \
--test data/test.conllu \
--output priv/models/en/ner_crf.model \
--task ner \
--iterations 100
# Train PCFG parser
mix nasty.train.pcfg \
--corpus data/en_ewt-ud-train.conllu \
--test data/en_ewt-ud-test.conllu \
--output priv/models/en/pcfg.model \
--smoothing 0.001
# Evaluate models
mix nasty.eval.pos \
--model priv/models/en/pos_hmm_v1.model \
--test data/UD_English-EWT/en_ewt-ud-test.conllu \
--baseline
mix nasty.eval \
--model priv/models/en/ner_crf.model \
--test data/test.conllu \
--type crf \
--task ner
mix nasty.eval \
--model priv/models/en/pcfg.model \
--test data/test.conllu \
--type pcfg
Future Enhancements
- [x] Statistical models for improved accuracy (HMM POS tagger - done!)
- [x] Neural models (BiLSTM-CRF POS tagger with 97-98% accuracy - done!)
- [x] PCFG parser for phrase structure (done!)
- [x] CRF for named entity recognition (done!)
- [x] Semantic role labeling (rule-based SRL - done!)
- [x] Coreference resolution (heuristic-based - done!)
- [x] Question answering (extractive QA - done!)
- [x] Information extraction (relations, events, templates - done!)
- [x] Code ↔ NL bidirectional conversion (done!)
- [x] Pre-trained transformers (BERT, RoBERTa via Bumblebee - done!)
- [x] Fine-tuning infrastructure for POS tagging and NER (done!)
- [x] Zero-shot classification using NLI models (done!)
- [x] Model quantization (INT8 with 4x compression) (done!)
- [x] Integration of PCFG/CRF with main pipeline (done!)
- [x] Multi-language support - Spanish and Catalan complete
- [ ] Advanced coreference (neural models)
License
MIT License — see LICENSE file for details.
Built with ❤️ using Elixir and NimbleParsec