Nasty Architecture
View SourceThis document describes the architecture of Nasty, a language-agnostic NLP library for Elixir that treats natural language with the same rigor as programming languages.
Design Philosophy
Nasty is built on three core principles:
- Grammar-First: Treat natural language as a formal grammar with an Abstract Syntax Tree (AST), similar to how compilers handle programming languages
- Language-Agnostic: Use behaviours to define a common interface, allowing multiple natural languages to coexist
- Pure Elixir: No external NLP dependencies; built entirely in Elixir using NimbleParsec and functional programming patterns
System Architecture
High-Level Overview
flowchart TD
API["Public API (Nasty)<br/>parse/2, render/2, summarize/2, to_code/2, explain_code/2"]
Registry["Language Registry<br/>Manages language implementations & auto-detection"]
English["Nasty.Language.English<br/>(Full implementation)"]
Other["Nasty.Language.Spanish/Catalan<br/>(Future)"]
Pipeline["NLP Pipeline<br/>Tokenization → POS Tagging → Parsing → Semantic Analysis"]
AST["AST Structures<br/>Document → Paragraph → Sentence → Clause → Phrases → Token"]
Translation["Translation System"]
Operations["AST Operations<br/>Query, Validation, Transform, Traversal"]
API --> Registry
Registry --> English
Registry --> Other
English --> Pipeline
Pipeline --> AST
AST --> Translation
AST --> OperationsCore Components
1. Language Behaviour System
The Nasty.Language.Behaviour defines the interface that all language implementations must follow:
Required Callbacks
@callback language_code() :: atom()
@callback tokenize(String.t(), options()) :: {:ok, [Token.t()]} | {:error, term()}
@callback tag_pos([Token.t()], options()) :: {:ok, [Token.t()]} | {:error, term()}
@callback parse([Token.t()], options()) :: {:ok, Document.t()} | {:error, term()}
@callback render(struct(), options()) :: {:ok, String.t()} | {:error, term()}Optional Callbacks
@callback metadata() :: map()Benefits
- Pluggability: New languages can be added without changing core code
- Type Safety: Dialyzer ensures implementations follow the contract
- Consistency: All languages provide the same interface
- Testing: Easy to mock and test language-specific behavior
2. Language Registry
The Nasty.Language.Registry is an Agent-based registry that:
- Registers language implementations at runtime
- Validates implementations comply with the Behaviour
- Provides lookup by language code (
:en,:es,:ca) - Detects language from text using heuristics
# Registration (happens at application startup)
Registry.register(Nasty.Language.English)
# Lookup
{:ok, module} = Registry.get(:en)
# Detection
{:ok, :en} = Registry.detect_language("Hello world")3. NLP Pipeline
Each language implementation follows a multi-stage pipeline:
Stage 1: Tokenization
Purpose: Split raw text into atomic units (tokens)
Responsibilities:
- Sentence boundary detection
- Word segmentation
- Contraction handling ("don't" → "do" + "n't")
- Position tracking (line, column, byte offsets)
Implementation: NimbleParsec combinators for efficient parsing
Output: [Token.t()] with text and position information
Stage 2: POS Tagging
Purpose: Assign part-of-speech tags and morphological features
Responsibilities:
- Tag assignment using Universal Dependencies tagset
- Morphological analysis (tense, number, person, case, etc.)
- Lemmatization (reduce to dictionary form)
Methods:
- Rule-based tagging
- Statistical models (HMM)
- Hybrid approaches
Output: [Token.t()] with pos_tag, lemma, and morphology filled
Stage 3: Parsing
Purpose: Build hierarchical syntactic structure
Responsibilities:
- Phrase structure parsing (NP, VP, PP, AP, AdvP)
- Clause identification (independent, subordinate, relative)
- Sentence structure determination (simple, compound, complex)
- Document and paragraph organization
Approaches:
- Recursive descent parsing
- Chart parsing (future)
- Statistical parsing (future)
Output: Document.t() with complete AST hierarchy
Stage 4: Semantic Analysis (Optional)
Purpose: Extract meaning and relationships
Components:
- Named Entity Recognition (NER): Identify persons, organizations, locations, dates
- Dependency Extraction: Extract grammatical relationships between words
- Semantic Role Labeling (SRL): Identify who did what to whom
- Coreference Resolution: Link pronouns to referents
- Relation Extraction: Extract entity relationships
- Event Extraction: Identify events and participants
Output: Enriched Document.t() with semantic annotations
Stage 5: Rendering
Purpose: Convert AST back to natural language text
Responsibilities:
- Surface realization (choose correct word forms)
- Agreement enforcement (subject-verb, etc.)
- Word order application (language-specific)
- Punctuation insertion
- Capitalization and formatting
Output: Rendered text string
4. AST Structure
The AST is a hierarchical, linguistically-precise representation:
graph TD
Doc["Document (root)"]
P1[Paragraph]
P2[Paragraph]
S1[Sentence]
S2[Sentence]
C1["Clause (main)"]
C2["Clause (subordinate)"]
Subj["Subject (NounPhrase)"]
Pred["Predicate (VerbPhrase)"]
V["Verb (Token)"]
Comp["Complement (NounPhrase)"]
Adv["Adverbial (PrepositionalPhrase)"]
Doc --> P1
Doc --> P2
P1 --> S1
P1 --> S2
S1 --> C1
S1 --> C2
C1 --> Subj
C1 --> Pred
Pred --> V
Pred --> Comp
Pred --> AdvNode Types
Document Nodes:
Document- Root containerParagraph- Topic-related sentences
Sentence Nodes:
Sentence- Complete grammatical unitClause- Subject + predicate
Phrase Nodes:
NounPhrase- Noun-headed (the cat, big house)VerbPhrase- Verb-headed (is running, gave a book)PrepositionalPhrase- Preposition-headed (on the mat)AdjectivalPhrase- Adjective-headed (very happy)AdverbialPhrase- Adverb-headed (quite quickly)
Atomic Nodes:
Token- Single word/punctuation with POS tag
Semantic Nodes:
Entity- Named entityRelation- Entity relationshipEvent- Event with participantsCorefChain- Coreference linksFrame- Semantic role frame
Universal Properties
All nodes include:
%{
language: atom(), # :en, :es, :ca
span: %{ # Position tracking
start_pos: {line, column},
start_byte: integer(),
end_pos: {line, column},
end_byte: integer()
}
}5. AST Utilities
Query Module
Search and extract information from AST:
Nasty.AST.Query.find_subject(sentence)
Nasty.AST.Query.extract_tokens(document)
Nasty.AST.Query.find_entities(document)Validation Module
Ensure AST structural integrity:
case Nasty.AST.Validation.validate(document) do
:ok -> :ok
{:error, errors} -> handle_errors(errors)
endTransform Module
Modify AST nodes:
transformed = Nasty.AST.Transform.map(document, fn node ->
# Transform logic
node
end)Traversal Module
Navigate AST with different strategies:
Nasty.AST.Traversal.pre_order(document, visitor_fn)
Nasty.AST.Traversal.post_order(document, visitor_fn)
Nasty.AST.Traversal.breadth_first(document, visitor_fn)6. Statistical & Neural Models
Model Infrastructure
Registry: Agent-based model storage
ModelRegistry.register/2- Store modelModelRegistry.get/1- Retrieve modelModelRegistry.list_models/0- List all
Loader: Serialize/deserialize models
ModelLoader.load/1- Load from fileModelLoader.save/2- Save to fileModelLoader.load_from_priv/1- Load from app resources
Model Types
HMM (Hidden Markov Model):
- POS tagging with ~95% accuracy
- Viterbi algorithm for decoding
- Fast inference, low memory
BiLSTM-CRF (Neural):
- POS tagging with 97-98% accuracy
- Bidirectional LSTM with CRF layer
- Built with Axon/EXLA for GPU acceleration
- Character-level CNN for OOV handling
- Pre-trained embedding support
Naive Bayes:
- Text classification
- Multinomial variant for document classification
Future Models:
- PCFG (Probabilistic Context-Free Grammar) for parsing
- CRF (Conditional Random Fields) for NER
- Pre-trained transformers (BERT, RoBERTa via Bumblebee)
7. Code Interoperability
Bidirectional conversion between natural language and code:
NL → Code Pipeline
Natural Language
↓
Intent Recognition (parse to Intent AST)
↓
Code Generation (Intent → Elixir AST)
↓
Validation
↓
Elixir Code StringExample:
Nasty.to_code("Filter users where age is greater than 18",
source_language: :en,
target_language: :elixir)
# => "Enum.filter(users, fn item -> item > 18 end)"Code → NL Pipeline
Elixir Code String
↓
Parse to Elixir AST
↓
Traverse & Explain (AST → Natural Language)
↓
Natural Language DescriptionExample:
Nasty.explain_code("Enum.sort(list)",
source_language: :elixir,
target_language: :en)
# => "Sort list"8. Translation System
AST-based translation between natural languages:
Translation Pipeline
Source AST (Language A)
↓
AST Transformation (structural changes)
↓
Token Translation (lemma-to-lemma mapping)
↓
Morphological Agreement (gender/number/person)
↓
Word Order Application (language-specific rules)
↓
Target AST (Language B)
↓
Rendering
↓
Target TextComponents:
ASTTransformer - Transforms AST nodes between languages:
alias Nasty.Translation.ASTTransformer
{:ok, spanish_doc} = ASTTransformer.transform_document(english_doc, :es)TokenTranslator - Lemma-to-lemma translation with POS awareness:
alias Nasty.Translation.TokenTranslator
# cat (noun) → gato (noun)
translated = TokenTranslator.translate_token(token, :en, :es)Agreement - Enforces morphological agreement:
alias Nasty.Translation.Agreement
# Ensure "el gato" (masc) not "la gato"
adjusted = Agreement.apply_agreement(tokens, :es)WordOrder - Applies language-specific word order:
alias Nasty.Translation.WordOrder
# "the big house" → "la casa grande" (adjective after noun in Spanish)
ordered = WordOrder.apply_order(phrase, :es)LexiconLoader - Manages bidirectional lexicons with ETS caching:
alias Nasty.Translation.LexiconLoader
# Load English-Spanish lexicon
{:ok, lexicon} = LexiconLoader.load(:en, :es)
# Bidirectional lookup
"gato" = LexiconLoader.lookup(lexicon, "cat", :noun)
"cat" = LexiconLoader.lookup(lexicon, "gato", :noun)Features:
- AST-aware translation preserving grammatical structure
- Morphological feature agreement
- Language-specific word order rules (SVO, pro-drop, adjective position)
- Idiomatic expression support
- Fallback to original text for untranslatable content
- Bidirectional translation (English ↔ Spanish, English ↔ Catalan)
9. Rendering & Visualization
Text Rendering
Convert AST to formatted text:
Nasty.Rendering.Text.render(document)Pretty Printing
Human-readable AST inspection:
Nasty.Rendering.PrettyPrint.inspect(ast)DOT Visualization
Generate Graphviz diagrams:
{:ok, dot} = Nasty.Rendering.Visualization.to_dot(ast)
File.write("ast.dot", dot)JSON Export
Export to JSON for external tools:
{:ok, json} = Nasty.Rendering.Visualization.to_json(ast)9. Data Layer
CoNLL-U Support
Parse and generate Universal Dependencies format:
{:ok, sentences} = Nasty.Data.CoNLLU.parse_file("corpus.conllu")
conllu_string = Nasty.Data.CoNLLU.format(sentence)Corpus Management
Manage training corpora:
{:ok, corpus} = Nasty.Data.Corpus.load("path/to/corpus")
stats = Nasty.Data.Corpus.statistics(corpus)Application Supervision
defmodule Nasty.Application do
use Application
def start(_type, _args) do
children = [
# Language Registry Agent
Nasty.Language.Registry,
# Model Registry Agent
Nasty.Statistics.ModelRegistry
]
opts = [strategy: :one_for_one, name: Nasty.Supervisor]
result = Supervisor.start_link(children, opts)
# Register languages at startup
Nasty.Language.Registry.register(Nasty.Language.English)
result
end
endExtension Points
Adding a New Language
- Implement
Nasty.Language.Behaviour - Create language module in
lib/language/your_language/ - Implement required callbacks
- Register in
application.ex - Add tests
See Language Guide for details.
Adding New NLP Features
- Create module in appropriate layer (
lib/language/,lib/semantic/, etc.) - Define behaviour if language-agnostic
- Implement for each language
- Add to pipeline if needed
- Update AST if new node types needed
Adding Statistical Models
- Implement model training in
lib/statistics/ - Create Mix task for training
- Add model to registry
- Integrate into pipeline
Performance Considerations
Efficiency
- NimbleParsec: Compiled parser combinators for fast tokenization
- Agent-based registries: Fast in-memory lookup
- Streaming: Process documents incrementally where possible
- Lazy evaluation: Use streams for large corpora
Scalability
- Stateless processing: All functions are pure
- Concurrent processing: Parse multiple documents in parallel
- Distributed: Can run across multiple nodes (future)
Testing Strategy
Unit Tests
- Test each module in isolation
- Use
async: truefor parallel execution - Mock language implementations when testing core
Integration Tests
- Test full pipeline from text to AST
- Test rendering round-trips
- Test code interoperability
Property-Based Testing
- Generate random ASTs and validate
- Test parsing/rendering round-trips
- Verify AST invariants
Future Directions
Architecture Evolution
- Generic Layers: Extract
lib/parsing/,lib/semantic/,lib/operations/ - Plugin System: Dynamic language loading
- Streaming Pipeline: Process infinite text streams
- Distributed Processing: Multi-node coordination
Advanced Features
- Neural Models: Transformer-based parsing and tagging
- Multi-lingual: True cross-language support
- Incremental Parsing: Update AST on edits
- Error Recovery: Graceful handling of malformed input