Nasty Architecture

This document describes the architecture of Nasty, a language-agnostic NLP library for Elixir that treats natural language with the same rigor as programming languages.

Design Philosophy

Nasty is built on three core principles:

Grammar-First: Treat natural language as a formal grammar with an Abstract Syntax Tree (AST), similar to how compilers handle programming languages
Language-Agnostic: Use behaviours to define a common interface, allowing multiple natural languages to coexist
Pure Elixir: No external NLP dependencies; built entirely in Elixir using NimbleParsec and functional programming patterns

System Architecture

High-Level Overview

flowchart TD
    API["Public API (Nasty)<br/>parse/2, render/2, summarize/2, to_code/2, explain_code/2"]
    Registry["Language Registry<br/>Manages language implementations & auto-detection"]
    English["Nasty.Language.English<br/>(Full implementation)"]
    Other["Nasty.Language.Spanish/Catalan<br/>(Future)"]
    Pipeline["NLP Pipeline<br/>Tokenization → POS Tagging → Parsing → Semantic Analysis"]
    AST["AST Structures<br/>Document → Paragraph → Sentence → Clause → Phrases → Token"]
    Translation["Translation System"]
    Operations["AST Operations<br/>Query, Validation, Transform, Traversal"]
    
    API --> Registry
    Registry --> English
    Registry --> Other
    English --> Pipeline
    Pipeline --> AST
    AST --> Translation
    AST --> Operations

Core Components

1. Language Behaviour System

The Nasty.Language.Behaviour defines the interface that all language implementations must follow:

Required Callbacks

@callback language_code() :: atom()
@callback tokenize(String.t(), options()) :: {:ok, [Token.t()]} | {:error, term()}
@callback tag_pos([Token.t()], options()) :: {:ok, [Token.t()]} | {:error, term()}
@callback parse([Token.t()], options()) :: {:ok, Document.t()} | {:error, term()}
@callback render(struct(), options()) :: {:ok, String.t()} | {:error, term()}

Optional Callbacks

@callback metadata() :: map()

Benefits

Pluggability: New languages can be added without changing core code
Type Safety: Dialyzer ensures implementations follow the contract
Consistency: All languages provide the same interface
Testing: Easy to mock and test language-specific behavior

2. Language Registry

The Nasty.Language.Registry is an Agent-based registry that:

Registers language implementations at runtime
Validates implementations comply with the Behaviour
Provides lookup by language code (:en, :es, :ca)
Detects language from text using heuristics

# Registration (happens at application startup)
Registry.register(Nasty.Language.English)

# Lookup
{:ok, module} = Registry.get(:en)

# Detection
{:ok, :en} = Registry.detect_language("Hello world")

3. NLP Pipeline

Each language implementation follows a multi-stage pipeline:

Stage 1: Tokenization

Purpose: Split raw text into atomic units (tokens)

Responsibilities:

Sentence boundary detection
Word segmentation
Contraction handling ("don't" → "do" + "n't")
Position tracking (line, column, byte offsets)

Implementation: NimbleParsec combinators for efficient parsing

Output: [Token.t()] with text and position information

Stage 2: POS Tagging

Purpose: Assign part-of-speech tags and morphological features

Responsibilities:

Tag assignment using Universal Dependencies tagset
Morphological analysis (tense, number, person, case, etc.)
Lemmatization (reduce to dictionary form)

Methods:

Rule-based tagging
Statistical models (HMM)
Hybrid approaches

Output: [Token.t()] with pos_tag, lemma, and morphology filled

Stage 3: Parsing

Purpose: Build hierarchical syntactic structure

Responsibilities:

Phrase structure parsing (NP, VP, PP, AP, AdvP)
Clause identification (independent, subordinate, relative)
Sentence structure determination (simple, compound, complex)
Document and paragraph organization

Approaches:

Recursive descent parsing
Chart parsing (future)
Statistical parsing (future)

Output: Document.t() with complete AST hierarchy

Stage 4: Semantic Analysis (Optional)

Purpose: Extract meaning and relationships

Components:

Named Entity Recognition (NER): Identify persons, organizations, locations, dates
Dependency Extraction: Extract grammatical relationships between words
Semantic Role Labeling (SRL): Identify who did what to whom
Coreference Resolution: Link pronouns to referents
Relation Extraction: Extract entity relationships
Event Extraction: Identify events and participants

Output: Enriched Document.t() with semantic annotations

Stage 5: Rendering

Purpose: Convert AST back to natural language text

Responsibilities:

Surface realization (choose correct word forms)
Agreement enforcement (subject-verb, etc.)
Word order application (language-specific)
Punctuation insertion
Capitalization and formatting

Output: Rendered text string

4. AST Structure

The AST is a hierarchical, linguistically-precise representation:

graph TD
    Doc["Document (root)"]
    P1[Paragraph]
    P2[Paragraph]
    S1[Sentence]
    S2[Sentence]
    C1["Clause (main)"]
    C2["Clause (subordinate)"]
    Subj["Subject (NounPhrase)"]
    Pred["Predicate (VerbPhrase)"]
    V["Verb (Token)"]
    Comp["Complement (NounPhrase)"]
    Adv["Adverbial (PrepositionalPhrase)"]
    
    Doc --> P1
    Doc --> P2
    P1 --> S1
    P1 --> S2
    S1 --> C1
    S1 --> C2
    C1 --> Subj
    C1 --> Pred
    Pred --> V
    Pred --> Comp
    Pred --> Adv

Node Types

Document Nodes:

Document - Root container
Paragraph - Topic-related sentences

Sentence Nodes:

Sentence - Complete grammatical unit
Clause - Subject + predicate

Phrase Nodes:

NounPhrase - Noun-headed (the cat, big house)
VerbPhrase - Verb-headed (is running, gave a book)
PrepositionalPhrase - Preposition-headed (on the mat)
AdjectivalPhrase - Adjective-headed (very happy)
AdverbialPhrase - Adverb-headed (quite quickly)

Atomic Nodes:

Token - Single word/punctuation with POS tag

Semantic Nodes:

Entity - Named entity
Relation - Entity relationship
Event - Event with participants
CorefChain - Coreference links
Frame - Semantic role frame

Universal Properties

All nodes include:

%{
  language: atom(),  # :en, :es, :ca
  span: %{          # Position tracking
    start_pos: {line, column},
    start_byte: integer(),
    end_pos: {line, column},
    end_byte: integer()
  }
}

5. AST Utilities

Query Module

Search and extract information from AST:

Nasty.AST.Query.find_subject(sentence)
Nasty.AST.Query.extract_tokens(document)
Nasty.AST.Query.find_entities(document)

Validation Module

Ensure AST structural integrity:

case Nasty.AST.Validation.validate(document) do
  :ok -> :ok
  {:error, errors} -> handle_errors(errors)
end

Transform Module

Modify AST nodes:

transformed = Nasty.AST.Transform.map(document, fn node ->
  # Transform logic
  node
end)

Traversal Module

Navigate AST with different strategies:

Nasty.AST.Traversal.pre_order(document, visitor_fn)
Nasty.AST.Traversal.post_order(document, visitor_fn)
Nasty.AST.Traversal.breadth_first(document, visitor_fn)

6. Statistical & Neural Models

Model Infrastructure

Registry: Agent-based model storage

ModelRegistry.register/2 - Store model
ModelRegistry.get/1 - Retrieve model
ModelRegistry.list_models/0 - List all

Loader: Serialize/deserialize models

ModelLoader.load/1 - Load from file
ModelLoader.save/2 - Save to file
ModelLoader.load_from_priv/1 - Load from app resources

Model Types

HMM (Hidden Markov Model):

POS tagging with ~95% accuracy
Viterbi algorithm for decoding
Fast inference, low memory

BiLSTM-CRF (Neural):

POS tagging with 97-98% accuracy
Bidirectional LSTM with CRF layer
Built with Axon/EXLA for GPU acceleration
Character-level CNN for OOV handling
Pre-trained embedding support

Naive Bayes:

Text classification
Multinomial variant for document classification

Future Models:

PCFG (Probabilistic Context-Free Grammar) for parsing
CRF (Conditional Random Fields) for NER
Pre-trained transformers (BERT, RoBERTa via Bumblebee)

7. Code Interoperability

Bidirectional conversion between natural language and code:

NL → Code Pipeline

Natural Language
    ↓
Intent Recognition (parse to Intent AST)
    ↓
Code Generation (Intent → Elixir AST)
    ↓
Validation
    ↓
Elixir Code String

Example:

Nasty.to_code("Filter users where age is greater than 18", 
  source_language: :en, 
  target_language: :elixir)
# => "Enum.filter(users, fn item -> item > 18 end)"

Code → NL Pipeline

Elixir Code String
    ↓
Parse to Elixir AST
    ↓
Traverse & Explain (AST → Natural Language)
    ↓
Natural Language Description

Example:

Nasty.explain_code("Enum.sort(list)", 
  source_language: :elixir, 
  target_language: :en)
# => "Sort list"

8. Translation System

AST-based translation between natural languages:

Translation Pipeline

Source AST (Language A)
    ↓
AST Transformation (structural changes)
    ↓
Token Translation (lemma-to-lemma mapping)
    ↓
Morphological Agreement (gender/number/person)
    ↓
Word Order Application (language-specific rules)
    ↓
Target AST (Language B)
    ↓
Rendering
    ↓
Target Text

Components:

ASTTransformer - Transforms AST nodes between languages:

alias Nasty.Translation.ASTTransformer

{:ok, spanish_doc} = ASTTransformer.transform_document(english_doc, :es)

TokenTranslator - Lemma-to-lemma translation with POS awareness:

alias Nasty.Translation.TokenTranslator

# cat (noun) → gato (noun)
translated = TokenTranslator.translate_token(token, :en, :es)

Agreement - Enforces morphological agreement:

alias Nasty.Translation.Agreement

# Ensure "el gato" (masc) not "la gato"
adjusted = Agreement.apply_agreement(tokens, :es)

WordOrder - Applies language-specific word order:

alias Nasty.Translation.WordOrder

# "the big house" → "la casa grande" (adjective after noun in Spanish)
ordered = WordOrder.apply_order(phrase, :es)

LexiconLoader - Manages bidirectional lexicons with ETS caching:

alias Nasty.Translation.LexiconLoader

# Load English-Spanish lexicon
{:ok, lexicon} = LexiconLoader.load(:en, :es)

# Bidirectional lookup
"gato" = LexiconLoader.lookup(lexicon, "cat", :noun)
"cat" = LexiconLoader.lookup(lexicon, "gato", :noun)

Features:

AST-aware translation preserving grammatical structure
Morphological feature agreement
Language-specific word order rules (SVO, pro-drop, adjective position)
Idiomatic expression support
Fallback to original text for untranslatable content
Bidirectional translation (English ↔ Spanish, English ↔ Catalan)

9. Rendering & Visualization

Text Rendering

Convert AST to formatted text:

Nasty.Rendering.Text.render(document)

Pretty Printing

Human-readable AST inspection:

Nasty.Rendering.PrettyPrint.inspect(ast)

DOT Visualization

Generate Graphviz diagrams:

{:ok, dot} = Nasty.Rendering.Visualization.to_dot(ast)
File.write("ast.dot", dot)

JSON Export

Export to JSON for external tools:

{:ok, json} = Nasty.Rendering.Visualization.to_json(ast)

9. Data Layer

CoNLL-U Support

Parse and generate Universal Dependencies format:

{:ok, sentences} = Nasty.Data.CoNLLU.parse_file("corpus.conllu")
conllu_string = Nasty.Data.CoNLLU.format(sentence)

Corpus Management

Manage training corpora:

{:ok, corpus} = Nasty.Data.Corpus.load("path/to/corpus")
stats = Nasty.Data.Corpus.statistics(corpus)

Application Supervision

defmodule Nasty.Application do
  use Application

  def start(_type, _args) do
    children = [
      # Language Registry Agent
      Nasty.Language.Registry,
      
      # Model Registry Agent
      Nasty.Statistics.ModelRegistry
    ]

    opts = [strategy: :one_for_one, name: Nasty.Supervisor]
    result = Supervisor.start_link(children, opts)
    
    # Register languages at startup
    Nasty.Language.Registry.register(Nasty.Language.English)
    
    result
  end
end

Extension Points

Adding a New Language

Implement Nasty.Language.Behaviour
Create language module in lib/language/your_language/
Implement required callbacks
Register in application.ex
Add tests

See Language Guide for details.

Adding New NLP Features

Create module in appropriate layer (lib/language/, lib/semantic/, etc.)
Define behaviour if language-agnostic
Implement for each language
Add to pipeline if needed
Update AST if new node types needed

Adding Statistical Models

Implement model training in lib/statistics/
Create Mix task for training
Add model to registry
Integrate into pipeline

Performance Considerations

Efficiency

NimbleParsec: Compiled parser combinators for fast tokenization
Agent-based registries: Fast in-memory lookup
Streaming: Process documents incrementally where possible
Lazy evaluation: Use streams for large corpora

Scalability

Stateless processing: All functions are pure
Concurrent processing: Parse multiple documents in parallel
Distributed: Can run across multiple nodes (future)

Testing Strategy

Unit Tests

Test each module in isolation
Use async: true for parallel execution
Mock language implementations when testing core

Integration Tests

Test full pipeline from text to AST
Test rendering round-trips
Test code interoperability

Property-Based Testing

Generate random ASTs and validate
Test parsing/rendering round-trips
Verify AST invariants

Future Directions

Architecture Evolution

Generic Layers: Extract lib/parsing/, lib/semantic/, lib/operations/
Plugin System: Dynamic language loading
Streaming Pipeline: Process infinite text streams
Distributed Processing: Multi-node coordination

Advanced Features

Neural Models: Transformer-based parsing and tagging
Multi-lingual: True cross-language support
Incremental Parsing: Update AST on edits
Error Recovery: Graceful handling of malformed input

Nasty Architecture

Design Philosophy

System Architecture

High-Level Overview

Core Components

1. Language Behaviour System

Required Callbacks

Optional Callbacks

Benefits

2. Language Registry

3. NLP Pipeline

Stage 1: Tokenization

Stage 2: POS Tagging

Stage 3: Parsing

Stage 4: Semantic Analysis (Optional)

Stage 5: Rendering

4. AST Structure

Node Types

Universal Properties

5. AST Utilities

Query Module

Validation Module

Transform Module

Traversal Module

6. Statistical & Neural Models

Model Infrastructure

Model Types

7. Code Interoperability

NL → Code Pipeline

Code → NL Pipeline

8. Translation System

Translation Pipeline

9. Rendering & Visualization

Text Rendering

Pretty Printing

DOT Visualization

JSON Export

9. Data Layer

CoNLL-U Support

Corpus Management

Application Supervision

Extension Points

Adding a New Language

Adding New NLP Features

Adding Statistical Models

Performance Considerations

Efficiency

Scalability

Testing Strategy

Unit Tests

Integration Tests

Property-Based Testing

Future Directions

Architecture Evolution

Advanced Features

See Also