Parsing Guide
View SourceThis document provides a comprehensive technical guide to all parsing algorithms implemented in Nasty, including tokenization, POS tagging, morphological analysis, phrase parsing, sentence parsing, and dependency extraction.
Table of Contents
- Pipeline Overview
- Tokenization
- POS Tagging
- Morphological Analysis
- Phrase Parsing
- Sentence Parsing
- Dependency Extraction
- Integration Example
Pipeline Overview
The Nasty NLP pipeline processes text through the following stages:
flowchart TD
A[Input Text]
B["[1] Tokenization (NimbleParsec)"]
C["[2] POS Tagging (Rule-based / HMM / Neural)"]
D["[3] Morphological Analysis (Lemmatization + Features)"]
E["[4] Phrase Parsing (Bottom-up CFG)"]
F["[5] Sentence Parsing (Clause Detection)"]
G["[6] Dependency Extraction (UD Relations)"]
H[Complete AST]
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> HEach stage:
- Takes structured input from the previous stage
- Adds linguistic annotations
- Preserves position tracking (span information)
- Maintains language metadata
Tokenization
Algorithm: NimbleParsec Combinator Parsing
Module: Nasty.Language.English.Tokenizer
Approach: Bottom-up combinator-based parsing using NimbleParsec, processing text left-to-right with greedy longest-match.
Token Types
- Hyphenated words:
well-known,twenty-one - Contractions:
don't,I'm,we've,it's - Numbers: integers (
123), decimals (3.14) - Words: alphabetic sequences
- Punctuation: sentence-ending (
.,!,?), commas, quotes, brackets, etc.
Parser Combinators
# Order matters - more specific patterns first
token = choice([
hyphenated, # "well-known"
contraction, # "don't"
number, # "123", "3.14"
word, # "cat"
punctuation # ".", ",", etc.
])Position Tracking
Every token includes precise position information:
%Token{
text: "cat",
span: %{
start_pos: {1, 5}, # {line, column}
start_offset: 4, # byte offset
end_pos: {1, 8},
end_offset: 7
}
}Position tracking handles:
- Multi-line text with newline counting
- Whitespace between tokens (ignored but tracked)
- UTF-8 byte offsets vs. character positions
Edge Cases
- Empty text: Returns
{:ok, []} - Whitespace-only: Returns
{:ok, []} - Unparseable text: Returns
{:error, {:parse_incomplete, ...}} - Contractions: Parsed as single tokens, not split
Example
{:ok, tokens} = Tokenizer.tokenize("I don't know.")
# => [
# %Token{text: "I", pos_tag: :x, span: ...},
# %Token{text: "don't", pos_tag: :x, span: ...},
# %Token{text: "know", pos_tag: :x, span: ...},
# %Token{text: ".", pos_tag: :punct, span: ...}
# ]POS Tagging
Three Tagging Models
Module: Nasty.Language.English.POSTagger
Nasty supports three POS tagging approaches with different accuracy/speed tradeoffs:
| Model | Accuracy | Speed | Method |
|---|---|---|---|
| Rule-based | ~85% | Very Fast | Lexical lookup + morphology + context |
| HMM (Trigram) | ~95% | Fast | Viterbi decoding with add-k smoothing |
| Neural (BiLSTM-CRF) | 97-98% | Moderate | Deep learning with contextual embeddings |
1. Rule-Based Tagging
Algorithm: Sequential pattern matching with three-tier lookup
Tagging Strategy
Lexical Lookup: Closed-class words (determiners, pronouns, prepositions, etc.)
- 450+ words in lookup tables
- Example:
"the"→:det,"in"→:adp,"and"→:cconj
Morphological Analysis: Suffix-based tagging for open-class words
Nouns: -tion, -sion, -ment, -ness, -ity, -ism Verbs: -ing, -ed, -s/-es (3rd person singular) Adjectives: -ful, -less, -ous, -ive, -able, -ible Adverbs: -lyContextual Disambiguation: Local context rules
- Word after determiner → likely noun
- Word after preposition → likely noun
- Word before noun → likely adjective
- Capitalized words → proper nouns
Third-Person Singular Verb Detection
Conservative approach to avoid mistagging plural nouns as verbs:
# "walks" → :verb (stem "walk" in common verb list)
# "books" → :noun (not a verb stem)
# "stations" → :noun (ends with -tions, noun suffix)Checks:
- Exclude capitalized words (proper nouns)
- Exclude words with clear noun suffixes (-tions, -ments, etc.)
- Verify stem is in common verb list (140+ verbs)
2. HMM-Based Tagging
Algorithm: Viterbi decoding with trigram Hidden Markov Model
Model Components
Emission Probabilities: P(word|tag)
- Learned from tagged training data
- Smoothing for unknown words: add-k smoothing (k=0.001)
Transition Probabilities: P(tag₃|tag₁, tag₂)
- Trigram model for better context
- Special START markers for sentence boundaries
- Add-k smoothing for unseen trigrams
Initial Probabilities: P(tag) at sentence start
- Distribution of first tags in training sentences
Training Process
training_data = [
{["The", "cat", "sat"], [:det, :noun, :verb]},
...
]
model = HMMTagger.new()
{:ok, trained} = HMMTagger.train(model, training_data, [])Counts:
- Emission counts:
{word, tag}pairs - Transition counts:
{tag1, tag2} → tag3trigrams - Initial counts: first tag in each sequence
Normalization:
P(word|tag) = (count(word, tag) + k) / (sum(word, tag) + k * vocab_size)
P(tag3|tag1,tag2) = (count(tag1,tag2,tag3) + k) / (sum(tag1,tag2,*) + k * num_tags)Viterbi Decoding
Dynamic programming algorithm to find most likely tag sequence:
score[t][tag] = max over prev_tags of:
score[t-1][prev_tag] +
log P(tag|prev_prev_tag, prev_tag) +
log P(word_t|tag)Steps:
- Initialization: Score each tag for first word
- Forward Pass: Compute best score for each (position, tag) pair
- Backpointers: Track best previous tag for reconstruction
- Backtracking: Reconstruct best path from end to start
3. Neural Tagging (BiLSTM-CRF)
Algorithm: Bidirectional LSTM with Conditional Random Field layer
Module: Nasty.Statistics.POSTagging.NeuralTagger
Architecture
flowchart TD
A["Input: Word IDs [batch_size, seq_len]"]
B["Word Embeddings [batch_size, seq_len, embedding_dim]"]
C["BiLSTM Layers (×2) [batch_size, seq_len, hidden_size * 2]"]
D["Linear Projection [batch_size, seq_len, num_tags]"]
E["CRF Layer (optional) [batch_size, seq_len, num_tags]"]
F["Output: Tag IDs [batch_size, seq_len]"]
A --> B
B --> C
C --> D
D --> E
E --> FKey Components
Word Embeddings: 300-dimensional learned representations
- Vocabulary built from training data (min frequency = 2)
- Unknown words mapped to special UNK token
Bidirectional LSTM: 2 layers, 256 hidden units each
- Forward LSTM: left-to-right context
- Backward LSTM: right-to-left context
- Concatenated outputs: 512 dimensions
CRF Layer: Learns tag transition constraints
- Enforces valid tag sequences (e.g., DET → NOUN more likely than DET → VERB)
- Joint decoding over entire sequence
Dropout: 0.3 rate for regularization
Training
tagger = NeuralTagger.new(vocab_size: 10000, num_tags: 17)
training_data = [{["The", "cat"], [:det, :noun]}, ...]
{:ok, trained} = NeuralTagger.train(tagger, training_data,
epochs: 10,
batch_size: 32,
learning_rate: 0.001,
validation_split: 0.1
)Training features:
- Adam optimizer (adaptive learning rate)
- Cross-entropy loss (or CRF loss if using CRF layer)
- Early stopping with patience=3
- Validation set monitoring (10% split)
Inference
{:ok, tags} = NeuralTagger.predict(trained, ["The", "cat", "sat"], [])
# => {:ok, [:det, :noun, :verb]}Steps:
- Convert words to IDs using vocabulary
- Pad sequences to batch size
- Run through BiLSTM-CRF model
- Argmax over tag dimension (or Viterbi if using CRF)
- Convert tag IDs back to atoms
Model Selection
Use :model option in POSTagger.tag_pos/2:
# Rule-based (fast, ~85% accuracy)
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :rule_based)
# HMM (fast, ~95% accuracy)
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :hmm)
# Neural (moderate, 97-98% accuracy)
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :neural)
# Ensemble: HMM + rule-based fallback for punctuation/numbers
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :ensemble)
# Neural ensemble: Neural + rule-based fallback
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :neural_ensemble)Morphological Analysis
Algorithm: Dictionary + Rule-Based Lemmatization
Module: Nasty.Language.English.Morphology
Approach: Two-tier lemmatization with irregular form lookup followed by rule-based suffix removal.
Lemmatization Process
1. Irregular Form Lookup
Check dictionaries for common irregular forms:
Verbs (80+ irregular verbs):
"went" → "go", "was" → "be", "ate" → "eat", "ran" → "run"Nouns (12 irregular nouns):
"children" → "child", "men" → "man", "mice" → "mouse"Adjectives (12 irregular comparatives/superlatives):
"better" → "good", "best" → "good", "worse" → "bad"2. Rule-Based Suffix Removal
If no irregular form found, apply POS-specific rules:
Verbs:
-ing → stem (handling doubled consonants)
"running" → "run" (remove doubled 'n')
"making" → "make"
-ed → stem (handling doubled consonants, silent e)
"stopped" → "stop" (remove doubled 'p')
"liked" → "like" (restore silent 'e')
-s → base form (3rd person singular)
"walks" → "walk"Nouns:
-ies → -y (flies → fly)
-es → base (if stem ends in s/x/z/ch/sh)
"boxes" → "box", "dishes" → "dish"
-s → base (cats → cat)Adjectives:
-est → base (superlative)
"fastest" → "fast" (handle doubled consonants)
-er → base (comparative)
"faster" → "fast"Morphological Feature Extraction
Verb Features
%{
tense: :present | :past,
aspect: :progressive, # for -ing forms
person: 3, # for 3rd person singular
number: :singular
}Examples:
"running"→%{tense: :present, aspect: :progressive}"walked"→%{tense: :past}"walks"→%{tense: :present, person: 3, number: :singular}
Noun Features
%{number: :singular | :plural}Examples:
"cat"→%{number: :singular}"cats"→%{number: :plural}
Adjective Features
%{degree: :positive | :comparative | :superlative}Examples:
"fast"→%{degree: :positive}"faster"→%{degree: :comparative}"fastest"→%{degree: :superlative}
Example
{:ok, tokens} = Tokenizer.tokenize("running cats")
{:ok, tagged} = POSTagger.tag_pos(tokens)
{:ok, analyzed} = Morphology.analyze(tagged)
# => [
# %Token{text: "running", pos_tag: :verb, lemma: "run",
# morphology: %{tense: :present, aspect: :progressive}},
# %Token{text: "cats", pos_tag: :noun, lemma: "cat",
# morphology: %{number: :plural}}
# ]Phrase Parsing
Algorithm: Bottom-Up Pattern Matching with Context-Free Grammar
Module: Nasty.Language.English.PhraseParser
Approach: Greedy longest-match, left-to-right phrase construction using simplified CFG rules.
Grammar Rules
NP → Det? Adj* (Noun | PropN | Pron) (PP | RelClause)*
VP → Aux* Verb (NP)? (PP | AdvP)*
PP → Prep NP
AdjP → Adv? Adj
AdvP → Adv
RC → RelPron/RelAdv ClausePhrase Types
1. Noun Phrase (NP)
Components:
- Determiner (optional):
the,a,my,some - Modifiers (0+): adjectives, adjectival phrases
- Head (required): noun, proper noun, or pronoun
- Post-modifiers (0+): prepositional phrases, relative clauses
Examples:
"the cat" → [det: "the", head: "cat"]
"the big cat" → [det: "the", modifiers: ["big"], head: "cat"]
"the cat on the mat" → [det: "the", head: "cat",
post_modifiers: [PP("on", NP("the mat"))]]Special Cases:
- Pronouns as NPs:
"I","he","they"can stand alone - Multi-word proper nouns:
"New York"→ consecutive PROPNs merged as modifiers
2. Verb Phrase (VP)
Components:
- Auxiliaries (0+):
is,have,will,can - Head (required): main verb
- Complements (0+): object NP, PPs, adverbs
Examples:
"sat" → [head: "sat"]
"is running" → [auxiliaries: ["is"], head: "running"]
"saw the cat" → [head: "saw", complements: [NP("the cat")]]
"sat on the mat" → [head: "sat", complements: [PP("on", NP("the mat"))]]Special Case - Copula Construction: If only auxiliaries found (no main verb), treat last auxiliary as main verb:
"is happy" → [head: "is", complements: [AdjP("happy")]]
"are engineers" → [head: "are", complements: [NP("engineers")]]3. Prepositional Phrase (PP)
Structure: Prep + NP
Examples:
"on the mat" → [head: "on", object: NP("the mat")]
"in the house" → [head: "in", object: NP("the house")]4. Adjectival Phrase (AdjP)
Structure: Adv? + Adj
Examples:
"very big" → [intensifier: "very", head: "big"]
"quite small" → [intensifier: "quite", head: "small"]5. Adverbial Phrase (AdvP)
Structure: Adv (currently simple single-word adverbs)
Examples:
"quickly" → [head: "quickly"]
"often" → [head: "often"]6. Relative Clause (RC)
Structure: RelPron/RelAdv + Clause
Relativizers:
- Pronouns:
who,whom,whose,which,that - Adverbs:
where,when,why
Examples:
"that sits" → [relativizer: "that", clause: VP("sits")]
"who I know" → [relativizer: "who", clause: [subject: NP("I"), predicate: VP("know")]]Two Patterns:
- Relativizer as subject:
"that sits"→ clause has only VP - Relativizer as object:
"that I see"→ clause has NP subject + VP
Parsing Process
Each parse_*_phrase function:
- Checks current position in token list
- Attempts to consume tokens matching the pattern
- Recursively parses sub-phrases (e.g., NP within PP)
- Calculates span from first to last consumed token
- Returns
{:ok, phrase, next_position}or:error
Greedy Matching: Consumes as many tokens as possible for each phrase (e.g., all consecutive adjectives as modifiers).
Position Tracking: Every phrase includes span covering all constituent tokens.
Example
tokens = [
%Token{text: "the", pos_tag: :det},
%Token{text: "big", pos_tag: :adj},
%Token{text: "cat", pos_tag: :noun},
%Token{text: "on", pos_tag: :adp},
%Token{text: "the", pos_tag: :det},
%Token{text: "mat", pos_tag: :noun}
]
{:ok, np, _pos} = PhraseParser.parse_noun_phrase(tokens, 0)
# => %NounPhrase{
# determiner: "the",
# modifiers: ["big"],
# head: "cat",
# post_modifiers: [
# %PrepositionalPhrase{
# head: "on",
# object: %NounPhrase{determiner: "the", head: "mat"}
# }
# ]
# }Sentence Parsing
Algorithm: Clause Detection with Coordination and Subordination
Module: Nasty.Language.English.SentenceParser
Approach: Split on sentence boundaries, then parse each sentence into clauses with support for simple, compound, and complex structures.
Sentence Structures
Simple: Single independent clause
"The cat sat."
Compound: Multiple coordinated independent clauses
"The cat sat and the dog ran."
Complex: Independent clause with subordinate clause(s)
"The cat sat because it was tired."
Fragment: Incomplete sentence (e.g., subordinate clause alone)
Sentence Functions
Inferred from punctuation:
.→:declarative(statement)?→:interrogative(question)!→:exclamative(exclamation)
Parsing Process
1. Sentence Boundary Detection
Split on sentence-ending punctuation (., !, ?):
split_sentences(tokens)
# Groups tokens into sentence units2. Clause Parsing
For each sentence group, parse into clause structure:
Grammar:
Sentence → Clause+
Clause → SubordConj? NP? VPThree Clause Types:
- Independent: Can stand alone as complete sentence
- Subordinate: Begins with subordinating conjunction (
because,if,when, etc.) - Relative: Part of relative clause structure (handled in phrase parsing)
3. Coordination Detection
Look for coordinating conjunctions (:cconj):
and,or,but,nor,yet,so,for
If found, split and parse both sides:
"The cat sat and the dog ran"
# Split at "and"
# Parse: Clause1 ("The cat sat") + Clause2 ("the dog ran")
# Result: [Clause1, Clause2]4. Subordination Detection
Check for subordinating conjunction (:sconj) at start:
after,although,because,before,if,since,when,while, etc.
If found, mark clause as subordinate:
"because it was tired"
# Parse: Clause with subordinator: "because"
# Type: :subordinateSimple Clause Parsing
Algorithm: Find verb, split at verb to identify subject and predicate.
Steps:
- Find first verb/auxiliary in token sequence
- If verb at position 0: Imperative sentence (no subject)
- Parse VP starting at position 0
- Subject = nil
- If verb at position > 0: Declarative sentence
- Try to parse NP before verb (subject)
- Parse VP starting at end of subject (predicate)
- If no subject found: Try VP alone (imperative or fragment)
Fallback: If parsing fails, create minimal clause with first verb found.
Clause Structure
%Clause{
type: :independent | :subordinate | :relative,
subordinator: Token.t() | nil, # "because", "if", etc.
subject: NounPhrase.t() | nil,
predicate: VerbPhrase.t(),
language: :en,
span: span
}Sentence Structure
%Sentence{
function: :declarative | :interrogative | :exclamative,
structure: :simple | :compound | :complex | :fragment,
main_clause: Clause.t(),
additional_clauses: [Clause.t()], # for compound sentences
language: :en,
span: span
}Example
tokens = tokenize_and_tag("The cat sat and the dog ran.")
{:ok, [sentence]} = SentenceParser.parse_sentences(tokens)
# => %Sentence{
# function: :declarative,
# structure: :compound,
# main_clause: %Clause{
# type: :independent,
# subject: NP("The cat"),
# predicate: VP("sat")
# },
# additional_clauses: [
# %Clause{
# type: :independent,
# subject: NP("the dog"),
# predicate: VP("ran")
# }
# ]
# }Dependency Extraction
Algorithm: Phrase Structure to Universal Dependencies Conversion
Module: Nasty.Language.English.DependencyExtractor
Approach: Traverse phrase structure AST and extract grammatical relations as Universal Dependencies (UD) relations.
Universal Dependencies Relations
Nasty uses the UD relation taxonomy:
Core Arguments:
nsubj- nominal subjectobj- direct objectiobj- indirect object
Non-Core Dependents:
obl- oblique nominal (prepositional complement to verb)advmod- adverbial modifieraux- auxiliary verb
Nominal Dependents:
det- determineramod- adjectival modifiernmod- nominal modifier (prepositional complement to noun)case- case marking (preposition)
Clausal Dependents:
acl- adnominal clause (relative clause)mark- subordinating marker
Coordination:
conj- conjunctcc- coordinating conjunction
Extraction Process
1. Sentence-Level Extraction
extract(sentence)
# Extracts from main_clause + additional_clauses2. Clause-Level Extraction
For each clause:
Subject Dependency:
nsubj(predicate_head, subject_head)- Extract head token from subject NP
- Extract head token from predicate VP
- Create dependency relation
Predicate Dependencies: Extract from VP (see below)
Subordinator Dependency (if present):
mark(predicate_head, subordinator)
3. Noun Phrase Dependencies
From NP structure:
Determiner:
det(head, determiner)"the cat"→det(cat, the)
Adjectival Modifiers:
amod(head, modifier)"big cat"→amod(cat, big)
Post-modifiers:
- PP:
case(pp_object_head, preposition)+nmod(np_head, pp_object_head)"cat on mat"→case(mat, on)+nmod(cat, mat)
- Relative Clause:
mark(clause_head, relativizer)+acl(np_head, clause_head)"cat that sits"→mark(sits, that)+acl(cat, sits)
- PP:
4. Verb Phrase Dependencies
From VP structure:
Auxiliaries:
aux(main_verb, auxiliary)"is running"→aux(running, is)
Complements:
- Direct Object NP:
obj(verb, np_head)"saw cat"→obj(saw, cat)
- PP Complement:
case(pp_object, preposition)+obl(verb, pp_object)"sat on mat"→case(mat, on)+obl(sat, mat)
- Adverb:
advmod(verb, adverb)"ran quickly"→advmod(ran, quickly)
- Direct Object NP:
5. Prepositional Phrase Dependencies
From PP structure:
- Case Marking:
case(pp_object_head, preposition) - Oblique/Nominal Modifier:
- If governor is verb:
obl(governor, pp_object_head) - If governor is noun:
nmod(governor, pp_object_head)
- If governor is verb:
Dependency Structure
%Dependency{
relation: :nsubj | :obj | :det | ...,
head: Token.t(), # Governor token
dependent: Token.t(), # Dependent token
span: span
}Example
# Input: "The cat sat on the mat."
sentence = parse("The cat sat on the mat.")
dependencies = DependencyExtractor.extract(sentence)
# => [
# %Dependency{relation: :det, head: "cat", dependent: "the"},
# %Dependency{relation: :nsubj, head: "sat", dependent: "cat"},
# %Dependency{relation: :case, head: "mat", dependent: "on"},
# %Dependency{relation: :det, head: "mat", dependent: "the"},
# %Dependency{relation: :obl, head: "sat", dependent: "mat"}
# ]Visualization
Dependencies can be visualized as a directed graph:
graph TD
Root["sat (ROOT)"]
Cat[cat]
Mat[mat]
The1[the]
On[on]
The2[the]
Root -->|nsubj| Cat
Root -->|obl| Mat
Cat -->|det| The1
Mat -->|case| On
Mat -->|det| The2Integration Example
Complete pipeline from text to dependencies:
alias Nasty.Language.English.{
Tokenizer, POSTagger, Morphology,
PhraseParser, SentenceParser, DependencyExtractor
}
# Input text
text = "The big cat sat on the mat."
# Step 1: Tokenization
{:ok, tokens} = Tokenizer.tokenize(text)
# => [Token("The"), Token("big"), Token("cat"), ...]
# Step 2: POS Tagging (choose model)
{:ok, tagged} = POSTagger.tag_pos(tokens, model: :neural)
# => [Token("The", :det), Token("big", :adj), Token("cat", :noun), ...]
# Step 3: Morphological Analysis
{:ok, analyzed} = Morphology.analyze(tagged)
# => [Token("The", :det, lemma: "the"), ...]
# Step 4: Sentence Parsing (includes phrase parsing internally)
{:ok, sentences} = SentenceParser.parse_sentences(analyzed)
# => [Sentence(...)]
# Step 5: Dependency Extraction
sentence = hd(sentences)
dependencies = DependencyExtractor.extract(sentence)
# => [Dependency(:det, "cat", "The"), ...]
# Result: Complete AST with dependencies
sentence
# => %Sentence{
# main_clause: %Clause{
# subject: %NounPhrase{
# determiner: Token("The"),
# modifiers: [Token("big")],
# head: Token("cat")
# },
# predicate: %VerbPhrase{
# head: Token("sat"),
# complements: [
# %PrepositionalPhrase{
# head: Token("on"),
# object: %NounPhrase{...}
# }
# ]
# }
# }
# }Performance Considerations
Model Selection
For Production:
- Use neural models for highest accuracy
- Cache loaded models in memory
- Batch sentences for GPU acceleration (if available)
For Development/Testing:
- Use rule-based for fastest iteration
- HMM for good balance of speed and accuracy
Optimization Tips
- Batch Processing: Process multiple sentences together
- Model Caching: Load models once, reuse across requests
- Lazy Loading: Only load neural models when needed
- Parallel Processing: Use
Task.async_streamfor multiple sentences
Accuracy Benchmarks
Tested on Universal Dependencies English-EWT test set:
| Component | Accuracy |
|---|---|
| Tokenization | 99.9% |
| Rule-based POS | 85% |
| HMM POS | 95% |
| Neural POS | 97-98% |
| Phrase Parsing | 87% (F1) |
| Dependency Extraction | 82% (UAS) |
Further Reading
- Universal Dependencies - UD relations and guidelines
- Penn Treebank POS Tags
- NimbleParsec Documentation
- Axon Neural Networks
- See
docs/ARCHITECTURE.mdfor overall system design - See
docs/NEURAL_MODELS.mdfor neural network details