Parsing Guide

This document provides a comprehensive technical guide to all parsing algorithms implemented in Nasty, including tokenization, POS tagging, morphological analysis, phrase parsing, sentence parsing, and dependency extraction.

Pipeline Overview
Tokenization
POS Tagging
Morphological Analysis
Phrase Parsing
Sentence Parsing
Dependency Extraction
Integration Example

Pipeline Overview

The Nasty NLP pipeline processes text through the following stages:

flowchart TD
    A[Input Text]
    B["[1] Tokenization (NimbleParsec)"]
    C["[2] POS Tagging (Rule-based / HMM / Neural)"]
    D["[3] Morphological Analysis (Lemmatization + Features)"]
    E["[4] Phrase Parsing (Bottom-up CFG)"]
    F["[5] Sentence Parsing (Clause Detection)"]
    G["[6] Dependency Extraction (UD Relations)"]
    H[Complete AST]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H

Each stage:

Takes structured input from the previous stage
Adds linguistic annotations
Preserves position tracking (span information)
Maintains language metadata

Tokenization

Algorithm: NimbleParsec Combinator Parsing

Module: Nasty.Language.English.Tokenizer

Approach: Bottom-up combinator-based parsing using NimbleParsec, processing text left-to-right with greedy longest-match.

Token Types

Hyphenated words: well-known, twenty-one
Contractions: don't, I'm, we've, it's
Numbers: integers (123), decimals (3.14)
Words: alphabetic sequences
Punctuation: sentence-ending (., !, ?), commas, quotes, brackets, etc.

Parser Combinators

# Order matters - more specific patterns first
token = choice([
  hyphenated,      # "well-known"
  contraction,     # "don't"
  number,          # "123", "3.14"
  word,            # "cat"
  punctuation      # ".", ",", etc.
])

Position Tracking

Every token includes precise position information:

%Token{
  text: "cat",
  span: %{
    start_pos: {1, 5},      # {line, column}
    start_offset: 4,        # byte offset
    end_pos: {1, 8},
    end_offset: 7
  }
}

Position tracking handles:

Multi-line text with newline counting
Whitespace between tokens (ignored but tracked)
UTF-8 byte offsets vs. character positions

Edge Cases

Empty text: Returns {:ok, []}
Whitespace-only: Returns {:ok, []}
Unparseable text: Returns {:error, {:parse_incomplete, ...}}
Contractions: Parsed as single tokens, not split

Example

{:ok, tokens} = Tokenizer.tokenize("I don't know.")
# => [
#   %Token{text: "I", pos_tag: :x, span: ...},
#   %Token{text: "don't", pos_tag: :x, span: ...},
#   %Token{text: "know", pos_tag: :x, span: ...},
#   %Token{text: ".", pos_tag: :punct, span: ...}
# ]

POS Tagging

Three Tagging Models

Module: Nasty.Language.English.POSTagger

Nasty supports three POS tagging approaches with different accuracy/speed tradeoffs:

Model	Accuracy	Speed	Method
Rule-based	~85%	Very Fast	Lexical lookup + morphology + context
HMM (Trigram)	~95%	Fast	Viterbi decoding with add-k smoothing
Neural (BiLSTM-CRF)	97-98%	Moderate	Deep learning with contextual embeddings

1. Rule-Based Tagging

Algorithm: Sequential pattern matching with three-tier lookup

Tagging Strategy

Lexical Lookup: Closed-class words (determiners, pronouns, prepositions, etc.)
- 450+ words in lookup tables
- Example: "the" → :det, "in" → :adp, "and" → :cconj

Morphological Analysis: Suffix-based tagging for open-class words

Nouns:    -tion, -sion, -ment, -ness, -ity, -ism
Verbs:    -ing, -ed, -s/-es (3rd person singular)
Adjectives: -ful, -less, -ous, -ive, -able, -ible
Adverbs:  -ly

Contextual Disambiguation: Local context rules
- Word after determiner → likely noun
- Word after preposition → likely noun
- Word before noun → likely adjective
- Capitalized words → proper nouns

Third-Person Singular Verb Detection

Conservative approach to avoid mistagging plural nouns as verbs:

# "walks" → :verb (stem "walk" in common verb list)
# "books" → :noun (not a verb stem)
# "stations" → :noun (ends with -tions, noun suffix)

Checks:

Exclude capitalized words (proper nouns)
Exclude words with clear noun suffixes (-tions, -ments, etc.)
Verify stem is in common verb list (140+ verbs)

2. HMM-Based Tagging

Algorithm: Viterbi decoding with trigram Hidden Markov Model

Model Components

Emission Probabilities: P(word|tag)
- Learned from tagged training data
- Smoothing for unknown words: add-k smoothing (k=0.001)
Transition Probabilities: P(tag₃|tag₁, tag₂)
- Trigram model for better context
- Special START markers for sentence boundaries
- Add-k smoothing for unseen trigrams
Initial Probabilities: P(tag) at sentence start
- Distribution of first tags in training sentences

Training Process

training_data = [
  {["The", "cat", "sat"], [:det, :noun, :verb]},
  ...
]

model = HMMTagger.new()
{:ok, trained} = HMMTagger.train(model, training_data, [])

Counts:

Emission counts: {word, tag} pairs
Transition counts: {tag1, tag2} → tag3 trigrams
Initial counts: first tag in each sequence

Normalization:

P(word|tag) = (count(word, tag) + k) / (sum(word, tag) + k * vocab_size)
P(tag3|tag1,tag2) = (count(tag1,tag2,tag3) + k) / (sum(tag1,tag2,*) + k * num_tags)

Viterbi Decoding

Dynamic programming algorithm to find most likely tag sequence:

score[t][tag] = max over prev_tags of:
                  score[t-1][prev_tag] + 
                  log P(tag|prev_prev_tag, prev_tag) +
                  log P(word_t|tag)

Steps:

Initialization: Score each tag for first word
Forward Pass: Compute best score for each (position, tag) pair
Backpointers: Track best previous tag for reconstruction
Backtracking: Reconstruct best path from end to start

3. Neural Tagging (BiLSTM-CRF)

Algorithm: Bidirectional LSTM with Conditional Random Field layer

Module: Nasty.Statistics.POSTagging.NeuralTagger

Architecture

flowchart TD
    A["Input: Word IDs [batch_size, seq_len]"]
    B["Word Embeddings [batch_size, seq_len, embedding_dim]"]
    C["BiLSTM Layers (×2) [batch_size, seq_len, hidden_size * 2]"]
    D["Linear Projection [batch_size, seq_len, num_tags]"]
    E["CRF Layer (optional) [batch_size, seq_len, num_tags]"]
    F["Output: Tag IDs [batch_size, seq_len]"]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

Key Components

Word Embeddings: 300-dimensional learned representations
- Vocabulary built from training data (min frequency = 2)
- Unknown words mapped to special UNK token
Bidirectional LSTM: 2 layers, 256 hidden units each
- Forward LSTM: left-to-right context
- Backward LSTM: right-to-left context
- Concatenated outputs: 512 dimensions
CRF Layer: Learns tag transition constraints
- Enforces valid tag sequences (e.g., DET → NOUN more likely than DET → VERB)
- Joint decoding over entire sequence
Dropout: 0.3 rate for regularization

Training

tagger = NeuralTagger.new(vocab_size: 10000, num_tags: 17)
training_data = [{["The", "cat"], [:det, :noun]}, ...]

{:ok, trained} = NeuralTagger.train(tagger, training_data,
  epochs: 10,
  batch_size: 32,
  learning_rate: 0.001,
  validation_split: 0.1
)

Training features:

Adam optimizer (adaptive learning rate)
Cross-entropy loss (or CRF loss if using CRF layer)
Early stopping with patience=3
Validation set monitoring (10% split)

Inference

{:ok, tags} = NeuralTagger.predict(trained, ["The", "cat", "sat"], [])
# => {:ok, [:det, :noun, :verb]}

Steps:

Convert words to IDs using vocabulary
Pad sequences to batch size
Run through BiLSTM-CRF model
Argmax over tag dimension (or Viterbi if using CRF)
Convert tag IDs back to atoms

Model Selection

Use :model option in POSTagger.tag_pos/2:

# Rule-based (fast, ~85% accuracy)
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :rule_based)

# HMM (fast, ~95% accuracy)
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :hmm)

# Neural (moderate, 97-98% accuracy)
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :neural)

# Ensemble: HMM + rule-based fallback for punctuation/numbers
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :ensemble)

# Neural ensemble: Neural + rule-based fallback
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :neural_ensemble)

Morphological Analysis

Algorithm: Dictionary + Rule-Based Lemmatization

Module: Nasty.Language.English.Morphology

Approach: Two-tier lemmatization with irregular form lookup followed by rule-based suffix removal.

Lemmatization Process

1. Irregular Form Lookup

Check dictionaries for common irregular forms:

Verbs (80+ irregular verbs):

"went" → "go", "was" → "be", "ate" → "eat", "ran" → "run"

Nouns (12 irregular nouns):

"children" → "child", "men" → "man", "mice" → "mouse"

Adjectives (12 irregular comparatives/superlatives):

"better" → "good", "best" → "good", "worse" → "bad"

2. Rule-Based Suffix Removal

If no irregular form found, apply POS-specific rules:

Verbs:

-ing → stem (handling doubled consonants)
  "running" → "run" (remove doubled 'n')
  "making" → "make"

-ed → stem (handling doubled consonants, silent e)
  "stopped" → "stop" (remove doubled 'p')
  "liked" → "like" (restore silent 'e')

-s → base form (3rd person singular)
  "walks" → "walk"

Nouns:

-ies → -y (flies → fly)
-es → base (if stem ends in s/x/z/ch/sh)
  "boxes" → "box", "dishes" → "dish"
-s → base (cats → cat)

Adjectives:

-est → base (superlative)
  "fastest" → "fast" (handle doubled consonants)
-er → base (comparative)
  "faster" → "fast"

Morphological Feature Extraction

Verb Features

%{
  tense: :present | :past,
  aspect: :progressive,  # for -ing forms
  person: 3,             # for 3rd person singular
  number: :singular
}

Examples:

"running" → %{tense: :present, aspect: :progressive}
"walked" → %{tense: :past}
"walks" → %{tense: :present, person: 3, number: :singular}

Noun Features

%{number: :singular | :plural}

Examples:

"cat" → %{number: :singular}
"cats" → %{number: :plural}

Adjective Features

%{degree: :positive | :comparative | :superlative}

Examples:

"fast" → %{degree: :positive}
"faster" → %{degree: :comparative}
"fastest" → %{degree: :superlative}

Example

{:ok, tokens} = Tokenizer.tokenize("running cats")
{:ok, tagged} = POSTagger.tag_pos(tokens)
{:ok, analyzed} = Morphology.analyze(tagged)

# => [
#   %Token{text: "running", pos_tag: :verb, lemma: "run", 
#          morphology: %{tense: :present, aspect: :progressive}},
#   %Token{text: "cats", pos_tag: :noun, lemma: "cat",
#          morphology: %{number: :plural}}
# ]

Phrase Parsing

Algorithm: Bottom-Up Pattern Matching with Context-Free Grammar

Module: Nasty.Language.English.PhraseParser

Approach: Greedy longest-match, left-to-right phrase construction using simplified CFG rules.

Grammar Rules

NP   → Det? Adj* (Noun | PropN | Pron) (PP | RelClause)*
VP   → Aux* Verb (NP)? (PP | AdvP)*
PP   → Prep NP
AdjP → Adv? Adj
AdvP → Adv
RC   → RelPron/RelAdv Clause

Phrase Types

1. Noun Phrase (NP)

Components:

Determiner (optional): the, a, my, some
Modifiers (0+): adjectives, adjectival phrases
Head (required): noun, proper noun, or pronoun
Post-modifiers (0+): prepositional phrases, relative clauses

Examples:

"the cat"          → [det: "the", head: "cat"]
"the big cat"      → [det: "the", modifiers: ["big"], head: "cat"]
"the cat on the mat" → [det: "the", head: "cat", 
                         post_modifiers: [PP("on", NP("the mat"))]]

Special Cases:

Pronouns as NPs: "I", "he", "they" can stand alone
Multi-word proper nouns: "New York" → consecutive PROPNs merged as modifiers

2. Verb Phrase (VP)

Components:

Auxiliaries (0+): is, have, will, can
Head (required): main verb
Complements (0+): object NP, PPs, adverbs

Examples:

"sat"              → [head: "sat"]
"is running"       → [auxiliaries: ["is"], head: "running"]
"saw the cat"      → [head: "saw", complements: [NP("the cat")]]
"sat on the mat"   → [head: "sat", complements: [PP("on", NP("the mat"))]]

Special Case - Copula Construction: If only auxiliaries found (no main verb), treat last auxiliary as main verb:

"is happy"  → [head: "is", complements: [AdjP("happy")]]
"are engineers" → [head: "are", complements: [NP("engineers")]]

3. Prepositional Phrase (PP)

Structure: Prep + NP

Examples:

"on the mat"    → [head: "on", object: NP("the mat")]
"in the house"  → [head: "in", object: NP("the house")]

4. Adjectival Phrase (AdjP)

Structure: Adv? + Adj

Examples:

"very big"   → [intensifier: "very", head: "big"]
"quite small" → [intensifier: "quite", head: "small"]

5. Adverbial Phrase (AdvP)

Structure: Adv (currently simple single-word adverbs)

Examples:

"quickly"  → [head: "quickly"]
"often"    → [head: "often"]

6. Relative Clause (RC)

Structure: RelPron/RelAdv + Clause

Relativizers:

Pronouns: who, whom, whose, which, that
Adverbs: where, when, why

Examples:

"that sits"        → [relativizer: "that", clause: VP("sits")]
"who I know"       → [relativizer: "who", clause: [subject: NP("I"), predicate: VP("know")]]

Two Patterns:

Relativizer as subject: "that sits" → clause has only VP
Relativizer as object: "that I see" → clause has NP subject + VP

Parsing Process

Each parse_*_phrase function:

Checks current position in token list
Attempts to consume tokens matching the pattern
Recursively parses sub-phrases (e.g., NP within PP)
Calculates span from first to last consumed token
Returns {:ok, phrase, next_position} or :error

Greedy Matching: Consumes as many tokens as possible for each phrase (e.g., all consecutive adjectives as modifiers).

Position Tracking: Every phrase includes span covering all constituent tokens.

Example

tokens = [
  %Token{text: "the", pos_tag: :det},
  %Token{text: "big", pos_tag: :adj},
  %Token{text: "cat", pos_tag: :noun},
  %Token{text: "on", pos_tag: :adp},
  %Token{text: "the", pos_tag: :det},
  %Token{text: "mat", pos_tag: :noun}
]

{:ok, np, _pos} = PhraseParser.parse_noun_phrase(tokens, 0)
# => %NounPhrase{
#   determiner: "the",
#   modifiers: ["big"],
#   head: "cat",
#   post_modifiers: [
#     %PrepositionalPhrase{
#       head: "on",
#       object: %NounPhrase{determiner: "the", head: "mat"}
#     }
#   ]
# }

Sentence Parsing

Algorithm: Clause Detection with Coordination and Subordination

Module: Nasty.Language.English.SentenceParser

Approach: Split on sentence boundaries, then parse each sentence into clauses with support for simple, compound, and complex structures.

Sentence Structures

Simple: Single independent clause
- "The cat sat."
Compound: Multiple coordinated independent clauses
- "The cat sat and the dog ran."
Complex: Independent clause with subordinate clause(s)
- "The cat sat because it was tired."
Fragment: Incomplete sentence (e.g., subordinate clause alone)

Sentence Functions

Inferred from punctuation:

. → :declarative (statement)
? → :interrogative (question)
! → :exclamative (exclamation)

Parsing Process

1. Sentence Boundary Detection

Split on sentence-ending punctuation (., !, ?):

split_sentences(tokens)
# Groups tokens into sentence units

2. Clause Parsing

For each sentence group, parse into clause structure:

Grammar:

Sentence → Clause+
Clause   → SubordConj? NP? VP

Three Clause Types:

Independent: Can stand alone as complete sentence
Subordinate: Begins with subordinating conjunction (because, if, when, etc.)
Relative: Part of relative clause structure (handled in phrase parsing)

3. Coordination Detection

Look for coordinating conjunctions (:cconj):

and, or, but, nor, yet, so, for

If found, split and parse both sides:

"The cat sat and the dog ran"
# Split at "and"
# Parse: Clause1 ("The cat sat") + Clause2 ("the dog ran")
# Result: [Clause1, Clause2]

4. Subordination Detection

Check for subordinating conjunction (:sconj) at start:

after, although, because, before, if, since, when, while, etc.

If found, mark clause as subordinate:

"because it was tired"
# Parse: Clause with subordinator: "because"
# Type: :subordinate

Simple Clause Parsing

Algorithm: Find verb, split at verb to identify subject and predicate.

Steps:

Find first verb/auxiliary in token sequence
If verb at position 0: Imperative sentence (no subject)
- Parse VP starting at position 0
- Subject = nil
If verb at position > 0: Declarative sentence
- Try to parse NP before verb (subject)
- Parse VP starting at end of subject (predicate)
If no subject found: Try VP alone (imperative or fragment)

Fallback: If parsing fails, create minimal clause with first verb found.

Clause Structure

%Clause{
  type: :independent | :subordinate | :relative,
  subordinator: Token.t() | nil,  # "because", "if", etc.
  subject: NounPhrase.t() | nil,
  predicate: VerbPhrase.t(),
  language: :en,
  span: span
}

Sentence Structure

%Sentence{
  function: :declarative | :interrogative | :exclamative,
  structure: :simple | :compound | :complex | :fragment,
  main_clause: Clause.t(),
  additional_clauses: [Clause.t()],  # for compound sentences
  language: :en,
  span: span
}

Example

tokens = tokenize_and_tag("The cat sat and the dog ran.")

{:ok, [sentence]} = SentenceParser.parse_sentences(tokens)

# => %Sentence{
#   function: :declarative,
#   structure: :compound,
#   main_clause: %Clause{
#     type: :independent,
#     subject: NP("The cat"),
#     predicate: VP("sat")
#   },
#   additional_clauses: [
#     %Clause{
#       type: :independent,
#       subject: NP("the dog"),
#       predicate: VP("ran")
#     }
#   ]
# }

Dependency Extraction

Algorithm: Phrase Structure to Universal Dependencies Conversion

Module: Nasty.Language.English.DependencyExtractor

Approach: Traverse phrase structure AST and extract grammatical relations as Universal Dependencies (UD) relations.

Universal Dependencies Relations

Nasty uses the UD relation taxonomy:

Core Arguments:

nsubj - nominal subject
obj - direct object
iobj - indirect object

Non-Core Dependents:

obl - oblique nominal (prepositional complement to verb)
advmod - adverbial modifier
aux - auxiliary verb

Nominal Dependents:

det - determiner
amod - adjectival modifier
nmod - nominal modifier (prepositional complement to noun)
case - case marking (preposition)

Clausal Dependents:

acl - adnominal clause (relative clause)
mark - subordinating marker

Coordination:

conj - conjunct
cc - coordinating conjunction

Extraction Process

1. Sentence-Level Extraction

extract(sentence)
# Extracts from main_clause + additional_clauses

2. Clause-Level Extraction

For each clause:

Subject Dependency: nsubj(predicate_head, subject_head)
- Extract head token from subject NP
- Extract head token from predicate VP
- Create dependency relation
Predicate Dependencies: Extract from VP (see below)
Subordinator Dependency (if present): mark(predicate_head, subordinator)

3. Noun Phrase Dependencies

From NP structure:

Determiner: det(head, determiner)
- "the cat" → det(cat, the)
Adjectival Modifiers: amod(head, modifier)
- "big cat" → amod(cat, big)
Post-modifiers:
- PP: case(pp_object_head, preposition) + nmod(np_head, pp_object_head)
  - "cat on mat" → case(mat, on) + nmod(cat, mat)
- Relative Clause: mark(clause_head, relativizer) + acl(np_head, clause_head)
  - "cat that sits" → mark(sits, that) + acl(cat, sits)

4. Verb Phrase Dependencies

From VP structure:

Auxiliaries: aux(main_verb, auxiliary)
- "is running" → aux(running, is)
Complements:
- Direct Object NP: obj(verb, np_head)
  - "saw cat" → obj(saw, cat)
- PP Complement: case(pp_object, preposition) + obl(verb, pp_object)
  - "sat on mat" → case(mat, on) + obl(sat, mat)
- Adverb: advmod(verb, adverb)
  - "ran quickly" → advmod(ran, quickly)

5. Prepositional Phrase Dependencies

From PP structure:

Case Marking: case(pp_object_head, preposition)
Oblique/Nominal Modifier:
- If governor is verb: obl(governor, pp_object_head)
- If governor is noun: nmod(governor, pp_object_head)

Dependency Structure

%Dependency{
  relation: :nsubj | :obj | :det | ...,
  head: Token.t(),       # Governor token
  dependent: Token.t(),  # Dependent token
  span: span
}

Example

# Input: "The cat sat on the mat."
sentence = parse("The cat sat on the mat.")
dependencies = DependencyExtractor.extract(sentence)

# => [
#   %Dependency{relation: :det, head: "cat", dependent: "the"},
#   %Dependency{relation: :nsubj, head: "sat", dependent: "cat"},
#   %Dependency{relation: :case, head: "mat", dependent: "on"},
#   %Dependency{relation: :det, head: "mat", dependent: "the"},
#   %Dependency{relation: :obl, head: "sat", dependent: "mat"}
# ]

Visualization

Dependencies can be visualized as a directed graph:

graph TD
    Root["sat (ROOT)"]
    Cat[cat]
    Mat[mat]
    The1[the]
    On[on]
    The2[the]
    
    Root -->|nsubj| Cat
    Root -->|obl| Mat
    Cat -->|det| The1
    Mat -->|case| On
    Mat -->|det| The2

Integration Example

Complete pipeline from text to dependencies:

alias Nasty.Language.English.{
  Tokenizer, POSTagger, Morphology,
  PhraseParser, SentenceParser, DependencyExtractor
}

# Input text
text = "The big cat sat on the mat."

# Step 1: Tokenization
{:ok, tokens} = Tokenizer.tokenize(text)
# => [Token("The"), Token("big"), Token("cat"), ...]

# Step 2: POS Tagging (choose model)
{:ok, tagged} = POSTagger.tag_pos(tokens, model: :neural)
# => [Token("The", :det), Token("big", :adj), Token("cat", :noun), ...]

# Step 3: Morphological Analysis
{:ok, analyzed} = Morphology.analyze(tagged)
# => [Token("The", :det, lemma: "the"), ...]

# Step 4: Sentence Parsing (includes phrase parsing internally)
{:ok, sentences} = SentenceParser.parse_sentences(analyzed)
# => [Sentence(...)]

# Step 5: Dependency Extraction
sentence = hd(sentences)
dependencies = DependencyExtractor.extract(sentence)
# => [Dependency(:det, "cat", "The"), ...]

# Result: Complete AST with dependencies
sentence
# => %Sentence{
#   main_clause: %Clause{
#     subject: %NounPhrase{
#       determiner: Token("The"),
#       modifiers: [Token("big")],
#       head: Token("cat")
#     },
#     predicate: %VerbPhrase{
#       head: Token("sat"),
#       complements: [
#         %PrepositionalPhrase{
#           head: Token("on"),
#           object: %NounPhrase{...}
#         }
#       ]
#     }
#   }
# }

Performance Considerations

Model Selection

For Production:

Use neural models for highest accuracy
Cache loaded models in memory
Batch sentences for GPU acceleration (if available)

For Development/Testing:

Use rule-based for fastest iteration
HMM for good balance of speed and accuracy

Optimization Tips

Batch Processing: Process multiple sentences together
Model Caching: Load models once, reuse across requests
Lazy Loading: Only load neural models when needed
Parallel Processing: Use Task.async_stream for multiple sentences

Accuracy Benchmarks

Tested on Universal Dependencies English-EWT test set:

Component	Accuracy
Tokenization	99.9%
Rule-based POS	85%
HMM POS	95%
Neural POS	97-98%
Phrase Parsing	87% (F1)
Dependency Extraction	82% (UAS)

Parsing Guide

Table of Contents

Pipeline Overview

Tokenization

Algorithm: NimbleParsec Combinator Parsing

Token Types

Parser Combinators

Position Tracking

Edge Cases

Example

POS Tagging

Three Tagging Models

1. Rule-Based Tagging

Tagging Strategy

Third-Person Singular Verb Detection

2. HMM-Based Tagging

Model Components

Training Process

Viterbi Decoding

3. Neural Tagging (BiLSTM-CRF)

Architecture

Key Components

Training

Inference

Model Selection

Morphological Analysis

Algorithm: Dictionary + Rule-Based Lemmatization

Lemmatization Process

1. Irregular Form Lookup

2. Rule-Based Suffix Removal

Morphological Feature Extraction

Verb Features

Noun Features

Adjective Features

Example

Phrase Parsing

Algorithm: Bottom-Up Pattern Matching with Context-Free Grammar

Grammar Rules

Phrase Types

1. Noun Phrase (NP)

2. Verb Phrase (VP)

3. Prepositional Phrase (PP)

4. Adjectival Phrase (AdjP)

5. Adverbial Phrase (AdvP)

6. Relative Clause (RC)

Parsing Process

Example

Sentence Parsing

Algorithm: Clause Detection with Coordination and Subordination

Sentence Structures

Sentence Functions

Parsing Process

1. Sentence Boundary Detection

2. Clause Parsing

3. Coordination Detection

4. Subordination Detection

Simple Clause Parsing

Clause Structure

Sentence Structure

Example

Dependency Extraction

Algorithm: Phrase Structure to Universal Dependencies Conversion

Universal Dependencies Relations

Extraction Process

1. Sentence-Level Extraction

2. Clause-Level Extraction

3. Noun Phrase Dependencies

4. Verb Phrase Dependencies

5. Prepositional Phrase Dependencies

Dependency Structure

Example

Visualization

Integration Example

Performance Considerations

Model Selection

Optimization Tips

Accuracy Benchmarks

Further Reading