Nasty User Guide
View SourceA comprehensive guide to using the Nasty NLP library for natural language processing in Elixir.
Table of Contents
- Introduction
- Installation
- Quick Start
- Core Concepts
- Basic Text Processing
- Phrase and Sentence Parsing
- Semantic Analysis
- Advanced NLP Operations
- Code Interoperability
- AST Manipulation
- Visualization and Debugging
- Statistical Models
- Performance Tips
- Troubleshooting
Introduction
Nasty (Natural Abstract Syntax Treey) is a comprehensive NLP library that treats natural language with the same rigor as programming languages. It provides a complete grammatical Abstract Syntax Tree (AST) for English, enabling sophisticated text analysis and manipulation.
Key Features
- Complete NLP Pipeline: From tokenization to summarization
- Grammar-First Design: Linguistically rigorous AST structure
- Statistical Models: HMM POS tagger with 95% accuracy
- Bidirectional Code Conversion: Natural language ↔ Elixir code
- AST Utilities: Traversal, querying, validation, and transformation
- Visualization: Export to DOT/Graphviz and JSON formats
Installation
Add nasty to your dependencies in mix.exs:
def deps do
[
{:nasty, "~> 0.1.0"}
]
endThen run:
mix deps.get
Quick Start
Here's a simple example to get started:
alias Nasty.Language.English
# Parse a sentence
text = "The quick brown fox jumps over the lazy dog."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged)
# Extract information
alias Nasty.Utils.Query
# Count tokens
token_count = Query.count(document, :token)
# => 9
# Find all nouns
nouns = Query.find_by_pos(document, :noun)
# => [%Token{text: "fox", ...}, %Token{text: "dog", ...}]
# Render back to text
alias Nasty.Rendering.Text
{:ok, text} = Text.render(document)
# => "The quick brown fox jumps over the lazy dog."Core Concepts
AST Structure
Nasty represents text as a hierarchical tree structure:
Document
└── Paragraph
└── Sentence
└── Clause
├── Subject (NounPhrase)
│ ├── Determiner (Token)
│ ├── Modifiers (Tokens)
│ └── Head (Token)
└── Predicate (VerbPhrase)
├── Auxiliaries (Tokens)
├── Head (Token)
└── Complements (NounPhrases, etc.)Universal Dependencies
All POS tags and dependency relations follow the Universal Dependencies standard:
POS Tags: noun, verb, adj, adv, det, adp, aux, cconj, sconj, pron, propn, num, punct
Dependencies: nsubj, obj, iobj, amod, advmod, det, case, acl, advcl, conj, cc
Language Markers
Every AST node carries a language identifier (:en for English), enabling future multilingual support.
Basic Text Processing
Tokenization
Split text into tokens (words and punctuation):
alias Nasty.Language.English
text = "Hello, world! How are you?"
{:ok, tokens} = English.tokenize(text)
# Tokens include position information
Enum.each(tokens, fn token ->
IO.puts("#{token.text} at #{inspect(token.span)}")
end)POS Tagging
Assign grammatical categories to tokens:
# Rule-based tagging (fast, ~85% accuracy)
{:ok, tagged} = English.tag_pos(tokens)
# Statistical tagging (higher accuracy, ~95%)
{:ok, tagged} = English.tag_pos(tokens, model: :hmm)
# Neural tagging (best accuracy, 97-98%)
{:ok, tagged} = English.tag_pos(tokens, model: :neural)
# Ensemble (combines all models)
{:ok, tagged} = English.tag_pos(tokens, model: :ensemble)
# Inspect tags
Enum.each(tagged, fn token ->
IO.puts("#{token.text}: #{token.pos_tag}")
end)Morphological Analysis
Extract lemmas and morphological features:
alias Nasty.Language.English.Morphology
tagged
|> Enum.map(fn token ->
lemma = Morphology.lemmatize(token.text, token.pos_tag)
features = Morphology.extract_features(token.text, token.pos_tag)
{token.text, lemma, features}
end)
|> Enum.each(fn {text, lemma, features} ->
IO.puts("#{text} -> #{lemma} (#{inspect(features)})")
end)Phrase and Sentence Parsing
Building the AST
Parse tokens into a complete AST:
text = "The cat sat on the mat."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged)
# Access structure
paragraph = List.first(document.paragraphs)
sentence = List.first(paragraph.sentences)
IO.puts("Sentence type: #{sentence.function}, #{sentence.structure}")Phrase Structure
Extract and analyze phrases:
alias Nasty.Utils.Query
# Find all noun phrases
noun_phrases = Query.find_all(document, :noun_phrase)
Enum.each(noun_phrases, fn np ->
det = if np.determiner, do: np.determiner.text, else: ""
mods = Enum.map(np.modifiers, & &1.text) |> Enum.join(" ")
head = np.head.text
IO.puts("NP: #{det} #{mods} #{head}")
end)
# Find verb phrases
verb_phrases = Query.find_all(document, :verb_phrase)
Enum.each(verb_phrases, fn vp ->
aux = Enum.map(vp.auxiliaries, & &1.text) |> Enum.join(" ")
verb = vp.head.text
IO.puts("VP: #{aux} #{verb}")
end)Sentence Structure Analysis
Analyze sentence complexity:
document.paragraphs
|> Enum.flat_map(& &1.sentences)
|> Enum.each(fn sentence ->
IO.puts("Function: #{sentence.function}")
IO.puts("Structure: #{sentence.structure}")
IO.puts("Clauses: #{1 + length(sentence.additional_clauses)}")
IO.puts("")
end)Dependency Relations
Extract grammatical dependencies:
alias Nasty.Language.English.DependencyExtractor
sentences = document.paragraphs |> Enum.flat_map(& &1.sentences)
Enum.each(sentences, fn sentence ->
deps = DependencyExtractor.extract(sentence)
Enum.each(deps, fn dep ->
IO.puts("#{dep.head.text} --#{dep.relation}--> #{dep.dependent.text}")
end)
end)Semantic Analysis
Named Entity Recognition
Extract and classify named entities:
alias Nasty.Language.English.EntityRecognizer
text = "John Smith works at Google in New York."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
entities = EntityRecognizer.recognize(tagged)
Enum.each(entities, fn entity ->
IO.puts("#{entity.text}: #{entity.type} (confidence: #{entity.confidence})")
end)
# => John Smith: PERSON (confidence: 0.8)
# Google: ORG (confidence: 0.8)
# New York: GPE (confidence: 0.7)Semantic Role Labeling
Identify who did what to whom:
{:ok, document} = English.parse(tagged, semantic_roles: true)
document.semantic_frames
|> Enum.each(fn frame ->
IO.puts("Predicate: #{frame.predicate}")
Enum.each(frame.roles, fn role ->
IO.puts(" #{role.type}: #{role.text}")
end)
end)
# => Predicate: works
# agent: John Smith
# location: at GoogleCoreference Resolution
Link mentions across sentences:
text = """
John Smith is a software engineer. He works at Google.
The company is based in Mountain View.
"""
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged, coreference: true)
document.coref_chains
|> Enum.each(fn chain ->
IO.puts("Representative: #{chain.representative.text}")
IO.puts("Mentions: #{Enum.map(chain.mentions, & &1.text) |> Enum.join(", ")}")
end)
# => Representative: John Smith
# Mentions: John Smith, HeAdvanced NLP Operations
Text Summarization
Extract key sentences from documents:
alias Nasty.Language.English
alias Nasty.Rendering.Text
long_text = """
[Your long document here...]
"""
{:ok, tokens} = English.tokenize(long_text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged)
# Extractive summarization - returns list of Sentence structs
summary_sentences = English.summarize(document, ratio: 0.3)
IO.puts("30% summary (#{length(summary_sentences)} sentences):")
# Render summary sentences to text
Enum.each(summary_sentences, fn sentence ->
{:ok, text} = Text.render(sentence)
IO.puts(text)
end)
# Fixed sentence count
summary_sentences = English.summarize(document, max_sentences: 3)
# MMR for reduced redundancy
summary_sentences = English.summarize(document,
max_sentences: 3,
method: :mmr,
mmr_lambda: 0.5
)Question Answering
Answer questions from documents:
text = """
John Smith is a software engineer at Google.
He graduated from Stanford University in 2010.
Google is headquartered in Mountain View, California.
"""
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged)
# Ask questions
questions = [
"Who works at Google?",
"Where is Google located?",
"When did John Smith graduate?",
"What is John Smith's profession?"
]
Enum.each(questions, fn question ->
{:ok, answers} = English.answer_question(document, question)
IO.puts("Q: #{question}")
Enum.each(answers, fn answer ->
IO.puts("A: #{answer.text} (confidence: #{answer.confidence})")
end)
IO.puts("")
end)Text Classification
Train and apply classifiers:
alias Nasty.Language.English
# Prepare training data
positive_reviews = [
"This product is amazing! Highly recommended.",
"Excellent quality and fast shipping.",
"Love it! Best purchase ever."
]
negative_reviews = [
"Terrible product. Waste of money.",
"Poor quality and slow delivery.",
"Very disappointed with this purchase."
]
# Parse documents
training_data =
Enum.map(positive_reviews, fn text ->
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)
{doc, :positive}
end) ++
Enum.map(negative_reviews, fn text ->
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)
{doc, :negative}
end)
# Train classifier
model = English.train_classifier(training_data,
features: [:bow, :lexical]
)
# Classify new text
test_text = "Great product, very satisfied!"
{:ok, tokens} = English.tokenize(test_text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)
{:ok, predictions} = English.classify(doc, model)
IO.inspect(predictions)Information Extraction
Extract structured information:
text = """
Apple Inc. acquired Beats Electronics for $3 billion in 2014.
The company is headquartered in Cupertino, California.
Tim Cook serves as CEO of Apple.
"""
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged)
# Extract relations
{:ok, relations} = English.extract_relations(document)
Enum.each(relations, fn rel ->
IO.puts("#{rel.subject.text} --#{rel.type}--> #{rel.object.text}")
end)
# Extract events
{:ok, events} = English.extract_events(document)
Enum.each(events, fn event ->
IO.puts("Event: #{event.type}")
IO.puts("Trigger: #{event.trigger}")
IO.puts("Participants: #{inspect(event.participants)}")
end)
# Template-based extraction
alias Nasty.Language.English.TemplateExtractor
templates = [
TemplateExtractor.employment_template(),
TemplateExtractor.acquisition_template()
]
{:ok, results} = English.extract_templates(document, templates)
Enum.each(results, fn result ->
IO.puts("Template: #{result.template}")
IO.puts("Slots: #{inspect(result.slots)}")
end)Code Interoperability
Natural Language to Code
Convert natural language commands to Elixir code:
alias Nasty.Language.English
# Simple operations
{:ok, code} = English.to_code("Sort the list")
IO.puts(code)
# => "Enum.sort(list)"
{:ok, code} = English.to_code("Filter users where age is greater than 18")
IO.puts(code)
# => "Enum.filter(users, fn item -> item > 18 end)"
{:ok, code} = English.to_code("Map the numbers to double each one")
IO.puts(code)
# => "Enum.map(numbers, fn item -> item * 2 end)"
# Get the AST
{:ok, ast} = English.to_code_ast("Sort the numbers")
IO.inspect(ast)
# Recognize intent without generating code
{:ok, intent} = English.recognize_intent("Filter the list")
IO.inspect(intent)Code to Natural Language
Explain code in natural language:
alias Nasty.Language.English
# Explain code strings
{:ok, explanation} = English.explain_code("Enum.sort(numbers)")
IO.puts(explanation)
# => "sort numbers"
{:ok, explanation} = English.explain_code("""
list
|> Enum.map(&(&1 * 2))
|> Enum.filter(&(&1 > 10))
|> Enum.sum()
""")
IO.puts(explanation)
# => "map list to each element times 2, then filter list where item is greater than 10, then sum list"
# Explain from AST
code_ast = quote do: x = a + b
{:ok, doc} = English.explain_code_to_document(code_ast)
{:ok, text} = Nasty.Rendering.Text.render(doc)
IO.puts(text)Translation
AST-Based Translation
Translate documents between languages while preserving grammatical structure:
alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator
# English to Spanish
text_en = "The quick cat runs in the garden."
{:ok, tokens_en} = English.tokenize(text_en)
{:ok, tagged_en} = English.tag_pos(tokens_en)
{:ok, doc_en} = English.parse(tagged_en)
# Translate document
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
# Render Spanish text
alias Nasty.Rendering.Text
{:ok, text_es} = Text.render(doc_es)
IO.puts(text_es)
# => "El gato rápido corre en el jardín."
# Or translate text directly
{:ok, text_es} = Translator.translate("The quick cat runs.", :en, :es)
IO.puts(text_es)
# => "El gato rápido corre."
# Spanish to English
text_es = "La casa grande está en la ciudad."
{:ok, tokens_es} = Spanish.tokenize(text_es)
{:ok, tagged_es} = Spanish.tag_pos(tokens_es)
{:ok, doc_es} = Spanish.parse(tagged_es)
{:ok, doc_en} = Translator.translate_document(doc_es, :en)
{:ok, text_en} = Text.render(doc_en)
IO.puts(text_en)
# => "The big house is in the city."How Translation Works
The translation system operates on AST structures, not raw text:
- Parse source text to AST
- Transform AST nodes to target language structure
- Translate tokens using lemma-to-lemma mapping with POS tags
- Apply morphological agreement (gender, number, person)
- Apply word order rules (language-specific)
- Render target AST to text
Morphological Agreement
The system automatically handles agreement:
alias Nasty.Translation.Translator
alias Nasty.Rendering.Text
# English: "the cats"
# Spanish: "los gatos" (masculine plural determiner + noun)
{:ok, doc_en} = Nasty.parse("The cats.", language: :en)
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
{:ok, text_es} = Text.render(doc_es)
# => "Los gatos."
# English: "the big houses"
# Spanish: "las casas grandes" (feminine plural, adjective after noun)
{:ok, doc_en} = Nasty.parse("The big houses.", language: :en)
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
{:ok, text_es} = Text.render(doc_es)
# => "Las casas grandes."Word Order Transformations
Language-specific word order is automatically applied:
alias Nasty.Translation.Translator
alias Nasty.Rendering.Text
# English: Adjective before noun
# Spanish: Most adjectives after noun
{:ok, doc_en} = Nasty.parse("The red car.", language: :en)
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
{:ok, text_es} = Text.render(doc_es)
# => "El carro rojo." (car red)
# Some adjectives stay before noun
{:ok, doc_en} = Nasty.parse("The good book.", language: :en)
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
{:ok, text_es} = Text.render(doc_es)
# => "El buen libro." (good stays before)Roundtrip Translation
Translations preserve grammatical structure for roundtrips:
alias Nasty.Translation.Translator
alias Nasty.Rendering.Text
original = "The cat runs quickly."
# English -> Spanish -> English
{:ok, doc_en} = Nasty.parse(original, language: :en)
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
{:ok, doc_en2} = Translator.translate_document(doc_es, :en)
{:ok, result} = Text.render(doc_en2)
IO.puts(original)
IO.puts(result)
# Original: "The cat runs quickly."
# Result: "The cat runs quickly." (or close equivalent)Supported Language Pairs
Currently supported:
- English ↔ Spanish
- English ↔ Catalan
- Spanish ↔ Catalan (via English)
Custom Lexicons
Extend lexicons with domain-specific vocabulary:
# Lexicons are in priv/translation/lexicons/
# Format: en_es.exs, es_en.exs, etc.
# Add entries in priv/translation/lexicons/en_es.exs:
%{
noun: %{
"widget" => "dispositivo",
"gadget" => "aparato"
},
verb: %{
"deploy" => "desplegar",
"compile" => "compilar"
}
}Translation Limitations
Current limitations:
- Idiomatic expressions may not translate well
- Complex verb tenses may need manual review
- Cultural context not preserved
- Ambiguous words use first lexicon entry
Best practices:
- Translate sentence by sentence for best results
- Review translations for idiomatic expressions
- Extend lexicons for domain-specific terms
- Use for technical/formal text rather than creative writing
AST Manipulation
Traversal
Walk the AST tree:
alias Nasty.Utils.Traversal
# Count all tokens
token_count = Traversal.reduce(document, 0, fn
%Nasty.AST.Token{}, acc -> acc + 1
_, acc -> acc
end)
# Collect all verbs
verbs = Traversal.collect(document, fn
%Nasty.AST.Token{pos_tag: :verb} -> true
_ -> false
end)
# Find first question
question = Traversal.find(document, fn
%Nasty.AST.Sentence{function: :interrogative} -> true
_ -> false
end)
# Transform tree (lowercase all text)
lowercased = Traversal.map(document, fn
%Nasty.AST.Token{} = token ->
%{token | text: String.downcase(token.text)}
node ->
node
end)
# Breadth-first traversal
nodes = Traversal.walk_breadth(document, [], fn node, acc ->
{:cont, [node | acc]}
end)Queries
High-level querying API:
alias Nasty.Utils.Query
# Find by type
noun_phrases = Query.find_all(document, :noun_phrase)
sentences = Query.find_all(document, :sentence)
# Find by POS tag
nouns = Query.find_by_pos(document, :noun)
verbs = Query.find_by_pos(document, :verb)
# Find by text pattern
cats = Query.find_by_text(document, "cat")
words_starting_with_s = Query.find_by_text(document, ~r/^s/i)
# Find by lemma
runs = Query.find_by_lemma(document, "run") # Matches "run", "runs", "running"
# Extract entities
all_entities = Query.extract_entities(document)
people = Query.extract_entities(document, type: :PERSON)
organizations = Query.extract_entities(document, type: :ORG)
# Structural queries
subject = Query.find_subject(sentence)
verb = Query.find_main_verb(sentence)
objects = Query.find_objects(sentence)
# Count nodes
token_count = Query.count(document, :token)
sentence_count = Query.count(document, :sentence)
# Content vs function words
content_words = Query.content_words(document)
function_words = Query.function_words(document)
# Custom predicates
long_words = Query.filter(document, fn
%Nasty.AST.Token{text: text} -> String.length(text) > 7
_ -> false
end)Transformations
Modify AST structures:
alias Nasty.Utils.Transform
# Case normalization
lowercased = Transform.normalize_case(document, :lower)
uppercased = Transform.normalize_case(document, :upper)
titled = Transform.normalize_case(document, :title)
# Remove punctuation
no_punct = Transform.remove_punctuation(document)
# Remove stop words
no_stops = Transform.remove_stop_words(document)
# Custom stop words
custom_stops = ["the", "a", "an"]
filtered = Transform.remove_stop_words(document, custom_stops)
# Lemmatize all tokens
lemmatized = Transform.lemmatize(document)
# Replace tokens
masked = Transform.replace_tokens(
document,
fn token -> token.pos_tag == :propn end,
fn token -> %{token | text: "[MASK]"} end
)
# Transformation pipelines
processed = Transform.pipeline(document, [
&Transform.normalize_case(&1, :lower),
&Transform.remove_punctuation/1,
&Transform.remove_stop_words/1,
&Transform.lemmatize/1
])
# Round-trip testing
{:ok, transformed} = Transform.round_trip_test(document, fn doc ->
Transform.normalize_case(doc, :lower)
end)Validation
Ensure AST integrity:
alias Nasty.Utils.Validator
# Validate structure
case Validator.validate(document) do
{:ok, doc} -> IO.puts("Valid!")
{:error, reason} -> IO.puts("Invalid: #{reason}")
end
# Check validity (boolean)
if Validator.valid?(document) do
IO.puts("Document is valid")
end
# Validate spans
case Validator.validate_spans(document) do
:ok -> IO.puts("Spans are consistent")
{:error, reason} -> IO.puts("Span error: #{reason}")
end
# Validate language consistency
case Validator.validate_language(document) do
:ok -> IO.puts("Language is consistent")
{:error, reason} -> IO.puts("Language error: #{reason}")
end
# Validate and raise
Validator.validate!(document) # Raises on errorVisualization and Debugging
Pretty Printing
Debug AST structures:
alias Nasty.Rendering.PrettyPrint
# Indented output
IO.puts(PrettyPrint.print(document))
# With colors
IO.puts(PrettyPrint.print(document, color: true))
# Limit depth
IO.puts(PrettyPrint.print(document, max_depth: 3))
# Show spans
IO.puts(PrettyPrint.print(document, show_spans: true))
# Tree-style output
IO.puts(PrettyPrint.tree(document))
# Statistics
IO.puts(PrettyPrint.stats(document))Graphviz Visualization
Export to DOT format for visual rendering:
alias Nasty.Rendering.Visualization
# Parse tree
dot = Visualization.to_dot(document, type: :parse_tree)
File.write("parse_tree.dot", dot)
# Then: dot -Tpng parse_tree.dot -o parse_tree.png
# Dependency graph
deps_dot = Visualization.to_dot(sentence,
type: :dependencies,
rankdir: "LR"
)
File.write("dependencies.dot", deps_dot)
# Entity graph
entity_dot = Visualization.to_dot(document, type: :entities)
File.write("entities.dot", entity_dot)
# Custom options
dot = Visualization.to_dot(document,
type: :parse_tree,
rankdir: "TB",
show_pos_tags: true,
show_spans: false
)JSON Export
Export for web visualization:
alias Nasty.Rendering.Visualization
# Export to JSON (for d3.js, etc.)
json = Visualization.to_json(document)
File.write("document.json", json)
# Can be loaded in JavaScript:
# fetch('document.json')
# .then(r => r.json())
# .then(data => visualize(data))Text Rendering
Convert AST back to text:
alias Nasty.Rendering.Text
# Basic rendering
{:ok, text} = Text.render(document)
# Or use language-specific rendering
alias Nasty.Language.English
{:ok, text} = English.render(document)
# For specific languages
alias Nasty.Language.{Spanish, Catalan}
{:ok, text_es} = Spanish.render(document)
{:ok, text_ca} = Catalan.render(document)Statistical & Neural Models
Using Pretrained Models
Load and use statistical and neural models:
alias Nasty.Language.English
# Automatic loading (looks in priv/models/)
{:ok, tokens} = English.tokenize(text)
# HMM statistical model (~95% accuracy)
{:ok, tagged} = English.tag_pos(tokens, model: :hmm)
# Neural model (97-98% accuracy)
{:ok, tagged} = English.tag_pos(tokens, model: :neural)
# Ensemble mode (combines neural + HMM + rule-based)
{:ok, tagged} = English.tag_pos(tokens, model: :ensemble)
# PCFG statistical parsing
{:ok, document} = English.parse(tagged, model: :pcfg)
# CRF-based named entity recognition
alias Nasty.Language.English.EntityRecognizer
entities = EntityRecognizer.recognize(tagged, model: :crf)Training Custom Models
Train on your own data:
# Download Universal Dependencies data
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4611/ud-treebanks-v2.10.tgz
# Extract
tar -xzf ud-treebanks-v2.10.tgz
# Train HMM POS tagger (fast, 95% accuracy)
mix nasty.train.pos \
--corpus ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-train.conllu \
--test ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-test.conllu \
--output priv/models/en/my_hmm_model.model
# Train neural POS tagger (slower, 97-98% accuracy)
mix nasty.train.neural_pos \
--corpus ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-train.conllu \
--output priv/models/en/my_neural_model.axon \
--epochs 10 \
--batch-size 32
# Train PCFG parser
mix nasty.train.pcfg \
--corpus ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-train.conllu \
--test ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-test.conllu \
--output priv/models/en/my_pcfg.model \
--smoothing 0.001
# Train CRF for named entity recognition
mix nasty.train.crf \
--corpus data/ner_train.conllu \
--test data/ner_test.conllu \
--output priv/models/en/my_crf_ner.model \
--task ner \
--iterations 100
# Evaluate models
mix nasty.eval.pos \
--model priv/models/en/my_hmm_model.model \
--test ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-test.conllu
mix nasty.eval \
--model priv/models/en/my_pcfg.model \
--test ud-treebanks-v2.10/UD_English-EWT/en_ewt-ud-test.conllu \
--type pcfg
mix nasty.eval \
--model priv/models/en/my_crf_ner.model \
--test data/ner_test.conllu \
--type crf \
--task ner
For detailed training instructions:
- Neural models: TRAINING_NEURAL.md
- PCFG and CRF models: STATISTICAL_MODELS.md
Model Management
# List models
mix nasty.models list
# Inspect model
mix nasty.models inspect priv/models/en/pos_hmm_v1.model
# Compare models
mix nasty.models compare model1.model model2.model
Performance Tips
Batch Processing
Process multiple texts efficiently:
alias Nasty.Language.English
texts = [
"First sentence.",
"Second sentence.",
"Third sentence."
]
# Process in parallel
results =
texts
|> Task.async_stream(fn text ->
with {:ok, tokens} <- English.tokenize(text),
{:ok, tagged} <- English.tag_pos(tokens),
{:ok, doc} <- English.parse(tagged) do
{:ok, doc}
end
end, max_concurrency: System.schedulers_online())
|> Enum.map(fn {:ok, result} -> result end)Selective Parsing
Skip expensive operations when not needed:
# Basic parsing (no semantic analysis)
{:ok, doc} = English.parse(tokens)
# With semantic roles
{:ok, doc} = English.parse(tokens, semantic_roles: true)
# With coreference
{:ok, doc} = English.parse(tokens, coreference: true)
# Full pipeline
{:ok, doc} = English.parse(tokens,
semantic_roles: true,
coreference: true
)Caching
Cache parsed documents:
defmodule MyApp.DocumentCache do
use Agent
def start_link(_) do
Agent.start_link(fn -> %{} end, name: __MODULE__)
end
def get_or_parse(text) do
Agent.get_and_update(__MODULE__, fn cache ->
case Map.fetch(cache, text) do
{:ok, doc} ->
{doc, cache}
:error ->
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)
{doc, Map.put(cache, text, doc)}
end
end)
end
endTroubleshooting
Common Issues
Issue: Parsing fails with long sentences
Solution: Break into smaller sentences or increase timeout
# Split long text
sentences = String.split(text, ~r/[.!?]+/)
Enum.map(sentences, &English.parse/1)Issue: Entity recognition misses entities
Solution: Train custom NER or add to dictionary
# Add custom entity patterns
alias Nasty.Language.English.EntityRecognizer
# This is conceptual - check actual API
EntityRecognizer.add_pattern(:ORG, ~r/\b[A-Z][a-z]+ Inc\.\b/)Issue: POS tagging accuracy is low
Solution: Use statistical model or ensemble
# Use HMM model
{:ok, tagged} = English.tag_pos(tokens, model: :hmm)
# Or ensemble
{:ok, tagged} = English.tag_pos(tokens, model: :ensemble)Debugging Tips
- Visualize the AST: Use pretty printing to understand structure
- Check spans: Ensure position tracking is correct
- Validate: Run validation to catch structural issues
- Incremental parsing: Test each pipeline stage separately
# Debug pipeline stage by stage
{:ok, tokens} = English.tokenize(text)
IO.inspect(tokens, label: "Tokens")
{:ok, tagged} = English.tag_pos(tokens)
IO.inspect(tagged, label: "Tagged")
{:ok, doc} = English.parse(tagged)
IO.puts(PrettyPrint.tree(doc))Getting Help
- Check the API documentation
- Review PLAN.md for architecture details
- See examples/ for working code
- Report issues on GitHub
Next Steps
- Explore the examples/ directory for more demos
- Read STATISTICAL_MODELS.md for ML details
- Check TRAINING_GUIDE.md to train custom models
- See INTEROP_GUIDE.md for code conversion details
Happy parsing!