Getting Started with Nasty
View SourceA beginner-friendly guide to Natural Abstract Syntax Tree processing in Elixir.
Table of Contents
Installation
Prerequisites
- Elixir: Version 1.14 or later
- Erlang/OTP: Version 25 or later
Check your versions:
elixir --version
# Erlang/OTP 25 [erts-13.0] [source] [64-bit]
# Elixir 1.14.0 (compiled with Erlang/OTP 25)
Adding Nasty to Your Project
Add nasty to your mix.exs dependencies:
def deps do
[
{:nasty, "~> 0.1.0"}
]
endThen run:
mix deps.get
mix compile
Verifying Installation
Test that everything works:
# In IEx
iex> alias Nasty.Language.English
iex> {:ok, tokens} = English.tokenize("Hello world!")
iex> IO.inspect(tokens)Your First Steps
Example 1: Parse a Simple Sentence
alias Nasty.Language.English
# Step 1: Tokenize
text = "The cat runs."
{:ok, tokens} = English.tokenize(text)
# Step 2: POS Tag
{:ok, tagged} = English.tag_pos(tokens)
# Step 3: Parse
{:ok, document} = English.parse(tagged)
# Examine the result
IO.inspect(document)What just happened?
- Tokenization: Split text into words and punctuation
- POS Tagging: Assigned grammatical categories (noun, verb, etc.)
- Parsing: Built an Abstract Syntax Tree (AST)
Example 2: Extract Information
alias Nasty.Language.English
text = "John Smith works at Google in New York."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
# Extract named entities
alias Nasty.Language.English.EntityRecognizer
entities = EntityRecognizer.recognize(tagged)
Enum.each(entities, fn entity ->
IO.puts("#{entity.text} is a #{entity.type}")
end)
# Output:
# John Smith is a person
# Google is a org
# New York is a gpeExample 3: Translate Between Languages
alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator
# Parse English
{:ok, tokens} = English.tokenize("The cat runs.")
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)
# Translate to Spanish
{:ok, doc_es} = Translator.translate(doc, :es)
# Render Spanish text
{:ok, text_es} = Nasty.Rendering.Text.render(doc_es)
IO.puts(text_es)
# Output: El gato corre.Core Concepts
The AST Structure
Nasty represents text as a tree:
Document
└── Paragraph
└── Sentence
└── Clause
├── Subject (NounPhrase)
│ ├── Determiner: "The"
│ └── Head: "cat"
└── Predicate (VerbPhrase)
└── Head: "runs"Tokens
Every word is a Token with:
text: The actual word ("runs")lemma: Dictionary form ("run")pos_tag: Part of speech (:verb)morphology: Features (%{tense: :present})language: Language code (:en)span: Position in text
Phrases
Phrases group related tokens:
- NounPhrase: "the big cat"
- VerbPhrase: "is running quickly"
- PrepositionalPhrase: "in the house"
The Processing Pipeline
Text → Tokenization → POS Tagging → Morphology → Parsing → ASTEach step enriches the data:
- Tokenization: Split into atomic units
- POS Tagging: Add grammatical categories
- Morphology: Add features (tense, number, etc.)
- Parsing: Build hierarchical structure
Common Patterns
Pattern 1: Batch Processing
Process multiple texts efficiently:
alias Nasty.Language.English
texts = [
"The first sentence.",
"The second sentence.",
"The third sentence."
]
results =
texts
|> Task.async_stream(fn text ->
with {:ok, tokens} <- English.tokenize(text),
{:ok, tagged} <- English.tag_pos(tokens),
{:ok, doc} <- English.parse(tagged) do
{:ok, doc}
end
end, max_concurrency: System.schedulers_online())
|> Enum.to_list()Pattern 2: Extract Specific Information
Find all nouns in a document:
alias Nasty.Utils.Query
{:ok, doc} = Nasty.parse("The cat and dog play.", language: :en)
# Find all nouns
nouns = Query.find_by_pos(doc, :noun)
Enum.each(nouns, fn token ->
IO.puts(token.text)
end)
# Output:
# cat
# dogPattern 3: Transform Text
Normalize and clean text:
alias Nasty.Utils.Transform
{:ok, doc} = Nasty.parse("The CAT runs QUICKLY!", language: :en)
# Lowercase everything
normalized = Transform.normalize_case(doc, :lower)
# Remove punctuation
no_punct = Transform.remove_punctuation(normalized)
# Render back to text
{:ok, clean_text} = Nasty.render(no_punct)
IO.puts(clean_text)
# Output: the cat runs quicklyPattern 4: Error Handling
Always handle errors gracefully:
alias Nasty.Language.English
text = "Some text..."
case English.tokenize(text) do
{:ok, tokens} ->
case English.tag_pos(tokens) do
{:ok, tagged} ->
case English.parse(tagged) do
{:ok, doc} ->
# Success! Process doc
process_document(doc)
{:error, reason} ->
IO.puts("Parse error: #{inspect(reason)}")
end
{:error, reason} ->
IO.puts("Tagging error: #{inspect(reason)}")
end
{:error, reason} ->
IO.puts("Tokenization error: #{inspect(reason)}")
endOr use with:
with {:ok, tokens} <- English.tokenize(text),
{:ok, tagged} <- English.tag_pos(tokens),
{:ok, doc} <- English.parse(tagged) do
process_document(doc)
else
{:error, reason} ->
IO.puts("Error: #{inspect(reason)}")
endLanguage Support
Supported Languages
Nasty currently supports:
- English (
:en) - Fully implemented - Spanish (
:es) - Fully implemented - Catalan (
:ca) - Fully implemented
Using Different Languages
Each language has its own module:
# English
alias Nasty.Language.English
{:ok, doc_en} = Nasty.parse("The cat runs.", language: :en)
# Spanish
alias Nasty.Language.Spanish
{:ok, doc_es} = Nasty.parse("El gato corre.", language: :es)
# Catalan
alias Nasty.Language.Catalan
{:ok, doc_ca} = Nasty.parse("El gat corre.", language: :ca)Language Detection
Auto-detect the language:
{:ok, lang} = Nasty.Language.Registry.detect_language("Hola mundo")
# => {:ok, :es}
{:ok, lang} = Nasty.Language.Registry.detect_language("Hello world")
# => {:ok, :en}Troubleshooting
Common Issues
Issue 1: Module Not Found
Error:
** (UndefinedFunctionError) function Nasty.Language.English.tokenize/1 is undefinedSolution: Make sure you've compiled the project:
mix deps.get
mix compile
Issue 2: Empty Token List
Problem:
{:ok, []} = English.tokenize("")Solution: Empty strings return empty token lists. Check your input:
text = String.trim(user_input)
if text != "" do
English.tokenize(text)
else
{:error, :empty_input}
endIssue 3: Parse Errors with Long Sentences
Problem: Very long or complex sentences may fail to parse.
Solution: Split long sentences:
sentences = String.split(text, ~r/[.!?]+/)
|> Enum.map(&String.trim/1)
|> Enum.filter(&(&1 != ""))
Enum.each(sentences, fn sent ->
{:ok, doc} = Nasty.parse(sent, language: :en)
# Process doc
end)Issue 4: Low Entity Recognition
Problem: Named entities not detected.
Solution: Entities depend on lexicons. For specialized domains, you may need to add custom entity patterns or use statistical models:
# Use rule-based (default)
{:ok, tagged} = English.tag_pos(tokens)
entities = EntityRecognizer.recognize(tagged)
# Or use CRF model (better accuracy)
entities = EntityRecognizer.recognize(tagged, model: :crf)Performance Issues
Slow Processing
If processing is slow:
- Use parallel processing for multiple documents
- Cache parsed documents to avoid re-parsing
- Use simpler models for POS tagging (
:ruleinstead of:neural)
# Fast rule-based tagging
{:ok, tagged} = English.tag_pos(tokens, model: :rule)
# Better accuracy but slower
{:ok, tagged} = English.tag_pos(tokens, model: :hmm)Getting Help
- Documentation: Check docs/ for detailed guides
- Examples: See examples/ for working code
- Issues: Report bugs on GitHub
Next Steps
Learn More
- Read the User Guide: USER_GUIDE.md for comprehensive examples
- Explore Examples: EXAMPLES.md for runnable scripts
- Understand Architecture: ARCHITECTURE.md for system design
- Try Translation: TRANSLATION.md for multilingual features
Try the Examples
Run the example scripts:
# Basic tokenization
elixir examples/tokenizer_example.exs
# Question answering
elixir examples/question_answering.exs
# Translation
elixir examples/translation_example.exs
# Multilingual comparison
elixir examples/multilingual_pipeline.exs
Build Something
Now that you understand the basics, try building:
- Text Analyzer: Extract keywords, entities, and sentiment
- Translation Tool: Translate documents between languages
- Chatbot: Parse user input and generate responses
- Content Categorizer: Classify documents by topic
- Grammar Checker: Analyze and correct grammatical errors
Advanced Topics
Once comfortable with basics, explore:
- Statistical Models: Train custom POS taggers
- Neural Networks: Use BiLSTM-CRF for better accuracy
- Information Extraction: Extract relations and events
- Question Answering: Build Q&A systems
- Custom Grammars: Define domain-specific grammar rules
Quick Reference
Essential Functions
# Parsing
Nasty.parse(text, language: :en)
# Rendering
Nasty.render(ast)
# Translation
Nasty.Translation.Translator.translate(ast, target_language)
# Querying
Nasty.Utils.Query.find_by_pos(doc, :noun)
Nasty.Utils.Query.extract_entities(doc)
# Transformation
Nasty.Utils.Transform.normalize_case(doc, :lower)
Nasty.Utils.Transform.remove_punctuation(doc)Language Modules
Nasty.Language.English
Nasty.Language.Spanish
Nasty.Language.CatalanCommon Modules
alias Nasty.Language.English
alias Nasty.Translation.Translator
alias Nasty.Utils.{Query, Transform, Traversal}
alias Nasty.Rendering.TextSummary
You now know how to:
- ✓ Install and set up Nasty
- ✓ Parse text into an AST
- ✓ Extract information from documents
- ✓ Translate between languages
- ✓ Handle common issues
- ✓ Use best practices
Happy parsing! 🚀