Getting Started with Nasty

A beginner-friendly guide to Natural Abstract Syntax Tree processing in Elixir.

Installation
Your First Steps
Core Concepts
Common Patterns
Language Support
Troubleshooting
Next Steps

Installation

Prerequisites

Elixir: Version 1.14 or later
Erlang/OTP: Version 25 or later

Check your versions:

elixir --version
# Erlang/OTP 25 [erts-13.0] [source] [64-bit]
# Elixir 1.14.0 (compiled with Erlang/OTP 25)

Adding Nasty to Your Project

Add nasty to your mix.exs dependencies:

def deps do
  [
    {:nasty, "~> 0.1.0"}
  ]
end

Then run:

mix deps.get
mix compile

Verifying Installation

Test that everything works:

# In IEx
iex> alias Nasty.Language.English
iex> {:ok, tokens} = English.tokenize("Hello world!")
iex> IO.inspect(tokens)

Your First Steps

Example 1: Parse a Simple Sentence

alias Nasty.Language.English

# Step 1: Tokenize
text = "The cat runs."
{:ok, tokens} = English.tokenize(text)

# Step 2: POS Tag
{:ok, tagged} = English.tag_pos(tokens)

# Step 3: Parse
{:ok, document} = English.parse(tagged)

# Examine the result
IO.inspect(document)

What just happened?

Tokenization: Split text into words and punctuation
POS Tagging: Assigned grammatical categories (noun, verb, etc.)
Parsing: Built an Abstract Syntax Tree (AST)

Example 2: Extract Information

alias Nasty.Language.English

text = "John Smith works at Google in New York."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)

# Extract named entities
alias Nasty.Language.English.EntityRecognizer
entities = EntityRecognizer.recognize(tagged)

Enum.each(entities, fn entity ->
  IO.puts("#{entity.text} is a #{entity.type}")
end)
# Output:
# John Smith is a person
# Google is a org
# New York is a gpe

Example 3: Translate Between Languages

alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator

# Parse English
{:ok, tokens} = English.tokenize("The cat runs.")
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)

# Translate to Spanish
{:ok, doc_es} = Translator.translate(doc, :es)

# Render Spanish text
{:ok, text_es} = Nasty.Rendering.Text.render(doc_es)
IO.puts(text_es)
# Output: El gato corre.

Core Concepts

The AST Structure

Nasty represents text as a tree:

Document
└── Paragraph
    └── Sentence
        └── Clause
            ├── Subject (NounPhrase)
            │   ├── Determiner: "The"
            │   └── Head: "cat"
            └── Predicate (VerbPhrase)
                └── Head: "runs"

Tokens

Every word is a Token with:

text: The actual word ("runs")
lemma: Dictionary form ("run")
pos_tag: Part of speech (:verb)
morphology: Features (%{tense: :present})
language: Language code (:en)
span: Position in text

Phrases

Phrases group related tokens:

NounPhrase: "the big cat"
VerbPhrase: "is running quickly"
PrepositionalPhrase: "in the house"

The Processing Pipeline

Text → Tokenization → POS Tagging → Morphology → Parsing → AST

Each step enriches the data:

Tokenization: Split into atomic units
POS Tagging: Add grammatical categories
Morphology: Add features (tense, number, etc.)
Parsing: Build hierarchical structure

Common Patterns

Pattern 1: Batch Processing

Process multiple texts efficiently:

alias Nasty.Language.English

texts = [
  "The first sentence.",
  "The second sentence.",
  "The third sentence."
]

results = 
  texts
  |> Task.async_stream(fn text ->
    with {:ok, tokens} <- English.tokenize(text),
         {:ok, tagged} <- English.tag_pos(tokens),
         {:ok, doc} <- English.parse(tagged) do
      {:ok, doc}
    end
  end, max_concurrency: System.schedulers_online())
  |> Enum.to_list()

Pattern 2: Extract Specific Information

Find all nouns in a document:

alias Nasty.Utils.Query

{:ok, doc} = Nasty.parse("The cat and dog play.", language: :en)

# Find all nouns
nouns = Query.find_by_pos(doc, :noun)

Enum.each(nouns, fn token ->
  IO.puts(token.text)
end)
# Output:
# cat
# dog

Pattern 3: Transform Text

Normalize and clean text:

alias Nasty.Utils.Transform

{:ok, doc} = Nasty.parse("The CAT runs QUICKLY!", language: :en)

# Lowercase everything
normalized = Transform.normalize_case(doc, :lower)

# Remove punctuation
no_punct = Transform.remove_punctuation(normalized)

# Render back to text
{:ok, clean_text} = Nasty.render(no_punct)
IO.puts(clean_text)
# Output: the cat runs quickly

Pattern 4: Error Handling

Always handle errors gracefully:

alias Nasty.Language.English

text = "Some text..."

case English.tokenize(text) do
  {:ok, tokens} ->
    case English.tag_pos(tokens) do
      {:ok, tagged} ->
        case English.parse(tagged) do
          {:ok, doc} -> 
            # Success! Process doc
            process_document(doc)
          {:error, reason} ->
            IO.puts("Parse error: #{inspect(reason)}")
        end
      {:error, reason} ->
        IO.puts("Tagging error: #{inspect(reason)}")
    end
  {:error, reason} ->
    IO.puts("Tokenization error: #{inspect(reason)}")
end

Or use with:

with {:ok, tokens} <- English.tokenize(text),
     {:ok, tagged} <- English.tag_pos(tokens),
     {:ok, doc} <- English.parse(tagged) do
  process_document(doc)
else
  {:error, reason} -> 
    IO.puts("Error: #{inspect(reason)}")
end

Language Support

Supported Languages

Nasty currently supports:

English (:en) - Fully implemented
Spanish (:es) - Fully implemented
Catalan (:ca) - Fully implemented

Using Different Languages

Each language has its own module:

# English
alias Nasty.Language.English
{:ok, doc_en} = Nasty.parse("The cat runs.", language: :en)

# Spanish
alias Nasty.Language.Spanish
{:ok, doc_es} = Nasty.parse("El gato corre.", language: :es)

# Catalan
alias Nasty.Language.Catalan
{:ok, doc_ca} = Nasty.parse("El gat corre.", language: :ca)

Language Detection

Auto-detect the language:

{:ok, lang} = Nasty.Language.Registry.detect_language("Hola mundo")
# => {:ok, :es}

{:ok, lang} = Nasty.Language.Registry.detect_language("Hello world")
# => {:ok, :en}

Troubleshooting

Common Issues

Issue 1: Module Not Found

Error:

** (UndefinedFunctionError) function Nasty.Language.English.tokenize/1 is undefined

Solution: Make sure you've compiled the project:

mix deps.get
mix compile

Issue 2: Empty Token List

Problem:

{:ok, []} = English.tokenize("")

Solution: Empty strings return empty token lists. Check your input:

text = String.trim(user_input)
if text != "" do
  English.tokenize(text)
else
  {:error, :empty_input}
end

Issue 3: Parse Errors with Long Sentences

Problem: Very long or complex sentences may fail to parse.

Solution: Split long sentences:

sentences = String.split(text, ~r/[.!?]+/)
|> Enum.map(&String.trim/1)
|> Enum.filter(&(&1 != ""))

Enum.each(sentences, fn sent ->
  {:ok, doc} = Nasty.parse(sent, language: :en)
  # Process doc
end)

Issue 4: Low Entity Recognition

Problem: Named entities not detected.

Solution: Entities depend on lexicons. For specialized domains, you may need to add custom entity patterns or use statistical models:

# Use rule-based (default)
{:ok, tagged} = English.tag_pos(tokens)
entities = EntityRecognizer.recognize(tagged)

# Or use CRF model (better accuracy)
entities = EntityRecognizer.recognize(tagged, model: :crf)

Performance Issues

Slow Processing

If processing is slow:

Use parallel processing for multiple documents
Cache parsed documents to avoid re-parsing
Use simpler models for POS tagging (:rule instead of :neural)

# Fast rule-based tagging
{:ok, tagged} = English.tag_pos(tokens, model: :rule)

# Better accuracy but slower
{:ok, tagged} = English.tag_pos(tokens, model: :hmm)

Getting Help

Documentation: Check docs/ for detailed guides
Examples: See examples/ for working code
Issues: Report bugs on GitHub

Next Steps

Learn More

Read the User Guide: USER_GUIDE.md for comprehensive examples
Explore Examples: EXAMPLES.md for runnable scripts
Understand Architecture: ARCHITECTURE.md for system design
Try Translation: TRANSLATION.md for multilingual features

Try the Examples

Run the example scripts:

# Basic tokenization
elixir examples/tokenizer_example.exs

# Question answering
elixir examples/question_answering.exs

# Translation
elixir examples/translation_example.exs

# Multilingual comparison
elixir examples/multilingual_pipeline.exs

Build Something

Now that you understand the basics, try building:

Text Analyzer: Extract keywords, entities, and sentiment
Translation Tool: Translate documents between languages
Chatbot: Parse user input and generate responses
Content Categorizer: Classify documents by topic
Grammar Checker: Analyze and correct grammatical errors

Advanced Topics

Once comfortable with basics, explore:

Statistical Models: Train custom POS taggers
Neural Networks: Use BiLSTM-CRF for better accuracy
Information Extraction: Extract relations and events
Question Answering: Build Q&A systems
Custom Grammars: Define domain-specific grammar rules

Quick Reference

Essential Functions

# Parsing
Nasty.parse(text, language: :en)

# Rendering
Nasty.render(ast)

# Translation
Nasty.Translation.Translator.translate(ast, target_language)

# Querying
Nasty.Utils.Query.find_by_pos(doc, :noun)
Nasty.Utils.Query.extract_entities(doc)

# Transformation
Nasty.Utils.Transform.normalize_case(doc, :lower)
Nasty.Utils.Transform.remove_punctuation(doc)

Language Modules

Nasty.Language.English
Nasty.Language.Spanish
Nasty.Language.Catalan

Common Modules

alias Nasty.Language.English
alias Nasty.Translation.Translator
alias Nasty.Utils.{Query, Transform, Traversal}
alias Nasty.Rendering.Text

Summary

You now know how to:

✓ Install and set up Nasty
✓ Parse text into an AST
✓ Extract information from documents
✓ Translate between languages
✓ Handle common issues
✓ Use best practices

Happy parsing! 🚀

← Previous Page Nasty → Natural Abstract Syntax Tree Yeoman

Next Page → Nasty User Guide