Getting Started with Nasty

View Source

A beginner-friendly guide to Natural Abstract Syntax Tree processing in Elixir.

Table of Contents

  1. Installation
  2. Your First Steps
  3. Core Concepts
  4. Common Patterns
  5. Language Support
  6. Troubleshooting
  7. Next Steps

Installation

Prerequisites

  • Elixir: Version 1.14 or later
  • Erlang/OTP: Version 25 or later

Check your versions:

elixir --version
# Erlang/OTP 25 [erts-13.0] [source] [64-bit]
# Elixir 1.14.0 (compiled with Erlang/OTP 25)

Adding Nasty to Your Project

Add nasty to your mix.exs dependencies:

def deps do
  [
    {:nasty, "~> 0.1.0"}
  ]
end

Then run:

mix deps.get
mix compile

Verifying Installation

Test that everything works:

# In IEx
iex> alias Nasty.Language.English
iex> {:ok, tokens} = English.tokenize("Hello world!")
iex> IO.inspect(tokens)

Your First Steps

Example 1: Parse a Simple Sentence

alias Nasty.Language.English

# Step 1: Tokenize
text = "The cat runs."
{:ok, tokens} = English.tokenize(text)

# Step 2: POS Tag
{:ok, tagged} = English.tag_pos(tokens)

# Step 3: Parse
{:ok, document} = English.parse(tagged)

# Examine the result
IO.inspect(document)

What just happened?

  1. Tokenization: Split text into words and punctuation
  2. POS Tagging: Assigned grammatical categories (noun, verb, etc.)
  3. Parsing: Built an Abstract Syntax Tree (AST)

Example 2: Extract Information

alias Nasty.Language.English

text = "John Smith works at Google in New York."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)

# Extract named entities
alias Nasty.Language.English.EntityRecognizer
entities = EntityRecognizer.recognize(tagged)

Enum.each(entities, fn entity ->
  IO.puts("#{entity.text} is a #{entity.type}")
end)
# Output:
# John Smith is a person
# Google is a org
# New York is a gpe

Example 3: Translate Between Languages

alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator

# Parse English
{:ok, tokens} = English.tokenize("The cat runs.")
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)

# Translate to Spanish
{:ok, doc_es} = Translator.translate(doc, :es)

# Render Spanish text
{:ok, text_es} = Nasty.Rendering.Text.render(doc_es)
IO.puts(text_es)
# Output: El gato corre.

Core Concepts

The AST Structure

Nasty represents text as a tree:

Document
 Paragraph
     Sentence
         Clause
             Subject (NounPhrase)
                Determiner: "The"
                Head: "cat"
             Predicate (VerbPhrase)
                 Head: "runs"

Tokens

Every word is a Token with:

  • text: The actual word ("runs")
  • lemma: Dictionary form ("run")
  • pos_tag: Part of speech (:verb)
  • morphology: Features (%{tense: :present})
  • language: Language code (:en)
  • span: Position in text

Phrases

Phrases group related tokens:

  • NounPhrase: "the big cat"
  • VerbPhrase: "is running quickly"
  • PrepositionalPhrase: "in the house"

The Processing Pipeline

Text  Tokenization  POS Tagging  Morphology  Parsing  AST

Each step enriches the data:

  1. Tokenization: Split into atomic units
  2. POS Tagging: Add grammatical categories
  3. Morphology: Add features (tense, number, etc.)
  4. Parsing: Build hierarchical structure

Common Patterns

Pattern 1: Batch Processing

Process multiple texts efficiently:

alias Nasty.Language.English

texts = [
  "The first sentence.",
  "The second sentence.",
  "The third sentence."
]

results = 
  texts
  |> Task.async_stream(fn text ->
    with {:ok, tokens} <- English.tokenize(text),
         {:ok, tagged} <- English.tag_pos(tokens),
         {:ok, doc} <- English.parse(tagged) do
      {:ok, doc}
    end
  end, max_concurrency: System.schedulers_online())
  |> Enum.to_list()

Pattern 2: Extract Specific Information

Find all nouns in a document:

alias Nasty.Utils.Query

{:ok, doc} = Nasty.parse("The cat and dog play.", language: :en)

# Find all nouns
nouns = Query.find_by_pos(doc, :noun)

Enum.each(nouns, fn token ->
  IO.puts(token.text)
end)
# Output:
# cat
# dog

Pattern 3: Transform Text

Normalize and clean text:

alias Nasty.Utils.Transform

{:ok, doc} = Nasty.parse("The CAT runs QUICKLY!", language: :en)

# Lowercase everything
normalized = Transform.normalize_case(doc, :lower)

# Remove punctuation
no_punct = Transform.remove_punctuation(normalized)

# Render back to text
{:ok, clean_text} = Nasty.render(no_punct)
IO.puts(clean_text)
# Output: the cat runs quickly

Pattern 4: Error Handling

Always handle errors gracefully:

alias Nasty.Language.English

text = "Some text..."

case English.tokenize(text) do
  {:ok, tokens} ->
    case English.tag_pos(tokens) do
      {:ok, tagged} ->
        case English.parse(tagged) do
          {:ok, doc} -> 
            # Success! Process doc
            process_document(doc)
          {:error, reason} ->
            IO.puts("Parse error: #{inspect(reason)}")
        end
      {:error, reason} ->
        IO.puts("Tagging error: #{inspect(reason)}")
    end
  {:error, reason} ->
    IO.puts("Tokenization error: #{inspect(reason)}")
end

Or use with:

with {:ok, tokens} <- English.tokenize(text),
     {:ok, tagged} <- English.tag_pos(tokens),
     {:ok, doc} <- English.parse(tagged) do
  process_document(doc)
else
  {:error, reason} -> 
    IO.puts("Error: #{inspect(reason)}")
end

Language Support

Supported Languages

Nasty currently supports:

  • English (:en) - Fully implemented
  • Spanish (:es) - Fully implemented
  • Catalan (:ca) - Fully implemented

Using Different Languages

Each language has its own module:

# English
alias Nasty.Language.English
{:ok, doc_en} = Nasty.parse("The cat runs.", language: :en)

# Spanish
alias Nasty.Language.Spanish
{:ok, doc_es} = Nasty.parse("El gato corre.", language: :es)

# Catalan
alias Nasty.Language.Catalan
{:ok, doc_ca} = Nasty.parse("El gat corre.", language: :ca)

Language Detection

Auto-detect the language:

{:ok, lang} = Nasty.Language.Registry.detect_language("Hola mundo")
# => {:ok, :es}

{:ok, lang} = Nasty.Language.Registry.detect_language("Hello world")
# => {:ok, :en}

Troubleshooting

Common Issues

Issue 1: Module Not Found

Error:

** (UndefinedFunctionError) function Nasty.Language.English.tokenize/1 is undefined

Solution: Make sure you've compiled the project:

mix deps.get
mix compile

Issue 2: Empty Token List

Problem:

{:ok, []} = English.tokenize("")

Solution: Empty strings return empty token lists. Check your input:

text = String.trim(user_input)
if text != "" do
  English.tokenize(text)
else
  {:error, :empty_input}
end

Issue 3: Parse Errors with Long Sentences

Problem: Very long or complex sentences may fail to parse.

Solution: Split long sentences:

sentences = String.split(text, ~r/[.!?]+/)
|> Enum.map(&String.trim/1)
|> Enum.filter(&(&1 != ""))

Enum.each(sentences, fn sent ->
  {:ok, doc} = Nasty.parse(sent, language: :en)
  # Process doc
end)

Issue 4: Low Entity Recognition

Problem: Named entities not detected.

Solution: Entities depend on lexicons. For specialized domains, you may need to add custom entity patterns or use statistical models:

# Use rule-based (default)
{:ok, tagged} = English.tag_pos(tokens)
entities = EntityRecognizer.recognize(tagged)

# Or use CRF model (better accuracy)
entities = EntityRecognizer.recognize(tagged, model: :crf)

Performance Issues

Slow Processing

If processing is slow:

  1. Use parallel processing for multiple documents
  2. Cache parsed documents to avoid re-parsing
  3. Use simpler models for POS tagging (:rule instead of :neural)
# Fast rule-based tagging
{:ok, tagged} = English.tag_pos(tokens, model: :rule)

# Better accuracy but slower
{:ok, tagged} = English.tag_pos(tokens, model: :hmm)

Getting Help

  • Documentation: Check docs/ for detailed guides
  • Examples: See examples/ for working code
  • Issues: Report bugs on GitHub

Next Steps

Learn More

  1. Read the User Guide: USER_GUIDE.md for comprehensive examples
  2. Explore Examples: EXAMPLES.md for runnable scripts
  3. Understand Architecture: ARCHITECTURE.md for system design
  4. Try Translation: TRANSLATION.md for multilingual features

Try the Examples

Run the example scripts:

# Basic tokenization
elixir examples/tokenizer_example.exs

# Question answering
elixir examples/question_answering.exs

# Translation
elixir examples/translation_example.exs

# Multilingual comparison
elixir examples/multilingual_pipeline.exs

Build Something

Now that you understand the basics, try building:

  1. Text Analyzer: Extract keywords, entities, and sentiment
  2. Translation Tool: Translate documents between languages
  3. Chatbot: Parse user input and generate responses
  4. Content Categorizer: Classify documents by topic
  5. Grammar Checker: Analyze and correct grammatical errors

Advanced Topics

Once comfortable with basics, explore:

  • Statistical Models: Train custom POS taggers
  • Neural Networks: Use BiLSTM-CRF for better accuracy
  • Information Extraction: Extract relations and events
  • Question Answering: Build Q&A systems
  • Custom Grammars: Define domain-specific grammar rules

Quick Reference

Essential Functions

# Parsing
Nasty.parse(text, language: :en)

# Rendering
Nasty.render(ast)

# Translation
Nasty.Translation.Translator.translate(ast, target_language)

# Querying
Nasty.Utils.Query.find_by_pos(doc, :noun)
Nasty.Utils.Query.extract_entities(doc)

# Transformation
Nasty.Utils.Transform.normalize_case(doc, :lower)
Nasty.Utils.Transform.remove_punctuation(doc)

Language Modules

Nasty.Language.English
Nasty.Language.Spanish
Nasty.Language.Catalan

Common Modules

alias Nasty.Language.English
alias Nasty.Translation.Translator
alias Nasty.Utils.{Query, Transform, Traversal}
alias Nasty.Rendering.Text

Summary

You now know how to:

  • ✓ Install and set up Nasty
  • ✓ Parse text into an AST
  • ✓ Extract information from documents
  • ✓ Translate between languages
  • ✓ Handle common issues
  • ✓ Use best practices

Happy parsing! 🚀