Strengths and Limitations

This document provides an honest assessment of Nasty's capabilities, helping you understand where it excels, where it falls short, and how to leverage it effectively in real-world applications.

Core Philosophy
What Nasty Does Best
Current Limitations
When to Use Nasty
When NOT to Use Nasty
Pretrained Models and Fine-Tuning
Integration with Ragex
Practical Recommendations

Core Philosophy

Nasty treats natural language with the same rigor as programming languages, building a complete grammatical Abstract Syntax Tree (AST) for every sentence. This "grammar-first" approach provides:

Structural Understanding: Deep syntactic analysis beyond surface patterns
Explainability: Every decision traces back to grammatical rules
Composability: Transform ASTs using standard tree operations
Bidirectionality: Convert natural language ↔ code with structural awareness

This is fundamentally different from statistical/neural-only approaches that learn patterns without explicit grammar representation.

What Nasty Does Best

1. Grammatical Analysis and Structure

Strengths:

Precise phrase structure parsing (NP, VP, PP)
Accurate dependency extraction (Universal Dependencies)
Clause detection (coordination, subordination)
Complex sentence analysis
Morphological feature extraction

Use Cases:

Grammar checking and correction
Language learning applications
Syntactic pattern extraction
Template-based text generation
Controlled natural language interfaces

Example:

# Parse complex sentence structure
text = "The professor who teaches mathematics works at the university."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)

# Extract precise syntactic dependencies
deps = English.DependencyExtractor.extract(doc)
# => Identifies subject, verb, relative clause, prepositional attachment

2. Multi-Language Support with Shared Architecture

Strengths:

Language-agnostic AST schema
Consistent API across English, Spanish, Catalan
Morphological agreement handling
Language-specific word order rules
Bidirectional translation preserving structure

Use Cases:

Multi-language content management
Cross-lingual information extraction
Translation with grammatical fidelity
Language comparison and analysis

Example:

# Parse in one language, render in another
{:ok, doc_en} = English.parse(tagged_en)
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
{:ok, text_es} = Nasty.Rendering.Text.render(doc_es)
# Maintains grammatical structure with proper agreement

3. Code-Natural Language Interoperability

Strengths:

Intent recognition from imperative sentences
Natural language → Elixir code generation
Code → Natural language explanation
Constraint extraction from comparisons

Use Cases:

Domain-specific language interfaces
Query builders from natural language
Code documentation generation
Conversational programming interfaces

Example:

# Generate code from natural language
{:ok, code} = Nasty.to_code(
  "Filter users where age greater than 21",
  source_language: :en,
  target_language: :elixir
)
# => "Enum.filter(users, fn u -> u.age > 21 end)"

# Explain code in natural language
{:ok, explanation} = Nasty.explain_code(
  "Enum.sort(numbers)",
  source_language: :elixir,
  target_language: :en
)
# => "Sort numbers"

4. Explainable NLP Pipeline

Strengths:

Rule-based components provide transparency
AST structure reveals decision path
No "black box" transformations
Debuggable at every stage

Use Cases:

Regulated industries requiring explainability
Educational tools showing linguistic analysis
Research requiring interpretable results
Debugging NLP pipelines

5. Pure Elixir Implementation

Strengths:

No Python interop overhead
Runs entirely in BEAM VM
Leverage OTP supervision
Easy deployment with Elixir apps
No external API dependencies

Use Cases:

Embedded in Elixir/Phoenix applications
Serverless/edge deployments
Offline processing
Low-latency requirements

Current Limitations

1. Lexical Coverage

Limitations:

Limited vocabulary in translation lexicons (~300 words per language pair)
No comprehensive dictionary lookup
Unknown words pass through untranslated
Domain-specific terminology requires manual addition

Impact:

Poor translation quality for specialized text
Incomplete entity recognition
Missing lexical semantics

Mitigation:

Expand lexicons incrementally for your domain
Use fallback to original word
Combine with external dictionary APIs
Contribute domain-specific lexicons

2. Semantic Understanding

Limitations:

No deep semantic analysis beyond SRL
Limited world knowledge
No common-sense reasoning
Cannot resolve ambiguity requiring external knowledge
Metaphors and idioms handled literally

Impact:

Misses implied meanings
Literal translations of idioms
Cannot answer questions requiring inference
Limited context understanding

Mitigation:

Focus on factual, literal text
Combine with knowledge bases
Use for structure, not deep meaning
Integrate pretrained models for semantics

3. Statistical Model Accuracy

Limitations:

Rule-based POS tagging: ~85% accuracy
HMM POS tagging: ~95% accuracy
Neural POS tagging: 97-98% accuracy (but requires training)
No pretrained models shipped by default
Small training datasets limit performance

Impact:

Errors compound through pipeline
Complex sentences may parse incorrectly
Needs domain-specific training for best results

Mitigation:

Use neural models for critical applications
Train on domain-specific data
Implement error correction layers
Ensemble multiple models

4. Parsing Incomplete Sentences

Limitations:

Expects complete, well-formed sentences
Fragments may fail to parse
Informal text (tweets, chat) challenging
Heavy reliance on punctuation
Assumes standard grammar

Impact:

Cannot process conversational text well
Social media text needs preprocessing
Bullet points and lists problematic

Mitigation:

Preprocess text to add punctuation
Use fallback parsing modes
Detect fragments and handle separately
Normalize text before parsing

5. Computational Cost

Limitations:

Full parsing is CPU-intensive
Neural models require significant memory
Not optimized for real-time streaming
Large documents can be slow

Impact:

Latency for interactive applications
Resource requirements for batch processing
EXLA compilation overhead on first run

Mitigation:

Cache parsed results
Use rule-based models for speed
Process in batches
Profile and optimize hot paths

6. Rendering Quality

Limitations:

Generated text can be unnatural
May lose stylistic nuances
Sentence simplification during parsing
Limited paraphrasing capability

Impact:

Roundtrip translation shows drift
Generated summaries feel mechanical
Cannot match human fluency

Mitigation:

Use for technical/formal text
Post-process generated text
Combine with template systems
Set user expectations appropriately

When to Use Nasty

Ideal Use Cases

Structured Text Processing
- Technical documentation analysis
- Legal/medical text with formal grammar
- Controlled language systems
- Template-based generation
Multi-Language Applications
- Content management systems
- Documentation translation
- Language learning tools
- Cross-lingual search
Code-NL Integration
- DSL interfaces
- Query builders
- Documentation generation
- Programming assistants
Grammatical Analysis
- Grammar checking
- Style analysis
- Linguistic research
- Educational applications
Elixir-Native NLP
- Phoenix applications
- Embedded NLP in Elixir apps
- No Python interop needed
- Offline processing

Best Practices

Start Simple: Begin with tokenization and POS tagging before full parsing
Validate Results: Check parse quality on your specific domain
Train Models: Fine-tune on domain-specific data for best accuracy
Combine Approaches: Use rule-based for speed, neural for accuracy
Handle Errors: Implement fallbacks for parsing failures
Expand Lexicons: Add domain vocabulary incrementally
Cache Results: Parse once, reuse AST multiple times

When NOT to Use Nasty

Poor Fit Scenarios

Semantic-Heavy Tasks
- Open-domain question answering
- Sentiment analysis (use dedicated models)
- Topic modeling
- Intent classification (without training)
Informal Text
- Social media analysis
- Chat/messaging data
- Fragmented text
- Heavy slang/abbreviations
Real-Time Streaming
- Low-latency chat bots
- Real-time speech processing
- High-throughput pipelines
- Sub-millisecond requirements
Domain Without Training Data
- Highly specialized jargon
- Mixed language text
- Code-switched text
- Non-standard dialects

Better Alternatives

For Semantic Tasks: Use transformer models (BERT, RoBERTa) via Bumblebee
For Sentiment: Hugging Face models or cloud APIs
For Summarization: Neural abstractive models
For Translation: Neural MT systems (DeepL, Google Translate)
For Casual Text: spaCy, NLTK with robust tokenization

Pretrained Models and Fine-Tuning

Current State

Nasty provides infrastructure for neural models but does NOT ship with pretrained models:

# Neural POS tagging requires training
tagger = NeuralTagger.new(vocab_size: 10000, num_tags: 17)
{:ok, trained} = NeuralTagger.train(tagger, training_data, epochs: 10)

Integration Opportunities

1. Bumblebee for Embeddings and Transformers

Strategy: Use Bumblebee models for semantic tasks, Nasty for syntax

# Bumblebee for embeddings
{:ok, model} = Bumblebee.load_model({:hf, "sentence-transformers/all-MiniLM-L6-v2"})
embeddings = Bumblebee.apply(model, text)

# Nasty for structure
{:ok, doc} = Nasty.parse(text, language: :en)
deps = English.DependencyExtractor.extract(doc)

# Combine: embeddings for similarity, AST for structure

Benefits:

Leverage pretrained semantic understanding
Keep grammatical precision from Nasty
Best of both worlds

2. Fine-Tune Neural POS Tagger

Strategy: Train Nasty's BiLSTM-CRF on domain data

# Download Universal Dependencies corpus
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3226/ud-treebanks-v2.6.tgz

# Extract English corpus
tar -xzf ud-treebanks-v2.6.tgz
cd UD_English-EWT

# Train neural tagger
mix nasty.train.neural_pos \
  --corpus en_ewt-ud-train.conllu \
  --test en_ewt-ud-test.conllu \
  --output priv/models/en/pos_neural.axon \
  --epochs 10

Expected Results:

97-98% accuracy on standard benchmarks
Domain adaptation with ~1000 annotated sentences
5-10x slower than rule-based

3. Hybrid Architecture

Recommendation: Combine Nasty syntax with pretrained semantics

defmodule HybridNLP do
  def analyze(text) do
    # Nasty for syntax
    {:ok, doc} = Nasty.parse(text, language: :en)
    
    # Extract sentences
    sentences = doc.paragraphs |> Enum.flat_map(& &1.sentences)
    
    # Bumblebee for semantic embeddings
    sentence_texts = Enum.map(sentences, &Nasty.Rendering.Text.render/1)
    embeddings = embed_with_bumblebee(sentence_texts)
    
    # Combine structural and semantic features
    %{
      syntax: doc,
      dependencies: English.DependencyExtractor.extract(doc),
      entities: English.EntityRecognizer.recognize(doc),
      embeddings: embeddings
    }
  end
end

Integration with Ragex

Ragex is a hybrid retrieval-augmented generation system for code analysis. Nasty and Ragex complement each other perfectly:

Architecture Alignment

Component	Ragex	Nasty
Input	Programming Language Code	Natural Language Text
AST	Code AST (Elixir/Erlang/Python)	Grammar AST (sentences/phrases)
Analysis	Functions, modules, calls	Entities, relations, structure
Search	Semantic + symbolic code search	Linguistic pattern matching
Output	Code understanding, refactoring	NL generation, translation

Complementary Strengths

Code Documentation Pipeline

# Ragex: Analyze code structure
{:ok, analysis} = Ragex.analyze_file("lib/my_module.ex")

# Nasty: Generate natural language documentation
function_docs = Enum.map(analysis.functions, fn func ->
  # Convert function AST to natural language
  Nasty.explain_code(func.ast, source_language: :elixir, target_language: :en)
end)

Natural Language Code Search

# User query in natural language
query = "function that validates email addresses"

# Nasty: Parse query to extract intent
{:ok, intent} = Nasty.Interop.IntentRecognizer.recognize_from_text(query, language: :en)
# => %Intent{action: "validate", target: "email addresses"}

# Ragex: Semantic search in codebase
{:ok, results} = Ragex.semantic_search(query)
# => [%{node_id: "MyModule.validate_email/1", similarity: 0.85}]

Bidirectional DSL

# Natural language → Code (Nasty)
{:ok, code} = Nasty.to_code("filter users by age", target_language: :elixir)

# Code → Natural language (Nasty)
{:ok, explanation} = Nasty.explain_code(code, target_language: :en)

# Code analysis and refactoring (Ragex)
{:ok, impact} = Ragex.find_function_impact("filter_users/1")

Enhanced Code Understanding

# Ragex: Extract code structure
{:ok, graph} = Ragex.analyze_directory("lib/")

# Nasty: Generate architectural summaries in natural language
summary = graph.modules
|> Enum.map(&describe_module/1)
|> Enum.join("\n\n")
# => "The MyApp.Users module manages user accounts. It depends on..."

Concrete Integration Patterns

Pattern 1: Intelligent Code Comments

defmodule CodeCommentGenerator do
  def generate_comment(function_ast) do
    # Ragex: Analyze function structure and dependencies
    {:ok, analysis} = Ragex.analyze_function(function_ast)
    
    # Extract: name, parameters, return type, calls
    template = build_template(analysis)
    
    # Nasty: Generate natural language from template
    {:ok, comment} = Nasty.render_template(template, language: :en)
    
    # => "@doc \"\"\"\n  Validates email addresses using regex pattern.\n  
    #     Returns {:ok, email} or {:error, reason}.\n    \"\"\""
  end
end

Pattern 2: Conversational Code Search

defmodule ConversationalSearch do
  def search_codebase(user_query) do
    # Nasty: Parse natural language query
    {:ok, doc} = Nasty.parse(user_query, language: :en)
    {:ok, intent} = Nasty.Interop.IntentRecognizer.recognize(doc)
    
    # Extract search criteria from linguistic structure
    criteria = %{
      action: intent.action,        # "parse", "validate", "transform"
      target: intent.target,         # "JSON", "email", "user input"
      constraints: intent.constraints  # "where error handling exists"
    }
    
    # Ragex: Hybrid search with semantic + symbolic
    {:ok, results} = Ragex.hybrid_search(
      query: user_query,
      strategy: :fusion,
      graph_filter: build_filter(criteria)
    )
    
    # Nasty: Explain results in natural language
    explanations = Enum.map(results, fn result ->
      Nasty.explain_code(result.code, target_language: :en)
    end)
    
    {results, explanations}
  end
end

Pattern 3: Automated Documentation Generation

defmodule AutoDocumentation do
  def document_module(module_path) do
    # Ragex: Extract module structure
    {:ok, analysis} = Ragex.analyze_file(module_path)
    
    doc_sections = [
      # Module overview
      overview: generate_overview(analysis.module),
      
      # Function descriptions (Nasty NL generation)
      functions: Enum.map(analysis.functions, fn func ->
        %{
          signature: func.name,
          description: explain_function(func),
          examples: generate_examples(func),
          see_also: find_related(func)  # via Ragex graph
        }
      end),
      
      # Dependencies (Ragex graph analysis)
      dependencies: Ragex.get_dependencies(analysis.module),
      
      # Usage patterns (combined)
      patterns: extract_usage_patterns(analysis)
    ]
    
    # Render as markdown
    render_markdown(doc_sections)
  end
end

When to Use Both Together

Use Ragex + Nasty when:

Building code documentation systems
Creating conversational code search
Generating technical writing from code
Building DSLs with natural language interfaces
Developing AI coding assistants
Analyzing and explaining codebases

Architecture Pattern:

User Query (Natural Language)
       ↓
   [Nasty] Parse and understand intent
       ↓
   [Ragex] Search codebase with semantic + symbolic
       ↓
   [Nasty] Explain results in natural language
       ↓
User Response (Natural Language + Code)

Practical Recommendations

For Production Systems

Start with Rule-Based Models
- Fastest performance
- No training required
- Good for well-formed text
- Upgrade to neural when needed

Implement Robust Error Handling

case Nasty.parse(text, language: :en) do
  {:ok, doc} -> 
    process_document(doc)
  
  {:error, {:parse_incomplete, _}} ->
    # Fallback: simpler analysis
    {:ok, tokens} = English.tokenize(text)
    {:ok, tagged} = English.tag_pos(tokens)
    process_tokens(tagged)
  
  {:error, reason} ->
    Logger.warn("Parse failed: #{inspect(reason)}")
    {:error, :parse_failed}
end

Cache Parsed Results

defmodule DocumentCache do
  use Agent
  
  def get_or_parse(text) do
    case Agent.get(__MODULE__, &Map.get(&1, cache_key(text))) do
      nil ->
        {:ok, doc} = Nasty.parse(text, language: :en)
        Agent.update(__MODULE__, &Map.put(&1, cache_key(text), doc))
        doc
      
      cached -> cached
    end
  end
end

Monitor Performance
- Track parse times
- Measure accuracy on test set
- A/B test rule vs. neural models
- Profile memory usage
Plan for Incremental Improvement
- Start with basic tokenization
- Add POS tagging when needed
- Full parsing for critical features
- Neural models for accuracy

For Development

Use Examples as Learning Tools
- Run all examples to understand capabilities
- Modify examples for your use case
- Check docs/ for detailed guides

Inspect AST Structure

{:ok, doc} = Nasty.parse(text, language: :en)
IO.inspect(doc, limit: :infinity, pretty: true)

Test on Your Domain
- Collect representative samples
- Measure accuracy
- Identify common failure modes
- Extend lexicons accordingly
Contribute Back
- Report bugs with examples
- Submit lexicon additions
- Share domain-specific improvements
- Document your use cases

Conclusion

Nasty excels at grammatical analysis and structural manipulation of natural language, making it ideal for applications requiring linguistic precision, multi-language support, and code interoperability. Its grammar-first approach provides explainability and composability that purely statistical systems lack.

However, Nasty is NOT a general-purpose NLP solution. It has limited lexical coverage, shallow semantic understanding, and requires careful domain adaptation. For best results:

Use Nasty for structure, pretrained models for semantics
Combine with Ragex for code-related tasks
Train on domain-specific data
Implement robust fallbacks
Set realistic expectations

The future of Nasty lies in hybrid architectures that combine its grammatical rigor with the semantic power of pretrained transformers, creating systems that understand both the form and meaning of natural language.

← Previous Page Nasty User Guide

Next Page → Nasty Examples Catalog

Strengths and Limitations

Table of Contents

Core Philosophy

What Nasty Does Best

1. Grammatical Analysis and Structure

2. Multi-Language Support with Shared Architecture

3. Code-Natural Language Interoperability

4. Explainable NLP Pipeline

5. Pure Elixir Implementation

Current Limitations

1. Lexical Coverage

2. Semantic Understanding

3. Statistical Model Accuracy

4. Parsing Incomplete Sentences

5. Computational Cost

6. Rendering Quality

When to Use Nasty

Ideal Use Cases

Best Practices

When NOT to Use Nasty

Poor Fit Scenarios

Better Alternatives

Pretrained Models and Fine-Tuning

Current State

Integration Opportunities

1. Bumblebee for Embeddings and Transformers

2. Fine-Tune Neural POS Tagger

3. Hybrid Architecture

Integration with Ragex

Architecture Alignment

Complementary Strengths

Concrete Integration Patterns

Pattern 1: Intelligent Code Comments

Pattern 2: Conversational Code Search

Pattern 3: Automated Documentation Generation

When to Use Both Together

Practical Recommendations

For Production Systems

For Development

Conclusion