Translation System Guide

Comprehensive guide to Nasty's AST-based translation system for natural language translation between English, Spanish, and Catalan.

Overview
Architecture
Quick Start
Core Components
Translation Pipeline
Morphological Agreement
Word Order Rules
Lexicon Management
Supported Language Pairs
Customization
Best Practices
Limitations

Overview

Nasty's translation system operates at the Abstract Syntax Tree (AST) level, providing grammatically-aware translation that preserves linguistic structure. Unlike token-by-token machine translation, this approach:

Preserves grammatical relationships
Applies morphological agreement rules
Handles language-specific word order
Supports bidirectional translation
Enables roundtrip translation with minimal loss

Architecture

System Diagram

flowchart TD
    A["Source Text<br/>(Language A)"]
    B["Parse to AST<br/>(Source Lang)"]
    C["AST Transform<br/>(Structural)"] -.-> C1[ASTTransformer]
    D["Token Translate<br/>(Lemma mapping)"] -.-> D1[TokenTranslator]
    E["Agreement<br/>(Morphology)"] -.-> E1[Agreement]
    F["Word Order<br/>(Reordering)"] -.-> F1[WordOrder]
    G["Render to Text<br/>(Target Lang)"] -.-> G1[AST.Renderer]
    H["Target Text<br/>(Language B)"]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H

Module Structure

graph TD
    Root[lib/]
    Trans[translation/]
    AST[ast/]
    Priv[priv/]
    TransSub[translation/]
    Lex[lexicons/]
    
    Root --> Trans
    Root --> AST
    Root --> Priv
    
    Trans --> T1[translator.ex<br/>Main API]
    Trans --> T2[ast_transformer.ex<br/>AST node transformation]
    Trans --> T3[token_translator.ex<br/>Token-level translation]
    Trans --> T4[agreement.ex<br/>Morphological agreement]
    Trans --> T5[word_order.ex<br/>Word order rules]
    Trans --> T6[lexicon_loader.ex<br/>Lexicon management]
    
    AST --> A1[renderer.ex<br/>AST to text rendering]
    
    Priv --> TransSub
    TransSub --> Lex
    Lex --> L1[en_es.exs<br/>English → Spanish]
    Lex --> L2[es_en.exs<br/>Spanish → English]
    Lex --> L3[en_ca.exs<br/>English → Catalan]
    Lex --> L4[ca_en.exs<br/>Catalan → English]

Quick Start

Basic Translation

alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator

# English to Spanish
{:ok, doc_en} = Nasty.parse("The cat runs.", language: :en)
{:ok, doc_es} = Translator.translate(doc_en, :es)
{:ok, text_es} = Nasty.render(doc_es)
IO.puts(text_es)
# => "El gato corre."

# Spanish to English
{:ok, doc_es} = Nasty.parse("El perro grande.", language: :es)
{:ok, doc_en} = Translator.translate(doc_es, :en)
{:ok, text_en} = Nasty.render(doc_en)
IO.puts(text_en)
# => "The big dog."

Using the High-Level API

# Translate text directly
{:ok, text_es} = Nasty.translate_text("The quick cat.", :en, :es)
# => "El gato rápido."

# Or with explicit parsing
{:ok, ast} = Nasty.parse("The house is big.", language: :en)
{:ok, translated_ast} = Nasty.translate(ast, :es)
{:ok, text} = Nasty.render(translated_ast)

Core Components

1. ASTTransformer

Transforms AST nodes between language structures.

Module: Nasty.Translation.ASTTransformer

Functions:

transform_document/2 - Transform entire document
transform_sentence/2 - Transform sentence
transform_phrase/2 - Transform phrase structures
transform_clause/2 - Transform clause

Example:

alias Nasty.Translation.ASTTransformer

{:ok, spanish_doc} = ASTTransformer.transform_document(english_doc, :es)

2. TokenTranslator

Performs lemma-to-lemma translation with POS awareness.

Module: Nasty.Translation.TokenTranslator

Functions:

translate_token/3 - Translate single token
translate_with_morphology/3 - Translate preserving morphology
lookup_translation/3 - Lookup in lexicon

Example:

alias Nasty.Translation.TokenTranslator

# cat (noun) → gato (noun)
translated = TokenTranslator.translate_token(token, :en, :es)

# Preserves morphology
# cats (noun, plural) → gatos (noun, plural)
translated = TokenTranslator.translate_with_morphology(token, :en, :es)

3. Agreement

Enforces morphological agreement rules (gender, number, person).

Module: Nasty.Translation.Agreement

Functions:

apply_agreement/2 - Apply all agreement rules
apply_determiner_noun/2 - Determiner-noun agreement
apply_noun_adjective/2 - Noun-adjective agreement
apply_subject_verb/2 - Subject-verb agreement

Example:

alias Nasty.Translation.Agreement

# Ensure "el gato" (masculine) not "la gato"
adjusted = Agreement.apply_agreement(tokens, :es)

# Ensure "los gatos grandes" (plural agreement throughout)
adjusted = Agreement.apply_agreement(tokens, :es)

4. WordOrder

Applies language-specific word order transformations.

Module: Nasty.Translation.WordOrder

Functions:

apply_order/2 - Apply all word order rules
apply_adjective_order/2 - Position adjectives correctly
apply_svo_order/2 - Subject-Verb-Object ordering
handle_clitics/2 - Clitic placement

Example:

alias Nasty.Translation.WordOrder

# "the big house" → "la casa grande" (adjective after noun)
ordered = WordOrder.apply_order(phrase, :es)

# "I eat it" → "Lo como" (clitic before verb in Spanish)
ordered = WordOrder.handle_clitics(phrase, :es)

5. LexiconLoader

Manages bidirectional lexicons with ETS caching for fast lookup.

Module: Nasty.Translation.LexiconLoader

Functions:

load/2 - Load lexicon for language pair
lookup/3 - Look up translation
reload/2 - Reload lexicon from file

Example:

alias Nasty.Translation.LexiconLoader

# Load lexicon (cached in ETS)
{:ok, lexicon} = LexiconLoader.load(:en, :es)

# Bidirectional lookup
"gato" = LexiconLoader.lookup(lexicon, "cat", :noun)
"cat" = LexiconLoader.lookup(lexicon, "gato", :noun)

# Reload after editing lexicon file
LexiconLoader.reload(:en, :es)

6. AST.Renderer

Renders AST back to natural language text.

Module: Nasty.AST.Renderer

Functions:

render_document/1 - Render complete document
render_sentence/1 - Render single sentence
render_phrase/1 - Render phrase
render_tokens/1 - Render token sequence

Example:

alias Nasty.AST.Renderer

# Render with proper spacing and punctuation
{:ok, text} = Renderer.render_document(document)

# Render phrase
{:ok, text} = Renderer.render_phrase(noun_phrase)
# => "el gato grande"

Translation Pipeline

Step-by-Step Process

1. Parse Source Text

alias Nasty.Language.English

text = "The quick brown fox jumps."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)

AST Structure:

graph TD
    Doc["Document (language: :en)"]
    Para[Paragraph]
    Sent[Sentence]
    Clause[Clause]
    Subj["Subject: NounPhrase"]
    Det["Determiner: 'The'"]
    Mod["Modifiers: ['quick', 'brown']"]
    Head1["Head: 'fox'"]
    Pred["Predicate: VerbPhrase"]
    Head2["Head: 'jumps'"]
    
    Doc --> Para
    Para --> Sent
    Sent --> Clause
    Clause --> Subj
    Clause --> Pred
    Subj --> Det
    Subj --> Mod
    Subj --> Head1
    Pred --> Head2

2. Transform AST Structure

alias Nasty.Translation.ASTTransformer

{:ok, doc_es} = ASTTransformer.transform_document(doc, :es)

Changes language: :en to language: :es throughout.

3. Translate Tokens

alias Nasty.Translation.TokenTranslator

# For each token in AST:
# "fox" (noun) → "zorro" (noun)
# "jumps" (verb) → "salta" (verb)

4. Apply Agreement

alias Nasty.Translation.Agreement

# Ensure gender/number agreement:
# "el" (masculine singular) + "zorro" (masculine singular) ✓
# "los" (masculine plural) + "zorros" (masculine plural) ✓

5. Apply Word Order

alias Nasty.Translation.WordOrder

# "the quick brown fox" → "el zorro rápido pardo"
# (adjectives after noun in Spanish for most adjectives)

6. Render to Text

alias Nasty.AST.Renderer

{:ok, text} = Renderer.render_document(doc_es)
# => "El zorro rápido pardo salta."

Morphological Agreement

Gender Agreement

Spanish and Catalan have grammatical gender (masculine/feminine).

Determiner-Noun:

# English: "the cat"
# Spanish: "el gato" (masculine)

# English: "the house"
# Spanish: "la casa" (feminine)

Noun-Adjective:

# English: "the red car"
# Spanish: "el carro rojo" (masculine)

# English: "the red house"
# Spanish: "la casa roja" (feminine)

Number Agreement

Determiners, nouns, and adjectives must agree in number.

# English: "the cats"
# Spanish: "los gatos" (plural)

# English: "the big cats"
# Spanish: "los gatos grandes" (plural throughout)

Person Agreement

Subject-verb agreement by grammatical person.

# English: "I run"
# Spanish: "Yo corro" (first person singular)

# English: "They run"
# Spanish: "Ellos corren" (third person plural)

Word Order Rules

SVO vs. SOV

English, Spanish, and Catalan all use Subject-Verb-Object (SVO) order:

# English: "The cat eats fish."
# Spanish: "El gato come pescado."
# Catalan: "El gat menja peix."

Adjective Position

English: Adjectives before nouns

"the red car"
"the big house"

Spanish/Catalan: Most adjectives after nouns

"el carro rojo" (the car red)
"la casa grande" (the house big)

Exceptions: Some adjectives stay before nouns

"el buen libro" (the good book) - NOT "el libro bueno"
"la primera vez" (the first time) - NOT "la vez primera"

Clitic Placement

Spanish clitics (lo, la, me, te, se) attach to verbs:

# English: "I see it"
# Spanish: "Lo veo" (clitic before conjugated verb)

# English: "I want to see it"
# Spanish: "Quiero verlo" (clitic after infinitive)

Lexicon Management

Lexicon Format

Lexicons are Elixir maps organized by POS tag:

# priv/translation/lexicons/en_es.exs
%{
  noun: %{
    "cat" => "gato",
    "house" => "casa",
    "book" => "libro"
  },
  verb: %{
    "run" => "correr",
    "eat" => "comer",
    "sleep" => "dormir"
  },
  adj: %{
    "big" => "grande",
    "red" => "rojo",
    "quick" => "rápido"
  },
  det: %{
    "the" => "el",
    "a" => "un",
    "some" => "algunos"
  }
}

Morphological Information

Include gender/number for target language:

%{
  noun: %{
    "cat" => %{lemma: "gato", gender: :masculine},
    "house" => %{lemma: "casa", gender: :feminine},
    "dog" => %{lemma: "perro", gender: :masculine}
  }
}

Idiomatic Expressions

Handle multi-word expressions:

%{
  idioms: %{
    "kick the bucket" => "estirar la pata",
    "break the ice" => "romper el hielo",
    "piece of cake" => "pan comido"
  }
}

Custom Lexicons

Add domain-specific vocabulary:

# priv/translation/lexicons/custom_tech_en_es.exs
%{
  noun: %{
    "widget" => "componente",
    "server" => "servidor",
    "database" => "base de datos"
  },
  verb: %{
    "deploy" => "desplegar",
    "compile" => "compilar",
    "debug" => "depurar"
  }
}

Load custom lexicons:

LexiconLoader.load(:en, :es, path: "priv/translation/lexicons/custom_tech_en_es.exs")

Supported Language Pairs

Direct Pairs

English ↔ Spanish - Full bidirectional support
English ↔ Catalan - Full bidirectional support

Transitive Pairs

Spanish ↔ Catalan - Via English (two-step translation)

# Spanish → Catalan (via English)
{:ok, doc_es} = Nasty.parse("El gato corre.", language: :es)
{:ok, doc_en} = Translator.translate(doc_es, :en)
{:ok, doc_ca} = Translator.translate(doc_en, :ca)
{:ok, text_ca} = Nasty.render(doc_ca)
# => "El gat corre."

Customization

Extending Lexicons

Edit lexicon files in priv/translation/lexicons/
Add new entries maintaining the POS structure
Reload lexicons: LexiconLoader.reload(:en, :es)

Custom Agreement Rules

Extend Nasty.Translation.Agreement:

defmodule MyApp.CustomAgreement do
  def apply_custom_rule(tokens, language) do
    # Custom agreement logic
    tokens
  end
end

Custom Word Order Rules

Extend Nasty.Translation.WordOrder:

defmodule MyApp.CustomWordOrder do
  def apply_custom_order(phrase, language) do
    # Custom word order logic
    phrase
  end
end

Best Practices

1. Sentence-Level Translation

Translate sentence by sentence for best results:

sentences = String.split(text, ~r/[.!?]+/)

translated = Enum.map(sentences, fn sent ->
  {:ok, doc} = Nasty.parse(sent, language: :en)
  {:ok, translated} = Translator.translate(doc, :es)
  {:ok, text} = Nasty.render(translated)
  text
end)
|> Enum.join(". ")

2. Review Idiomatic Expressions

Idiomatic expressions may not translate literally:

# "It's raining cats and dogs"
# Literal: "Está lloviendo gatos y perros" ❌
# Idiomatic: "Está lloviendo a cántaros" ✓

3. Extend Lexicons for Domain Text

For technical/specialized text, add domain vocabulary:

# Add medical, legal, technical terms
# to custom lexicon files

4. Use for Formal/Technical Text

Best for:

Technical documentation
Formal correspondence
News articles
Academic text

Less suitable for:

Poetry
Idiomatic speech
Creative writing

5. Verify Grammatical Gender

Some nouns have unexpected gender:

# "problem" → "problema" (masculine in Spanish!)
# "hand" → "mano" (feminine)

Check lexicons and adjust if needed.

Limitations

Current Limitations

Idiomatic Expressions
- May translate literally rather than idiomatically
- Solution: Add idiom mappings to lexicons
Complex Verb Tenses
- Some tense combinations may not map perfectly
- Solution: Manual review for complex tenses
Cultural Context
- Cultural references not adapted
- Solution: Add context-aware transformations
Ambiguous Words
- First lexicon entry used for ambiguous words
- Solution: Add context-aware lexicon lookup
Limited Language Pairs
- Currently English, Spanish, Catalan only
- Solution: Add more language implementations

Workarounds

For idiomatic text:

# Pre-process idioms before translation
text = String.replace(text, "kick the bucket", "die")

For ambiguous words:

# Use context or manual disambiguation
# "bank" (financial) vs "bank" (river)

For complex grammar:

# Simplify sentence structure before translation
# "Having been running..." → "He ran..."

Future Enhancements

Neural translation integration
Context-aware lexicon selection
Multi-sentence context for pronouns
Statistical phrase translation
User feedback learning
More language pairs (French, German, etc.)

Translation System Guide

Table of Contents

Overview

Architecture

System Diagram

Module Structure

Quick Start

Basic Translation

Using the High-Level API

Core Components

1. ASTTransformer

2. TokenTranslator

3. Agreement

4. WordOrder

5. LexiconLoader

6. AST.Renderer

Translation Pipeline

Step-by-Step Process

1. Parse Source Text

2. Transform AST Structure

3. Translate Tokens

4. Apply Agreement

5. Apply Word Order

6. Render to Text

Morphological Agreement

Gender Agreement

Number Agreement

Person Agreement

Word Order Rules

SVO vs. SOV

Adjective Position

Clitic Placement

Lexicon Management

Lexicon Format

Morphological Information

Idiomatic Expressions

Custom Lexicons

Supported Language Pairs

Direct Pairs

Transitive Pairs

Customization

Extending Lexicons

Custom Agreement Rules

Custom Word Order Rules

Best Practices

1. Sentence-Level Translation

2. Review Idiomatic Expressions

3. Extend Lexicons for Domain Text

4. Use for Formal/Technical Text

5. Verify Grammatical Gender

Limitations

Current Limitations

Workarounds

Future Enhancements

See Also