Translation System Guide

View Source

Comprehensive guide to Nasty's AST-based translation system for natural language translation between English, Spanish, and Catalan.

Table of Contents

  1. Overview
  2. Architecture
  3. Quick Start
  4. Core Components
  5. Translation Pipeline
  6. Morphological Agreement
  7. Word Order Rules
  8. Lexicon Management
  9. Supported Language Pairs
  10. Customization
  11. Best Practices
  12. Limitations

Overview

Nasty's translation system operates at the Abstract Syntax Tree (AST) level, providing grammatically-aware translation that preserves linguistic structure. Unlike token-by-token machine translation, this approach:

  • Preserves grammatical relationships
  • Applies morphological agreement rules
  • Handles language-specific word order
  • Supports bidirectional translation
  • Enables roundtrip translation with minimal loss

Architecture

System Diagram

flowchart TD
    A["Source Text<br/>(Language A)"]
    B["Parse to AST<br/>(Source Lang)"]
    C["AST Transform<br/>(Structural)"] -.-> C1[ASTTransformer]
    D["Token Translate<br/>(Lemma mapping)"] -.-> D1[TokenTranslator]
    E["Agreement<br/>(Morphology)"] -.-> E1[Agreement]
    F["Word Order<br/>(Reordering)"] -.-> F1[WordOrder]
    G["Render to Text<br/>(Target Lang)"] -.-> G1[AST.Renderer]
    H["Target Text<br/>(Language B)"]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H

Module Structure

graph TD
    Root[lib/]
    Trans[translation/]
    AST[ast/]
    Priv[priv/]
    TransSub[translation/]
    Lex[lexicons/]
    
    Root --> Trans
    Root --> AST
    Root --> Priv
    
    Trans --> T1[translator.ex<br/>Main API]
    Trans --> T2[ast_transformer.ex<br/>AST node transformation]
    Trans --> T3[token_translator.ex<br/>Token-level translation]
    Trans --> T4[agreement.ex<br/>Morphological agreement]
    Trans --> T5[word_order.ex<br/>Word order rules]
    Trans --> T6[lexicon_loader.ex<br/>Lexicon management]
    
    AST --> A1[renderer.ex<br/>AST to text rendering]
    
    Priv --> TransSub
    TransSub --> Lex
    Lex --> L1[en_es.exs<br/>English → Spanish]
    Lex --> L2[es_en.exs<br/>Spanish → English]
    Lex --> L3[en_ca.exs<br/>English → Catalan]
    Lex --> L4[ca_en.exs<br/>Catalan → English]

Quick Start

Basic Translation

alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator

# English to Spanish
{:ok, doc_en} = Nasty.parse("The cat runs.", language: :en)
{:ok, doc_es} = Translator.translate(doc_en, :es)
{:ok, text_es} = Nasty.render(doc_es)
IO.puts(text_es)
# => "El gato corre."

# Spanish to English
{:ok, doc_es} = Nasty.parse("El perro grande.", language: :es)
{:ok, doc_en} = Translator.translate(doc_es, :en)
{:ok, text_en} = Nasty.render(doc_en)
IO.puts(text_en)
# => "The big dog."

Using the High-Level API

# Translate text directly
{:ok, text_es} = Nasty.translate_text("The quick cat.", :en, :es)
# => "El gato rápido."

# Or with explicit parsing
{:ok, ast} = Nasty.parse("The house is big.", language: :en)
{:ok, translated_ast} = Nasty.translate(ast, :es)
{:ok, text} = Nasty.render(translated_ast)

Core Components

1. ASTTransformer

Transforms AST nodes between language structures.

Module: Nasty.Translation.ASTTransformer

Functions:

  • transform_document/2 - Transform entire document
  • transform_sentence/2 - Transform sentence
  • transform_phrase/2 - Transform phrase structures
  • transform_clause/2 - Transform clause

Example:

alias Nasty.Translation.ASTTransformer

{:ok, spanish_doc} = ASTTransformer.transform_document(english_doc, :es)

2. TokenTranslator

Performs lemma-to-lemma translation with POS awareness.

Module: Nasty.Translation.TokenTranslator

Functions:

  • translate_token/3 - Translate single token
  • translate_with_morphology/3 - Translate preserving morphology
  • lookup_translation/3 - Lookup in lexicon

Example:

alias Nasty.Translation.TokenTranslator

# cat (noun) → gato (noun)
translated = TokenTranslator.translate_token(token, :en, :es)

# Preserves morphology
# cats (noun, plural) → gatos (noun, plural)
translated = TokenTranslator.translate_with_morphology(token, :en, :es)

3. Agreement

Enforces morphological agreement rules (gender, number, person).

Module: Nasty.Translation.Agreement

Functions:

  • apply_agreement/2 - Apply all agreement rules
  • apply_determiner_noun/2 - Determiner-noun agreement
  • apply_noun_adjective/2 - Noun-adjective agreement
  • apply_subject_verb/2 - Subject-verb agreement

Example:

alias Nasty.Translation.Agreement

# Ensure "el gato" (masculine) not "la gato"
adjusted = Agreement.apply_agreement(tokens, :es)

# Ensure "los gatos grandes" (plural agreement throughout)
adjusted = Agreement.apply_agreement(tokens, :es)

4. WordOrder

Applies language-specific word order transformations.

Module: Nasty.Translation.WordOrder

Functions:

  • apply_order/2 - Apply all word order rules
  • apply_adjective_order/2 - Position adjectives correctly
  • apply_svo_order/2 - Subject-Verb-Object ordering
  • handle_clitics/2 - Clitic placement

Example:

alias Nasty.Translation.WordOrder

# "the big house" → "la casa grande" (adjective after noun)
ordered = WordOrder.apply_order(phrase, :es)

# "I eat it" → "Lo como" (clitic before verb in Spanish)
ordered = WordOrder.handle_clitics(phrase, :es)

5. LexiconLoader

Manages bidirectional lexicons with ETS caching for fast lookup.

Module: Nasty.Translation.LexiconLoader

Functions:

  • load/2 - Load lexicon for language pair
  • lookup/3 - Look up translation
  • reload/2 - Reload lexicon from file

Example:

alias Nasty.Translation.LexiconLoader

# Load lexicon (cached in ETS)
{:ok, lexicon} = LexiconLoader.load(:en, :es)

# Bidirectional lookup
"gato" = LexiconLoader.lookup(lexicon, "cat", :noun)
"cat" = LexiconLoader.lookup(lexicon, "gato", :noun)

# Reload after editing lexicon file
LexiconLoader.reload(:en, :es)

6. AST.Renderer

Renders AST back to natural language text.

Module: Nasty.AST.Renderer

Functions:

  • render_document/1 - Render complete document
  • render_sentence/1 - Render single sentence
  • render_phrase/1 - Render phrase
  • render_tokens/1 - Render token sequence

Example:

alias Nasty.AST.Renderer

# Render with proper spacing and punctuation
{:ok, text} = Renderer.render_document(document)

# Render phrase
{:ok, text} = Renderer.render_phrase(noun_phrase)
# => "el gato grande"

Translation Pipeline

Step-by-Step Process

1. Parse Source Text

alias Nasty.Language.English

text = "The quick brown fox jumps."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)

AST Structure:

graph TD
    Doc["Document (language: :en)"]
    Para[Paragraph]
    Sent[Sentence]
    Clause[Clause]
    Subj["Subject: NounPhrase"]
    Det["Determiner: 'The'"]
    Mod["Modifiers: ['quick', 'brown']"]
    Head1["Head: 'fox'"]
    Pred["Predicate: VerbPhrase"]
    Head2["Head: 'jumps'"]
    
    Doc --> Para
    Para --> Sent
    Sent --> Clause
    Clause --> Subj
    Clause --> Pred
    Subj --> Det
    Subj --> Mod
    Subj --> Head1
    Pred --> Head2

2. Transform AST Structure

alias Nasty.Translation.ASTTransformer

{:ok, doc_es} = ASTTransformer.transform_document(doc, :es)

Changes language: :en to language: :es throughout.

3. Translate Tokens

alias Nasty.Translation.TokenTranslator

# For each token in AST:
# "fox" (noun) → "zorro" (noun)
# "jumps" (verb) → "salta" (verb)

4. Apply Agreement

alias Nasty.Translation.Agreement

# Ensure gender/number agreement:
# "el" (masculine singular) + "zorro" (masculine singular) ✓
# "los" (masculine plural) + "zorros" (masculine plural) ✓

5. Apply Word Order

alias Nasty.Translation.WordOrder

# "the quick brown fox" → "el zorro rápido pardo"
# (adjectives after noun in Spanish for most adjectives)

6. Render to Text

alias Nasty.AST.Renderer

{:ok, text} = Renderer.render_document(doc_es)
# => "El zorro rápido pardo salta."

Morphological Agreement

Gender Agreement

Spanish and Catalan have grammatical gender (masculine/feminine).

Determiner-Noun:

# English: "the cat"
# Spanish: "el gato" (masculine)

# English: "the house"
# Spanish: "la casa" (feminine)

Noun-Adjective:

# English: "the red car"
# Spanish: "el carro rojo" (masculine)

# English: "the red house"
# Spanish: "la casa roja" (feminine)

Number Agreement

Determiners, nouns, and adjectives must agree in number.

# English: "the cats"
# Spanish: "los gatos" (plural)

# English: "the big cats"
# Spanish: "los gatos grandes" (plural throughout)

Person Agreement

Subject-verb agreement by grammatical person.

# English: "I run"
# Spanish: "Yo corro" (first person singular)

# English: "They run"
# Spanish: "Ellos corren" (third person plural)

Word Order Rules

SVO vs. SOV

English, Spanish, and Catalan all use Subject-Verb-Object (SVO) order:

# English: "The cat eats fish."
# Spanish: "El gato come pescado."
# Catalan: "El gat menja peix."

Adjective Position

English: Adjectives before nouns

"the red car"
"the big house"

Spanish/Catalan: Most adjectives after nouns

"el carro rojo" (the car red)
"la casa grande" (the house big)

Exceptions: Some adjectives stay before nouns

"el buen libro" (the good book) - NOT "el libro bueno"
"la primera vez" (the first time) - NOT "la vez primera"

Clitic Placement

Spanish clitics (lo, la, me, te, se) attach to verbs:

# English: "I see it"
# Spanish: "Lo veo" (clitic before conjugated verb)

# English: "I want to see it"
# Spanish: "Quiero verlo" (clitic after infinitive)

Lexicon Management

Lexicon Format

Lexicons are Elixir maps organized by POS tag:

# priv/translation/lexicons/en_es.exs
%{
  noun: %{
    "cat" => "gato",
    "house" => "casa",
    "book" => "libro"
  },
  verb: %{
    "run" => "correr",
    "eat" => "comer",
    "sleep" => "dormir"
  },
  adj: %{
    "big" => "grande",
    "red" => "rojo",
    "quick" => "rápido"
  },
  det: %{
    "the" => "el",
    "a" => "un",
    "some" => "algunos"
  }
}

Morphological Information

Include gender/number for target language:

%{
  noun: %{
    "cat" => %{lemma: "gato", gender: :masculine},
    "house" => %{lemma: "casa", gender: :feminine},
    "dog" => %{lemma: "perro", gender: :masculine}
  }
}

Idiomatic Expressions

Handle multi-word expressions:

%{
  idioms: %{
    "kick the bucket" => "estirar la pata",
    "break the ice" => "romper el hielo",
    "piece of cake" => "pan comido"
  }
}

Custom Lexicons

Add domain-specific vocabulary:

# priv/translation/lexicons/custom_tech_en_es.exs
%{
  noun: %{
    "widget" => "componente",
    "server" => "servidor",
    "database" => "base de datos"
  },
  verb: %{
    "deploy" => "desplegar",
    "compile" => "compilar",
    "debug" => "depurar"
  }
}

Load custom lexicons:

LexiconLoader.load(:en, :es, path: "priv/translation/lexicons/custom_tech_en_es.exs")

Supported Language Pairs

Direct Pairs

  • English ↔ Spanish - Full bidirectional support
  • English ↔ Catalan - Full bidirectional support

Transitive Pairs

  • Spanish ↔ Catalan - Via English (two-step translation)
# Spanish → Catalan (via English)
{:ok, doc_es} = Nasty.parse("El gato corre.", language: :es)
{:ok, doc_en} = Translator.translate(doc_es, :en)
{:ok, doc_ca} = Translator.translate(doc_en, :ca)
{:ok, text_ca} = Nasty.render(doc_ca)
# => "El gat corre."

Customization

Extending Lexicons

  1. Edit lexicon files in priv/translation/lexicons/
  2. Add new entries maintaining the POS structure
  3. Reload lexicons: LexiconLoader.reload(:en, :es)

Custom Agreement Rules

Extend Nasty.Translation.Agreement:

defmodule MyApp.CustomAgreement do
  def apply_custom_rule(tokens, language) do
    # Custom agreement logic
    tokens
  end
end

Custom Word Order Rules

Extend Nasty.Translation.WordOrder:

defmodule MyApp.CustomWordOrder do
  def apply_custom_order(phrase, language) do
    # Custom word order logic
    phrase
  end
end

Best Practices

1. Sentence-Level Translation

Translate sentence by sentence for best results:

sentences = String.split(text, ~r/[.!?]+/)

translated = Enum.map(sentences, fn sent ->
  {:ok, doc} = Nasty.parse(sent, language: :en)
  {:ok, translated} = Translator.translate(doc, :es)
  {:ok, text} = Nasty.render(translated)
  text
end)
|> Enum.join(". ")

2. Review Idiomatic Expressions

Idiomatic expressions may not translate literally:

# "It's raining cats and dogs"
# Literal: "Está lloviendo gatos y perros" ❌
# Idiomatic: "Está lloviendo a cántaros" ✓

3. Extend Lexicons for Domain Text

For technical/specialized text, add domain vocabulary:

# Add medical, legal, technical terms
# to custom lexicon files

4. Use for Formal/Technical Text

Best for:

  • Technical documentation
  • Formal correspondence
  • News articles
  • Academic text

Less suitable for:

  • Poetry
  • Idiomatic speech
  • Creative writing

5. Verify Grammatical Gender

Some nouns have unexpected gender:

# "problem" → "problema" (masculine in Spanish!)
# "hand" → "mano" (feminine)

Check lexicons and adjust if needed.

Limitations

Current Limitations

  1. Idiomatic Expressions

    • May translate literally rather than idiomatically
    • Solution: Add idiom mappings to lexicons
  2. Complex Verb Tenses

    • Some tense combinations may not map perfectly
    • Solution: Manual review for complex tenses
  3. Cultural Context

    • Cultural references not adapted
    • Solution: Add context-aware transformations
  4. Ambiguous Words

    • First lexicon entry used for ambiguous words
    • Solution: Add context-aware lexicon lookup
  5. Limited Language Pairs

    • Currently English, Spanish, Catalan only
    • Solution: Add more language implementations

Workarounds

For idiomatic text:

# Pre-process idioms before translation
text = String.replace(text, "kick the bucket", "die")

For ambiguous words:

# Use context or manual disambiguation
# "bank" (financial) vs "bank" (river)

For complex grammar:

# Simplify sentence structure before translation
# "Having been running..." → "He ran..."

Future Enhancements

  • Neural translation integration
  • Context-aware lexicon selection
  • Multi-sentence context for pronouns
  • Statistical phrase translation
  • User feedback learning
  • More language pairs (French, German, etc.)

See Also