Translation System Guide
View SourceComprehensive guide to Nasty's AST-based translation system for natural language translation between English, Spanish, and Catalan.
Table of Contents
- Overview
- Architecture
- Quick Start
- Core Components
- Translation Pipeline
- Morphological Agreement
- Word Order Rules
- Lexicon Management
- Supported Language Pairs
- Customization
- Best Practices
- Limitations
Overview
Nasty's translation system operates at the Abstract Syntax Tree (AST) level, providing grammatically-aware translation that preserves linguistic structure. Unlike token-by-token machine translation, this approach:
- Preserves grammatical relationships
- Applies morphological agreement rules
- Handles language-specific word order
- Supports bidirectional translation
- Enables roundtrip translation with minimal loss
Architecture
System Diagram
flowchart TD
A["Source Text<br/>(Language A)"]
B["Parse to AST<br/>(Source Lang)"]
C["AST Transform<br/>(Structural)"] -.-> C1[ASTTransformer]
D["Token Translate<br/>(Lemma mapping)"] -.-> D1[TokenTranslator]
E["Agreement<br/>(Morphology)"] -.-> E1[Agreement]
F["Word Order<br/>(Reordering)"] -.-> F1[WordOrder]
G["Render to Text<br/>(Target Lang)"] -.-> G1[AST.Renderer]
H["Target Text<br/>(Language B)"]
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> HModule Structure
graph TD
Root[lib/]
Trans[translation/]
AST[ast/]
Priv[priv/]
TransSub[translation/]
Lex[lexicons/]
Root --> Trans
Root --> AST
Root --> Priv
Trans --> T1[translator.ex<br/>Main API]
Trans --> T2[ast_transformer.ex<br/>AST node transformation]
Trans --> T3[token_translator.ex<br/>Token-level translation]
Trans --> T4[agreement.ex<br/>Morphological agreement]
Trans --> T5[word_order.ex<br/>Word order rules]
Trans --> T6[lexicon_loader.ex<br/>Lexicon management]
AST --> A1[renderer.ex<br/>AST to text rendering]
Priv --> TransSub
TransSub --> Lex
Lex --> L1[en_es.exs<br/>English → Spanish]
Lex --> L2[es_en.exs<br/>Spanish → English]
Lex --> L3[en_ca.exs<br/>English → Catalan]
Lex --> L4[ca_en.exs<br/>Catalan → English]Quick Start
Basic Translation
alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator
# English to Spanish
{:ok, doc_en} = Nasty.parse("The cat runs.", language: :en)
{:ok, doc_es} = Translator.translate(doc_en, :es)
{:ok, text_es} = Nasty.render(doc_es)
IO.puts(text_es)
# => "El gato corre."
# Spanish to English
{:ok, doc_es} = Nasty.parse("El perro grande.", language: :es)
{:ok, doc_en} = Translator.translate(doc_es, :en)
{:ok, text_en} = Nasty.render(doc_en)
IO.puts(text_en)
# => "The big dog."Using the High-Level API
# Translate text directly
{:ok, text_es} = Nasty.translate_text("The quick cat.", :en, :es)
# => "El gato rápido."
# Or with explicit parsing
{:ok, ast} = Nasty.parse("The house is big.", language: :en)
{:ok, translated_ast} = Nasty.translate(ast, :es)
{:ok, text} = Nasty.render(translated_ast)Core Components
1. ASTTransformer
Transforms AST nodes between language structures.
Module: Nasty.Translation.ASTTransformer
Functions:
transform_document/2- Transform entire documenttransform_sentence/2- Transform sentencetransform_phrase/2- Transform phrase structurestransform_clause/2- Transform clause
Example:
alias Nasty.Translation.ASTTransformer
{:ok, spanish_doc} = ASTTransformer.transform_document(english_doc, :es)2. TokenTranslator
Performs lemma-to-lemma translation with POS awareness.
Module: Nasty.Translation.TokenTranslator
Functions:
translate_token/3- Translate single tokentranslate_with_morphology/3- Translate preserving morphologylookup_translation/3- Lookup in lexicon
Example:
alias Nasty.Translation.TokenTranslator
# cat (noun) → gato (noun)
translated = TokenTranslator.translate_token(token, :en, :es)
# Preserves morphology
# cats (noun, plural) → gatos (noun, plural)
translated = TokenTranslator.translate_with_morphology(token, :en, :es)3. Agreement
Enforces morphological agreement rules (gender, number, person).
Module: Nasty.Translation.Agreement
Functions:
apply_agreement/2- Apply all agreement rulesapply_determiner_noun/2- Determiner-noun agreementapply_noun_adjective/2- Noun-adjective agreementapply_subject_verb/2- Subject-verb agreement
Example:
alias Nasty.Translation.Agreement
# Ensure "el gato" (masculine) not "la gato"
adjusted = Agreement.apply_agreement(tokens, :es)
# Ensure "los gatos grandes" (plural agreement throughout)
adjusted = Agreement.apply_agreement(tokens, :es)4. WordOrder
Applies language-specific word order transformations.
Module: Nasty.Translation.WordOrder
Functions:
apply_order/2- Apply all word order rulesapply_adjective_order/2- Position adjectives correctlyapply_svo_order/2- Subject-Verb-Object orderinghandle_clitics/2- Clitic placement
Example:
alias Nasty.Translation.WordOrder
# "the big house" → "la casa grande" (adjective after noun)
ordered = WordOrder.apply_order(phrase, :es)
# "I eat it" → "Lo como" (clitic before verb in Spanish)
ordered = WordOrder.handle_clitics(phrase, :es)5. LexiconLoader
Manages bidirectional lexicons with ETS caching for fast lookup.
Module: Nasty.Translation.LexiconLoader
Functions:
load/2- Load lexicon for language pairlookup/3- Look up translationreload/2- Reload lexicon from file
Example:
alias Nasty.Translation.LexiconLoader
# Load lexicon (cached in ETS)
{:ok, lexicon} = LexiconLoader.load(:en, :es)
# Bidirectional lookup
"gato" = LexiconLoader.lookup(lexicon, "cat", :noun)
"cat" = LexiconLoader.lookup(lexicon, "gato", :noun)
# Reload after editing lexicon file
LexiconLoader.reload(:en, :es)6. AST.Renderer
Renders AST back to natural language text.
Module: Nasty.AST.Renderer
Functions:
render_document/1- Render complete documentrender_sentence/1- Render single sentencerender_phrase/1- Render phraserender_tokens/1- Render token sequence
Example:
alias Nasty.AST.Renderer
# Render with proper spacing and punctuation
{:ok, text} = Renderer.render_document(document)
# Render phrase
{:ok, text} = Renderer.render_phrase(noun_phrase)
# => "el gato grande"Translation Pipeline
Step-by-Step Process
1. Parse Source Text
alias Nasty.Language.English
text = "The quick brown fox jumps."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)AST Structure:
graph TD
Doc["Document (language: :en)"]
Para[Paragraph]
Sent[Sentence]
Clause[Clause]
Subj["Subject: NounPhrase"]
Det["Determiner: 'The'"]
Mod["Modifiers: ['quick', 'brown']"]
Head1["Head: 'fox'"]
Pred["Predicate: VerbPhrase"]
Head2["Head: 'jumps'"]
Doc --> Para
Para --> Sent
Sent --> Clause
Clause --> Subj
Clause --> Pred
Subj --> Det
Subj --> Mod
Subj --> Head1
Pred --> Head22. Transform AST Structure
alias Nasty.Translation.ASTTransformer
{:ok, doc_es} = ASTTransformer.transform_document(doc, :es)Changes language: :en to language: :es throughout.
3. Translate Tokens
alias Nasty.Translation.TokenTranslator
# For each token in AST:
# "fox" (noun) → "zorro" (noun)
# "jumps" (verb) → "salta" (verb)4. Apply Agreement
alias Nasty.Translation.Agreement
# Ensure gender/number agreement:
# "el" (masculine singular) + "zorro" (masculine singular) ✓
# "los" (masculine plural) + "zorros" (masculine plural) ✓5. Apply Word Order
alias Nasty.Translation.WordOrder
# "the quick brown fox" → "el zorro rápido pardo"
# (adjectives after noun in Spanish for most adjectives)6. Render to Text
alias Nasty.AST.Renderer
{:ok, text} = Renderer.render_document(doc_es)
# => "El zorro rápido pardo salta."Morphological Agreement
Gender Agreement
Spanish and Catalan have grammatical gender (masculine/feminine).
Determiner-Noun:
# English: "the cat"
# Spanish: "el gato" (masculine)
# English: "the house"
# Spanish: "la casa" (feminine)Noun-Adjective:
# English: "the red car"
# Spanish: "el carro rojo" (masculine)
# English: "the red house"
# Spanish: "la casa roja" (feminine)Number Agreement
Determiners, nouns, and adjectives must agree in number.
# English: "the cats"
# Spanish: "los gatos" (plural)
# English: "the big cats"
# Spanish: "los gatos grandes" (plural throughout)Person Agreement
Subject-verb agreement by grammatical person.
# English: "I run"
# Spanish: "Yo corro" (first person singular)
# English: "They run"
# Spanish: "Ellos corren" (third person plural)Word Order Rules
SVO vs. SOV
English, Spanish, and Catalan all use Subject-Verb-Object (SVO) order:
# English: "The cat eats fish."
# Spanish: "El gato come pescado."
# Catalan: "El gat menja peix."Adjective Position
English: Adjectives before nouns
"the red car"
"the big house"Spanish/Catalan: Most adjectives after nouns
"el carro rojo" (the car red)
"la casa grande" (the house big)Exceptions: Some adjectives stay before nouns
"el buen libro" (the good book) - NOT "el libro bueno"
"la primera vez" (the first time) - NOT "la vez primera"Clitic Placement
Spanish clitics (lo, la, me, te, se) attach to verbs:
# English: "I see it"
# Spanish: "Lo veo" (clitic before conjugated verb)
# English: "I want to see it"
# Spanish: "Quiero verlo" (clitic after infinitive)Lexicon Management
Lexicon Format
Lexicons are Elixir maps organized by POS tag:
# priv/translation/lexicons/en_es.exs
%{
noun: %{
"cat" => "gato",
"house" => "casa",
"book" => "libro"
},
verb: %{
"run" => "correr",
"eat" => "comer",
"sleep" => "dormir"
},
adj: %{
"big" => "grande",
"red" => "rojo",
"quick" => "rápido"
},
det: %{
"the" => "el",
"a" => "un",
"some" => "algunos"
}
}Morphological Information
Include gender/number for target language:
%{
noun: %{
"cat" => %{lemma: "gato", gender: :masculine},
"house" => %{lemma: "casa", gender: :feminine},
"dog" => %{lemma: "perro", gender: :masculine}
}
}Idiomatic Expressions
Handle multi-word expressions:
%{
idioms: %{
"kick the bucket" => "estirar la pata",
"break the ice" => "romper el hielo",
"piece of cake" => "pan comido"
}
}Custom Lexicons
Add domain-specific vocabulary:
# priv/translation/lexicons/custom_tech_en_es.exs
%{
noun: %{
"widget" => "componente",
"server" => "servidor",
"database" => "base de datos"
},
verb: %{
"deploy" => "desplegar",
"compile" => "compilar",
"debug" => "depurar"
}
}Load custom lexicons:
LexiconLoader.load(:en, :es, path: "priv/translation/lexicons/custom_tech_en_es.exs")Supported Language Pairs
Direct Pairs
- English ↔ Spanish - Full bidirectional support
- English ↔ Catalan - Full bidirectional support
Transitive Pairs
- Spanish ↔ Catalan - Via English (two-step translation)
# Spanish → Catalan (via English)
{:ok, doc_es} = Nasty.parse("El gato corre.", language: :es)
{:ok, doc_en} = Translator.translate(doc_es, :en)
{:ok, doc_ca} = Translator.translate(doc_en, :ca)
{:ok, text_ca} = Nasty.render(doc_ca)
# => "El gat corre."Customization
Extending Lexicons
- Edit lexicon files in
priv/translation/lexicons/ - Add new entries maintaining the POS structure
- Reload lexicons:
LexiconLoader.reload(:en, :es)
Custom Agreement Rules
Extend Nasty.Translation.Agreement:
defmodule MyApp.CustomAgreement do
def apply_custom_rule(tokens, language) do
# Custom agreement logic
tokens
end
endCustom Word Order Rules
Extend Nasty.Translation.WordOrder:
defmodule MyApp.CustomWordOrder do
def apply_custom_order(phrase, language) do
# Custom word order logic
phrase
end
endBest Practices
1. Sentence-Level Translation
Translate sentence by sentence for best results:
sentences = String.split(text, ~r/[.!?]+/)
translated = Enum.map(sentences, fn sent ->
{:ok, doc} = Nasty.parse(sent, language: :en)
{:ok, translated} = Translator.translate(doc, :es)
{:ok, text} = Nasty.render(translated)
text
end)
|> Enum.join(". ")2. Review Idiomatic Expressions
Idiomatic expressions may not translate literally:
# "It's raining cats and dogs"
# Literal: "Está lloviendo gatos y perros" ❌
# Idiomatic: "Está lloviendo a cántaros" ✓3. Extend Lexicons for Domain Text
For technical/specialized text, add domain vocabulary:
# Add medical, legal, technical terms
# to custom lexicon files4. Use for Formal/Technical Text
Best for:
- Technical documentation
- Formal correspondence
- News articles
- Academic text
Less suitable for:
- Poetry
- Idiomatic speech
- Creative writing
5. Verify Grammatical Gender
Some nouns have unexpected gender:
# "problem" → "problema" (masculine in Spanish!)
# "hand" → "mano" (feminine)Check lexicons and adjust if needed.
Limitations
Current Limitations
Idiomatic Expressions
- May translate literally rather than idiomatically
- Solution: Add idiom mappings to lexicons
Complex Verb Tenses
- Some tense combinations may not map perfectly
- Solution: Manual review for complex tenses
Cultural Context
- Cultural references not adapted
- Solution: Add context-aware transformations
Ambiguous Words
- First lexicon entry used for ambiguous words
- Solution: Add context-aware lexicon lookup
Limited Language Pairs
- Currently English, Spanish, Catalan only
- Solution: Add more language implementations
Workarounds
For idiomatic text:
# Pre-process idioms before translation
text = String.replace(text, "kick the bucket", "die")For ambiguous words:
# Use context or manual disambiguation
# "bank" (financial) vs "bank" (river)For complex grammar:
# Simplify sentence structure before translation
# "Having been running..." → "He ran..."Future Enhancements
- Neural translation integration
- Context-aware lexicon selection
- Multi-sentence context for pronouns
- Statistical phrase translation
- User feedback learning
- More language pairs (French, German, etc.)
See Also
- API.md - Translation API reference
- ARCHITECTURE.md - System architecture
- USER_GUIDE.md - User guide with examples
- CROSS_LINGUAL.md - Cross-lingual transfer learning