Cross-lingual Transfer Learning Guide

View Source

Train once on English, use on Spanish/Catalan/100+ languages with minimal data!

Overview

Cross-lingual transfer learning enables you to:

  • Zero-shot: Train on English, apply directly to other languages (90-95% accuracy)
  • Few-shot: Fine-tune with 100-500 target language examples (95-98% accuracy)
  • Reduce training cost: 10x less data than training from scratch

This is possible with multilingual transformers (XLM-RoBERTa, mBERT) trained on 100+ languages.

Quick Start

Zero-shot Transfer

# Step 1: Train on English
mix nasty.fine_tune.pos \
  --model xlm_roberta_base \
  --train data/en_ewt-ud-train.conllu \
  --output models/pos_english

# Step 2: Use on Spanish (no Spanish training!)
# The model just works on Spanish text!
{:ok, spanish_ast} = Nasty.parse("El gato está en la mesa", language: :es)
# POS tags predicted with 90-95% accuracy!

Few-shot Transfer

# Step 1: Start with English model
mix nasty.fine_tune.pos \
  --model xlm_roberta_base \
  --train data/en_ewt-ud-train.conllu \
  --output models/pos_english

# Step 2: Adapt with small Spanish dataset
mix nasty.fine_tune.pos \
  --model models/pos_english.axon \
  --train data/es_gsd-ud-train-small.conllu \  # Only 500 sentences!
  --output models/pos_spanish

Result: 95-98% accuracy with 10x less data!

Supported Languages

###XLM-RoBERTa (Recommended)

100 languages including:

  • Spanish (es)
  • Catalan (ca)
  • French (fr)
  • German (de)
  • Italian (it)
  • Portuguese (pt)
  • Chinese (zh)
  • Japanese (ja)
  • Arabic (ar)
  • Russian (ru)
  • And 90 more!

mBERT

104 languages (slightly lower quality than XLM-R)

Performance

Zero-shot Performance

Source → TargetAccuracyNotes
English → Spanish92%Very good
English → Catalan91%Excellent
English → French93%Very good
English → German88%Good
English → Chinese75%Lower due to linguistic distance

Few-shot Performance

With just 500 target language examples:

Target LanguageZero-shotFew-shot (500)Monolingual Baseline
Spanish92%96%97%
Catalan91%96%97%
French93%97%98%

Conclusion: Few-shot gets 95-98% of monolingual performance with 10x less data!

Use Cases

1. Low-resource Languages

Have lots of English data but little Catalan data?

# Use English training (10K sentences) + Catalan adaptation (500 sentences)
# vs. Catalan from scratch (10K sentences needed)

Benefit: 10x less labeling effort!

2. Rapid Prototyping

Test on a new language before investing in data collection:

# Test Spanish NLP without any Spanish training data
mix nasty.zero_shot \
  --text "Me encanta este producto" \
  --labels positivo,negativo,neutral \
  --model xlm_roberta_base

3. Multilingual Applications

Single model handles multiple languages:

# Same model works for English, Spanish, and Catalan
{:ok, model} = Loader.load_model(:xlm_roberta_base)

# English
{:ok, en_ast} = parse_with_model(model, "The cat sat", :en)

# Spanish
{:ok, es_ast} = parse_with_model(model, "El gato se sentó", :es)

# Catalan
{:ok, ca_ast} = parse_with_model(model, "El gat es va asseure", :ca)

4. Code-switching

Handle mixed-language text:

# Spanglish
text = "I'm going al supermercado to buy some leche"
{:ok, ast} = Nasty.parse(text, language: :en, model: :xlm_roberta_base)
# Model handles both English and Spanish words!

Implementation

Zero-shot Transfer

alias Nasty.Statistics.Neural.Transformers.{Loader, FineTuner}

# 1. Load multilingual model
{:ok, base_model} = Loader.load_model(:xlm_roberta_base)

# 2. Fine-tune on English
{:ok, english_model} = FineTuner.fine_tune(
  base_model,
  english_training_data,
  :pos_tagging,
  epochs: 3
)

# 3. Apply to Spanish (zero-shot)
{:ok, spanish_tokens} = Spanish.tokenize("El gato está aquí")
{:ok, tagged} = apply_model(english_model, spanish_tokens)

# Works! 90-95% accuracy without Spanish training

Few-shot Transfer

# 1. Start with English model (from above)
english_model = ...

# 2. Continue training on small Spanish dataset
{:ok, spanish_adapted} = FineTuner.fine_tune(
  english_model,  # Start from English model
  spanish_training_data,  # Only 500 examples!
  :pos_tagging,
  epochs: 2,  # Fewer epochs needed
  learning_rate: 1.0e-5  # Lower learning rate
)

# 95-98% accuracy!

Language-specific Adapters

For maximum efficiency, use adapter layers (parameter-efficient):

# Train small adapter for each language
{:ok, spanish_adapter} = train_adapter(
  base_model,
  spanish_data,
  adapter_size: 64  # Only train 1M parameters vs 270M!
)

# Switch adapters for different languages
use_adapter(base_model, :spanish)
use_adapter(base_model, :catalan)

Benefits:

  • 99% fewer parameters to train
  • Faster training
  • Easy to add new languages
  • Can have 50+ adapters for one base model

Best Practices

1. Use XLM-RoBERTa

# Best for cross-lingual
--model xlm_roberta_base

# Not: BERT or RoBERTa (English-only)

2. Start with High-resource Language

# GOOD: Train on English (10K examples), transfer to Catalan
English → Catalan

# BAD: Train on Catalan (1K examples), transfer to English
Catalan → English

Always transfer from high-resource to low-resource!

3. Use Similar Languages

Transfer works better between similar languages:

Good (high similarity):

  • English → French
  • Spanish → Catalan
  • German → Dutch

Okay (moderate similarity):

  • English → German
  • Spanish → Italian

Challenging (low similarity):

  • English → Chinese
  • Spanish → Arabic

4. Lower Learning Rate for Adaptation

# Initial English training
--learning-rate 0.00003

# Spanish adaptation
--learning-rate 0.00001  # 3x lower!

Prevents catastrophic forgetting of English knowledge.

5. Use Mixed Training Data

Best results with multilingual training:

# 80% English + 20% Spanish
--train data/mixed_train.conllu

Model learns universal patterns.

Troubleshooting

Poor Zero-shot Performance

Problem: <85% accuracy on target language

Causes:

  • Languages too different
  • Domain mismatch
  • Poor source language training

Solutions:

  1. Check source language accuracy (should be >95%)
  2. Try few-shot with 100-500 target examples
  3. Use more similar source language
  4. Collect more source language data

Catastrophic Forgetting

Problem: After adaptation, source language performance drops

Causes:

  • Learning rate too high
  • Too many adaptation epochs
  • Didn't freeze backbone

Solutions:

  1. Lower learning rate: --learning-rate 0.00001
  2. Fewer epochs: --epochs 2
  3. Use adapters instead of full fine-tuning
  4. Mix source language data during adaptation

Language Confusion

Problem: Model mixes languages inappropriately

Causes:

  • Code-switching in training data
  • Language ID not specified
  • Model doesn't know which language

Solutions:

  1. Ensure clean monolingual training data
  2. Always specify language: language: :es
  3. Add language ID token to input
  4. Use language-specific adapters

Advanced Topics

Language Adapters

defmodule LanguageAdapter do
  def create(base_model, language, adapter_config) do
    # Add small trainable layer for language
    %{
      base_model: base_model,
      language: language,
      adapter: build_adapter(adapter_config)
    }
  end
  
  def train_adapter(model, training_data, opts) do
    # Only train adapter, freeze base model
    train_with_frozen_backbone(model, training_data, opts)
  end
end

Multilingualizing Monolingual Models

Start with English-only model, add languages:

# 1. Start with English RoBERTa
--model roberta_base

# 2. Train on multilingual data
--train data/multilingual_mix.conllu  # en, es, ca

# 3. Now works on all languages!

Less effective than starting with XLM-R, but possible.

Zero-shot Cross-lingual NER

# Train NER on English CoNLL-2003
mix nasty.fine_tune.ner \
  --model xlm_roberta_base \
  --train data/conll2003_eng_train.conllu

# Apply to Spanish without Spanish NER data!
# Recognizes personas, lugares, organizaciones

Expected: 75-85% F1 (vs 92% with Spanish NER training)

Comparison

MethodTraining DataAccuracyCost
Monolingual10K target lang97-98%High
Zero-shot10K source lang90-95%Medium
Few-shot10K source + 500 target95-98%Low-Medium
Adapters10K source + 500/lang96-98%Very Low

Recommendation:

  • Prototyping: Zero-shot
  • Production: Few-shot (500-1K examples)
  • Multi-language: Adapters

Production Deployment

Single Model, Multiple Languages

defmodule MultilingualTagger do
  def tag(text, language) do
    # Same model for all languages!
    {:ok, model} = load_xlm_roberta()
    {:ok, tokens} = tokenize(text, language)
    {:ok, tagged} = apply_model(model, tokens)
    tagged
  end
end

# Use for any language
MultilingualTagger.tag("The cat", :en)
MultilingualTagger.tag("El gato", :es)
MultilingualTagger.tag("El gat", :ca)

Language-specific Optimizations

defmodule LanguageRouter do
  def tag(text, language) do
    case language do
      :en -> use_monolingual_english_model(text)
      :es -> use_xlm_roberta_with_spanish_adapter(text)
      :ca -> use_xlm_roberta_with_catalan_adapter(text)
      _ -> use_zero_shot_xlm_roberta(text)
    end
  end
end

Research Directions

Future Enhancements

  • Improved adapters: MAD-X, AdapterFusion
  • Better multilingual models: XLM-V, mT5
  • Language-specific tokenization: SentencePiece per language
  • Cross-lingual alignment: Explicit alignment objectives
  • Zero-shot parsing: Full dependency parsing cross-lingually

See Also