Architecture Refactoring Guide

This document explains the ongoing refactoring to extract language-agnostic layers from language-specific implementations.

Overview

The current architecture has all NLP operations embedded within language implementations (e.g., Nasty.Language.English.Summarizer). The goal is to create generic, behaviour-based layers that can be reused across languages.

Current Structure (Before Refactoring)

lib/
├── language/
│   ├── behaviour.ex          # Language interface
│   ├── registry.ex
│   └── english/
│       ├── summarizer.ex      # English-specific
│       ├── text_classifier.ex # English-specific
│       ├── entity_recognizer.ex # English-specific
│       ├── coreference_resolver.ex
│       └── ... (17 modules)

Target Structure (After Refactoring)

lib/
├── language/
│   ├── behaviour.ex          # Core language interface
│   ├── registry.ex
│   └── english/
│       ├── english.ex         # Main module
│       ├── tokenizer.ex
│       ├── pos_tagger.ex
│       ├── phrase_parser.ex
│       └── adapters/          # Adapters to generic layers
│           ├── summarizer_adapter.ex
│           ├── classifier_adapter.ex
│           └── ner_adapter.ex
├── operations/                # Generic NLP operations
│   ├── summarization.ex      # Behaviour
│   ├── classification.ex     # Behaviour
│   └── question_answering.ex # Behaviour
└── semantic/                  # Generic semantic analysis
    ├── entity_recognition.ex  # Behaviour
    ├── coreference_resolution.ex # Behaviour
    └── semantic_role_labeling.ex # Behaviour

New Behaviour Layers

1. Operations Layer (`lib/operations/`)

Language-agnostic NLP operations that produce results:

`Nasty.Operations.Summarization`

@callback summarize(Document.t(), options()) :: 
  {:ok, [Sentence.t()] | String.t()} | {:error, term()}
@callback methods() :: [method()]

Purpose: Extract or generate summaries from documents

Implementation: Nasty.Language.English.SummarizerAdapter

`Nasty.Operations.Classification`

@callback train(training_data(), options()) :: {:ok, model()} | {:error, term()}
@callback classify(model(), input(), options()) :: {:ok, Classification.t()} | {:error, term()}

Purpose: Train and use text classifiers

Implementation: Nasty.Language.English.ClassifierAdapter

2. Semantic Layer (`lib/semantic/`)

Language-agnostic semantic analysis:

`Nasty.Semantic.EntityRecognition`

@callback recognize_document(Document.t(), options()) :: {:ok, [Entity.t()]} | {:error, term()}
@callback recognize(tokens(), options()) :: {:ok, [Entity.t()]} | {:error, term()}

Purpose: Named entity recognition across languages

Implementation: Nasty.Language.English.NERAdapter

`Nasty.Semantic.CoreferenceResolution`

@callback resolve(Document.t(), options()) :: {:ok, Document.t()} | {:error, term()}

Purpose: Resolve coreferences in text

Implementation: Nasty.Language.English.CoreferenceAdapter

Migration Strategy

Phase 1: Create Behaviour Definitions (CURRENT)

✅ Status: Complete

Created lib/operations/ with base behaviours
Created lib/semantic/ with base behaviours
Defined clear interfaces for each operation

Phase 2: Create Adapter Pattern (IN PROGRESS)

Goal: Adapt existing English implementations to new behaviours without breaking changes

Approach:

Keep existing modules functioning as-is
Create adapter modules that implement new behaviours
Adapters delegate to existing implementations
Update top-level APIs to use adapters when available

Example Adapter:

defmodule Nasty.Language.English.SummarizerAdapter do
  @behaviour Nasty.Operations.Summarization
  
  alias Nasty.Language.English.Summarizer
  
  @impl true
  def summarize(document, opts) do
    # Delegate to existing implementation
    sentences = Summarizer.summarize(document, opts)
    {:ok, sentences}
  end
  
  @impl true
  def methods, do: [:extractive, :mmr]
end

Phase 3: Refactor Implementations (COMPLETED)

✅ Status: Complete for Summarization and Entity Recognition

Goal: Move language-agnostic logic out of language modules

Completed Work:

✅ Created Nasty.Operations.Summarization.Extractive - Generic extractive summarization
✅ Created Nasty.Semantic.EntityRecognition.RuleBased - Generic rule-based NER
✅ Refactored English.Summarizer to delegate to generic module (69% code reduction)
✅ Refactored English.EntityRecognizer to delegate to generic module (23% code reduction)
✅ All language-specific logic (lexicons, stop words, patterns) remains in English modules
✅ All 360 tests passing with no breaking changes

Phase 4: Extract Generic Algorithms (COMPLETED for 2 modules)

✅ Status: Complete for Summarization and Entity Recognition

Extracted Algorithms:

✅ Nasty.Operations.Summarization.Extractive (440 lines)
- Position scoring, length scoring, TF-IDF keyword scoring
- Entity scoring, discourse marker scoring, coreference scoring
- Greedy and MMR selection algorithms
- Jaccard similarity for redundancy reduction
✅ Nasty.Semantic.EntityRecognition.RuleBased (237 lines)
- Sequence detection (finds capitalized token sequences)
- Configurable classification framework
- Lexicon matching, pattern matching, heuristic classification
- Generic entity creation with proper span calculation

Remaining modules for future phases:

[ ] Coreference Resolution
[ ] Semantic Role Labeling
[ ] Question Answering
[ ] Text Classification

Benefits of Refactoring

1. Code Reuse

Generic algorithms work across all languages
Less duplication when adding new languages
Easier to maintain and test

2. Clear Separation

Language-specific logic clearly separated
Generic operations have well-defined interfaces
Easier to understand system architecture

3. Easier Language Addition

# Before: Implement 17 modules for new language
defmodule Nasty.Language.Spanish.Summarizer do
  # 200 lines of code
end

# After: Implement adapter + language-specific tweaks
defmodule Nasty.Language.Spanish.SummarizerAdapter do
  @behaviour Nasty.Operations.Summarization
  
  # Provide language-specific configuration (241 lines)
  # Generic algorithm (440 lines) is reused automatically
  
  # Only override language-specific parts
  def stop_words, do: @spanish_stop_words  # 10 lines
end

4. Testing

Test generic algorithms once
Test language-specific adaptations separately
Mock behaviours easily in tests

Backward Compatibility

Maintaining Existing APIs

All existing code continues to work:

# Still works
Nasty.Language.English.Summarizer.summarize(doc, [])

# Also works with new adapter
Nasty.Operations.Summarization.summarize(doc, language: :en)

Deprecation Strategy

Keep old modules functional
Add deprecation warnings after adapters are complete
Remove old modules in next major version

Implementation Checklist

Operations Layer

[x] Create lib/operations/summarization.ex behaviour
[x] Create lib/operations/classification.ex behaviour
[x] Create English adapters for operations
[x] Extract generic algorithms
- [x] Nasty.Operations.Summarization.Extractive
[ ] Create lib/operations/question_answering.ex behaviour
[ ] Extract remaining generic algorithms

Semantic Layer

[x] Create lib/semantic/entity_recognition.ex behaviour
[x] Create lib/semantic/coreference_resolution.ex behaviour
[x] Create English adapters for semantic operations
[x] Extract generic algorithms
- [x] Nasty.Semantic.EntityRecognition.RuleBased
[ ] Create lib/semantic/semantic_role_labeling.ex behaviour
[ ] Extract remaining generic algorithms

Documentation

[x] Create REFACTORING.md guide
[x] Update REFACTORING.md with Phase 3-4 completion
[x] Document adapter pattern with Spanish implementation example
[ ] Update ARCHITECTURE.md with new layers
[ ] Add migration examples

Language Implementations

[x] English adapters (3 total)
- [x] SummarizerAdapter
- [x] EntityRecognizerAdapter
- [x] CoreferenceResolverAdapter
[x] Spanish adapters (3 total, 843 lines)
- [x] SummarizerAdapter (241 lines)
- [x] EntityRecognizerAdapter (346 lines)
- [x] CoreferenceResolverAdapter (256 lines)
[x] Spanish implementation validates adapter pattern (45% code reduction)
[ ] Catalan adapters (future)

Example: Adapting Summarizer

Step 1: Current Implementation

defmodule Nasty.Language.English.Summarizer do
  def summarize(%Document{} = doc, opts) do
    # 200 lines of extractive summarization logic
  end
end

Step 2: Create Adapter

defmodule Nasty.Language.English.SummarizerAdapter do
  @behaviour Nasty.Operations.Summarization
  
  alias Nasty.Language.English.Summarizer
  
  @impl true
  def summarize(document, opts) do
    result = Summarizer.summarize(document, opts)
    {:ok, result}
  end
  
  @impl true
  def methods, do: [:extractive, :mmr]
end

Step 3: Update Top-Level API

defmodule Nasty do
  def summarize(text_or_ast, opts) do
    # Use adapter if available
    case get_summarizer_adapter(opts[:language]) do
      {:ok, adapter} -> adapter.summarize(ast, opts)
      {:error, _} -> fallback_to_old_api(ast, opts)
    end
  end
end

Step 4: Extract Generic Algorithm (Future)

defmodule Nasty.Operations.Summarization.Extractive do
  def summarize(sentences, scoring_fn, opts) do
    # Generic extractive summarization
    # Works for any language with custom scoring_fn
  end
end

defmodule Nasty.Language.English.SummarizerAdapter do
  use Nasty.Operations.Summarization.Extractive
  
  def score_sentence(sentence, context) do
    # English-specific scoring using stop words, etc.
  end
end

Contributing

When adding new NLP features:

Define behaviour first in lib/operations/ or lib/semantic/
Implement for English as an adapter
Extract generic algorithms where possible
Document the behaviour and implementation strategy

Success Story: Spanish Implementation

The Spanish language implementation (2026-01-08) validates the refactoring strategy:

Metrics

3 adapters: 843 total lines providing Spanish-specific configuration
Generic algorithms reused: 677+ lines (Summarization, NER, Coreference)
Code reduction: 45% through delegation to generic implementations
Time to implement: ~1 week for complete pipeline
Test coverage: 641 tests passing (9 Spanish-specific)

Adapter Implementation

Spanish Summarizer Adapter (241 lines):

5 categories of discourse markers (conclusion, emphasis, causal, contrast, addition)
100+ Spanish stop words
Punctuation patterns
Delegates all scoring and selection to Operations.Summarization.Extractive (440 lines)

Spanish Entity Recognizer Adapter (346 lines):

40+ person names (male, female, surnames)
40+ place names (Spain, Latin America)
Organization patterns (S.A., S.L., government, companies)
Titles, date/time, money patterns
Delegates detection to Semantic.EntityRecognition.RuleBased (237 lines)

Spanish Coreference Resolver Adapter (256 lines):

Complete pronoun system (subject, object, reflexive, possessive, demonstrative)
Gender/number agreement rules
Spanish-specific pronoun features
Delegates resolution to generic coreference algorithms

Key Learnings

Adapter pattern works: 45% code reduction demonstrates effective reuse
Configuration vs. implementation: Language-specific details separate from algorithms
Fast implementation: Complete pipeline in ~1 week vs. estimated 6-8 weeks
No breaking changes: All existing tests continue to pass
Maintainability: Bug fixes in generic code benefit all languages