Architecture Refactoring Guide

View Source

This document explains the ongoing refactoring to extract language-agnostic layers from language-specific implementations.

Overview

The current architecture has all NLP operations embedded within language implementations (e.g., Nasty.Language.English.Summarizer). The goal is to create generic, behaviour-based layers that can be reused across languages.

Current Structure (Before Refactoring)

lib/
 language/
    behaviour.ex          # Language interface
    registry.ex
    english/
        summarizer.ex      # English-specific
        text_classifier.ex # English-specific
        entity_recognizer.ex # English-specific
        coreference_resolver.ex
        ... (17 modules)

Target Structure (After Refactoring)

lib/
 language/
    behaviour.ex          # Core language interface
    registry.ex
    english/
        english.ex         # Main module
        tokenizer.ex
        pos_tagger.ex
        phrase_parser.ex
        adapters/          # Adapters to generic layers
            summarizer_adapter.ex
            classifier_adapter.ex
            ner_adapter.ex
 operations/                # Generic NLP operations
    summarization.ex      # Behaviour
    classification.ex     # Behaviour
    question_answering.ex # Behaviour
 semantic/                  # Generic semantic analysis
     entity_recognition.ex  # Behaviour
     coreference_resolution.ex # Behaviour
     semantic_role_labeling.ex # Behaviour

New Behaviour Layers

1. Operations Layer (lib/operations/)

Language-agnostic NLP operations that produce results:

Nasty.Operations.Summarization

@callback summarize(Document.t(), options()) :: 
  {:ok, [Sentence.t()] | String.t()} | {:error, term()}
@callback methods() :: [method()]

Purpose: Extract or generate summaries from documents

Implementation: Nasty.Language.English.SummarizerAdapter

Nasty.Operations.Classification

@callback train(training_data(), options()) :: {:ok, model()} | {:error, term()}
@callback classify(model(), input(), options()) :: {:ok, Classification.t()} | {:error, term()}

Purpose: Train and use text classifiers

Implementation: Nasty.Language.English.ClassifierAdapter

2. Semantic Layer (lib/semantic/)

Language-agnostic semantic analysis:

Nasty.Semantic.EntityRecognition

@callback recognize_document(Document.t(), options()) :: {:ok, [Entity.t()]} | {:error, term()}
@callback recognize(tokens(), options()) :: {:ok, [Entity.t()]} | {:error, term()}

Purpose: Named entity recognition across languages

Implementation: Nasty.Language.English.NERAdapter

Nasty.Semantic.CoreferenceResolution

@callback resolve(Document.t(), options()) :: {:ok, Document.t()} | {:error, term()}

Purpose: Resolve coreferences in text

Implementation: Nasty.Language.English.CoreferenceAdapter

Migration Strategy

Phase 1: Create Behaviour Definitions (CURRENT)

Status: Complete

  • Created lib/operations/ with base behaviours
  • Created lib/semantic/ with base behaviours
  • Defined clear interfaces for each operation

Phase 2: Create Adapter Pattern (IN PROGRESS)

Goal: Adapt existing English implementations to new behaviours without breaking changes

Approach:

  1. Keep existing modules functioning as-is
  2. Create adapter modules that implement new behaviours
  3. Adapters delegate to existing implementations
  4. Update top-level APIs to use adapters when available

Example Adapter:

defmodule Nasty.Language.English.SummarizerAdapter do
  @behaviour Nasty.Operations.Summarization
  
  alias Nasty.Language.English.Summarizer
  
  @impl true
  def summarize(document, opts) do
    # Delegate to existing implementation
    sentences = Summarizer.summarize(document, opts)
    {:ok, sentences}
  end
  
  @impl true
  def methods, do: [:extractive, :mmr]
end

Phase 3: Refactor Implementations (COMPLETED)

Status: Complete for Summarization and Entity Recognition

Goal: Move language-agnostic logic out of language modules

Completed Work:

  1. ✅ Created Nasty.Operations.Summarization.Extractive - Generic extractive summarization
  2. ✅ Created Nasty.Semantic.EntityRecognition.RuleBased - Generic rule-based NER
  3. ✅ Refactored English.Summarizer to delegate to generic module (69% code reduction)
  4. ✅ Refactored English.EntityRecognizer to delegate to generic module (23% code reduction)
  5. ✅ All language-specific logic (lexicons, stop words, patterns) remains in English modules
  6. ✅ All 360 tests passing with no breaking changes

Phase 4: Extract Generic Algorithms (COMPLETED for 2 modules)

Status: Complete for Summarization and Entity Recognition

Extracted Algorithms:

  • Nasty.Operations.Summarization.Extractive (440 lines)

    • Position scoring, length scoring, TF-IDF keyword scoring
    • Entity scoring, discourse marker scoring, coreference scoring
    • Greedy and MMR selection algorithms
    • Jaccard similarity for redundancy reduction
  • Nasty.Semantic.EntityRecognition.RuleBased (237 lines)

    • Sequence detection (finds capitalized token sequences)
    • Configurable classification framework
    • Lexicon matching, pattern matching, heuristic classification
    • Generic entity creation with proper span calculation

Remaining modules for future phases:

  • [ ] Coreference Resolution
  • [ ] Semantic Role Labeling
  • [ ] Question Answering
  • [ ] Text Classification

Benefits of Refactoring

1. Code Reuse

  • Generic algorithms work across all languages
  • Less duplication when adding new languages
  • Easier to maintain and test

2. Clear Separation

  • Language-specific logic clearly separated
  • Generic operations have well-defined interfaces
  • Easier to understand system architecture

3. Easier Language Addition

# Before: Implement 17 modules for new language
defmodule Nasty.Language.Spanish.Summarizer do
  # 200 lines of code
end

# After: Implement adapter + language-specific tweaks
defmodule Nasty.Language.Spanish.SummarizerAdapter do
  @behaviour Nasty.Operations.Summarization
  
  # Provide language-specific configuration (241 lines)
  # Generic algorithm (440 lines) is reused automatically
  
  # Only override language-specific parts
  def stop_words, do: @spanish_stop_words  # 10 lines
end

4. Testing

  • Test generic algorithms once
  • Test language-specific adaptations separately
  • Mock behaviours easily in tests

Backward Compatibility

Maintaining Existing APIs

All existing code continues to work:

# Still works
Nasty.Language.English.Summarizer.summarize(doc, [])

# Also works with new adapter
Nasty.Operations.Summarization.summarize(doc, language: :en)

Deprecation Strategy

  1. Keep old modules functional
  2. Add deprecation warnings after adapters are complete
  3. Remove old modules in next major version

Implementation Checklist

Operations Layer

  • [x] Create lib/operations/summarization.ex behaviour
  • [x] Create lib/operations/classification.ex behaviour
  • [x] Create English adapters for operations
  • [x] Extract generic algorithms
  • [ ] Create lib/operations/question_answering.ex behaviour
  • [ ] Extract remaining generic algorithms

Semantic Layer

  • [x] Create lib/semantic/entity_recognition.ex behaviour
  • [x] Create lib/semantic/coreference_resolution.ex behaviour
  • [x] Create English adapters for semantic operations
  • [x] Extract generic algorithms
  • [ ] Create lib/semantic/semantic_role_labeling.ex behaviour
  • [ ] Extract remaining generic algorithms

Documentation

  • [x] Create REFACTORING.md guide
  • [x] Update REFACTORING.md with Phase 3-4 completion
  • [x] Document adapter pattern with Spanish implementation example
  • [ ] Update ARCHITECTURE.md with new layers
  • [ ] Add migration examples

Language Implementations

  • [x] English adapters (3 total)
    • [x] SummarizerAdapter
    • [x] EntityRecognizerAdapter
    • [x] CoreferenceResolverAdapter
  • [x] Spanish adapters (3 total, 843 lines)
    • [x] SummarizerAdapter (241 lines)
    • [x] EntityRecognizerAdapter (346 lines)
    • [x] CoreferenceResolverAdapter (256 lines)
  • [x] Spanish implementation validates adapter pattern (45% code reduction)
  • [ ] Catalan adapters (future)

Example: Adapting Summarizer

Step 1: Current Implementation

defmodule Nasty.Language.English.Summarizer do
  def summarize(%Document{} = doc, opts) do
    # 200 lines of extractive summarization logic
  end
end

Step 2: Create Adapter

defmodule Nasty.Language.English.SummarizerAdapter do
  @behaviour Nasty.Operations.Summarization
  
  alias Nasty.Language.English.Summarizer
  
  @impl true
  def summarize(document, opts) do
    result = Summarizer.summarize(document, opts)
    {:ok, result}
  end
  
  @impl true
  def methods, do: [:extractive, :mmr]
end

Step 3: Update Top-Level API

defmodule Nasty do
  def summarize(text_or_ast, opts) do
    # Use adapter if available
    case get_summarizer_adapter(opts[:language]) do
      {:ok, adapter} -> adapter.summarize(ast, opts)
      {:error, _} -> fallback_to_old_api(ast, opts)
    end
  end
end

Step 4: Extract Generic Algorithm (Future)

defmodule Nasty.Operations.Summarization.Extractive do
  def summarize(sentences, scoring_fn, opts) do
    # Generic extractive summarization
    # Works for any language with custom scoring_fn
  end
end

defmodule Nasty.Language.English.SummarizerAdapter do
  use Nasty.Operations.Summarization.Extractive
  
  def score_sentence(sentence, context) do
    # English-specific scoring using stop words, etc.
  end
end

Contributing

When adding new NLP features:

  1. Define behaviour first in lib/operations/ or lib/semantic/
  2. Implement for English as an adapter
  3. Extract generic algorithms where possible
  4. Document the behaviour and implementation strategy

Success Story: Spanish Implementation

The Spanish language implementation (2026-01-08) validates the refactoring strategy:

Metrics

  • 3 adapters: 843 total lines providing Spanish-specific configuration
  • Generic algorithms reused: 677+ lines (Summarization, NER, Coreference)
  • Code reduction: 45% through delegation to generic implementations
  • Time to implement: ~1 week for complete pipeline
  • Test coverage: 641 tests passing (9 Spanish-specific)

Adapter Implementation

Spanish Summarizer Adapter (241 lines):

  • 5 categories of discourse markers (conclusion, emphasis, causal, contrast, addition)
  • 100+ Spanish stop words
  • Punctuation patterns
  • Delegates all scoring and selection to Operations.Summarization.Extractive (440 lines)

Spanish Entity Recognizer Adapter (346 lines):

  • 40+ person names (male, female, surnames)
  • 40+ place names (Spain, Latin America)
  • Organization patterns (S.A., S.L., government, companies)
  • Titles, date/time, money patterns
  • Delegates detection to Semantic.EntityRecognition.RuleBased (237 lines)

Spanish Coreference Resolver Adapter (256 lines):

  • Complete pronoun system (subject, object, reflexive, possessive, demonstrative)
  • Gender/number agreement rules
  • Spanish-specific pronoun features
  • Delegates resolution to generic coreference algorithms

Key Learnings

  1. Adapter pattern works: 45% code reduction demonstrates effective reuse
  2. Configuration vs. implementation: Language-specific details separate from algorithms
  3. Fast implementation: Complete pipeline in ~1 week vs. estimated 6-8 weeks
  4. No breaking changes: All existing tests continue to pass
  5. Maintainability: Bug fixes in generic code benefit all languages

See Also