Grammar Customization Guide

View Source

This document explains how to customize and extend Nasty's grammar rules by creating external grammar resource files.

Overview

Starting with version 0.2.0, Nasty externalizes grammar rules from hardcoded Elixir modules into configurable .exs resource files. This allows you to:

  • Customize existing grammar rules without modifying source code
  • Create domain-specific grammar variants (e.g., legal, medical, technical)
  • Add support for new languages
  • A/B test different parsing strategies
  • Share grammar rule sets across projects

Architecture

Grammar rules are stored as Elixir term files (.exs) in:

priv/languages/{language_code}/grammars/{rule_type}.exs

For variants (e.g., formal, informal, technical):

priv/languages/{language_code}/variants/{variant_name}/{rule_type}.exs

Language Codes

  • English: en or english
  • Spanish: es or spanish
  • Catalan: ca or catalan (future)

Rule Types

Each language can have the following grammar rule files:

  1. phrase_rules.exs - Phrase structure patterns (NP, VP, PP, AdjP, AdvP)
  2. dependency_rules.exs - Universal Dependencies relations and extraction rules
  3. coordination_rules.exs - Coordinating conjunctions and coordination patterns
  4. subordination_rules.exs - Subordinating conjunctions and subordinate clause patterns

Grammar Loader API

Loading Grammar Rules

alias Nasty.Language.GrammarLoader

# Load default grammar rules
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules)

# Load with variant
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules, variant: "formal")

# Force reload (bypass cache)
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules, force_reload: true)

Cache Management

# Clear all cached grammar
GrammarLoader.clear_cache()

# Clear specific cached rules
GrammarLoader.clear_cache(:en, :phrase_rules, :default)

Direct File Loading

# Load from custom path
{:ok, rules} = GrammarLoader.load_file("/path/to/custom_rules.exs")

Creating Grammar Files

File Structure

Grammar files are Elixir term files that evaluate to a map:

%{
  # Top-level keys define rule categories
  rule_category_1: [...],
  rule_category_2: %{...},
  
  # Metadata
  notes: %{
    key: "description"
  }
}

Example: Simple Phrase Rules

Create priv/languages/en/grammars/custom_phrase_rules.exs:

%{
  # Noun phrase patterns
  noun_phrases: [
    # Simple NP: Det + Noun
    {:np, [:det, :noun]},
    
    # NP with adjective: Det + Adj + Noun
    {:np, [:det, :adj, :noun]},
    
    # NP with PP: Det + Noun + PP
    {:np, [:det, :noun, :pp]}
  ],
  
  # Verb phrase patterns
  verb_phrases: [
    # Simple VP: just Verb
    {:vp, [:verb]},
    
    # VP with object: Verb + NP
    {:vp, [:verb, :np]},
    
    # VP with auxiliary: Aux + Verb
    {:vp, [:aux, :verb]}
  ],
  
  notes: %{
    version: "1.0.0",
    author: "Your Name",
    description: "Custom phrase rules for domain-specific parsing"
  }
}

English Grammar Reference

Phrase Rules (phrase_rules.exs)

See priv/languages/en/grammars/phrase_rules.exs for the complete reference.

Key sections:

%{
  noun_phrases: [
    # List of NP patterns
    {:np, [:det, :noun]},
    {:np, [:det, :adj, :noun]},
    # ...
  ],
  
  verb_phrases: [
    # List of VP patterns
    {:vp, [:verb]},
    {:vp, [:aux, :verb, :np]},
    # ...
  ],
  
  prepositional_phrases: [
    # PP patterns
    {:pp, [:prep, :np]},
    # ...
  ],
  
  adjectival_phrases: [
    # AdjP patterns
    {:adjp, [:adv, :adj]},
    # ...
  ],
  
  adverbial_phrases: [
    # AdvP patterns
    {:advp, [:adv]},
    # ...
  ],
  
  relative_clauses: [
    # Relative clause patterns
    {:relative_clause, [:relative_marker, :clause]},
    # ...
  ],
  
  special_rules: [
    # Special handling rules
    {:comparative_than, :pseudo_prep},
    # ...
  ]
}

Dependency Rules (dependency_rules.exs)

See priv/languages/en/grammars/dependency_rules.exs for the complete reference.

Key sections:

%{
  core_arguments: [
    # Subject, object, complements
    %{
      relation: :nsubj,
      description: "Nominal subject",
      head_pos: [:verb],
      dependent_pos: [:noun, :propn, :pron],
      example: "The cat sleeps → nsubj(sleeps, cat)"
    },
    # ...
  ],
  
  nominal_dependents: [
    # Determiners, modifiers
    %{relation: :det, ...},
    %{relation: :amod, ...},
    # ...
  ],
  
  function_words: [
    # Auxiliaries, copulas, markers
    %{relation: :aux, ...},
    # ...
  ],
  
  extraction_priorities: [
    # Order of dependency extraction
    :nsubj, :obj, :det, :amod, # ...
  ]
}

Coordination Rules (coordination_rules.exs)

Key sections:

%{
  coordinating_conjunctions: [
    %{
      conjunction: "and",
      type: :copulative,
      example: "cats and dogs"
    },
    # ...
  ],
  
  coordination_patterns: [
    %{
      pattern: :np_coordination,
      structure: "NP CCONJ NP",
      example: "cats and dogs"
    },
    # ...
  ],
  
  special_cases: [
    # Correlative conjunctions, etc.
    %{
      type: :correlative,
      patterns: [
        %{pair: ["both", "and"], example: "both cats and dogs"},
        # ...
      ]
    }
  ]
}

Subordination Rules (subordination_rules.exs)

Key sections:

%{
  subordinating_conjunctions: [
    %{
      conjunction: "because",
      type: :causal,
      example: "I stayed because it rained"
    },
    # ...
  ],
  
  relative_markers: [
    %{
      marker: "who",
      type: :relative_pronoun,
      example: "the person who came"
    },
    # ...
  ],
  
  subordinate_clause_types: [
    %{
      type: :adverbial,
      dependency_relation: :advcl,
      subtypes: [:temporal, :causal, :conditional, ...]
    },
    # ...
  ]
}

Spanish Grammar Reference

Spanish grammar files follow the same structure but include Spanish-specific features:

  • Post-nominal adjectives: la casa roja (the red house)
  • Pro-drop: null subjects allowed
  • Flexible word order: SVO, VSO, VOS
  • Clitic pronouns: dámelo (give-me-it)
  • Personal 'a': Veo a Juan (I see Juan)
  • Two copulas: ser vs. estar
  • Phonetic variants: ye, ou before vowels

See files in priv/languages/es/grammars/ for complete Spanish grammar.

Creating Domain-Specific Variants

Example: Technical English

Create priv/languages/en/variants/technical/phrase_rules.exs:

%{
  # Inherit base rules and add technical-specific patterns
  noun_phrases: [
    # Standard patterns
    {:np, [:det, :noun]},
    
    # Technical compound nouns (e.g., "TCP/IP protocol")
    {:np, [:propn, :noun]},
    {:np, [:propn, :sym, :propn, :noun]},
    
    # Noun phrases with technical modifiers
    {:np, [:num, {:unit, [:noun]}, :noun]},  # "5 GB memory"
    
    # Multi-word technical terms
    {:np, [{:many, :noun}]}  # "machine learning model"
  ],
  
  verb_phrases: [
    # Standard patterns
    {:vp, [:verb, :np]},
    
    # Technical action verbs (instantiate, serialize, etc.)
    {:vp, [:tech_verb, :np, :pp]},
    
    # Passive constructions common in technical writing
    {:vp, [:aux, :verb, :pp]}
  ],
  
  notes: %{
    domain: "technical",
    use_case: "Software documentation, API specs, technical papers"
  }
}
%{
  noun_phrases: [
    # Legal entities
    {:np, [:det, :legal_entity]},  # "the plaintiff", "the defendant"
    
    # Complex legal terms
    {:np, [:det, :adj, :legal_term, :pp]},  # "the aforementioned contractual obligation"
    
    # References (Section X, Article Y)
    {:np, [:legal_ref_type, :num]}  # "Section 5"
  ],
  
  subordination_patterns: [
    # Legal conditionals (provided that, in the event that)
    {:conditional, :multiword_legal_conj}
  ],
  
  notes: %{
    domain: "legal",
    use_case: "Contracts, legislation, court documents"
  }
}

Using Custom Grammar in Code

Option 1: Load and Use Directly

# Load custom grammar
{:ok, custom_phrase_rules} = GrammarLoader.load(:en, :custom_phrase_rules)

# Use in your parser
custom_np_patterns = custom_phrase_rules.noun_phrases
# Process with custom patterns...

Option 2: Extend Parser Module

defmodule MyApp.CustomParser do
  alias Nasty.Language.GrammarLoader
  
  def parse_technical_text(text) do
    # Load technical variant
    {:ok, rules} = GrammarLoader.load(:en, :phrase_rules, variant: "technical")
    
    # Parse using custom rules
    # ... your parsing logic using rules ...
  end
end

Option 3: Runtime Configuration

# In config/config.exs
config :nasty,
  default_grammar_variant: "technical"

# In your code
variant = Application.get_env(:nasty, :default_grammar_variant, :default)
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules, variant: variant)

Grammar Validation

The grammar loader validates that all files return a map:

# Valid
%{
  rules: [...],
  notes: %{}
}

# Invalid - will raise error
[1, 2, 3]  # Not a map

For more complex validation, extend GrammarLoader.validate_rules/1.

Best Practices

1. Start with Base Grammar

Copy existing grammar files and modify rather than starting from scratch:

cp priv/languages/en/grammars/phrase_rules.exs \
   priv/languages/en/variants/custom/phrase_rules.exs

2. Document Your Rules

Include comprehensive notes in your grammar files:

%{
  rules: [...],
  
  notes: %{
    version: "1.0.0",
    author: "Team Name",
    created: "2026-01-08",
    description: "Custom grammar for medical text parsing",
    changes: [
      "Added medical entity patterns",
      "Extended VP patterns for medical procedures"
    ],
    examples: [
      "The patient underwent cardiac catheterization",
      "Diagnose: Type 2 diabetes mellitus"
    ]
  }
}

3. Test Your Grammar

Create tests for custom grammar:

defmodule MyApp.CustomGrammarTest do
  use ExUnit.Case
  alias Nasty.Language.GrammarLoader
  
  test "custom grammar loads successfully" do
    assert {:ok, rules} = GrammarLoader.load(:en, :custom_rules)
    assert is_map(rules)
    assert Map.has_key?(rules, :noun_phrases)
  end
  
  test "custom grammar includes domain patterns" do
    {:ok, rules} = GrammarLoader.load(:en, :custom_rules, variant: "medical")
    assert Enum.any?(rules.noun_phrases, fn pattern ->
      # Check for medical-specific patterns
    end)
  end
end

4. Version Your Grammar

Track grammar versions for reproducibility:

%{
  metadata: %{
    version: "2.1.0",
    compatible_with: "nasty >= 0.2.0"
  },
  # ... rules ...
}

5. Keep Grammar Files Focused

Separate concerns across different rule types:

  • Phrase structure → phrase_rules.exs
  • Dependencies → dependency_rules.exs
  • Coordination → coordination_rules.exs
  • Subordination → subordination_rules.exs

Don't mix all rules into one file.

Performance Considerations

Caching

Grammar files are cached in ETS after first load:

# First load: reads from disk
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules)  # ~5ms

# Subsequent loads: from cache
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules)  # ~0.1ms

Clear cache when updating grammar during development:

GrammarLoader.clear_cache()

File Size

Keep grammar files under 1MB for fast loading. If needed, split into multiple files:

phrase_rules_np.exs  # Noun phrase patterns
phrase_rules_vp.exs  # Verb phrase patterns
phrase_rules_pp.exs  # Prepositional phrase patterns

Troubleshooting

Grammar File Not Found

Grammar file not found: .../en/grammars/missing_rules.exs, using empty rules

Solution: Check file exists and path is correct. Grammar files must be in priv/languages/{lang}/grammars/.

Invalid Grammar Format

** (ArgumentError) Grammar rules must be a map, got: [...]

Solution: Ensure file evaluates to a map:

# Correct
%{rules: [...]}

# Wrong
[...]

Compilation Errors

** (SyntaxError) invalid syntax

Solution: Grammar files must be valid Elixir. Test with:

elixir priv/languages/en/grammars/your_rules.exs

Cache Issues

If changes to grammar files aren't reflected:

# Clear cache
Nasty.Language.GrammarLoader.clear_cache()

# Or force reload
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules, force_reload: true)

Examples Repository

See working examples in the main repository:

  • English grammar: priv/languages/en/grammars/
  • Spanish grammar: priv/languages/es/grammars/
  • Test fixtures: test/fixtures/grammars/

Contributing Custom Grammars

To contribute grammar variants to the Nasty project:

  1. Create grammar files following the structure above
  2. Add tests demonstrating the grammar works
  3. Document the use case and domain
  4. Submit a pull request to the main repository

Further Reading