Grammar Resources

This document describes the grammar resource system in Nasty, which externalizes lexicons and grammar rules to separate files for easy modification and multilingual support.

Overview

Grammar resources are stored in priv/languages/{language_code}/ and include:

Lexicons: Word lists for closed-class words (determiners, pronouns, etc.)
Grammar rules: Context-Free Grammar (CFG) rules for phrase structure
Other resources: Irregular verb forms, stop words, etc.

Directory Structure

priv/languages/
├── en/                      # English resources
│   ├── lexicons/           # Word lists
│   │   ├── determiners.exs
│   │   ├── pronouns.exs
│   │   ├── prepositions.exs
│   │   ├── conjunctions_coord.exs
│   │   ├── conjunctions_sub.exs
│   │   ├── auxiliaries.exs
│   │   ├── adverbs.exs
│   │   ├── particles.exs
│   │   ├── interjections.exs
│   │   ├── common_verbs.exs
│   │   ├── common_adjectives.exs
│   │   ├── irregular_verbs.txt
│   │   ├── irregular_nouns.txt
│   │   └── stop_words.txt
│   └── grammars/           # Grammar rules
│       ├── phrase_rules.ex
│       └── dependency_rules.ex
├── es/                      # Spanish resources
│   └── ...
└── ca/                      # Catalan resources
    └── ...

Lexicon File Format

Lexicon files use Elixir term format (.exs) and evaluate to a list of strings using the ~w() sigil.

Example: `determiners.exs`

# English Determiners
# Articles, demonstratives, possessives, quantifiers

~w(
  the a an
  this that these those
  my your his her its our their
  some any no every each either neither
  much many more most less least few several all both half
  whose
)

Lexicon Categories

Closed-Class Words (Complete Lists)

Determiners (determiners.exs) - Articles, demonstratives, possessives, quantifiers
Pronouns (pronouns.exs) - Personal, possessive, reflexive, demonstrative, interrogative
Prepositions (prepositions.exs) - Spatial, temporal, logical relations
Coordinating Conjunctions (conjunctions_coord.exs) - FANBOYS (for, and, nor, but, or, yet, so)
Subordinating Conjunctions (conjunctions_sub.exs) - after, although, because, if, when, etc.
Auxiliaries (auxiliaries.exs) - be, have, do, modals (will, can, should, etc.)
Particles (particles.exs) - Phrasal verb particles (up, down, out, etc.)
Interjections (interjections.exs) - oh, wow, hey, etc.

Open-Class Words (Common Examples)

Common Verbs (common_verbs.exs) - Frequently used verbs with all inflections
Common Adjectives (common_adjectives.exs) - Frequently used qualitative and relational adjectives

Verb Inflections

The common_verbs.exs file includes all inflected forms:

~w(
  go went gone going goes
  come came coming comes
  see saw seen seeing sees
  ...
)

This ensures that verbs are recognized in all their forms during POS tagging.

Loading Lexicons

In Code

Use the LexiconLoader module to load lexicons:

alias Nasty.Language.Resources.LexiconLoader

# Load a lexicon
determiners = LexiconLoader.load(:en, :determiners)

# Check if word is in lexicon
LexiconLoader.in_lexicon?(:en, :determiners, "the")  # => true

# List all available lexicons
LexiconLoader.list_lexicons(:en)

At Compile Time

For performance, load lexicons at compile time using module attributes:

defmodule MyModule do
  alias Nasty.Language.Resources.LexiconLoader

  @determiners LexiconLoader.load(:en, :determiners)
  @pronouns LexiconLoader.load(:en, :pronouns)

  defp determiners, do: @determiners
  defp pronouns, do: @pronouns
end

This is how the POSTagger module loads lexicons efficiently.

Grammar Rules

Grammar rules are documented in grammars/phrase_rules.ex and follow Context-Free Grammar (CFG) notation.

Phrase Structure Rules

# Noun Phrase
NP → Det? Adj* (Noun | PropN | Pron) PP* RC*

# Verb Phrase
VP → Aux* Verb NP? PP* AdvP*

# Prepositional Phrase
PP → Prep NP

# Adjectival Phrase
AdjP → Adv? Adj

# Adverbial Phrase
AdvP → Adv+

Rule File Format

Grammar rules are defined as Elixir modules returning lists of tuples:

defmodule Nasty.Language.English.Grammar.PhraseRules do
  def rules do
    [
      {:np, [
        [:det, :adj, :noun],
        [:det, :noun],
        [:noun],
        [:propn],
        [:pron]
      ]},
      {:vp, [
        [:aux, :verb, :np],
        [:verb, :np],
        [:verb]
      ]},
      # ...
    ]
  end
end

Note: Currently, these rules are documentation only. The phrase parser uses procedural pattern matching rather than rule interpretation. Future versions may add a rule-based parser.

Adding a New Language

To add support for a new language:

Create directory structure:

mkdir -p priv/languages/{code}/lexicons
mkdir -p priv/languages/{code}/grammars

Create lexicon files: Translate lexicons from English, adjusting for the language's grammar
Create grammar rules: Define CFG rules for the language's phrase structure
Implement language module: Create a module implementing Nasty.Language.Behaviour
Register language: Register in the Application

Example: Spanish Lexicons

# priv/languages/es/lexicons/determiners.exs
~w(
  el la los las
  un una unos unas
  este esta estos estas
  ese esa esos esas
  mi tu su nuestro vuestra
  algún alguna algunos algunas
)

Modifying Lexicons

To add or modify words:

Edit the appropriate .exs file in priv/languages/{code}/lexicons/
Recompile the project: mix compile --force
Run tests to verify: mix test

Changes take effect immediately after recompilation since lexicons are loaded at compile time.

Testing

Lexicon loading is tested in test/language/resources/lexicon_loader_test.exs:

test "loads determiners lexicon for English" do
  determiners = LexiconLoader.load(:en, :determiners)
  
  assert is_list(determiners)
  assert "the" in determiners
  assert "a" in determiners
end

Performance Considerations

Compile-time loading: Lexicons are loaded once during compilation and cached as module attributes
No runtime overhead: Lookups are fast list membership checks
Memory usage: All lexicons are kept in memory (typically < 1MB per language)

Best Practices

Keep lexicons sorted: Makes it easier to find and avoid duplicates
Add comments: Document word categories and usage patterns
Test coverage: Add tests for new lexicons or grammar rules
Version control: Commit lexicon changes with descriptive messages
Language consistency: Follow Universal Dependencies (UD) tag set

Future Work

Rule-based parser: Implement CFG rule interpreter for phrase parsing
Pattern rules: Add pattern matching rules for specific constructions
Morphological rules: Externalize morphological analysis patterns
Statistical models: Support for statistical grammar models

References

Universal Dependencies - POS tags and dependency relations
docs/languages/ENGLISH_GRAMMAR.md - Formal English grammar specification
docs/PARSING_GUIDE.md - Parsing algorithm documentation

← Previous Page Grammar Customization Guide

Next Page → Information Extraction