Grammar Resources
View SourceThis document describes the grammar resource system in Nasty, which externalizes lexicons and grammar rules to separate files for easy modification and multilingual support.
Overview
Grammar resources are stored in priv/languages/{language_code}/ and include:
- Lexicons: Word lists for closed-class words (determiners, pronouns, etc.)
- Grammar rules: Context-Free Grammar (CFG) rules for phrase structure
- Other resources: Irregular verb forms, stop words, etc.
Directory Structure
priv/languages/
├── en/ # English resources
│ ├── lexicons/ # Word lists
│ │ ├── determiners.exs
│ │ ├── pronouns.exs
│ │ ├── prepositions.exs
│ │ ├── conjunctions_coord.exs
│ │ ├── conjunctions_sub.exs
│ │ ├── auxiliaries.exs
│ │ ├── adverbs.exs
│ │ ├── particles.exs
│ │ ├── interjections.exs
│ │ ├── common_verbs.exs
│ │ ├── common_adjectives.exs
│ │ ├── irregular_verbs.txt
│ │ ├── irregular_nouns.txt
│ │ └── stop_words.txt
│ └── grammars/ # Grammar rules
│ ├── phrase_rules.ex
│ └── dependency_rules.ex
├── es/ # Spanish resources
│ └── ...
└── ca/ # Catalan resources
└── ...Lexicon File Format
Lexicon files use Elixir term format (.exs) and evaluate to a list of strings using the ~w() sigil.
Example: determiners.exs
# English Determiners
# Articles, demonstratives, possessives, quantifiers
~w(
the a an
this that these those
my your his her its our their
some any no every each either neither
much many more most less least few several all both half
whose
)Lexicon Categories
Closed-Class Words (Complete Lists)
- Determiners (
determiners.exs) - Articles, demonstratives, possessives, quantifiers - Pronouns (
pronouns.exs) - Personal, possessive, reflexive, demonstrative, interrogative - Prepositions (
prepositions.exs) - Spatial, temporal, logical relations - Coordinating Conjunctions (
conjunctions_coord.exs) - FANBOYS (for, and, nor, but, or, yet, so) - Subordinating Conjunctions (
conjunctions_sub.exs) - after, although, because, if, when, etc. - Auxiliaries (
auxiliaries.exs) - be, have, do, modals (will, can, should, etc.) - Particles (
particles.exs) - Phrasal verb particles (up, down, out, etc.) - Interjections (
interjections.exs) - oh, wow, hey, etc.
Open-Class Words (Common Examples)
- Common Verbs (
common_verbs.exs) - Frequently used verbs with all inflections - Common Adjectives (
common_adjectives.exs) - Frequently used qualitative and relational adjectives
Verb Inflections
The common_verbs.exs file includes all inflected forms:
~w(
go went gone going goes
come came coming comes
see saw seen seeing sees
...
)This ensures that verbs are recognized in all their forms during POS tagging.
Loading Lexicons
In Code
Use the LexiconLoader module to load lexicons:
alias Nasty.Language.Resources.LexiconLoader
# Load a lexicon
determiners = LexiconLoader.load(:en, :determiners)
# Check if word is in lexicon
LexiconLoader.in_lexicon?(:en, :determiners, "the") # => true
# List all available lexicons
LexiconLoader.list_lexicons(:en)At Compile Time
For performance, load lexicons at compile time using module attributes:
defmodule MyModule do
alias Nasty.Language.Resources.LexiconLoader
@determiners LexiconLoader.load(:en, :determiners)
@pronouns LexiconLoader.load(:en, :pronouns)
defp determiners, do: @determiners
defp pronouns, do: @pronouns
endThis is how the POSTagger module loads lexicons efficiently.
Grammar Rules
Grammar rules are documented in grammars/phrase_rules.ex and follow Context-Free Grammar (CFG) notation.
Phrase Structure Rules
# Noun Phrase
NP → Det? Adj* (Noun | PropN | Pron) PP* RC*
# Verb Phrase
VP → Aux* Verb NP? PP* AdvP*
# Prepositional Phrase
PP → Prep NP
# Adjectival Phrase
AdjP → Adv? Adj
# Adverbial Phrase
AdvP → Adv+Rule File Format
Grammar rules are defined as Elixir modules returning lists of tuples:
defmodule Nasty.Language.English.Grammar.PhraseRules do
def rules do
[
{:np, [
[:det, :adj, :noun],
[:det, :noun],
[:noun],
[:propn],
[:pron]
]},
{:vp, [
[:aux, :verb, :np],
[:verb, :np],
[:verb]
]},
# ...
]
end
endNote: Currently, these rules are documentation only. The phrase parser uses procedural pattern matching rather than rule interpretation. Future versions may add a rule-based parser.
Adding a New Language
To add support for a new language:
Create directory structure:
mkdir -p priv/languages/{code}/lexicons mkdir -p priv/languages/{code}/grammarsCreate lexicon files: Translate lexicons from English, adjusting for the language's grammar
Create grammar rules: Define CFG rules for the language's phrase structure
Implement language module: Create a module implementing
Nasty.Language.BehaviourRegister language: Register in the
Application
Example: Spanish Lexicons
# priv/languages/es/lexicons/determiners.exs
~w(
el la los las
un una unos unas
este esta estos estas
ese esa esos esas
mi tu su nuestro vuestra
algún alguna algunos algunas
)Modifying Lexicons
To add or modify words:
- Edit the appropriate
.exsfile inpriv/languages/{code}/lexicons/ - Recompile the project:
mix compile --force - Run tests to verify:
mix test
Changes take effect immediately after recompilation since lexicons are loaded at compile time.
Testing
Lexicon loading is tested in test/language/resources/lexicon_loader_test.exs:
test "loads determiners lexicon for English" do
determiners = LexiconLoader.load(:en, :determiners)
assert is_list(determiners)
assert "the" in determiners
assert "a" in determiners
endPerformance Considerations
- Compile-time loading: Lexicons are loaded once during compilation and cached as module attributes
- No runtime overhead: Lookups are fast list membership checks
- Memory usage: All lexicons are kept in memory (typically < 1MB per language)
Best Practices
- Keep lexicons sorted: Makes it easier to find and avoid duplicates
- Add comments: Document word categories and usage patterns
- Test coverage: Add tests for new lexicons or grammar rules
- Version control: Commit lexicon changes with descriptive messages
- Language consistency: Follow Universal Dependencies (UD) tag set
Future Work
- Rule-based parser: Implement CFG rule interpreter for phrase parsing
- Pattern rules: Add pattern matching rules for specific constructions
- Morphological rules: Externalize morphological analysis patterns
- Statistical models: Support for statistical grammar models
References
- Universal Dependencies - POS tags and dependency relations
- docs/languages/ENGLISH_GRAMMAR.md - Formal English grammar specification
- docs/PARSING_GUIDE.md - Parsing algorithm documentation