Nasty.Language.English.POSTagger (Nasty v0.3.0)

View Source

Part-of-Speech tagger for English using rule-based pattern matching.

Tags tokens with Universal Dependencies POS tags based on:

  • Lexical lookup (closed-class words)
  • Morphological patterns (suffixes)
  • Context-based disambiguation

This is a simple rule-based tagger. For better accuracy, consider using statistical models or neural networks in the future.

Examples

iex> alias Nasty.Language.English.{Tokenizer, POSTagger}
iex> {:ok, tokens} = Tokenizer.tokenize("the")
iex> {:ok, tagged} = POSTagger.tag_pos(tokens)
iex> hd(tagged).pos_tag
:det

Summary

Functions

Tags a list of tokens with POS tags.

Ensemble POS tagging combining rule-based and HMM.

HMM-based POS tagging.

Neural POS tagging using BiLSTM-CRF.

Neural ensemble POS tagging combining neural, HMM, and rule-based models.

Rule-based POS tagging (original implementation).

Transformer-based POS tagging using pre-trained models.

Functions

tag_pos(tokens, opts \\ [])

@spec tag_pos(
  [Nasty.AST.Token.t()],
  keyword()
) :: {:ok, [Nasty.AST.Token.t()]}

Tags a list of tokens with POS tags.

Uses:

  1. Lexical lookup for known words (determiners, pronouns, etc.)
  2. Morphological patterns (suffixes for verbs, nouns, adjectives)
  3. Context rules (e.g., word after determiner is likely a noun)
  4. Statistical models (HMM)
  5. Neural models (BiLSTM-CRF)

Parameters

  • tokens - List of Token structs (from tokenizer)
  • opts - Options
    • :model - Model type: :rule_based (default), :hmm, :neural, :ensemble, :neural_ensemble, :transformer, or specific transformer model name (e.g., :roberta_base)
    • :hmm_model - Trained HMM model (optional)
    • :neural_model - Trained neural model (optional)

Returns

  • {:ok, tokens} - Tokens with updated pos_tag field

tag_pos_ensemble(tokens, opts)

Ensemble POS tagging combining rule-based and HMM.

Uses HMM predictions but falls back to rule-based for punctuation and other deterministic cases.

tag_pos_hmm(tokens, opts)

HMM-based POS tagging.

If no model is provided via :hmm_model option, attempts to load the latest English POS tagging model from the registry. Falls back to rule-based tagging if no model is available.

tag_pos_neural(tokens, opts)

Neural POS tagging using BiLSTM-CRF.

If no model is provided via :neural_model option, attempts to load the latest neural POS tagging model from the registry. Falls back to HMM or rule-based tagging if no model is available.

tag_pos_neural_ensemble(tokens, opts)

Neural ensemble POS tagging combining neural, HMM, and rule-based models.

Uses neural predictions as primary, with fallback chain: neural -> HMM -> rule-based

Prefers rule-based for high-confidence cases like punctuation and numbers.

tag_pos_rule_based(tokens)

Rule-based POS tagging (original implementation).

tag_pos_transformer(tokens, opts)

Transformer-based POS tagging using pre-trained models.

Uses BERT, RoBERTa, or other transformer models for state-of-the-art accuracy (98-99%). Falls back to neural tagging if transformer fails.