Nasty.Semantic.EntityRecognition.RuleBased behaviour (Nasty v0.3.0)

Language-agnostic rule-based Named Entity Recognition (NER).

Provides a generic framework for rule-based entity recognition that can be configured with language-specific lexicons and patterns. The algorithm:

Finds sequences of capitalized tokens (potential entities)
Classifies each sequence using configurable rules
Returns Entity structs with type, text, tokens, and span

Usage

defmodule MyLanguage.EntityRecognizer do
  @behaviour Nasty.Semantic.EntityRecognition.RuleBased

  @impl true
  def excluded_pos_tags, do: [:punct, :det, :adp, :verb, :aux]

  @impl true
  def classification_rules do
    [
      {:person, &has_person_title?/1},
      {:gpe, &has_location_suffix?/1},
      {:org, &has_org_suffix?/1}
    ]
  end

  @impl true
  def lexicon_matchers do
    %{
      person: &person_name?/1,
      gpe: &place_name?/1,
      org: &organization_name?/1
    }
  end
end

Summary

Callbacks

classification_rules()

Callback for ordered classification rules. Returns a list of {type, predicate_function} tuples. Predicates receive {text, tokens} and return boolean.

default_classification(list)

Callback for default classification heuristics (optional). Receives tokens and returns entity type or nil.

excluded_pos_tags()

Callback for POS tags to exclude when finding entity sequences.

lexicon_matchers()

Callback for lexicon matchers (optional). Returns a map of entity_type => matcher_function. Matcher functions receive text and return boolean.

Functions

all_capitalized?(tokens)

Checks if all tokens in a sequence are capitalized.

capitalized?(token)

Checks if a token is capitalized.

check_classification_rules(impl, text, tokens)

Checks classification rules in order.

check_default_classification(impl, tokens)

Checks default classification heuristics.

check_lexicons(impl, text)

Checks lexicon matchers for entity type.

classify_entity(impl, arg, confidence)

Classifies an entity sequence using configured rules.

determine_entity_type(impl, text, tokens)

Determines entity type using lexicons, patterns, and heuristics.

find_proper_noun_sequences(tokens, impl)

Finds sequences of consecutive capitalized tokens.

recognize(impl, tokens, opts \\ [])

Recognizes named entities in a list of POS-tagged tokens.