Nasty.Semantic.EntityRecognition.RuleBased behaviour (Nasty v0.3.0)

View Source

Language-agnostic rule-based Named Entity Recognition (NER).

Provides a generic framework for rule-based entity recognition that can be configured with language-specific lexicons and patterns. The algorithm:

  1. Finds sequences of capitalized tokens (potential entities)
  2. Classifies each sequence using configurable rules
  3. Returns Entity structs with type, text, tokens, and span

Usage

defmodule MyLanguage.EntityRecognizer do
  @behaviour Nasty.Semantic.EntityRecognition.RuleBased

  @impl true
  def excluded_pos_tags, do: [:punct, :det, :adp, :verb, :aux]

  @impl true
  def classification_rules do
    [
      {:person, &has_person_title?/1},
      {:gpe, &has_location_suffix?/1},
      {:org, &has_org_suffix?/1}
    ]
  end

  @impl true
  def lexicon_matchers do
    %{
      person: &person_name?/1,
      gpe: &place_name?/1,
      org: &organization_name?/1
    }
  end
end

Summary

Callbacks

Callback for ordered classification rules. Returns a list of {type, predicate_function} tuples. Predicates receive {text, tokens} and return boolean.

Callback for default classification heuristics (optional). Receives tokens and returns entity type or nil.

Callback for POS tags to exclude when finding entity sequences.

Callback for lexicon matchers (optional). Returns a map of entity_type => matcher_function. Matcher functions receive text and return boolean.

Functions

Checks if all tokens in a sequence are capitalized.

Checks if a token is capitalized.

Checks classification rules in order.

Checks default classification heuristics.

Checks lexicon matchers for entity type.

Classifies an entity sequence using configured rules.

Determines entity type using lexicons, patterns, and heuristics.

Finds sequences of consecutive capitalized tokens.

Recognizes named entities in a list of POS-tagged tokens.

Callbacks

classification_rules()

@callback classification_rules() :: [{atom(), function()}]

Callback for ordered classification rules. Returns a list of {type, predicate_function} tuples. Predicates receive {text, tokens} and return boolean.

default_classification(list)

(optional)
@callback default_classification([Nasty.AST.Token.t()]) :: atom() | nil

Callback for default classification heuristics (optional). Receives tokens and returns entity type or nil.

excluded_pos_tags()

@callback excluded_pos_tags() :: [atom()]

Callback for POS tags to exclude when finding entity sequences.

lexicon_matchers()

(optional)
@callback lexicon_matchers() :: %{required(atom()) => function()}

Callback for lexicon matchers (optional). Returns a map of entity_type => matcher_function. Matcher functions receive text and return boolean.

Functions

all_capitalized?(tokens)

@spec all_capitalized?([Nasty.AST.Token.t()]) :: boolean()

Checks if all tokens in a sequence are capitalized.

capitalized?(token)

@spec capitalized?(Nasty.AST.Token.t()) :: boolean()

Checks if a token is capitalized.

check_classification_rules(impl, text, tokens)

@spec check_classification_rules(module(), String.t(), [Nasty.AST.Token.t()]) ::
  atom() | nil

Checks classification rules in order.

check_default_classification(impl, tokens)

@spec check_default_classification(module(), [Nasty.AST.Token.t()]) :: atom() | nil

Checks default classification heuristics.

check_lexicons(impl, text)

@spec check_lexicons(module(), String.t()) :: atom() | nil

Checks lexicon matchers for entity type.

classify_entity(impl, arg, confidence)

@spec classify_entity(
  module(),
  {String.t(), [Nasty.AST.Token.t()], Nasty.AST.Node.span()},
  float()
) ::
  Nasty.AST.Semantic.Entity.t() | nil

Classifies an entity sequence using configured rules.

determine_entity_type(impl, text, tokens)

@spec determine_entity_type(module(), String.t(), [Nasty.AST.Token.t()]) ::
  atom() | nil

Determines entity type using lexicons, patterns, and heuristics.

Order of precedence:

  1. Lexicon matchers (if provided)
  2. Classification rules
  3. Default classification (if provided)

find_proper_noun_sequences(tokens, impl)

@spec find_proper_noun_sequences([Nasty.AST.Token.t()], module()) :: [
  {String.t(), [Nasty.AST.Token.t()], Nasty.AST.Node.span()}
]

Finds sequences of consecutive capitalized tokens.

Groups tokens that:

  • Are capitalized
  • Are not in excluded POS tags
  • Are consecutive

Returns list of {text, tokens, span} tuples.

recognize(impl, tokens, opts \\ [])

@spec recognize(module(), [Nasty.AST.Token.t()], keyword()) :: [
  Nasty.AST.Semantic.Entity.t()
]

Recognizes named entities in a list of POS-tagged tokens.

Returns a list of Entity structs.