Nasty.Semantic.EntityRecognition.RuleBased behaviour (Nasty v0.3.0)
View SourceLanguage-agnostic rule-based Named Entity Recognition (NER).
Provides a generic framework for rule-based entity recognition that can be configured with language-specific lexicons and patterns. The algorithm:
- Finds sequences of capitalized tokens (potential entities)
- Classifies each sequence using configurable rules
- Returns Entity structs with type, text, tokens, and span
Usage
defmodule MyLanguage.EntityRecognizer do
@behaviour Nasty.Semantic.EntityRecognition.RuleBased
@impl true
def excluded_pos_tags, do: [:punct, :det, :adp, :verb, :aux]
@impl true
def classification_rules do
[
{:person, &has_person_title?/1},
{:gpe, &has_location_suffix?/1},
{:org, &has_org_suffix?/1}
]
end
@impl true
def lexicon_matchers do
%{
person: &person_name?/1,
gpe: &place_name?/1,
org: &organization_name?/1
}
end
end
Summary
Callbacks
Callback for ordered classification rules. Returns a list of {type, predicate_function} tuples. Predicates receive {text, tokens} and return boolean.
Callback for default classification heuristics (optional). Receives tokens and returns entity type or nil.
Callback for POS tags to exclude when finding entity sequences.
Callback for lexicon matchers (optional). Returns a map of entity_type => matcher_function. Matcher functions receive text and return boolean.
Functions
Checks if all tokens in a sequence are capitalized.
Checks if a token is capitalized.
Checks classification rules in order.
Checks default classification heuristics.
Checks lexicon matchers for entity type.
Classifies an entity sequence using configured rules.
Determines entity type using lexicons, patterns, and heuristics.
Finds sequences of consecutive capitalized tokens.
Recognizes named entities in a list of POS-tagged tokens.
Callbacks
Callback for ordered classification rules. Returns a list of {type, predicate_function} tuples. Predicates receive {text, tokens} and return boolean.
@callback default_classification([Nasty.AST.Token.t()]) :: atom() | nil
Callback for default classification heuristics (optional). Receives tokens and returns entity type or nil.
@callback excluded_pos_tags() :: [atom()]
Callback for POS tags to exclude when finding entity sequences.
Callback for lexicon matchers (optional). Returns a map of entity_type => matcher_function. Matcher functions receive text and return boolean.
Functions
@spec all_capitalized?([Nasty.AST.Token.t()]) :: boolean()
Checks if all tokens in a sequence are capitalized.
@spec capitalized?(Nasty.AST.Token.t()) :: boolean()
Checks if a token is capitalized.
@spec check_classification_rules(module(), String.t(), [Nasty.AST.Token.t()]) :: atom() | nil
Checks classification rules in order.
@spec check_default_classification(module(), [Nasty.AST.Token.t()]) :: atom() | nil
Checks default classification heuristics.
Checks lexicon matchers for entity type.
@spec classify_entity( module(), {String.t(), [Nasty.AST.Token.t()], Nasty.AST.Node.span()}, float() ) :: Nasty.AST.Semantic.Entity.t() | nil
Classifies an entity sequence using configured rules.
@spec determine_entity_type(module(), String.t(), [Nasty.AST.Token.t()]) :: atom() | nil
Determines entity type using lexicons, patterns, and heuristics.
Order of precedence:
- Lexicon matchers (if provided)
- Classification rules
- Default classification (if provided)
@spec find_proper_noun_sequences([Nasty.AST.Token.t()], module()) :: [ {String.t(), [Nasty.AST.Token.t()], Nasty.AST.Node.span()} ]
Finds sequences of consecutive capitalized tokens.
Groups tokens that:
- Are capitalized
- Are not in excluded POS tags
- Are consecutive
Returns list of {text, tokens, span} tuples.
@spec recognize(module(), [Nasty.AST.Token.t()], keyword()) :: [ Nasty.AST.Semantic.Entity.t() ]
Recognizes named entities in a list of POS-tagged tokens.
Returns a list of Entity structs.