Nasty.Statistics.SequenceLabeling.Features (Nasty v0.3.0)

Feature extraction for sequence labeling tasks (NER, POS tagging, etc.).

Extracts rich feature representations from tokens including lexical, orthographic, POS, contextual, and gazetteer-based features.

Feature Types

Lexical: word, lowercased, lemma
Orthographic: capitalization, shape, digits
POS: part-of-speech tags
Context: surrounding words and POS tags
Affixes: prefixes and suffixes
Gazetteers: matches in entity lists
Patterns: special character patterns

Examples

iex> token = %Token{text: "John", pos_tag: :propn, lemma: "John"}
iex> context = %{prev_word: "Mr.", next_word: "Smith", position: 1}
iex> features = Features.extract(token, context)
["word=john", "pos=PROPN", "capitalized=true", "prefix-2=Jo", ...]

Summary

Types

context()

feature()

feature_vector()

Functions

extract(token, context \\ %{}, opts \\ [])

Extracts features from a token given its context.

extract_sequence(tokens, opts \\ [])

Extracts features for an entire sequence of tokens.

Types

context()

@type context() :: %{
  optional(:prev_word) => String.t(),
  optional(:next_word) => String.t(),
  optional(:prev_pos) => atom(),
  optional(:next_pos) => atom(),
  optional(:prev_label) => atom(),
  optional(:position) => non_neg_integer(),
  optional(:sequence_length) => non_neg_integer()
}

feature()

@type feature() :: String.t()

feature_vector()

@type feature_vector() :: [feature()]

Functions

extract(token, context \\ %{}, opts \\ [])

@spec extract(Nasty.AST.Token.t(), context(), keyword()) :: feature_vector()

Extracts features from a token given its context.

Parameters

token - Token to extract features from
context - Contextual information (surrounding words, position, etc.)
opts - Options:
- :use_gazetteers - Enable gazetteer features (default: true)
- :max_affix_length - Maximum prefix/suffix length (default: 4)

Returns

List of feature strings

extract_sequence(tokens, opts \\ [])

@spec extract_sequence(
  [Nasty.AST.Token.t()],
  keyword()
) :: [feature_vector()]

Extracts features for an entire sequence of tokens.

Automatically builds context for each token from surrounding tokens.

Parameters

tokens - List of tokens
opts - Options passed to extract/3

Returns

List of feature vectors, one per token