Nasty.Statistics.SequenceLabeling.Features (Nasty v0.3.0)

View Source

Feature extraction for sequence labeling tasks (NER, POS tagging, etc.).

Extracts rich feature representations from tokens including lexical, orthographic, POS, contextual, and gazetteer-based features.

Feature Types

  1. Lexical: word, lowercased, lemma
  2. Orthographic: capitalization, shape, digits
  3. POS: part-of-speech tags
  4. Context: surrounding words and POS tags
  5. Affixes: prefixes and suffixes
  6. Gazetteers: matches in entity lists
  7. Patterns: special character patterns

Examples

iex> token = %Token{text: "John", pos_tag: :propn, lemma: "John"}
iex> context = %{prev_word: "Mr.", next_word: "Smith", position: 1}
iex> features = Features.extract(token, context)
["word=john", "pos=PROPN", "capitalized=true", "prefix-2=Jo", ...]

Summary

Functions

Extracts features from a token given its context.

Extracts features for an entire sequence of tokens.

Types

context()

@type context() :: %{
  optional(:prev_word) => String.t(),
  optional(:next_word) => String.t(),
  optional(:prev_pos) => atom(),
  optional(:next_pos) => atom(),
  optional(:prev_label) => atom(),
  optional(:position) => non_neg_integer(),
  optional(:sequence_length) => non_neg_integer()
}

feature()

@type feature() :: String.t()

feature_vector()

@type feature_vector() :: [feature()]

Functions

extract(token, context \\ %{}, opts \\ [])

@spec extract(Nasty.AST.Token.t(), context(), keyword()) :: feature_vector()

Extracts features from a token given its context.

Parameters

  • token - Token to extract features from
  • context - Contextual information (surrounding words, position, etc.)
  • opts - Options:
    • :use_gazetteers - Enable gazetteer features (default: true)
    • :max_affix_length - Maximum prefix/suffix length (default: 4)

Returns

List of feature strings

extract_sequence(tokens, opts \\ [])

@spec extract_sequence(
  [Nasty.AST.Token.t()],
  keyword()
) :: [feature_vector()]

Extracts features for an entire sequence of tokens.

Automatically builds context for each token from surrounding tokens.

Parameters

  • tokens - List of tokens
  • opts - Options passed to extract/3

Returns

List of feature vectors, one per token