Nasty.Statistics.FeatureExtractor (Nasty v0.3.0)

View Source

Feature extraction utilities for statistical models.

Extracts rich feature representations from tokens for use in machine learning models (HMM, MaxEnt, CRF, etc.).

Feature Types

  • Lexical: Word form, lemma, lowercased form
  • Contextual: Words/POS tags in surrounding window
  • Morphological: Prefixes, suffixes, character n-grams
  • Orthographic: Capitalization patterns, digits, punctuation
  • Positional: Sentence/document position features

Examples

iex> token = %Token{text: "Running", pos_tag: :verb}
iex> FeatureExtractor.extract_lexical(token)
%{word: "Running", lowercase: "running", length: 7}

iex> tokens = [token1, token2, token3]
iex> FeatureExtractor.extract_context(tokens, 1, window: 1)
%{prev_word: "The", next_word: "cat"}

Summary

Functions

Extract all features for a token in context.

Extract contextual features from surrounding tokens.

Extract lexical features from a token.

Extract morphological features from a token.

Extract orthographic features from a token.

Extract positional features for a token.

Extract features for an entire sequence of tokens.

Convert feature map to a list of binary feature indicators.

Functions

extract_all(tokens, index, opts \\ [])

@spec extract_all([Nasty.AST.Token.t()], non_neg_integer(), keyword()) :: map()

Extract all features for a token in context.

Combines lexical, morphological, orthographic, and contextual features.

Parameters

  • tokens - List of all tokens in the sequence
  • index - Index of the target token
  • opts - Options
    • :window - Context window size (default: 2)
    • :ngram_size - Character n-gram size (default: 3)

Returns

  • Feature map for the token

extract_context(tokens, index, opts \\ [])

@spec extract_context([Nasty.AST.Token.t()], non_neg_integer(), keyword()) :: map()

Extract contextual features from surrounding tokens.

Features

  • :prev_word_N - Word N positions before (for N in 1..window)
  • :next_word_N - Word N positions after (for N in 1..window)
  • :prev_pos_N - POS tag N positions before (if available)
  • :next_pos_N - POS tag N positions after (if available)

Options

  • :window - Context window size (default: 2)

Examples

iex> tokens = [token1, token2, token3]
iex> extract_context(tokens, 1, window: 1)
%{prev_word_1: "The", next_word_1: "cat"}

extract_lexical(token)

@spec extract_lexical(Nasty.AST.Token.t()) :: map()

Extract lexical features from a token.

Features

  • :word - Original word form
  • :lowercase - Lowercased form
  • :length - Word length

Examples

iex> token = %Token{text: "Running"}
iex> FeatureExtractor.extract_lexical(token)
%{word: "Running", lowercase: "running", length: 7}

extract_morphological(token, opts \\ [])

@spec extract_morphological(
  Nasty.AST.Token.t(),
  keyword()
) :: map()

Extract morphological features from a token.

Features

  • :prefix_N - First N characters (for N in 1..4)
  • :suffix_N - Last N characters (for N in 1..4)
  • :contains_hyphen - Boolean
  • :contains_digit - Boolean

Options

  • :ngram_size - Maximum n-gram size (default: 3)

extract_orthographic(token)

@spec extract_orthographic(Nasty.AST.Token.t()) :: map()

Extract orthographic features from a token.

Features

  • :is_capitalized - First letter uppercase
  • :is_all_caps - All letters uppercase
  • :is_all_lower - All letters lowercase
  • :has_internal_caps - Mixed case (e.g., "iPhone")
  • :is_numeric - Contains only digits
  • :is_alphanumeric - Contains letters and digits
  • :has_punctuation - Contains punctuation characters

Examples

iex> token = %Token{text: "iPhone"}
iex> extract_orthographic(token)
%{is_capitalized: false, has_internal_caps: true, ...}

extract_positional(tokens, index)

@spec extract_positional([Nasty.AST.Token.t()], non_neg_integer()) :: map()

Extract positional features for a token.

Features

  • :position - Absolute position in sequence (0-indexed)
  • :relative_position - Position as fraction of sequence length
  • :is_first - Boolean, true if first token
  • :is_last - Boolean, true if last token
  • :distance_from_start - Distance from beginning
  • :distance_from_end - Distance from end

extract_sequence(tokens, opts \\ [])

@spec extract_sequence(
  [Nasty.AST.Token.t()],
  keyword()
) :: [map()]

Extract features for an entire sequence of tokens.

Returns a list of feature maps, one per token.

Examples

iex> tokens = [token1, token2, token3]
iex> features = extract_sequence(tokens)
[%{word: "The", ...}, %{word: "cat", ...}, %{word: "sat", ...}]

to_binary_features(features)

@spec to_binary_features(map()) :: [String.t()]

Convert feature map to a list of binary feature indicators.

Useful for models that expect binary feature vectors.

Examples

iex> features = %{word: "cat", is_capitalized: true, length: 3}
iex> to_binary_features(features)
["word=cat", "is_capitalized=true", "length=3"]