Nasty.Statistics.FeatureExtractor (Nasty v0.3.0)

Feature extraction utilities for statistical models.

Extracts rich feature representations from tokens for use in machine learning models (HMM, MaxEnt, CRF, etc.).

Feature Types

Lexical: Word form, lemma, lowercased form
Contextual: Words/POS tags in surrounding window
Morphological: Prefixes, suffixes, character n-grams
Orthographic: Capitalization patterns, digits, punctuation
Positional: Sentence/document position features

Examples

iex> token = %Token{text: "Running", pos_tag: :verb}
iex> FeatureExtractor.extract_lexical(token)
%{word: "Running", lowercase: "running", length: 7}

iex> tokens = [token1, token2, token3]
iex> FeatureExtractor.extract_context(tokens, 1, window: 1)
%{prev_word: "The", next_word: "cat"}

Summary

Functions

extract_all(tokens, index, opts \\ [])

Extract all features for a token in context.

extract_context(tokens, index, opts \\ [])

Extract contextual features from surrounding tokens.

extract_lexical(token)

Extract lexical features from a token.

extract_morphological(token, opts \\ [])

Extract morphological features from a token.

extract_orthographic(token)

Extract orthographic features from a token.

extract_positional(tokens, index)

Extract positional features for a token.

extract_sequence(tokens, opts \\ [])

Extract features for an entire sequence of tokens.

to_binary_features(features)

Convert feature map to a list of binary feature indicators.

Functions

extract_all(tokens, index, opts \\ [])

@spec extract_all([Nasty.AST.Token.t()], non_neg_integer(), keyword()) :: map()

Extract all features for a token in context.

Combines lexical, morphological, orthographic, and contextual features.

Parameters

tokens - List of all tokens in the sequence
index - Index of the target token
opts - Options
- :window - Context window size (default: 2)
- :ngram_size - Character n-gram size (default: 3)

Returns

Feature map for the token

extract_context(tokens, index, opts \\ [])

@spec extract_context([Nasty.AST.Token.t()], non_neg_integer(), keyword()) :: map()

Extract contextual features from surrounding tokens.

Features

:prev_word_N - Word N positions before (for N in 1..window)
:next_word_N - Word N positions after (for N in 1..window)
:prev_pos_N - POS tag N positions before (if available)
:next_pos_N - POS tag N positions after (if available)

Options

:window - Context window size (default: 2)

Examples

iex> tokens = [token1, token2, token3]
iex> extract_context(tokens, 1, window: 1)
%{prev_word_1: "The", next_word_1: "cat"}

extract_lexical(token)

@spec extract_lexical(Nasty.AST.Token.t()) :: map()

Extract lexical features from a token.

Features

:word - Original word form
:lowercase - Lowercased form
:length - Word length

Examples

iex> token = %Token{text: "Running"}
iex> FeatureExtractor.extract_lexical(token)
%{word: "Running", lowercase: "running", length: 7}

extract_morphological(token, opts \\ [])

@spec extract_morphological(
  Nasty.AST.Token.t(),
  keyword()
) :: map()

Extract morphological features from a token.

Features

:prefix_N - First N characters (for N in 1..4)
:suffix_N - Last N characters (for N in 1..4)
:contains_hyphen - Boolean
:contains_digit - Boolean

Options

:ngram_size - Maximum n-gram size (default: 3)

extract_orthographic(token)

@spec extract_orthographic(Nasty.AST.Token.t()) :: map()

Extract orthographic features from a token.

Features

:is_capitalized - First letter uppercase
:is_all_caps - All letters uppercase
:is_all_lower - All letters lowercase
:has_internal_caps - Mixed case (e.g., "iPhone")
:is_numeric - Contains only digits
:is_alphanumeric - Contains letters and digits
:has_punctuation - Contains punctuation characters

Examples

iex> token = %Token{text: "iPhone"}
iex> extract_orthographic(token)
%{is_capitalized: false, has_internal_caps: true, ...}

extract_positional(tokens, index)

@spec extract_positional([Nasty.AST.Token.t()], non_neg_integer()) :: map()

Extract positional features for a token.

Features

:position - Absolute position in sequence (0-indexed)
:relative_position - Position as fraction of sequence length
:is_first - Boolean, true if first token
:is_last - Boolean, true if last token
:distance_from_start - Distance from beginning
:distance_from_end - Distance from end

extract_sequence(tokens, opts \\ [])

@spec extract_sequence(
  [Nasty.AST.Token.t()],
  keyword()
) :: [map()]

Extract features for an entire sequence of tokens.

Returns a list of feature maps, one per token.

Examples

iex> tokens = [token1, token2, token3]
iex> features = extract_sequence(tokens)
[%{word: "The", ...}, %{word: "cat", ...}, %{word: "sat", ...}]

to_binary_features(features)

@spec to_binary_features(map()) :: [String.t()]

Convert feature map to a list of binary feature indicators.

Useful for models that expect binary feature vectors.

Examples

iex> features = %{word: "cat", is_capitalized: true, length: 3}
iex> to_binary_features(features)
["word=cat", "is_capitalized=true", "length=3"]