Nasty.Statistics.FeatureExtractor (Nasty v0.3.0)
View SourceFeature extraction utilities for statistical models.
Extracts rich feature representations from tokens for use in machine learning models (HMM, MaxEnt, CRF, etc.).
Feature Types
- Lexical: Word form, lemma, lowercased form
- Contextual: Words/POS tags in surrounding window
- Morphological: Prefixes, suffixes, character n-grams
- Orthographic: Capitalization patterns, digits, punctuation
- Positional: Sentence/document position features
Examples
iex> token = %Token{text: "Running", pos_tag: :verb}
iex> FeatureExtractor.extract_lexical(token)
%{word: "Running", lowercase: "running", length: 7}
iex> tokens = [token1, token2, token3]
iex> FeatureExtractor.extract_context(tokens, 1, window: 1)
%{prev_word: "The", next_word: "cat"}
Summary
Functions
Extract all features for a token in context.
Extract contextual features from surrounding tokens.
Extract lexical features from a token.
Extract morphological features from a token.
Extract orthographic features from a token.
Extract positional features for a token.
Extract features for an entire sequence of tokens.
Convert feature map to a list of binary feature indicators.
Functions
@spec extract_all([Nasty.AST.Token.t()], non_neg_integer(), keyword()) :: map()
Extract all features for a token in context.
Combines lexical, morphological, orthographic, and contextual features.
Parameters
tokens- List of all tokens in the sequenceindex- Index of the target tokenopts- Options:window- Context window size (default: 2):ngram_size- Character n-gram size (default: 3)
Returns
- Feature map for the token
@spec extract_context([Nasty.AST.Token.t()], non_neg_integer(), keyword()) :: map()
Extract contextual features from surrounding tokens.
Features
:prev_word_N- Word N positions before (for N in 1..window):next_word_N- Word N positions after (for N in 1..window):prev_pos_N- POS tag N positions before (if available):next_pos_N- POS tag N positions after (if available)
Options
:window- Context window size (default: 2)
Examples
iex> tokens = [token1, token2, token3]
iex> extract_context(tokens, 1, window: 1)
%{prev_word_1: "The", next_word_1: "cat"}
@spec extract_lexical(Nasty.AST.Token.t()) :: map()
Extract lexical features from a token.
Features
:word- Original word form:lowercase- Lowercased form:length- Word length
Examples
iex> token = %Token{text: "Running"}
iex> FeatureExtractor.extract_lexical(token)
%{word: "Running", lowercase: "running", length: 7}
@spec extract_morphological( Nasty.AST.Token.t(), keyword() ) :: map()
Extract morphological features from a token.
Features
:prefix_N- First N characters (for N in 1..4):suffix_N- Last N characters (for N in 1..4):contains_hyphen- Boolean:contains_digit- Boolean
Options
:ngram_size- Maximum n-gram size (default: 3)
@spec extract_orthographic(Nasty.AST.Token.t()) :: map()
Extract orthographic features from a token.
Features
:is_capitalized- First letter uppercase:is_all_caps- All letters uppercase:is_all_lower- All letters lowercase:has_internal_caps- Mixed case (e.g., "iPhone"):is_numeric- Contains only digits:is_alphanumeric- Contains letters and digits:has_punctuation- Contains punctuation characters
Examples
iex> token = %Token{text: "iPhone"}
iex> extract_orthographic(token)
%{is_capitalized: false, has_internal_caps: true, ...}
@spec extract_positional([Nasty.AST.Token.t()], non_neg_integer()) :: map()
Extract positional features for a token.
Features
:position- Absolute position in sequence (0-indexed):relative_position- Position as fraction of sequence length:is_first- Boolean, true if first token:is_last- Boolean, true if last token:distance_from_start- Distance from beginning:distance_from_end- Distance from end
@spec extract_sequence( [Nasty.AST.Token.t()], keyword() ) :: [map()]
Extract features for an entire sequence of tokens.
Returns a list of feature maps, one per token.
Examples
iex> tokens = [token1, token2, token3]
iex> features = extract_sequence(tokens)
[%{word: "The", ...}, %{word: "cat", ...}, %{word: "sat", ...}]
Convert feature map to a list of binary feature indicators.
Useful for models that expect binary feature vectors.
Examples
iex> features = %{word: "cat", is_capitalized: true, length: 3}
iex> to_binary_features(features)
["word=cat", "is_capitalized=true", "length=3"]