Nasty.Language.English.FeatureExtractor (Nasty v0.3.0)
View SourceExtracts classification features from parsed documents.
Supports multiple feature types:
- Bag of Words (BoW): Lemmatized word frequencies
- N-grams: Word sequences (bigrams, trigrams)
- POS patterns: Part-of-speech tag sequences
- Syntactic features: Sentence structure statistics
- Entity features: Named entity type distributions
- Lexical features: Vocabulary richness, sentence length
Summary
Functions
Extracts features from a document.
Converts a feature map to a sparse vector representation.
Functions
@spec extract( Nasty.AST.Document.t(), keyword() ) :: map()
Extracts features from a document.
Options
:features- List of feature types to extract (default:[:bow, :ngrams]):bow- Bag of words (lemmatized):ngrams- Word n-grams:pos_patterns- POS tag sequences:syntactic- Sentence structure features:entities- Entity type features:lexical- Lexical statistics
:ngram_size- Size of n-grams (default: 2):max_features- Maximum number of features to keep (default: 1000):min_frequency- Minimum frequency threshold (default: 1):include_stop_words- Include stop words in BoW (default: false)
Examples
iex> document = parse("The cat sat on the mat.")
iex> features = FeatureExtractor.extract(document, features: [:bow, :ngrams])
%{
bow: %{"cat" => 1, "sat" => 1, "mat" => 1},
ngrams: %{{"cat", "sat"} => 1, {"sat", "mat"} => 1}
}
Converts a feature map to a sparse vector representation.
Useful for machine learning algorithms that expect numeric vectors.
Examples
iex> features = %{bow: %{"cat" => 2, "dog" => 1}}
iex> FeatureExtractor.to_vector(features, [:bow])
%{"bow:cat" => 2, "bow:dog" => 1}