Nasty.Language.English.FeatureExtractor (Nasty v0.3.0)

Extracts classification features from parsed documents.

Supports multiple feature types:

Bag of Words (BoW): Lemmatized word frequencies
N-grams: Word sequences (bigrams, trigrams)
POS patterns: Part-of-speech tag sequences
Syntactic features: Sentence structure statistics
Entity features: Named entity type distributions
Lexical features: Vocabulary richness, sentence length

Summary

Functions

extract(document, opts \\ [])

Extracts features from a document.

to_vector(features, feature_types)

Converts a feature map to a sparse vector representation.

Functions

extract(document, opts \\ [])

@spec extract(
  Nasty.AST.Document.t(),
  keyword()
) :: map()

Extracts features from a document.

Options

:features - List of feature types to extract (default: [:bow, :ngrams])
- :bow - Bag of words (lemmatized)
- :ngrams - Word n-grams
- :pos_patterns - POS tag sequences
- :syntactic - Sentence structure features
- :entities - Entity type features
- :lexical - Lexical statistics
:ngram_size - Size of n-grams (default: 2)
:max_features - Maximum number of features to keep (default: 1000)
:min_frequency - Minimum frequency threshold (default: 1)
:include_stop_words - Include stop words in BoW (default: false)

Examples

iex> document = parse("The cat sat on the mat.")
iex> features = FeatureExtractor.extract(document, features: [:bow, :ngrams])
%{
  bow: %{"cat" => 1, "sat" => 1, "mat" => 1},
  ngrams: %{{"cat", "sat"} => 1, {"sat", "mat"} => 1}
}

to_vector(features, feature_types)

@spec to_vector(map(), [atom()]) :: %{required(String.t()) => number()}

Converts a feature map to a sparse vector representation.

Useful for machine learning algorithms that expect numeric vectors.

Examples

iex> features = %{bow: %{"cat" => 2, "dog" => 1}}
iex> FeatureExtractor.to_vector(features, [:bow])
%{"bow:cat" => 2, "bow:dog" => 1}