Nasty.Language.English.FeatureExtractor (Nasty v0.3.0)

View Source

Extracts classification features from parsed documents.

Supports multiple feature types:

  • Bag of Words (BoW): Lemmatized word frequencies
  • N-grams: Word sequences (bigrams, trigrams)
  • POS patterns: Part-of-speech tag sequences
  • Syntactic features: Sentence structure statistics
  • Entity features: Named entity type distributions
  • Lexical features: Vocabulary richness, sentence length

Summary

Functions

Extracts features from a document.

Converts a feature map to a sparse vector representation.

Functions

extract(document, opts \\ [])

@spec extract(
  Nasty.AST.Document.t(),
  keyword()
) :: map()

Extracts features from a document.

Options

  • :features - List of feature types to extract (default: [:bow, :ngrams])
    • :bow - Bag of words (lemmatized)
    • :ngrams - Word n-grams
    • :pos_patterns - POS tag sequences
    • :syntactic - Sentence structure features
    • :entities - Entity type features
    • :lexical - Lexical statistics
  • :ngram_size - Size of n-grams (default: 2)
  • :max_features - Maximum number of features to keep (default: 1000)
  • :min_frequency - Minimum frequency threshold (default: 1)
  • :include_stop_words - Include stop words in BoW (default: false)

Examples

iex> document = parse("The cat sat on the mat.")
iex> features = FeatureExtractor.extract(document, features: [:bow, :ngrams])
%{
  bow: %{"cat" => 1, "sat" => 1, "mat" => 1},
  ngrams: %{{"cat", "sat"} => 1, {"sat", "mat"} => 1}
}

to_vector(features, feature_types)

@spec to_vector(map(), [atom()]) :: %{required(String.t()) => number()}

Converts a feature map to a sparse vector representation.

Useful for machine learning algorithms that expect numeric vectors.

Examples

iex> features = %{bow: %{"cat" => 2, "dog" => 1}}
iex> FeatureExtractor.to_vector(features, [:bow])
%{"bow:cat" => 2, "bow:dog" => 1}