Nasty.Statistics.SequenceLabeling.Features (Nasty v0.3.0)
View SourceFeature extraction for sequence labeling tasks (NER, POS tagging, etc.).
Extracts rich feature representations from tokens including lexical, orthographic, POS, contextual, and gazetteer-based features.
Feature Types
- Lexical: word, lowercased, lemma
- Orthographic: capitalization, shape, digits
- POS: part-of-speech tags
- Context: surrounding words and POS tags
- Affixes: prefixes and suffixes
- Gazetteers: matches in entity lists
- Patterns: special character patterns
Examples
iex> token = %Token{text: "John", pos_tag: :propn, lemma: "John"}
iex> context = %{prev_word: "Mr.", next_word: "Smith", position: 1}
iex> features = Features.extract(token, context)
["word=john", "pos=PROPN", "capitalized=true", "prefix-2=Jo", ...]
Summary
Functions
Extracts features from a token given its context.
Extracts features for an entire sequence of tokens.
Types
@type context() :: %{ optional(:prev_word) => String.t(), optional(:next_word) => String.t(), optional(:prev_pos) => atom(), optional(:next_pos) => atom(), optional(:prev_label) => atom(), optional(:position) => non_neg_integer(), optional(:sequence_length) => non_neg_integer() }
@type feature() :: String.t()
@type feature_vector() :: [feature()]
Functions
@spec extract(Nasty.AST.Token.t(), context(), keyword()) :: feature_vector()
Extracts features from a token given its context.
Parameters
token- Token to extract features fromcontext- Contextual information (surrounding words, position, etc.)opts- Options::use_gazetteers- Enable gazetteer features (default: true):max_affix_length- Maximum prefix/suffix length (default: 4)
Returns
List of feature strings
@spec extract_sequence( [Nasty.AST.Token.t()], keyword() ) :: [feature_vector()]
Extracts features for an entire sequence of tokens.
Automatically builds context for each token from surrounding tokens.
Parameters
tokens- List of tokensopts- Options passed toextract/3
Returns
List of feature vectors, one per token