Nasty.Statistics.Neural.Preprocessing (Nasty v0.3.0)
View SourcePreprocessing utilities for neural models.
Provides text normalization, augmentation, and feature extraction for neural network training.
Features
- Text normalization (lowercase, punctuation, etc.)
- Character-level features
- Data augmentation
- Feature extraction (capitalization, word shape, etc.)
- Sequence padding and truncation
Example
# Normalize text
normalized = Preprocessing.normalize_text(text, lowercase: true)
# Extract character sequences
char_ids = Preprocessing.extract_char_features(words, char_vocab)
# Augment training data
augmented = Preprocessing.augment(sentences, methods: [:synonym, :shuffle])
Summary
Functions
Augments training data with various techniques.
Augments text with various techniques (placeholder).
Builds character vocabulary from words.
Creates an attention mask for a padded sequence.
Extracts character-level features from words.
Extracts handcrafted features from words.
Normalizes text for neural model input.
Pads all sequences in a batch to the same length.
Pads or truncates a single sequence to a fixed length.
Pads or truncates sequences to a fixed length.
Tokenizes text into subwords using BPE or similar (placeholder).
Functions
Augments training data with various techniques.
Parameters
sentences- List of {words, tags} tuplesopts- Augmentation options
Options
:methods- List of augmentation methods (default: [:synonym]):probability- Probability of applying augmentation (default: 0.3)
Augmentation Methods
:shuffle- Shuffle word order (for non-syntactic tasks):dropout- Randomly drop words:synonym- Replace with synonyms (requires word embeddings)
Returns
Augmented list of sentences.
Note
This is a placeholder for future implementation. Full augmentation requires external resources (synonym dictionaries, etc.)
Augments text with various techniques (placeholder).
Returns
{:error, :not_implemented}
Builds character vocabulary from words.
Parameters
words- List of wordsopts- Vocabulary options
Options
:special_tokens- Include special tokens (default: true):min_freq- Minimum character frequency (default: 1)
Returns
{:ok, char_vocab}- Character to ID mapping
Creates an attention mask for a padded sequence.
Parameters
sequence- Padded sequenceopts- Mask options
Options
:padding_value- Value used for padding (default: 0)
Returns
Mask list where 1 = real token, 0 = padding.
Extracts character-level features from words.
Converts each word into a sequence of character IDs for use in character-level CNNs or embeddings.
Parameters
words- List of wordschar_vocab- Character vocabulary %{char => id}opts- Extraction options
Options
:max_word_length- Maximum characters per word (default: 20):pad_value- Padding value for short words (default: 0)
Returns
Tensor of shape [num_words, max_word_length] with character IDs.
Extracts handcrafted features from words.
Extracts linguistic features like capitalization, word shape, etc. Useful for augmenting neural models.
Parameters
words- List of words
Returns
List of feature maps, one per word.
Feature Types
:is_capitalized- First letter uppercase:is_all_caps- All letters uppercase:is_numeric- Contains numbers:has_hyphen- Contains hyphen:word_shape- Pattern (e.g., "Xxxxx" for "Hello"):prefix- First 3 characters:suffix- Last 3 characters
Normalizes text for neural model input.
Parameters
text- Text to normalizeopts- Normalization options
Options
:lowercase- Convert to lowercase (default: true):remove_accents- Remove accents/diacritics (default: false):remove_punct- Remove punctuation (default: false):normalize_whitespace- Normalize whitespace (default: false):normalize_digits- Replace digits with <NUM> (default: false):normalize_urls- Replace URLs with <URL> (default: false):normalize_emails- Replace emails with <EMAIL> (default: false)
Returns
Normalized text string.
Pads all sequences in a batch to the same length.
Parameters
batch- List of sequencesopts- Padding options
Options
:max_length- Target length (default: length of longest sequence):padding_value- Value to use for padding (default: 0)
Returns
List of padded sequences.
@spec pad_sequence(list(), non_neg_integer(), keyword()) :: list()
Pads or truncates a single sequence to a fixed length.
Parameters
sequence- Single sequence (list)max_length- Target lengthopts- Padding options
Options
:padding_value- Value to use for padding (default: 0):truncate- Truncation strategy::preor:post(default::post)
Returns
Sequence of length max_length.
@spec pad_sequences([list()], non_neg_integer(), keyword()) :: [list()]
Pads or truncates sequences to a fixed length.
Parameters
sequences- List of sequences (lists)max_length- Target lengthopts- Padding options
Options
:pad_value- Value to use for padding (default: 0):truncate- Truncation strategy::preor:post(default::post)
Returns
List of sequences, all of length max_length.
Tokenizes text into subwords using BPE or similar (placeholder).
Returns
{:error, :not_implemented}