Nasty.Statistics.Neural.Preprocessing (Nasty v0.3.0)

Preprocessing utilities for neural models.

Provides text normalization, augmentation, and feature extraction for neural network training.

Features

Text normalization (lowercase, punctuation, etc.)
Character-level features
Data augmentation
Feature extraction (capitalization, word shape, etc.)
Sequence padding and truncation

Example

# Normalize text
normalized = Preprocessing.normalize_text(text, lowercase: true)

# Extract character sequences
char_ids = Preprocessing.extract_char_features(words, char_vocab)

# Augment training data
augmented = Preprocessing.augment(sentences, methods: [:synonym, :shuffle])

Summary

Functions

augment(sentences, opts \\ [])

Augments training data with various techniques.

augment_text(text, opts)

Augments text with various techniques (placeholder).

build_char_vocabulary(words, opts \\ [])

Builds character vocabulary from words.

create_attention_mask(sequence, opts \\ [])

Creates an attention mask for a padded sequence.

extract_char_features(word, char_vocab, opts \\ [])

Extracts character-level features from words.

extract_word_features(word)

Extracts handcrafted features from words.

normalize_text(text, opts \\ [])

Normalizes text for neural model input.

pad_batch(batch, opts \\ [])

Pads all sequences in a batch to the same length.

pad_sequence(seq, max_length, opts \\ [])

Pads or truncates a single sequence to a fixed length.

pad_sequences(sequences, max_length, opts \\ [])

Pads or truncates sequences to a fixed length.

tokenize_subwords(text, model)

Tokenizes text into subwords using BPE or similar (placeholder).

Functions

augment(sentences, opts \\ [])

@spec augment(
  [{[String.t()], [atom()]}],
  keyword()
) :: [{[String.t()], [atom()]}]

Augments training data with various techniques.

Parameters

sentences - List of {words, tags} tuples
opts - Augmentation options

Options

:methods - List of augmentation methods (default: [:synonym])
:probability - Probability of applying augmentation (default: 0.3)

Augmentation Methods

:shuffle - Shuffle word order (for non-syntactic tasks)
:dropout - Randomly drop words
:synonym - Replace with synonyms (requires word embeddings)

Returns

Augmented list of sentences.

Note

This is a placeholder for future implementation. Full augmentation requires external resources (synonym dictionaries, etc.)

augment_text(text, opts)

Augments text with various techniques (placeholder).

Returns

{:error, :not_implemented}

build_char_vocabulary(words, opts \\ [])

@spec build_char_vocabulary(
  [String.t()],
  keyword()
) :: {:ok, map()}

Builds character vocabulary from words.

Parameters

words - List of words
opts - Vocabulary options

Options

:special_tokens - Include special tokens (default: true)
:min_freq - Minimum character frequency (default: 1)

Returns

{:ok, char_vocab} - Character to ID mapping

create_attention_mask(sequence, opts \\ [])

@spec create_attention_mask(
  list(),
  keyword()
) :: list()

Creates an attention mask for a padded sequence.

Parameters

sequence - Padded sequence
opts - Mask options

Options

:padding_value - Value used for padding (default: 0)

Returns

Mask list where 1 = real token, 0 = padding.

extract_char_features(word, char_vocab, opts \\ [])

@spec extract_char_features(String.t() | [String.t()], map(), keyword()) ::
  list() | Nx.Tensor.t()

Extracts character-level features from words.

Converts each word into a sequence of character IDs for use in character-level CNNs or embeddings.

Parameters

words - List of words
char_vocab - Character vocabulary %{char => id}
opts - Extraction options

Options

:max_word_length - Maximum characters per word (default: 20)
:pad_value - Padding value for short words (default: 0)

Returns

Tensor of shape [num_words, max_word_length] with character IDs.

extract_word_features(word)

@spec extract_word_features(String.t() | [String.t()]) :: map() | [map()]

Extracts handcrafted features from words.

Extracts linguistic features like capitalization, word shape, etc. Useful for augmenting neural models.

Parameters

words - List of words

Returns

List of feature maps, one per word.

Feature Types

:is_capitalized - First letter uppercase
:is_all_caps - All letters uppercase
:is_numeric - Contains numbers
:has_hyphen - Contains hyphen
:word_shape - Pattern (e.g., "Xxxxx" for "Hello")
:prefix - First 3 characters
:suffix - Last 3 characters

normalize_text(text, opts \\ [])

@spec normalize_text(
  String.t(),
  keyword()
) :: String.t()

Normalizes text for neural model input.

Parameters

text - Text to normalize
opts - Normalization options

Options

:lowercase - Convert to lowercase (default: true)
:remove_accents - Remove accents/diacritics (default: false)
:remove_punct - Remove punctuation (default: false)
:normalize_whitespace - Normalize whitespace (default: false)
:normalize_digits - Replace digits with <NUM> (default: false)
:normalize_urls - Replace URLs with <URL> (default: false)
:normalize_emails - Replace emails with <EMAIL> (default: false)

Returns

Normalized text string.

pad_batch(batch, opts \\ [])

@spec pad_batch(
  [list()],
  keyword()
) :: [list()]

Pads all sequences in a batch to the same length.

Parameters

batch - List of sequences
opts - Padding options

Options

:max_length - Target length (default: length of longest sequence)
:padding_value - Value to use for padding (default: 0)

Returns

List of padded sequences.

pad_sequence(seq, max_length, opts \\ [])

@spec pad_sequence(list(), non_neg_integer(), keyword()) :: list()

Pads or truncates a single sequence to a fixed length.

Parameters

sequence - Single sequence (list)
max_length - Target length
opts - Padding options

Options

:padding_value - Value to use for padding (default: 0)
:truncate - Truncation strategy: :pre or :post (default: :post)

Returns

Sequence of length max_length.

pad_sequences(sequences, max_length, opts \\ [])

@spec pad_sequences([list()], non_neg_integer(), keyword()) :: [list()]

Pads or truncates sequences to a fixed length.

Parameters

sequences - List of sequences (lists)
max_length - Target length
opts - Padding options

Options

:pad_value - Value to use for padding (default: 0)
:truncate - Truncation strategy: :pre or :post (default: :post)

Returns

List of sequences, all of length max_length.

tokenize_subwords(text, model)

Tokenizes text into subwords using BPE or similar (placeholder).

Returns

{:error, :not_implemented}