Nasty.Statistics.Neural.Preprocessing (Nasty v0.3.0)

View Source

Preprocessing utilities for neural models.

Provides text normalization, augmentation, and feature extraction for neural network training.

Features

  • Text normalization (lowercase, punctuation, etc.)
  • Character-level features
  • Data augmentation
  • Feature extraction (capitalization, word shape, etc.)
  • Sequence padding and truncation

Example

# Normalize text
normalized = Preprocessing.normalize_text(text, lowercase: true)

# Extract character sequences
char_ids = Preprocessing.extract_char_features(words, char_vocab)

# Augment training data
augmented = Preprocessing.augment(sentences, methods: [:synonym, :shuffle])

Summary

Functions

Augments training data with various techniques.

Augments text with various techniques (placeholder).

Builds character vocabulary from words.

Creates an attention mask for a padded sequence.

Extracts character-level features from words.

Extracts handcrafted features from words.

Normalizes text for neural model input.

Pads all sequences in a batch to the same length.

Pads or truncates a single sequence to a fixed length.

Pads or truncates sequences to a fixed length.

Tokenizes text into subwords using BPE or similar (placeholder).

Functions

augment(sentences, opts \\ [])

@spec augment(
  [{[String.t()], [atom()]}],
  keyword()
) :: [{[String.t()], [atom()]}]

Augments training data with various techniques.

Parameters

  • sentences - List of {words, tags} tuples
  • opts - Augmentation options

Options

  • :methods - List of augmentation methods (default: [:synonym])
  • :probability - Probability of applying augmentation (default: 0.3)

Augmentation Methods

  • :shuffle - Shuffle word order (for non-syntactic tasks)
  • :dropout - Randomly drop words
  • :synonym - Replace with synonyms (requires word embeddings)

Returns

Augmented list of sentences.

Note

This is a placeholder for future implementation. Full augmentation requires external resources (synonym dictionaries, etc.)

augment_text(text, opts)

Augments text with various techniques (placeholder).

Returns

{:error, :not_implemented}

build_char_vocabulary(words, opts \\ [])

@spec build_char_vocabulary(
  [String.t()],
  keyword()
) :: {:ok, map()}

Builds character vocabulary from words.

Parameters

  • words - List of words
  • opts - Vocabulary options

Options

  • :special_tokens - Include special tokens (default: true)
  • :min_freq - Minimum character frequency (default: 1)

Returns

  • {:ok, char_vocab} - Character to ID mapping

create_attention_mask(sequence, opts \\ [])

@spec create_attention_mask(
  list(),
  keyword()
) :: list()

Creates an attention mask for a padded sequence.

Parameters

  • sequence - Padded sequence
  • opts - Mask options

Options

  • :padding_value - Value used for padding (default: 0)

Returns

Mask list where 1 = real token, 0 = padding.

extract_char_features(word, char_vocab, opts \\ [])

@spec extract_char_features(String.t() | [String.t()], map(), keyword()) ::
  list() | Nx.Tensor.t()

Extracts character-level features from words.

Converts each word into a sequence of character IDs for use in character-level CNNs or embeddings.

Parameters

  • words - List of words
  • char_vocab - Character vocabulary %{char => id}
  • opts - Extraction options

Options

  • :max_word_length - Maximum characters per word (default: 20)
  • :pad_value - Padding value for short words (default: 0)

Returns

Tensor of shape [num_words, max_word_length] with character IDs.

extract_word_features(word)

@spec extract_word_features(String.t() | [String.t()]) :: map() | [map()]

Extracts handcrafted features from words.

Extracts linguistic features like capitalization, word shape, etc. Useful for augmenting neural models.

Parameters

  • words - List of words

Returns

List of feature maps, one per word.

Feature Types

  • :is_capitalized - First letter uppercase
  • :is_all_caps - All letters uppercase
  • :is_numeric - Contains numbers
  • :has_hyphen - Contains hyphen
  • :word_shape - Pattern (e.g., "Xxxxx" for "Hello")
  • :prefix - First 3 characters
  • :suffix - Last 3 characters

normalize_text(text, opts \\ [])

@spec normalize_text(
  String.t(),
  keyword()
) :: String.t()

Normalizes text for neural model input.

Parameters

  • text - Text to normalize
  • opts - Normalization options

Options

  • :lowercase - Convert to lowercase (default: true)
  • :remove_accents - Remove accents/diacritics (default: false)
  • :remove_punct - Remove punctuation (default: false)
  • :normalize_whitespace - Normalize whitespace (default: false)
  • :normalize_digits - Replace digits with <NUM> (default: false)
  • :normalize_urls - Replace URLs with <URL> (default: false)
  • :normalize_emails - Replace emails with <EMAIL> (default: false)

Returns

Normalized text string.

pad_batch(batch, opts \\ [])

@spec pad_batch(
  [list()],
  keyword()
) :: [list()]

Pads all sequences in a batch to the same length.

Parameters

  • batch - List of sequences
  • opts - Padding options

Options

  • :max_length - Target length (default: length of longest sequence)
  • :padding_value - Value to use for padding (default: 0)

Returns

List of padded sequences.

pad_sequence(seq, max_length, opts \\ [])

@spec pad_sequence(list(), non_neg_integer(), keyword()) :: list()

Pads or truncates a single sequence to a fixed length.

Parameters

  • sequence - Single sequence (list)
  • max_length - Target length
  • opts - Padding options

Options

  • :padding_value - Value to use for padding (default: 0)
  • :truncate - Truncation strategy: :pre or :post (default: :post)

Returns

Sequence of length max_length.

pad_sequences(sequences, max_length, opts \\ [])

@spec pad_sequences([list()], non_neg_integer(), keyword()) :: [list()]

Pads or truncates sequences to a fixed length.

Parameters

  • sequences - List of sequences (lists)
  • max_length - Target length
  • opts - Padding options

Options

  • :pad_value - Value to use for padding (default: 0)
  • :truncate - Truncation strategy: :pre or :post (default: :post)

Returns

List of sequences, all of length max_length.

tokenize_subwords(text, model)

Tokenizes text into subwords using BPE or similar (placeholder).

Returns

{:error, :not_implemented}