Nasty.Statistics.Neural.Transformers.DataPreprocessor (Nasty v0.3.0)

View Source

Data preprocessing pipeline for fine-tuning transformer models.

Transforms Nasty tokens into transformer-compatible inputs with:

  • Subword tokenization alignment
  • Padding and truncation to max sequence length
  • Attention mask generation
  • Label alignment for subword tokens

Example

alias Nasty.AST.Token
alias Nasty.Statistics.Neural.Transformers.DataPreprocessor

tokens = [
  %Token{text: "The", pos: :det},
  %Token{text: "cat", pos: :noun}
]

label_map = %{det: 0, noun: 1}

{:ok, batch} = DataPreprocessor.prepare_batch(
  tokens,
  tokenizer,
  label_map,
  max_length: 512
)

Summary

Functions

Aligns word-level labels with subword tokens.

Creates label map from list of unique labels.

Extracts all unique labels from token sequences.

Converts Nasty token to label ID using label map.

Pads or truncates a sequence to target length.

Prepares a batch of token sequences for transformer input.

Tokenizes a single sequence and aligns labels with subword tokens.

Types

batch()

@type batch() :: %{
  input_ids: Nx.Tensor.t(),
  attention_mask: Nx.Tensor.t(),
  labels: Nx.Tensor.t()
}

label_map()

@type label_map() :: %{required(atom()) => integer()}

tokenizer()

@type tokenizer() :: map()

Functions

align_labels(labels, word_ids, label_pad_id)

@spec align_labels([integer()], [integer() | nil], integer()) :: [integer()]

Aligns word-level labels with subword tokens.

When a word is split into multiple subword tokens, the first subword gets the label and subsequent subwords get label_pad_id.

Strategy

  • First subword of word: original label
  • Subsequent subwords: label_pad_id (ignored in loss)
  • Special tokens (CLS, SEP): label_pad_id

Examples

labels = [1, 2, 3]
word_ids = [nil, 0, 0, 1, 2, 2, nil]  # nil = special token
align_labels(labels, word_ids, -100)
# => [-100, 1, -100, 2, 3, -100, -100]

create_label_map(labels)

@spec create_label_map([atom()]) :: label_map()

Creates label map from list of unique labels.

Examples

iex> create_label_map([:noun, :verb, :adj])
%{noun: 0, verb: 1, adj: 2}

extract_labels(token_sequences, key \\ :pos)

@spec extract_labels([[Nasty.AST.Token.t()]], atom()) :: [atom()]

Extracts all unique labels from token sequences.

Examples

tokens = [
  [%Token{pos: :noun}, %Token{pos: :verb}],
  [%Token{pos: :adj}, %Token{pos: :noun}]
]

extract_labels(tokens)
# => [:noun, :verb, :adj]

get_label(token, label_map, key \\ :pos)

@spec get_label(Nasty.AST.Token.t(), label_map(), atom()) :: integer()

Converts Nasty token to label ID using label map.

Supports multiple label extraction strategies:

  • :pos - Part-of-speech tag
  • :entity_type - Named entity type
  • Custom key from token struct

Examples

iex> get_label(%Token{pos: :noun}, %{noun: 1})
1

iex> get_label(%Token{entity_type: :person}, %{person: 0}, :entity_type)
0

pad_or_truncate(sequence, target_length, pad_value)

@spec pad_or_truncate([integer()], integer(), integer()) :: [integer()]

Pads or truncates a sequence to target length.

Examples

iex> pad_or_truncate([1, 2, 3], 5, 0)
[1, 2, 3, 0, 0]

iex> pad_or_truncate([1, 2, 3, 4, 5], 3, 0)
[1, 2, 3]

prepare_batch(token_sequences, tokenizer, label_map, opts \\ [])

@spec prepare_batch([Nasty.AST.Token.t()], tokenizer(), label_map(), keyword()) ::
  {:ok, batch()} | {:error, term()}

Prepares a batch of token sequences for transformer input.

Parameters

  • token_sequences - List of token lists
  • tokenizer - Bumblebee tokenizer
  • label_map - Map from POS tags/labels to integer IDs
  • opts - Options

Options

  • :max_length - Maximum sequence length (default: 512)
  • :padding - Padding strategy (:max_length or :longest, default: :max_length)
  • :truncation - Enable truncation (default: true)
  • :label_pad_id - ID to use for padded labels (default: -100)

Returns

  • {:ok, batch} - Preprocessed batch with tensors
  • {:error, reason} - Error during preprocessing

process_sequence(tokens, tokenizer, label_map, max_length, label_pad_id)

@spec process_sequence(
  [Nasty.AST.Token.t()],
  tokenizer(),
  label_map(),
  integer(),
  integer()
) ::
  {:ok, map()} | {:error, term()}

Tokenizes a single sequence and aligns labels with subword tokens.

Parameters

  • tokens - List of Nasty tokens
  • tokenizer - Bumblebee tokenizer
  • label_map - Label to ID mapping
  • max_length - Maximum sequence length
  • label_pad_id - Padding ID for labels

Returns

  • {:ok, %{input_ids: list, attention_mask: list, labels: list}}
  • {:error, reason}