Nasty.Statistics.Neural.Transformers.DataPreprocessor (Nasty v0.3.0)
View SourceData preprocessing pipeline for fine-tuning transformer models.
Transforms Nasty tokens into transformer-compatible inputs with:
- Subword tokenization alignment
- Padding and truncation to max sequence length
- Attention mask generation
- Label alignment for subword tokens
Example
alias Nasty.AST.Token
alias Nasty.Statistics.Neural.Transformers.DataPreprocessor
tokens = [
%Token{text: "The", pos: :det},
%Token{text: "cat", pos: :noun}
]
label_map = %{det: 0, noun: 1}
{:ok, batch} = DataPreprocessor.prepare_batch(
tokens,
tokenizer,
label_map,
max_length: 512
)
Summary
Functions
Aligns word-level labels with subword tokens.
Creates label map from list of unique labels.
Extracts all unique labels from token sequences.
Converts Nasty token to label ID using label map.
Pads or truncates a sequence to target length.
Prepares a batch of token sequences for transformer input.
Tokenizes a single sequence and aligns labels with subword tokens.
Types
@type batch() :: %{ input_ids: Nx.Tensor.t(), attention_mask: Nx.Tensor.t(), labels: Nx.Tensor.t() }
@type tokenizer() :: map()
Functions
Aligns word-level labels with subword tokens.
When a word is split into multiple subword tokens, the first subword gets the label and subsequent subwords get label_pad_id.
Strategy
- First subword of word: original label
- Subsequent subwords: label_pad_id (ignored in loss)
- Special tokens (CLS, SEP): label_pad_id
Examples
labels = [1, 2, 3]
word_ids = [nil, 0, 0, 1, 2, 2, nil] # nil = special token
align_labels(labels, word_ids, -100)
# => [-100, 1, -100, 2, 3, -100, -100]
Creates label map from list of unique labels.
Examples
iex> create_label_map([:noun, :verb, :adj])
%{noun: 0, verb: 1, adj: 2}
@spec extract_labels([[Nasty.AST.Token.t()]], atom()) :: [atom()]
Extracts all unique labels from token sequences.
Examples
tokens = [
[%Token{pos: :noun}, %Token{pos: :verb}],
[%Token{pos: :adj}, %Token{pos: :noun}]
]
extract_labels(tokens)
# => [:noun, :verb, :adj]
@spec get_label(Nasty.AST.Token.t(), label_map(), atom()) :: integer()
Converts Nasty token to label ID using label map.
Supports multiple label extraction strategies:
:pos- Part-of-speech tag:entity_type- Named entity type- Custom key from token struct
Examples
iex> get_label(%Token{pos: :noun}, %{noun: 1})
1
iex> get_label(%Token{entity_type: :person}, %{person: 0}, :entity_type)
0
Pads or truncates a sequence to target length.
Examples
iex> pad_or_truncate([1, 2, 3], 5, 0)
[1, 2, 3, 0, 0]
iex> pad_or_truncate([1, 2, 3, 4, 5], 3, 0)
[1, 2, 3]
@spec prepare_batch([Nasty.AST.Token.t()], tokenizer(), label_map(), keyword()) :: {:ok, batch()} | {:error, term()}
Prepares a batch of token sequences for transformer input.
Parameters
token_sequences- List of token liststokenizer- Bumblebee tokenizerlabel_map- Map from POS tags/labels to integer IDsopts- Options
Options
:max_length- Maximum sequence length (default: 512):padding- Padding strategy (:max_length or :longest, default: :max_length):truncation- Enable truncation (default: true):label_pad_id- ID to use for padded labels (default: -100)
Returns
{:ok, batch}- Preprocessed batch with tensors{:error, reason}- Error during preprocessing
@spec process_sequence( [Nasty.AST.Token.t()], tokenizer(), label_map(), integer(), integer() ) :: {:ok, map()} | {:error, term()}
Tokenizes a single sequence and aligns labels with subword tokens.
Parameters
tokens- List of Nasty tokenstokenizer- Bumblebee tokenizerlabel_map- Label to ID mappingmax_length- Maximum sequence lengthlabel_pad_id- Padding ID for labels
Returns
{:ok, %{input_ids: list, attention_mask: list, labels: list}}{:error, reason}