Nasty.Statistics.Neural.Transformers.TokenizerAdapter (Nasty v0.3.0)
View SourceBridges between Nasty's word-level tokens and transformer subword tokenization.
Transformers use subword tokenization (BPE, WordPiece) which splits words into multiple tokens. This module handles:
- Converting Nasty tokens to transformer input
- Aligning transformer predictions back to original tokens
- Managing special tokens ([CLS], [SEP], etc.)
Summary
Functions
Aligns transformer predictions back to original tokens.
Extracts only predictions for real tokens (ignoring special tokens).
Tokenizes Nasty tokens for transformer input.
Types
@type alignment_map() :: %{required(integer()) => subword_range()}
@type tokenizer_output() :: %{ input_ids: Nx.Tensor.t(), attention_mask: Nx.Tensor.t(), alignment_map: alignment_map(), special_token_mask: [boolean()] }
Functions
@spec align_predictions(Nx.Tensor.t() | [map()], alignment_map(), keyword()) :: [map()] | {:error, term()}
Aligns transformer predictions back to original tokens.
Takes predictions for each subword token and aggregates them to produce one prediction per original token.
Strategies
:first- Use prediction from first subword (default):average- Average predictions across all subwords:max- Use maximum prediction across subwords
Examples
predictions = align_predictions(subword_preds, alignment_map, strategy: :first)
# => [%{label: "NOUN", score: 0.95}, ...]
Extracts only predictions for real tokens (ignoring special tokens).
Examples
real_predictions = remove_special_tokens(predictions, special_token_mask)
@spec tokenize_for_transformer([Nasty.AST.Token.t()], map(), keyword()) :: {:ok, tokenizer_output()} | {:error, term()}
Tokenizes Nasty tokens for transformer input.
Returns input tensors and an alignment map that tracks which subword tokens correspond to which original tokens.
Options
:max_length- Maximum sequence length (default: 512):padding- Padding strategy::max_lengthor:none(default::max_length):truncation- Whether to truncate long sequences (default: true)
Examples
{:ok, output} = TokenizerAdapter.tokenize_for_transformer(tokens, tokenizer)
# => %{
# input_ids: #Nx.Tensor<...>,
# attention_mask: #Nx.Tensor<...>,
# alignment_map: %{0 => {1, 2}, 1 => {3, 3}, ...},
# special_token_mask: [true, false, false, ...]
# }