Nasty.Statistics.Neural.Transformers.TokenizerAdapter (Nasty v0.3.0)

View Source

Bridges between Nasty's word-level tokens and transformer subword tokenization.

Transformers use subword tokenization (BPE, WordPiece) which splits words into multiple tokens. This module handles:

  • Converting Nasty tokens to transformer input
  • Aligning transformer predictions back to original tokens
  • Managing special tokens ([CLS], [SEP], etc.)

Summary

Functions

Aligns transformer predictions back to original tokens.

Extracts only predictions for real tokens (ignoring special tokens).

Tokenizes Nasty tokens for transformer input.

Types

alignment_map()

@type alignment_map() :: %{required(integer()) => subword_range()}

subword_range()

@type subword_range() :: {start_index :: integer(), end_index :: integer()}

tokenizer_output()

@type tokenizer_output() :: %{
  input_ids: Nx.Tensor.t(),
  attention_mask: Nx.Tensor.t(),
  alignment_map: alignment_map(),
  special_token_mask: [boolean()]
}

Functions

align_predictions(subword_predictions, alignment_map, opts \\ [])

@spec align_predictions(Nx.Tensor.t() | [map()], alignment_map(), keyword()) ::
  [map()] | {:error, term()}

Aligns transformer predictions back to original tokens.

Takes predictions for each subword token and aggregates them to produce one prediction per original token.

Strategies

  • :first - Use prediction from first subword (default)
  • :average - Average predictions across all subwords
  • :max - Use maximum prediction across subwords

Examples

predictions = align_predictions(subword_preds, alignment_map, strategy: :first)
# => [%{label: "NOUN", score: 0.95}, ...]

remove_special_tokens(predictions, special_token_mask)

@spec remove_special_tokens([map()], [boolean()]) :: [map()]

Extracts only predictions for real tokens (ignoring special tokens).

Examples

real_predictions = remove_special_tokens(predictions, special_token_mask)

tokenize_for_transformer(tokens, tokenizer, opts \\ [])

@spec tokenize_for_transformer([Nasty.AST.Token.t()], map(), keyword()) ::
  {:ok, tokenizer_output()} | {:error, term()}

Tokenizes Nasty tokens for transformer input.

Returns input tensors and an alignment map that tracks which subword tokens correspond to which original tokens.

Options

  • :max_length - Maximum sequence length (default: 512)
  • :padding - Padding strategy: :max_length or :none (default: :max_length)
  • :truncation - Whether to truncate long sequences (default: true)

Examples

{:ok, output} = TokenizerAdapter.tokenize_for_transformer(tokens, tokenizer)
# => %{
#   input_ids: #Nx.Tensor<...>,
#   attention_mask: #Nx.Tensor<...>,
#   alignment_map: %{0 => {1, 2}, 1 => {3, 3}, ...},
#   special_token_mask: [true, false, false, ...]
# }