Tinkex.TrainingClient.DataProcessor (Tinkex v0.3.4)

Data chunking, numbering, and tensor operations for TrainingClient.

This module handles:

Chunking training data based on size limits
Estimating chunk sizes using byte heuristics
Building placeholder gradients for custom loss
Extracting target tokens from loss function inputs

Summary

Functions

allocate_request_ids(count, counter)

Allocate sequential request IDs for a batch of requests.

build_placeholder_gradients(data)

Build placeholder gradients (zeros) for custom loss computation.

chunk_data(data)

Chunk data into manageable pieces based on size and byte limits.

fetch_target_tokens_tensor(datum)

Extract target_tokens tensor from a datum's loss_fn_inputs.

Functions

allocate_request_ids(count, counter)

@spec allocate_request_ids(non_neg_integer(), pos_integer()) ::
  {[pos_integer()], pos_integer()}

Allocate sequential request IDs for a batch of requests.

Returns {[id1, id2, ...], new_counter} where the IDs are consecutive starting from the current counter.

build_placeholder_gradients(data)

@spec build_placeholder_gradients([Tinkex.Types.Datum.t()]) ::
  {:ok, [Nx.Tensor.t()]} | {:error, Tinkex.Error.t()}

Build placeholder gradients (zeros) for custom loss computation.

Creates zero-filled tensors matching the shape of target_tokens for each datum. These are used as placeholder gradients before the actual loss computation.

chunk_data(data)

@spec chunk_data(list()) :: [list()]

Chunk data into manageable pieces based on size and byte limits.

Ensures no chunk exceeds:

1024 items
5000000 total estimated bytes

fetch_target_tokens_tensor(datum)

@spec fetch_target_tokens_tensor(Tinkex.Types.Datum.t()) ::
  {:ok, Nx.Tensor.t()} | {:error, Tinkex.Error.t()}

Extract target_tokens tensor from a datum's loss_fn_inputs.

Supports both TensorData and Nx.Tensor formats.