Tiktokenex.BPE (Tiktokenex v0.1.0)

Copy Markdown View Source

Core Byte-Pair Encoding merge algorithm.

Given a sequence of bytes and a rank map, repeatedly merges the lowest-ranked adjacent pair until no more merges are possible.

Summary

Functions

Encodes a binary chunk into a list of token rank integers using BPE.

Functions

encode(chunk, ranks)

@spec encode(binary(), %{required(binary()) => non_neg_integer()}) :: [
  non_neg_integer()
]

Encodes a binary chunk into a list of token rank integers using BPE.

The input should be a pre-tokenized chunk (output of Pretokenizer.split/2). Returns a list of integer token IDs (ranks).