Tiktokenex.Pretokenizer (Tiktokenex v0.1.0)

Regex-based pre-tokenization that splits text into chunks before BPE.

Each encoding uses a specific regex pattern defined by OpenAI's tiktoken. The regex is compiled once at module load time.

Summary

Splits text into pre-tokenized chunks using the encoding's regex pattern.

@spec split(binary(), atom()) :: [binary()]

Splits text into pre-tokenized chunks using the encoding's regex pattern.

Returns a list of binary strings, each of which will be independently BPE-encoded.