Tiktokenex.Pretokenizer (Tiktokenex v0.1.0)

Copy Markdown View Source

Regex-based pre-tokenization that splits text into chunks before BPE.

Each encoding uses a specific regex pattern defined by OpenAI's tiktoken. The regex is compiled once at module load time.

Summary

Functions

Splits text into pre-tokenized chunks using the encoding's regex pattern.

Functions

split(text, encoding \\ :cl100k_base)

@spec split(binary(), atom()) :: [binary()]

Splits text into pre-tokenized chunks using the encoding's regex pattern.

Returns a list of binary strings, each of which will be independently BPE-encoded.