Regex-based pre-tokenization that splits text into chunks before BPE.
Each encoding uses a specific regex pattern defined by OpenAI's tiktoken. The regex is compiled once at module load time.
Summary
Functions
Splits text into pre-tokenized chunks using the encoding's regex pattern.