# `Tiktokenex.Pretokenizer`
[🔗](https://github.com/phiat/tiktokenex/blob/v0.1.0/lib/tiktokenex/pretokenizer.ex#L1)

Regex-based pre-tokenization that splits text into chunks before BPE.

Each encoding uses a specific regex pattern defined by OpenAI's tiktoken.
The regex is compiled once at module load time.

# `split`

```elixir
@spec split(binary(), atom()) :: [binary()]
```

Splits text into pre-tokenized chunks using the encoding's regex pattern.

Returns a list of binary strings, each of which will be independently
BPE-encoded.

---

*Consult [api-reference.md](api-reference.md) for complete listing*