Tiktokenex (Tiktokenex v0.1.0)

Pure Elixir BPE tokenizer compatible with OpenAI's tiktoken.

Supports :cl100k_base and :o200k_base encodings.

Examples

iex> tokens = Tiktokenex.encode("hello world")
iex> is_list(tokens) and Enum.all?(tokens, &is_integer/1)
true

iex> Tiktokenex.decode(Tiktokenex.encode("hello world"))
"hello world"

iex> Tiktokenex.count("hello world")
2

Summary

Functions

count(text, encoding \\ :cl100k_base)

Returns the number of tokens in the text.

decode(token_ids, encoding \\ :cl100k_base)

Decodes a list of token IDs back into a binary string.

encode(text, encoding \\ :cl100k_base)

Encodes text into a list of token IDs.

encode_to_chunks(text, encoding \\ :cl100k_base)

Encodes text and returns the token byte-string chunks.

vocab_size(encoding \\ :cl100k_base)

Returns the vocabulary size for the given encoding.

Functions

count(text, encoding \\ :cl100k_base)

@spec count(binary(), atom()) :: non_neg_integer()

Returns the number of tokens in the text.

More efficient than encode/2 |> length/1 as it avoids building the full token ID list.

Examples

iex> Tiktokenex.count("hello world")
2

decode(token_ids, encoding \\ :cl100k_base)

@spec decode([non_neg_integer()], atom()) :: binary()

Decodes a list of token IDs back into a binary string.

The encoding must match the one used to produce the token IDs. Raises ArgumentError if a token ID is not found in the encoding's vocabulary.

Examples

iex> Tiktokenex.decode([15339, 1917])
"hello world"

encode(text, encoding \\ :cl100k_base)

@spec encode(binary(), atom()) :: [non_neg_integer()]

Encodes text into a list of token IDs.

Examples

iex> tokens = Tiktokenex.encode("hello")
iex> is_list(tokens)
true

encode_to_chunks(text, encoding \\ :cl100k_base)

@spec encode_to_chunks(binary(), atom()) :: [binary()]

Encodes text and returns the token byte-string chunks.

Each chunk corresponds to one BPE token. Useful for visualizing how text is tokenized.

Examples

iex> chunks = Tiktokenex.encode_to_chunks("hello world")
iex> is_list(chunks) and Enum.all?(chunks, &is_binary/1)
true

vocab_size(encoding \\ :cl100k_base)

@spec vocab_size(atom()) :: non_neg_integer()

Returns the vocabulary size for the given encoding.

Examples

iex> Tiktokenex.vocab_size() > 100_000
true