Pure Elixir BPE tokenizer compatible with OpenAI's tiktoken.
Supports :cl100k_base and :o200k_base encodings.
Examples
iex> tokens = Tiktokenex.encode("hello world")
iex> is_list(tokens) and Enum.all?(tokens, &is_integer/1)
true
iex> Tiktokenex.decode(Tiktokenex.encode("hello world"))
"hello world"
iex> Tiktokenex.count("hello world")
2
Summary
Functions
Returns the number of tokens in the text.
Decodes a list of token IDs back into a binary string.
Encodes text into a list of token IDs.
Encodes text and returns the token byte-string chunks.
Returns the vocabulary size for the given encoding.
Functions
@spec count(binary(), atom()) :: non_neg_integer()
Returns the number of tokens in the text.
More efficient than encode/2 |> length/1 as it avoids building
the full token ID list.
Examples
iex> Tiktokenex.count("hello world")
2
@spec decode([non_neg_integer()], atom()) :: binary()
Decodes a list of token IDs back into a binary string.
The encoding must match the one used to produce the token IDs.
Raises ArgumentError if a token ID is not found in the encoding's vocabulary.
Examples
iex> Tiktokenex.decode([15339, 1917])
"hello world"
@spec encode(binary(), atom()) :: [non_neg_integer()]
Encodes text into a list of token IDs.
Examples
iex> tokens = Tiktokenex.encode("hello")
iex> is_list(tokens)
true
Encodes text and returns the token byte-string chunks.
Each chunk corresponds to one BPE token. Useful for visualizing how text is tokenized.
Examples
iex> chunks = Tiktokenex.encode_to_chunks("hello world")
iex> is_list(chunks) and Enum.all?(chunks, &is_binary/1)
true
@spec vocab_size(atom()) :: non_neg_integer()
Returns the vocabulary size for the given encoding.
Examples
iex> Tiktokenex.vocab_size() > 100_000
true