# `Tiktokenex`
[🔗](https://github.com/phiat/tiktokenex/blob/v0.1.0/lib/tiktokenex.ex#L1)

Pure Elixir BPE tokenizer compatible with OpenAI's tiktoken.

Supports `:cl100k_base` and `:o200k_base` encodings.

## Examples

    iex> tokens = Tiktokenex.encode("hello world")
    iex> is_list(tokens) and Enum.all?(tokens, &is_integer/1)
    true

    iex> Tiktokenex.decode(Tiktokenex.encode("hello world"))
    "hello world"

    iex> Tiktokenex.count("hello world")
    2

# `count`

```elixir
@spec count(binary(), atom()) :: non_neg_integer()
```

Returns the number of tokens in the text.

More efficient than `encode/2 |> length/1` as it avoids building
the full token ID list.

## Examples

    iex> Tiktokenex.count("hello world")
    2

# `decode`

```elixir
@spec decode([non_neg_integer()], atom()) :: binary()
```

Decodes a list of token IDs back into a binary string.

The encoding must match the one used to produce the token IDs.
Raises `ArgumentError` if a token ID is not found in the encoding's vocabulary.

## Examples

    iex> Tiktokenex.decode([15339, 1917])
    "hello world"

# `encode`

```elixir
@spec encode(binary(), atom()) :: [non_neg_integer()]
```

Encodes text into a list of token IDs.

## Examples

    iex> tokens = Tiktokenex.encode("hello")
    iex> is_list(tokens)
    true

# `encode_to_chunks`

```elixir
@spec encode_to_chunks(binary(), atom()) :: [binary()]
```

Encodes text and returns the token byte-string chunks.

Each chunk corresponds to one BPE token. Useful for visualizing
how text is tokenized.

## Examples

    iex> chunks = Tiktokenex.encode_to_chunks("hello world")
    iex> is_list(chunks) and Enum.all?(chunks, &is_binary/1)
    true

# `vocab_size`

```elixir
@spec vocab_size(atom()) :: non_neg_integer()
```

Returns the vocabulary size for the given encoding.

## Examples

    iex> Tiktokenex.vocab_size() > 100_000
    true

---

*Consult [api-reference.md](api-reference.md) for complete listing*