Mojentic.LLM.Gateways.TokenizerGateway (Mojentic v1.2.0)

Copy Markdown View Source

Gateway for tokenizing and detokenizing text using Hugging Face tokenizers.

This gateway provides encoding and decoding functionality for text, which is useful for:

  • Counting tokens to manage context windows
  • Understanding token usage for cost estimation
  • Debugging token-related issues

The gateway uses the tokenizers library, which provides Rust-based tokenizers via Rustler NIF bindings for high performance.

Examples

iex> {:ok, tokenizer} = Mojentic.LLM.Gateways.TokenizerGateway.new()
iex> tokens = Mojentic.LLM.Gateways.TokenizerGateway.encode(tokenizer, "Hello, world!")
iex> text = Mojentic.LLM.Gateways.TokenizerGateway.decode(tokenizer, tokens)
iex> text
"Hello, world!"

Summary

Functions

Counts the number of tokens in a text string.

Decodes tokens back into text.

Encodes text into tokens.

Creates a new TokenizerGateway with the specified model.

Creates a new TokenizerGateway with the specified model, raising on error.

Types

t()

@type t() :: %Mojentic.LLM.Gateways.TokenizerGateway{
  tokenizer: Tokenizers.Tokenizer.t()
}

Functions

count_tokens(gateway, text)

@spec count_tokens(t(), String.t()) :: non_neg_integer()

Counts the number of tokens in a text string.

This is a convenience function that encodes the text and returns the token count.

Parameters

  • gateway - The TokenizerGateway instance
  • text - The text to count tokens for

Returns

  • count - The number of tokens

Examples

iex> {:ok, tokenizer} = Mojentic.LLM.Gateways.TokenizerGateway.new()
iex> count = Mojentic.LLM.Gateways.TokenizerGateway.count_tokens(tokenizer, "Hello, world!")
iex> count > 0
true

decode(tokenizer_gateway, tokens)

@spec decode(t(), [integer()]) :: String.t()

Decodes tokens back into text.

Parameters

  • gateway - The TokenizerGateway instance
  • tokens - List of token IDs to decode

Returns

  • text - The decoded text

Examples

iex> {:ok, tokenizer} = Mojentic.LLM.Gateways.TokenizerGateway.new()
iex> tokens = Mojentic.LLM.Gateways.TokenizerGateway.encode(tokenizer, "Hello!")
iex> text = Mojentic.LLM.Gateways.TokenizerGateway.decode(tokenizer, tokens)
iex> text
"Hello!"

encode(tokenizer_gateway, text)

@spec encode(t(), String.t()) :: [integer()]

Encodes text into tokens.

Parameters

  • gateway - The TokenizerGateway instance
  • text - The text to encode

Returns

  • tokens - List of token IDs

Examples

iex> {:ok, tokenizer} = Mojentic.LLM.Gateways.TokenizerGateway.new()
iex> tokens = Mojentic.LLM.Gateways.TokenizerGateway.encode(tokenizer, "Hello, world!")
iex> is_list(tokens) and length(tokens) > 0
true

new(model \\ "gpt2")

@spec new(String.t()) :: {:ok, t()} | {:error, term()}

Creates a new TokenizerGateway with the specified model.

Parameters

  • model - The model name to load. Defaults to "gpt2" which uses a BPE tokenizer similar to GPT models. Other options include model identifiers from Hugging Face.

Returns

  • {:ok, gateway} - Successfully created gateway
  • {:error, reason} - Failed to load tokenizer

Examples

iex> {:ok, tokenizer} = Mojentic.LLM.Gateways.TokenizerGateway.new()
iex> is_struct(tokenizer, Mojentic.LLM.Gateways.TokenizerGateway)
true

iex> {:ok, tokenizer} = Mojentic.LLM.Gateways.TokenizerGateway.new("bert-base-uncased")
iex> is_struct(tokenizer, Mojentic.LLM.Gateways.TokenizerGateway)
true

new!(model \\ "gpt2")

@spec new!(String.t()) :: t()

Creates a new TokenizerGateway with the specified model, raising on error.

Parameters

  • model - The model name to load. Defaults to "gpt2".

Returns

  • gateway - Successfully created gateway

Raises

Examples

iex> tokenizer = Mojentic.LLM.Gateways.TokenizerGateway.new!()
iex> is_struct(tokenizer, Mojentic.LLM.Gateways.TokenizerGateway)
true