LlamaCppEx.Tokenizer (LlamaCppEx v0.8.13)

Text tokenization and detokenization.

Summary

Functions

bos_token(model)

Returns the BOS (beginning of sentence) token ID.

decode(model, tokens)

Decodes a list of token IDs back into text.

encode(model, text, opts \\ [])

Encodes text into a list of token IDs.

eog?(model, token)

Returns whether a token is an end-of-generation token.

eos_token(model)

Returns the EOS (end of sentence) token ID.

token_to_piece(model, token)

Converts a single token ID to its text representation.

vocab_size(model)

Returns the vocabulary size.

Functions

bos_token(model)

@spec bos_token(LlamaCppEx.Model.t()) :: integer()

Returns the BOS (beginning of sentence) token ID.

decode(model, tokens)

@spec decode(LlamaCppEx.Model.t(), [integer()]) ::
  {:ok, String.t()} | {:error, String.t()}

Decodes a list of token IDs back into text.

encode(model, text, opts \\ [])

@spec encode(LlamaCppEx.Model.t(), String.t(), keyword()) ::
  {:ok, [integer()]} | {:error, String.t()}

Encodes text into a list of token IDs.

Options

:add_special - Add special tokens (BOS/EOS). Defaults to true.
:parse_special - Parse special token text (e.g., <|im_start|>). Defaults to true.

eog?(model, token)

@spec eog?(LlamaCppEx.Model.t(), integer()) :: boolean()

Returns whether a token is an end-of-generation token.

eos_token(model)

@spec eos_token(LlamaCppEx.Model.t()) :: integer()

Returns the EOS (end of sentence) token ID.

token_to_piece(model, token)

@spec token_to_piece(LlamaCppEx.Model.t(), integer()) :: String.t()

Converts a single token ID to its text representation.

vocab_size(model)

@spec vocab_size(LlamaCppEx.Model.t()) :: integer()

Returns the vocabulary size.