AI.PretendTokenizer (fnord v0.9.29)

View Source

OpenAI's tokenizer uses regexes that are not compatible with Erlang's regex engine. There are a couple of modules available on hex, but all of them require a working python installation, access to rustc, a number of external dependencies, and some env flags set to allow it to compile.

Rather than impose that on end users, this module uses a deliberately conservative token estimator. It guesstimates token counts with extra room for token-dense inputs so callers can choose chunk sizes with a buffer for inaccuracy.

Summary

Types

chunk_size()

@type chunk_size() :: non_neg_integer() | AI.Model.t()

chunked_input()

@type chunked_input() :: [String.t()]

input()

@type input() :: String.t()

reduction_factor()

@type reduction_factor() :: float()

Functions

chunk(input, chunk_size, reduction_factor)

@spec chunk(input(), chunk_size(), reduction_factor()) :: chunked_input()

guesstimate_tokens(input)

over_max_for_openai_embeddings?(input)