AI.PretendTokenizer (fnord v0.7.16)

View Source

OpenAI's tokenizer uses regexes that are not compatible with Erlang's regex engine. There are a couple of modules available on hex, but all of them require a working python installation, access to rustc, a number of external dependencies, and some env flags set to allow it to compile.

Rather than impose that on end users, this module guesstimates token counts based on OpenAI's assertion that 1 token is approximately 4 characters. Callers must take that into account when selecting their chunk size, including some amount of buffer to account for the inaccuracy of this approximation.

Summary

Types

chunk_size()

@type chunk_size() :: non_neg_integer() | AI.Model.t()

chunked_input()

@type chunked_input() :: [String.t()]

input()

@type input() :: String.t()

reduction_factor()

@type reduction_factor() :: float()

Functions

chunk(input, chunk_size, reduction_factor)

@spec chunk(input(), chunk_size(), reduction_factor()) :: chunked_input()

guesstimate_tokens(input)

over_max_for_openai_embeddings?(input)