AI.PretendTokenizer (fnord v0.7.16)
View SourceOpenAI's tokenizer uses regexes that are not compatible with Erlang's regex engine. There are a couple of modules available on hex, but all of them require a working python installation, access to rustc, a number of external dependencies, and some env flags set to allow it to compile.
Rather than impose that on end users, this module guesstimates token counts based on OpenAI's assertion that 1 token is approximately 4 characters. Callers must take that into account when selecting their chunk size, including some amount of buffer to account for the inaccuracy of this approximation.
Summary
Types
@type chunk_size() :: non_neg_integer() | AI.Model.t()
@type chunked_input() :: [String.t()]
@type input() :: String.t()
@type reduction_factor() :: float()