Text chunking utilities for RLM preprocessing.
Splits text into chunks by lines, characters, or approximate tokens. Removes chunking logic from LLM-generated code, eliminating typos and enabling proper tokenization.
Examples
iex> PtcRunner.Chunker.by_lines("a\nb\nc\nd", 2)
["a\nb", "c\nd"]
iex> PtcRunner.Chunker.by_chars("hello world", 5)
["hello", " worl", "d"]
iex> PtcRunner.Chunker.by_tokens("hello world test", 2)
["hello wo", "rld test"]Options
All functions accept these options:
:overlap- sliding window overlap (default: 0):metadata- return maps with%{text, index, lines, chars, tokens}(default: false)
by_tokens/3 also accepts:
:tokenizer-:simple(4 chars/token) or a custom function (default::simple)
Summary
Functions
Splits text into chunks by character count.
Splits text into chunks by line count.
Splits text into chunks by approximate token count.
Types
@type chars_opt() :: {:overlap, non_neg_integer()} | {:metadata, boolean()}
@type chunk() :: String.t()
@type chunk_with_metadata() :: %{ text: String.t(), index: non_neg_integer(), lines: non_neg_integer(), chars: non_neg_integer(), tokens: non_neg_integer() }
@type lines_opt() :: {:overlap, non_neg_integer()} | {:metadata, boolean()}
@type result() :: [chunk()] | [chunk_with_metadata()]
@type tokens_opt() :: {:overlap, non_neg_integer()} | {:metadata, boolean()} | {:tokenizer, :simple | :cl100k | (String.t() -> non_neg_integer())}
Functions
@spec by_chars(String.t() | nil, pos_integer(), [chars_opt()]) :: result()
Splits text into chunks by character count.
Uses String.graphemes/1 for unicode-safe splitting.
Examples
iex> PtcRunner.Chunker.by_chars("hello world", 5)
["hello", " worl", "d"]
iex> PtcRunner.Chunker.by_chars("abcdef", 3, overlap: 1)
["abc", "cde", "ef"]
iex> PtcRunner.Chunker.by_chars(nil, 10)
[]
iex> PtcRunner.Chunker.by_chars("", 10)
[]
iex> PtcRunner.Chunker.by_chars("hi", 100)
["hi"]
@spec by_lines(String.t() | nil, pos_integer(), [lines_opt()]) :: result()
Splits text into chunks by line count.
Examples
iex> PtcRunner.Chunker.by_lines("a\nb\nc\nd\ne", 2)
["a\nb", "c\nd", "e"]
iex> PtcRunner.Chunker.by_lines("a\nb\nc\nd", 2, overlap: 1)
["a\nb", "b\nc", "c\nd"]
iex> PtcRunner.Chunker.by_lines(nil, 10)
[]
iex> PtcRunner.Chunker.by_lines("", 10)
[]
iex> PtcRunner.Chunker.by_lines("short", 100)
["short"]
@spec by_tokens(String.t() | nil, pos_integer(), [tokens_opt()]) :: result()
Splits text into chunks by approximate token count.
Uses a tokenizer to estimate token count. The default :simple tokenizer
uses 4 characters per token heuristic.
Examples
iex> PtcRunner.Chunker.by_tokens("hello world test", 2)
["hello wo", "rld test"]
iex> PtcRunner.Chunker.by_tokens("abcdefghijklmnop", 2, overlap: 1)
["abcdefgh", "efghijkl", "ijklmnop"]
iex> PtcRunner.Chunker.by_tokens(nil, 10)
[]
iex> PtcRunner.Chunker.by_tokens("", 10)
[]
iex> PtcRunner.Chunker.by_tokens("hi", 100)
["hi"]Custom tokenizer example:
iex> tokenizer = fn text -> div(String.length(text), 2) end
iex> PtcRunner.Chunker.by_tokens("abcdefgh", 2, tokenizer: tokenizer)
["abcd", "efgh"]