PtcRunner.Chunker (PtcRunner v0.7.0)

Copy Markdown View Source

Text chunking utilities for RLM preprocessing.

Splits text into chunks by lines, characters, or approximate tokens. Removes chunking logic from LLM-generated code, eliminating typos and enabling proper tokenization.

Examples

iex> PtcRunner.Chunker.by_lines("a\nb\nc\nd", 2)
["a\nb", "c\nd"]

iex> PtcRunner.Chunker.by_chars("hello world", 5)
["hello", " worl", "d"]

iex> PtcRunner.Chunker.by_tokens("hello world test", 2)
["hello wo", "rld test"]

Options

All functions accept these options:

  • :overlap - sliding window overlap (default: 0)
  • :metadata - return maps with %{text, index, lines, chars, tokens} (default: false)

by_tokens/3 also accepts:

  • :tokenizer - :simple (4 chars/token) or a custom function (default: :simple)

Summary

Functions

Splits text into chunks by character count.

Splits text into chunks by line count.

Splits text into chunks by approximate token count.

Types

chars_opt()

@type chars_opt() :: {:overlap, non_neg_integer()} | {:metadata, boolean()}

chunk()

@type chunk() :: String.t()

chunk_with_metadata()

@type chunk_with_metadata() :: %{
  text: String.t(),
  index: non_neg_integer(),
  lines: non_neg_integer(),
  chars: non_neg_integer(),
  tokens: non_neg_integer()
}

lines_opt()

@type lines_opt() :: {:overlap, non_neg_integer()} | {:metadata, boolean()}

result()

@type result() :: [chunk()] | [chunk_with_metadata()]

tokens_opt()

@type tokens_opt() ::
  {:overlap, non_neg_integer()}
  | {:metadata, boolean()}
  | {:tokenizer, :simple | :cl100k | (String.t() -> non_neg_integer())}

Functions

by_chars(text, n, opts \\ [])

@spec by_chars(String.t() | nil, pos_integer(), [chars_opt()]) :: result()

Splits text into chunks by character count.

Uses String.graphemes/1 for unicode-safe splitting.

Examples

iex> PtcRunner.Chunker.by_chars("hello world", 5)
["hello", " worl", "d"]

iex> PtcRunner.Chunker.by_chars("abcdef", 3, overlap: 1)
["abc", "cde", "ef"]

iex> PtcRunner.Chunker.by_chars(nil, 10)
[]

iex> PtcRunner.Chunker.by_chars("", 10)
[]

iex> PtcRunner.Chunker.by_chars("hi", 100)
["hi"]

by_lines(text, n, opts \\ [])

@spec by_lines(String.t() | nil, pos_integer(), [lines_opt()]) :: result()

Splits text into chunks by line count.

Examples

iex> PtcRunner.Chunker.by_lines("a\nb\nc\nd\ne", 2)
["a\nb", "c\nd", "e"]

iex> PtcRunner.Chunker.by_lines("a\nb\nc\nd", 2, overlap: 1)
["a\nb", "b\nc", "c\nd"]

iex> PtcRunner.Chunker.by_lines(nil, 10)
[]

iex> PtcRunner.Chunker.by_lines("", 10)
[]

iex> PtcRunner.Chunker.by_lines("short", 100)
["short"]

by_tokens(text, n, opts \\ [])

@spec by_tokens(String.t() | nil, pos_integer(), [tokens_opt()]) :: result()

Splits text into chunks by approximate token count.

Uses a tokenizer to estimate token count. The default :simple tokenizer uses 4 characters per token heuristic.

Examples

iex> PtcRunner.Chunker.by_tokens("hello world test", 2)
["hello wo", "rld test"]

iex> PtcRunner.Chunker.by_tokens("abcdefghijklmnop", 2, overlap: 1)
["abcdefgh", "efghijkl", "ijklmnop"]

iex> PtcRunner.Chunker.by_tokens(nil, 10)
[]

iex> PtcRunner.Chunker.by_tokens("", 10)
[]

iex> PtcRunner.Chunker.by_tokens("hi", 100)
["hi"]

Custom tokenizer example:

iex> tokenizer = fn text -> div(String.length(text), 2) end
iex> PtcRunner.Chunker.by_tokens("abcdefgh", 2, tokenizer: tokenizer)
["abcd", "efgh"]