TextChunker.Strategies.RecursiveChunk (TextChunker v0.5.2)

View Source

Handles recursive text splitting, aiming to adhere to configured size and overlap requirements. Employs a flexible separator-based approach to break down text into manageable chunks, while generating metadata for each produced chunk.

Key Features:

  • Size-Guided Chunking: Prioritizes splitting text into semantic blocks while respecting the maximum chunk_size.
  • Context Preservation: Maintains chunk_overlap to minimize information loss at chunk boundaries.
  • Separator Handling: Selects the most appropriate delimiter (e.g., line breaks, spaces) based on the text content.
  • Metadata Generation: Creates %TextChunker.Chunk{} structs containing the split text and its original byte range.

Algorithm Overview

  1. Separator Prioritization: Establishes a list of potential separators (e.g., line breaks, spaces), ordered by their expected relevance to the text structure.
  2. Recursive Splitting:
  • Iterates through the separator list.
  • Attempts to split the text using the current separator.
  • If a split is successful, recursively applies the algorithm to any resulting sub-chunks that still exceed the chunk_size.
  1. Chunk Assembly:
  • Combines smaller text segments into chunks, aiming to get as close to the chunk_size as possible.
  • Employs chunk_overlap to ensure smooth transitions between chunks.
  1. Metadata Generation: Tracks byte ranges for each chunk for potential reassembly of the original text.

Summary

Functions

Internal recursive chunking strategy. Use TextChunker.split/2 for public API.

Functions

split(text, opts)

@spec split(
  binary(),
  keyword()
) :: [TextChunker.Chunk.t()]

Internal recursive chunking strategy. Use TextChunker.split/2 for public API.

Splits text using prioritized separators, respecting chunk_size limits while maintaining chunk_overlap for context preservation.

Options

  • :chunk_size (integer) - Maximum chunk size
  • :chunk_overlap (integer) - Overlap between chunks
  • :format (atom) - Text format for separator selection
  • :get_chunk_size (function) - Size calculation function (required)