TextChunker.Strategies.RecursiveChunk (TextChunker v0.5.2)
View SourceHandles recursive text splitting, aiming to adhere to configured size and overlap requirements. Employs a flexible separator-based approach to break down text into manageable chunks, while generating metadata for each produced chunk.
Key Features:
- Size-Guided Chunking: Prioritizes splitting text into semantic blocks while respecting the maximum
chunk_size. - Context Preservation: Maintains
chunk_overlapto minimize information loss at chunk boundaries. - Separator Handling: Selects the most appropriate delimiter (e.g., line breaks, spaces) based on the text content.
- Metadata Generation: Creates
%TextChunker.Chunk{}structs containing the split text and its original byte range.
Algorithm Overview
- Separator Prioritization: Establishes a list of potential separators (e.g., line breaks, spaces), ordered by their expected relevance to the text structure.
- Recursive Splitting:
- Iterates through the separator list.
- Attempts to split the text using the current separator.
- If a split is successful, recursively applies the algorithm to any resulting sub-chunks that still exceed the
chunk_size.
- Chunk Assembly:
- Combines smaller text segments into chunks, aiming to get as close to the
chunk_sizeas possible. - Employs
chunk_overlapto ensure smooth transitions between chunks.
- Metadata Generation: Tracks byte ranges for each chunk for potential reassembly of the original text.
Summary
Functions
Internal recursive chunking strategy. Use TextChunker.split/2 for public API.
Functions
@spec split( binary(), keyword() ) :: [TextChunker.Chunk.t()]
Internal recursive chunking strategy. Use TextChunker.split/2 for public API.
Splits text using prioritized separators, respecting chunk_size limits while
maintaining chunk_overlap for context preservation.
Options
:chunk_size(integer) - Maximum chunk size:chunk_overlap(integer) - Overlap between chunks:format(atom) - Text format for separator selection:get_chunk_size(function) - Size calculation function (required)