View Source TextChunker.Strategies.RecursiveChunk (TextChunker v0.3.1)
Handles recursive text splitting, aiming to adhere to configured size and overlap requirements. Employs a flexible separator-based approach to break down text into manageable chunks, while generating metadata for each produced chunk.
Key Features:
- Size-Guided Chunking: Prioritizes splitting text into semantic blocks while respecting the maximum
chunk_size
. - Context Preservation: Maintains
chunk_overlap
to minimize information loss at chunk boundaries. - Separator Handling: Selects the most appropriate delimiter (e.g., line breaks, spaces) based on the text content.
- Metadata Generation: Creates
%TextChunker.Chunk{}
structs containing the split text and its original byte range. - Oversized Chunk Warnings: Provides feedback when chunks cannot be created due to misconfiguration or limitations of the input text.
Algorithm Overview
- Separator Prioritization: Establishes a list of potential separators (e.g., line breaks, spaces), ordered by their expected relevance to the text structure.
- Recursive Splitting:
- Iterates through the separator list.
- Attempts to split the text using the current separator.
- If a split is successful, recursively applies the algorithm to any resulting sub-chunks that still exceed the
chunk_size
.
- Chunk Assembly:
- Combines smaller text segments into chunks, aiming to get as close to the
chunk_size
as possible. - Employs
chunk_overlap
to ensure smooth transitions between chunks.
- Metadata Generation: Tracks byte ranges for each chunk for potential reassembly of the original text.
Summary
Functions
Splits the given text into chunks using a recursive strategy. Prioritizes compliance
with the configured chunk_size
as a maximum, while aiming to maintain chunk_overlap
for
context preservation. Intelligently handles various separators for flexible splitting.
Functions
@spec split( binary(), keyword() ) :: [TextChunker.Chunk.t()]
Splits the given text into chunks using a recursive strategy. Prioritizes compliance
with the configured chunk_size
as a maximum, while aiming to maintain chunk_overlap
for
context preservation. Intelligently handles various separators for flexible splitting.
Options
:chunk_size
(integer) - Target size in bytes for each chunk.:chunk_overlap
(integer) - Number of overlapping bytes between chunks.:format
(atom) - The format of the input text (influences separator selection).
Examples
iex> long_text = "This is a very long text that needs to be split into smaller pieces for easier handling."
iex> TextChunker.Strategies.RecursiveChunk.split(long_text, chunk_size: 15, chunk_overlap: 5)
[
%TextChunker.Chunk{
start_byte: 0,
end_byte: 47,
text: "This is a very long text that needs to be split"
},
%TextChunker.Chunk{
start_byte: 38,
end_byte: 88,
text: " be split into smaller pieces for easier handling."
}
]