View Source TextChunker.Strategies.RecursiveChunk (TextChunker v0.3.1)

Handles recursive text splitting, aiming to adhere to configured size and overlap requirements. Employs a flexible separator-based approach to break down text into manageable chunks, while generating metadata for each produced chunk.

Key Features:

  • Size-Guided Chunking: Prioritizes splitting text into semantic blocks while respecting the maximum chunk_size.
  • Context Preservation: Maintains chunk_overlap to minimize information loss at chunk boundaries.
  • Separator Handling: Selects the most appropriate delimiter (e.g., line breaks, spaces) based on the text content.
  • Metadata Generation: Creates %TextChunker.Chunk{} structs containing the split text and its original byte range.
  • Oversized Chunk Warnings: Provides feedback when chunks cannot be created due to misconfiguration or limitations of the input text.

Algorithm Overview

  1. Separator Prioritization: Establishes a list of potential separators (e.g., line breaks, spaces), ordered by their expected relevance to the text structure.
  2. Recursive Splitting:
  • Iterates through the separator list.
  • Attempts to split the text using the current separator.
  • If a split is successful, recursively applies the algorithm to any resulting sub-chunks that still exceed the chunk_size.
  1. Chunk Assembly:
  • Combines smaller text segments into chunks, aiming to get as close to the chunk_size as possible.
  • Employs chunk_overlap to ensure smooth transitions between chunks.
  1. Metadata Generation: Tracks byte ranges for each chunk for potential reassembly of the original text.

Summary

Functions

Splits the given text into chunks using a recursive strategy. Prioritizes compliance with the configured chunk_size as a maximum, while aiming to maintain chunk_overlap for context preservation. Intelligently handles various separators for flexible splitting.

Functions

Link to this function

produce_metadata(text, split_text, opts)

View Source
@spec split(
  binary(),
  keyword()
) :: [TextChunker.Chunk.t()]

Splits the given text into chunks using a recursive strategy. Prioritizes compliance with the configured chunk_size as a maximum, while aiming to maintain chunk_overlap for context preservation. Intelligently handles various separators for flexible splitting.

Options

  • :chunk_size (integer) - Target size in bytes for each chunk.
  • :chunk_overlap (integer) - Number of overlapping bytes between chunks.
  • :format (atom) - The format of the input text (influences separator selection).

Examples

iex> long_text = "This is a very long text that needs to be split into smaller pieces for easier handling."

iex> TextChunker.Strategies.RecursiveChunk.split(long_text, chunk_size: 15, chunk_overlap: 5)
[
  %TextChunker.Chunk{
    start_byte: 0,
    end_byte: 47,
    text: "This is a very long text that needs to be split"
  },
  %TextChunker.Chunk{
    start_byte: 38,
    end_byte: 88,
    text: " be split into smaller pieces for easier handling."
  }
]