View Source TextChunker (TextChunker v0.3.1)

Provides a high-level interface for text chunking, employing a configurable splitting strategy (defaults to recursive splitting). Manages options and coordinates the process, tracking chunk metadata.

Key Features

  • Customizable Splitting: Allows the splitting strategy to be customized via the :strategy option.
  • Size and Overlap Control: Provides options for :chunk_size and :chunk_overlap.
  • Metadata Tracking: Generates Chunk structs containing byte range information.

Supported Options

  • :chunk_size (positive integer, default: 2000) - Maximum size in code point length for each chunk.
  • :chunk_overlap (non-negative integer, default: 200) - Number of overlapping code points between consecutive chunks to preserve context.
  • :strategy (module default: RecursiveChunk) - A module implementing the split function. Currently only RecursiveChunk is supported.
  • :format (atom, default: :plaintext) - The format of the input text. Used to determine where to split the text in some strategies.

Summary

Functions

Splits the provided text into a list of %Chunk{} structs.

Functions

@spec split(
  binary(),
  keyword()
) :: [Chunk.t()] | {:error, String.t()}

Splits the provided text into a list of %Chunk{} structs.

Examples

iex> long_text = "This is a very long text that needs to be split into smaller pieces for easier handling."
iex> TextChunker.split(long_text)
# => [%Chunk{}, %Chunk{}, ...]

iex> TextChunker.split(long_text, chunk_size: 10, chunk_overlap: 3)

=> Generates many smaller chunks with significant overlap