Implements sentence based chunking strategy.
Splits text into overlapping chunks based on sentences while respecting token limits.
Summary
Functions
Splits text into overlapping chunks using sentence boundaries.
Types
@type chunk_opts() :: [ chunk_size: pos_integer(), chunk_overlap: pos_integer(), min_sentences_per_chunk: pos_integer(), delimiters: [String.t()], short_sentence_threshold: pos_integer() ]
Functions
@spec chunk(binary(), Tokenizers.Tokenizer.t(), chunk_opts()) :: {:ok, [Chunx.Chunk.t()]} | {:error, term()}
Splits text into overlapping chunks using sentence boundaries.
Options
:chunk_size- Maximum number of tokens per chunk (default: 512). The chunker will try to fit as many complete sentences as possible while staying under this limit. If a single sentence exceeds this limit, it will still be included as its own chunk.:chunk_overlap- Number of tokens that should overlap between consecutive chunks (default: 128). This helps maintain context between chunks by including some sentences from the end of the previous chunk at the start of the next chunk. Must be less than chunk_size.:min_sentences_per_chunk- Minimum number of sentences that must be included in each chunk (default: 1). This ensures chunks contain complete thoughts, even if including multiple sentences would exceed chunk_size.:delimiters- List of sentence delimiters. Sentences will be split at these delimiters. (default: ["." "!" "?" "\n"]):short_sentence_threshold- Below this byte size a sentence is considered too short and will be concatenated with the next sentence. (default: 6)