Chunx.Chunker.Sentence (chunx v0.1.0)

Implements sentence based chunking strategy.

Splits text into overlapping chunks based on sentences while respecting token limits.

Summary

Types

chunk_opts()

Functions

chunk(text, tokenizer, opts \\ [])

Splits text into overlapping chunks using sentence boundaries.

Types

chunk_opts()

@type chunk_opts() :: [
  chunk_size: pos_integer(),
  chunk_overlap: pos_integer(),
  min_sentences_per_chunk: pos_integer(),
  delimiters: [String.t()],
  short_sentence_threshold: pos_integer()
]

Functions

chunk(text, tokenizer, opts \\ [])

@spec chunk(binary(), Tokenizers.Tokenizer.t(), chunk_opts()) ::
  {:ok, [Chunx.Chunk.t()]} | {:error, term()}

Splits text into overlapping chunks using sentence boundaries.

Options

:chunk_size - Maximum number of tokens per chunk (default: 512). The chunker will try to fit as many complete sentences as possible while staying under this limit. If a single sentence exceeds this limit, it will still be included as its own chunk.
:chunk_overlap - Number of tokens that should overlap between consecutive chunks (default: 128). This helps maintain context between chunks by including some sentences from the end of the previous chunk at the start of the next chunk. Must be less than chunk_size.
:min_sentences_per_chunk - Minimum number of sentences that must be included in each chunk (default: 1). This ensures chunks contain complete thoughts, even if including multiple sentences would exceed chunk_size.
:delimiters - List of sentence delimiters. Sentences will be split at these delimiters. (default: ["." "!" "?" "\n"])
:short_sentence_threshold - Below this byte size a sentence is considered too short and will be concatenated with the next sentence. (default: 6)