Chunx.Chunker.Sentence (chunx v0.1.0)

Copy Markdown View Source

Implements sentence based chunking strategy.

Splits text into overlapping chunks based on sentences while respecting token limits.

Summary

Functions

Splits text into overlapping chunks using sentence boundaries.

Types

chunk_opts()

@type chunk_opts() :: [
  chunk_size: pos_integer(),
  chunk_overlap: pos_integer(),
  min_sentences_per_chunk: pos_integer(),
  delimiters: [String.t()],
  short_sentence_threshold: pos_integer()
]

Functions

chunk(text, tokenizer, opts \\ [])

@spec chunk(binary(), Tokenizers.Tokenizer.t(), chunk_opts()) ::
  {:ok, [Chunx.Chunk.t()]} | {:error, term()}

Splits text into overlapping chunks using sentence boundaries.

Options

  • :chunk_size - Maximum number of tokens per chunk (default: 512). The chunker will try to fit as many complete sentences as possible while staying under this limit. If a single sentence exceeds this limit, it will still be included as its own chunk.

  • :chunk_overlap - Number of tokens that should overlap between consecutive chunks (default: 128). This helps maintain context between chunks by including some sentences from the end of the previous chunk at the start of the next chunk. Must be less than chunk_size.

  • :min_sentences_per_chunk - Minimum number of sentences that must be included in each chunk (default: 1). This ensures chunks contain complete thoughts, even if including multiple sentences would exceed chunk_size.

  • :delimiters - List of sentence delimiters. Sentences will be split at these delimiters. (default: ["." "!" "?" "\n"])

  • :short_sentence_threshold - Below this byte size a sentence is considered too short and will be concatenated with the next sentence. (default: 6)