Implements semantic chunking using sentence embeddings.
Splits text into semantically coherent chunks using embeddings while respecting token limits.
Summary
Functions
Splits text into semantically coherent chunks using embeddings.
Types
@type chunk_opts() :: [ chunk_size: pos_integer(), threshold: float() | :auto, min_sentences: pos_integer(), min_chunk_size: pos_integer(), threshold_step: float() ]
Functions
@spec chunk( binary(), Tokenizers.Tokenizer.t(), ([String.t()] -> [Nx.Tensor.t()]), chunk_opts() ) :: {:ok, [Chunx.SentenceChunk.t()]} | {:error, term()}
Splits text into semantically coherent chunks using embeddings.
Options
:chunk_size- Maximum number of tokens per chunk (default: 512):threshold- Threshold for semantic similarity (0-1) or :auto (default: :auto):min_sentences- Minimum number of sentences per chunk (default: 1):min_chunk_size- Minimum number of tokens per chunk (default: 2):threshold_step- Step size for threshold calculation (default: 0.01)