Chunx.Chunker.Semantic (chunx v0.1.0)

Implements semantic chunking using sentence embeddings.

Splits text into semantically coherent chunks using embeddings while respecting token limits.

Summary

Types

chunk_opts()

Functions

chunk(text, tokenizer, embedding_fun, opts \\ [])

Splits text into semantically coherent chunks using embeddings.

Types

chunk_opts()

@type chunk_opts() :: [
  chunk_size: pos_integer(),
  threshold: float() | :auto,
  min_sentences: pos_integer(),
  min_chunk_size: pos_integer(),
  threshold_step: float()
]

Functions

chunk(text, tokenizer, embedding_fun, opts \\ [])

@spec chunk(
  binary(),
  Tokenizers.Tokenizer.t(),
  ([String.t()] -> [Nx.Tensor.t()]),
  chunk_opts()
) :: {:ok, [Chunx.SentenceChunk.t()]} | {:error, term()}

Splits text into semantically coherent chunks using embeddings.

Options

:chunk_size - Maximum number of tokens per chunk (default: 512)
:threshold - Threshold for semantic similarity (0-1) or :auto (default: :auto)
:min_sentences - Minimum number of sentences per chunk (default: 1)
:min_chunk_size - Minimum number of tokens per chunk (default: 2)
:threshold_step - Step size for threshold calculation (default: 0.01)