Chunx.Chunker.Semantic (chunx v0.1.0)

Copy Markdown View Source

Implements semantic chunking using sentence embeddings.

Splits text into semantically coherent chunks using embeddings while respecting token limits.

Summary

Functions

Splits text into semantically coherent chunks using embeddings.

Types

chunk_opts()

@type chunk_opts() :: [
  chunk_size: pos_integer(),
  threshold: float() | :auto,
  min_sentences: pos_integer(),
  min_chunk_size: pos_integer(),
  threshold_step: float()
]

Functions

chunk(text, tokenizer, embedding_fun, opts \\ [])

@spec chunk(
  binary(),
  Tokenizers.Tokenizer.t(),
  ([String.t()] -> [Nx.Tensor.t()]),
  chunk_opts()
) :: {:ok, [Chunx.SentenceChunk.t()]} | {:error, term()}

Splits text into semantically coherent chunks using embeddings.

Options

  • :chunk_size - Maximum number of tokens per chunk (default: 512)
  • :threshold - Threshold for semantic similarity (0-1) or :auto (default: :auto)
  • :min_sentences - Minimum number of sentences per chunk (default: 1)
  • :min_chunk_size - Minimum number of tokens per chunk (default: 2)
  • :threshold_step - Step size for threshold calculation (default: 0.01)