Stephen.Chunker (Stephen v1.0.0)
View SourcePassage chunking for long documents.
ColBERT has a maximum document length (typically 180 tokens). For longer documents, we split them into overlapping chunks and track the mapping back to original documents.
Stephen uses sentence-aware recursive chunking via text_chunker, which splits at semantic boundaries (sentences, paragraphs). Research shows ColBERT performs best with sentence-aware splitting.
Usage
# Split documents into chunks
{chunks, mapping} = Stephen.Chunker.chunk_documents(documents)
# With custom size
{chunks, mapping} = Stephen.Chunker.chunk_documents(documents,
chunk_size: 500,
chunk_overlap: 100
)
# For markdown documents
{chunks, mapping} = Stephen.Chunker.chunk_documents(documents,
format: :markdown
)
# After retrieval, merge results back to document level
merged_results = Stephen.Chunker.merge_results(chunk_results, mapping)
Summary
Functions
Splits documents into overlapping chunks.
Chunks a single text into overlapping segments.
Calculates how many chunks a text will produce.
Gets all chunk IDs for a document.
Returns the original document ID for a chunk.
Merges chunk-level results back to document level.
Types
@type chunk_id() :: String.t()
@type chunk_mapping() :: %{ required(chunk_id()) => %{doc_id: doc_id(), chunk_index: non_neg_integer()} }
@type doc_id() :: term()
Functions
@spec chunk_documents( [{doc_id(), String.t()}], keyword() ) :: {[{chunk_id(), String.t()}], chunk_mapping()}
Splits documents into overlapping chunks.
Arguments
documents- List of {doc_id, text} tuplesopts- Chunking options
Options
:chunk_size- Target chunk size in characters (default: 500):chunk_overlap- Overlap between chunks in characters (default: 100):format- Text format for separator selection (:plaintextor:markdown, default::plaintext)
Returns
Tuple of {chunks, mapping} where:
chunksis a list of {chunk_id, text} tuplesmappingis a map from chunk_id to original doc info
Chunks a single text into overlapping segments.
Arguments
text- Text to chunkopts- Chunking options (same as chunk_documents/2)
Returns
List of text chunks (strings)
@spec estimate_chunks( String.t(), keyword() ) :: non_neg_integer()
Calculates how many chunks a text will produce.
Useful for estimating index size before indexing.
Arguments
text- Text to analyzeopts- Same options as chunk_text/2
Returns
Number of chunks
@spec get_chunk_ids(doc_id(), chunk_mapping()) :: [chunk_id()]
Gets all chunk IDs for a document.
Arguments
doc_id- The original document IDmapping- Chunk mapping from chunk_documents/2
Returns
List of chunk IDs belonging to the document
@spec get_doc_id(chunk_id(), chunk_mapping()) :: doc_id() | nil
Returns the original document ID for a chunk.
Arguments
chunk_id- The chunk identifiermapping- Chunk mapping from chunk_documents/2
Returns
The original document ID or nil if not found
@spec merge_results( [%{doc_id: chunk_id(), score: float()}], chunk_mapping(), keyword() ) :: [ %{doc_id: doc_id(), score: float()} ]
Merges chunk-level results back to document level.
Takes the maximum score among all chunks of the same document.
Arguments
results- List of %{doc_id: chunk_id, score: float} from searchmapping- Chunk mapping from chunk_documents/2
Options
:aggregation- How to combine chunk scores (:max, :mean, :sum) (default: :max)
Returns
List of %{doc_id: original_doc_id, score: float} sorted by score descending.