Chunking Strategies
View SourceThe Rag.Chunker behavior provides pluggable strategies for splitting text into chunks optimized for different use cases.
Overview
alias Rag.Chunker
alias Rag.Chunker.{Character, Sentence, Paragraph, Recursive}
chunker = %Recursive{max_chars: 500}
chunks = Chunker.chunk(chunker, text)Each chunk is a %Rag.Chunker.Chunk{} struct:
%Rag.Chunker.Chunk{
content: String.t(), # The chunk text
start_byte: non_neg_integer(),
end_byte: non_neg_integer(),
index: non_neg_integer(),
metadata: map() # Chunker-specific metadata
}Chunkers
1. Character (Rag.Chunker.Character)
Fixed-size chunks with smart boundary detection.
chunker = %Character{max_chars: 500, overlap: 50}
Chunker.chunk(chunker, text)Options:
max_chars- Maximum characters per chunk (default: 500)overlap- Characters to overlap between chunks (default: 50)
Behavior:
- Splits at sentence boundaries (
.!?) when possible - Falls back to word boundaries
- Falls back to hard split at max_chars
- Creates overlap for context preservation
Best for:
- Consistent embedding sizes
- Unstructured text
- Predictable chunk sizes
2. Sentence (Rag.Chunker.Sentence)
Preserves complete sentences within chunks.
chunker = %Sentence{max_chars: 500, min_chars: 100}
Chunker.chunk(chunker, text)Options:
max_chars- Maximum characters per chunk (default: 500)min_chars- Minimum characters before starting new chunk (optional)
Behavior:
- Splits on sentence boundaries
- Combines sentences up to max_chars
- If min_chars specified, continues until reaching minimum
- Falls back to character-based if a sentence exceeds max_chars
Best for:
- Q&A systems
- Well-structured prose
- Semantic coherence
3. Paragraph (Rag.Chunker.Paragraph)
Preserves paragraph structure and topic boundaries.
chunker = %Paragraph{max_chars: 500, min_chars: 100}
Chunker.chunk(chunker, text)Options:
max_chars- Maximum characters per chunk (default: 500)min_chars- Minimum characters before starting new chunk (optional)
Behavior:
- Splits on paragraph boundaries (double newlines)
- Combines short paragraphs if under min_chars
- Falls back to sentence-based if paragraph exceeds max_chars
Best for:
- Articles and blog posts
- Documentation
- Topic-organized content
4. Recursive (Rag.Chunker.Recursive)
Hierarchical splitting from paragraph to sentence to character.
chunker = %Recursive{max_chars: 500, min_chars: 100}
Chunker.chunk(chunker, text)Options:
max_chars- Maximum characters per chunk (default: 500)min_chars- Minimum characters per chunk (optional)
Metadata:
%{chunker: :recursive, hierarchy: :paragraph | :sentence | :character}Best for:
- Mixed content structures
- Varying document formats
- Smart hierarchy preservation
5. Semantic (Rag.Chunker.Semantic)
Groups sentences by semantic similarity using embeddings.
alias Rag.Router
alias Rag.Chunker.Semantic
{:ok, router} = Router.new(providers: [:gemini])
embedding_fn = fn text ->
{:ok, [embedding], _} = Router.execute(router, :embeddings, [text], [])
embedding
end
chunker = %Semantic{embedding_fn: embedding_fn, threshold: 0.8, max_chars: 500}
Chunker.chunk(chunker, text)Options:
embedding_fn- Required function to generate embeddingsthreshold- Similarity threshold for grouping (default: 0.8)max_chars- Maximum characters per chunk (default: 500)
Behavior:
- Splits text into sentences
- Generates embedding for each sentence
- Groups sentences by cosine similarity
- Continues adding while similarity >= threshold and under max_chars
Best for:
- Topic-focused chunks
- High-quality RAG systems
- When API cost is acceptable
6. Format-Aware (Rag.Chunker.FormatAware)
Format-aware chunking using TextChunker for code and markup formats.
alias Rag.Chunker.FormatAware
chunker = %FormatAware{format: :markdown, chunk_size: 500}
Chunker.chunk(chunker, markdown_text)Options:
format- Document format (default: :plaintext)chunk_size- Maximum size in code points (default: 2000)chunk_overlap- Overlap between chunks (default: 200)size_fn- Custom size function(String.t() -> integer())(optional)
Note: This chunker requires TextChunker:
{:text_chunker, "~> 0.5.2"}Strategy Comparison
| Strategy | Chunk Size | Structure | API Calls | Best For |
|---|---|---|---|---|
| Character | Consistent | May split thoughts | None | Predictable sizing |
| Sentence | Variable | Complete thoughts | None | Q&A systems |
| Paragraph | Variable | Topic boundaries | None | Structured docs |
| Recursive | Variable | Smart hierarchy | None | Mixed content |
| Semantic | Variable | Semantic groups | Yes | Topic coherence |
| FormatAware | Variable | Format-aware | None | Code and markup |
Overlap Demonstration
text = "First sentence. Second sentence. Third sentence. Fourth sentence."
# No overlap
Chunker.chunk(%Character{max_chars: 40, overlap: 0}, text)
# With overlap
Chunker.chunk(%Character{max_chars: 40, overlap: 20}, text)Overlap helps:
- Preserve context between chunks
- Improve retrieval for information at chunk boundaries
- Reduce information loss during splitting
Position Validation
alias Rag.Chunker.Chunk
chunker = %Character{max_chars: 100}
chunks = Chunker.chunk(chunker, text)
Enum.all?(chunks, fn chunk ->
Chunk.valid?(chunk, text)
end)Complete Example
alias Rag.Chunker
alias Rag.Chunker.{Character, Sentence, Paragraph, Recursive, Semantic}
# Load document
text = File.read!("document.md")
# Try different strategies
char_chunks = Chunker.chunk(%Character{max_chars: 500, overlap: 50}, text)
sent_chunks = Chunker.chunk(%Sentence{max_chars: 500}, text)
para_chunks = Chunker.chunk(%Paragraph{max_chars: 500}, text)
rec_chunks = Chunker.chunk(%Recursive{max_chars: 500}, text)
# Semantic chunking (requires embedding function)
embedding_fn = fn text ->
{:ok, [embedding], _} = Rag.Router.execute(router, :embeddings, [text], [])
embedding
end
sem_chunks = Chunker.chunk(%Semantic{embedding_fn: embedding_fn, threshold: 0.75}, text)
# Compare results
for {name, chunks} <- [
{"Character", char_chunks},
{"Sentence", sent_chunks},
{"Paragraph", para_chunks},
{"Recursive", rec_chunks},
{"Semantic", sem_chunks}
] do
avg_size = if length(chunks) > 0 do
total = Enum.reduce(chunks, 0, fn c, acc -> acc + String.length(c.content) end)
round(total / length(chunks))
else
0
end
IO.puts("#{name}: #{length(chunks)} chunks, avg #{avg_size} chars")
end