Chunking Strategies

View Source

The Rag.Chunker behavior provides pluggable strategies for splitting text into chunks optimized for different use cases.

Overview

alias Rag.Chunker
alias Rag.Chunker.{Character, Sentence, Paragraph, Recursive}

chunker = %Recursive{max_chars: 500}
chunks = Chunker.chunk(chunker, text)

Each chunk is a %Rag.Chunker.Chunk{} struct:

%Rag.Chunker.Chunk{
  content: String.t(),      # The chunk text
  start_byte: non_neg_integer(),
  end_byte: non_neg_integer(),
  index: non_neg_integer(),
  metadata: map()           # Chunker-specific metadata
}

Chunkers

1. Character (Rag.Chunker.Character)

Fixed-size chunks with smart boundary detection.

chunker = %Character{max_chars: 500, overlap: 50}
Chunker.chunk(chunker, text)

Options:

  • max_chars - Maximum characters per chunk (default: 500)
  • overlap - Characters to overlap between chunks (default: 50)

Behavior:

  1. Splits at sentence boundaries (.!?) when possible
  2. Falls back to word boundaries
  3. Falls back to hard split at max_chars
  4. Creates overlap for context preservation

Best for:

  • Consistent embedding sizes
  • Unstructured text
  • Predictable chunk sizes

2. Sentence (Rag.Chunker.Sentence)

Preserves complete sentences within chunks.

chunker = %Sentence{max_chars: 500, min_chars: 100}
Chunker.chunk(chunker, text)

Options:

  • max_chars - Maximum characters per chunk (default: 500)
  • min_chars - Minimum characters before starting new chunk (optional)

Behavior:

  1. Splits on sentence boundaries
  2. Combines sentences up to max_chars
  3. If min_chars specified, continues until reaching minimum
  4. Falls back to character-based if a sentence exceeds max_chars

Best for:

  • Q&A systems
  • Well-structured prose
  • Semantic coherence

3. Paragraph (Rag.Chunker.Paragraph)

Preserves paragraph structure and topic boundaries.

chunker = %Paragraph{max_chars: 500, min_chars: 100}
Chunker.chunk(chunker, text)

Options:

  • max_chars - Maximum characters per chunk (default: 500)
  • min_chars - Minimum characters before starting new chunk (optional)

Behavior:

  1. Splits on paragraph boundaries (double newlines)
  2. Combines short paragraphs if under min_chars
  3. Falls back to sentence-based if paragraph exceeds max_chars

Best for:

  • Articles and blog posts
  • Documentation
  • Topic-organized content

4. Recursive (Rag.Chunker.Recursive)

Hierarchical splitting from paragraph to sentence to character.

chunker = %Recursive{max_chars: 500, min_chars: 100}
Chunker.chunk(chunker, text)

Options:

  • max_chars - Maximum characters per chunk (default: 500)
  • min_chars - Minimum characters per chunk (optional)

Metadata:

%{chunker: :recursive, hierarchy: :paragraph | :sentence | :character}

Best for:

  • Mixed content structures
  • Varying document formats
  • Smart hierarchy preservation

5. Semantic (Rag.Chunker.Semantic)

Groups sentences by semantic similarity using embeddings.

alias Rag.Router
alias Rag.Chunker.Semantic

{:ok, router} = Router.new(providers: [:gemini])

embedding_fn = fn text ->
  {:ok, [embedding], _} = Router.execute(router, :embeddings, [text], [])
  embedding
end

chunker = %Semantic{embedding_fn: embedding_fn, threshold: 0.8, max_chars: 500}
Chunker.chunk(chunker, text)

Options:

  • embedding_fn - Required function to generate embeddings
  • threshold - Similarity threshold for grouping (default: 0.8)
  • max_chars - Maximum characters per chunk (default: 500)

Behavior:

  1. Splits text into sentences
  2. Generates embedding for each sentence
  3. Groups sentences by cosine similarity
  4. Continues adding while similarity >= threshold and under max_chars

Best for:

  • Topic-focused chunks
  • High-quality RAG systems
  • When API cost is acceptable

6. Format-Aware (Rag.Chunker.FormatAware)

Format-aware chunking using TextChunker for code and markup formats.

alias Rag.Chunker.FormatAware

chunker = %FormatAware{format: :markdown, chunk_size: 500}
Chunker.chunk(chunker, markdown_text)

Options:

  • format - Document format (default: :plaintext)
  • chunk_size - Maximum size in code points (default: 2000)
  • chunk_overlap - Overlap between chunks (default: 200)
  • size_fn - Custom size function (String.t() -> integer()) (optional)

Note: This chunker requires TextChunker:

{:text_chunker, "~> 0.5.2"}

Strategy Comparison

StrategyChunk SizeStructureAPI CallsBest For
CharacterConsistentMay split thoughtsNonePredictable sizing
SentenceVariableComplete thoughtsNoneQ&A systems
ParagraphVariableTopic boundariesNoneStructured docs
RecursiveVariableSmart hierarchyNoneMixed content
SemanticVariableSemantic groupsYesTopic coherence
FormatAwareVariableFormat-awareNoneCode and markup

Overlap Demonstration

text = "First sentence. Second sentence. Third sentence. Fourth sentence."

# No overlap
Chunker.chunk(%Character{max_chars: 40, overlap: 0}, text)

# With overlap
Chunker.chunk(%Character{max_chars: 40, overlap: 20}, text)

Overlap helps:

  • Preserve context between chunks
  • Improve retrieval for information at chunk boundaries
  • Reduce information loss during splitting

Position Validation

alias Rag.Chunker.Chunk

chunker = %Character{max_chars: 100}
chunks = Chunker.chunk(chunker, text)

Enum.all?(chunks, fn chunk ->
  Chunk.valid?(chunk, text)
end)

Complete Example

alias Rag.Chunker
alias Rag.Chunker.{Character, Sentence, Paragraph, Recursive, Semantic}

# Load document
text = File.read!("document.md")

# Try different strategies
char_chunks = Chunker.chunk(%Character{max_chars: 500, overlap: 50}, text)
sent_chunks = Chunker.chunk(%Sentence{max_chars: 500}, text)
para_chunks = Chunker.chunk(%Paragraph{max_chars: 500}, text)
rec_chunks = Chunker.chunk(%Recursive{max_chars: 500}, text)

# Semantic chunking (requires embedding function)
embedding_fn = fn text ->
  {:ok, [embedding], _} = Rag.Router.execute(router, :embeddings, [text], [])
  embedding
end

sem_chunks = Chunker.chunk(%Semantic{embedding_fn: embedding_fn, threshold: 0.75}, text)

# Compare results
for {name, chunks} <- [
  {"Character", char_chunks},
  {"Sentence", sent_chunks},
  {"Paragraph", para_chunks},
  {"Recursive", rec_chunks},
  {"Semantic", sem_chunks}
] do
  avg_size = if length(chunks) > 0 do
    total = Enum.reduce(chunks, 0, fn c, acc -> acc + String.length(c.content) end)
    round(total / length(chunks))
  else
    0
  end
  IO.puts("#{name}: #{length(chunks)} chunks, avg #{avg_size} chars")
end