LeXtract.Chunking (lextract v0.1.2)

View Source

Integrates semantic text chunking with tokenization for document processing.

This module combines TextChunker's semantic splitting capabilities with LeXtract's tokenization system to produce chunks that maintain both character-level and token-level position information.

Key Features

  • Semantic boundary detection via TextChunker
  • Token-level position tracking for each chunk
  • Configurable chunk sizes and overlap
  • Unicode-aware processing (handles emojis and multi-byte characters)
  • Byte-level accuracy for text alignment

Options

  • :max_char_buffer - Maximum chunk size in characters (default: 1000)
  • :chunk_overlap - Overlap between chunks in characters (default: 200)
  • :tokenizer - Custom tokenizer instance (default: uses LeXtract.Tokenizer.default_tokenizer/0)

Examples

iex> doc = LeXtract.Document.create("The patient has diabetes. The patient is 45 years old.")
iex> chunks = LeXtract.Chunking.chunk_document(doc)
iex> length(chunks) >= 1
true

iex> doc = LeXtract.Document.create("Short text")
iex> [chunk] = LeXtract.Chunking.chunk_document(doc, max_char_buffer: 100)
iex> chunk.text
"Short text"

iex> long_text = String.duplicate("word ", 500)
iex> doc = LeXtract.Document.create(long_text)
iex> chunks = LeXtract.Chunking.chunk_document(doc, max_char_buffer: 100, chunk_overlap: 20)
iex> length(chunks) > 1
true

Summary

Functions

Calculates optimal overlap as 20% of the chunk size.

Chunks a document using semantic splitting and tokenization.

Chunks text with a specific tokenizer instance and optional document reference.

Functions

calculate_overlap(chunk_size)

@spec calculate_overlap(pos_integer()) :: pos_integer()

Calculates optimal overlap as 20% of the chunk size.

Examples

iex> LeXtract.Chunking.calculate_overlap(1000)
200

iex> LeXtract.Chunking.calculate_overlap(500)
100

iex> LeXtract.Chunking.calculate_overlap(10)
2

chunk_document(document, opts \\ [])

@spec chunk_document(
  LeXtract.Document.t(),
  keyword()
) :: [LeXtract.TextChunk.t()]

Chunks a document using semantic splitting and tokenization.

Takes a Document and splits its text into smaller TextChunks, each containing:

  • The chunk text
  • Byte positions (start_byte, end_byte) from TextChunker
  • Token information via Tokenizer encoding
  • Character and token intervals for alignment

Options

  • :max_char_buffer - Maximum chunk size in characters (default: 1000)
  • :chunk_overlap - Overlap between chunks in characters (default: 200)
  • :tokenizer - Custom tokenizer instance (default: uses LeXtract.Tokenizer.default_tokenizer/0)

Examples

iex> doc = LeXtract.Document.create("Hello world")
iex> chunks = LeXtract.Chunking.chunk_document(doc)
iex> [chunk] = chunks
iex> chunk.text
"Hello world"
iex> is_struct(chunk.char_interval, LeXtract.CharInterval)
true
iex> is_struct(chunk.token_interval, LeXtract.TokenInterval)
true

iex> doc = LeXtract.Document.create("")
iex> LeXtract.Chunking.chunk_document(doc)
[]

chunk_with_tokenizer(text, tokenizer, document \\ nil, opts \\ [])

@spec chunk_with_tokenizer(
  String.t(),
  LeXtract.Tokenizer.tokenizer_ref(),
  LeXtract.Document.t() | nil,
  keyword()
) :: [LeXtract.TextChunk.t()]

Chunks text with a specific tokenizer instance and optional document reference.

This function performs the core chunking logic:

  1. Splits text using TextChunker for semantic boundaries
  2. Tokenizes each chunk to get token offsets
  3. Creates TextChunk structs with both character and token intervals

Options

  • :max_char_buffer - Maximum chunk size in characters (default: 1000)
  • :chunk_overlap - Overlap between chunks in characters (default: 200)

Examples

iex> {:ok, tokenizer} = LeXtract.Tokenizer.default_tokenizer()
iex> chunks = LeXtract.Chunking.chunk_with_tokenizer("Hello world", tokenizer)
iex> [chunk] = chunks
iex> chunk.text
"Hello world"

iex> {:ok, tokenizer} = LeXtract.Tokenizer.default_tokenizer()
iex> LeXtract.Chunking.chunk_with_tokenizer("", tokenizer)
[]