LeXtract.Chunking (lextract v0.1.2)
View SourceIntegrates semantic text chunking with tokenization for document processing.
This module combines TextChunker's semantic splitting capabilities with LeXtract's tokenization system to produce chunks that maintain both character-level and token-level position information.
Key Features
- Semantic boundary detection via TextChunker
- Token-level position tracking for each chunk
- Configurable chunk sizes and overlap
- Unicode-aware processing (handles emojis and multi-byte characters)
- Byte-level accuracy for text alignment
Options
:max_char_buffer- Maximum chunk size in characters (default: 1000):chunk_overlap- Overlap between chunks in characters (default: 200):tokenizer- Custom tokenizer instance (default: usesLeXtract.Tokenizer.default_tokenizer/0)
Examples
iex> doc = LeXtract.Document.create("The patient has diabetes. The patient is 45 years old.")
iex> chunks = LeXtract.Chunking.chunk_document(doc)
iex> length(chunks) >= 1
true
iex> doc = LeXtract.Document.create("Short text")
iex> [chunk] = LeXtract.Chunking.chunk_document(doc, max_char_buffer: 100)
iex> chunk.text
"Short text"
iex> long_text = String.duplicate("word ", 500)
iex> doc = LeXtract.Document.create(long_text)
iex> chunks = LeXtract.Chunking.chunk_document(doc, max_char_buffer: 100, chunk_overlap: 20)
iex> length(chunks) > 1
true
Summary
Functions
Calculates optimal overlap as 20% of the chunk size.
Chunks a document using semantic splitting and tokenization.
Chunks text with a specific tokenizer instance and optional document reference.
Functions
@spec calculate_overlap(pos_integer()) :: pos_integer()
Calculates optimal overlap as 20% of the chunk size.
Examples
iex> LeXtract.Chunking.calculate_overlap(1000)
200
iex> LeXtract.Chunking.calculate_overlap(500)
100
iex> LeXtract.Chunking.calculate_overlap(10)
2
@spec chunk_document( LeXtract.Document.t(), keyword() ) :: [LeXtract.TextChunk.t()]
Chunks a document using semantic splitting and tokenization.
Takes a Document and splits its text into smaller TextChunks, each containing:
- The chunk text
- Byte positions (start_byte, end_byte) from TextChunker
- Token information via Tokenizer encoding
- Character and token intervals for alignment
Options
:max_char_buffer- Maximum chunk size in characters (default: 1000):chunk_overlap- Overlap between chunks in characters (default: 200):tokenizer- Custom tokenizer instance (default: usesLeXtract.Tokenizer.default_tokenizer/0)
Examples
iex> doc = LeXtract.Document.create("Hello world")
iex> chunks = LeXtract.Chunking.chunk_document(doc)
iex> [chunk] = chunks
iex> chunk.text
"Hello world"
iex> is_struct(chunk.char_interval, LeXtract.CharInterval)
true
iex> is_struct(chunk.token_interval, LeXtract.TokenInterval)
true
iex> doc = LeXtract.Document.create("")
iex> LeXtract.Chunking.chunk_document(doc)
[]
@spec chunk_with_tokenizer( String.t(), LeXtract.Tokenizer.tokenizer_ref(), LeXtract.Document.t() | nil, keyword() ) :: [LeXtract.TextChunk.t()]
Chunks text with a specific tokenizer instance and optional document reference.
This function performs the core chunking logic:
- Splits text using TextChunker for semantic boundaries
- Tokenizes each chunk to get token offsets
- Creates TextChunk structs with both character and token intervals
Options
:max_char_buffer- Maximum chunk size in characters (default: 1000):chunk_overlap- Overlap between chunks in characters (default: 200)
Examples
iex> {:ok, tokenizer} = LeXtract.Tokenizer.default_tokenizer()
iex> chunks = LeXtract.Chunking.chunk_with_tokenizer("Hello world", tokenizer)
iex> [chunk] = chunks
iex> chunk.text
"Hello world"
iex> {:ok, tokenizer} = LeXtract.Tokenizer.default_tokenizer()
iex> LeXtract.Chunking.chunk_with_tokenizer("", tokenizer)
[]