Implements token based chunking strategy.
Splits text into overlapping chunks based on token count using the given tokenizer.
Summary
Functions
Splits text into overlapping chunks using the given tokenizer.
Types
@type chunk_opts() :: [ chunk_size: pos_integer(), chunk_overlap: pos_integer() | float() ]
Functions
@spec chunk(binary(), Tokenizers.Tokenizer.t(), chunk_opts()) :: {:ok, [Chunk.t()]} | {:error, term()}
Splits text into overlapping chunks using the given tokenizer.
Options
:chunk_size- Maximum number of tokens per chunk (default: 512):chunk_overlap- Number of tokens (integer) or percentage (float between 0 and 1) to overlap between chunks (default: 0.25)
Examples
iex> {:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("distilbert/distilbert-base-uncased")
iex> Chunx.Chunker.Token.chunk("Some text to split", tokenizer, chunk_size: 3, chunk_overlap: 1)
{
:ok,
[
%Chunx.Chunk{end_byte: 12, start_byte: 0, text: "Some text to", token_count: 3},
%Chunx.Chunk{end_byte: 18, start_byte: 10, text: "to split", token_count: 2}
]
}