Chunx.Chunker.Word (chunx v0.1.0)

Implements word based chunking strategy.

Splits text into overlapping chunks based on words while respecting token limits.

Summary

Types

chunk_opts()

Functions

chunk(text, tokenizer, opts \\ [])

Splits text into overlapping chunks using word boundaries.

Types

chunk_opts()

@type chunk_opts() :: [
  chunk_size: pos_integer(),
  chunk_overlap: pos_integer() | float()
]

Functions

chunk(text, tokenizer, opts \\ [])

@spec chunk(binary(), Tokenizers.Tokenizer.t(), chunk_opts()) ::
  {:ok, [Chunx.Chunk.t()]} | {:error, term()}

Splits text into overlapping chunks using word boundaries.

Options

:chunk_size - Maximum number of tokens per chunk (default: 512)
:chunk_overlap - Number of tokens (integer) or percentage (float between 0 and 1) to overlap between chunks (default: 0.25)

Examples

iex> {:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
iex> Chunx.Chunker.Word.chunk("Some text to split", tokenizer, chunk_size: 3, chunk_overlap: 1)
{
  :ok,
  [
    %Chunx.Chunk{end_byte: 12, start_byte: 0, text: "Some text to", token_count: 3},
    %Chunx.Chunk{end_byte: 18, start_byte: 9, text: " to split", token_count: 2}
  ]
}