Chunx.Chunker.Word (chunx v0.1.0)

Copy Markdown View Source

Implements word based chunking strategy.

Splits text into overlapping chunks based on words while respecting token limits.

Summary

Functions

Splits text into overlapping chunks using word boundaries.

Types

chunk_opts()

@type chunk_opts() :: [
  chunk_size: pos_integer(),
  chunk_overlap: pos_integer() | float()
]

Functions

chunk(text, tokenizer, opts \\ [])

@spec chunk(binary(), Tokenizers.Tokenizer.t(), chunk_opts()) ::
  {:ok, [Chunx.Chunk.t()]} | {:error, term()}

Splits text into overlapping chunks using word boundaries.

Options

  • :chunk_size - Maximum number of tokens per chunk (default: 512)
  • :chunk_overlap - Number of tokens (integer) or percentage (float between 0 and 1) to overlap between chunks (default: 0.25)

Examples

iex> {:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
iex> Chunx.Chunker.Word.chunk("Some text to split", tokenizer, chunk_size: 3, chunk_overlap: 1)
{
  :ok,
  [
    %Chunx.Chunk{end_byte: 12, start_byte: 0, text: "Some text to", token_count: 3},
    %Chunx.Chunk{end_byte: 18, start_byte: 9, text: " to split", token_count: 2}
  ]
}