Chunx.Chunker.Token (chunx v0.1.0)

Copy Markdown View Source

Implements token based chunking strategy.

Splits text into overlapping chunks based on token count using the given tokenizer.

Summary

Functions

Splits text into overlapping chunks using the given tokenizer.

Types

chunk_opts()

@type chunk_opts() :: [
  chunk_size: pos_integer(),
  chunk_overlap: pos_integer() | float()
]

Functions

chunk(text, tokenizer, opts \\ [])

@spec chunk(binary(), Tokenizers.Tokenizer.t(), chunk_opts()) ::
  {:ok, [Chunk.t()]} | {:error, term()}

Splits text into overlapping chunks using the given tokenizer.

Options

  • :chunk_size - Maximum number of tokens per chunk (default: 512)
  • :chunk_overlap - Number of tokens (integer) or percentage (float between 0 and 1) to overlap between chunks (default: 0.25)

Examples

iex> {:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("distilbert/distilbert-base-uncased")
iex> Chunx.Chunker.Token.chunk("Some text to split", tokenizer, chunk_size: 3, chunk_overlap: 1)
{
  :ok,
  [
    %Chunx.Chunk{end_byte: 12, start_byte: 0, text: "Some text to", token_count: 3},
    %Chunx.Chunk{end_byte: 18, start_byte: 10, text: "to split", token_count: 2}
  ]
}