Chunx.Chunker.Token (chunx v0.1.0)

Implements token based chunking strategy.

Splits text into overlapping chunks based on token count using the given tokenizer.

Summary

Types

chunk_opts()

Functions

chunk(text, tokenizer, opts \\ [])

Splits text into overlapping chunks using the given tokenizer.

Types

chunk_opts()

@type chunk_opts() :: [
  chunk_size: pos_integer(),
  chunk_overlap: pos_integer() | float()
]

Functions

chunk(text, tokenizer, opts \\ [])

@spec chunk(binary(), Tokenizers.Tokenizer.t(), chunk_opts()) ::
  {:ok, [Chunk.t()]} | {:error, term()}

Splits text into overlapping chunks using the given tokenizer.

Options

:chunk_size - Maximum number of tokens per chunk (default: 512)
:chunk_overlap - Number of tokens (integer) or percentage (float between 0 and 1) to overlap between chunks (default: 0.25)

Examples

iex> {:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("distilbert/distilbert-base-uncased")
iex> Chunx.Chunker.Token.chunk("Some text to split", tokenizer, chunk_size: 3, chunk_overlap: 1)
{
  :ok,
  [
    %Chunx.Chunk{end_byte: 12, start_byte: 0, text: "Some text to", token_count: 3},
    %Chunx.Chunk{end_byte: 18, start_byte: 10, text: "to split", token_count: 2}
  ]
}