View Source LangChain.TextSplitter.CharacterTextSplitter (LangChain v0.3.3)

The CharacterTextSplitter is a length based text splitter that divides text based on specified characters. This splitter provides consistent chunk sizes. It operates as follows:

  • It splits the text at specified separator characters.
  • It takes a chunk_size parameter that determines the maximum number of characters in each chunk.
  • If no separator is found within the chunk_size, it will create a chunk larger than the specified size.

The purpose is to prepare text for processing by large language models with limited context windows, or where a shorter context window is desired.

A CharacterTextSplitter is defined using a schema.

  • separator - String that splits a given text.
  • chunk_size - Integer number of characters that a chunk should have.
  • chunk_overlap - Integer number of characters that two consecutive chunks should share.
  • keep_separator - Either :discard_separator, :start or :end. If :discard_separator, the separator is discarded from the output chunks. :start and :end keep the separator at the start or end of the output chunks. Defaults to :discard_separator.
  • is_separator_regex - Boolean defaulting to false. If true, the separator string is not escaped. Defaults to false

Summary

Functions

Build a new CharacterTextSplitter and return an :ok/:error tuple with the result.

Build a new CharacterTextSplitter and return it or raise an error if invalid.

Splits text based on a given character. By default, the separator character is discarded

Types

@type t() :: %LangChain.TextSplitter.CharacterTextSplitter{
  chunk_overlap: term(),
  chunk_size: term(),
  is_separator_regex: term(),
  keep_separator: term(),
  separator: term()
}

Functions

Build a new CharacterTextSplitter and return an :ok/:error tuple with the result.

Build a new CharacterTextSplitter and return it or raise an error if invalid.

Link to this function

split_text(text_splitter, text)

View Source

Splits text based on a given character. By default, the separator character is discarded

iex> text_splitter = CharacterTextSplitter.new!(%{separator: " ", chunk_size: 3, chunk_overlap: 0})
iex> text = "foo bar baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo", "bar", "baz"]

We can keep the separator at the end of a chunk, providing the keep_separator: :end option:

iex> text_splitter = CharacterTextSplitter.new!(%{separator: ".", chunk_size: 3, chunk_overlap: 0, keep_separator: :end})
iex> text = "foo.bar.baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo.", "bar.", "baz"]

In order to keep the separator at the beginning of a chunk, provide the keep_separator: :start option:

iex> text_splitter = CharacterTextSplitter.new!(%{separator: ".", chunk_size: 3, chunk_overlap: 0, keep_separator: :start})
iex> text = "foo.bar.baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo", ".bar", ".baz"]

The last two examples used a regex special character as a separator. Plain strings are escaped and parsed as regex before splitting. If you want to use a complex regex as separator you can, but make sure to pass the is_separator_regex: true option:

iex> text_splitter = CharacterTextSplitter.new!(%{separator: Regex.escape("."), chunk_size: 3, chunk_overlap: 0, keep_separator: :start, is_separator_regex: true})
iex> text = "foo.bar.baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo", ".bar", ".baz"]

You can control the overlap of chunks trhough the chunk_overlap parameter:

iex> text_splitter = CharacterTextSplitter.new!(%{separator: " ", chunk_size: 7, chunk_overlap: 3})
iex> text = "foo bar baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo bar", "bar baz"]