View Source LangChain.TextSplitter.CharacterTextSplitter (LangChain v0.3.3)
The CharacterTextSplitter
is a length based text splitter
that divides text based on specified characters.
This splitter provides consistent chunk sizes.
It operates as follows:
- It splits the text at specified
separator
characters. - It takes a
chunk_size
parameter that determines the maximum number of characters in each chunk. - If no separator is found within the
chunk_size
, it will create a chunk larger than the specified size.
The purpose is to prepare text for processing by large language models with limited context windows, or where a shorter context window is desired.
A CharacterTextSplitter
is defined using a schema.
separator
- String that splits a given text.chunk_size
- Integer number of characters that a chunk should have.chunk_overlap
- Integer number of characters that two consecutive chunks should share.keep_separator
- Either:discard_separator
,:start
or:end
. If:discard_separator
, the separator is discarded from the output chunks.:start
and:end
keep the separator at the start or end of the output chunks. Defaults to:discard_separator
.is_separator_regex
- Boolean defaulting tofalse
. Iftrue
, theseparator
string is not escaped. Defaults tofalse
Summary
Functions
Build a new CharacterTextSplitter and return an :ok
/:error
tuple with the result.
Build a new CharacterTextSplitter and return it or raise an error if invalid.
Splits text based on a given character.
By default, the separator
character is discarded
Types
Functions
Build a new CharacterTextSplitter and return an :ok
/:error
tuple with the result.
Build a new CharacterTextSplitter and return it or raise an error if invalid.
Splits text based on a given character.
By default, the separator
character is discarded
iex> text_splitter = CharacterTextSplitter.new!(%{separator: " ", chunk_size: 3, chunk_overlap: 0})
iex> text = "foo bar baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo", "bar", "baz"]
We can keep the separator at the end of a chunk, providing the
keep_separator: :end
option:
iex> text_splitter = CharacterTextSplitter.new!(%{separator: ".", chunk_size: 3, chunk_overlap: 0, keep_separator: :end})
iex> text = "foo.bar.baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo.", "bar.", "baz"]
In order to keep the separator at the beginning of a chunk, provide the
keep_separator: :start
option:
iex> text_splitter = CharacterTextSplitter.new!(%{separator: ".", chunk_size: 3, chunk_overlap: 0, keep_separator: :start})
iex> text = "foo.bar.baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo", ".bar", ".baz"]
The last two examples used a regex special character as a separator
.
Plain strings are escaped and parsed as regex before splitting.
If you want to use a complex regex as separator
you can,
but make sure to pass the is_separator_regex: true
option:
iex> text_splitter = CharacterTextSplitter.new!(%{separator: Regex.escape("."), chunk_size: 3, chunk_overlap: 0, keep_separator: :start, is_separator_regex: true})
iex> text = "foo.bar.baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo", ".bar", ".baz"]
You can control the overlap of chunks trhough the chunk_overlap
parameter:
iex> text_splitter = CharacterTextSplitter.new!(%{separator: " ", chunk_size: 7, chunk_overlap: 3})
iex> text = "foo bar baz"
iex> CharacterTextSplitter.split_text(text_splitter, text)
["foo bar", "bar baz"]