View Source LangChain.TextSplitter.RecursiveCharacterTextSplitter (LangChain v0.3.3)
The RecursiveCharacterTextSplitter
is the recommended spliltter for generic text.
It splits the text based on a list of characters.
It uses each of these characters sequentially, until the text is split
into small enough chunks. The default list is [" ", " ", " ", ""]
.
The purpose is to prepare text for processing by large language models with limited context windows, or where a shorter context window is desired.
The main characterstinc of this splitter is that tries to keep paragraphs, sentences or code functions together as long as possible.
LangChain.TextSplitter.LanguageSeparators
provide separator lists for some programming and markup languages.
To use these Separators, it's recommended to set the is_separator_regex
option to true
.
How it works:
- It splits the text at the first specified
separator
characters from the givenseparators
list. It usesLangChain.TextSplitter.CharacterTextSplitter
to do so. - For each of the above splits, it calls itself recursively
using the tail of the
separators
list.
A RecursiveCharacterTextSplitter
is defined using a schema.
separators
- List of string that split a given text. The default list is[" ", " ", " ", ""]
.chunk_size
- Integer number of characters that a chunk should have.chunk_overlap
- Integer number of characters that two consecutive chunks should share.keep_separator
- Either:discard_separator
,:start
or:end
. Ifnil
, the separator is discarded from the output chunks.:start
and:end
keep the separator at the start or end of the output chunks. Defaults tostart
.is_separator_regex
- Boolean defaulting tofalse
. Iftrue
, theseparator
string is not escaped. Defaults tofalse
Summary
Functions
Build a new RecursiveCharcterTextSplitter and return an :ok
/:error
tuple with the result.
Build a new RecursiveCharacterTextSplitter and return it or raise an error if invalid.
Splits text recursively based on a list of characters.
By default, the separators
characters are kept at the start
Types
Functions
Build a new RecursiveCharcterTextSplitter and return an :ok
/:error
tuple with the result.
Build a new RecursiveCharacterTextSplitter and return it or raise an error if invalid.
Splits text recursively based on a list of characters.
By default, the separators
characters are kept at the start
iex> split_tags = [",", "."]
iex> base_params = %{chunk_size: 10, chunk_overlap: 0, separators: split_tags}
iex> query = "Apple,banana,orange and tomato."
iex> splitter = RecursiveCharacterTextSplitter.new!(base_params)
iex> splitter |> RecursiveCharacterTextSplitter.split_text(query)
["Apple", ",banana", ",orange and tomato", "."]
We can keep the separator at the end of a chunk, providing the
keep_separator: :end
option:
iex> split_tags = [",", "."]
iex> base_params = %{chunk_size: 10, chunk_overlap: 0, separators: split_tags, keep_separator: :end}
iex> query = "Apple,banana,orange and tomato."
iex> splitter = RecursiveCharacterTextSplitter.new!(base_params)
iex> splitter |> RecursiveCharacterTextSplitter.split_text(query)
["Apple,", "banana,", "orange and tomato."]
See LangChain.TextSplitter.CharacterTextSplitter
for the usage of the different options.
LanguageSeparators
provides separators
for multiple
programming and markdown languages.
To use these Separators, it's recommended to set the is_separator_regex
option to true
.
To split Python code:
iex> python_code = "
...>def hello_world():
...> print('Hello, World')
...>
...>
...># Call the function
...>hello_world()"
iex> splitter =
...> RecursiveCharacterTextSplitter.new!(%{
...> separators: LanguageSeparators.python(),
...> keep_separator: :start,
...> is_separator_regex: :true,
...> chunk_size: 16,
...> chunk_overlap: 0})
iex> splitter |> RecursiveCharacterTextSplitter.split_text(python_code)
["def", "hello_world():", "print('Hello,", "World')", "# Call the", "function", "hello_world()"]