View Source LangChain.TextSplitter.RecursiveCharacterTextSplitter (LangChain v0.4.0)
The RecursiveCharacterTextSplitter is the recommended spliltter for generic text.
It splits the text based on a list of characters.
It uses each of these characters sequentially, until the text is split
into small enough chunks. The default list is [" ", " ", " ", ""].
The purpose is to prepare text for processing by large language models with limited context windows, or where a shorter context window is desired.
The main characterstinc of this splitter is that tries to keep paragraphs, sentences or code functions together as long as possible.
LangChain.TextSplitter.LanguageSeparators provide separator lists for some programming and markup languages.
To use these Separators, it's recommended to set the is_separator_regex option to true.
How it works:
- It splits the text at the first specified
separatorcharacters from the givenseparatorslist. It usesLangChain.TextSplitter.CharacterTextSplitterto do so. - For each of the above splits, it calls itself recursively
using the tail of the
separatorslist.
A RecursiveCharacterTextSplitter is defined using a schema.
separators- List of string that split a given text. The default list is[" ", " ", " ", ""].chunk_size- Integer number of tokens that a chunk should have.chunk_overlap- Integer number of tokens that two consecutive chunks should share.keep_separator- Either:discard_separator,:startor:end. Ifnil, the separator is discarded from the output chunks.:startand:endkeep the separator at the start or end of the output chunks. Defaults tostart.is_separator_regex- Boolean defaulting tofalse. Iftrue, theseparatorstring is not escaped. Defaults tofalsetokenizer- Function that takes a string and returns the number of tokens. Defaults to&String.length/1.
Summary
Functions
Build a new RecursiveCharcterTextSplitter and return an :ok/:error tuple with the result.
Build a new RecursiveCharacterTextSplitter and return it or raise an error if invalid.
Splits text recursively based on a list of characters.
By default, the separators characters are kept at the start
Types
Functions
Build a new RecursiveCharcterTextSplitter and return an :ok/:error tuple with the result.
Build a new RecursiveCharacterTextSplitter and return it or raise an error if invalid.
Splits text recursively based on a list of characters.
By default, the separators characters are kept at the start
iex> split_tags = [",", "."]
iex> base_params = %{chunk_size: 10, chunk_overlap: 0, separators: split_tags}
iex> query = "Apple,banana,orange and tomato."
iex> splitter = RecursiveCharacterTextSplitter.new!(base_params)
iex> splitter |> RecursiveCharacterTextSplitter.split_text(query)
["Apple", ",banana", ",orange and tomato", "."]We can keep the separator at the end of a chunk, providing the
keep_separator: :end option:
iex> split_tags = [",", "."]
iex> base_params = %{chunk_size: 10, chunk_overlap: 0, separators: split_tags, keep_separator: :end}
iex> query = "Apple,banana,orange and tomato."
iex> splitter = RecursiveCharacterTextSplitter.new!(base_params)
iex> splitter |> RecursiveCharacterTextSplitter.split_text(query)
["Apple,", "banana,", "orange and tomato."]See LangChain.TextSplitter.CharacterTextSplitter for the usage of the different options.
LanguageSeparators provides separators for multiple
programming and markdown languages.
To use these Separators, it's recommended to set the is_separator_regex option to true.
To split Python code:
iex> python_code = "
...>def hello_world():
...> print('Hello, World')
...>
...>
...># Call the function
...>hello_world()"
iex> splitter =
...> RecursiveCharacterTextSplitter.new!(%{
...> separators: LanguageSeparators.python(),
...> keep_separator: :start,
...> is_separator_regex: :true,
...> chunk_size: 16,
...> chunk_overlap: 0})
iex> splitter |> RecursiveCharacterTextSplitter.split_text(python_code)
["def", "hello_world():", "print('Hello,", "World')", "# Call the", "function", "hello_world()"]