View Source LangChain.TextSplitter.RecursiveCharacterTextSplitter (LangChain v0.3.3)

The RecursiveCharacterTextSplitter is the recommended spliltter for generic text. It splits the text based on a list of characters. It uses each of these characters sequentially, until the text is split into small enough chunks. The default list is [" ", " ", " ", ""].

The purpose is to prepare text for processing by large language models with limited context windows, or where a shorter context window is desired.

The main characterstinc of this splitter is that tries to keep paragraphs, sentences or code functions together as long as possible.

LangChain.TextSplitter.LanguageSeparators provide separator lists for some programming and markup languages. To use these Separators, it's recommended to set the is_separator_regex option to true.

How it works:

It splits the text at the first specified separator characters from the given separators list. It uses LangChain.TextSplitter.CharacterTextSplitter to do so.
For each of the above splits, it calls itself recursively using the tail of the separators list.

A RecursiveCharacterTextSplitter is defined using a schema.

separators - List of string that split a given text. The default list is [" ", " ", " ", ""].
chunk_size - Integer number of characters that a chunk should have.
chunk_overlap - Integer number of characters that two consecutive chunks should share.
keep_separator - Either :discard_separator, :start or :end. If nil, the separator is discarded from the output chunks. :start and :end keep the separator at the start or end of the output chunks. Defaults to start.
is_separator_regex - Boolean defaulting to false. If true, the separator string is not escaped. Defaults to false

Summary

Types

t()

Functions

new(attrs \\ %{})

Build a new RecursiveCharcterTextSplitter and return an :ok/:error tuple with the result.

new!(attrs \\ %{})

Build a new RecursiveCharacterTextSplitter and return it or raise an error if invalid.

split_text(text_splitter, text)

Splits text recursively based on a list of characters. By default, the separators characters are kept at the start