View Source LangChain.TextSplitter.RecursiveCharacterTextSplitter (LangChain v0.3.3)

The RecursiveCharacterTextSplitter is the recommended spliltter for generic text. It splits the text based on a list of characters. It uses each of these characters sequentially, until the text is split into small enough chunks. The default list is [" ", " ", " ", ""].

The purpose is to prepare text for processing by large language models with limited context windows, or where a shorter context window is desired.

The main characterstinc of this splitter is that tries to keep paragraphs, sentences or code functions together as long as possible.

LangChain.TextSplitter.LanguageSeparators provide separator lists for some programming and markup languages. To use these Separators, it's recommended to set the is_separator_regex option to true.

How it works:

  • It splits the text at the first specified separator characters from the given separators list. It uses LangChain.TextSplitter.CharacterTextSplitter to do so.
  • For each of the above splits, it calls itself recursively using the tail of the separators list.

A RecursiveCharacterTextSplitter is defined using a schema.

  • separators - List of string that split a given text. The default list is [" ", " ", " ", ""].
  • chunk_size - Integer number of characters that a chunk should have.
  • chunk_overlap - Integer number of characters that two consecutive chunks should share.
  • keep_separator - Either :discard_separator, :start or :end. If nil, the separator is discarded from the output chunks. :start and :end keep the separator at the start or end of the output chunks. Defaults to start.
  • is_separator_regex - Boolean defaulting to false. If true, the separator string is not escaped. Defaults to false

Summary

Functions

Build a new RecursiveCharcterTextSplitter and return an :ok/:error tuple with the result.

Build a new RecursiveCharacterTextSplitter and return it or raise an error if invalid.

Splits text recursively based on a list of characters. By default, the separators characters are kept at the start

Types

@type t() :: %LangChain.TextSplitter.RecursiveCharacterTextSplitter{
  chunk_overlap: term(),
  chunk_size: term(),
  is_separator_regex: term(),
  keep_separator: term(),
  separators: term()
}

Functions

Build a new RecursiveCharcterTextSplitter and return an :ok/:error tuple with the result.

Build a new RecursiveCharacterTextSplitter and return it or raise an error if invalid.

Link to this function

split_text(text_splitter, text)

View Source

Splits text recursively based on a list of characters. By default, the separators characters are kept at the start

iex> split_tags = [",", "."]
iex> base_params = %{chunk_size: 10, chunk_overlap: 0, separators: split_tags}
iex> query = "Apple,banana,orange and tomato."
iex> splitter = RecursiveCharacterTextSplitter.new!(base_params)    
iex> splitter |> RecursiveCharacterTextSplitter.split_text(query)
["Apple", ",banana", ",orange and tomato", "."]

We can keep the separator at the end of a chunk, providing the keep_separator: :end option:

iex> split_tags = [",", "."]
iex> base_params = %{chunk_size: 10, chunk_overlap: 0, separators: split_tags, keep_separator: :end}
iex> query = "Apple,banana,orange and tomato."
iex> splitter = RecursiveCharacterTextSplitter.new!(base_params)    
iex> splitter |> RecursiveCharacterTextSplitter.split_text(query)
["Apple,", "banana,", "orange and tomato."]

See LangChain.TextSplitter.CharacterTextSplitter for the usage of the different options.

LanguageSeparators provides separators for multiple programming and markdown languages. To use these Separators, it's recommended to set the is_separator_regex option to true. To split Python code:

iex> python_code = "
...>def hello_world():
...>  print('Hello, World')
...>
...>            
...># Call the function
...>hello_world()"
iex> splitter =
...>  RecursiveCharacterTextSplitter.new!(%{
...>    separators: LanguageSeparators.python(),
...>    keep_separator: :start,
...>    is_separator_regex: :true,
...>    chunk_size: 16,
...>    chunk_overlap: 0})
iex> splitter |> RecursiveCharacterTextSplitter.split_text(python_code)
["def", "hello_world():", "print('Hello,", "World')", "# Call the", "function", "hello_world()"]