View Source Converting Text into Explorer DataFrames

Mix.install(
  [
    {:instructor, path: Path.expand("../../", __DIR__)},
    {:explorer, "~> 0.7.2"},
    {:kino, "~> 0.12.0"},
    {:kino_explorer, "~> 0.1.13"}
  ],
  config: [
    instructor: [
      adapter: Instructor.Adapters.OpenAI,
      openai: [api_key: System.fetch_env!("LB_OPENAI_API_KEY")]
    ]
  ]
)

Motivation

Sometimes we come across text that can be naturally extracted into structured data. Although we don't necessarily know how many tables, and what the schemas are represented in the data. Using Instructor we can let the LLM determine the tables and schemas for us, pulling the results into a dataframe which we can then use for further analysis.

The Schema

Notice from this example we want resulting data structure to have data frames. That's not a default Ecto type so we're gonna have to do things a little different here.

Luckily, Ecto allows us to define our own custom types in which we can override how the values are casted. And Instructor provides us a behavior which tells us how to represent it in a JSONSchema so the LLM can understand it.

In our case, we're going to create a type for a table of data, which will be represented as a CSV string with respect to the LLM. Our custom type will then parse the string when it is casted by Ecto and parse the CSV into an Explorer.DataFrame.

defmodule Ecto.CSVDataFrame do
  use Ecto.Type
  use Instructor.EctoType

  def type, do: :string

  def cast(csv_str) when is_binary(csv_str) do
    df = Explorer.DataFrame.load_csv!(csv_str)
    {:ok, df}
  end

  def cast(%Explorer.DataFrame{} = df), do: {:ok, df}
  def cast(_), do: :error

  def to_json_schema(),
    do: %{
      type: "string",
      description: "A CSV representation of a data table"
    }

  def dump(x), do: {:ok, x}
  def load(x), do: {:ok, x}
end
{:module, Ecto.CSVDataFrame, <<70, 79, 82, 49, 0, 0, 12, ...>>, {:load, 1}}

Now that we have our data type, we can use it in our embedded ecto schema, just as we would have in any other example. If everything works correctly, we should get some data frames out.

defmodule Database do
  use Ecto.Schema
  use Instructor.Validator

  @doc """
  The extracted database will contain one or more tables with data as csv formatted with ',' delimiters
  """
  @primary_key false
  embedded_schema do
    embeds_many :tables, DataFrame, primary_key: false do
      field(:name, :string)
      field(:data, Ecto.CSVDataFrame)
    end
  end
end
{:module, Database, <<70, 79, 82, 49, 0, 0, 16, ...>>,
 [__schema__: 1, __schema__: 1, __schema__: 1, __schema__: 1, __schema__: 2, __schema__: 2, ...]}

Perfect. And now let's test it out on some example text.

extract_df = fn text ->
  Instructor.chat_completion(
    model: "gpt-3.5-turbo",
    response_model: Database,
    messages: [
      %{
        role: "system",
        content: """
        Map this data into one or more dataframes and correctly define the columns and data
        """
      },
      %{
        role: "user",
        content: "#{text}"
      }
    ]
  )
end

{:ok, db} =
  extract_df.("""
    My name is John and I am 25 years old. I live in
    New York and I like to play basketball. His name is
    Mike and he is 30 years old. He lives in San Francisco
    and he likes to play baseball. Sarah is 20 years old
    and she lives in Los Angeles. She likes to play tennis.
    Her name is Mary and she is 35 years old.
    She lives in Chicago.
    
    On one team 'Tigers' the captain is John and there are 12 players.
    On the other team 'Lions' the captain is Mike and there are 10 players.
  """)

Kino.Layout.tabs(
  for table <- db.tables do
    {table.name, table.data}
  end
)