View Source Text Classification

Mix.install(
  [
    {:instructor, path: Path.expand("../../", __DIR__)}
  ],
  config: [
    instructor: [
      adapter: Instructor.Adapters.OpenAI,
      openai: [api_key: System.fetch_env!("LB_OPENAI_API_KEY")]
    ]
  ]
)

Motivation

Text classification is a common task in NLP and broadly applicable across software. Whether it be spam detection, or support ticket categorization, NLP is at the core. Historically, this required training custom, bespoke models that required collecting thousands of pre-labeled examples. With LLMs a lot of this knowledge is already encoded into the model. With proper instruction and guiding the output to a known set of classifications using GPT you can be up and running with a text classification model in no time.

Hell, you can even use instructor to help generate the training set to train your own more efficient model. But let's not get ahead of ourselves, there's more on that later in the tutorials.

Binary Text Classification

Spam detection is a classic example of binary text classification. It's as simple as returning a true / false of whether an example is in the class. This is pretty trivial to implement in instructor.

defmodule SpamPrediction do
  use Ecto.Schema
  use Instructor.Validator

  @doc """
  ## Field Descriptions:
  - class: Whether or not the email is spam.
  - reason: A short, less than 10 word rationalization for the classification.
  - score: A confidence score between 0.0 and 1.0 for the classification.
  """
  @primary_key false
  embedded_schema do
    field(:class, Ecto.Enum, values: [:spam, :not_spam])
    field(:reason, :string)
    field(:score, :float)
  end

  @impl true
  def validate_changeset(changeset) do
    changeset
    |> Ecto.Changeset.validate_number(:score,
      greater_than_or_equal_to: 0.0,
      less_than_or_equal_to: 1.0
    )
  end
end

is_spam? = fn text ->
  Instructor.chat_completion(
    model: "gpt-3.5-turbo",
    response_model: SpamPrediction,
    max_retries: 3,
    messages: [
      %{
        role: "user",
        content: """
        Your purpose is to classify customer support emails as either spam or not.
        This is for a clothing retail business.
        They sell all types of clothing.

        Classify the following email: 
        ```
        #{text}
        ```
        """
      }
    ]
  )
end

is_spam?.("Hello I am a Nigerian prince and I would like to send you money")
{:ok, %SpamPrediction{class: :spam, reason: "Nigerian prince email", score: 0.95}}

We don't have to stop just at a boolean inclusion, we can also easily extend this idea to multiple categories or classes that we can classify the text into. In this example, let's consider classifying support emails. We want to know whether it's a general_inquiry, billing_issue, or a technical_issue perhaps it rightly fits in multiple classes. This can be useful if we want to cc' specialized support agents when intersecting customer issues occur

We can leverage Ecto.Enum to define a schema that restricts the LLM output to be a list of those values. We can also provide a @doc description to help guide the LLM with the semantic understanding of what these classifications ought to represent.

defmodule EmailClassifications do
  use Ecto.Schema

  @doc """
  A classification of a customer support email.

  technical_issue - whether the user is having trouble accessing their account.
  billing_issue - whether the customer is having trouble managing their billing or credit card
  general_inquiry - all other issues
  """
  @primary_key false
  embedded_schema do
    field(:tags, {:array, Ecto.Enum},
      values: [:general_inquiry, :billing_issue, :technical_issue]
    )
  end
end

classify_email = fn text ->
  {:ok, %{tags: result}} =
    Instructor.chat_completion(
      model: "gpt-3.5-turbo",
      response_model: EmailClassifications,
      messages: [
        %{
          role: "user",
          content: "Classify the following text: #{text}"
        }
      ]
    )

  result
end

classify_email.("My account is locked and I can't access my billing info.")
[:technical_issue, :billing_issue]