View Source Iris Classification with Gradient Boosting

Mix.install([
  {:exgboost, "~> 0.5"},
  {:nx, "~> 0.5"},
  {:scidata, "~> 0.1"},
  {:scholar, "~> 0.1"}
])

Data

We'll be working with the Iris flower dataset. The Iris dataset consists of features corresponding to measurements of 3 different species of the Iris flower. Overall we have 150 examples, each with 4 featurse and a numeric label mapping to 1 of the 3 species. We can download this dataset using Scidata:

{x, y} = Scidata.Iris.download()
:ok

Scidata doesn't provide train-test splits for Iris. Instead, we'll need to shuffle the original dataset and split manually. We'll save 20% of the dataset for testing:

data = Enum.zip(x, y) |> Enum.shuffle()
{train, test} = Enum.split(data, ceil(length(data) * 0.8))
:ok

EXGBoost requires inputs to be Nx tensors. The conversion for this example is rather easy as we can just wrap both features and labels in a call to Nx.tensor/1:

{x_train, y_train} = Enum.unzip(train)
{x_test, y_test} = Enum.unzip(test)

x_train = Nx.tensor(x_train)
y_train = Nx.tensor(y_train)

x_test = Nx.tensor(x_test)
y_test = Nx.tensor(y_test)

x_train
y_train

We now have both train and test sets consisting of features and labels. Time to train a booster!

Training

The simplest way to train a booster is using the top-level EXGBoost.train/2 function. This function expects input features and labels, as well as some optional training configuration parameters.

This example is a multi-class classification problem with 3 output classes. We need to configure EXGBoost to train this booster as a multi-class classifier by specifying a different training objective. We also need to specify the number of output classes:

booster =
  EXGBoost.train(x_train, y_train,
    num_class: 3,
    objective: :multi_softprob,
    num_boost_rounds: 10000,
    evals: [{x_train, y_train, "training"}]
  )

And that's it! Now we can test our booster.

Testing

To get predictions from a trained booster, we can just call EXGBoost.predict/2. You'll notice for this problem that the booster outputs a tensor of shape {30, 3} where the 2nd dimension represents output probabilities for each class. We can obtain a discrete prediction for use in our accuracy measurement by computing the argmax along the last dimension:

preds = EXGBoost.predict(booster, x_test) |> Nx.argmax(axis: -1)
Scholar.Metrics.Classification.accuracy(y_test, preds)

And that's it! We've successfully trained a booster on the Iris dataset with EXGBoost.