View Source Cross-validation with gradient boosting trees

Mix.install([
  {:scholar, "~> 0.2.0"},
  {:explorer, "~> 0.6.1"},
  {:exla, "~> 0.6.0"},
  {:nx, "~> 0.6.0", override: true},
  {:exgboost, "~> 0.3"},
  {:req, "~> 0.3.9"},
  {:kino_vega_lite, "~> 0.1.9"},
  {:kino, "~> 0.10.0"},
  {:kino_explorer, "~> 0.1.7"}
])

Setup

We will use Explorer in this notebook, so let's define aliases for its main modules:

require Explorer.DataFrame, as: DF
require Explorer.Series, as: S

And let's configure EXLA as our default backend (where our tensors are stored) and compiler (which compiles Scholar code) across the notebook and all branched sections:

Nx.global_default_backend(EXLA.Backend)
Nx.Defn.global_default_options(compiler: EXLA)

We are going to work with Medical Cost Personal Datasets to predict medical charges that were applied to each person from the dataset. Let's download it:

data =
  Req.get!(
    "https://gist.githubusercontent.com/meperezcuello/82a9f1c1c473d6585e750ad2e3c05a41/raw/d42d226d0dd64e7f5395a0eec1b9190a10edbc03/Medical_Cost.csv"
  ).body

df = DF.load_csv!(data)

The dataset consists of 7 columns: age, sex, BMI (body mass index), children (number), smoker (yes-no), region (NE, NW, SE, SW), and charges that we want to predict. Since gradient boosting trees that we are using in our analysis accept numerical data, we need to process further three columns: sex, smoker and region and encode them from categorical to numerical.

y = DF.select(df, "charges") |> Nx.concatenate()

x =
  df
  |> DF.discard(["charges"])
  |> DF.mutate(
    sex: cast(sex, :category),
    smoker: cast(smoker, :category),
    region: cast(region, :category)
  )
  |> Nx.stack(axis: 1)

{x, y}

Before training our model, we separate the data between train and test sets.

{x_train, x_test} = Nx.split(x, 0.8)
{y_train, y_test} = Nx.split(y, 0.8)

Training a gradient boosting tree

Gradient boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. Let's go through a simple regression example, using decision trees as the base predictors; this is called gradient tree boosting, or gradient boosted regression trees (GBRT).

EXGBoost provides an implementation of gradient boosting trees that accepts a wide range of hyperparameter configurations. For the full list of hyperparameters refer to the EXGBoost docs.

y_pred =
  EXGBoost.train(
    x_train,
    y_train,
    booster: :gbtree,
    tree_method: :auto,
    objective: :reg_squarederror,
    num_boost_rounds: 100,
    evals: [{x_train, y_train, "training"}],
    verbose_eval: true
  )
  |> EXGBoost.predict(x_test)

Having our predictions, we can measure performance by calculating the root mean squared error of predictions with respect to target values.

alias Scholar.Metrics.Regression, as: Metrics

Metrics.mean_square_error(y_test, y_pred)
|> Nx.sqrt()
|> Nx.to_number()

With very little preprocessing we get similar results to the linear regression model. However, we can improve our model evaluation process by using cross-validation.

Evaluating with cross-validation

k-fold cross-validation works by creating splits on the training set into k smaller sets, so that the model is trained using $k - 1$ splits (folds) as training data and is validated on the remaining part of the data. When using this technique, the performance measure is the average of the values computed in each iteration.

Scholar provides tools for performing k-fold CV.

alias Scholar.ModelSelection

First, we need to define a folding function that will perform the k-folds, and also a scoring function that will train the model and evaluate performance with each split.

folding_fn = fn x -> ModelSelection.k_fold_split(x, 5) end

scoring_fn = fn x, y ->
  {x_train, x_test} = x
  {y_train, y_test} = y

  y_pred =
    EXGBoost.train(
      x_train,
      y_train,
      booster: :gbtree,
      tree_method: :auto,
      objective: :reg_squarederror,
      num_boost_rounds: 100,
      evals: [{x_train, y_train, "training"}],
      verbose_eval: true
    )
    |> EXGBoost.predict(x_test)

  Metrics.mean_square_error(y_test, y_pred)
  |> Nx.sqrt()
end

Now let's run the cross-validation function and put the scores tensor in a series.

cv_score =
  ModelSelection.cross_validate(
    x_train,
    y_train,
    folding_fn,
    scoring_fn
  )
  |> Nx.squeeze()
  |> S.from_tensor()

Taking the average results in the performance reported by cross-validation.

S.mean(cv_score)

Finding the right configuration of hyperparameters is an important part of the process when selecting a model. One could try different combinations of hyperparameter values manually, but this can get tedious and time consuming. Instead, we can use the grid search method, an iterative process for finding an optimal configuration of hyperparameter values for a given model.

First, we need to provide a "grid" of hyperparameter values, so that the algorithm can train and evaluate our model with all possible combinations.

grid = [
  booster: [:gbtree],
  objective: [:reg_squarederror],
  evals: [[{x_train, y_train, "training"}]],
  verbose_eval: [true],
  tree_method: [:approx, :exact],
  max_depth: [2, 3, 4, 5, 6],
  num_boost_rounds: [20, 50, 90],
  subsample: [0.25, 0.5, 0.75, 1.0]
]

We also need to adapt our scoring function in order to use the hyperparameter values for each grid search iteration.

gs_scoring_fn = fn x, y, hyperparams ->
  {x_train, x_test} = x
  {y_train, y_test} = y

  y_pred =
    x_train
    |> EXGBoost.train(y_train, hyperparams)
    |> EXGBoost.predict(x_test)

  Metrics.mean_square_error(y_test, y_pred)
  |> Nx.sqrt()
end

Let's run the grid search and see the results. Remember that the more hyperparameter values you add to the grid, the more it will take the algorithm to end.

gs_scores =
  ModelSelection.grid_search(
    x_train,
    y_train,
    folding_fn,
    gs_scoring_fn,
    grid
  )

The output is a list of maps, each corresponding to an iteration of the grid search algorithm. Every iteration yields a score calculated by our scoring function. Let's find the set of hyperparameters that optimizes the score.

best_config =
  Enum.min_by(gs_scores, fn %{score: score} ->
    score
    |> Nx.squeeze()
    |> Nx.to_number()
  end)

Finally we train and evaluate a model using the best hyperparameter configuration found by grid search.

%{hyperparameters: opts} = best_config

model = EXGBoost.train(x_train, y_train, opts)
y_pred = EXGBoost.predict(model, x_test)

rmse =
  Metrics.mean_square_error(y_test, y_pred)
  |> Nx.sqrt()

"RMSE: #{Nx.to_number(rmse)}"