View Source EXGBoost (EXGBoost v0.5.1)

Elixir bindings to the XGBoost C API using Native Implemented Functions (NIFs).

EXGBoost provides an implementation of XGBoost that works with Nx tensors.

Xtreme Gradient Boosting (XGBoost) is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

Installation

def deps do
[
  {:exgboost, "~> 0.5"}
]
end

API Data Structures

EXGBoost's top-level EXGBoost API works directly and only with Nx tensors. However, under the hood, it leverages the structs defined in the EXGBoost.Booster and EXGBoost.DMatrix modules. These structs are wrappers around the structs defined in the XGBoost library. The two main structs used are DMatrix to represent the data matrix that will be used to train the model, and Booster which represents the model.

The top-level EXGBoost API does not expose the structs directly. Instead, the structs are exposed through the EXGBoost.Booster and EXGBoost.DMatrix modules. Power users might wish to use these modules directly. For example, if you wish to use the Booster struct directly then you can use the EXGBoost.Booster.booster/2 function to create a Booster struct from a DMatrix and a keyword list of options. See the EXGBoost.Booster and EXGBoost.DMatrix modules source for more implementation details.

Basic Usage

key = Nx.Random.key(42)
{x, key} = Nx.Random.normal(key, 0, 1, shape: {10, 5})
{y, key} = Nx.Random.normal(key, 0, 1, shape: {10})
model = EXGBoost.train(x, y)
EXGBoost.predict(model, x)

Training

EXGBoost is designed to feel familiar to the users of the Python XGBoost library. EXGBoost.train/2 is the primary entry point for training a model. It accepts a Nx tensor for the features and a Nx tensor for the labels. EXGBoost.train/2 returns a trainedBooster struct that can be used for prediction. EXGBoost.train/2 also accepts a keyword list of options that can be used to configure the training process. See the XGBoost documentation for the full list of options.

EXGBoost.train/2 uses the EXGBoost.Training.train/1 function to perform the actual training. EXGBoost.Training.train/1 and can be used directly if you wish to work directly with the DMatrix and Booster structs.

One of the main features of EXGBoost.train/2 is the ability for the end user to provide a custom training function that will be used to train the model. This is done by passing a function to the :obj option. The function must accept a DMatrix and a Booster and return a Booster. The function will be called at each iteration of the training process. This allows the user to implement custom training logic. For example, the user could implement a custom loss function or a custom metric function. See the XGBoost documentation for more information on custom loss functions and custom metric functions.

Another feature of EXGBoost.train/2 is the ability to provide a validation set for early stopping. This is done by passing a list of 3-tuples to the :evals option. Each 3-tuple should contain a Nx tensor for the features, a Nx tensor for the labels, and a string label for the validation set name. The validation set will be used to calculate the validation error at each iteration of the training process. If the validation error does not improve for :early_stopping_rounds iterations then the training process will stop. See the XGBoost documentation for a more detailed explanation of early stopping.

Early stopping is achieved through the use of callbacks. EXGBoost.train/2 accepts a list of callbacks that will be called at each iteration of the training process. The callbacks can be used to implement custom logic. For example, the user could implement a callback that will print the validation error at each iteration of the training process or to provide a custom setup function for training. See the EXGBoost.Training.Callback module for more information on callbacks.

Please notes that callbacks are called in the order that they are provided. If you provide multiple callbacks that modify the same parameter then the last callback will trump the previous callbacks. For example, if you provide a callback that sets the :early_stopping_rounds parameter to 10 and then provide a callback that sets the :early_stopping_rounds parameter to 20 then the :early_stopping_rounds parameter will be set to 20.

You are also able to pass parameters to be applied to the Booster model using the :params option. These parameters will be applied to the Booster model before training begins. This allows you to set parameters that are not available as options to EXGBoost.train/2. See the XGBoost documentation for a full list of parameters.

EXGBoost.train(X,
              y,
              obj: &EXGBoost.Training.train/1,
              evals: [{X_test, y_test, "test"}],
              learning_rates: fn i -> i/10 end,
              num_boost_round: 10,
              early_stopping_rounds: 3,
              max_depth: 3,
              eval_metric: [:rmse,:logloss]
              )

Prediction

EXGBoost.predict/2 is the primary entry point for making predictions with a trained model. It accepts a Booster struct (which is the output of EXGBoost.train/2). EXGBoost.predict/2 returns a Nx tensor containing the predictions. EXGBoost.predict/2 also accepts a keyword list of options that can be used to configure the prediction process.

preds = EXGBoost.train(X, y) |> EXGBoost.predict(X)

Serialization

A Booster can be serialized to a file using EXGBoost.write_* and loaded from a file using EXGBoost.read_*. The file format can be specified using the :format option which can be either :json or :ubj. The default is :json. If the file already exists, it will NOT be overwritten by default. Boosters can either be serialized to a file or to a binary string. Boosters can be serialized in three different ways: configuration only, configuration and model, or model only. dump functions will serialize the Booster to a binary string. Functions named with weights will serialize the model's trained parameters only. This is best used when the model is already trained and only inferences/predictions are going to be performed. Functions named with config will serialize the configuration only. Functions that specify model will serialize both the model parameters and the configuration.

Output Formats

read/write - File.
load/dump - Binary buffer.

Output Contents

config - Save the configuration only.
weights - Save the model parameters only. Use this when you want to save the model to a format that can be ingested by other XGBoost APIs.
model - Save both the model parameters and the configuration.

Plotting

EXGBoost.plot_tree/2 is the primary entry point for plotting a tree from a trained model. It accepts an EXGBoost.Booster struct (which is the output of EXGBoost.train/2). EXGBoost.plot_tree/2 returns a VegaLite spec that can be rendered in a notebook or saved to a file. EXGBoost.plot_tree/2 also accepts a keyword list of options that can be used to configure the plotting process.

See EXGBoost.Plotting for more detail on plotting.

You can see available styles by running EXGBoost.Plotting.get_styles() or refer to the EXGBoost.Plotting.Styles documentation for a gallery of the styles.

Kino & Livebook Integration

EXGBoost integrates with Kino and Livebook to provide a rich interactive experience for data scientists.

EXGBoost implements the Kino.Render protocol for EXGBoost.Booster structs. This allows you to render a Booster in a Livebook notebook. Under the hood, EXGBoost uses Vega-Lite and Kino Vega-Lite to render the Booster.

See the Plotting in EXGBoost Notebook for an example of how to use EXGBoost with Kino and Livebook.

Examples

See the example Notebooks in the left sidebar (under the Pages tab) for more examples and tutorials on how to use EXGBoost.

Requirements

Precompiled Distribution

We currenly offer the following precompiled packages for EXGBoost:

%{
  "exgboost-nif-2.16-aarch64-apple-darwin-0.5.0.tar.gz" => "sha256:c659d086d07e9c209bdffbbf982951c6109b2097c4d3008ef9af59c3050663d2",
  "exgboost-nif-2.16-x86_64-apple-darwin-0.5.0.tar.gz" => "sha256:05256238700456c57e279558765b54b5b5ed4147878c6861cd4c937472abbe52",
  "exgboost-nif-2.16-x86_64-linux-gnu-0.5.0.tar.gz" => "sha256:ad3ba6aba8c3c2821dce4afc05b66a5e529764e0cea092c5a90e826446653d99",
  "exgboost-nif-2.17-aarch64-apple-darwin-0.5.0.tar.gz" => "sha256:745e7e970316b569a10d76ceb711b9189360b3bf9ab5ee6133747f4355f45483",
  "exgboost-nif-2.17-x86_64-apple-darwin-0.5.0.tar.gz" => "sha256:73948d6f2ef298e3ca3dceeca5d8a36a2d88d842827e1168c64589e4931af8d7",
  "exgboost-nif-2.17-x86_64-linux-gnu-0.5.0.tar.gz" => "sha256:a0b5ff0b074a9726c69d632b2dc0214fc7b66dccb4f5879e01255eeb7b9d4282",
}

The correct package will be downloaded and installed (if supported) when you install the dependency through Mix (as shown above), otherwise you will need to compile manually.

NOTE If MacOS, you still need to install libomp even to use the precompiled libraries:

brew install libomp

Dev Requirements

If you are contributing to the library and need to compile locally or choose to not use the precompiled libraries, you will need the following:

Make
CMake
If MacOS: brew install libomp

When you run mix compile, the xgboost shared library will be compiled, so the first time you compile your project will take longer than subsequent compilations.

You also need to set CC_PRECOMPILER_PRECOMPILE_ONLY_LOCAL=true before the first local compilation, otherwise you will get an error related to a missing checksum file.

Known Limitations

The XGBoost C API uses C function pointers to implement streaming data types. The Python ctypes library is able to pass function pointers to the C API which are then executed by XGBoost. Erlang/Elixir NIFs do not have this capability, and as such, streaming data types are not supported in EXGBoost.
Currently, EXGBoost only works with tensors from the Nx.Binarybackend. If you are using any other backend you will need to perform an Nx.backend_transfer or Nx.backend_copy before training an EXGBoost.Booster. This is because Nx tensors are JSON-encoded and serialized before being sent to XGBoost and the binary backend is required for proper JSON-encoding of the underlying tensor.

Summary

System / Native Config

get_config()

Get current values of the global configuration.

set_config(config)

Set global configuration.

xgboost_build_info()

Check the build information of the xgboost library.

xgboost_version()

Check the version of the xgboost library.

Training & Prediction

inplace_predict(boostr, data, opts \\ [])

Run prediction in-place, Unlike EXGBoost.predict/2, in-place prediction does not cache the prediction result.

predict(bst, x, opts \\ [])

Predict with a booster model against a tensor.

train(x, y, opts \\ [])

Train a new booster model given a data tensor and a label tensor.

Serialization

dump_config(booster, opts \\ [])

Dump a model config to a buffer as a JSON - encoded string.

dump_model(booster, opts \\ [])

Dump a model to a binary encoded in the desired format.

dump_weights(booster, opts \\ [])

Dump a model's trained parameters to a buffer as a JSON-encoded binary.

load_config(buffer, opts \\ [])

Create a new Booster from a config buffer. The config buffer must be from the output of dump_config/2.

load_model(buffer)

Read a model from a buffer and return the Booster.

load_weights(buffer)

Read a model's trained parameters from a buffer and return the Booster.

read_config(path, opts \\ [])

Create a new Booster from a config file. The config file must be from the output of write_config/2.

read_model(path)

Read a model from a file and return the Booster.

read_weights(path)

Read a model's trained parameters from a file and return the Booster.

write_config(booster, path, opts \\ [])

Write a model config to a file as a JSON - encoded string.

write_model(booster, path, opts \\ [])

Write a model to a file.

write_weights(booster, path, opts \\ [])

Write a model's trained parameters to a file.

Plotting

plot_tree(booster, opts \\ [])

Plot a tree from a Booster model and save it to a file.

System / Native Config

get_config()

@spec get_config() :: map()

Get current values of the global configuration.

Global configuration consists of a collection of parameters that can be applied in the global scope. See Global Parameters in EXGBoost.Parameters for the full list of parameters supported in the global configuration.

set_config(config)

@spec set_config(map()) :: :ok | {:error, String.t()}

Set global configuration.

xgboost_build_info()

@spec xgboost_build_info() :: map()

Check the build information of the xgboost library.

Returns a map containing information about the build.

xgboost_version()

@spec xgboost_version() :: {integer(), integer(), integer()} | {:error, String.t()}

Check the version of the xgboost library.

Returns a 3-tuple in the form of {major, minor, patch}.

Training & Prediction

inplace_predict(boostr, data, opts \\ [])

Run prediction in-place, Unlike EXGBoost.predict/2, in-place prediction does not cache the prediction result.

Options

:base_margin - Base margin used for boosting from existing model.
:missing - Value used for missing values. If None, defaults to Nx.Constants.nan().
:predict_type - One of:
- "value" - Output model prediction values.
- "margin" - Output the raw untransformed margin value.
:output_margin - Whether to output the raw untransformed margin value.
:iteration_range - See EXGBoost.predict/2 for details.
:strict_shape - See EXGBoost.predict/2 for details.

Returns an Nx.Tensor containing the predictions.

predict(bst, x, opts \\ [])

Predict with a booster model against a tensor.

The full model will be used unless iteration_range is specified, meaning user have to either slice the model or use the best_iteration attribute to get prediction from best model returned from early stopping.

Options

:output_margin - Whether to output the raw untransformed margin value.
:pred_leaf - When this option is on, the output will be an Nx.Tensor of shape {nsamples, ntrees}, where each row indicates the predicted leaf index of each sample in each tree. Note that the leaf index of a tree is unique per tree, but not globally, so you may find leaf 1 in both tree 1 and tree 0.
:pred_contribs - When this is true the output will be a matrix of size {nsample, nfeats + 1} with each record indicating the feature contributions (SHAP values) for that prediction. The sum of all feature contributions is equal to the raw untransformed margin value of the prediction. Note the final column is the bias term.
:approx_contribs - Approximate the contributions of each feature. Used when pred_contribs or pred_interactions is set to true. Changing the default of this parameter (false) is not recommended.
:pred_interactions - When this is true the output will be an Nx.Tensor of shape {nsamples, nfeats + 1} indicating the SHAP interaction values for each pair of features. The sum of each row (or column) of the interaction values equals the corresponding SHAP value (from pred_contribs), and the sum of the entire matrix equals the raw untransformed margin value of the prediction. Note the last row and column correspond to the bias term.
:validate_features - When this is true, validate that the Booster's and data's feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
:training - Determines whether the prediction value is used for training. This can affect the dart booster, which performs dropouts during training iterations but uses all trees for inference. If you want to obtain result with dropouts, set this option to true. Also, the option is set to true when obtaining prediction for custom objective function.
:iteration_range - Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
:strict_shape - When set to true, output shape is invariant to whether classification is used. For both value and margin prediction, the output shape is (n_samples, n_groups), n_groups == 1 when multi-class is not used. Defaults to false, in which case the output shape can be (n_samples, ) if multi-class is not used.

Returns an Nx.Tensor containing the predictions.

train(x, y, opts \\ [])

@spec train(Nx.Tensor.t(), Nx.Tensor.t(), Keyword.t()) :: EXGBoost.Booster.t()

Train a new booster model given a data tensor and a label tensor.

Options

:obj - Specify the learning task and the corresponding learning objective. This function must accept two arguments: preds, dtrain. preds is an array of predicted real valued scores. dtrain is the training data set. This function returns gradient and second order gradient.
:num_boost_rounds - Number of boosting iterations.
:evals - A list of 3-Tuples {x, y, label} to use as a validation set for early-stopping.
:early_stopping_rounds - Activates early stopping. Target metric needs to increase/decrease (depending on metric) at least every early_stopping_rounds round(s) to continue training. Requires at least one item in :evals. If there's more than one, will use the last eval set. If there’s more than one metric in the eval_metric parameter given in the booster's params, the last metric will be used for early stopping. If early stopping occurs, the model will have two additional fields:
- bst.best_score
- bst.best_iteration.
If these values are nil then no early stopping occurred.
:verbose_eval - Requires at least one item in evals. If verbose_eval is true then the evaluation metric on the validation set is printed at each boosting stage. If verbose_eval is an integer then the evaluation metric on the validation set is printed at every given verbose_eval boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed. Example: with verbose_eval=4 and at least one item in evals, an evaluation metric is printed every 4 boosting stages, instead of every boosting stage.
:learning_rates - Either an arity 1 function that accept an integer parameter epoch and returns the corresponding learning rate or a list with the same length as num_boost_rounds.
:callbacks - List of EXGBoost.Training.Callback that are called during a given event. It is possible to use predefined callbacks by using EXGBoost.Training.Callback module. Callbacks should be in the form of a keyword list where the only valid keys are :before_training, :after_training, :before_iteration, and :after_iteration. The value of each key should be a list of functions that accepts a booster and an iteration and returns a booster. The function will be called at the appropriate time with the booster and the iteration as the arguments. The function should return the booster. If the function returns a booster with a different memory address, the original booster will be replaced with the new booster. If the function returns the original booster, the original booster will be used. If the function returns a booster with the same memory address but different contents, the behavior is undefined.
opts - Refer to EXGBoost.Parameters for the full list of options.

Serialization

dump_config(booster, opts \\ [])

Dump a model config to a buffer as a JSON - encoded string.

Options

:format - The format to serialize to. Can be either :json or :ubj. The default value is :json.

dump_model(booster, opts \\ [])

Dump a model to a binary encoded in the desired format.

Options

:format - The format to serialize to. Can be either :json or :ubj. The default value is :json.

dump_weights(booster, opts \\ [])

Dump a model's trained parameters to a buffer as a JSON-encoded binary.

Options

:format - The format to serialize to. Can be either :json or :ubj. The default value is :json.

load_config(buffer, opts \\ [])

Create a new Booster from a config buffer. The config buffer must be from the output of dump_config/2.

Options

:booster (struct of type EXGBoost.Booster) - The Booster to load the model into. If a Booster is provided, the model will be loaded into that Booster. Otherwise, a new Booster will be created. If a Booster is provided, model parameters will be merged with the existing Booster's parameters using Map.merge/2, where the parameters of the provided Booster take precedence.

load_model(buffer)

@spec load_model(binary()) :: EXGBoost.Booster.t()

Read a model from a buffer and return the Booster.

load_weights(buffer)

@spec load_weights(binary()) :: EXGBoost.Booster.t()

Read a model's trained parameters from a buffer and return the Booster.

read_config(path, opts \\ [])

Create a new Booster from a config file. The config file must be from the output of write_config/2.

Options

:booster (struct of type EXGBoost.Booster) - The Booster to load the model into. If a Booster is provided, the model will be loaded into that Booster. Otherwise, a new Booster will be created. If a Booster is provided, model parameters will be merged with the existing Booster's parameters using Map.merge/2, where the parameters of the provided Booster take precedence.

read_model(path)

@spec read_model(String.t()) :: EXGBoost.Booster.t()

Read a model from a file and return the Booster.

read_weights(path)

@spec read_weights(String.t()) :: EXGBoost.Booster.t()

Read a model's trained parameters from a file and return the Booster.

write_config(booster, path, opts \\ [])

Write a model config to a file as a JSON - encoded string.

Options

:format - The format to serialize to. Can be either :json or :ubj. The default value is :json.
:overwrite (boolean/0) - Whether or not to overwrite the file if it already exists. The default value is false.

write_model(booster, path, opts \\ [])

Write a model to a file.

Options

:format - The format to serialize to. Can be either :json or :ubj. The default value is :json.
:overwrite (boolean/0) - Whether or not to overwrite the file if it already exists. The default value is false.

write_weights(booster, path, opts \\ [])

Write a model's trained parameters to a file.

Options

:format - The format to serialize to. Can be either :json or :ubj. The default value is :json.
:overwrite (boolean/0) - Whether or not to overwrite the file if it already exists. The default value is false.

Plotting

plot_tree(booster, opts \\ [])

Plot a tree from a Booster model and save it to a file.

Options

:format - the format to export the graphic as, must be either of: :json, :html, :png, :svg, :pdf. By default the format is inferred from the file extension.
:local_npm_prefix - a relative path pointing to a local npm project directory where the necessary npm packages are installed. For instance, in Phoenix projects you may want to pass local_npm_prefix: "assets". By default the npm packages are searched for in the current directory and globally.
:path - the path to save the graphic to. If not provided, the graphic is returned as a VegaLite spec.
:opts - additional options to pass to EXGBoost.Plotting.plot/2. See EXGBoost.Plotting for more information.