View Source Scholar.Linear.BayesianRidgeRegression (Scholar v0.3.1)

Bayesian ridge regression: A fully probabilistic linear model with parameter regularization.

In order to obtain a fully probabilistic linear model, we declare the precision parameter in the model: $\alpha$, This parameter describes the dispersion of the data around the mean.

$$ p(y | X, w, \alpha) = \mathcal{N}(y | Xw, \alpha^{-1}) $$

Where:

$X$ is an input data
$y$ is an input target
$w$ is the model weights matrix
$\alpha$ is the precision parameter of the target and $\alpha^{-1} = \sigma^{2}$, the variance.

In order to obtain a fully probabilistic regularized linear model, we declare the distribution of the model weights matrix with it's corresponding precision parameter:

$$ p(w | \lambda) = \mathcal{N}(w, \lambda^{-1}) $$

Where $\lambda$ is the precision parameter of the weights matrix.

Both $\alpha$ and $\lambda$ are choosen to have prior gamma distributions, controlled through hyperparameters $\alpha_1$, $\alpha_2$, $\lambda_1$, $\lambda_2$. These parameters are set by default to non-informative $\alpha_1 = \alpha_2 = \lambda_1 = \lambda_2 = 1^{-6}$.

This model is similar to the classical ridge regression. Confusingly the classical ridge regression's $\alpha$ parameter is the Bayesian ridge's $\lambda$ parameter.

Other than that, the differences between alorithms are:

The matrix weight regularization parameter is estimated from data,
The precision of the target is estimated.

As such, Bayesian ridge is more flexible to the data at hand. These features come at higher computational cost.

This implementation is ported from Python's scikit-learn. It uses the algorithm described in (Tipping, 2001) and regularization parameters are updated as by (MacKay, 1992).

References:

D. J. C. MacKay, Bayesian Interpolation, Computation and Neural Systems, Vol. 4, No. 3, 1992.

M. E. Tipping, Sparse Bayesian Learning and the Relevance Vector Machine, Journal of Machine Learning Research, Vol. 1, 2001.

Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 12, pp. 2825-2830, 2011.

Summary

Functions

fit(x, y, opts \\ [])

Fits a Bayesian ridge model for sample inputs x and sample targets y.

predict(model, x)

Makes predictions with the given model on input x.

Functions

fit(x, y, opts \\ [])

Fits a Bayesian ridge model for sample inputs x and sample targets y.

Options

:iterations (pos_integer/0) - Maximum number of iterations before stopping the fitting algorithm. The number of iterations may be lower is parameters converge. The default value is 300.
:sample_weights - The weights for each observation. If not provided, all observations are assigned equal weight.
:fit_intercept? (boolean/0) - If set to true, a model will fit the intercept. Otherwise, the intercept is set to 0.0. The intercept is an independent term in a linear model. Specifically, it is the expected mean value of targets for a zero-vector on input. The default value is true.
:compute_scores? (boolean/0) - If set to true, the log marginal likelihood will be computed at each iteration of the algorithm. The default value is false.
:alpha_init - The initial value for alpha. This parameter influences the precision of the noise. :alpha must be a non-negative float i.e. in [0, inf). Defaults to 1/Var(y).
:lambda_init - The initial value for lambda. This parameter influences the precision of the weights. :lambda must be a non-negative float i.e. in [0, inf). Defaults to 1. The default value is 1.0.
:alpha_1 - Hyper-parameter : shape parameter for the Gamma distribution prior
over the alpha parameter. The default value is 1.0e-6.
:alpha_2 - Hyper-parameter : inverse scale (rate) parameter for the Gamma distribution prior over the alpha parameter. The default value is 1.0e-6.
:lambda_1 - Hyper-parameter : shape parameter for the Gamma distribution prior over the lambda parameter. The default value is 1.0e-6.
:lambda_2 - Hyper-parameter : inverse scale (rate) parameter for the Gamma distribution prior
over the lambda parameter. The default value is 1.0e-6.
:eps (float/0) - The convergence tolerance. When Nx.sum(Nx.abs(coef - coef_new)) < :eps, the algorithm is considered to have converged. The default value is 1.0e-8.

Return Values

The function returns a struct with the following parameters:

:coefficients - Estimated coefficients for the linear regression problem.
:intercept - Independent term in the linear model.
:alpha - Estimated precision of the noise.
:lambda - Estimated precision of the weights.
:sigma - Estimated variance covariance matrix of weights with shape (n_features, n_features).
:iterations - How many times the optimization algorithm was computed.
:has_converged - Whether the coefficients converged during the optimization algorithm.
:scores - Value of the log marginal likelihood at each iteration during the optimization.

Examples

iex> x = Nx.tensor([[1], [2], [6], [8], [10]])
iex> y = Nx.tensor([1, 2, 6, 8, 10])
iex> model = Scholar.Linear.BayesianRidgeRegression.fit(x, y)
iex> model.coefficients
#Nx.Tensor<
  f32[1]
  [0.9932512044906616]
>
iex> model.intercept
#Nx.Tensor<
  f32
  0.03644371032714844
>

predict(model, x)

Makes predictions with the given model on input x.

Examples

iex> x = Nx.tensor([[1], [2], [6], [8], [10]])
iex> y = Nx.tensor([1, 2, 6, 8, 10])
iex> model = Scholar.Linear.BayesianRidgeRegression.fit(x, y)
iex> Scholar.Linear.BayesianRidgeRegression.predict(model, Nx.tensor([[1], [3], [4]]))
Nx.tensor(
  [1.02969491481781, 3.0161972045898438, 4.009448528289795]  
)