View Source Linear regression in practice

Mix.install([
  {:scholar, "~> 0.2.0"},
  {:explorer, "~> 0.6.1"},
  {:exla, "~> 0.6.0"},
  {:nx, "~> 0.6.0", override: true},
  {:req, "~> 0.3.9"},
  {:kino_vega_lite, "~> 0.1.9"},
  {:kino, "~> 0.10.0"},
  {:kino_explorer, "~> 0.1.7"},
  {:tucan, "~> 0.3.0"}
])

Setup

In the livebook, we will cover the typical use cases of linear regression on practical examples.

require Explorer.DataFrame, as: DF
require Explorer.Series, as: S
alias Scholar.Linear.LinearRegression, as: LR
alias Scholar.Linear.PolynomialRegression, as: PR
alias Scholar.Impute.SimpleImputer
alias Scholar.Metrics.Regression

And let's configure EXLA as our default backend (where our tensors are stored) and compiler (which compiles Scholar code) across the notebook and all branched sections:

Nx.global_default_backend(EXLA.Backend)
Nx.Defn.global_default_options(compiler: EXLA)
seed = 42
key = Nx.Random.key(42)
#Nx.Tensor<
  u32[2]
  EXLA.Backend<host:0, 0.54336858.2205024268.194665>
  [0, 42]
>

Linear regression on synthetic data

Before we dive into real-life use cases of linear regression, we start with a simpler one. We will generate data with a linear pattern and then use Scholar.Linear.LinearRegression to compute regression.

Firstly, we generate the data which simulates the function f(x)=3x+4f(x) = 3x + 4 with added uniform, zero-mean noise. Nx.Random.uniform creates a tensor with a given shape and type.

defmodule LinearData do
  import Nx.Defn

  defn data do
    key = Nx.Random.key(42)
    size = 100
    {x, new_key} = Nx.Random.uniform(key, 0, 2, shape: {size, 1}, type: :f64)
    {noise, _} = Nx.Random.uniform(new_key, -0.5, 0.5, shape: {size, 1}, type: :f64)
    y = 3 * x + 4 + noise
    {x, y}
  end
end
{:module, LinearData, <<70, 79, 82, 49, 0, 0, 11, ...>>, true}

Now let's plot the generated points.

size = 100
{x, y} = LinearData.data()

df = DF.new(x: x, y: y)

Tucan.scatter(df, "x", "y", width: 630, height: 630, filled: true)
|> Tucan.Scale.set_x_domain(-0.05, 2.05)
|> Tucan.Scale.set_y_domain(2.5, 12)
|> Tucan.Grid.set_enabled(false)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)

For a regression task, we will use the Scholar.Linear.LinearRegression module.

model = LR.fit(x, y)
%Scholar.Linear.LinearRegression{
  coefficients: #Nx.Tensor<
    f64[1][1]
    EXLA.Backend<host:0, 0.54336858.2205024268.196432>
    [
      [2.9578183089017]
    ]
  >,
  intercept: #Nx.Tensor<
    f64[1]
    EXLA.Backend<host:0, 0.54336858.2205024268.196436>
    [4.023067929858136]
  >
}

As we can see, the coefficient is almost 3.0, and the intercept is nearly 4.0. Those are decent estimations. They are not exactly equal to 3.0 and 4.0 because we introduce noise to our samples.

Now, let's plot the result of linear regression.

[intercept] = Nx.to_flat_list(model.intercept)
[coefficients] = Nx.to_flat_list(model.coefficients)
x_1 = 0
x_2 = 2

line = %{
  x: [x_1, x_2],
  y: [x_1 * coefficients + intercept, x_2 * coefficients + intercept]
}

Tucan.layers([
  Tucan.scatter(df, "x", "y", filled: true),
  Tucan.lineplot(line, "x", "y", line_color: "green")
])
|> Tucan.Grid.set_enabled(false)
|> Tucan.Scale.set_x_domain(-0.05, 2.05)
|> Tucan.Scale.set_y_domain(2.5, 12)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)

Using Scholar.Linear.LinearRegression.predict, we can predict an expected value for a given input. However, we must remember that our prediction will be valid only if we consider linearly dependent data. Fortunately, our data set is perfect for this kind of prediction.

Now we will predict one value and draw it on the previous graph in a different color.

x_prediction = Nx.tensor([[0.83]])

[y_prediction] =
  model
  |> LR.predict(x_prediction)
  |> Nx.to_flat_list()

[x_prediction] = Nx.to_flat_list(x_prediction)
{x_prediction, y_prediction}
{0.8299999833106995, 6.478057076882628}
prediction = %{
  x: [x_prediction],
  y: [y_prediction]
}

Tucan.layers([
  Tucan.scatter(df, "x", "y", filled: true),
  Tucan.lineplot(line, "x", "y", line_color: "green"),
  Tucan.scatter(prediction, "x", "y", point_color: "red", point_size: 80, filled: true)
])
|> Tucan.Grid.set_enabled(false)
|> Tucan.Scale.set_x_domain(-0.05, 2.05)
|> Tucan.Scale.set_y_domain(2.5, 12)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)

As we expected, the red dot lies on the regression line.

This implementation of linear regression is based on the so-called Least Squares method. In practice, the function computes X+yX^+y where X+X^+ is a pseudo-inverse matrix (more precisely, Moore-Penrose matrix). You can calculate the results using Nx.LinAlg.pinv/2.

x_b = Nx.concatenate([Nx.broadcast(1.0, {size, 1}), x], axis: 1)
x_b |> Nx.LinAlg.pinv() |> Nx.dot(y)
#Nx.Tensor<
  f64[2][1]
  EXLA.Backend<host:0, 0.54336858.2205024268.197088>
  [
    [4.023067929858136],
    [2.957818308901698]
  ]
>

Polynomial regression on synthetic data

Before moving on to a more complex example, this section will briefly show how to use another regression method. While not strictly linear, the approach and calculations that go into it is similar enough that it makes sense for it to be explained alongside linear regression.

Instead of the Scholar.Linear.LinearRegression module, the following example uses Scholar.Linear.PolynomialRegression. Polynomial and linear regression differ in one key way. Linear regression optimizes a function in the form of f(x)=ax+bf(x) = ax + b, whereas polynomial regression uses the function f(x)=b+a1x1+a2x2++anxnf(x) = b + a_1x^1 + a_2x^2 + … + a_nx^n, where nn represents the degree. Notice how if the degree is 11, the function represents linear regression.

For this example we will start by generating data simulating the function f(x)=2x2+3x+5f(x) = 2x^2 + 3x + 5 of degree 2 with some added noise.

defmodule PolynomialData do
  import Nx.Defn

  defn data do
    key = Nx.Random.key(42)
    size = 100
    {x, new_key} = Nx.Random.uniform(key, -2, 2, shape: {size, 1}, type: :f32)
    {noise, _} = Nx.Random.uniform(new_key, -0.5, 0.5, shape: {size, 1}, type: :f32)
    y = 2 * x ** 2 + 3 * x + 5 + noise
    {x, y}
  end
end
{:module, PolynomialData, <<70, 79, 82, 49, 0, 0, 11, ...>>, true}

Now let's plot the generated data:

{x, y} = PolynomialData.data()
df = DF.new(x: Nx.to_flat_list(x), y: Nx.to_flat_list(y))

Tucan.scatter(df, "x", "y", filled: true)
|> Tucan.Scale.set_x_domain(-2.05, 2.05)
|> Tucan.Scale.set_y_domain(2.5, 20)
|> Tucan.Grid.set_enabled(false)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)

As is clear in the picture, the plotted data would most likely not be accurately estimated by a straight line like the ones linear regression estimates, so polynomial regression will better fit the data in this case. We do this with the Scholar.Linear.PolynomialRegression module. If you're familiar with the Scholar.Linear.LinearRegression module, the next steps will feel familiar.

To more clearly show the results, we will plot both methods.

x_start = -2
x_end = 2
precision = 1000
x_values = Enum.map((x_start * precision)..(x_end * precision), fn r -> r / precision end)

# Linear model
linear_model = LR.fit(x, y)

y_linear_values =
  LR.predict(
    linear_model,
    Nx.tensor(x_values) |> Nx.reshape({:auto, 1})
  )

df_linear_results =
  DF.new(
    x: x_values,
    y: y_linear_values |> Nx.to_flat_list()
  )

# Polynomial model
model = PR.fit(x, y, degree: 2)

y_values =
  PR.predict(
    model,
    Nx.tensor(x_values) |> Nx.reshape({:auto, 1})
  )

df_results =
  DF.new(
    x: x_values,
    y: y_values |> Nx.to_flat_list()
  )

Tucan.layers([
  Tucan.scatter(df, "x", "y", filled: true),
  Tucan.lineplot(df_results, "x", "y", line_color: "green"),
  Tucan.lineplot(df_linear_results, "x", "y", line_color: "red", clip: true)
])
|> Tucan.Grid.set_enabled(false)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)

Notice how the process is mostly the same for fitting a model and making a prediction. The one key difference comes in the coefficients, and how the input data is handled. The polynomial model returns a number of coefficients depending on the number of variables and the degree. This is why there are two coefficients but only one variable. In this example, with one variable and degree 2, we get two coefficients:

model.coefficients
#Nx.Tensor<
  f32[1][2]
  EXLA.Backend<host:0, 0.54336858.2205024268.199387>
  [
    [3.023973226547241, 2.013108730316162]
  ]
>

These coefficients will be used to optimize the function and correspond to the size of the transformed input data.

Feel free, in the cell bellow, to play around with the number of variables and the degree of the transformation.

n_variables = 1
n_samples = 5

Nx.iota({n_samples, n_variables})
|> PR.transform(degree: 3, fit_intercept?: false)
|> dbg()
#Nx.Tensor<
  s32[5][3]
  EXLA.Backend<host:0, 0.54336858.2205024268.199422>
  [
    [0, 0, 0],
    [1, 1, 1],
    [2, 4, 8],
    [3, 9, 27],
    [4, 16, 64]
  ]
>

We can make simple predictions, just like in Scholar.Linear.LinearRegression.

x_prediction = Nx.tensor([[-0.83], [0.83]])

y_predictions =
  PR.predict(model, x_prediction)
  |> Nx.to_flat_list()

x_predictions = x_prediction |> Nx.to_flat_list()

{x_predictions, y_predictions}
{[-0.8299999833106995, 0.8299999833106995], [3.8831818103790283, 8.902976989746094]}

And plot these predictions with the training data.

df_prediction =
  DF.new(
    x: x_predictions,
    y: y_predictions
  )

x_scale = [domain: [-2.05, 2.05]]
y_scale = [domain: [2.5, 20]]

Tucan.layers([
  Tucan.scatter(df, "x", "y", filled: true),
  Tucan.lineplot(df_results, "x", "y", line_color: "green"),
  Tucan.scatter(df_prediction, "x", "y", point_color: "red", point_size: 80, filled: true)
])
|> Tucan.Grid.set_enabled(false)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Plot of Generated Data and Predictions", offset: 20)

Now we are ready to go into a more complex example!

California Housing

In this section we will play with California Housing Data Set. The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned, so there are some preprocessing steps required! The columns are as follows (their names are pretty self explanatory):

  • longitude
  • latitude
  • housing_median_age
  • total_rooms
  • total_bedrooms
  • population
  • households
  • median_income
  • median_house_value
  • ocean_proximity

The main task of this section is to predict the median_house_income. However, before we use our linear regression for prediction, we need to learn more about the data.

data =
  Req.get!(
    "https://raw.githubusercontent.com/sonarsushant/California-House-Price-Prediction/master/housing.csv"
  ).body

df = DF.load_csv!(data)

Firstly, let's look at the distribution of houses based on the distance to the ocean.

S.frequencies(df["ocean_proximity"])

Now, we will plot univariate histograms for each feature of the data set.

# Increase the sample size (or use 1.0 to plot all data)
sample = DF.sample(df, 0.2, seed: seed)

histograms =
  for name <- List.delete(df.names, "ocean_proximity") do
    Tucan.histogram(sample, name, maxbins: 50, fill_opacity: 1.0, only: [name])
    |> Tucan.Axes.put_options(:x, ticks: false)
  end

Tucan.concat(histograms, columns: 3)
|> Tucan.set_title("Univariate Histograms of all features", anchor: :middle)
|> Tucan.set_size(500, 500)

From histograms, we can spot that median_income and median_house_values have a similar distribution. Both of them are heavy-tailed with high skewness. We might speculate that those two features are strictly correlated. We will check that later on.

Now let's render the houses as a scatter plot using the latitude and longitude, overlayed on top of California's map. Let's also use color to encode the house prices and use the circle size to indicate the population of districts.

Tucan.scatter(df, "longitude", "latitude",
  filled: true,
  tooltip: true,
  only: ~w(latitude longitude median_house_value population)
)
|> Tucan.color_by("median_house_value", type: :quantitative)
|> Tucan.size_by("population")
|> Tucan.Scale.set_x_domain(-124.55, -113.80)
|> Tucan.Scale.set_y_domain(32.45, 42.05)
|> Tucan.Grid.set_enabled(false)
|> Tucan.Scale.set_color_scheme(:viridis)
|> Tucan.background_image(
  "https://raw.githubusercontent.com/ageron/handson-ml2/master/images/end_to_end_project/california.png"
)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)

From This plot, we can read that prices are substantially dependent on geolocalization and population. For geolocalization, we see, those areas closer to the ocean are more expensive. But it's not a strict rule since houses on the northern bay of California are much more affordable than in in-land Mid California. For the population, there are two dense areas with expensive housing: Los Angeles Bay (In South California) and San Francisco Bay (in Mid Califonia). They are metropolises with a lot of different tech companies, and business and cultural institutions, so, logically, housing in those places will be expensive.

Hint:

You can try to add another feature by computing clustering on this data set. It might be a sum or power mean of distances to the clusters. We may predict that centroids will be located in San Francisco Bay and Los Angeles Bay. You can also pass population as weights to k-means.

Before we convert our data to tensor, we will add three more columns which might be informative:

  • rooms_per_family
  • bedrooms_per_rooms
  • population_per_family

The names of columns are self-describing. Now, add them to our data frame.

df =
  DF.mutate(df,
    rooms_per_family: total_rooms / households,
    bedrooms_per_rooms: total_bedrooms / total_rooms,
    population_per_family: population / households
  )

In the next step, we will find the correlation matrix. But to do this, we need to cast our data frame to Nx tensor and split data into train and test sets.

First let's remove all :nans from our data and then convert the "ocean_proximity" string column to a numerical one. Explorer supports a :category type but, in this case, we will do a custom conversion as we can consider the ocean proximity as ordinal data since we can order them by the distance to the ocean. The bigger value the further from the ocean.

# Replace all nils with :nan so we are able to convert to tensor.
names =
  df
  |> DF.names()
  |> List.delete("ocean_proximity")

after_preprocessing = for name <- names, into: %{}, do: {name, S.fill_missing(df[name], :nan)}

preprocessed_data = DF.new(after_preprocessing)

mapping = %{
  "ISLAND" => 0.0,
  "<1H OCEAN" => 1.0,
  "NEAR OCEAN" => 2.0,
  "NEAR BAY" => 3.0,
  "INLAND" => 4.0
}

mapped_location = S.transform(df["ocean_proximity"], fn x -> Map.fetch!(mapping, x) end)

df = DF.put(preprocessed_data, :ocean_proximity, mapped_location)

Now we convert dataframes into tensors. We can do so by concatenating and stacking the columns accordingly:

# Shuffle data to make splitting more resonable
{num_rows, _num_cols} = DF.shape(df)

indices = Nx.iota({num_rows})
{permutation_indices, _} = Nx.Random.shuffle(key, Nx.iota({num_rows}), axis: 0)

y =
  df[["median_house_value"]]
  |> Nx.concatenate()
  |> Nx.take(permutation_indices)

x =
  df
  |> DF.discard("median_house_value")
  |> Nx.stack(axis: 1)
  |> Nx.take(permutation_indices)

{x, y}
{#Nx.Tensor<
   f64[20640][12]
   EXLA.Backend<host:0, 0.54336858.2205024268.200763>
   [
     [0.2707641196013289, 326.0, 15.0, 32.76, -117.02, 1.0278, 543.0, 1.665644171779141, 3.6932515337423313, 326.0, 1204.0, 1.0],
     [0.2598343685300207, 230.0, 27.0, 38.44, -122.71, 1.7, 462.0, 2.008695652173913, 4.2, 251.0, 966.0, 1.0],
     [0.2902033271719039, 793.0, 35.0, 34.09, -118.35, 3.0349, 1526.0, 1.9243379571248425, 3.41109709962169, 785.0, 2705.0, 1.0],
     [0.2940552016985138, 549.0, 25.0, 33.91, -118.35, 2.8512, 1337.0, 2.435336976320583, 3.431693989071038, 554.0, 1884.0, 1.0],
     [0.2067861321509945, ...],
     ...
   ]
 >,
 #Nx.Tensor<
   f64[20640]
   EXLA.Backend<host:0, 0.54336858.2205024268.200734>
   [1.542e5, 3.5e5, 2.667e5, 2.728e5, 1.163e5, 2.941e5, 4.178e5, 1.851e5, 5.68e4, 500001.0, 500001.0, 1.152e5, 2.343e5, 8.75e4, 8.37e4, 500001.0, 1.669e5, 3.003e5, 1.546e5, 6.69e4, 2.325e5, 1.367e5, 1.375e5, 1.27e5, 1.683e5, 1.441e5, 9.53e4, 1.834e5, 1.435e5, 6.85e4, 1.625e5, 1.661e5, 1.908e5, 2.431e5, 1.488e5, 3.036e5, 1.479e5, 1.5e5, 3.288e5, 7.08e4, 2.25e5, 1.375e5, 3.5e5, 3.742e5, 1.549e5, 4.5e5, 3.063e5, 2.051e5, ...]
 >}

Since we don't have a stratified split of data implemented (to learn more see Stratified Sampling), we shuffle data set and take advantage of Law of large numbers. It says that the average of the results obtained from a large number of trials should be close to the expected value and tends to become closer to the expected value as more trials are performed. As we take a lot of samples from shuffled data it implies that the sampled data sets will be stratified. Now, we will split the data into training and test sets.

train_ratio = 0.8

{x_train, x_test} = Nx.split(x, train_ratio)
{y_train, y_test} = Nx.split(y, train_ratio)
{#Nx.Tensor<
   f64[16512]
   EXLA.Backend<host:0, 0.54336858.2205024268.200769>
   [1.542e5, 3.5e5, 2.667e5, 2.728e5, 1.163e5, 2.941e5, 4.178e5, 1.851e5, 5.68e4, 500001.0, 500001.0, 1.152e5, 2.343e5, 8.75e4, 8.37e4, 500001.0, 1.669e5, 3.003e5, 1.546e5, 6.69e4, 2.325e5, 1.367e5, 1.375e5, 1.27e5, 1.683e5, 1.441e5, 9.53e4, 1.834e5, 1.435e5, 6.85e4, 1.625e5, 1.661e5, 1.908e5, 2.431e5, 1.488e5, 3.036e5, 1.479e5, 1.5e5, 3.288e5, 7.08e4, 2.25e5, 1.375e5, 3.5e5, 3.742e5, 1.549e5, 4.5e5, 3.063e5, 2.051e5, 2.737e5, ...]
 >,
 #Nx.Tensor<
   f64[4128]
   EXLA.Backend<host:0, 0.54336858.2205024268.200771>
   [8.55e4, 1.743e5, 5.65e4, 1.308e5, 1.375e5, 2.472e5, 2.25e5, 1.164e5, 500001.0, 1.154e5, 1.269e5, 1.229e5, 7.04e4, 3.534e5, 7.14e4, 4.083e5, 3.0e5, 1.648e5, 1.125e5, 2.028e5, 1.139e5, 2.15e5, 500001.0, 1.48e5, 1.683e5, 2.819e5, 3.338e5, 1.078e5, 3.277e5, 3.412e5, 3.289e5, 2.879e5, 2.545e5, 2.667e5, 6.82e4, 1.406e5, 2.795e5, 2.25e5, 1.17e5, 1.375e5, 3.115e5, 2.423e5, 4.93e5, 3.458e5, 1.542e5, 3.728e5, 6.6e4, 1.281e5, ...]
 >}

Before we compute the correlation matrix, we will check if we have NaNs (Not a Number) in the data set.

y_nan_count = Nx.sum(Nx.is_nan(y))
x_nan_count = Nx.sum(Nx.is_nan(x))
{x_nan_count, y_nan_count}
{#Nx.Tensor<
   u64
   EXLA.Backend<host:0, 0.54336858.2205024268.200779>
   414
 >,
 #Nx.Tensor<
   u64
   EXLA.Backend<host:0, 0.54336858.2205024268.200775>
   0
 >}

Ups, we have some. Fortunately, for y, we don't have any NaNs. If we dig a little bit more, it turns out that NaNs are in <pre> bedrooms_per_rooms (1st row) </pre> <pre> total_bedrooms (10th row) </pre>

{bedrooms_per_rooms_idx, total_bedrooms_idx} = {0, 9}
bedrooms_per_rooms_nan_count = Nx.sum(Nx.is_nan(x[[.., bedrooms_per_rooms_idx]]))
total_bedrooms_nan_count = Nx.sum(Nx.is_nan(x[[.., total_bedrooms_idx]]))
Nx.equal(x_nan_count, Nx.add(bedrooms_per_rooms_nan_count, total_bedrooms_nan_count))
#Nx.Tensor<
  u8
  EXLA.Backend<host:0, 0.54336858.2205024268.200794>
  1
>

For these two, we use Scholar.Impute.SimpleImputer with startegy set to median of values. Function fit learn the median of features and transform for trained model replace all NaNs with a given startegy. It is important that we perform imputation after splitting data because otherwise we will have a leakage of information from test data.

x_train =
  x_train
  |> SimpleImputer.fit(strategy: :median)
  |> SimpleImputer.transform(x_train)

x_test =
  x_test
  |> SimpleImputer.fit(strategy: :median)
  |> SimpleImputer.transform(x_test)
#Nx.Tensor<
  f64[4128][12]
  EXLA.Backend<host:0, 0.54336858.2205024268.200878>
  [
    [0.20019126554032515, 548.0, 23.0, 36.32, -119.33, 2.5, 1446.0, 2.6386861313868613, 5.724452554744525, 628.0, 3137.0, 4.0],
    [0.18461538461538463, 83.0, 34.0, 34.26, -118.44, 5.5124, 433.0, 5.216867469879518, 3.9156626506024095, 60.0, 325.0, 1.0],
    [0.23362175525339926, 398.0, 35.0, 35.77, -119.25, 1.6786, 1449.0, 3.64070351758794, 4.065326633165829, 378.0, 1618.0, 4.0],
    [0.20259128386336867, 313.0, 24.0, 37.8, -121.2, 3.5625, 927.0, 2.961661341853035, 5.424920127795527, 344.0, 1698.0, 4.0],
    [0.42487046632124353, 155.0, ...],
    ...
  ]
>

Eventually, we can compute the correlation matrix. We will use Scholar.Covariance to calculate the correlation matrix.

correlation =
  Nx.concatenate([x_train, Nx.new_axis(y_train, 1)], axis: 1)
  |> Scholar.Covariance.correlation_matrix(biased: true)
#Nx.Tensor<
  f64[13][13]
  EXLA.Backend<host:0, 0.54336858.2205024268.200912>
  [
    [1.0000000000000002, 0.06144282875654085, 0.13311283521287637, -0.12216156759996018, 0.09984860365269956, -0.612630933680883, 0.03517997111673592, 0.004697459810143742, -0.4078383723411937, 0.08095666109618091, -0.18612145325855575, -0.11835385450623682, -0.25385512001635363],
    [0.06144282875654085, 1.0, -0.305555948690877, -0.07566125267999342, 0.06170866872453172, 0.01398821983308101, 0.907460734474932, -0.02634672373043427, -0.07388109216765114, 0.9735036322842058, 0.9198029415934222, -0.04626792658479704, 0.0639675034251392],
    [0.13311283521287637, -0.305555948690877, 0.9999999999999998, 0.017212690983879612, -0.11382253728911094, -0.12557649464959478, -0.2961765721596695, 0.012442541870155687, -0.14834326174239934, -0.32094539339262423, -0.3602627671081426, -0.11940626101050672, 0.10251505008793772],
    [-0.12216156759996018, -0.07566125267999342, 0.017212690983879612, 0.9999999999999992, -0.92526150146814, -0.0709535112221358, -0.11317680203176562, 0.0032929374670018535, 0.10562463526360089, -0.07174931649052743, -0.038747214812891013, ...],
    ...
  ]
>

Maybe visual representation would be nicer. 😅

{corr_size, _} = Nx.shape(correlation)
correlation_list = Nx.to_flat_list(correlation)

names = [
  "Bedrooms per rooms",
  "Households",
  "Housing median age",
  "Latitude",
  "Longitude",
  "Median income",
  "Population",
  "Population per family",
  "Rooms per family",
  "Total bedrooms",
  "Total rooms",
  "Ocean proximity",
  "Median house value"
]

corr_to_plot =
  DF.new(
    x: List.flatten(List.duplicate(names, corr_size)),
    y: List.flatten(for name <- names, do: List.duplicate(name, corr_size)),
    corr_val: Enum.map(correlation_list, fn x -> Float.round(x, 2) end)
  )

Tucan.heatmap(corr_to_plot, "x", "y", "corr_val",
  annotate: true,
  text_color: [{nil, 0, "white"}, {0, nil, "black"}]
)
|> Tucan.Scale.set_color_scheme(:viridis)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Correlation Matrix for California Housing", offset: 20)

We can spot that median_house_value is strongly correlated with median_income. It's pretty straightforward, the more money you have, the more expensive house you can buy. Non-obvious is a negative correlation with bedrooms_per_rooms. But it also can be explained. Bedrooms are the most crucial rooms in the house. Firstly, you need to guarantee that you have a house with enough bedrooms. If this condition is satisfied, then you can focus on "additional rooms" like a chill room, cabinets and so on. So if you buy a house with more additional rooms, then you decrease the ratio.

Now we are ready to train a model for the median_house_value prediction. We will use linear regression. In the first step, we create the model by calling the fit function.

model = LR.fit(x_train, y_train)
%Scholar.Linear.LinearRegression{
  coefficients: #Nx.Tensor<
    f64[12]
    EXLA.Backend<host:0, 0.54336858.2205024268.201579>
    [292661.1879978386, 111.46237796967083, 1170.1666131366273, -33322.29721781791, -35394.90464853701, 41384.64324873476, -41.31598131072877, 50.507900596456054, 3040.877192253343, 7.253624678472778, 3.272614960520949, -8059.519389730254]
  >,
  intercept: #Nx.Tensor<
    f64
    EXLA.Backend<host:0, 0.54336858.2205024268.201583>
    -3101555.563125743
  >
}

Now we can predict the values for the test set and measure the error of our prediction. We will calculate root mean square error (RMSE) and mean absolute error (MAE).

predictions = LR.predict(model, x_test)
rmse = Regression.mean_square_error(y_test, predictions) |> Nx.sqrt()
mae = Regression.mean_absolute_error(y_test, predictions)
{rmse, mae}
{#Nx.Tensor<
   f64
   EXLA.Backend<host:0, 0.54336858.2205024268.201596>
   67648.9435367406
 >,
 #Nx.Tensor<
   f64
   EXLA.Backend<host:0, 0.54336858.2205024268.201601>
   48942.11908533544
 >}

Ok, but is it a good or poor estimation? Huh, check the mean value of the target and then compare it to the value of errors.

Nx.mean(y)
#Nx.Tensor<
  f64
  EXLA.Backend<host:0, 0.54336858.2205024268.201604>
  206855.81690891474
>

For such a simple model as linear regression, it seems to be a pretty good result. But there is space to improve this result. You can, for example, add some additional features to the data set. In the future, you will be able to try more complicated models, such as random forests.