View Source Linear regression in practice
Mix.install([
{:scholar, "~> 0.2.0"},
{:explorer, "~> 0.6.1"},
{:exla, "~> 0.6.0"},
{:nx, "~> 0.6.0", override: true},
{:req, "~> 0.3.9"},
{:kino_vega_lite, "~> 0.1.9"},
{:kino, "~> 0.10.0"},
{:kino_explorer, "~> 0.1.7"},
{:tucan, "~> 0.3.0"}
])
Setup
In the livebook, we will cover the typical use cases of linear regression on practical examples.
require Explorer.DataFrame, as: DF
require Explorer.Series, as: S
alias Scholar.Linear.LinearRegression, as: LR
alias Scholar.Linear.PolynomialRegression, as: PR
alias Scholar.Impute.SimpleImputer
alias Scholar.Metrics.Regression
And let's configure EXLA
as our default backend (where our tensors are stored) and compiler (which compiles Scholar code) across the notebook and all branched sections:
Nx.global_default_backend(EXLA.Backend)
Nx.Defn.global_default_options(compiler: EXLA)
seed = 42
key = Nx.Random.key(42)
#Nx.Tensor<
u32[2]
EXLA.Backend<host:0, 0.54336858.2205024268.194665>
[0, 42]
>
Linear regression on synthetic data
Before we dive into real-life use cases of linear regression, we start with a simpler one. We will generate data with a linear pattern and then use Scholar.Linear.LinearRegression
to compute regression.
Firstly, we generate the data which simulates the function with added uniform, zero-mean noise. Nx.Random.uniform
creates a tensor with a given shape and type.
defmodule LinearData do
import Nx.Defn
defn data do
key = Nx.Random.key(42)
size = 100
{x, new_key} = Nx.Random.uniform(key, 0, 2, shape: {size, 1}, type: :f64)
{noise, _} = Nx.Random.uniform(new_key, -0.5, 0.5, shape: {size, 1}, type: :f64)
y = 3 * x + 4 + noise
{x, y}
end
end
{:module, LinearData, <<70, 79, 82, 49, 0, 0, 11, ...>>, true}
Now let's plot the generated points.
size = 100
{x, y} = LinearData.data()
df = DF.new(x: x, y: y)
Tucan.scatter(df, "x", "y", width: 630, height: 630, filled: true)
|> Tucan.Scale.set_x_domain(-0.05, 2.05)
|> Tucan.Scale.set_y_domain(2.5, 12)
|> Tucan.Grid.set_enabled(false)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)
For a regression task, we will use the Scholar.Linear.LinearRegression
module.
model = LR.fit(x, y)
%Scholar.Linear.LinearRegression{
coefficients: #Nx.Tensor<
f64[1][1]
EXLA.Backend<host:0, 0.54336858.2205024268.196432>
[
[2.9578183089017]
]
>,
intercept: #Nx.Tensor<
f64[1]
EXLA.Backend<host:0, 0.54336858.2205024268.196436>
[4.023067929858136]
>
}
As we can see, the coefficient is almost 3.0, and the intercept is nearly 4.0. Those are decent estimations. They are not exactly equal to 3.0 and 4.0 because we introduce noise to our samples.
Now, let's plot the result of linear regression.
[intercept] = Nx.to_flat_list(model.intercept)
[coefficients] = Nx.to_flat_list(model.coefficients)
x_1 = 0
x_2 = 2
line = %{
x: [x_1, x_2],
y: [x_1 * coefficients + intercept, x_2 * coefficients + intercept]
}
Tucan.layers([
Tucan.scatter(df, "x", "y", filled: true),
Tucan.lineplot(line, "x", "y", line_color: "green")
])
|> Tucan.Grid.set_enabled(false)
|> Tucan.Scale.set_x_domain(-0.05, 2.05)
|> Tucan.Scale.set_y_domain(2.5, 12)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)
Using Scholar.Linear.LinearRegression.predict
, we can predict an expected value for a given input. However, we must remember that our prediction will be valid only if we consider linearly dependent data. Fortunately, our data set is perfect for this kind of prediction.
Now we will predict one value and draw it on the previous graph in a different color.
x_prediction = Nx.tensor([[0.83]])
[y_prediction] =
model
|> LR.predict(x_prediction)
|> Nx.to_flat_list()
[x_prediction] = Nx.to_flat_list(x_prediction)
{x_prediction, y_prediction}
{0.8299999833106995, 6.478057076882628}
prediction = %{
x: [x_prediction],
y: [y_prediction]
}
Tucan.layers([
Tucan.scatter(df, "x", "y", filled: true),
Tucan.lineplot(line, "x", "y", line_color: "green"),
Tucan.scatter(prediction, "x", "y", point_color: "red", point_size: 80, filled: true)
])
|> Tucan.Grid.set_enabled(false)
|> Tucan.Scale.set_x_domain(-0.05, 2.05)
|> Tucan.Scale.set_y_domain(2.5, 12)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)
As we expected, the red dot lies on the regression line.
This implementation of linear regression is based on the so-called Least Squares method. In practice, the function computes where is a pseudo-inverse matrix (more precisely, Moore-Penrose matrix). You can calculate the results using Nx.LinAlg.pinv/2
.
x_b = Nx.concatenate([Nx.broadcast(1.0, {size, 1}), x], axis: 1)
x_b |> Nx.LinAlg.pinv() |> Nx.dot(y)
#Nx.Tensor<
f64[2][1]
EXLA.Backend<host:0, 0.54336858.2205024268.197088>
[
[4.023067929858136],
[2.957818308901698]
]
>
Polynomial regression on synthetic data
Before moving on to a more complex example, this section will briefly show how to use another regression method. While not strictly linear, the approach and calculations that go into it is similar enough that it makes sense for it to be explained alongside linear regression.
Instead of the Scholar.Linear.LinearRegression
module, the following example uses Scholar.Linear.PolynomialRegression
. Polynomial and linear regression differ in one key way. Linear regression optimizes a function in the form of , whereas polynomial regression
uses the function , where represents the degree. Notice how if the degree is , the function represents linear regression.
For this example we will start by generating data simulating the function of degree 2 with some added noise.
defmodule PolynomialData do
import Nx.Defn
defn data do
key = Nx.Random.key(42)
size = 100
{x, new_key} = Nx.Random.uniform(key, -2, 2, shape: {size, 1}, type: :f32)
{noise, _} = Nx.Random.uniform(new_key, -0.5, 0.5, shape: {size, 1}, type: :f32)
y = 2 * x ** 2 + 3 * x + 5 + noise
{x, y}
end
end
{:module, PolynomialData, <<70, 79, 82, 49, 0, 0, 11, ...>>, true}
Now let's plot the generated data:
{x, y} = PolynomialData.data()
df = DF.new(x: Nx.to_flat_list(x), y: Nx.to_flat_list(y))
Tucan.scatter(df, "x", "y", filled: true)
|> Tucan.Scale.set_x_domain(-2.05, 2.05)
|> Tucan.Scale.set_y_domain(2.5, 20)
|> Tucan.Grid.set_enabled(false)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)
As is clear in the picture, the plotted data would most likely not be accurately estimated by a straight line like the ones linear regression estimates, so polynomial regression will better fit the data in this case. We do this with the Scholar.Linear.PolynomialRegression
module. If you're familiar with the Scholar.Linear.LinearRegression
module, the next steps will feel familiar.
To more clearly show the results, we will plot both methods.
x_start = -2
x_end = 2
precision = 1000
x_values = Enum.map((x_start * precision)..(x_end * precision), fn r -> r / precision end)
# Linear model
linear_model = LR.fit(x, y)
y_linear_values =
LR.predict(
linear_model,
Nx.tensor(x_values) |> Nx.reshape({:auto, 1})
)
df_linear_results =
DF.new(
x: x_values,
y: y_linear_values |> Nx.to_flat_list()
)
# Polynomial model
model = PR.fit(x, y, degree: 2)
y_values =
PR.predict(
model,
Nx.tensor(x_values) |> Nx.reshape({:auto, 1})
)
df_results =
DF.new(
x: x_values,
y: y_values |> Nx.to_flat_list()
)
Tucan.layers([
Tucan.scatter(df, "x", "y", filled: true),
Tucan.lineplot(df_results, "x", "y", line_color: "green"),
Tucan.lineplot(df_linear_results, "x", "y", line_color: "red", clip: true)
])
|> Tucan.Grid.set_enabled(false)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)
Notice how the process is mostly the same for fitting a model and making a prediction. The one key difference comes in the coefficients, and how the input data is handled. The polynomial model returns a number of coefficients depending on the number of variables and the degree. This is why there are two coefficients but only one variable. In this example, with one variable and degree 2, we get two coefficients:
model.coefficients
#Nx.Tensor<
f32[1][2]
EXLA.Backend<host:0, 0.54336858.2205024268.199387>
[
[3.023973226547241, 2.013108730316162]
]
>
These coefficients will be used to optimize the function and correspond to the size of the transformed input data.
Feel free, in the cell bellow, to play around with the number of variables and the degree of the transformation.
n_variables = 1
n_samples = 5
Nx.iota({n_samples, n_variables})
|> PR.transform(degree: 3, fit_intercept?: false)
|> dbg()
#Nx.Tensor<
s32[5][3]
EXLA.Backend<host:0, 0.54336858.2205024268.199422>
[
[0, 0, 0],
[1, 1, 1],
[2, 4, 8],
[3, 9, 27],
[4, 16, 64]
]
>
We can make simple predictions, just like in Scholar.Linear.LinearRegression
.
x_prediction = Nx.tensor([[-0.83], [0.83]])
y_predictions =
PR.predict(model, x_prediction)
|> Nx.to_flat_list()
x_predictions = x_prediction |> Nx.to_flat_list()
{x_predictions, y_predictions}
{[-0.8299999833106995, 0.8299999833106995], [3.8831818103790283, 8.902976989746094]}
And plot these predictions with the training data.
df_prediction =
DF.new(
x: x_predictions,
y: y_predictions
)
x_scale = [domain: [-2.05, 2.05]]
y_scale = [domain: [2.5, 20]]
Tucan.layers([
Tucan.scatter(df, "x", "y", filled: true),
Tucan.lineplot(df_results, "x", "y", line_color: "green"),
Tucan.scatter(df_prediction, "x", "y", point_color: "red", point_size: 80, filled: true)
])
|> Tucan.Grid.set_enabled(false)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Plot of Generated Data and Predictions", offset: 20)
Now we are ready to go into a more complex example!
California Housing
In this section we will play with California Housing Data Set. The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned, so there are some preprocessing steps required! The columns are as follows (their names are pretty self explanatory):
longitude
latitude
housing_median_age
total_rooms
total_bedrooms
population
households
median_income
median_house_value
ocean_proximity
The main task of this section is to predict the median_house_income. However, before we use our linear regression for prediction, we need to learn more about the data.
data =
Req.get!(
"https://raw.githubusercontent.com/sonarsushant/California-House-Price-Prediction/master/housing.csv"
).body
df = DF.load_csv!(data)
Firstly, let's look at the distribution of houses based on the distance to the ocean.
S.frequencies(df["ocean_proximity"])
Now, we will plot univariate histograms for each feature of the data set.
# Increase the sample size (or use 1.0 to plot all data)
sample = DF.sample(df, 0.2, seed: seed)
histograms =
for name <- List.delete(df.names, "ocean_proximity") do
Tucan.histogram(sample, name, maxbins: 50, fill_opacity: 1.0, only: [name])
|> Tucan.Axes.put_options(:x, ticks: false)
end
Tucan.concat(histograms, columns: 3)
|> Tucan.set_title("Univariate Histograms of all features", anchor: :middle)
|> Tucan.set_size(500, 500)
From histograms, we can spot that median_income and median_house_values have a similar distribution. Both of them are heavy-tailed with high skewness. We might speculate that those two features are strictly correlated. We will check that later on.
Now let's render the houses as a scatter plot using the latitude and longitude, overlayed on top of California's map. Let's also use color to encode the house prices and use the circle size to indicate the population of districts.
Tucan.scatter(df, "longitude", "latitude",
filled: true,
tooltip: true,
only: ~w(latitude longitude median_house_value population)
)
|> Tucan.color_by("median_house_value", type: :quantitative)
|> Tucan.size_by("population")
|> Tucan.Scale.set_x_domain(-124.55, -113.80)
|> Tucan.Scale.set_y_domain(32.45, 42.05)
|> Tucan.Grid.set_enabled(false)
|> Tucan.Scale.set_color_scheme(:viridis)
|> Tucan.background_image(
"https://raw.githubusercontent.com/ageron/handson-ml2/master/images/end_to_end_project/california.png"
)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Scatterplot of Generated Data", offset: 20)
From This plot, we can read that prices are substantially dependent on geolocalization and population. For geolocalization, we see, those areas closer to the ocean are more expensive. But it's not a strict rule since houses on the northern bay of California are much more affordable than in in-land Mid California. For the population, there are two dense areas with expensive housing: Los Angeles Bay (In South California) and San Francisco Bay (in Mid Califonia). They are metropolises with a lot of different tech companies, and business and cultural institutions, so, logically, housing in those places will be expensive.
Hint:You can try to add another feature by computing clustering on this data set. It might be a sum or power mean of distances to the clusters. We may predict that centroids will be located in San Francisco Bay and Los Angeles Bay. You can also pass population as weights to k-means.
Before we convert our data to tensor, we will add three more columns which might be informative:
rooms_per_family
bedrooms_per_rooms
population_per_family
The names of columns are self-describing. Now, add them to our data frame.
df =
DF.mutate(df,
rooms_per_family: total_rooms / households,
bedrooms_per_rooms: total_bedrooms / total_rooms,
population_per_family: population / households
)
In the next step, we will find the correlation matrix. But to do this, we need to cast our data frame to Nx tensor and split data into train and test sets.
First let's remove all :nan
s from our data and then convert the "ocean_proximity" string column to a numerical one. Explorer supports a :category
type but, in this case, we will do a custom conversion as we can consider the ocean proximity as ordinal data since we can order them by the distance to the ocean. The bigger value the further from the ocean.
# Replace all nils with :nan so we are able to convert to tensor.
names =
df
|> DF.names()
|> List.delete("ocean_proximity")
after_preprocessing = for name <- names, into: %{}, do: {name, S.fill_missing(df[name], :nan)}
preprocessed_data = DF.new(after_preprocessing)
mapping = %{
"ISLAND" => 0.0,
"<1H OCEAN" => 1.0,
"NEAR OCEAN" => 2.0,
"NEAR BAY" => 3.0,
"INLAND" => 4.0
}
mapped_location = S.transform(df["ocean_proximity"], fn x -> Map.fetch!(mapping, x) end)
df = DF.put(preprocessed_data, :ocean_proximity, mapped_location)
Now we convert dataframes into tensors. We can do so by concatenating and stacking the columns accordingly:
# Shuffle data to make splitting more resonable
{num_rows, _num_cols} = DF.shape(df)
indices = Nx.iota({num_rows})
{permutation_indices, _} = Nx.Random.shuffle(key, Nx.iota({num_rows}), axis: 0)
y =
df[["median_house_value"]]
|> Nx.concatenate()
|> Nx.take(permutation_indices)
x =
df
|> DF.discard("median_house_value")
|> Nx.stack(axis: 1)
|> Nx.take(permutation_indices)
{x, y}
{#Nx.Tensor<
f64[20640][12]
EXLA.Backend<host:0, 0.54336858.2205024268.200763>
[
[0.2707641196013289, 326.0, 15.0, 32.76, -117.02, 1.0278, 543.0, 1.665644171779141, 3.6932515337423313, 326.0, 1204.0, 1.0],
[0.2598343685300207, 230.0, 27.0, 38.44, -122.71, 1.7, 462.0, 2.008695652173913, 4.2, 251.0, 966.0, 1.0],
[0.2902033271719039, 793.0, 35.0, 34.09, -118.35, 3.0349, 1526.0, 1.9243379571248425, 3.41109709962169, 785.0, 2705.0, 1.0],
[0.2940552016985138, 549.0, 25.0, 33.91, -118.35, 2.8512, 1337.0, 2.435336976320583, 3.431693989071038, 554.0, 1884.0, 1.0],
[0.2067861321509945, ...],
...
]
>,
#Nx.Tensor<
f64[20640]
EXLA.Backend<host:0, 0.54336858.2205024268.200734>
[1.542e5, 3.5e5, 2.667e5, 2.728e5, 1.163e5, 2.941e5, 4.178e5, 1.851e5, 5.68e4, 500001.0, 500001.0, 1.152e5, 2.343e5, 8.75e4, 8.37e4, 500001.0, 1.669e5, 3.003e5, 1.546e5, 6.69e4, 2.325e5, 1.367e5, 1.375e5, 1.27e5, 1.683e5, 1.441e5, 9.53e4, 1.834e5, 1.435e5, 6.85e4, 1.625e5, 1.661e5, 1.908e5, 2.431e5, 1.488e5, 3.036e5, 1.479e5, 1.5e5, 3.288e5, 7.08e4, 2.25e5, 1.375e5, 3.5e5, 3.742e5, 1.549e5, 4.5e5, 3.063e5, 2.051e5, ...]
>}
Since we don't have a stratified split of data implemented (to learn more see Stratified Sampling), we shuffle data set and take advantage of Law of large numbers. It says that the average of the results obtained from a large number of trials should be close to the expected value and tends to become closer to the expected value as more trials are performed. As we take a lot of samples from shuffled data it implies that the sampled data sets will be stratified. Now, we will split the data into training and test sets.
train_ratio = 0.8
{x_train, x_test} = Nx.split(x, train_ratio)
{y_train, y_test} = Nx.split(y, train_ratio)
{#Nx.Tensor<
f64[16512]
EXLA.Backend<host:0, 0.54336858.2205024268.200769>
[1.542e5, 3.5e5, 2.667e5, 2.728e5, 1.163e5, 2.941e5, 4.178e5, 1.851e5, 5.68e4, 500001.0, 500001.0, 1.152e5, 2.343e5, 8.75e4, 8.37e4, 500001.0, 1.669e5, 3.003e5, 1.546e5, 6.69e4, 2.325e5, 1.367e5, 1.375e5, 1.27e5, 1.683e5, 1.441e5, 9.53e4, 1.834e5, 1.435e5, 6.85e4, 1.625e5, 1.661e5, 1.908e5, 2.431e5, 1.488e5, 3.036e5, 1.479e5, 1.5e5, 3.288e5, 7.08e4, 2.25e5, 1.375e5, 3.5e5, 3.742e5, 1.549e5, 4.5e5, 3.063e5, 2.051e5, 2.737e5, ...]
>,
#Nx.Tensor<
f64[4128]
EXLA.Backend<host:0, 0.54336858.2205024268.200771>
[8.55e4, 1.743e5, 5.65e4, 1.308e5, 1.375e5, 2.472e5, 2.25e5, 1.164e5, 500001.0, 1.154e5, 1.269e5, 1.229e5, 7.04e4, 3.534e5, 7.14e4, 4.083e5, 3.0e5, 1.648e5, 1.125e5, 2.028e5, 1.139e5, 2.15e5, 500001.0, 1.48e5, 1.683e5, 2.819e5, 3.338e5, 1.078e5, 3.277e5, 3.412e5, 3.289e5, 2.879e5, 2.545e5, 2.667e5, 6.82e4, 1.406e5, 2.795e5, 2.25e5, 1.17e5, 1.375e5, 3.115e5, 2.423e5, 4.93e5, 3.458e5, 1.542e5, 3.728e5, 6.6e4, 1.281e5, ...]
>}
Before we compute the correlation matrix, we will check if we have NaNs (Not a Number) in the data set.
y_nan_count = Nx.sum(Nx.is_nan(y))
x_nan_count = Nx.sum(Nx.is_nan(x))
{x_nan_count, y_nan_count}
{#Nx.Tensor<
u64
EXLA.Backend<host:0, 0.54336858.2205024268.200779>
414
>,
#Nx.Tensor<
u64
EXLA.Backend<host:0, 0.54336858.2205024268.200775>
0
>}
Ups, we have some. Fortunately, for y, we don't have any NaNs. If we dig a little bit more, it turns out that NaNs are in <pre> bedrooms_per_rooms (1st row) </pre> <pre> total_bedrooms (10th row) </pre>
{bedrooms_per_rooms_idx, total_bedrooms_idx} = {0, 9}
bedrooms_per_rooms_nan_count = Nx.sum(Nx.is_nan(x[[.., bedrooms_per_rooms_idx]]))
total_bedrooms_nan_count = Nx.sum(Nx.is_nan(x[[.., total_bedrooms_idx]]))
Nx.equal(x_nan_count, Nx.add(bedrooms_per_rooms_nan_count, total_bedrooms_nan_count))
#Nx.Tensor<
u8
EXLA.Backend<host:0, 0.54336858.2205024268.200794>
1
>
For these two, we use Scholar.Impute.SimpleImputer
with startegy set to median of values. Function fit
learn the median of features and transform
for trained model replace all NaNs with a given startegy. It is important that we perform imputation after splitting data because otherwise we will have a leakage of information from test data.
x_train =
x_train
|> SimpleImputer.fit(strategy: :median)
|> SimpleImputer.transform(x_train)
x_test =
x_test
|> SimpleImputer.fit(strategy: :median)
|> SimpleImputer.transform(x_test)
#Nx.Tensor<
f64[4128][12]
EXLA.Backend<host:0, 0.54336858.2205024268.200878>
[
[0.20019126554032515, 548.0, 23.0, 36.32, -119.33, 2.5, 1446.0, 2.6386861313868613, 5.724452554744525, 628.0, 3137.0, 4.0],
[0.18461538461538463, 83.0, 34.0, 34.26, -118.44, 5.5124, 433.0, 5.216867469879518, 3.9156626506024095, 60.0, 325.0, 1.0],
[0.23362175525339926, 398.0, 35.0, 35.77, -119.25, 1.6786, 1449.0, 3.64070351758794, 4.065326633165829, 378.0, 1618.0, 4.0],
[0.20259128386336867, 313.0, 24.0, 37.8, -121.2, 3.5625, 927.0, 2.961661341853035, 5.424920127795527, 344.0, 1698.0, 4.0],
[0.42487046632124353, 155.0, ...],
...
]
>
Eventually, we can compute the correlation matrix. We will use Scholar.Covariance
to calculate the correlation matrix.
correlation =
Nx.concatenate([x_train, Nx.new_axis(y_train, 1)], axis: 1)
|> Scholar.Covariance.correlation_matrix(biased: true)
#Nx.Tensor<
f64[13][13]
EXLA.Backend<host:0, 0.54336858.2205024268.200912>
[
[1.0000000000000002, 0.06144282875654085, 0.13311283521287637, -0.12216156759996018, 0.09984860365269956, -0.612630933680883, 0.03517997111673592, 0.004697459810143742, -0.4078383723411937, 0.08095666109618091, -0.18612145325855575, -0.11835385450623682, -0.25385512001635363],
[0.06144282875654085, 1.0, -0.305555948690877, -0.07566125267999342, 0.06170866872453172, 0.01398821983308101, 0.907460734474932, -0.02634672373043427, -0.07388109216765114, 0.9735036322842058, 0.9198029415934222, -0.04626792658479704, 0.0639675034251392],
[0.13311283521287637, -0.305555948690877, 0.9999999999999998, 0.017212690983879612, -0.11382253728911094, -0.12557649464959478, -0.2961765721596695, 0.012442541870155687, -0.14834326174239934, -0.32094539339262423, -0.3602627671081426, -0.11940626101050672, 0.10251505008793772],
[-0.12216156759996018, -0.07566125267999342, 0.017212690983879612, 0.9999999999999992, -0.92526150146814, -0.0709535112221358, -0.11317680203176562, 0.0032929374670018535, 0.10562463526360089, -0.07174931649052743, -0.038747214812891013, ...],
...
]
>
Maybe visual representation would be nicer. 😅
{corr_size, _} = Nx.shape(correlation)
correlation_list = Nx.to_flat_list(correlation)
names = [
"Bedrooms per rooms",
"Households",
"Housing median age",
"Latitude",
"Longitude",
"Median income",
"Population",
"Population per family",
"Rooms per family",
"Total bedrooms",
"Total rooms",
"Ocean proximity",
"Median house value"
]
corr_to_plot =
DF.new(
x: List.flatten(List.duplicate(names, corr_size)),
y: List.flatten(for name <- names, do: List.duplicate(name, corr_size)),
corr_val: Enum.map(correlation_list, fn x -> Float.round(x, 2) end)
)
Tucan.heatmap(corr_to_plot, "x", "y", "corr_val",
annotate: true,
text_color: [{nil, 0, "white"}, {0, nil, "black"}]
)
|> Tucan.Scale.set_color_scheme(:viridis)
|> Tucan.set_size(630, 630)
|> Tucan.set_title("Correlation Matrix for California Housing", offset: 20)
We can spot that median_house_value is strongly correlated with median_income. It's pretty straightforward, the more money you have, the more expensive house you can buy. Non-obvious is a negative correlation with bedrooms_per_rooms. But it also can be explained. Bedrooms are the most crucial rooms in the house. Firstly, you need to guarantee that you have a house with enough bedrooms. If this condition is satisfied, then you can focus on "additional rooms" like a chill room, cabinets and so on. So if you buy a house with more additional rooms, then you decrease the ratio.
Now we are ready to train a model for the median_house_value prediction. We will use linear regression. In the first step, we create the model by calling the fit
function.
model = LR.fit(x_train, y_train)
%Scholar.Linear.LinearRegression{
coefficients: #Nx.Tensor<
f64[12]
EXLA.Backend<host:0, 0.54336858.2205024268.201579>
[292661.1879978386, 111.46237796967083, 1170.1666131366273, -33322.29721781791, -35394.90464853701, 41384.64324873476, -41.31598131072877, 50.507900596456054, 3040.877192253343, 7.253624678472778, 3.272614960520949, -8059.519389730254]
>,
intercept: #Nx.Tensor<
f64
EXLA.Backend<host:0, 0.54336858.2205024268.201583>
-3101555.563125743
>
}
Now we can predict the values for the test set and measure the error of our prediction. We will calculate root mean square error (RMSE) and mean absolute error (MAE).
predictions = LR.predict(model, x_test)
rmse = Regression.mean_square_error(y_test, predictions) |> Nx.sqrt()
mae = Regression.mean_absolute_error(y_test, predictions)
{rmse, mae}
{#Nx.Tensor<
f64
EXLA.Backend<host:0, 0.54336858.2205024268.201596>
67648.9435367406
>,
#Nx.Tensor<
f64
EXLA.Backend<host:0, 0.54336858.2205024268.201601>
48942.11908533544
>}
Ok, but is it a good or poor estimation? Huh, check the mean value of the target and then compare it to the value of errors.
Nx.mean(y)
#Nx.Tensor<
f64
EXLA.Backend<host:0, 0.54336858.2205024268.201604>
206855.81690891474
>
For such a simple model as linear regression, it seems to be a pretty good result. But there is space to improve this result. You can, for example, add some additional features to the data set. In the future, you will be able to try more complicated models, such as random forests.