View Source Scholar.Cluster.KMeans (Scholar v0.3.1)
K-Means Algorithm.
K-Means is simple clustering method that works iteratively [1]. In the first iteration, centroids are chosen randomly from input data. It turned out that some initializations are especially effective. In 2007 David Arthur and Sergei Vassilvitskii proposed initialization called k-means++ which speed up convergence of algorithm drastically [2]. After initialization, from each centroid find points that are the closest to that centroid. Then, for each centroid replace it with the center of mass of associated points. These two steps mentioned above are repeated until the solution converges. Since some initializations are unfortunate and converge to sub-optimal results we need repeat the whole procedure a few times and take the best result.
Average time complexity is $O(CKNI)$, where $C$ is the number of clusters, $N$ is the number of samples, $I$ is the number of iterations until convergence, and $K$ is the number of features. Space complexity is $O(K*(N+C))$.
Reference:
- [1] - K-Means Algorithm
- [2] - K-Means++ Initialization
Summary
Functions
Fits a K-Means model for sample inputs x
.
Makes predictions with the given model
on inputs x
.
Calculates distances between each sample from x
and the calculated centroids.
Functions
Fits a K-Means model for sample inputs x
.
Options
:num_clusters
(pos_integer/0
) - Required. The number of clusters to form as well as the number of centroids to generate.:max_iterations
(pos_integer/0
) - Maximum number of iterations of the k-means algorithm for a single run. The default value is300
.:num_runs
(pos_integer/0
) - Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of num_runs runs in terms of inertia. The default value is10
.:tol
- Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. The default value is0.0001
.:weights
- The weights for each observation inx
. If equals tonil
, all observations are assigned equal weight.:init
- Method for centroid initialization, either of::k_means_plus_plus
- selects initial cluster centroids using sampling based on an empirical probability distribution of the points' contribution to the overall inertia. This technique speeds up convergence, and is theoretically proven to be O(log(k))-optimal.:random
- choose:num_clusters
observations (rows) at random from data for the initial centroids.
The default value is
:k_means_plus_plus
.:key
- Determines random number generation for centroid initialization. If the key is not provided, it is set toNx.Random.key(System.system_time())
.
Return Values
The function returns a struct with the following parameters:
:clusters
- Coordinates of cluster centers.:num_iterations
- Number of iterations run.:inertia
- Sum of squared distances of samples to their closest cluster center.:labels
- Labels of each point.
Examples
iex> key = Nx.Random.key(42)
iex> x = Nx.tensor([[1, 2], [2, 4], [1, 3], [2, 5]])
iex> Scholar.Cluster.KMeans.fit(x, num_clusters: 2, key: key)
%Scholar.Cluster.KMeans{
num_iterations: Nx.tensor(
2
),
clusters: Nx.tensor(
[
[1.0, 2.5],
[2.0, 4.5]
]
),
inertia: Nx.tensor(
1.0
),
labels: Nx.tensor(
[0, 1, 0, 1]
)
}
Makes predictions with the given model
on inputs x
.
Return Values
It returns a tensor with clusters corresponding to the input.
Examples
iex> key = Nx.Random.key(42)
iex> x = Nx.tensor([[1, 2], [2, 4], [1, 3], [2, 5]])
iex> model = Scholar.Cluster.KMeans.fit(x, num_clusters: 2, key: key)
iex> Scholar.Cluster.KMeans.predict(model, Nx.tensor([[1.9, 4.3], [1.1, 2.0]]))
Nx.tensor(
[1, 0]
)
Calculates distances between each sample from x
and the calculated centroids.
Return Values
It returns a tensor with corresponding distances.
Examples
iex> key = Nx.Random.key(40)
iex> model =
...> Scholar.Cluster.KMeans.fit(Nx.tensor([[1, 2], [2, 4], [1, 3], [2, 5]]),
...> num_clusters: 2,
...> key: key
...> )
iex> Scholar.Cluster.KMeans.transform(model, Nx.tensor([[1.0, 2.5]]))
Nx.tensor(
[
[2.2360680103302, 0.0]
]
)