View Source Scholar.Cluster.Hierarchical (Scholar v0.3.0)
Performs hierarchical, agglomerative clustering on a dataset.
Hierarchical clustering is good for when the number of clusters is not known ahead of time. It also allows for the creation of a dendrogram plot (regardless of the dimensionality of the dataset) which can be used to select the number of clusters in a post-processing step.
Limitations
Due to the requirements of the current implementation, only these options are supported:
dissimilarity: :euclidean
linkage: :average | :complete | :single | :weighted
Our current algorithm is $O(\frac{n^2}{p} \cdot \log(n))$ where $n$ is the number of data points and $p$ is the number of processors. This is better than the generic algorithm which is $O(n^3)$. It is also parallel, which means that runtime decreases in direct proportion to the number of processors.
However, the implementation requires certain theoretical properties of the dissimilarities and linkages. As such, we've restricted the options to only those combinations with the correct properties.
In the future, we plan to add additional algorithms which won't have the same restrictions.
Summary
Functions
Use hierarchical clustering to form the initial model to be clustered with labels_list/2
or
labels_map/2
.
Cluster a Scholar.Cluster.Hierarchical
struct into a list of cluster labels.
Cluster a Scholar.Cluster.Hierarchical
struct into a map of cluster labels to member indices.
Functions
Use hierarchical clustering to form the initial model to be clustered with labels_list/2
or
labels_map/2
.
Options
:dissimilarity
- Pairwise dissimilarity function: computes the 'dissimilarity' between each pair of data points. Dissimilarity is analogous to distance, but without the expectation that the triangle inequality holds.Choices:
:euclidean
- L2 norm.
See "Limitations" in the moduledoc for an explanation of the lack of choices.
The default value is
:euclidean
.:linkage
- Linkage function: how to compute the intra-clade dissimilarity of two clades if they were merged.Choices:
:average
- The unweighted average dissimilarity across all pairs of points.:complete
- (Historic name) The maximum dissimilarity across all pairs of points.:single
- (Historic name) The minimum dissimilarity across all pairs of points.:ward
- (Named for Ward's method) The minimum increase in sum of squares (MISSQ) of dissimilarities.:weighted
- The weighted average dissimilarity across all pairs of points.
The default value is
:single
.
Return values
Returns a Scholar.Cluster.Hierarchical
struct with the following fields:
clades
(Nx.Tensor
with shape{n - 1, 2}
) - Contains the indices of the pair of clades merged at each step of the agglomerative clustering process.Agglomerative clustering starts by considering each datum in
data
its own singleton group or "clade". It then picks two clades to merge into a new clade containing the data from both. It does this until there is a single clade remaining.Since each datum starts as its own clade, e.g.
data[0]
is clade0
, indexing of new clades starts atn
wheren
is the size of the originaldata
tensor. Ifclades[k] == [i, j]
, then cladesi
andj
were merged to formk + n
.dissimilarities
(Nx.Tensor
with shape{n - 1}
) - Contains a metric that measures the intra-clade closeness of each newly formed clade. Represented by the heights of each clade in a dendrogram plot. Determined by both the:dissimilarity
and:linkage
options.num_points
(pos_integer/0
) - Number of points in the dataset. Must be $\geq 3$.sizes
(Nx.Tensor
with shape{n - 1}
) -sizes[i]
is the size of cladei
. If cladek
was created by merging cladesi
andj
, thensizes[k] == sizes[i] + sizes[j]
.
Examples
iex> data = Nx.tensor([[2], [7], [9], [0], [3]])
iex> Hierarchical.fit(data)
%Scholar.Cluster.Hierarchical{
clades: Nx.tensor([[0, 4], [1, 2], [3, 5], [6, 7]]),
dissimilarities: Nx.tensor([1.0, 2.0, 2.0, 4.0]),
num_points: 5,
sizes: Nx.tensor([2, 2, 3, 5])
}
Cluster a Scholar.Cluster.Hierarchical
struct into a list of cluster labels.
Options
:cluster_by
(non-emptykeyword/0
) - Required. How to select which clades from the dendrogram should form the final clusters. Must provide either a height or a number of clusters.:height
(float/0
) - Height of the dendrogram to use as the split point for clusters.:num_clusters
(pos_integer/0
) - Number of clusters to form.
Return values
Returns a list of length n
and values 0..(k - 1)
where n
is the number of data points and
k
is the number of clusters formed.
The i
th element of the result list is the label of the i
th data point's cluster.
Cluster labels are arbitrary, but deterministic.
Examples
iex> data = Nx.tensor([[5], [5], [5], [10], [10]])
iex> model = Hierarchical.fit(data)
iex> Hierarchical.labels_list(model, cluster_by: [num_clusters: 2])
[0, 0, 0, 1, 1]
Cluster a Scholar.Cluster.Hierarchical
struct into a map of cluster labels to member indices.
Options
:cluster_by
(non-emptykeyword/0
) - Required. How to select which clades from the dendrogram should form the final clusters. Must provide either a height or a number of clusters.:height
(float/0
) - Height of the dendrogram to use as the split point for clusters.:num_clusters
(pos_integer/0
) - Number of clusters to form.
Return values
Returns a map where the keys are integers from 0..(k - 1)
where k
is the number of clusters.
Each value is a cluster represented by a list of member indices.
E.g. if the result map was %{0 => [0, 1], 1 => [2]}
, then elements [0, 1]
of the data would
be in cluster 0
and the singleton element [2]
would be in cluster 1
.
Cluster labels are arbitrary, but deterministic.
Examples
iex> data = Nx.tensor([[5], [5], [5], [10], [10]])
iex> model = Hierarchical.fit(data)
iex> Hierarchical.labels_map(model, cluster_by: [num_clusters: 2])
%{0 => [0, 1, 2], 1 => [3, 4]}