View Source Scholar.Cluster.Hierarchical (Scholar v0.4.0)

Performs hierarchical, agglomerative clustering on a dataset.

Hierarchical clustering is good for when the number of clusters is not known ahead of time. It also allows for the creation of a dendrogram plot (regardless of the dimensionality of the dataset) which can be used to select the number of clusters in a post-processing step.

Limitations

Due to the requirements of the current implementation, only these options are supported:

dissimilarity: :euclidean
linkage: :average | :complete | :single | :weighted

Our current algorithm is $O(\frac{n^2}{p} \cdot \log(n))$ where $n$ is the number of data points and $p$ is the number of processors. This is better than the generic algorithm which is $O(n^3)$ . It is also parallel, which means that runtime decreases in direct proportion to the number of processors.

However, the implementation requires certain theoretical properties of the dissimilarities and linkages. As such, we've restricted the options to only those combinations with the correct properties.

In the future, we plan to add additional algorithms which won't have the same restrictions.

Summary

Functions

fit(data, opts \\ [])

Use hierarchical clustering to form the initial model to be clustered with labels_list/2 or labels_map/2.

labels_list(model, opts)

Cluster a Scholar.Cluster.Hierarchical struct into a list of cluster labels.

labels_map(model, opts)

Cluster a Scholar.Cluster.Hierarchical struct into a map of cluster labels to member indices.

Functions

fit(data, opts \\ [])

Use hierarchical clustering to form the initial model to be clustered with labels_list/2 or labels_map/2.

Options

:dissimilarity - Pairwise dissimilarity function: computes the 'dissimilarity' between each pair of data points. Dissimilarity is analogous to distance, but without the expectation that the triangle inequality holds.
Choices:
- :euclidean - L2 norm.
See "Limitations" in the moduledoc for an explanation of the lack of choices.
The default value is :euclidean.
:linkage - Linkage function: how to compute the intra-clade dissimilarity of two clades if they were merged.
Choices:
- :average - The unweighted average dissimilarity across all pairs of points.
- :complete - (Historic name) The maximum dissimilarity across all pairs of points.
- :single - (Historic name) The minimum dissimilarity across all pairs of points.
- :ward - (Named for Ward's method) The minimum increase in sum of squares (MISSQ) of dissimilarities.
- :weighted - The weighted average dissimilarity across all pairs of points.
The default value is :single.

Return values

Returns a Scholar.Cluster.Hierarchical struct with the following fields:

clades (Nx.Tensor with shape {n - 1, 2}) - Contains the indices of the pair of clades merged at each step of the agglomerative clustering process.
Agglomerative clustering starts by considering each datum in data its own singleton group or "clade". It then picks two clades to merge into a new clade containing the data from both. It does this until there is a single clade remaining.
Since each datum starts as its own clade, e.g. data[0] is clade 0, indexing of new clades starts at n where n is the size of the original data tensor. If clades[k] == [i, j], then clades i and j were merged to form k + n.
dissimilarities (Nx.Tensor with shape {n - 1}) - Contains a metric that measures the intra-clade closeness of each newly formed clade. Represented by the heights of each clade in a dendrogram plot. Determined by both the :dissimilarity and :linkage options.
num_points (pos_integer/0) - Number of points in the dataset. Must be $\geq 3$ .
sizes (Nx.Tensor with shape {n - 1}) - sizes[i] is the size of clade i. If clade k was created by merging clades i and j, then sizes[k] == sizes[i] + sizes[j].

Examples

iex> data = Nx.tensor([[2], [7], [9], [0], [3]])
iex> Hierarchical.fit(data)
%Scholar.Cluster.Hierarchical{
  clades: Nx.tensor([[0, 4], [1, 2], [3, 5], [6, 7]]),
  dissimilarities: Nx.tensor([1.0, 2.0, 2.0, 4.0]),
  num_points: 5,
  sizes: Nx.tensor([2, 2, 3, 5])
}

labels_list(model, opts)

Cluster a Scholar.Cluster.Hierarchical struct into a list of cluster labels.

Options

:cluster_by (non-empty keyword/0) - Required. How to select which clades from the dendrogram should form the final clusters. Must provide either a height or a number of clusters.
- :height (float/0) - Height of the dendrogram to use as the split point for clusters.
- :num_clusters (pos_integer/0) - Number of clusters to form.

Return values

Returns a list of length n and values 0..(k - 1) where n is the number of data points and k is the number of clusters formed. The ith element of the result list is the label of the ith data point's cluster.

Cluster labels are arbitrary, but deterministic.

Examples

iex> data = Nx.tensor([[5], [5], [5], [10], [10]])
iex> model = Hierarchical.fit(data)
iex> Hierarchical.labels_list(model, cluster_by: [num_clusters: 2])
[0, 0, 0, 1, 1]

labels_map(model, opts)

Cluster a Scholar.Cluster.Hierarchical struct into a map of cluster labels to member indices.

Options

:cluster_by (non-empty keyword/0) - Required. How to select which clades from the dendrogram should form the final clusters. Must provide either a height or a number of clusters.
- :height (float/0) - Height of the dendrogram to use as the split point for clusters.
- :num_clusters (pos_integer/0) - Number of clusters to form.

Return values

Returns a map where the keys are integers from 0..(k - 1) where k is the number of clusters. Each value is a cluster represented by a list of member indices. E.g. if the result map was %{0 => [0, 1], 1 => [2]}, then elements [0, 1] of the data would be in cluster 0 and the singleton element [2] would be in cluster 1.

Cluster labels are arbitrary, but deterministic.

Examples

iex> data = Nx.tensor([[5], [5], [5], [10], [10]])
iex> model = Hierarchical.fit(data)
iex> Hierarchical.labels_map(model, cluster_by: [num_clusters: 2])
%{0 => [0, 1, 2], 1 => [3, 4]}