t_digest v0.1.1 TDigest
An implementation of Ted Dunning’s “t-digest”, a structure for computing rank- based statistics on large datasets.
Key features:
- smaller estimation error in the tails of the distribution
- composable (multiple t-digests can be joined into one)
For more details, see the original paper:
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf
Use:
iex> t = TDigest.new
...> t = TDigest.update(t, 1..1000)
#TDigest<[count: 1000, clusters: 1000, delta: 0.1, p1: 10.5, p5: 50.5,
p25: 250.5, p50: 500.5, p75: 750.5, p95: 950.5, p99: 990.5]>
Adding sequential data results in overlarge datastructures, as noted
by Dunning in his paper. A call to compress/1
after such an update corrects
this issue (note the reduction in number of clusters above and below 1000 -> 77):
...> TDigest.compress(t)
#TDigest<[count: 1000, clusters: 77, delta: 0.1, p1: 10.5,
p5: 50.50000000000001, p25: 250.89542483660134, p50: 500.25494225408204,
p75: 750.6740530303031, p95: 950.5, p99: 990.5]>
With a dataset loaded, this module exposes two query methods, percentile/2
and quantile/2
which are inverses of one another.
...> TDigest.percentile(t, 0.35)
350.2636503460651
...> TDigest.quantile(t, 350)
0.34973140759655286
The real benefit of the t-digest is the size reduction:
iex> data = for _ <- 1..1_000_000, do: :rand.normal()
...> t = TDigest.new(0.1) |> TDigest.update(data)
#TDigest<[count: 1000000, clusters: 163, delta: 0.1, p1: -2.327118447306301,
p5: -1.6476559128477877, p25: -0.6783902616363299, p50: -0.001392334178668811,
p75: 0.6739971826231269, p95: 1.644470629820828, p99: 2.324171666803378]>
As the inspected result shows, the t-digest is only holding 163 clusters to
represent one million observations, where each cluster is a tuple
{center, weight}
: center is a float, and weight an integer. Even using
Erlang’s naive term format, the datastructure is only 2.4k.
...> :erlang.term_to_binary(t) |> byte_size
2442
By changing the compression parameter delta, resolution and space can be traded:
...> t2 = TDigest.new(0.01) |> TDigest.update(data)
#TDigest<[count: 1000000, clusters: 1344, delta: 0.01, p1: -2.3257785556861186,
p5: -1.646212495824313, p25: -0.6766720390117584, p50: -0.0014221336729099022,
p75: 0.672790746292638, p95: 1.6437912633928424, p99: 2.3237613045306835]>
...> :erlang.term_to_binary(t2) |> byte_size
19181
Implementation Notes:
My requirements for this datastructure don’t specify performance, so the current implementation is backed by a list of 2-tuples, rather than the balanced tree suggested by Dunning. The list implies most operations will be O(n) for n clusters. For a high-throughput use case, you may wish to choose another datastructure, or submit a pull request updating this library with a higher performance backend.
Summary
Functions
Compresses a T-Digest by inserting each cluster into a new T-Digest in a randomized order. According to Dunning: “This final pass typically reduces the number of centroids by 20-40% with no apparent change in accuracy”
Creates an empty T-Digest datastructure with the specified delta compression parameter
Returns the value associated to a particular percentile within a distribution.
Percentiles are given as values in the interval [0, 1], and return values are
commensurate with the datapoints added using update/2
Returns the quantile of a particular value within a distribution. Return values fall in the interval [0, 1]
Adds one or more observed values to the datastructure
Functions
Compresses a T-Digest by inserting each cluster into a new T-Digest in a randomized order. According to Dunning: “This final pass typically reduces the number of centroids by 20-40% with no apparent change in accuracy”.
Highly recommended to run once after bulk-updating with sequential data. Repeated compression runs are discouraged.
Creates an empty T-Digest datastructure with the specified delta compression parameter.
iex> TDigest.new
#TDigest<[count: 0, clusters: 0, delta: 0.1, p1: nil, p5: nil, p25: nil,
p50: nil, p75: nil, p95: nil, p99: nil]>
Returns the value associated to a particular percentile within a distribution.
Percentiles are given as values in the interval [0, 1], and return values are
commensurate with the datapoints added using update/2
.
iex> data = for _ <- 1..1000, do: :rand.uniform(100)
...> t = TDigest.new |> TDigest.update(data)
...> TDigest.percentile(t, 0.9)
89.9831029185868
...> TDigest.percentile(t, 0.5)
48.20593708462561
Logical inverse of quantile/2
Returns the quantile of a particular value within a distribution. Return values fall in the interval [0, 1].
iex> data = for _ <- 1..1000, do: :rand.uniform(100)
...> t = TDigest.new |> TDigest.update(data)
...> TDigest.quantile(t, 10)
0.10722356495468277
...> TDigest.quantile(t, 50)
0.5207628183923255
Logical inverse of percentile/2
Adds one or more observed values to the datastructure.
data
can be a single value, a 2-tuple of {value, weight}
, another
T-Digest, or a list containing any of these.
Examples:
# adding a single value:
iex> t = TDigest.new
...> t1 = TDigest.update(t, 1)
#TDigest<[count: 1, clusters: 1, delta: 0.1, p1: 1, p5: 1, p25: 1, p50: 1,
p75: 1, p95: 1, p99: 1]>
# adding a list of values:
...> t2 = TDigest.update(t, 1..4)
#TDigest<[count: 4, clusters: 4, delta: 0.1, p1: 1, p5: 1, p25: 1.5, p50: 2.5,
p75: 3.5, p95: 4, p99: 4]>
# adding a list of T-Digests:
...> TDigest.update(t, [t1, t2])
#TDigest<[count: 5, clusters: 5, delta: 0.1, p1: 1, p5: 1, p25: 1.0, p50: 3,
p75: 3.25, p95: 4, p99: 4]>
Be sure to use compress/1
after calling this method with sequntial data