View Source hyper (hyper v1.0.1)
hyper
is the reference implementation of hyperloglog for erlang
Hyperloglog is an algorithm based on a probabilistic datastructure to solve the count-distinct problem. It allows an approximation of the distinct element of a large collection, with a limited memory usage and the ability to be distributed and merged losslessly.
This module is the interface you use to interact with the hyperloglog. hyper
accept multiple backend for the implementation of hyperloglog, each with a different set of tradeoff and performance.
For more details on how to use it, look at new/3
, insert/2
and card/1
how-does-it-work
How does it work
Without entering in all the details, the hyperloglog datastructure is based on a limited set of elements. Details of how these are implemented will vary between different backend and implementations, but the high level idea stay the same. - A hash function that output a 64bit hash - A set of numbered "bins", controlled by the precision - An estimator function
For every element to insert, the element is hashed into a binary N, then the leftmost P bits are used to define which bin the element belong. In this bin, we store the number of zeroes before the first 1 in the leftmost bit of the leftover, after truncating the leftmost N bit. If this number is bigger than the one currently in the bin, we update it. Otherwise we do nothing. Our input is inserted.
Later, when we want to get an estimation of the cardinality, we apply the estimator function to the set of bins. The details of the estimator are too complex for this documentation, but suffice to say that they are proven to reduce the estimation error to stay under a tight boundary.
references
References
For more details, here is a list of reference papers that have been used to implement this library.
In particular huge thanks to Omar Ertl for his clear work on this domain.Link to this section Summary
Functions
hyper
structure in memoryhyper
structurehyper
structure. Do not use, considered an implementation detail of the backend. Will be removed in the future.Deserialize the json friendly format from to_json/1
to a hyper
structure
Deserialize the json friendly format from to_json/1
to a hyper
structure, with Mod
as backend
Value
inside the Hyper structure passed.hyper
structure using the inclusion/exclusion principle. Use with caution as the precision of this methods is fairly limited.true
if the structure passed is a hyper
structure. false
otherwiseCreate a new Hyperloglog structure Equivalent to hyper:new(P, hyper_binary,
sha-v1')'
Create a new Hyperloglog structure with the specified backend Equivalent to hyper:new(P, Mod,
sha-v1')'
Create a new Hyperloglog structure with the specified backend and the specified version.
hyper
structure passed.hyper
structure to P. This is lossless in the sense that the old structure hold all the data needed to fill the reduced precision one. That said, reducing precision does grow the error in the estimator, and there is no way to get it back. Do this knowing you are losing precision.Serialize the hyper
structure to a json friendly format that can be decoded with from_json/2
Link to this section Types
-type filter() :: #hyper{}.
-type precision() :: 4..16.
-type registers() :: any().
-type value() :: binary().
-type version() :: atom().
Link to this section Functions
hyper
structure in memory
-spec card(filter()) -> float().
hyper
structure
hyper
structure. Do not use, considered an implementation detail of the backend. Will be removed in the future.
-spec from_json(any()) -> filter().
Deserialize the json friendly format from to_json/1
to a hyper
structure
Equivalent to from_json(Struct, hyper_binary)
-spec from_json(any(), module()) -> filter().
Deserialize the json friendly format from to_json/1
to a hyper
structure, with Mod
as backend
Value
inside the Hyper structure passed.
hyper
structure using the inclusion/exclusion principle. Use with caution as the precision of this methods is fairly limited.
-spec is_hyper(filter()) -> boolean().
true
if the structure passed is a hyper
structure. false
otherwise
Create a new Hyperloglog structure Equivalent to hyper:new(P, hyper_binary,
sha-v1')'
hyper:new/3
Create a new Hyperloglog structure with the specified backend Equivalent to hyper:new(P, Mod,
sha-v1')'
hyper:new/3
Create a new Hyperloglog structure with the specified backend and the specified version.
hyper
ship with one backend: - hyper_binary
For implementing your own backend, see hyper_register
%% hyper structure with different version will work transparently for now, but will generate deprecation warning in the future if we change some of the implementation details.
You should usenew/2
most of the time, so that the versionning is handled by hyper
. You should only use new/3
if you need to do your own versionning for custom backends or for custom versionning problems.
hyper
structure passed.
hyper
structure to P. This is lossless in the sense that the old structure hold all the data needed to fill the reduced precision one. That said, reducing precision does grow the error in the estimator, and there is no way to get it back. Do this knowing you are losing precision.
-spec to_json(filter()) -> any().
Serialize the hyper
structure to a json friendly format that can be decoded with from_json/2