View Source hyper (hyper v1.0.1)

hyper is the reference implementation of hyperloglog for erlang

Hyperloglog is an algorithm based on a probabilistic datastructure to solve the count-distinct problem. It allows an approximation of the distinct element of a large collection, with a limited memory usage and the ability to be distributed and merged losslessly.

This module is the interface you use to interact with the hyperloglog. hyper accept multiple backend for the implementation of hyperloglog, each with a different set of tradeoff and performance.

For more details on how to use it, look at new/3, insert/2 and card/1

how-does-it-work

How does it work

Without entering in all the details, the hyperloglog datastructure is based on a limited set of elements. Details of how these are implemented will vary between different backend and implementations, but the high level idea stay the same. - A hash function that output a 64bit hash - A set of numbered "bins", controlled by the precision - An estimator function

For every element to insert, the element is hashed into a binary N, then the leftmost P bits are used to define which bin the element belong. In this bin, we store the number of zeroes before the first 1 in the leftmost bit of the leftover, after truncating the leftmost N bit. If this number is bigger than the one currently in the bin, we update it. Otherwise we do nothing. Our input is inserted.

Later, when we want to get an estimation of the cardinality, we apply the estimator function to the set of bins. The details of the estimator are too complex for this documentation, but suffice to say that they are proven to reduce the estimation error to stay under a tight boundary.

references

References

For more details, here is a list of reference papers that have been used to implement this library.

In particular huge thanks to Omar Ertl for his clear work on this domain.

Link to this section Summary

Functions

return the size in bytes of the hyper structure in memory
Estimate the cardinality of a hyper structure
compact the underlying hyper structure. Do not use, considered an implementation detail of the backend. Will be removed in the future.
from_json(Struct) deprecated

Deserialize the json friendly format from to_json/1 to a hyper structure

from_json(_, Mod) deprecated

Deserialize the json friendly format from to_json/1 to a hyper structure, with Mod as backend

Insert Value inside the Hyper structure passed.
insert a list of values in the Hyper structure.
Provide the cardinality of the intersection of two hyper structure using the inclusion/exclusion principle. Use with caution as the precision of this methods is fairly limited.
return true if the structure passed is a hyper structure. false otherwise

Create a new Hyperloglog structure Equivalent to hyper:new(P, hyper_binary,sha-v1')'

Create a new Hyperloglog structure with the specified backend Equivalent to hyper:new(P, Mod,sha-v1')'

Create a new Hyperloglog structure with the specified backend and the specified version.

return the precision of the hyper structure passed.
Reduce the precision of the hyper structure to P. This is lossless in the sense that the old structure hold all the data needed to fill the reduced precision one. That said, reducing precision does grow the error in the estimator, and there is no way to get it back. Do this knowing you are losing precision.
to_json(Hyper) deprecated

Serialize the hyper structure to a json friendly format that can be decoded with from_json/2

Link to this section Types

-type filter() :: #hyper{}.
-type precision() :: 4..16.
-type registers() :: any().
-type value() :: binary().
-type version() :: atom().

Link to this section Functions

return the size in bytes of the hyper structure in memory
-spec card(filter()) -> float().
Estimate the cardinality of a hyper structure
compact the underlying hyper structure. Do not use, considered an implementation detail of the backend. Will be removed in the future.
This function is deprecated. 0.6.0.
-spec from_json(any()) -> filter().

Deserialize the json friendly format from to_json/1 to a hyper structure

Equivalent to from_json(Struct, hyper_binary)

Do not use, this is going to be replaced with better solution in the future
This function is deprecated. 0.6.0.
-spec from_json(any(), module()) -> filter().

Deserialize the json friendly format from to_json/1 to a hyper structure, with Mod as backend

Do not use, this is going to be replaced with better solution in the future
-spec insert(value(), filter()) -> filter().
Insert Value inside the Hyper structure passed.
-spec insert_many([value()], filter()) -> filter().
insert a list of values in the Hyper structure.
Link to this function

intersect_card(Left, Right)

View Source
-spec intersect_card(filter(), filter()) -> float().
Provide the cardinality of the intersection of two hyper structure using the inclusion/exclusion principle. Use with caution as the precision of this methods is fairly limited.
-spec is_hyper(filter()) -> boolean().
return true if the structure passed is a hyper structure. false otherwise
-spec new(precision()) -> filter().

Create a new Hyperloglog structure Equivalent to hyper:new(P, hyper_binary,sha-v1')'

see hyper:new/3
-spec new(precision(), module()) -> filter().

Create a new Hyperloglog structure with the specified backend Equivalent to hyper:new(P, Mod,sha-v1')'

see hyper:new/3
-spec new(precision(), module(), version()) -> filter().

Create a new Hyperloglog structure with the specified backend and the specified version.

hyper ship with one backend: - hyper_binary

For implementing your own backend, see hyper_register

%% hyper structure with different version will work transparently for now, but will generate deprecation warning in the future if we change some of the implementation details.

You should use new/2 most of the time, so that the versionning is handled by hyper. You should only use new/3 if you need to do your own versionning for custom backends or for custom versionning problems.
return the precision of the hyper structure passed.
Link to this function

reduce_precision(P, Hyper)

View Source
Reduce the precision of the hyper structure to P. This is lossless in the sense that the old structure hold all the data needed to fill the reduced precision one. That said, reducing precision does grow the error in the estimator, and there is no way to get it back. Do this knowing you are losing precision.
This function is deprecated. 0.6.0.
-spec to_json(filter()) -> any().

Serialize the hyper structure to a json friendly format that can be decoded with from_json/2

Do not use, this is going to be replaced with better solution in the future
-spec union([filter()]) -> filter().