ExDataSketch.HLL (ExDataSketch v0.7.1)

Copy Markdown View Source

HyperLogLog (HLL) sketch for cardinality estimation.

HLL provides approximate distinct-count estimates using sublinear memory. The precision parameter p controls the trade-off between memory usage and accuracy: higher p means more memory but better estimates.

Memory and Accuracy

  • Register count: m = 2^p
  • Memory: m bytes (one byte per register in v1 format)
  • Relative standard error: approximately 1.04 / sqrt(m)
pRegistersMemory~Error
101,0241 KiB3.25%
124,0964 KiB1.63%
1416,38416 KiB0.81%
1665,53664 KiB0.41%

Binary State Layout (v1)

All multi-byte fields are little-endian.

Offset  Size    Field
------  ------  -----
0       1       Version (u8, currently 1)
1       1       Precision p (u8, 4..16)
2       2       Reserved flags (u16 little-endian, must be 0)
4       m       Registers (m = 2^p bytes, one u8 per register)

Total: 4 + 2^p bytes.

Options

Merge Properties

HLL merge is associative and commutative (register-wise max). This means sketches can be merged in any order or grouping and produce the same result, making HLL safe for parallel and distributed aggregation.

Summary

Functions

Deserializes an EXSK binary into an HLL sketch.

Deserializes an Apache DataSketches HLL binary.

Estimates the number of distinct items in the sketch.

Creates a new HLL sketch from an enumerable of items.

Merges two HLL sketches.

Merges a non-empty enumerable of HLL sketches into one.

Returns a 2-arity merge function suitable for combining sketches.

Creates a new HLL sketch.

Returns a 2-arity reducer function suitable for Enum.reduce/3 and similar.

Serializes the sketch to the ExDataSketch-native EXSK binary format.

Serializes the sketch to Apache DataSketches HLL format.

Returns the size of the sketch state in bytes.

Updates the sketch with a single item.

Updates the sketch with multiple items in a single pass.

Types

t()

@type t() :: %ExDataSketch.HLL{backend: module(), opts: keyword(), state: binary()}

Functions

deserialize(binary)

@spec deserialize(binary()) :: {:ok, t()} | {:error, Exception.t()}

Deserializes an EXSK binary into an HLL sketch.

Returns {:ok, sketch} on success or {:error, reason} on failure.

Examples

iex> ExDataSketch.HLL.deserialize(<<"invalid">>)
{:error, %ExDataSketch.Errors.DeserializationError{message: "deserialization failed: invalid magic bytes, expected EXSK"}}

deserialize_datasketches(binary)

@spec deserialize_datasketches(binary()) :: {:ok, t()} | {:error, Exception.t()}

Deserializes an Apache DataSketches HLL binary.

Not implemented. See serialize_datasketches/1 for details.

Examples

iex> try do
...>   ExDataSketch.HLL.deserialize_datasketches(<<>>)
...> rescue
...>   e in ExDataSketch.Errors.NotImplementedError -> e.message
...> end
"ExDataSketch.HLL.deserialize_datasketches is not yet implemented"

estimate(hll)

@spec estimate(t()) :: float()

Estimates the number of distinct items in the sketch.

Returns a floating-point estimate. The accuracy depends on the precision parameter p.

Examples

iex> ExDataSketch.HLL.new(p: 10) |> ExDataSketch.HLL.estimate()
0.0

from_enumerable(enumerable, opts \\ [])

@spec from_enumerable(
  Enumerable.t(),
  keyword()
) :: t()

Creates a new HLL sketch from an enumerable of items.

Equivalent to new(opts) |> update_many(enumerable).

Options

Same as new/1.

Examples

iex> sketch = ExDataSketch.HLL.from_enumerable(["a", "b", "c"], p: 10)
iex> ExDataSketch.HLL.estimate(sketch) > 0.0
true

merge(sketch, hll)

@spec merge(t(), t()) :: t()

Merges two HLL sketches.

Both sketches must have the same precision p. The result contains the register-wise maximum, which corresponds to the union of the two input multisets.

Returns the merged sketch. Raises ExDataSketch.Errors.IncompatibleSketchesError if the sketches have different parameters.

Examples

iex> a = ExDataSketch.HLL.new(p: 10) |> ExDataSketch.HLL.update("x")
iex> b = ExDataSketch.HLL.new(p: 10) |> ExDataSketch.HLL.update("y")
iex> merged = ExDataSketch.HLL.merge(a, b)
iex> ExDataSketch.HLL.estimate(merged) >= ExDataSketch.HLL.estimate(a)
true

merge_many(sketches)

@spec merge_many(Enumerable.t()) :: t()

Merges a non-empty enumerable of HLL sketches into one.

Raises Enum.EmptyError if the enumerable is empty.

Examples

iex> a = ExDataSketch.HLL.new(p: 10) |> ExDataSketch.HLL.update("x")
iex> b = ExDataSketch.HLL.new(p: 10) |> ExDataSketch.HLL.update("y")
iex> merged = ExDataSketch.HLL.merge_many([a, b])
iex> ExDataSketch.HLL.estimate(merged) > 0.0
true

merger(opts \\ [])

@spec merger(keyword()) :: (t(), t() -> t())

Returns a 2-arity merge function suitable for combining sketches.

The returned function calls merge/2 on two sketches.

Examples

iex> is_function(ExDataSketch.HLL.merger(), 2)
true

new(opts \\ [])

@spec new(keyword()) :: t()

Creates a new HLL sketch.

Options

  • :p - precision parameter, integer 4..16 (default: 14). Higher values use more memory but give better accuracy.
  • :backend - backend module (default: ExDataSketch.Backend.Pure).
  • :hash_fn - custom hash function (term -> non_neg_integer).
  • :seed - hash seed (default: 0).

Examples

iex> sketch = ExDataSketch.HLL.new(p: 10)
iex> sketch.opts[:p]
10
iex> ExDataSketch.HLL.size_bytes(sketch)
1028

reducer()

@spec reducer() :: (term(), t() -> t())

Returns a 2-arity reducer function suitable for Enum.reduce/3 and similar.

The returned function calls update/2 on each item.

Examples

iex> is_function(ExDataSketch.HLL.reducer(), 2)
true

serialize(hll)

@spec serialize(t()) :: binary()

Serializes the sketch to the ExDataSketch-native EXSK binary format.

The serialized binary includes magic bytes, version, sketch type, parameters, and state. See ExDataSketch.Codec for format details.

Examples

iex> sketch = ExDataSketch.HLL.new(p: 10)
iex> binary = ExDataSketch.HLL.serialize(sketch)
iex> <<"EXSK", _rest::binary>> = binary
iex> byte_size(binary) > 0
true

serialize_datasketches(hll)

@spec serialize_datasketches(t()) :: binary()

Serializes the sketch to Apache DataSketches HLL format.

Not implemented. Apache DataSketches HLL interop is not planned for the current release series. Only Theta sketches support DataSketches interop via ExDataSketch.Theta.serialize_datasketches/1. For HLL serialization, use serialize/1 (ExDataSketch-native EXSK format).

Examples

iex> try do
...>   sketch = %ExDataSketch.HLL{state: <<>>, opts: [p: 14], backend: nil}
...>   ExDataSketch.HLL.serialize_datasketches(sketch)
...> rescue
...>   e in ExDataSketch.Errors.NotImplementedError -> e.message
...> end
"ExDataSketch.HLL.serialize_datasketches is not yet implemented"

size_bytes(hll)

@spec size_bytes(t()) :: non_neg_integer()

Returns the size of the sketch state in bytes.

Examples

iex> ExDataSketch.HLL.new(p: 10) |> ExDataSketch.HLL.size_bytes()
1028

update(sketch, item)

@spec update(t(), term()) :: t()

Updates the sketch with a single item.

The item is hashed using ExDataSketch.Hash.hash64/1 before being inserted into the sketch.

Examples

iex> sketch = ExDataSketch.HLL.new(p: 10) |> ExDataSketch.HLL.update("hello")
iex> ExDataSketch.HLL.estimate(sketch) > 0.0
true

update_many(sketch, items)

@spec update_many(t(), Enumerable.t()) :: t()

Updates the sketch with multiple items in a single pass.

More efficient than calling update/2 repeatedly because it minimizes intermediate binary allocations.

Examples

iex> sketch = ExDataSketch.HLL.new(p: 10) |> ExDataSketch.HLL.update_many(["a", "b", "c"])
iex> ExDataSketch.HLL.estimate(sketch) > 0.0
true