ExDataSketch.CMS (ExDataSketch v0.7.1)

Copy Markdown View Source

Count-Min Sketch (CMS) for frequency estimation.

CMS provides approximate frequency estimates for items in a data stream. It answers: "approximately how many times has this item appeared?"

The sketch uses depth independent hash functions and a width x depth counter matrix. Estimates are guaranteed to never undercount (for non-negative increments) but may overcount.

Memory and Accuracy

  • Counter matrix: width * depth * (counter_width / 8) bytes.
  • Error bound: estimates are within e * N / width of the true count with probability at least 1 - (1/2)^depth, where N is total count and e is Euler's number.
WidthDepthCounterMemory
2,048532-bit40 KiB
8,192732-bit224 KiB
2,048564-bit80 KiB

Binary State Layout (v1)

All multi-byte fields are little-endian.

Offset  Size      Field
------  --------  -----
0       1         Version (u8, currently 1)
1       4         Width (u32 little-endian)
5       2         Depth (u16 little-endian)
7       1         Counter width in bits (u8, 32 or 64)
8       1         Reserved (u8, must be 0)
9       W*D*C     Counters (row-major, little-endian per counter)

Where C = counter_width / 8 (4 or 8 bytes per counter). Total: 9 + width depth C bytes.

Options

  • :width - number of counters per row, pos_integer (default: 2048).
  • :depth - number of rows (hash functions), pos_integer (default: 5).
  • :counter_width - bits per counter, 32 or 64 (default: 32).
  • :backend - backend module (default: ExDataSketch.Backend.Pure).

Overflow Policy

Default: saturating. When a counter reaches its maximum value (2^32-1 or 2^64-1), further increments leave it at the maximum. This prevents wrap-around errors at the cost of potential undercounting at extreme values.

Merge Properties

CMS merge is associative and commutative (element-wise counter addition). This means sketches can be merged in any order or grouping and produce the same result, making CMS safe for parallel and distributed aggregation.

Summary

Functions

Deserializes an EXSK binary into a CMS sketch.

Deserializes an Apache DataSketches CMS binary.

Estimates the frequency of an item in the sketch.

Creates a new CMS sketch from an enumerable of items.

Merges two CMS sketches.

Merges a non-empty enumerable of CMS sketches into one.

Returns a 2-arity merge function suitable for combining sketches.

Creates a new CMS sketch.

Returns a 2-arity reducer function suitable for Enum.reduce/3 and similar.

Serializes the sketch to the ExDataSketch-native EXSK binary format.

Serializes the sketch to Apache DataSketches CMS format.

Returns the size of the sketch state in bytes.

Updates the sketch with a single item.

Updates the sketch with multiple items in a single pass.

Types

t()

@type t() :: %ExDataSketch.CMS{backend: module(), opts: keyword(), state: binary()}

Functions

deserialize(binary)

@spec deserialize(binary()) :: {:ok, t()} | {:error, Exception.t()}

Deserializes an EXSK binary into a CMS sketch.

Returns {:ok, sketch} on success or {:error, reason} on failure.

Examples

iex> ExDataSketch.CMS.deserialize(<<"invalid">>)
{:error, %ExDataSketch.Errors.DeserializationError{message: "deserialization failed: invalid magic bytes, expected EXSK"}}

deserialize_datasketches(binary)

@spec deserialize_datasketches(binary()) :: {:ok, t()} | {:error, Exception.t()}

Deserializes an Apache DataSketches CMS binary.

Not implemented. See serialize_datasketches/1 for details.

Examples

iex> try do
...>   ExDataSketch.CMS.deserialize_datasketches(<<>>)
...> rescue
...>   e in ExDataSketch.Errors.NotImplementedError -> e.message
...> end
"ExDataSketch.CMS.deserialize_datasketches is not yet implemented"

estimate(cms, item)

@spec estimate(t(), term()) :: non_neg_integer()

Estimates the frequency of an item in the sketch.

Returns a non-negative integer. The estimate is guaranteed to be at least the true count (no undercounting) but may overcount.

Examples

iex> sketch = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("hello", 5)
iex> ExDataSketch.CMS.estimate(sketch, "hello")
5

from_enumerable(enumerable, opts \\ [])

@spec from_enumerable(
  Enumerable.t(),
  keyword()
) :: t()

Creates a new CMS sketch from an enumerable of items.

Equivalent to new(opts) |> update_many(enumerable).

Options

Same as new/1.

Examples

iex> sketch = ExDataSketch.CMS.from_enumerable(["a", "b", "a"])
iex> ExDataSketch.CMS.estimate(sketch, "a")
2

merge(sketch, cms)

@spec merge(t(), t()) :: t()

Merges two CMS sketches.

Both sketches must have the same width, depth, and counter_width. Counters are added element-wise with saturating arithmetic.

Examples

iex> a = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("x", 3)
iex> b = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("x", 5)
iex> merged = ExDataSketch.CMS.merge(a, b)
iex> ExDataSketch.CMS.estimate(merged, "x")
8

merge_many(sketches)

@spec merge_many(Enumerable.t()) :: t()

Merges a non-empty enumerable of CMS sketches into one.

Raises Enum.EmptyError if the enumerable is empty.

Examples

iex> a = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("x")
iex> b = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("x")
iex> merged = ExDataSketch.CMS.merge_many([a, b])
iex> ExDataSketch.CMS.estimate(merged, "x")
2

merger(opts \\ [])

@spec merger(keyword()) :: (t(), t() -> t())

Returns a 2-arity merge function suitable for combining sketches.

The returned function calls merge/2 on two sketches.

Examples

iex> is_function(ExDataSketch.CMS.merger(), 2)
true

new(opts \\ [])

@spec new(keyword()) :: t()

Creates a new CMS sketch.

Options

  • :width - counters per row (default: 2048). Must be positive.
  • :depth - number of rows (default: 5). Must be positive.
  • :counter_width - bits per counter, 32 or 64 (default: 32).
  • :backend - backend module (default: ExDataSketch.Backend.Pure).
  • :hash_fn - custom hash function (term -> non_neg_integer).
  • :seed - hash seed (default: 0).

Examples

iex> sketch = ExDataSketch.CMS.new()
iex> ExDataSketch.CMS.estimate(sketch, "anything")
0

reducer()

@spec reducer() :: (term(), t() -> t())

Returns a 2-arity reducer function suitable for Enum.reduce/3 and similar.

The returned function calls update/2 on each item.

Examples

iex> is_function(ExDataSketch.CMS.reducer(), 2)
true

serialize(cms)

@spec serialize(t()) :: binary()

Serializes the sketch to the ExDataSketch-native EXSK binary format.

Examples

iex> sketch = ExDataSketch.CMS.new(width: 100, depth: 3, counter_width: 32)
iex> binary = ExDataSketch.CMS.serialize(sketch)
iex> <<"EXSK", _rest::binary>> = binary
iex> byte_size(binary) > 0
true

serialize_datasketches(cms)

@spec serialize_datasketches(t()) :: binary()

Serializes the sketch to Apache DataSketches CMS format.

Not implemented. Apache DataSketches does not define a standard CMS binary format. Only Theta sketches support DataSketches interop via ExDataSketch.Theta.serialize_datasketches/1. For CMS serialization, use serialize/1 (ExDataSketch-native EXSK format).

Examples

iex> try do
...>   sketch = %ExDataSketch.CMS{state: <<>>, opts: [], backend: nil}
...>   ExDataSketch.CMS.serialize_datasketches(sketch)
...> rescue
...>   e in ExDataSketch.Errors.NotImplementedError -> e.message
...> end
"ExDataSketch.CMS.serialize_datasketches is not yet implemented"

size_bytes(cms)

@spec size_bytes(t()) :: non_neg_integer()

Returns the size of the sketch state in bytes.

Examples

iex> sketch = ExDataSketch.CMS.new(width: 100, depth: 3, counter_width: 32)
iex> ExDataSketch.CMS.size_bytes(sketch)
1209

update(sketch, item, increment \\ 1)

@spec update(t(), term(), pos_integer()) :: t()

Updates the sketch with a single item.

The item is hashed using ExDataSketch.Hash.hash64/1 before being recorded in the sketch.

Parameters

  • sketch - the CMS sketch to update.
  • item - any Elixir term to count.
  • increment - positive integer to add (default: 1).

Examples

iex> sketch = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("hello")
iex> ExDataSketch.CMS.estimate(sketch, "hello")
1

update_many(sketch, items)

@spec update_many(t(), Enumerable.t()) :: t()

Updates the sketch with multiple items in a single pass.

Accepts an enumerable of items (each with implicit increment of 1) or an enumerable of {item, increment} tuples.

Examples

iex> sketch = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update_many(["a", "b", "a"])
iex> ExDataSketch.CMS.estimate(sketch, "a")
2