Count-Min Sketch (CMS) for frequency estimation.
CMS provides approximate frequency estimates for items in a data stream. It answers: "approximately how many times has this item appeared?"
The sketch uses depth independent hash functions and a width x depth
counter matrix. Estimates are guaranteed to never undercount (for non-negative
increments) but may overcount.
Memory and Accuracy
- Counter matrix:
width * depth * (counter_width / 8)bytes. - Error bound: estimates are within
e * N / widthof the true count with probability at least1 - (1/2)^depth, whereNis total count andeis Euler's number.
| Width | Depth | Counter | Memory |
|---|---|---|---|
| 2,048 | 5 | 32-bit | 40 KiB |
| 8,192 | 7 | 32-bit | 224 KiB |
| 2,048 | 5 | 64-bit | 80 KiB |
Binary State Layout (v1)
All multi-byte fields are little-endian.
Offset Size Field
------ -------- -----
0 1 Version (u8, currently 1)
1 4 Width (u32 little-endian)
5 2 Depth (u16 little-endian)
7 1 Counter width in bits (u8, 32 or 64)
8 1 Reserved (u8, must be 0)
9 W*D*C Counters (row-major, little-endian per counter)Where C = counter_width / 8 (4 or 8 bytes per counter). Total: 9 + width depth C bytes.
Options
:width- number of counters per row, pos_integer (default: 2048).:depth- number of rows (hash functions), pos_integer (default: 5).:counter_width- bits per counter, 32 or 64 (default: 32).:backend- backend module (default:ExDataSketch.Backend.Pure).
Overflow Policy
Default: saturating. When a counter reaches its maximum value (2^32-1 or 2^64-1), further increments leave it at the maximum. This prevents wrap-around errors at the cost of potential undercounting at extreme values.
Merge Properties
CMS merge is associative and commutative (element-wise counter addition). This means sketches can be merged in any order or grouping and produce the same result, making CMS safe for parallel and distributed aggregation.
Summary
Functions
Deserializes an EXSK binary into a CMS sketch.
Deserializes an Apache DataSketches CMS binary.
Estimates the frequency of an item in the sketch.
Creates a new CMS sketch from an enumerable of items.
Merges two CMS sketches.
Merges a non-empty enumerable of CMS sketches into one.
Returns a 2-arity merge function suitable for combining sketches.
Creates a new CMS sketch.
Returns a 2-arity reducer function suitable for Enum.reduce/3 and similar.
Serializes the sketch to the ExDataSketch-native EXSK binary format.
Serializes the sketch to Apache DataSketches CMS format.
Returns the size of the sketch state in bytes.
Updates the sketch with a single item.
Updates the sketch with multiple items in a single pass.
Types
Functions
@spec deserialize(binary()) :: {:ok, t()} | {:error, Exception.t()}
Deserializes an EXSK binary into a CMS sketch.
Returns {:ok, sketch} on success or {:error, reason} on failure.
Examples
iex> ExDataSketch.CMS.deserialize(<<"invalid">>)
{:error, %ExDataSketch.Errors.DeserializationError{message: "deserialization failed: invalid magic bytes, expected EXSK"}}
@spec deserialize_datasketches(binary()) :: {:ok, t()} | {:error, Exception.t()}
Deserializes an Apache DataSketches CMS binary.
Not implemented. See serialize_datasketches/1 for details.
Examples
iex> try do
...> ExDataSketch.CMS.deserialize_datasketches(<<>>)
...> rescue
...> e in ExDataSketch.Errors.NotImplementedError -> e.message
...> end
"ExDataSketch.CMS.deserialize_datasketches is not yet implemented"
@spec estimate(t(), term()) :: non_neg_integer()
Estimates the frequency of an item in the sketch.
Returns a non-negative integer. The estimate is guaranteed to be at least the true count (no undercounting) but may overcount.
Examples
iex> sketch = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("hello", 5)
iex> ExDataSketch.CMS.estimate(sketch, "hello")
5
@spec from_enumerable( Enumerable.t(), keyword() ) :: t()
Creates a new CMS sketch from an enumerable of items.
Equivalent to new(opts) |> update_many(enumerable).
Options
Same as new/1.
Examples
iex> sketch = ExDataSketch.CMS.from_enumerable(["a", "b", "a"])
iex> ExDataSketch.CMS.estimate(sketch, "a")
2
Merges two CMS sketches.
Both sketches must have the same width, depth, and counter_width. Counters are added element-wise with saturating arithmetic.
Examples
iex> a = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("x", 3)
iex> b = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("x", 5)
iex> merged = ExDataSketch.CMS.merge(a, b)
iex> ExDataSketch.CMS.estimate(merged, "x")
8
@spec merge_many(Enumerable.t()) :: t()
Merges a non-empty enumerable of CMS sketches into one.
Raises Enum.EmptyError if the enumerable is empty.
Examples
iex> a = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("x")
iex> b = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("x")
iex> merged = ExDataSketch.CMS.merge_many([a, b])
iex> ExDataSketch.CMS.estimate(merged, "x")
2
Returns a 2-arity merge function suitable for combining sketches.
The returned function calls merge/2 on two sketches.
Examples
iex> is_function(ExDataSketch.CMS.merger(), 2)
true
Creates a new CMS sketch.
Options
:width- counters per row (default: 2048). Must be positive.:depth- number of rows (default: 5). Must be positive.:counter_width- bits per counter, 32 or 64 (default: 32).:backend- backend module (default:ExDataSketch.Backend.Pure).:hash_fn- custom hash function(term -> non_neg_integer).:seed- hash seed (default: 0).
Examples
iex> sketch = ExDataSketch.CMS.new()
iex> ExDataSketch.CMS.estimate(sketch, "anything")
0
Returns a 2-arity reducer function suitable for Enum.reduce/3 and similar.
The returned function calls update/2 on each item.
Examples
iex> is_function(ExDataSketch.CMS.reducer(), 2)
true
Serializes the sketch to the ExDataSketch-native EXSK binary format.
Examples
iex> sketch = ExDataSketch.CMS.new(width: 100, depth: 3, counter_width: 32)
iex> binary = ExDataSketch.CMS.serialize(sketch)
iex> <<"EXSK", _rest::binary>> = binary
iex> byte_size(binary) > 0
true
Serializes the sketch to Apache DataSketches CMS format.
Not implemented. Apache DataSketches does not define a standard CMS binary
format. Only Theta sketches support DataSketches interop via
ExDataSketch.Theta.serialize_datasketches/1. For CMS serialization,
use serialize/1 (ExDataSketch-native EXSK format).
Examples
iex> try do
...> sketch = %ExDataSketch.CMS{state: <<>>, opts: [], backend: nil}
...> ExDataSketch.CMS.serialize_datasketches(sketch)
...> rescue
...> e in ExDataSketch.Errors.NotImplementedError -> e.message
...> end
"ExDataSketch.CMS.serialize_datasketches is not yet implemented"
@spec size_bytes(t()) :: non_neg_integer()
Returns the size of the sketch state in bytes.
Examples
iex> sketch = ExDataSketch.CMS.new(width: 100, depth: 3, counter_width: 32)
iex> ExDataSketch.CMS.size_bytes(sketch)
1209
@spec update(t(), term(), pos_integer()) :: t()
Updates the sketch with a single item.
The item is hashed using ExDataSketch.Hash.hash64/1 before being
recorded in the sketch.
Parameters
sketch- the CMS sketch to update.item- any Elixir term to count.increment- positive integer to add (default: 1).
Examples
iex> sketch = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update("hello")
iex> ExDataSketch.CMS.estimate(sketch, "hello")
1
@spec update_many(t(), Enumerable.t()) :: t()
Updates the sketch with multiple items in a single pass.
Accepts an enumerable of items (each with implicit increment of 1) or
an enumerable of {item, increment} tuples.
Examples
iex> sketch = ExDataSketch.CMS.new() |> ExDataSketch.CMS.update_many(["a", "b", "a"])
iex> ExDataSketch.CMS.estimate(sketch, "a")
2