ExDataSketch.FrequentItems (ExDataSketch v0.7.1)

Copy Markdown View Source

FrequentItems sketch for approximate heavy-hitter detection using the SpaceSaving algorithm.

Tracks the top-k most frequent items in a data stream using at most k counters. Each counter stores an item, its estimated count, and a maximum overcount error. The sketch provides approximate frequency estimates with bounded memory and deterministic tie-breaking.

Algorithm (SpaceSaving)

The SpaceSaving algorithm maintains a fixed set of at most k counters:

  • Update(x): If x is already tracked, increment its count. If there is room (fewer than k entries), insert x with count=1 and error=0. Otherwise, evict the entry with the minimum count (ties broken by lexicographically smallest item_bytes), and replace it with x, count=min_count+1, error=min_count.

  • Weighted update(x, w): Same logic but increment by w. Replacement count = min_count + w, error = min_count.

  • Batch optimization: Pre-aggregate incoming items into a frequency map, then apply weighted updates for each unique item in sorted key order. Single decode + single encode of the binary state.

Key Encoding

Items are encoded to binary at the public API boundary using a configurable key encoding policy:

EncodingDescription
:binary (default)Keys are raw binaries, passed through as-is
:intKeys are integers, encoded as signed 64-bit little-endian
{:term, :external}Keys are arbitrary Erlang terms, encoded via :erlang.term_to_binary/1

The backend always receives and returns raw item_bytes. Decoding happens at the public API boundary when returning results.

Merge Properties

FrequentItems merge is commutative. Both sketches must have the same k and key_encoding parameters. Merge combines counts and errors additively across the union of keys, then retains the top-k entries by count (ties broken by lexicographically smallest key) to enforce the capacity invariant. Count (n) is always exactly additive regardless of whether entries are dropped.

Associativity holds for count totals but not necessarily for the retained entry set, since intermediate merges may drop entries that a different grouping would retain. See docs/frequent_items_format.md for details.

Binary State Format (FI1)

See docs/frequent_items_format.md for the complete binary layout specification, canonicalization rules, and merge algebra proof sketch.

Options

  • :k - maximum number of counters (default: 10, must be >= 1).
  • :key_encoding - key encoding policy: :binary (default), :int, or {:term, :external}.
  • :backend - backend module (default: ExDataSketch.Backend.Pure).

Summary

Functions

Returns the total number of items observed by the sketch.

Deserializes an EXSK binary into a FrequentItems sketch.

Returns the number of distinct items currently tracked by the sketch.

Returns the frequency estimate for a given item.

Returns items whose estimated frequency exceeds the given threshold.

Creates a new FrequentItems sketch from an enumerable of items.

Merges two FrequentItems sketches.

Merges a non-empty enumerable of FrequentItems sketches into one.

Returns a 2-arity merge function suitable for combining sketches.

Creates a new FrequentItems sketch.

Returns a 2-arity reducer function suitable for Enum.reduce/3 and similar.

Serializes the sketch to the ExDataSketch-native EXSK binary format.

Returns the top-k most frequent items, sorted by estimated count descending.

Updates the sketch with a single item.

Updates the sketch with multiple items in a single pass.

Types

t()

@type t() :: %ExDataSketch.FrequentItems{
  backend: module(),
  opts: keyword(),
  state: binary()
}

Functions

count(frequent_items)

@spec count(t()) :: non_neg_integer()

Returns the total number of items observed by the sketch.

This is the sum of all weights (each update/2 call contributes 1).

Examples

iex> ExDataSketch.FrequentItems.new(k: 5) |> ExDataSketch.FrequentItems.count()
0

deserialize(binary)

@spec deserialize(binary()) :: {:ok, t()} | {:error, Exception.t()}

Deserializes an EXSK binary into a FrequentItems sketch.

Returns {:ok, sketch} on success or {:error, reason} on failure.

Examples

iex> ExDataSketch.FrequentItems.deserialize(<<"invalid">>)
{:error, %ExDataSketch.Errors.DeserializationError{message: "deserialization failed: invalid magic bytes, expected EXSK"}}

entry_count(frequent_items)

@spec entry_count(t()) :: non_neg_integer()

Returns the number of distinct items currently tracked by the sketch.

This is always <= k.

Examples

iex> ExDataSketch.FrequentItems.new(k: 5) |> ExDataSketch.FrequentItems.entry_count()
0

estimate(frequent_items, item)

@spec estimate(t(), term()) :: {:ok, map()} | {:error, :not_tracked}

Returns the frequency estimate for a given item.

Returns {:ok, estimate_map} if the item is tracked, where estimate_map contains :estimate, :error, :lower, and :upper fields. Returns {:error, :not_tracked} if the item is not in the sketch.

The estimate map fields:

  • :estimate - the estimated count (may overcount but never undercount)
  • :error - the maximum possible overcount
  • :lower - max(estimate - error, 0), guaranteed lower bound
  • :upper - same as estimate, guaranteed upper bound

Examples

iex> sketch = ExDataSketch.FrequentItems.new(k: 10) |> ExDataSketch.FrequentItems.update("x")
iex> {:ok, est} = ExDataSketch.FrequentItems.estimate(sketch, "x")
iex> est.estimate
1

frequent(sketch, threshold)

@spec frequent(t(), non_neg_integer()) :: [map()]

Returns items whose estimated frequency exceeds the given threshold.

The threshold is compared against the lower bound (estimate - error). Only items with lower >= threshold are returned.

Examples

iex> sketch = ExDataSketch.FrequentItems.new(k: 10)
iex> sketch = ExDataSketch.FrequentItems.update_many(sketch, List.duplicate("a", 100) ++ ["b"])
iex> frequent = ExDataSketch.FrequentItems.frequent(sketch, 50)
iex> Enum.any?(frequent, fn e -> e.item == "a" end)
true

from_enumerable(enumerable, opts \\ [])

@spec from_enumerable(
  Enumerable.t(),
  keyword()
) :: t()

Creates a new FrequentItems sketch from an enumerable of items.

Equivalent to new(opts) |> update_many(enumerable).

Examples

iex> sketch = ExDataSketch.FrequentItems.from_enumerable(["a", "b", "a"], k: 10)
iex> ExDataSketch.FrequentItems.count(sketch)
3

merge(sketch, frequent_items)

@spec merge(t(), t()) :: t()

Merges two FrequentItems sketches.

Both sketches must have the same k and key_encoding parameters. The merge combines counts and errors additively across the union of tracked items, then retains the top-k entries by count to enforce the capacity invariant.

Raises ExDataSketch.Errors.IncompatibleSketchesError if the sketches have different k or key_encoding values.

Examples

iex> a = ExDataSketch.FrequentItems.new(k: 5) |> ExDataSketch.FrequentItems.update_many(["x", "x"])
iex> b = ExDataSketch.FrequentItems.new(k: 5) |> ExDataSketch.FrequentItems.update_many(["y", "y"])
iex> merged = ExDataSketch.FrequentItems.merge(a, b)
iex> ExDataSketch.FrequentItems.count(merged)
4

merge_many(sketches)

@spec merge_many(Enumerable.t()) :: t()

Merges a non-empty enumerable of FrequentItems sketches into one.

Raises Enum.EmptyError if the enumerable is empty.

Examples

iex> sketches = Enum.map(1..3, fn i ->
...>   ExDataSketch.FrequentItems.new(k: 5) |> ExDataSketch.FrequentItems.update(to_string(i))
...> end)
iex> merged = ExDataSketch.FrequentItems.merge_many(sketches)
iex> ExDataSketch.FrequentItems.count(merged)
3

merger(opts \\ [])

@spec merger(keyword()) :: (t(), t() -> t())

Returns a 2-arity merge function suitable for combining sketches.

Examples

iex> is_function(ExDataSketch.FrequentItems.merger(), 2)
true

new(opts \\ [])

@spec new(keyword()) :: t()

Creates a new FrequentItems sketch.

Options

  • :k - maximum number of counters (default: 10, must be >= 1).
  • :key_encoding - key encoding policy: :binary (default), :int, or {:term, :external}.
  • :backend - backend module (default: ExDataSketch.Backend.Pure).

Examples

iex> sketch = ExDataSketch.FrequentItems.new(k: 10)
iex> sketch.opts[:k]
10
iex> ExDataSketch.FrequentItems.count(sketch)
0

reducer()

@spec reducer() :: (term(), t() -> t())

Returns a 2-arity reducer function suitable for Enum.reduce/3 and similar.

The returned function calls update/2 on each item.

Examples

iex> is_function(ExDataSketch.FrequentItems.reducer(), 2)
true

serialize(frequent_items)

@spec serialize(t()) :: binary()

Serializes the sketch to the ExDataSketch-native EXSK binary format.

Examples

iex> sketch = ExDataSketch.FrequentItems.new(k: 10)
iex> binary = ExDataSketch.FrequentItems.serialize(sketch)
iex> <<"EXSK", _rest::binary>> = binary
iex> byte_size(binary) > 0
true

top_k(frequent_items, query_opts \\ [])

@spec top_k(
  t(),
  keyword()
) :: [map()]

Returns the top-k most frequent items, sorted by estimated count descending.

Ties in count are broken by key ascending (lexicographic on raw bytes, then decoded via the key encoding policy).

Options

  • :limit - maximum number of items to return (default: all entries).

Returns a list of maps with :item, :estimate, :error, :lower, and :upper fields.

Examples

iex> sketch = ExDataSketch.FrequentItems.new(k: 10)
iex> sketch = ExDataSketch.FrequentItems.update_many(sketch, ["a", "a", "b"])
iex> [first | _] = ExDataSketch.FrequentItems.top_k(sketch)
iex> first.item
"a"

update(sketch, item)

@spec update(t(), term()) :: t()

Updates the sketch with a single item.

Delegates to update_many/2 with a single-element list.

Examples

iex> sketch = ExDataSketch.FrequentItems.new(k: 5)
iex> sketch = ExDataSketch.FrequentItems.update(sketch, "hello")
iex> ExDataSketch.FrequentItems.count(sketch)
1

update_many(sketch, items)

@spec update_many(t(), Enumerable.t()) :: t()

Updates the sketch with multiple items in a single pass.

More efficient than calling update/2 repeatedly because it pre-aggregates items into a frequency map, decodes and encodes the binary state only once, and applies weighted updates for each unique item.

Examples

iex> sketch = ExDataSketch.FrequentItems.new(k: 5)
iex> sketch = ExDataSketch.FrequentItems.update_many(sketch, ["a", "b", "a"])
iex> ExDataSketch.FrequentItems.count(sketch)
3