ExDataSketch.Bloom (ExDataSketch v0.7.1)

Copy Markdown View Source

Bloom filter for probabilistic membership testing.

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. False positives are possible, but false negatives are not: if member?/2 returns false, the item was definitely not inserted; if it returns true, the item was probably inserted.

Parameters

  • :capacity -- expected number of elements (default: 10,000). Used to derive the optimal bitset size.
  • :false_positive_rate -- target false positive rate (default: 0.01). Must be between 0 and 1 exclusive.
  • :seed -- hash seed (default: 0). Filters with different seeds are incompatible for merge.

The optimal bit_count and hash_count are derived automatically:

bit_count  = ceil(-capacity * ln(fpr) / ln(2)^2)
hash_count = max(1, round(bit_count / capacity * ln(2)))

Hash Strategy

Items are hashed via ExDataSketch.Hash.hash64/1 at the API boundary. The 64-bit hash is split into two 32-bit halves for double hashing (Kirsch-Mitzenmacher optimization):

h1 = hash >>> 32
h2 = hash &&& 0xFFFFFFFF
position_i = rem(h1 + i * h2, bit_count)

Binary State Layout (BLM1)

See plans/adr/ADR-001-bloom-binary-format.md for the full specification. 40-byte header followed by a packed bitset (LSB-first byte order).

Merge Properties

Bloom merge is commutative and associative (bitwise OR). Two filters can merge only if they have identical bit_count, hash_count, and seed.

Summary

Functions

Returns the set of capabilities supported by Bloom filter.

Returns the configured capacity (expected number of elements).

Returns true if two Bloom filters have compatible parameters.

Returns the number of set bits (popcount) in the bitset.

Deserializes an EXSK binary into a Bloom filter.

Returns the configured target false positive rate.

Creates a Bloom filter from an enumerable of items.

Tests whether an item may be a member of the set.

Merges two Bloom filters via bitwise OR.

Merges a non-empty enumerable of Bloom filters into one.

Returns a 2-arity merge function for combining filters.

Creates a new empty Bloom filter.

Inserts a single item into the filter.

Inserts multiple items in a single pass.

Returns a 2-arity reducer function for use with Enum.reduce/3.

Serializes the filter to the ExDataSketch-native EXSK binary format.

Returns the byte size of the state binary.

Types

t()

@type t() :: %ExDataSketch.Bloom{backend: module(), opts: keyword(), state: binary()}

Functions

capabilities()

Returns the set of capabilities supported by Bloom filter.

capacity(bloom)

@spec capacity(t()) :: pos_integer()

Returns the configured capacity (expected number of elements).

Examples

iex> ExDataSketch.Bloom.new(capacity: 5000) |> ExDataSketch.Bloom.capacity()
5000

compatible_with?(bloom1, bloom2)

@spec compatible_with?(t(), t()) :: boolean()

Returns true if two Bloom filters have compatible parameters.

Examples

iex> a = ExDataSketch.Bloom.new(capacity: 100)
iex> b = ExDataSketch.Bloom.new(capacity: 100)
iex> ExDataSketch.Bloom.compatible_with?(a, b)
true

count(bloom)

@spec count(t()) :: non_neg_integer()

Returns the number of set bits (popcount) in the bitset.

This is NOT the number of inserted elements. Useful for computing fill ratio.

Examples

iex> ExDataSketch.Bloom.new(capacity: 100) |> ExDataSketch.Bloom.count()
0

deserialize(binary)

@spec deserialize(binary()) :: {:ok, t()} | {:error, Exception.t()}

Deserializes an EXSK binary into a Bloom filter.

Returns {:ok, bloom} on success or {:error, reason} on failure.

Examples

iex> bloom = ExDataSketch.Bloom.new(capacity: 100) |> ExDataSketch.Bloom.put("test")
iex> {:ok, recovered} = ExDataSketch.Bloom.deserialize(ExDataSketch.Bloom.serialize(bloom))
iex> ExDataSketch.Bloom.member?(recovered, "test")
true

error_rate(bloom)

@spec error_rate(t()) :: float()

Returns the configured target false positive rate.

Examples

iex> ExDataSketch.Bloom.new(false_positive_rate: 0.05) |> ExDataSketch.Bloom.error_rate()
0.05

from_enumerable(enumerable, opts \\ [])

@spec from_enumerable(
  Enumerable.t(),
  keyword()
) :: t()

Creates a Bloom filter from an enumerable of items.

Equivalent to new(opts) |> put_many(enumerable).

Examples

iex> bloom = ExDataSketch.Bloom.from_enumerable(["a", "b", "c"], capacity: 100)
iex> ExDataSketch.Bloom.member?(bloom, "a")
true

member?(bloom, item)

@spec member?(t(), term()) :: boolean()

Tests whether an item may be a member of the set.

Returns true if the item is possibly in the set (may be a false positive), false if the item is definitely not in the set.

Examples

iex> bloom = ExDataSketch.Bloom.new() |> ExDataSketch.Bloom.put("hello")
iex> ExDataSketch.Bloom.member?(bloom, "hello")
true

iex> bloom = ExDataSketch.Bloom.new()
iex> ExDataSketch.Bloom.member?(bloom, "hello")
false

merge(bloom, bloom)

@spec merge(t(), t()) :: t()

Merges two Bloom filters via bitwise OR.

Both filters must have identical bit_count, hash_count, and seed. Raises ExDataSketch.Errors.IncompatibleSketchesError if parameters differ.

Examples

iex> a = ExDataSketch.Bloom.new(capacity: 100) |> ExDataSketch.Bloom.put("x")
iex> b = ExDataSketch.Bloom.new(capacity: 100) |> ExDataSketch.Bloom.put("y")
iex> merged = ExDataSketch.Bloom.merge(a, b)
iex> ExDataSketch.Bloom.member?(merged, "x") and ExDataSketch.Bloom.member?(merged, "y")
true

merge_many(blooms)

@spec merge_many(Enumerable.t()) :: t()

Merges a non-empty enumerable of Bloom filters into one.

Raises Enum.EmptyError if the enumerable is empty.

Examples

iex> filters = Enum.map(1..3, fn i ->
...>   ExDataSketch.Bloom.new(capacity: 100) |> ExDataSketch.Bloom.put("item_#{i}")
...> end)
iex> merged = ExDataSketch.Bloom.merge_many(filters)
iex> ExDataSketch.Bloom.member?(merged, "item_1")
true

merger(opts \\ [])

@spec merger(keyword()) :: (t(), t() -> t())

Returns a 2-arity merge function for combining filters.

Examples

iex> is_function(ExDataSketch.Bloom.merger(), 2)
true

new(opts \\ [])

@spec new(keyword()) :: t()

Creates a new empty Bloom filter.

Options

  • :capacity -- expected number of elements (default: 10000).
  • :false_positive_rate -- target FPR (default: 0.01).
  • :seed -- hash seed (default: 0).
  • :backend -- backend module (default: ExDataSketch.Backend.Pure).
  • :hash_fn -- custom hash function (term -> non_neg_integer).

Examples

iex> bloom = ExDataSketch.Bloom.new()
iex> bloom.opts[:capacity]
10000

iex> bloom = ExDataSketch.Bloom.new(capacity: 1000, false_positive_rate: 0.001)
iex> bloom.opts[:capacity]
1000

put(bloom, item)

@spec put(t(), term()) :: t()

Inserts a single item into the filter.

The item is hashed via ExDataSketch.Hash.hash64/1 before insertion.

Examples

iex> bloom = ExDataSketch.Bloom.new() |> ExDataSketch.Bloom.put("hello")
iex> ExDataSketch.Bloom.member?(bloom, "hello")
true

put_many(bloom, items)

@spec put_many(t(), Enumerable.t()) :: t()

Inserts multiple items in a single pass.

More efficient than calling put/2 repeatedly because it minimizes intermediate binary allocations.

Examples

iex> bloom = ExDataSketch.Bloom.new() |> ExDataSketch.Bloom.put_many(["a", "b", "c"])
iex> ExDataSketch.Bloom.member?(bloom, "a")
true

reducer()

@spec reducer() :: (term(), t() -> t())

Returns a 2-arity reducer function for use with Enum.reduce/3.

Examples

iex> is_function(ExDataSketch.Bloom.reducer(), 2)
true

serialize(bloom)

@spec serialize(t()) :: binary()

Serializes the filter to the ExDataSketch-native EXSK binary format.

Examples

iex> bloom = ExDataSketch.Bloom.new(capacity: 100)
iex> binary = ExDataSketch.Bloom.serialize(bloom)
iex> <<"EXSK", _rest::binary>> = binary
iex> byte_size(binary) > 0
true

size_bytes(bloom)

@spec size_bytes(t()) :: non_neg_integer()

Returns the byte size of the state binary.

Examples

iex> bloom = ExDataSketch.Bloom.new(capacity: 100)
iex> ExDataSketch.Bloom.size_bytes(bloom) > 0
true