This guide covers the full API, configuration options, backend system, serialization formats, and error handling for ExDataSketch.
Options
HLL Options
| Option | Type | Default | Description |
|---|---|---|---|
:p | integer | 14 | Precision parameter. Valid range: 4..16. Higher values use more memory but give better accuracy. Register count = 2^p. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
Memory usage: 2^p bytes for registers (e.g., p=14 uses 16 KiB).
Relative error: approximately 1.04 / sqrt(2^p).
ULL Options
UltraLogLog (Ertl, 2023) provides approximately 20% better accuracy than HLL at the same memory footprint. It uses the same register array layout but stores a different value per register and applies the FGRA estimator.
| Option | Type | Default | Description |
|---|---|---|---|
:p | integer | 14 | Precision parameter. Valid range: 4..26. Higher values use more memory but give better accuracy. Register count = 2^p. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
Memory usage: 8 + 2^p bytes (e.g., p=14 uses ~16 KiB).
Relative error: approximately 0.835 / sqrt(2^p).
HLL vs ULL Comparison
| p | Memory | HLL Error | ULL Error | ULL Improvement |
|---|---|---|---|---|
| 10 | ~1 KiB | 3.25% | 2.61% | ~20% |
| 12 | ~4 KiB | 1.63% | 1.30% | ~20% |
| 14 | ~16 KiB | 0.81% | 0.65% | ~20% |
| 16 | ~64 KiB | 0.41% | 0.33% | ~20% |
sketch = ExDataSketch.ULL.new(p: 14)
sketch = ExDataSketch.ULL.update_many(sketch, items)
ExDataSketch.ULL.estimate(sketch)
# Merge two sketches
merged = ExDataSketch.ULL.merge(sketch_a, sketch_b)
# Build from enumerable
sketch = ExDataSketch.ULL.from_enumerable(stream, p: 14)CMS Options
| Option | Type | Default | Description |
|---|---|---|---|
:width | pos_integer | 2048 | Number of counters per row. Higher values reduce error. |
:depth | pos_integer | 5 | Number of hash functions (rows). Higher values reduce failure probability. |
:counter_width | 32 or 64 | 32 | Bit width of each counter. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
Memory usage: width * depth * (counter_width / 8) bytes.
Error bound: e * total_count / width with probability 1 - (1/2)^depth,
where e is Euler's number.
Theta Options
| Option | Type | Default | Description |
|---|---|---|---|
:k | pos_integer | 4096 | Nominal number of entries. Controls accuracy. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
KLL Options
| Option | Type | Default | Description |
|---|---|---|---|
:k | integer | 200 | Accuracy parameter. Valid range: 8..65535. Higher values use more memory but give better accuracy. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
Rank error: approximately 1.65 / k.
DDSketch Options
| Option | Type | Default | Description |
|---|---|---|---|
:alpha | float | 0.01 | Relative accuracy parameter. Controls bucket width. Must be in (0.0, 1.0). |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
Value-relative error: the returned quantile value v satisfies
v_true * (1 - alpha) <= v <= v_true * (1 + alpha).
| alpha | Relative Error | Bucket Count (for 1ms..10s range) |
|---|---|---|
| 0.05 | 5% | ~185 |
| 0.01 | 1% | ~920 |
| 0.005 | 0.5% | ~1840 |
| 0.001 | 0.1% | ~9200 |
FrequentItems Options
| Option | Type | Default | Description |
|---|---|---|---|
:k | pos_integer | 10 | Maximum number of counters. Controls how many distinct items can be tracked. Higher values give more accurate frequency estimates. |
:key_encoding | atom/tuple | :binary | Key encoding policy: :binary (raw binaries), :int (64-bit LE integers), or {:term, :external} (Erlang external term format). |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
FrequentItems uses the SpaceSaving algorithm to track the top-k most
frequent items in a data stream. It maintains at most k counters, each
storing an item, its estimated count, and a maximum overcount error.
When a new item arrives and all k slots are full, the item with the minimum count is evicted (ties broken by lexicographically smallest key). The new item inherits the evicted count plus one, and the evicted count is recorded as the error bound.
Frequency estimates are always upper bounds -- the true frequency may be
lower, but never higher. The lower field gives a guaranteed lower bound:
max(estimate - error, 0).
Bloom Options
| Option | Type | Default | Description |
|---|---|---|---|
:capacity | pos_integer | 10,000 | Expected number of items to insert. Used to derive optimal bit_count and hash_count. |
:false_positive_rate | float | 0.01 | Target false positive probability. Must be in (0.0, 1.0). |
:seed | non_neg_integer | 0 | Hash seed. Filters with different seeds cannot be merged. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
Bloom filters provide probabilistic membership testing: member?/2 returns
true if an item was probably inserted, or false if it was definitely not.
False positives are possible; false negatives are not.
Parameters are derived automatically:
bit_count = ceil(-capacity * ln(fpr) / ln(2)^2)hash_count = max(1, round(bit_count / capacity * ln(2))), clamped to 1..30
For capacity=100,000 and fpr=0.01: bit_count=958,506, hash_count=7, ~117 KB.
Cuckoo Options
| Option | Type | Default | Description |
|---|---|---|---|
:capacity | pos_integer | 10,000 | Expected number of items. Used to derive bucket count. |
:fingerprint_size | integer | 8 | Fingerprint width in bits. Valid values: 8, 12, 16. Larger fingerprints reduce false positive rate. |
:bucket_size | integer | 4 | Slots per bucket. Valid values: 2, 4. Bucket size 4 gives better space efficiency. |
:max_kicks | integer | 500 | Maximum relocation attempts before reporting full. Valid range: 100..2000. |
:seed | non_neg_integer | 0 | Hash seed. Filters with different seeds are not compatible. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
Cuckoo filters provide membership testing with deletion support. Unlike Bloom filters, items can be removed after insertion without introducing false negatives. Uses partial-key cuckoo hashing where fingerprints are stored in one of two candidate buckets.
put/2 returns {:ok, cuckoo} on success or {:error, :full} when the
filter cannot relocate fingerprints after max_kicks attempts. delete/2
returns {:ok, cuckoo} on success or {:error, :not_found} if the
fingerprint is not present.
False positive rate: approximately 2 * bucket_size / 2^fingerprint_size.
For fingerprint_size=8, bucket_size=4: ~3.1%. For fingerprint_size=16,
bucket_size=4: ~0.012%.
Quotient Options
| Option | Type | Default | Description |
|---|---|---|---|
:q | integer | 16 | Quotient bits. Determines number of slots (2^q). Valid range: 1..28. |
:r | integer | 8 | Remainder bits stored per slot. Valid range: 1..32. Constraint: q + r <= 64. |
:seed | non_neg_integer | 0 | Hash seed. Filters with different seeds cannot be merged. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
Quotient filters split each fingerprint into a quotient (slot index) and remainder (stored value). Three metadata bits per slot (is_occupied, is_continuation, is_shifted) enable linear probing with cluster tracking.
Supports safe deletion and merge without re-hashing. Merge combines two filters with matching parameters by iterating runs from both filters.
False positive rate: approximately 1 / 2^r. For r=8: ~0.4%. For r=16: ~0.0015%.
Memory usage: 2^q * (r + 3) / 8 bytes (3 metadata bits per slot).
CQF Options
| Option | Type | Default | Description |
|---|---|---|---|
:q | integer | 16 | Quotient bits. Determines number of slots (2^q). Valid range: 1..28. |
:r | integer | 8 | Remainder bits stored per slot. Valid range: 1..32. Constraint: q + r <= 64. |
:seed | non_neg_integer | 0 | Hash seed. Filters with different seeds cannot be merged. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
CQF extends the quotient filter with variable-length counter encoding to
track item multiplicities. Use estimate_count/2 to query how many times an
item has been inserted. Counts are approximate -- never underestimated but may
be overestimated.
Merge sums counts across filters. Deletion decrements the count for an item (safe no-op if the item is not present).
cqf = ExDataSketch.CQF.new(q: 16, r: 8)
cqf = cqf |> ExDataSketch.CQF.put("event") |> ExDataSketch.CQF.put("event")
ExDataSketch.CQF.estimate_count(cqf, "event") # >= 2XorFilter Options
| Option | Type | Default | Description |
|---|---|---|---|
:fingerprint_bits | integer | 8 | Fingerprint width. Valid values: 8, 16. |
:seed | non_neg_integer | 0 | Hash seed. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
XorFilter is a static, immutable membership filter. All items must be
provided at construction time via build/2. After construction, only
member?/2 queries are supported -- no insertion, deletion, or merge.
XorFilter offers the smallest memory footprint and fastest lookups of all membership filters in this library. Use it when the set is known ahead of time and will not change.
items = MapSet.new(["alice", "bob", "carol"])
{:ok, xor} = ExDataSketch.XorFilter.build(items, fingerprint_bits: 8)
ExDataSketch.XorFilter.member?(xor, "alice") # true
ExDataSketch.XorFilter.member?(xor, "dave") # false (probably)False positive rate: approximately 1 / 2^fingerprint_bits. For 8-bit: ~0.4%.
For 16-bit: ~0.0015%.
IBLT Options
| Option | Type | Default | Description |
|---|---|---|---|
:cell_count | pos_integer | 1,000 | Number of cells. Valid range: 1..16,777,216. More cells improve decode success rate. |
:hash_count | pos_integer | 3 | Number of hash functions. Valid range: 1..10. |
:seed | non_neg_integer | 0 | Hash seed. IBLTs with different seeds cannot be subtracted or merged. |
:backend | module | ExDataSketch.Backend.Pure | Backend module for computation. |
IBLT (Invertible Bloom Lookup Table) extends Bloom filters with the ability to list entries and find set differences. Two parties build IBLTs from their respective sets, exchange them, and subtract to discover items present in one set but not the other.
Supports two modes:
- Set mode:
put/2anddelete/2operate on items (value_hash = 0). - Key-value mode:
put/3anddelete/3operate on key-value pairs.
subtract/2 performs cell-wise subtraction. list_entries/1 peels the
resulting IBLT to recover entries, returning {:ok, %{positive: [...], negative: [...]}} or {:error, :decode_failed} if the difference is too
large for the cell count.
iblt_a = ExDataSketch.IBLT.new(cell_count: 1000) |> ExDataSketch.IBLT.put_many(set_a)
iblt_b = ExDataSketch.IBLT.new(cell_count: 1000) |> ExDataSketch.IBLT.put_many(set_b)
diff = ExDataSketch.IBLT.subtract(iblt_a, iblt_b)
{:ok, %{positive: only_in_a, negative: only_in_b}} = ExDataSketch.IBLT.list_entries(diff)Rule of thumb: set cell_count to at least 3x the expected number of
differences for reliable decoding.
FilterChain Options
FilterChain composes multiple membership filters into a pipeline with capability-aware stage validation.
chain =
ExDataSketch.FilterChain.new()
|> ExDataSketch.FilterChain.add_stage(ExDataSketch.Cuckoo.new(capacity: 10_000))
|> ExDataSketch.FilterChain.add_stage(ExDataSketch.Bloom.new(capacity: 100_000))Stage positions:
| Position | Allowed types | Notes |
|---|---|---|
| Front / Middle | Bloom, Cuckoo, Quotient, CQF | Dynamic filters supporting :put |
| Terminal | XorFilter | Static filter, must be last stage |
| Adjunct | IBLT | Not in query path, used for reconciliation |
Operations:
put/2-- Inserts into all stages supporting:put(skips XorFilter).member?/2-- Queries stages in order, short-circuits onfalse.delete/2-- Deletes from all stages. RaisesUnsupportedOperationErrorif any stage does not support:delete.
Lifecycle-tier pattern example:
# Hot tier (absorbs writes) -> Cold tier (compacted snapshot)
chain =
ExDataSketch.FilterChain.new()
|> ExDataSketch.FilterChain.add_stage(ExDataSketch.Cuckoo.new(capacity: 50_000))
# Later, compact hot tier into a static XorFilter
{:ok, xor} = ExDataSketch.XorFilter.build(compacted_items, fingerprint_bits: 16)
chain = ExDataSketch.FilterChain.add_stage(chain, xor)Membership Filter Comparison
| Property | Bloom | Cuckoo | Quotient | CQF | XorFilter | IBLT |
|---|---|---|---|---|---|---|
| Insert | yes | yes | yes | yes | build-only | yes |
| Delete | no | yes | yes | yes | no | yes |
| Merge | yes | no | yes | yes | no | yes |
| Count items | yes | yes | yes | yes (multiset) | yes | yes |
| Static | no | no | no | no | yes | no |
| Reconciliation | no | no | no | no | no | yes |
| Space efficiency | good | better at low FPR | moderate | moderate | best | depends on diff size |
REQ Options
The Relative Error Quantile (REQ) sketch provides relative-error rank accuracy. Unlike KLL (rank error) or DDSketch (value-relative error), REQ concentrates accuracy near the extremes of the distribution.
| Option | Type | Default | Description |
|---|---|---|---|
k | integer | 12 | Accuracy parameter. Higher k means more accuracy but more memory |
hra | boolean | true | High Rank Accuracy. When true, accuracy is better at high ranks (p99, p999) |
backend | module | configured | Backend module |
sketch = ExDataSketch.REQ.new(k: 12, hra: true)
sketch = ExDataSketch.REQ.update_many(sketch, Enum.map(1..10_000, &(&1 * 1.0)))
ExDataSketch.REQ.quantile(sketch, 0.99)
ExDataSketch.REQ.rank(sketch, 9500.0)
ExDataSketch.REQ.cdf(sketch, [1000.0, 5000.0, 9000.0])
ExDataSketch.REQ.pmf(sketch, [1000.0, 5000.0, 9000.0])MisraGries Options
The Misra-Gries algorithm provides deterministic heavy-hitter detection. Unlike FrequentItems (SpaceSaving), Misra-Gries makes no estimate when an item is not tracked, returning 0 instead.
| Option | Type | Default | Description |
|---|---|---|---|
:k | integer | 10 | Number of counters. Detects items with frequency > n/k. |
:key_encoding | atom/tuple | :binary | Key encoding policy: :binary (raw binaries), :int (signed 64-bit LE integers), or {:term, :external} (Erlang external term format). |
:backend | module | configured | Backend module. |
# Default binary keys
sketch = ExDataSketch.MisraGries.new(k: 10)
sketch = ExDataSketch.MisraGries.update_many(sketch, items)
ExDataSketch.MisraGries.estimate(sketch, "frequent_item")
ExDataSketch.MisraGries.top_k(sketch, 10)
# Integer keys
sketch = ExDataSketch.MisraGries.new(k: 10, key_encoding: :int)
sketch = ExDataSketch.MisraGries.update_many(sketch, [100, 200, 100, 300, 100])
ExDataSketch.MisraGries.top_k(sketch, 3)
# => [{100, 3}, {200, 1}, {300, 1}]
# Arbitrary Erlang terms
sketch = ExDataSketch.MisraGries.new(k: 10, key_encoding: {:term, :external})
sketch = ExDataSketch.MisraGries.update_many(sketch, [{:user, 1}, {:user, 2}, {:user, 1}])
ExDataSketch.MisraGries.estimate(sketch, {:user, 1})
# => 2XXHash3
ExDataSketch includes a Rust NIF implementation of XXHash3 for high-quality 64-bit hashing. This is used internally by the membership filters and is also available directly.
ExDataSketch.Hash.xxhash3_64("some data")
ExDataSketch.Hash.xxhash3_64("some data", 42)KLL vs DDSketch
Both are quantile sketches available through ExDataSketch.Quantiles, but they
provide different accuracy guarantees:
| Property | KLL | DDSketch |
|---|---|---|
| Error type | Rank error | Value-relative error |
| Best for | General rank queries | Latency percentiles, SLOs |
| Parameter | k (default 200) | alpha (default 0.01) |
| Guarantee | True rank within ~1.65/k | True value within factor (1 +/- alpha) |
| Negative values | Supported | Rejected |
| Memory | Fixed by k | Grows with log(max/min) |
Use KLL when you need rank accuracy (e.g., "what fraction of values are below X?"). Use DDSketch when you need value accuracy (e.g., "p99 latency is 142ms +/- 1%").
Quantiles Facade
The ExDataSketch.Quantiles module provides a type-dispatched facade:
# KLL (default)
sketch = ExDataSketch.Quantiles.new(type: :kll, k: 200)
sketch = ExDataSketch.Quantiles.update_many(sketch, 1..1000)
ExDataSketch.Quantiles.quantile(sketch, 0.5) # approximate median
# DDSketch
sketch = ExDataSketch.Quantiles.new(type: :ddsketch, alpha: 0.01)
sketch = ExDataSketch.Quantiles.update_many(sketch, latency_samples)
ExDataSketch.Quantiles.quantile(sketch, 0.99) # p99 with relative accuracyBackend System
ExDataSketch uses a backend system to allow swapping computation engines without changing the public API.
Available Backends
ExDataSketch.Backend.Pure-- Pure Elixir implementation. Always available. Default backend.ExDataSketch.Backend.Rust-- Rust NIF acceleration. Precompiled binaries are downloaded automatically on supported platforms. Falls back to Pure if the NIF is not available.
Precompiled NIF Binaries
The Rust NIF is distributed as precompiled binaries for the following platforms:
| Platform | Architecture |
|---|---|
| macOS | ARM64 (Apple Silicon), x86_64 |
| Linux (glibc) | x86_64, aarch64 |
| Linux (musl) | x86_64, aarch64 |
On these platforms, mix compile automatically downloads the correct binary.
No Rust toolchain is required.
To force compilation from source (requires Rust):
EX_DATA_SKETCH_BUILD=1 mix compile
On unsupported platforms, the library operates in pure Elixir mode automatically. No configuration is needed.
Check availability at runtime:
ExDataSketch.Backend.Rust.available?()
# => true or falseSelecting a Backend
Per-sketch:
sketch = ExDataSketch.HLL.new(backend: ExDataSketch.Backend.Rust)Global default (in config):
config :ex_data_sketch, backend: ExDataSketch.Backend.RustThe per-sketch option always takes precedence over the global config.
If Backend.Rust is configured but the NIF is not available, it
automatically falls back to Backend.Pure.
Rust Backend Details
The Rust backend accelerates batch and traversal operations via NIFs:
| Rust NIF | Pure fallback |
|---|---|
hll_update_many, hll_merge, hll_estimate | hll_new, hll_update |
cms_update_many, cms_merge | cms_new, cms_update, cms_estimate |
theta_update_many, theta_merge | theta_new, theta_update, theta_compact, theta_estimate |
kll_update_many, kll_merge | kll_new, kll_update, kll_quantile, kll_rank, kll_count, kll_min, kll_max |
Dirty Scheduler Thresholds
Batch operations automatically use dirty CPU schedulers when input size exceeds configurable thresholds:
| Operation | Default threshold |
|---|---|
hll_update_many | 10,000 hashes |
cms_update_many | 10,000 pairs |
theta_update_many | 10,000 hashes |
kll_update_many | 10,000 values |
cms_merge | 100,000 total counters |
theta_merge | 50,000 combined entries |
kll_merge | 50,000 combined items |
Override globally:
config :ex_data_sketch, :dirty_thresholds, %{
hll_update_many: 5_000,
cms_update_many: 20_000
}Or per-call via opts:
HLL.update_many(sketch, items, dirty_threshold: 5_000)Backend Guarantees
- Both backends produce identical results for the same inputs.
- Serialized state is identical regardless of which backend produced it.
- The public API does not change between backends.
Serialization
ExDataSketch-Native Format (EXSK)
All sketches support the native binary format:
binary = ExDataSketch.HLL.serialize(sketch)
{:ok, sketch} = ExDataSketch.HLL.deserialize(binary)The EXSK format structure:
| Field | Size | Description |
|---|---|---|
| Magic | 4 bytes | "EXSK" |
| Version | 1 byte | Format version (currently 1) |
| Sketch ID | 1 byte | Identifies sketch type (HLL=1, CMS=2, Theta=3, KLL=4, DDSketch=5, FrequentItems=6, Bloom=7) |
| Params length | 4 bytes | Little-endian u32, byte length of params |
| Params | variable | Sketch-specific parameters |
| State length | 4 bytes | Little-endian u32, byte length of state |
| State | variable | Raw sketch state |
DataSketches Interop
Selected sketch types support the Apache DataSketches binary format for cross-language interoperability:
# Theta sketch interop (priority target)
binary = ExDataSketch.Theta.serialize_datasketches(theta_sketch)
{:ok, sketch} = ExDataSketch.Theta.deserialize_datasketches(binary)Interop priority order: Theta (CompactSketch), then HLL, then KLL. KLL DataSketches interop is stubbed but not yet implemented.
Hashing
ExDataSketch uses a deterministic 64-bit hash function for all sketch operations. The hash module provides:
# Hash any Elixir term
ExDataSketch.Hash.hash64("hello")
# Hash raw binary data
ExDataSketch.Hash.hash64_binary(<<1, 2, 3>>)Auto-detection
hash64/2 automatically selects the best available hash implementation:
XXHash3 (NIF): When the Rust NIF is loaded,
hash64/2uses XXHash3 via NIF, producing native 64-bit hashes with zero Elixir-side overhead. XXHash3 output is stable across platforms.phash2 + fixnum-safe mix64 (pure): When the NIF is not available,
hash64/2falls back to:erlang.phash2/2with a 64-bit mixer that uses 16-bit partial products to avoid bigint heap allocations. Every intermediate value stays under 35 bits (well within the BEAM's 60-bit fixnum limit), eliminating the 3+ transient bigint allocations per hash call that a naive 64-bit multiply would incur.
The NIF availability check is performed once and cached in :persistent_term
for zero-cost subsequent lookups.
Why not 32-bit hashes
32-bit hashes degrade sketch accuracy above approximately 10M items due to birthday-paradox collisions. The fixnum-safe mixer preserves full 64-bit output quality at the cost of roughly 20 fixnum operations per hash -- still faster than 3+ bigint heap allocations from naive 64-bit multiplications.
Hash strategy tagging
Sketches record which hash function was used at creation time (:xxhash3,
:phash2, or :custom). This tag is persisted through serialization in the
EXSK params section and controls runtime hash dispatch:
:phash2-- always uses the pure Elixir mix64 path, even when the NIF is available.:xxhash3-- requires the Rust NIF; raisesArgumentErrorif unavailable.:custom-- cannot be deserialized (the original function is not recoverable from bytes);deserialize/1returns{:error, %DeserializationError{}}.
Merge operations validate that both sketches share the same strategy and seed.
Pluggable hash
Users can override the default hash with a custom function:
HLL.new(p: 14, hash_fn: fn term -> my_hash(term) end)When :hash_fn is provided, the sketch records hash_strategy: :custom.
Stability
:erlang.phash2/2 output is not guaranteed stable across OTP major versions.
XXHash3 output is stable across platforms. For cross-version or cross-system
stability, use the NIF build (XXHash3) or supply a custom :hash_fn.
Use Cases
Algorithm Selection Guide
| Use Case | Algorithm | Key Option | What You Get |
|---|---|---|---|
| Count distinct users / IPs / sessions | HLL | p: 14 | Cardinality estimate with ~0.8% error |
| Frequency of specific items (query counts, error codes) | CMS | width: 2048, depth: 5 | Per-item count estimates (always >= true count) |
| Set intersection / union cardinality | Theta | k: 4096 | Jaccard similarity, union/intersection sizes |
| Percentiles and rank queries (general) | KLL | k: 200 | Median, P95, P99 with rank-error guarantees |
| Latency percentiles and SLO monitoring | DDSketch | alpha: 0.01 | P99 with value-relative error (e.g., 142ms +/- 1%) |
| Top-k / heavy hitter detection | FrequentItems | k: 64 | Most frequent items with count and error bounds |
| Membership testing (seen before?) | Bloom | capacity: 100_000 | "Probably yes" or "definitely no" with tunable FPR |
| Membership with deletion | Cuckoo | capacity: 100_000 | Bloom-like but supports delete |
| Membership with deletion + merge | Quotient | q: 16, r: 8 | Mergeable filter with safe deletion |
| Multiset counting (how many times?) | CQF | q: 16, r: 8 | Approximate per-item multiplicity |
| Static set membership | XorFilter | fingerprint_bits: 8 | Smallest footprint, fastest lookup |
| Set reconciliation (what's different?) | IBLT | cell_count: 1000 | Find symmetric difference between sets |
| Improved cardinality estimation | ULL | p: 14 | ~20% better accuracy than HLL at the same memory |
HLL: Real-time unique visitor counting
A web analytics service needs to count unique visitors per page per day without storing every visitor ID. HLL provides bounded-memory cardinality estimation that can be merged across time windows and server instances.
defmodule Analytics.UniqueVisitors do
alias ExDataSketch.HLL
# Each page gets its own HLL sketch, stored in ETS or Redis
def record_visit(page_id, visitor_id) do
sketch = get_or_create_sketch(page_id)
updated = HLL.update(sketch, visitor_id)
store_sketch(page_id, updated)
end
def unique_count(page_id) do
page_id
|> get_or_create_sketch()
|> HLL.estimate()
|> round()
end
# Merge hourly sketches into a daily rollup
def daily_rollup(page_id, date) do
0..23
|> Enum.map(&get_hourly_sketch(page_id, date, &1))
|> Enum.reject(&is_nil/1)
|> HLL.merge_many()
end
# Compare unique visitors across two pages
def overlap_estimate(page_a, page_b) do
sketch_a = get_or_create_sketch(page_a)
sketch_b = get_or_create_sketch(page_b)
merged = HLL.merge(sketch_a, sketch_b)
count_a = HLL.estimate(sketch_a)
count_b = HLL.estimate(sketch_b)
count_union = HLL.estimate(merged)
# Inclusion-exclusion: |A intersect B| = |A| + |B| - |A union B|
overlap = max(count_a + count_b - count_union, 0)
%{page_a: round(count_a), page_b: round(count_b), overlap: round(overlap)}
end
defp get_or_create_sketch(page_id) do
case :ets.lookup(:hll_sketches, page_id) do
[{_, bin}] ->
{:ok, sketch} = HLL.deserialize(bin)
sketch
[] ->
HLL.new(p: 14)
end
end
defp store_sketch(page_id, sketch) do
:ets.insert(:hll_sketches, {page_id, HLL.serialize(sketch)})
end
defp get_hourly_sketch(page_id, date, hour) do
key = {page_id, date, hour}
case :ets.lookup(:hll_hourly, key) do
[{_, bin}] -> {:ok, s} = HLL.deserialize(bin); s
[] -> nil
end
end
endCMS: API rate limiting with frequency tracking
An API gateway needs to track request counts per API key to enforce rate limits. CMS provides constant-memory frequency estimation -- counts are always at least the true count, so rate limits are never under-enforced.
defmodule Gateway.RateLimiter do
use GenServer
alias ExDataSketch.CMS
@rate_limit 1000 # max requests per window
@window_ms 60_000 # 1-minute windows
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def check_and_record(api_key) do
GenServer.call(__MODULE__, {:check, api_key})
end
def top_consumers do
GenServer.call(__MODULE__, :top_consumers)
end
@impl true
def init(_opts) do
schedule_rotation()
{:ok, %{sketch: CMS.new(width: 4096, depth: 5), keys_seen: MapSet.new()}}
end
@impl true
def handle_call({:check, api_key}, _from, state) do
current_count = CMS.estimate(state.sketch, api_key)
if current_count >= @rate_limit do
{:reply, {:error, :rate_limited, current_count}, state}
else
updated_sketch = CMS.update(state.sketch, api_key)
updated_keys = MapSet.put(state.keys_seen, api_key)
{:reply, {:ok, current_count + 1}, %{state | sketch: updated_sketch, keys_seen: updated_keys}}
end
end
@impl true
def handle_call(:top_consumers, _from, state) do
# Check counts for all observed keys
top =
state.keys_seen
|> Enum.map(fn key -> {key, CMS.estimate(state.sketch, key)} end)
|> Enum.sort_by(fn {_k, count} -> -count end)
|> Enum.take(10)
{:reply, top, state}
end
@impl true
def handle_info(:rotate, _state) do
schedule_rotation()
{:noreply, %{sketch: CMS.new(width: 4096, depth: 5), keys_seen: MapSet.new()}}
end
defp schedule_rotation, do: Process.send_after(self(), :rotate, @window_ms)
endTheta: A/B test audience overlap analysis
A product team needs to measure how much overlap exists between users exposed to experiment A vs experiment B. Theta sketches support set operations (union, intersection) on cardinalities, giving overlap estimates without storing every user ID.
defmodule Experiments.OverlapAnalysis do
alias ExDataSketch.Theta
def analyze_overlap(experiment_a_users, experiment_b_users) do
sketch_a = Theta.from_enumerable(experiment_a_users, k: 4096)
sketch_b = Theta.from_enumerable(experiment_b_users, k: 4096)
union = Theta.merge(sketch_a, sketch_b)
count_a = Theta.estimate(sketch_a)
count_b = Theta.estimate(sketch_b)
count_union = Theta.estimate(union)
# Inclusion-exclusion principle
count_intersection = max(count_a + count_b - count_union, 0.0)
jaccard = if count_union > 0, do: count_intersection / count_union, else: 0.0
%{
group_a_size: round(count_a),
group_b_size: round(count_b),
union_size: round(count_union),
intersection_size: round(count_intersection),
jaccard_similarity: Float.round(jaccard, 4),
overlap_pct_of_a: Float.round(count_intersection / max(count_a, 1) * 100, 1),
overlap_pct_of_b: Float.round(count_intersection / max(count_b, 1) * 100, 1)
}
end
# Compare overlap across multiple experiment cohorts
def pairwise_overlap(cohorts) when is_map(cohorts) do
sketches =
Map.new(cohorts, fn {name, users} ->
{name, Theta.from_enumerable(users, k: 4096)}
end)
names = Map.keys(sketches)
for a <- names, b <- names, a < b do
union = Theta.merge(sketches[a], sketches[b])
est_a = Theta.estimate(sketches[a])
est_b = Theta.estimate(sketches[b])
est_union = Theta.estimate(union)
intersection = max(est_a + est_b - est_union, 0.0)
{a, b, round(intersection)}
end
end
endKLL: Monitoring response size distributions
A CDN needs to track the distribution of response body sizes to plan cache shard capacity. KLL provides rank-accurate quantiles -- useful when you care about "what fraction of responses are below X bytes?"
defmodule CDN.ResponseSizeMonitor do
use GenServer
alias ExDataSketch.KLL
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def record_response(size_bytes) do
GenServer.cast(__MODULE__, {:record, size_bytes})
end
def record_batch(sizes) do
GenServer.cast(__MODULE__, {:record_batch, sizes})
end
def distribution_report do
GenServer.call(__MODULE__, :report)
end
@impl true
def init(_opts) do
{:ok, %{sketch: KLL.new(k: 200)}}
end
@impl true
def handle_cast({:record, size}, state) do
{:noreply, %{state | sketch: KLL.update(state.sketch, size)}}
end
@impl true
def handle_cast({:record_batch, sizes}, state) do
{:noreply, %{state | sketch: KLL.update_many(state.sketch, sizes)}}
end
@impl true
def handle_call(:report, _from, state) do
sketch = state.sketch
n = KLL.count(sketch)
report =
if n > 0 do
%{
count: n,
min: KLL.min(sketch),
p25: KLL.quantile(sketch, 0.25),
median: KLL.quantile(sketch, 0.5),
p75: KLL.quantile(sketch, 0.75),
p90: KLL.quantile(sketch, 0.90),
p99: KLL.quantile(sketch, 0.99),
max: KLL.max(sketch),
# What fraction of responses are under 1 MB?
pct_under_1mb: Float.round(KLL.rank(sketch, 1_048_576) * 100, 1)
}
else
%{count: 0}
end
{:reply, report, state}
end
end
# Usage:
# CDN.ResponseSizeMonitor.record_batch(response_sizes)
# CDN.ResponseSizeMonitor.distribution_report()
# => %{count: 1_500_000, median: 24576, p99: 4_194_304, pct_under_1mb: 92.3, ...}DDSketch: SLO compliance monitoring
A platform team needs to track API latency percentiles for SLO reporting. DDSketch provides value-relative error -- when the SLO says "P99 < 200ms", DDSketch guarantees the reported P99 is within 1% of the true value, not just within a rank tolerance.
defmodule Platform.SLOMonitor do
use GenServer
alias ExDataSketch.DDSketch
@slo_rules [
{:p50, 0.5, 50}, # P50 < 50ms
{:p95, 0.95, 150}, # P95 < 150ms
{:p99, 0.99, 200} # P99 < 200ms
]
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def record_latency(endpoint, latency_ms) do
GenServer.cast(__MODULE__, {:record, endpoint, latency_ms})
end
def record_latencies(endpoint, latencies) do
GenServer.cast(__MODULE__, {:record_batch, endpoint, latencies})
end
def slo_report(endpoint) do
GenServer.call(__MODULE__, {:report, endpoint})
end
@impl true
def init(_opts) do
{:ok, %{sketches: %{}}}
end
@impl true
def handle_cast({:record, endpoint, latency_ms}, state) do
sketch = Map.get(state.sketches, endpoint, DDSketch.new(alpha: 0.01))
updated = DDSketch.update(sketch, latency_ms)
{:noreply, %{state | sketches: Map.put(state.sketches, endpoint, updated)}}
end
@impl true
def handle_cast({:record_batch, endpoint, latencies}, state) do
sketch = Map.get(state.sketches, endpoint, DDSketch.new(alpha: 0.01))
updated = DDSketch.update_many(sketch, latencies)
{:noreply, %{state | sketches: Map.put(state.sketches, endpoint, updated)}}
end
@impl true
def handle_call({:report, endpoint}, _from, state) do
case Map.fetch(state.sketches, endpoint) do
{:ok, sketch} ->
n = DDSketch.count(sketch)
checks =
Enum.map(@slo_rules, fn {label, quantile, threshold_ms} ->
actual = DDSketch.quantile(sketch, quantile)
%{
metric: label,
threshold_ms: threshold_ms,
actual_ms: Float.round(actual, 2),
passing: actual <= threshold_ms
}
end)
all_passing = Enum.all?(checks, & &1.passing)
report = %{
endpoint: endpoint,
sample_count: n,
min_ms: Float.round(DDSketch.min(sketch), 2),
max_ms: Float.round(DDSketch.max(sketch), 2),
slo_checks: checks,
compliant: all_passing
}
{:reply, {:ok, report}, state}
:error ->
{:reply, {:error, :no_data}, state}
end
end
end
# Usage:
# Platform.SLOMonitor.record_latencies("/api/search", latency_samples)
# Platform.SLOMonitor.slo_report("/api/search")
# => {:ok, %{compliant: true, slo_checks: [
# %{metric: :p99, threshold_ms: 200, actual_ms: 142.37, passing: true}, ...]}}FrequentItems: Trending search queries
A search engine needs to identify the most popular queries in real time to populate autocomplete suggestions and trending lists. FrequentItems (SpaceSaving) tracks the top-k items with bounded memory, providing count estimates and error bounds.
defmodule Search.TrendingQueries do
use GenServer
alias ExDataSketch.FrequentItems
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def record_query(query) do
GenServer.cast(__MODULE__, {:query, query})
end
def record_queries(queries) do
GenServer.cast(__MODULE__, {:queries, queries})
end
def trending(limit \\ 10) do
GenServer.call(__MODULE__, {:trending, limit})
end
def query_estimate(query) do
GenServer.call(__MODULE__, {:estimate, query})
end
@impl true
def init(_opts) do
{:ok, %{sketch: FrequentItems.new(k: 128)}}
end
@impl true
def handle_cast({:query, query}, state) do
normalized = query |> String.downcase() |> String.trim()
{:noreply, %{state | sketch: FrequentItems.update(state.sketch, normalized)}}
end
@impl true
def handle_cast({:queries, queries}, state) do
normalized = Enum.map(queries, &(&1 |> String.downcase() |> String.trim()))
{:noreply, %{state | sketch: FrequentItems.update_many(state.sketch, normalized)}}
end
@impl true
def handle_call({:trending, limit}, _from, state) do
top =
state.sketch
|> FrequentItems.top_k(limit: limit)
|> Enum.map(fn entry ->
%{
query: entry.item,
estimated_count: entry.count,
min_count: entry.lower,
max_count: entry.upper
}
end)
{:reply, top, state}
end
@impl true
def handle_call({:estimate, query}, _from, state) do
normalized = query |> String.downcase() |> String.trim()
result = FrequentItems.estimate(state.sketch, normalized)
{:reply, result, state}
end
end
# Usage:
# Search.TrendingQueries.record_queries(batch_of_queries)
# Search.TrendingQueries.trending(5)
# => [%{query: "elixir genserver", estimated_count: 4821, min_count: 4750, max_count: 4821},
# %{query: "phoenix liveview", estimated_count: 3102, min_count: 3010, max_count: 3102}, ...]ULL: Multi-node distributed cardinality with improved accuracy
An ad-tech platform counts unique ad impressions per campaign across a cluster. Each node maintains a local ULL sketch, serializes it, and sends to a central aggregator that merges and reports campaign reach. ULL is chosen over HLL for its ~20% better accuracy at the same memory -- significant when reporting to advertisers who pay per unique impression.
defmodule AdTech.CampaignReach do
@moduledoc """
Distributed unique impression counting per campaign using ULL sketches.
Each application node tracks impressions locally and periodically
flushes serialized sketches to a central aggregator for merge.
"""
alias ExDataSketch.ULL
# --- Per-node: local impression tracking ---
def new_tracker(precision \\ 14) do
%{sketches: %{}, precision: precision}
end
def record_impression(tracker, campaign_id, user_id) do
sketch =
Map.get_lazy(tracker.sketches, campaign_id, fn ->
ULL.new(p: tracker.precision)
end)
updated = ULL.update(sketch, user_id)
put_in(tracker.sketches[campaign_id], updated)
end
def record_impressions(tracker, campaign_id, user_ids) do
sketch =
Map.get_lazy(tracker.sketches, campaign_id, fn ->
ULL.new(p: tracker.precision)
end)
updated = ULL.update_many(sketch, user_ids)
put_in(tracker.sketches[campaign_id], updated)
end
@doc "Serialize all campaign sketches for network transport."
def flush(tracker) do
payloads =
Map.new(tracker.sketches, fn {campaign_id, sketch} ->
{campaign_id, ULL.serialize(sketch)}
end)
{payloads, %{tracker | sketches: %{}}}
end
# --- Central aggregator: merge from all nodes ---
def new_aggregator, do: %{sketches: %{}}
def ingest(aggregator, payloads) do
Enum.reduce(payloads, aggregator, fn {campaign_id, binary}, acc ->
{:ok, remote} = ULL.deserialize(binary)
merged =
case Map.fetch(acc.sketches, campaign_id) do
{:ok, existing} -> ULL.merge(existing, remote)
:error -> remote
end
put_in(acc.sketches[campaign_id], merged)
end)
end
@doc "Report unique reach per campaign."
def report(aggregator) do
Map.new(aggregator.sketches, fn {campaign_id, sketch} ->
{campaign_id, %{
unique_impressions: ULL.estimate(sketch),
sketch_bytes: ULL.size_bytes(sketch)
}}
end)
end
@doc "Merge hourly aggregators into a daily rollup."
def rollup(hourly_aggregators) do
Enum.reduce(hourly_aggregators, new_aggregator(), fn hourly, daily ->
Enum.reduce(hourly.sketches, daily, fn {campaign_id, sketch}, acc ->
merged =
case Map.fetch(acc.sketches, campaign_id) do
{:ok, existing} -> ULL.merge(existing, sketch)
:error -> sketch
end
put_in(acc.sketches[campaign_id], merged)
end)
end)
end
end
# Usage:
# Node A records impressions
# tracker = AdTech.CampaignReach.new_tracker(14)
# tracker = AdTech.CampaignReach.record_impressions(tracker, "camp_42", user_ids)
# {payloads, tracker} = AdTech.CampaignReach.flush(tracker)
# send(aggregator_node, {:impressions, payloads})
#
# Aggregator merges from all nodes
# agg = AdTech.CampaignReach.new_aggregator()
# agg = AdTech.CampaignReach.ingest(agg, payloads_from_node_a)
# agg = AdTech.CampaignReach.ingest(agg, payloads_from_node_b)
# AdTech.CampaignReach.report(agg)
# => %{"camp_42" => %{unique_impressions: 1_247_823, sketch_bytes: 16392}}
#
# ULL at p=14 gives ~0.65% relative error vs HLL's ~0.81%
# -- a meaningful improvement when billing per unique impression.Bloom: Deduplication in event processing
An event pipeline receives millions of events per hour and needs to filter duplicates without storing every event ID. A Bloom filter provides space-efficient "have I seen this before?" checks -- false positives mean an occasional duplicate slips through, but no event is ever wrongly dropped.
defmodule Events.Deduplicator do
use GenServer
alias ExDataSketch.Bloom
@capacity 10_000_000 # expected events per window
@fpr 0.001 # 0.1% false positive rate
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
# Returns :new if not seen before, :duplicate if probably seen
def check_and_insert(event_id) do
GenServer.call(__MODULE__, {:check, event_id})
end
# Batch check for Broadway/Flow pipelines
def filter_duplicates(event_ids) do
GenServer.call(__MODULE__, {:filter_batch, event_ids})
end
def stats do
GenServer.call(__MODULE__, :stats)
end
@impl true
def init(_opts) do
{:ok, %{
bloom: Bloom.new(capacity: @capacity, false_positive_rate: @fpr),
checked: 0,
duplicates: 0
}}
end
@impl true
def handle_call({:check, event_id}, _from, state) do
if Bloom.member?(state.bloom, event_id) do
{:reply, :duplicate, %{state | checked: state.checked + 1, duplicates: state.duplicates + 1}}
else
updated_bloom = Bloom.put(state.bloom, event_id)
{:reply, :new, %{state | bloom: updated_bloom, checked: state.checked + 1}}
end
end
@impl true
def handle_call({:filter_batch, event_ids}, _from, state) do
{new_ids, bloom, dup_count} =
Enum.reduce(event_ids, {[], state.bloom, 0}, fn id, {acc, bloom, dups} ->
if Bloom.member?(bloom, id) do
{acc, bloom, dups + 1}
else
{[id | acc], Bloom.put(bloom, id), dups}
end
end)
new_state = %{state |
bloom: bloom,
checked: state.checked + length(event_ids),
duplicates: state.duplicates + dup_count
}
{:reply, Enum.reverse(new_ids), new_state}
end
@impl true
def handle_call(:stats, _from, state) do
reply = %{
checked: state.checked,
duplicates: state.duplicates,
approximate_unique: Bloom.count(state.bloom),
filter_size_bytes: Bloom.size_bytes(state.bloom),
configured_fpr: Bloom.error_rate(state.bloom)
}
{:reply, reply, state}
end
end
# Usage in a Broadway pipeline:
defmodule MyPipeline do
use Broadway
@impl true
def handle_batch(_batcher, messages, _batch_info, _context) do
ids = Enum.map(messages, & &1.data.event_id)
new_ids = Events.Deduplicator.filter_duplicates(ids)
new_id_set = MapSet.new(new_ids)
messages
|> Enum.filter(fn msg -> MapSet.member?(new_id_set, msg.data.event_id) end)
|> Enum.each(&process_event/1)
messages
end
endCuckoo: Session revocation cache
A session management system needs to maintain a revoked-session cache that supports both addition and removal of session IDs. Cuckoo filters support deletion, making them suitable for caches where items expire or are reinstated.
defmodule Auth.RevocationCache do
alias ExDataSketch.Cuckoo
def new_cache(capacity \\ 100_000) do
Cuckoo.new(capacity: capacity, fingerprint_size: 16)
end
def revoke(cache, session_id) do
case Cuckoo.put(cache, session_id) do
{:ok, updated} -> {:ok, updated}
{:error, :full} -> {:error, :cache_full}
end
end
def reinstate(cache, session_id) do
case Cuckoo.delete(cache, session_id) do
{:ok, updated} -> {:ok, updated}
{:error, :not_found} -> {:ok, cache}
end
end
def revoked?(cache, session_id) do
Cuckoo.member?(cache, session_id)
end
endIBLT: Replica set reconciliation
Two database replicas need to synchronize their key sets without exchanging full inventories. Each node builds an IBLT from its keys, they exchange the compact IBLTs, and subtract to find only the differing keys.
defmodule Sync.Reconciler do
alias ExDataSketch.IBLT
# Each node builds an IBLT from its local keys
def build_digest(keys, cell_count \\ 3000) do
iblt = IBLT.new(cell_count: cell_count, hash_count: 3)
IBLT.put_many(iblt, keys)
end
# Find keys that differ between two nodes
def find_differences(local_digest, remote_digest) do
diff = IBLT.subtract(local_digest, remote_digest)
case IBLT.list_entries(diff) do
{:ok, %{positive: only_local, negative: only_remote}} ->
{:ok, %{push: only_local, pull: only_remote}}
{:error, :decode_failed} ->
{:error, :too_many_differences}
end
end
endFilterChain: Hot/cold lifecycle tiers
A membership service uses a hot Cuckoo filter for active writes and periodic compaction into a cold XorFilter for archived snapshots.
defmodule Membership.TieredFilter do
alias ExDataSketch.{FilterChain, Cuckoo, XorFilter}
def new_hot(capacity \\ 50_000) do
FilterChain.new()
|> FilterChain.add_stage(Cuckoo.new(capacity: capacity))
end
def ingest(chain, items) do
Enum.reduce(items, chain, fn item, acc ->
{:ok, updated} = FilterChain.put(acc, item)
updated
end)
end
def compact(chain, archived_items) do
{:ok, xor} = XorFilter.build(archived_items, fingerprint_bits: 16)
FilterChain.add_stage(chain, xor)
end
def seen?(chain, item) do
FilterChain.member?(chain, item)
end
endError Handling
ExDataSketch uses tagged tuples for recoverable errors:
{:error, %ExDataSketch.Errors.InvalidOptionError{message: "p must be between 4 and 16"}}
{:error, %ExDataSketch.Errors.DeserializationError{message: "invalid magic bytes"}}Functions that cannot fail return values directly (no :ok tuple wrapping).
Functions that validate external input (deserialization, option parsing) return
{:ok, result} | {:error, error} tuples.
Merging in Distributed Systems
Sketches are designed for distributed aggregation. A common pattern:
- Each node maintains a local sketch.
- Periodically serialize and send sketches to an aggregator.
- The aggregator deserializes and merges all sketches.
- Query the merged sketch for global estimates.
Requirements for merging:
- Both sketches must be the same type (e.g., both HLL).
- Both sketches must have the same parameters (e.g., same
pvalue for HLL). - Attempting to merge incompatible sketches returns an error.
Benchmarking
Run all benchmarks with mix bench, or individual benchmark files with
mix run bench/<name>.exs.
Interpreting Benchee Memory Numbers
Benchee's "Memory usage" metric measures total heap allocation during the benchmarked function call. This is the sum of every byte allocated on the process heap, including transient garbage that is immediately collectible (intermediate values, short-lived binary copies, list cons cells, etc.). It is not peak memory, resident memory, or the size of the final data structure.
This distinction matters when benchmarking probabilistic sketches against exact data structures like MapSet:
Processing allocation is O(n) for both approaches. Every item must be processed, so both MapSet and HLL/ULL allocate proportionally to the number of items inserted. The pure-Elixir hash mixer uses 16-bit partial products that stay within the BEAM's fixnum limit, avoiding the transient bigint allocations that earlier versions incurred. When the Rust NIF is available, hashing moves entirely into native code with zero Elixir heap allocation per item.
Result size is where sketches win. An HLL sketch at p=12 occupies a fixed 4 KB regardless of whether 1,000 or 100,000,000 items were inserted. A MapSet must store every unique element and grows linearly with cardinality.
To measure the actual memory advantage, compare the size of the finished data structures rather than Benchee's total allocation metric:
mapset = Enum.into(data, MapSet.new())
hll = HLL.new(p: 12) |> HLL.update_many(items)
:erlang.external_size(mapset) # grows with cardinality
:erlang.external_size(hll.state) # fixed at ~4 KB for p=12See bench/hhl_v_naive.exs for a complete example that includes both a
Benchee throughput/allocation comparison and a result-size comparison showing
the fixed sketch footprint.
Backend Considerations
The Rust backend (ExDataSketch.Backend.Rust) moves batch sketch operations
(update_many, merge, estimate) into a NIF for maximum throughput. When the NIF
is loaded, hash64/2 also automatically uses XXHash3 for hashing, so the full
pipeline runs in native code. When benchmarking, specify the backend explicitly
so results are reproducible:
HLL.new(p: 12, backend: ExDataSketch.Backend.Rust)The Pure backend processes items one at a time in update_many, avoiding
intermediate list allocation but performing all arithmetic on the Elixir heap.
The fixnum-safe hash mixer eliminates bigint overhead that earlier versions
incurred, keeping per-item allocation to a single unavoidable 64-bit return
value. The pure backend is useful for environments where the Rust NIF is not
available.