Flower
This is an implementation of Bloom Filters for Elixir. It uses NIFs written in Rust for better performance, since Bloom Filters rely highly on mutability.
What are Bloom Filter?
TL;DR: Huge amount of data ➜ small Bloom Filter : Was X not in Huge amount of data?
For more about this topic consider, checking out:
- Bloom Filters by Jason Davis
- Bloom Filters, Mining of Massive Datasets, Stanford University on Youtube
Installation
The package can be installed by adding flower
to your list of dependencies in mix.exs
:
def deps do
[
{:flower, "~> 0.1.4"},
]
end
Also, you need to have Rust installed for development, since this uses NIFs.
Docs can be found at https://hexdocs.pm/flower.
Documentation can be generated with ExDoc.
Usage
There is also a small walk through in the Example Section
alias Flower.Bloom, as: Bloom
Create new Bloom Filter
expected_number_of_unique_elements = 1000
# With a size atom
iex> filter = Bloom.new(:"8 KB", expected_number_of_unique_elements)
...
# Or by a maximum byte size:
iex> filter = Bloom.new_by_byte_size(8 * 1024, expected_number_of_unique_elements)
...
# Or by a bit address length:
iex> filter = Bloom.new(16, expected_number_of_unique_elements)
...
Bit Address Length | Size Atom | Bit Address Length | Size Atom | Bit Address Length | Size Atom |
---|---|---|---|---|---|
13 | 1 KB | 23 | 1 MB | ||
14 | 2 KB | 24 | 2 MB | ||
15 | 4 KB | 25 | 4 MB | ||
6 | 8 Byte | 16 | 8 KB | 26 | 8 MB |
7 | 16 Byte | 17 | 16 KB | 27 | 16 MB |
8 | 32 Byte | 18 | 32 KB | 28 | 32 MB |
9 | 64 Byte | 19 | 64 KB | 29 | 64 MB |
10 | 128 Byte | 20 | 128 KB | 30 | 128 MB |
11 | 256 Byte | 21 | 256 KB | 31 | 256 MB |
12 | 512 Byte | 22 | 512 KB | 32 | 512 MB |
Insert Elements
iex> Bloom.insert(filter, 42)
:ok
iex> Bloom.insert(filter, [1,2,{:atom, <<1,2,3,4>>}])
:ok
iex> Bloom.insert(filter, "Hello!")
:ok
Check for Elements
iex> Bloom.has_not?(filter, "Hello!")
false
iex> Bloom.has?(filter, [1,2,{:atom, <<1,2,3,4>>}])
true
iex> Bloom.has?(filter, :atom)
false
# `has?` is always the opposite of `has_not?`.
iex> Bloom.has?(filter, 42) != Bloom.has_not?(filter, 42)
true
Was actually inserted? | has? | has_not? |
---|---|---|
yes | yes | no |
no | most of the times: no | most of the times: yes |
Write To Disk
filename = "prime_filter.bloom"
file = File.stream!(filename, [:delayed_write, :binary], 8096)
(Bloom.stream(filter) |> Stream.into(file) |> Stream.run)
Read From Disk
filename = "prime_filter.bloom"
file = File.stream!(filename, [:read_ahead, :binary], 8096)
{:ok, new_filter} = Bloom.from_stream(file)
Funky Stuff
iex> Bloom.estimate_count(filter)
3
iex> Bloom.false_positive_probability(filter)
3.2348227494719115e-28
# This number is only that small, because we have just 3 elements in 8 KBs.
Example
Prime Numbers
Checking if a number is not a prime and below 100_000
.
# helper module
defmodule Step do
def throught(from, to, step, func) when from < to do
func.(from)
throught(from+step, to, step, func)
end
def throught(_,_,_,_), do: :ok
end && :ok
# alias so we need to write less
alias Flower.Bloom, as: Bloom
# Select appropriate size.
# We will put about 100_000 elements in.
# 64 KB = 512 KBit
# 512 KBit / 100_000 ~= 5.25 Bits per element.
# Meeh. Let's see...
non_primes = Bloom.new(:"64 KB", 100_000)
# We can also write
non_primes = Bloom.new(19, 100_000)
# or (it will choose the next smaller size)
non_primes = Bloom.new_by_byte_size(64 * 1024, 100_000)
# Let's put some non primes in:
Bloom.insert(non_primes, 42)
Bloom.insert(non_primes, "one hundred")
Bloom.insert(non_primes, [:ok, {Anything, 1.0e9}])
# Let's double check:
Bloom.has?(non_primes, 42)
Bloom.has?(non_primes, "one hundred")
Bloom.has?(non_primes, 7)
Bloom.has?(non_primes, [:ok, {Anything, 1.0e9}])
Bloom.has?(non_primes, 52)
# Works!
# Now we can get a estimate of
# the number of items we
# put in.
Bloom.estimate_count(non_primes)
# Apply Sieve of Eratosthenes.
# This may take a few seconds.
# At least we can do it with
# constant memory.
for x <- 2..50_000 do
# Skip multiples of previous numbers.
# This is actually not safe to do
# with a bloom filter since they are
# approximations. But let's don't care for now.
if(Bloom.has_not?(non_primes, x)) do
Step.throught(x*2, 100_000, x, fn non_prime ->
# Much of the time is used to calculate
# hashes.
Bloom.insert(non_primes, non_prime)
end)
end
end && :ok
non_primes |> Bloom.has_not?(12) # not a prime
non_primes |> Bloom.has_not?(11) # prime
non_primes |> Bloom.has_not?(6719) # no prime
non_primes |> Bloom.has_not?(4245) # prime
non_primes |> Bloom.has_not?(9973) # no prime
non_primes |> Bloom.has_not?(3549) # prime
non_primes |> Bloom.has_not?(89591) # prime
non_primes |> Bloom.has_not?(84949) # no prime
# Let's write it to disk:
filename = "prime_filter.bloom"
file = File.stream!(filename, [:delayed_write, :read_ahead, :binary], 8096)
(Bloom.stream(non_primes)
|> Stream.into(file)
|> Stream.run)
# Let's inspect the size of the filter:
File.lstat!(filename).size
# We can also read from disk:
{:ok, new_non_primes} = Bloom.from_stream(file)
new_non_primes |> Bloom.has_not?(12) # not a prime
new_non_primes |> Bloom.has_not?(11) # prime
# Works!
# Maybe a bit high, 64 KB for 100_000 Numbers
# is not that much.
Bloom.false_positive_probability(non_primes)
# Let's reestimate .
# There are 9_592 primes below 100_000, so this
# should yield about 100_000 - 9_592 = 90_408.
Bloom.estimate_count(non_primes)
Side note: If you want to check primes, google
search for Miller–Rabin primality test.