Flower

Build Status

This is an implementation of Bloom Filters for Elixir. It uses NIFs written in Rust for better performance, since Bloom Filters rely highly on mutability.

What are Bloom Filter?

TL;DR: Huge amount of data small Bloom Filter : Was X not in Huge amount of data?

For more about this topic consider, checking out:

Installation

The package can be installed by adding flower to your list of dependencies in mix.exs:

def deps do
  [
    {:flower, "~> 0.1.4"},
  ]
end

Also, you need to have Rust installed for development, since this uses NIFs.

Docs can be found at https://hexdocs.pm/flower.

Documentation can be generated with ExDoc.

Usage

There is also a small walk through in the Example Section

alias Flower.Bloom, as: Bloom
Create new Bloom Filter
expected_number_of_unique_elements = 1000
# With a size atom
iex> filter = Bloom.new(:"8 KB", expected_number_of_unique_elements)
...
# Or by a maximum byte size:
iex> filter = Bloom.new_by_byte_size(8 * 1024, expected_number_of_unique_elements)
...
# Or by a bit address length:
iex> filter = Bloom.new(16, expected_number_of_unique_elements)
...
Bit Address LengthSize AtomBit Address LengthSize AtomBit Address LengthSize Atom
131 KB231 MB
142 KB242 MB
154 KB254 MB
68 Byte168 KB268 MB
716 Byte1716 KB2716 MB
832 Byte1832 KB2832 MB
964 Byte1964 KB2964 MB
10128 Byte20128 KB30128 MB
11256 Byte21256 KB31256 MB
12512 Byte22512 KB32512 MB
Insert Elements
iex> Bloom.insert(filter, 42)
:ok
iex> Bloom.insert(filter, [1,2,{:atom, <<1,2,3,4>>}])
:ok
iex> Bloom.insert(filter, "Hello!")
:ok
Check for Elements
iex> Bloom.has_not?(filter, "Hello!")
false
iex> Bloom.has?(filter, [1,2,{:atom, <<1,2,3,4>>}])
true
iex> Bloom.has?(filter, :atom)
false
# `has?` is always the opposite of `has_not?`.
iex> Bloom.has?(filter, 42) != Bloom.has_not?(filter, 42)
true
Was actually inserted?has?has_not?
yesyesno
nomost of the times: nomost of the times: yes
Write To Disk
filename = "prime_filter.bloom"
file = File.stream!(filename, [:delayed_write, :binary], 8096)

(Bloom.stream(filter) |> Stream.into(file) |> Stream.run)
Read From Disk
filename = "prime_filter.bloom"
file = File.stream!(filename, [:read_ahead, :binary], 8096)

{:ok, new_filter} = Bloom.from_stream(file)
Funky Stuff
iex> Bloom.estimate_count(filter)
3
iex> Bloom.false_positive_probability(filter)
3.2348227494719115e-28
# This number is only that small, because we have just 3 elements in 8 KBs.

Example

Prime Numbers

Checking if a number is not a prime and below 100_000.

# helper module
defmodule Step do
    def throught(from, to, step, func) when from < to do
        func.(from)
        throught(from+step, to, step, func)
    end
    def throught(_,_,_,_), do: :ok
end && :ok

# alias so we need to write less
alias Flower.Bloom, as: Bloom


# Select appropriate size.
# We will put about 100_000 elements in.
# 64 KB = 512 KBit
# 512 KBit / 100_000 ~= 5.25 Bits per element.
# Meeh. Let's see...
non_primes = Bloom.new(:"64 KB", 100_000)
# We can also write
non_primes = Bloom.new(19, 100_000)
# or (it will choose the next smaller size)
non_primes = Bloom.new_by_byte_size(64 * 1024, 100_000)

# Let's put some non primes in:
Bloom.insert(non_primes, 42)
Bloom.insert(non_primes, "one hundred")
Bloom.insert(non_primes, [:ok, {Anything, 1.0e9}])

# Let's double check:
Bloom.has?(non_primes, 42)
Bloom.has?(non_primes, "one hundred")
Bloom.has?(non_primes, 7)
Bloom.has?(non_primes, [:ok, {Anything, 1.0e9}])
Bloom.has?(non_primes, 52)
# Works!


# Now we can get a estimate of
# the number of items we
# put in.
Bloom.estimate_count(non_primes)


# Apply Sieve of Eratosthenes.
# This may take a few seconds.
# At least we can do it with
# constant memory.
for x <- 2..50_000 do
  # Skip multiples of previous numbers.
  # This is actually not safe to do
  # with a bloom filter since they are
  # approximations. But let's don't care for now.
  if(Bloom.has_not?(non_primes, x)) do
    Step.throught(x*2, 100_000, x, fn non_prime ->
        # Much of the time is used to calculate
        # hashes.
        Bloom.insert(non_primes, non_prime)
    end)
 end
end && :ok

non_primes |> Bloom.has_not?(12) # not a prime
non_primes |> Bloom.has_not?(11) # prime
non_primes |> Bloom.has_not?(6719) # no prime
non_primes |> Bloom.has_not?(4245) # prime
non_primes |> Bloom.has_not?(9973) # no prime
non_primes |> Bloom.has_not?(3549) # prime
non_primes |> Bloom.has_not?(89591) # prime
non_primes |> Bloom.has_not?(84949) # no prime

# Let's write it to disk:
filename = "prime_filter.bloom"
file = File.stream!(filename, [:delayed_write, :read_ahead, :binary], 8096)
(Bloom.stream(non_primes)
|> Stream.into(file)
|> Stream.run)
# Let's inspect the size of the filter:
File.lstat!(filename).size

# We can also read from disk:
{:ok, new_non_primes} = Bloom.from_stream(file)
new_non_primes |> Bloom.has_not?(12) # not a prime
new_non_primes |> Bloom.has_not?(11) # prime
# Works!

# Maybe a bit high, 64 KB for 100_000 Numbers
# is not that much.
Bloom.false_positive_probability(non_primes)

# Let's reestimate .
# There are 9_592 primes below 100_000, so this
# should yield about 100_000 - 9_592 = 90_408.
Bloom.estimate_count(non_primes)

Side note: If you want to check primes, google search for Miller–Rabin primality test.