Introduction to Bloom Filter by Examples

Vincenzo Palazzo
5 min readDec 8, 2021

--

These days I finished implementing my C++ version of the static bloom filter, and I found one missing in the post around the internet. That is the missing 101 bloom filter as a practical approach, and in my opinion, the theory is not funny enough without practice.

With this article, I want to cover how to implement a bloom filter with C++, or other OOP languages, and add some tips that I found useful in terms of error probability, time, and space complexity.

An introduction to Bloom Filter

Bloom Filters are space-efficient data structures that allow for membership queries over a given set of elements. The Bloom Filter uses k hash functions, h1, h2, . . . , hk to hash elements into an array of size m, and for each element s, the bits at positions h1(s), h2(s), . . . , hk(s) in the array are set to 1.

To know if an element is present in the set, the bloom filter checks the bits at positions h1(q), h2(q), . . . , hk(q), and if there are all 1, so the element q is inside the set.

However, there is a problem when the hash function is not good to make the job right, in fact, the hash function can have some errors (aka collisions), and we can have some false positives. This means that we can have some positive results for the query over the bloom filter for some element that is not contained in the set.

This behavior is unacceptable in a data structure, but if we think about why we need a bloom filter, we can accept in some sense this error, if the frequency is not too high.

Usage of Bloom Filter

The bloom filter is a data structure that is used in the world of caching or when we need to check the presence of some element in some sets without getting the value of the element and we need a space efficient data structure. In fact, the basic implementation of the bloom filter contains only a binary array, and it is also static, so it is not possible to add an element in the bloom filter.

However, this last requirement is a limitation of the first bloom filter proposal, nowadays there are other alternatives that admit the modified operation like insert and delete such as Spectral Bloom Filter.

In the real world, the bloom filter is used in very big projects, such as Bitcoin, leveldb, and many others.

Implementing a Bloom Filter

In this article we will discuss easy and readable C++ implementation of the Bloom Filter, and the implementation is contained inside the cpstl (copy and paste standard library)

The code is pretty easy, but it required a hash function, which in this case, we have implemented also in the cpstl a Universal Hash Function.

The code of the bloom filter (without the hash function file that can be found here) is the following one

#include <cmath>
#include <iostream>
#include <memory>
#include <vector>
#include "UniversalHash.hpp"namespace cpstl {/**
* It is used to calculate the number of
* hash function to use inside the bloom filter.
*
* This number are from this chart
* https://freecontent.manning.com/wp-content/uploads/bloom-filters_06.png
*/
enum class ErrProb { HIGHT = 6, MEDIUM = 10, LOW = 14 };
template <class T>
class BloomFilter {
private:
std::size_t size;
std::vector<uint8_t> filter;
std::vector<cpstl::UniversalHash<T>> hash_functions;
uint16_t from_enum_to_prob(ErrProb prob) {
switch (prob) {
case ErrProb::HIGHT:
return 6;
case ErrProb::MEDIUM:
return 10;
case ErrProb::LOW:
return 16;
}
assert(false && "Error probability unknown");
}
std::size_t calculate_number_of_hash_function(uint16_t filter_factor,
std::size_t size) {
return (std::log(2) * ((filter_factor * size / size)));
}
public:
BloomFilter(std::size_t size, ErrProb err_pro = ErrProb::MEDIUM) {
this->size = size;
auto factor = from_enum_to_prob(err_pro);
filter = std::vector<uint8_t>(factor * size, 0);
auto optimal_hash_size =
this->calculate_number_of_hash_function(factor, size);
this->hash_functions = std::vector<cpstl::UniversalHash<T>>(
optimal_hash_size, cpstl::UniversalHash<T>(factor * this->size));
}
void insert(T value) {
for (auto hash : hash_functions) {
auto pos = hash.universal_hashing(value);
filter[pos] = true;
}
}
bool contains(T value) {
for (auto hash : hash_functions) {
auto pos = hash.universal_hashing(value);
if (filter[pos] == false) return false;
}
return true;
}
};
}; // namespace cpstl

You can find the source code available here

In this case, we can noted only one important function that is the following one

std::size_t calculate_number_of_hash_function(uint16_t filter_factor, std::size_t size) {
return (std::log(2) * ((filter_factor * size / size)));
}

This function will return the optimal number of hash functions that are needed to have an acceptable error probability, and we will introduce the proof in the next section, but the only important thing to do now is to play with the code and see how to use the implementation.

In the cpstl repository, there is a test file that implements a test case, and you can find all the source code inside the file Main.

Proof the goodness of Bloom Filter

Bloom Filter, is a probabilistic data structure, and in this section, we try to make a step-by-step proof of what we need to prove the error probability, and what is the correct number of the hash functions to choose to have a good error probability.

We start to introduce the probability that a value in position “i” of the bloom filter is 0 with probability equal to Prob(Bloom[i] == 0) = 1 — (1/n) where n is the size of the bloom filter and in some cases, it is the number of bits used by the bloom filter. Usually, we need only 1 bit to store the presence of an element. But we maintain the explanation easy and n it is the size of the bloom filter.

Now, we repeat this probability for all the hash functions that we are using inside the bloom filter, so the probability to have a bit set to 0 with k hash functions it is the following one

Prob(Bloom[i] == 0) = (1–1/m)^(k * m) where m is the size of the set of elements. In addition, can recognize the formula with the Teyilor Exponential where (1 — x) it is approximable to (e ^ -x).

So in this case, we have (1–1/n)^(k*n) so we can write the formula like e^(-k*n/m).

Now the question is when do we have a false positive? As we described before, we can find a false positive when the bloom filter reports the presence of an element, but it is lying. We can define that this event can happen with the following probability

Prob(Bloom[hash(i)]) == 1 for each hash function in the set of hash functions, and also we know that the element “i” is inside the set of elements. We can define that the probability is defined as (1 — e ^(-k*n/m))

Now, we defined the optimal number of hash functions with the following definition

Hash Function set size = log in base 2 of (n / m).

Benchmarking

In the repository are developed a couple of benchmarks with google benchmark that report also the number of collisions that happens inside the bloom filter.

and the terminal result of the execution is the following one

Conclusion

If you like my work on this article please consider following me on Twitter and Github and if you want some other article please consider donating with the following method https://github.com/sponsors/vincenzopalazzo

--

--

Vincenzo Palazzo
Vincenzo Palazzo

Written by Vincenzo Palazzo

Professional Bodybuilder, and in my free time developing free software!

No responses yet