CUCKOO vs BLOOM filter, from a Gopher’s perspective

In this article, I am trying to implement and test the efficiency of a cuckoo filter over a bloom filter. (Read previous post on Chord DHT, implementing a distributed hash table in Golang)

Introduction

Probabilistic data structures are very useful, especially when processing large data sets. Most of the times, whilst working on the data side of things, one would want to do a simple “is the item not present” or “is the item already present” query whilst processing the real time data. Say you want to answer queries in real time, like number of unique ips, most frequent ips, if an ad has already been served to a user, using probabilistic data structures provide a space efficient way to answer these queries. The typical approach to such queries would be to use either a HashMap or a HashTable, or store it is some external cache (like redis), but the problem is with large datasets, these simple data structures can’t fit into memory. This is where probabilistic data structures come into play because of their space and time advantages.

Example Use cases

Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
Medium uses Bloom filters to check if an article has already been recommended to an user
Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain
The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result)

What’s in a “Cuckoo”?

We’ve used bloom filters in many places for answering such queries on the data platform. Recently I came across this paper on Cuckoo filter which caught my interest. The title itself says, “Cuckoo Filter: Practically Better Than Bloom”, so I decided to check it out.

Cuckoo filters improve upon the design of the bloom filter by offering deletion, limited counting, and a bounded false positive probability, while still maintaining a similar space complexity. They use cuckoo hashing to resolve collisions and are essentially a compact cuckoo hash table.

Cuckoo and bloom filters are both useful for set membership testing when the size of the original data is large. They both only use 7 bits per entry. They are also useful when an expensive operation can be avoided prior to execution by a set membership test. For example, before querying a database, a set membership test can be done to see if the desired object is even in the database.

Algorithm

Parameters of the Filter:
1. Two Hash Functions: h1 and h2
2. An array B with n buckets. The i-th bucket will be called B[i]

Input: L, a list of elements  to be inserted into the cuckoo filter.

Algorithm:
While L is not empty:
    Let x be the first item in the list L. Remove x from the list.
    If B[h1(x)] is empty:
        place x in B[h1(x)]
    Else, If B[h2(x) is empty]:
        place x in B[h2(x)]
    Else:
        Let y be the element in B[h2(x)].
        Prepend y to L
        place x in B[h2(x)]

Implementation

The implementation seems pretty straightforward, so I decided to have a go at it and compare how space/time efficient it is compared to a bloom filter. The Cuckoo filter consists of a Cuckoo hash table that stores the ‘fingerprints’ of items inserted. The fingerprint of an item is a bit string derived from the hash of that item. A cuckoo hash table consists of an array of buckets where an item to be inserted is mapped to two possible buckets based on two hash functions. Each bucket can be configured to store a variable number of fingerprints. Typically, a Cuckoo filter is identified by its fingerprint and bucket size. For example, a (2,4) Cuckoo filter stores 2 bit length fingerprints and each bucket in the Cuckoo hash table can store up to 4 fingerprints.

Insertion

Algorithm:

f = fingerprint(x);
i1 = hash(x);
i2 = i1 ⊕ hash(f);

if bucket[i1] or bucket[i2] has an empty entry then
   add f to that bucket;
   return Done;

// must relocate existing items;
i = randomly pick i1 or i2;
for n = 0; n < MaxNumKicks; n++ do
   randomly select an entry e from bucket[i];
   swap f and the fingerprint stored in entry e;
   i = i ⊕ hash(f);
   if bucket[i] has an empty entry then
      add f to bucket[i];
      return Done;

// Hashtable is considered full;
return Failure;

Code:

<a href="https://medium.com/media/fb67f33fdc8334fa7f9ec0af7bb85fce/href">https://medium.com/media/fb67f33fdc8334fa7f9ec0af7bb85fce/href</a>

Search

Algorithm:

f = fingerprint(x);
i1 = hash(x);
i2 = i1 ⊕ hash(f);

if bucket[i1] or bucket[i2] has f then
    return True;

return False;

Code:

<a href="https://medium.com/media/90d845abd04c0e11f41bbe113395645f/href">https://medium.com/media/90d845abd04c0e11f41bbe113395645f/href</a>

Delete

Algorithm:

f = fingerprint(x);
i1 = hash(x);
i2 = i1 ⊕ hash(f);

if bucket[i1] or bucket[i2] has f then
   remove a copy of f from this bucket;
   return True;

return False;

Code:

<a href="https://medium.com/media/31a0be9921f81e3f26d1eb425c147a3d/href">https://medium.com/media/31a0be9921f81e3f26d1eb425c147a3d/href</a>

Performance Test

I’ve used Will Fitzgerald library for the test on Bloom filter. The FPP (False Positive Probability) ration taken for cuckoo filter is 0.001

<a href="https://medium.com/media/eb93ffc65a9a4878e55c27b47d12597d/href">https://medium.com/media/eb93ffc65a9a4878e55c27b47d12597d/href</a>

Space Complexity

With regards to the cuckoo and bloom filters, they perform differently at different false positive probabilities. When the false positive probability of the filter is less than or equal to 3%, the cuckoo filter has fewer bits per entry. When it is higher, the bloom filter has fewer bits per entry.

Time Complexity

In cuckoo hashing, inserting an element seems like much worse than O(1) in the worst case because there could be many instances during collision, where we have to remove a value in order to make room for the current value. Plus, if there is a cycle then the entire table must be rehashed.

Doing a time analysis of both the filters yields the following results:

<a href="https://medium.com/media/8991ffe2b9d120643326957ae8d09a50/href">https://medium.com/media/8991ffe2b9d120643326957ae8d09a50/href</a>

Throughout this experiment (keeping in mind my code may not be fully optimized), Bloom filters seem to perform exceptionally well in space complexity, occupying less amount of space for large number of items. Cuckoo filter seems to perform better at insertion of large number of items, but a little slower in lookup(search times) due to it’s implementation.

Inference

I wouldn’t really take a side on which filter to recommend, I think they both have their own use cases. Bloom filters do not support deletions because hashing is lossy and irreversible. Though counting bloom filters solve that problem, Cuckoo filters are useful in the case where you would require deletions. Of course Cuckoo filters give an error when the filter is full, and that has it’s own advantages, whereas in a Bloom filter, there is no control over capacity, it just rehashes over the existing bit array.

Code

arriqaaq/cuckoo

References

P.S If you find anything wrong with the tests/implementation, please feel free to leave your suggestion/comments.