A Bloom filter is a probabilistic data structure that approximately and efficiently answers the set membership question. Given a Bloom filter B constructed out of the elements of a set , returns or , indicating the membership of in . Bloom filters are approximate since they can return false positives. The false positive rate for a Bloom filter can be expressed by:
An interesting question is: Can Bloom filters be used to compute whether two sets are disjoint?
That is, given a set represented by a Bloom filter , and another set , can we use the Bloom filter to accurately determine whether ? An obvious procedure to compute disjointness is:
Procedure : Iterate through each element of the set . If for any element , then return , else return .
Before we can reason about the accuracy of this procedure, let’s define the following events:
- : the event that and are disjoint (and to be its complement), and
- : the event that is accurate (and to be its complement).
Now, we make the following claims about our procedure :
Claim 1: .
In other words, our procedure is guaranteed to be correct when and are not disjoint. This claim follows easily from the fact that a Bloom filter cannot have false negatives, i.e., if does exist in , then will return .
Claim 2: .
In other words, can be inaccurate when and are disjoint. Specifically, this procedure can be inaccurate if the Bloom filter returns for at least one element in , given that no element in belongs to . But how inaccurate? It is often easier to compute the probability of the complementary event , which is the event that the Bloom filter returns for every element of . The probability that the Bloom filter returns for a given element can be expressed by:
Assuming independence of Bloom filter output for the different elements of , and assuming that has elements, we can write:
Further, notice that the likelihood of the sets being disjoint has nothing whatsoever to do with the Bloom filter accuracy (I like to picture it as: ). In other words, and are independent! If we wanted to be explicit, we could apply Bayes’ Theorem:
from which it follows that:
Since , the inaccuracy grows with . This makes intuitive sense since the larger the set , the more the chances that we will run into a Bloom filter false positive. In plain English: Bloom filters are probably not a good idea for computing whether two sets are disjoint.