# Is there an anti-Bloom filter?

A Bloom filter makes it possible to efficiently keep track of whether various values have already been encountered during processing. When there are many data items then a Bloom filter can result in a significant memory saving over a hash table. The main feature of a Bloom filter, which it shares with a hash table, is that it always says “not new” if an item is not new, but there is a non-zero probability that an item will be flagged as “not new” even when it is new.

Is there an “anti-Bloom filter”, which has the opposite behaviour?

In other words: is there an efficient data structure which says “new” if an item is new, but which might also say “new” for some items which are not new?

Keeping all the previously seen items (for instance, in a sorted linked list) satisfies the first requirement but may use a lot of memory. I am hoping it is also unnecessary, given the relaxed second requirement.

For those who prefer a more formal treatment, write $b(x) = 1$ if the Bloom filter thinks $x$ is new, $b(x) = 0$ otherwise, and write $n(x) = 1$ if $x$ really is new and $n(x) = 0$ otherwise.

Then $Pr[b(x) = 0 | n(x) = 0] = 1$; $Pr[b(x) = 0 | n(x) = 1] = \alpha$; $Pr[b(x) = 1 | n(x) = 0] = 0$; $Pr[b(x) = 1 | n(x) = 1] = 1 - \alpha$, for some $0 < \alpha < 1$.

I am asking: does an efficient data structure exist, implementing a function $b'$ with some $0 < \beta < 1$, such that $Pr[b'(x) = 0 | n(x) = 0] = \beta$; $Pr[b'(x) = 0 | n(x) = 1] = 0$; $Pr[b'(x) = 1 | n(x) = 0] = 1 - \beta$; $Pr[b'(x) = 1 | n(x) = 1] = 1$?

Edit: It seems this question has been asked before on StackExchange, as https://stackoverflow.com/questions/635728 and https://cstheory.stackexchange.com/questions/6596 with a range of answers from "can't be done" through "can be done, at some cost" to "it is trivial to do, by reversing the values of $b$". It is not yet clear to me what the "right" answer is. What is clear is that an LRU caching scheme of some sort (such as the one suggested by Ilmari Karonen) works rather well, is easy to implement, and resulted in a 50% reduction in the time taken to run my code.

Going with Patrick87's hash idea, here's a practical construction that almost meets your requirements — the probability of falsely mistaking a new value for an old one is not quite zero, but can be easily made negligibly small.

Choose the parameters $n$ and $k$; practical values might be, say, $n = 128$ and $k = 16$. Let $H$ be a secure cryptographic hash function producing (at least) $n+k$ bits of output.

Let $a$ be an array of $2^k$ $n$-bit bitstrings. This array stores the state of the filter, using a total of $n2^k$ bits. (It does not particularly matter how this array is initialized; we can just fill it with zeros, or with random bits.)

• To add a new value $x$ to the filter, calculate $i \,\|\, j = H(x)$, where $i$ denotes the first $k$ bits and $j$ denotes the following $n$ bits of $H(x)$. Let $a_{i} = j$.

• To test whether a value $x'$ has been added to the filter, calculate $i' \,\|\, j' = H(x')$, as above, and check whether $a_{i'} = j'$. If yes, return true; otherwise return false.

Claim 1: The probability of a false positive (= new value falsely claimed to have been seen) is $1/2^{n+k}$. This can be made arbitrarily small, at a modest cost in storage space, by increasing $n$; in particular, for $n \ge 128$, this probability is essentially negligible, being, in practice, much smaller than the probability of a false positive due to a hardware malfunction.

In particular, after $N$ distinct values have been checked and added to the filter, the probability of at least one false positive having occurred is $(N^2-N) / 2^{n+k+1}$. For example, with $n=128$ and $k=16$, the number of distinct values needed to get a false positive with 50% probability is about $2^{(n+k)/2} = 2^{72}$.

Claim 2: The probability of a false negative (= previously added value falsely claimed to be new) is no greater than $1-(1-2^{-k})^N \approx 1-\exp(-N/2^k) < N/2^k$, where $N$ is the number of distinct values added to the filter (or, more specifically, the number of distinct values added after the specific value being tested was most recently added to the filter).

Ps. To put "negligibly small" into perspective, 128-bit encryption is generally considered unbreakable with currently known technology. Getting a false positive from this scheme with $n+k=128$ is as likely as someone correctly guessing your secret 128-bit encryption key on their first attempt. (With $n=128$ and $k=16$, it's actually about 65,000 times less likely than that.)

But if that still leaves you feeling irrationally nervous, you can always switch to $n=256$; it'll double your storage requirements, but I can safely bet you any sum you'd care to name that nobody will ever see a false positive with $n=256$ — assuming that the hash function isn't broken, anyway.