Sunday, April 13, 2008

When the problem is really hard...

... change the problem!

This sounds like a cop-out, but by that measure, most of engineering is a cop-out.

This sort of engineering happens in all fields, but since I am most familiar with CS, I'll go over an example from there.

The problem domain arises from the design of caches. It is also applicable to any general search / query optimization and I'll talk about that briefly at the end. On to caches - I'm talking about the web sort here not the processor sort. So on to the problem itself:
Imagine you're building a large cache of webpages. You have the webpage itself on disk, but ideally you don't want to hit disk to avoid the 8ms hit. You can reference each webpage by some id, which is, say, a hash on the URL. One easy thing to do is to store all the ids in a hashtable using some reasonable hashing function. Even with perfect hashing, your hashtable is going to be size O(n) if you have to store n elements. This is fine if the number of documents you're storing is small , but if you have a large number of documents (1 billion docs * 8 bytes = 8G memory), and realistically you don't have a perfect hashing function, this can become quite cumbersome. So what do we do?

Enter bloom filters. Bloom filters are a type of probabilistic data structures that use some fixed number of bits and provide the guarantee that if an element is not found in the bloom filter, it does not exist in the cache. If an element is found, then it is in the cache, with very high probability. What this realistically means is that your bloom filter can say that you have something, when in reality you don't. In most cases this is okay, because the probability of this happening is sufficiently small.

The simplest kind of bloom filter is conceptually pretty easy to explain too. Lets suppose that you're hashing a set of elements S = {x1, x2, .. xn}. Lets use a bitset of a fixed size - say 1024 bits. Now we hash x1 => k1 where {k1..kn} is a number in 2^1024. Inserting an element x1, we hash it to k1 and flip that bit to 1. Thus your bloom filter looks like this:

x1 x3 x5
| | |
v v v
10010101010100000001100000
^
|
x2

x1 => k1
x2, x3 => k12

When we search for x1, the bit is 1, its a hit, we look for it and its there.
Similarly x2 hashes to k12 and the bit is 1, we look for it and its there.
Lets assume x3 is not there in our cache. We search for x3 and it hashes to k12 and the bit is 1 and we try to look for it and its not - its a false positive.
x5 hashes to k17 (or something like that) and the bit is 0, so we know definitively that the element is not in our set.

This is a brilliant optimization that helps out in a surprising number of search problems where the time hit on a false positive is acceptable.

I'm sure I didn't do a good enough job explaining bloom filters, so here's a great talk on the same. The first 10 minutes are really good and then he gets into a lot of very complicated stuff that optimizes various aspects of bloom filters. If you're implementing your own, I'd recommend the whole talk, else just the first 10 minutes or so.



Thus in the end, if you look at the problem definition of hash - "tell me, with certainty, whether a particular element exists in a set or not", its difficult (for space reasons). But if you change the problem definition to "tell me, with certainty, whether a certain element is not in a set, but you can be wrong about it being in the set", suddenly the problem becomes solvable along the space axis.

No comments: