February 13th, 2009
I haven’t finished the code to implement this idea yet. I’m forced to show my hand since today is the last day in On Ruby’s CouchDB Contest .
A bloom filter is data structure with some very cool properties. It’s very compact and performant. It represents an approximation to set membership. That is, it represents a collection of elements without storing the elements. A oft quoted example is that of a dictionary—one can encode every word in the dictionary (taking up a lot of space) into a substantially smaller bloom filter. This comes with a cost, when a bloom filter says an element is not in the set, it’s not in the set. However, when a bloom filter says that an element is in the set, it could be wrong, but the expected probability of it lying can be ascertained from the size of the bloom filter and how many hashes it utilizes.
Properties of Bloom Filters
- Space efficient.
- Speed efficient (constant time adding elements, constant time query).
- Uses k hashing functions (where k is the user selected number of hashing functions to use, determines how accurate the filter is).
- Combinable. The union of two bloom filters of the same length and same k hashing functions is just the bit-wise OR of its bitfield.
Possible use cases
Examples: could encode every WikiWord into a bloomfilter providing a very fast client-side test to see if a word should be linkable.
ISBN’s in a library: I’ve been meaning to hack together a Amazon powered better book search for my university, but how to efficiently determine if a particular book in a search of all books is at my local library without tons of API requests? I’d like to try a bloom filter encoding all books locally available.
Trying to avoid a Denial-of-Service attack. Say you’ve generated some large key space to keep your users data private but also shareable, (like the large random URL’s Google Reader uses). Someone could start guessing, and they’ll be guessing for a long time. Yes you should rate-limit them, but each request could possibly result in a database query too—why not use a bloom filter to quickly determine if that long random key is likely in your database without the performance hit?
I need to look into the replication code in CouchDB, I don’t know how it’s being implemented—but one CouchDB instance could furnish another instance a bloom filter representing the set of all documents and their versions in the database.
Peter Cooper coded up a ruby bloom filter and the ruby bit field that the bloom filter depends on. The bit field is implement as an array of 32-bit numbers. To index a specific bit you first look up the associated number in the array, then the bit offset into that number.
Coming up with k hashing functions
If you’ve got a good hashing function, you can use that hashing function to create more hashing functions, as many as you need in fact. The way I’m currently planning on doing this is to take a hash of the first hash, then a hash of the second hash etc. until there is enough keying material to produce k indexes.
Where CouchDB comes in
CouchDB implements views or calculated columns using map/reduce.
The map function will take a document, specify the size of the bloom filter, and specify the number of hashes (k) to use in that bloom filter (or an estimate of how many elements you’ll add and the accuracy you’d like to attain). Then whatever property is desired out of the document can be iterated over and added to the bloom filter.
The map function will emit a JSON document with an array of numbers (representing the bit field), a property specifying which hashing function is being used (skein, md5, sha1 etc.) and the number of hashes calculated for each item.