CouchDB Bloom Filters

February 13th, 2009

I haven’t finished the code to implement this idea yet. I’m forced to show my hand since today is the last day in On Ruby’s CouchDB Contest .

A bloom filter is data structure with some very cool properties. It’s very compact and performant. It represents an approximation to set membership. That is, it represents a collection of elements without storing the elements. A oft quoted example is that of a dictionary—one can encode every word in the dictionary (taking up a lot of space) into a substantially smaller bloom filter. This comes with a cost, when a bloom filter says an element is not in the set, it’s not in the set. However, when a bloom filter says that an element is in the set, it could be wrong, but the expected probability of it lying can be ascertained from the size of the bloom filter and how many hashes it utilizes.

Properties of Bloom Filters

  1. Space efficient.
  2. Speed efficient (constant time adding elements, constant time query).
  3. Uses k hashing functions (where k is the user selected number of hashing functions to use, determines how accurate the filter is).
  4. Combinable. The union of two bloom filters of the same length and same k hashing functions is just the bit-wise OR of its bitfield.

Possible use cases

Besides the dictionary use case mentioned, bloom filters are used to great effect in distributed file systems and caches, as a quick ‘filter’ to determine if an element is in the cache or on disk, to avoid expensive computations. I’d like to be able to reconstitute a bloom filter as a javascript object. Then you could very quickly determine client-side set membership without querying the server.

Examples: could encode every WikiWord into a bloomfilter providing a very fast client-side test to see if a word should be linkable.

ISBN’s in a library: I’ve been meaning to hack together a Amazon powered better book search for my university, but how to efficiently determine if a particular book in a search of all books is at my local library without tons of API requests? I’d like to try a bloom filter encoding all books locally available.

Trying to avoid a Denial-of-Service attack. Say you’ve generated some large key space to keep your users data private but also shareable, (like the large random URL’s Google Reader uses). Someone could start guessing, and they’ll be guessing for a long time. Yes you should rate-limit them, but each request could possibly result in a database query too—why not use a bloom filter to quickly determine if that long random key is likely in your database without the performance hit?

I need to look into the replication code in CouchDB, I don’t know how it’s being implemented—but one CouchDB instance could furnish another instance a bloom filter representing the set of all documents and their versions in the database.

Javascript and Ruby encodings of bloom filters and bit fields

Peter Cooper coded up a ruby bloom filter and the ruby bit field that the bloom filter depends on. The bit field is implement as an array of 32-bit numbers. To index a specific bit you first look up the associated number in the array, then the bit offset into that number.

An array of numbers, easily expressible as JSON. A javascript class could easily use the JSON numeric array as the bitfield, as could Ruby or any other language.

Coming up with k hashing functions

So we’ve got bit fields, then we just need a hashing function. I’ve been looking for an excuse to use Bruce Schneier’s team submission to the NIST competition to determine the next standard hash (SHA3) called Skein. The H2 Database guys, in addition to a Java based version of the hash, provided us with a javascript implementation of Skein as well.

If you’ve got a good hashing function, you can use that hashing function to create more hashing functions, as many as you need in fact. The way I’m currently planning on doing this is to take a hash of the first hash, then a hash of the second hash etc. until there is enough keying material to produce k indexes.

Where CouchDB comes in

CouchDB implements views or calculated columns using map/reduce.

Map

The map function will take a document, specify the size of the bloom filter, and specify the number of hashes (k) to use in that bloom filter (or an estimate of how many elements you’ll add and the accuracy you’d like to attain). Then whatever property is desired out of the document can be iterated over and added to the bloom filter.

The map function will emit a JSON document with an array of numbers (representing the bit field), a property specifying which hashing function is being used (skein, md5, sha1 etc.) and the number of hashes calculated for each item.

Reduce

The reduce function will combine bloom filters, like union with a bitwise OR. After the reduce a query to CouchDB will return a result containing a bloom filter for whatever content you’re after. This result can be used client-side, in javascript, or server side in ruby or something.

Status

I’ve ported Peter Cooper’s ruby bit field to javascript. The bloom filter itself is partially implemented. I need to figure out how to add utility functions to CouchDB so that the bloom filter code can be easily used in any view. If anyone knows of a good javascript testing framework that works with SpiderMonkey from the command line I am all ears.

1 Response to “CouchDB Bloom Filters”

  1. Brian Whitmer Says:
    Very nice stuff. I think we'll be using this.

Sorry, comments are closed for this article.