CouchDB Bloom Filters
February 13th, 2009
I haven’t finished the code to implement this idea yet. I’m forced to show my hand since today is the last day in On Ruby’s CouchDB Contest .
A bloom filter is data structure with some very cool properties. It’s very compact and performant. It represents an approximation to set membership. That is, it represents a collection of elements without storing the elements. A oft quoted example is that of a dictionary—one can encode every word in the dictionary (taking up a lot of space) into a substantially smaller bloom filter. This comes with a cost, when a bloom filter says an element is not in the set, it’s not in the set. However, when a bloom filter says that an element is in the set, it could be wrong, but the expected probability of it lying can be ascertained from the size of the bloom filter and how many hashes it utilizes.
Properties of Bloom Filters
- Space efficient.
- Speed efficient (constant time adding elements, constant time query).
- Uses k hashing functions (where k is the user selected number of hashing functions to use, determines how accurate the filter is).
- Combinable. The union of two bloom filters of the same length and same k hashing functions is just the bit-wise OR of its bitfield.
Possible use cases
Besides the dictionary use case mentioned, bloom filters are used to great effect in distributed file systems and caches, as a quick ‘filter’ to determine if an element is in the cache or on disk, to avoid expensive computations. I’d like to be able to reconstitute a bloom filter as a javascript object. Then you could very quickly determine client-side set membership without querying the server.
Examples: could encode every WikiWord into a bloomfilter providing a very fast client-side test to see if a word should be linkable.
ISBN’s in a library: I’ve been meaning to hack together a Amazon powered better book search for my university, but how to efficiently determine if a particular book in a search of all books is at my local library without tons of API requests? I’d like to try a bloom filter encoding all books locally available.
Trying to avoid a Denial-of-Service attack. Say you’ve generated some large key space to keep your users data private but also shareable, (like the large random URL’s Google Reader uses). Someone could start guessing, and they’ll be guessing for a long time. Yes you should rate-limit them, but each request could possibly result in a database query too—why not use a bloom filter to quickly determine if that long random key is likely in your database without the performance hit?
I need to look into the replication code in CouchDB, I don’t know how it’s being implemented—but one CouchDB instance could furnish another instance a bloom filter representing the set of all documents and their versions in the database.
Javascript and Ruby encodings of bloom filters and bit fields
Peter Cooper coded up a ruby bloom filter and the ruby bit field that the bloom filter depends on. The bit field is implement as an array of 32-bit numbers. To index a specific bit you first look up the associated number in the array, then the bit offset into that number.
An array of numbers, easily expressible as JSON. A javascript class could easily use the JSON numeric array as the bitfield, as could Ruby or any other language.
Coming up with k hashing functions
So we’ve got bit fields, then we just need a hashing function. I’ve been looking for an excuse to use Bruce Schneier’s team submission to the NIST competition to determine the next standard hash (SHA3) called Skein. The H2 Database guys, in addition to a Java based version of the hash, provided us with a javascript implementation of Skein as well.
If you’ve got a good hashing function, you can use that hashing function to create more hashing functions, as many as you need in fact. The way I’m currently planning on doing this is to take a hash of the first hash, then a hash of the second hash etc. until there is enough keying material to produce k indexes.
Where CouchDB comes in
CouchDB implements views or calculated columns using map/reduce.
Map
The map function will take a document, specify the size of the bloom filter, and specify the number of hashes (k) to use in that bloom filter (or an estimate of how many elements you’ll add and the accuracy you’d like to attain). Then whatever property is desired out of the document can be iterated over and added to the bloom filter.
The map function will emit a JSON document with an array of numbers (representing the bit field), a property specifying which hashing function is being used (skein, md5, sha1 etc.) and the number of hashes calculated for each item.
Reduce
The reduce function will combine bloom filters, like union with a bitwise OR. After the reduce a query to CouchDB will return a result containing a bloom filter for whatever content you’re after. This result can be used client-side, in javascript, or server side in ruby or something.
Status
I’ve ported Peter Cooper’s ruby bit field to javascript. The bloom filter itself is partially implemented. I need to figure out how to add utility functions to CouchDB so that the bloom filter code can be easily used in any view. If anyone knows of a good javascript testing framework that works with SpiderMonkey from the command line I am all ears.
MountainWest RubyCamp This Saturday, 17 Nov
November 14th, 2007
The summary:
MountainWest RubyCamp 2007
Saturday, November 17th 2007
1:00 PM – 4:00 PM
Salt Lake City Library
Level 4 Meeting Room
The organizers have asked that if you’re coming to put your name on the wiki page
I’ve never been to a “Camp” before but I hear they are like “unconferences”. I go to the Internet Identity Workshop which is an unconference, and the results have been fantastic. Those who come are actively involved in the discussion, it’s quite refreshing.
I don’t know if I will present or not. I could lead a discussion and get people up to speed with the digital identity landscape—OpenID, CAS, InfoCard and some secret sauce :)
If you’re interested but a little put off that it is specifically about “Ruby” you should come anyway. Ruby is a good excuse to get together and rub elbows, it’s not an excuse to exclude interesting people or ideas.
Painless Instant Local Version Control
October 8th, 2007
To learn about distributed version control systems (DVCS), I’ve been using bazaar on some small projects . I know these DVCSs are supposed to really shine in multi-developer environments spread out across the world but I have found that they offer an incredibly low barrier to entry for a single developer on his own box. This is the typical situation for a college student in computer science and many others as well.
An example could help here. I’m taking a networking class where a bit of code is provided in a framework and we need to use the code for our programs.cd lab1 bzr init bzr add bzr checkin -m "Initial import"
Done. This folder is now under version control.
When version control is this easy to setup and use, there’s no excuse for not using it.
Immersion is Mandatory
October 2nd, 2007
Some time ago I went on a church mission to Brazil for two years. I didn’t know anything about Brazil or Portuguese. We have a Missionary Training Center where I was inundated with non-stop Portuguese lessons for two months. It moved so fast it was hard to remember everything but it was a good preparation. I went from no knowledge of Portuguese to two months later being dropped off in the small town of Maracajú, Mato Grosso do Sul, Brazil. The other missionary working with me was Brazilian leaving me three hours away from anyone who could understand me. I thought I was learning a lot in the classroom but have since found that the pace of learning doesn’t even compare to complete immersion. I don’t like to go hungry. I learned how to speak.
I’ve been in web development for a long time. Since I’m a student in computer science, I have gravitated towards making the “back-end” of web systems. At the same time I think that user experience is the most important aspect of software design. I found that even though I read a lot about it, my javascript fu was not strong enough. I could design cool, useful apps that were fully functional on the backend, but in order to code up the front I would search for javascript scripts and widgets until I found one that was similar to what I was looking for and then just tweaked it. That works for many things, but other times you can’t fall back on those tactics and must flex your own fu.
After all is said and done, more is said than done. —Aesop
Reading and talking about it aren’t enough. You’ve got to do it yourself.
I gave myself a self-imposed immersion project. I wanted something small enough that I could have some quick success (to entice myself to stick with it) but still be big enough to be interesting.
When the term Ajax first started gaining momentum Jamis Buck blogged a piece about his first Ajax app , a simple word game. The back story was that there was a technical discussion at his work about the merits of asynchronous XMLHTTPRequest versus synchronous XMLHTTPRequest. His little word game was a demonstration to his coworkers showing that asynchronous was the way to go, so that it would not lock up the browser. It was a fun little game that both me and my wife enjoyed playing now and again.
For my javascript immersion project I decided that I would try and write the entire game in javascript with no back-end at all. It was just to force myself to get out of my comfort zone.
Here it is, Devlin’s Word Reorder Game.
Some technical notes:
This is implemented using the excellent jQuery library. The tasteful effects courtesy of the Interface library. The dictionary is just all the four letter words contained in the file /usr/share/dict/words on my Mac. I know, a lot of words you think should be in there aren’t (especially the short pluralized ones like tips, huts etc.). If you can get me a better dictionary I’ll glady switch.
I used jQuery and am glad I did. It takes away a lot of pain experienced in my previous forays into javascript and browser incompatibilities.
