Things I do when analyzing large datasets in CouchDB

Views should return gzip'd results


CouchDB is a good HTTP server. I usually don't need to proxy requests through Apache or Nginx. What it doesn't do is return views in a compressed format. Views are great candidates for compression because there is often a lot of redundant data returned. I used to proxy through Apache in order to get gzip compression for all json data. That worked ok, but large replications would often fail no reason that I could ever understand. When replicating directly to couch, there was no problem. My current solution is to use this very simple, easy to understand node proxy that adds gzip: http://broken-by.me/tag/accept-encoding-gzip/
I then use forever in a reboot cron job to make sure it stays up:
@reboot /usr/local/bin/forever start -c /usr/bin/node /var/www/gooseberry/scripts/proxy.js


When a view isn't possible, just make documents

For me, data is captured at different times and stored in separate documents that correspond with the type of data that is captured. But analyzing data often requires me to group documents together that are then analyzed in groups. For instance Coconut Surveillance tracks malaria cases. A single case will consist of data captured at different times by different users in different places. One record will be for the data captured at the facility, other data will be captured at a household. It could even be captured on different tablets, so it isn't all available until the results are replicated to the cloud server. Views can emit one or more rows per document, but they cannot emit a row that includes data from multiple document. In other words, I can't make a view where each emitted row represents the full data for one of my malaria cases. What ends up happening is I have to do one query to figure out all of the documents relevant to my case. Then on the client side I have to group all of the results together into one object that I can analyze. When I need to analyze a lot of cases, this approach becomes very slow. Even to create a spreadsheet of case data is very slow. After lots of different strategies my current approach is to manually create a document for each case that contains all of the data pulled from the docs relevant to that case. I have a script that watches the _changes feed, if it detects a change relevant to a case, it will open up the case document and update that case document. Now, if I want to analyze case data I can simply load the relevant case documents (one document per case) or use views that work on the case documents (on row per case) and things are fast. It feels dirty to manually keep and update redundant data in the database, but that is basically what a view is, just a manually managed one.

Keep a class

Related to the approach above, I keep a class (in my case in CoffeeScript) that maps to this aggregated document of documents (one example being a malaria case). As I build up logic that does calculations on the document, I have it all in one nice place that can be reused across reports. This might be a simple thing that looks up the date the person was found positive for malaria (check the faclility record, if it doesn't exist, use the creation time of the notification) or something more complex that looks across multiple pieces of data to calculate how long it took to followup the case. When I grab the documents from couch, I load them as this class and can easily do analysis in all sorts of contexts.

One view per design document

The couchapp tool that I use forces us to have one design document and then put all of your views in that single design document. This is really terrible. Any change to any view forces the entire design document to be re-processed and blocks the entire application until that is done (unless you use stale requests - but then your data is not up to date). Thanks to this post by Nolan Lawson I realized that I should put each view in it's own design document. I expect there may be some minor performance hits in CouchDB doing this, but the overall result is so much better. Single views build relatively quickly after changes, and don't block other views. I ended up writing my own little tool that lets me keep all of my views (maps and reduces) in one directory and then uploads them and loads them (to cache them). Here's the script that creates one design doc mapfile for each coffee file in the directory (and looks for reduces too):
pushViews.rb
And here's the one for executing views (done in node since async is so easy):
executeViews.coffee

Don't be afraid to add another view

This is obvious, but at first I tried to create one view that ruled them all. I am often trying to be clever about re-using views. Before I had the one view per design doc approach (above), adding a new view required the entire app to stop working while the design doc rebuilt itself. With that problem solved, it's easy to add and remove views. Sure they use disk space, but who cares. Don't be afraid to make a really specific view with hardcoded values and magic numbers.

Make use of arrays in view results

Despite years of working with couchdb and a few forays into using arrays as the key my views was emitting, I rarely used it to its potential. Once I understood that keys passed to view can look this:

startkey: ["People"]
endkey: ["People",{}]

Things began being much more manageable and tidy (I was doing embarrassing things like string concatenation before). But much more amazing was using the group_level option. This option still amazes me. I didn't know about it for years, simply because of how it is described in the documentation. Here's the relevant part from the table of options that can be passed to a view:
group_level
number
-
see below

"see below" - Are you kidding me? I don't have time to see below! So I never did, until I saw an example online that used group_level and was doing things I never thought possible. Now I do stuff like this:

emit [country, state, county, city, neighborhood], population

Then, when I want to know the population at any of those levels, I just use group_level (combined with the _count builtin reduce function) and I can get the aggregated data for any level I need. I actually don't do this for population, but for more complex data aggregation and disaggregation but the idea is the same. group_level is a game changer, but it's hiding outside of the table that lazy people like me depend on.

If I need a reduce function, I am doing it wrong

I used to pound my head against the wall until it understood how reduce functions (and their mysterious options like rereduce) work. I would eventually get it, implement something that worked, then totally forget everything I had learned. So when my reduce function needed fixing I was back to banging my head against the wall. My brain just doesn't seem to be able to maintain the neural pathways needed for reduce. So I don't use them. I use the builtin ones and that's it.