Things I do when analyzing large datasets in CouchDB

Views should return gzip'd results


CouchDB is a good HTTP server. I usually don't need to proxy requests through Apache or Nginx. What it doesn't do is return views in a compressed format. Views are great candidates for compression because there is often a lot of redundant data returned. I used to proxy through Apache in order to get gzip compression for all json data. That worked ok, but large replications would often fail no reason that I could ever understand. When replicating directly to couch, there was no problem. My current solution is to use this very simple, easy to understand node proxy that adds gzip: http://broken-by.me/tag/accept-encoding-gzip/
I then use forever in a reboot cron job to make sure it stays up:
@reboot /usr/local/bin/forever start -c /usr/bin/node /var/www/gooseberry/scripts/proxy.js


When a view isn't possible, just make documents

For me, data is captured at different times and stored in separate documents that correspond with the type of data that is captured. But analyzing data often requires me to group documents together that are then analyzed in groups. For instance Coconut Surveillance tracks malaria cases. A single case will consist of data captured at different times by different users in different places. One record will be for the data captured at the facility, other data will be captured at a household. It could even be captured on different tablets, so it isn't all available until the results are replicated to the cloud server. Views can emit one or more rows per document, but they cannot emit a row that includes data from multiple document. In other words, I can't make a view where each emitted row represents the full data for one of my malaria cases. What ends up happening is I have to do one query to figure out all of the documents relevant to my case. Then on the client side I have to group all of the results together into one object that I can analyze. When I need to analyze a lot of cases, this approach becomes very slow. Even to create a spreadsheet of case data is very slow. After lots of different strategies my current approach is to manually create a document for each case that contains all of the data pulled from the docs relevant to that case. I have a script that watches the _changes feed, if it detects a change relevant to a case, it will open up the case document and update that case document. Now, if I want to analyze case data I can simply load the relevant case documents (one document per case) or use views that work on the case documents (on row per case) and things are fast. It feels dirty to manually keep and update redundant data in the database, but that is basically what a view is, just a manually managed one.

Keep a class

Related to the approach above, I keep a class (in my case in CoffeeScript) that maps to this aggregated document of documents (one example being a malaria case). As I build up logic that does calculations on the document, I have it all in one nice place that can be reused across reports. This might be a simple thing that looks up the date the person was found positive for malaria (check the faclility record, if it doesn't exist, use the creation time of the notification) or something more complex that looks across multiple pieces of data to calculate how long it took to followup the case. When I grab the documents from couch, I load them as this class and can easily do analysis in all sorts of contexts.

One view per design document

The couchapp tool that I use forces us to have one design document and then put all of your views in that single design document. This is really terrible. Any change to any view forces the entire design document to be re-processed and blocks the entire application until that is done (unless you use stale requests - but then your data is not up to date). Thanks to this post by Nolan Lawson I realized that I should put each view in it's own design document. I expect there may be some minor performance hits in CouchDB doing this, but the overall result is so much better. Single views build relatively quickly after changes, and don't block other views. I ended up writing my own little tool that lets me keep all of my views (maps and reduces) in one directory and then uploads them and loads them (to cache them). Here's the script that creates one design doc mapfile for each coffee file in the directory (and looks for reduces too):
pushViews.rb
And here's the one for executing views (done in node since async is so easy):
executeViews.coffee

Don't be afraid to add another view

This is obvious, but at first I tried to create one view that ruled them all. I am often trying to be clever about re-using views. Before I had the one view per design doc approach (above), adding a new view required the entire app to stop working while the design doc rebuilt itself. With that problem solved, it's easy to add and remove views. Sure they use disk space, but who cares. Don't be afraid to make a really specific view with hardcoded values and magic numbers.

Make use of arrays in view results

Despite years of working with couchdb and a few forays into using arrays as the key my views was emitting, I rarely used it to its potential. Once I understood that keys passed to view can look this:

startkey: ["People"]
endkey: ["People",{}]

Things began being much more manageable and tidy (I was doing embarrassing things like string concatenation before). But much more amazing was using the group_level option. This option still amazes me. I didn't know about it for years, simply because of how it is described in the documentation. Here's the relevant part from the table of options that can be passed to a view:
group_level
number
-
see below

"see below" - Are you kidding me? I don't have time to see below! So I never did, until I saw an example online that used group_level and was doing things I never thought possible. Now I do stuff like this:

emit [country, state, county, city, neighborhood], population

Then, when I want to know the population at any of those levels, I just use group_level (combined with the _count builtin reduce function) and I can get the aggregated data for any level I need. I actually don't do this for population, but for more complex data aggregation and disaggregation but the idea is the same. group_level is a game changer, but it's hiding outside of the table that lazy people like me depend on.

If I need a reduce function, I am doing it wrong

I used to pound my head against the wall until it understood how reduce functions (and their mysterious options like rereduce) work. I would eventually get it, implement something that worked, then totally forget everything I had learned. So when my reduce function needed fixing I was back to banging my head against the wall. My brain just doesn't seem to be able to maintain the neural pathways needed for reduce. So I don't use them. I use the builtin ones and that's it.

One gun fewer



I was upset too. I hugged my wife. Kissed my baby girl. Mourned for the children and teachers. I was sad about the lack of mental health resources and angry about the guns.

The more I thought about guns the more upset I became. I laid awake for hours on Saturday night in sadness and anger. I found myself contemplating my own gun experiences - mostly just shooting guns as a kid. I decided that I want fewer guns in my family's future. Then I thought about one person that I know who has a gun, someone I trust and who trusts me. I decided I would call him up and ask him to consider destroying his gun. This idea didn't help me to sleep.

It wasn't easy. My voice was shaky. I did not manage to coolly and clearly explain the data that describes why guns in homes are dangerous. I did not win a debate but the conversation didn't devolve into abstract politics. I wish that I was more sensitive ("having a gun to defend yourself is a stupid idea") but conversations in the real world are like that. I'm not sure if I will succeed, but I'm trying to care for people I love and make the world a better place one relationship, and one gun fewer, at a time.

Read ebooks hands free


Every year for Christmas I try and make Claudia a Christmas present. Sometimes it's a big success (an old laptop I turned into a digital photoframe that shows local bus positions, bike share station status and more recently streaming video from our baby monitor) and sometimes less so (a fish scale modified to set off a buzzer when pulled too far that was supposed to teach our dog not to pull on the leash). This past Christmas, my wife was pregnant so I decided to build something that would help us get through the late nights of breastfeeding.

Claudia loves to read, and we now buy a lot of digital books, which Claudia reads on an old Kindle and I on my phone. I decided that I wanted to come up with a way that would enable Claudia to read hands free while breastfeeding our baby. Besides Kindles and phones, Amazon also lets you read books on their kindle cloud reader. You just go to http://read.amazon.com, sign in with your kindle account, and all of your books are there and ready to be read.

When you put the browser in full screen mode, it makes good use of the screen. You can adjust the text size and spacing so that it is quite readable, even at a distance. Of course you can click an icon to change the page, but you can also use the right and left arrow keys to flip pages. With all of this in hand, the rest of my solution was pretty obvious.

I really only had to buy two things:

A music stand:



Musician's Gear Heavy-Duty Folding Music Stand Black





A set of foot pedals:




Scythe Usb-2fs-2 USB 2 Foot Switch Version 2


The music stand is adjustable to hold the laptop at eye level, is sturdy and easy to move around. The foot pedals basically work like a keyboard with two keys. I used the included software to configure them so that the right pedal would send the right arrow key, and the left pedal would send the left arrow key.

Needless to say, I was pretty psyched when I plugged it all in, went to the read.amazon.com and everything worked perfectly.

Now that Annika, our little baby girl, has arrived, my wife is using the system around the clock and it seems to be working flawlessly. With the background color set to black, it barely even lights up the room, which is helpful when you are trying to keep the baby in sleep mode.

Ways to improve it

It turns out that having a computer around really helps pass the time for breastfeeding mommies. Besides reading books, she has been watching a lot of movies and TV shows. It would be great if you could program the pedals to send "space" to pause a TV show. If the pedals could switch modes and send the right keys depending on which application had focused - then that would be awesome. Or simply allowing long presses or combo taps (left, right, left, right) to send other keys or key combinations would allow you to also launch new applications. Then she could scroll on websites or browse iPhoto. Maybe I can use voice recognition to do this...

Twitter Strategy for Humans


Another from the Recent Emails I have Sent Department:


The most effective Twitter strategy is to use Twitter personally (as yourself, not as your organization) and engage in (and start new) online discussions about things that you feel strongly about. This includes education strategies, new products and, yes, sometimes even what you had for breakfast. The reason is that twitter is about online community and conversation, sort of like Facebook, but with people (not products or organizations) that you often have never met personally. No one wants to talk to a press release, or a corporate department they want real people (who eat breakfast). It is useful to have a corporate twitter identity, but mostly it’s just as a mechanism for real people to share press releases – the real value add happens in public discussion that everyone can see. Often those online discussions turn into post-conference meetings or drinks when people pass through town, and that is usually when the most important opportunities and discussions happen. One more thing – using Twitter during a conference is a great way to establish thought leadership, get followers, and participate in a discussion that is often much more interesting than what is going on at the front of the room.

[An entirely new strategy will probably be necessary once @horse_ebooks begins reproducing.]

Ode to Coffeescript

(this started as email to a friend but I thought it might be useful to share)

I’ve been using coffeescript for about a year. Other than using it for my own projects, I have learned it's syntax from the coffeescript website and from the random coffeescript snippet that I see here and there.

I started using it to try and write expressive code that reminded me of Ruby. I like my code readable, with very descriptive (and sometimes long) variable names (never abbreviated) and few comments. If I can’t understand the code by reading it, then I probably need to split up my one-liner into a few lines or make a new function or two.

Being able to abandon a lot of the extra braces and parens for indentation helped for readability (I actually agree with Python over Ruby on this one), especially relative to the javascript that I was writing before.

Initially I used all of the looping shortcuts that coffeescript comes with, but now I tend to use underscore.js whenever I am looping/mapping/etc. I am not sure if this is what @jashkenas had in mind (did this guy really write coffeescript, underscore and backbone??) but I think coffeescript + underscore results in a really nice compromise.

I have only recently really understood that everything is an expression in coffeescript. Using it really helps me to modularize my code

Here’s an example I have just written that sort of sums up what I like about coffeescript:



Line 1. Takes advantage of "everything is an expression". Whatever the block indented below line 1 returns, will be set to formElement. This is so much better than initializing an empty string and then setting it.

Line 2. Note that I am using underscore to check if the value I am interested in is in an array. If so, then line 3 just returns the screen which bubble back to line 1. Same thing with line 4-5.

Line 7. See how we can do ruby style string interpolation. That is huge - so huge. Javascript doesn't let you do sane multi-line strings, nor can you interpolate. But wait - check out the crazy interpolation of line 8 - I start a multi-line map (underscore again!). Coffeescript basically enables quick and dirty templating inside any string. It's a bit dangerous to mix too much logic and templating, but for small things it is awesome. (I use handlebars.js for the big jobs)

Anyways, it's not the greatest code in the world, but it's real code I wrote yesterday and it's helping me get the job done.

I am one of those people that think you should learn a new programming language every year or so – and indeed coffeescript has made me a better programmer. So if you are learning it, I recommend that you stick with it. You’ll get it and be better for it.

Fall blooms and dies over a few weeks


Skateistan: To Live and Skate Kabul