At the interface of science and computing

Data Flow

Russell Jurney has a great blog post on the Hortonworks blog entitled Pig as Hadoop Connector …. I’ve long been a fan of data flow-style approaches and Pig fits my mental model better than something like Hive. The post does a great job explaining how you can move data through Hadoop, into MongoDB, and eventually turn the data into a web service (in this case via Node.js). Such a workflow is a particularly nice fit for modern bioinfomratics, especially in a high scale next-gen world. MongoDB with it’s document-based model and rich query syntax is quite popular with the next-gen sequencing crowd, and I’ve started to see a lot more Hadoop, especially in commercial services that need to scale cost-effectively.

Biological data is a great fit for Mongo-style key-value stores. In practice, I wonder how many people are using such pipelines, where they may use something like Hadoop to aggregate a large number of “events”. In this case an event could be the output from a single experiment or pipeline. Essentially you could just stream the output from all your pipeline runs into one or more Hadoop clusters that would do your aggregation and sorting, and then feed that into MongoDB or similar K-V store. From there, publishing the data as a service is a relatively simple step, and you can even make it look pretty quickly with something like Bootstrap.

The key message here is that we have unprecedented access to the kinds of tools that allow us to work with data flexibly at various scales, and, even better still, make results available to a broader set of users and developers via web services. Sometimes it feels like there are too many tools to keep track of and learn, and to some extent that is true (pretty much the story of my life), but it’s a fun time to be a developer.