business|bytes|genes|molecules

At the interface of science and computing

Open Data Begets Cool

I am a huge fan of Common Crawl. For those who don’t know, Common Crawl is a non-profit whose goal is to build and maintain an open crawl of the web. Their hope is that with the availability of an open, high quality crawl, cool things will happen, e.g. like Michael Nielsen’s how to on crawling 250 million web pages quickly and inexpensively. The thing that makes Common Crawl work is not just quality raw data. They also provide JSON crawl metadata in an S3 bucket, and an Amazon Machine Image to help both users get up and running quickly. The image includes a copy of the Common Crawl User Library, examples, and launch scripts that show users how to analyze the Common Crawl corpus using their own Hadoop cluster or Amazon Elastic MapReduce.

It is this complete picture, data + tools, and the easy availability of infrastructure to do so that make a project like Common Crawl so compelling. When you have the infrastructure in place, the friction to do something interesting gets reduced sufficiently that there are enough smart people using the data that interesting things are inevitable. With people like Michael and Pete Warden publishing great getting started posts, the barriers to entry for Common Crawl are essentially the cost of running a small cluster for a few hours.

I can think of a few life science data sets that would benefit from such an approach, e.g. data sets releated to disease outbreaks, expression profiles, etc. Data that can be analyzed and mashed up with other sources with minimal friction. That would be awesome.

Data Flow

Russell Jurney has a great blog post on the Hortonworks blog entitled Pig as Hadoop Connector …. I’ve long been a fan of data flow-style approaches and Pig fits my mental model better than something like Hive. The post does a great job explaining how you can move data through Hadoop, into MongoDB, and eventually turn the data into a web service (in this case via Node.js). Such a workflow is a particularly nice fit for modern bioinfomratics, especially in a high scale next-gen world. MongoDB with it’s document-based model and rich query syntax is quite popular with the next-gen sequencing crowd, and I’ve started to see a lot more Hadoop, especially in commercial services that need to scale cost-effectively.

Biological data is a great fit for Mongo-style key-value stores. In practice, I wonder how many people are using such pipelines, where they may use something like Hadoop to aggregate a large number of “events”. In this case an event could be the output from a single experiment or pipeline. Essentially you could just stream the output from all your pipeline runs into one or more Hadoop clusters that would do your aggregation and sorting, and then feed that into MongoDB or similar K-V store. From there, publishing the data as a service is a relatively simple step, and you can even make it look pretty quickly with something like Bootstrap.

The key message here is that we have unprecedented access to the kinds of tools that allow us to work with data flexibly at various scales, and, even better still, make results available to a broader set of users and developers via web services. Sometimes it feels like there are too many tools to keep track of and learn, and to some extent that is true (pretty much the story of my life), but it’s a fun time to be a developer.

Platforms for Citizen Science

Nice article in the NY Times around citizen science (or here). It presents a balanced view of citizen science, a topic I care about deeply

In the end, citizen science is many things. It is a way to stimulate public interest, help collect data that would be difficult to do without engaging the community, but perhaps most importantly, citizen science allows the broader public to be engaged in science. Of the many people participating in data collection, perhaps a few will actually do some analysis, and an even fewer number will end up pursuing science as more than a hobby. That’s OK and that’s how it should be.

The key in my mind is to make sure we are developing and nurturing the frameworks that enable participation. The Zooniverse is a great example of making participation easy and fun. Foldit is another model that makes participation fun and rewarding. The current reach of the web makes such platforms very viable and very powerful. Do all models and efforts need to work? No, that is very difficult. Is it OK to leave the hard science to the “experts”? To an extent, that is a good model, but you never know who the experts really are and assuming the sit in some laboratory is both limiting and naive. Not proceeding forward in areas where work can be done by the broader community in chunks because we worry about quality is going to only hold science back. The key once again is to make sure that the underlying platforms make participation easy, and also allow quality to be managed and filtered. In the biological sciences, we haven’t quite seen a project like the Zooniverse, at least not to my knowledge. Initial success has come from efforts that involve the broader scientific community and some hobbyists. Over time, hopefully we will achieve the scale that the web enables and reach a wider set of people, not just scientists. I am pretty sure folks like Andrew Su are thinking of how to do exactly this.

The GATK License

One of the catalysts for restarting the blog was the new GATK license. GATK is a great tool for the genomics community and has historically had an open (MIT) license. However, with GATK 2.0, the license is moving to a hybrid licensing model. Per the announcement

The complete GATK 2.0 suite will be distributed as a binary only, without source code for the newest tools. We plan to release the source code for these tools, but its unclear the timeframe for this. The GATK engine and programming libraries will remain open-sourced under the MIT license, as they currently are for GATK 1.0. The current GATK 1.0 tool chain, now called GATK-lite, will remain open-source under the MIT license and distributed as a companion binary to the full GATK binary. GATK-lite includes the original base quality score recalibrator (BQSR), indel realigner, unified genotyper v1, and VQSR v2.

GATK 2.0 is being released under a software license that permits non-commercial research use only. Until the beta ends and the full GATK 2.0 suite is officially launched, commercial activities should use the unrestricted GATK-lite version. In the fall we intend to release the full version of GATK 2.0. The full version will be free-to-use version for non-commercial entities, just like the beta. A commercial license will be required for commercial entities. This commercial version will include commercial-grade support for installation, configuration, and documentation, as well as long-term support for each commercial release.

This is the wrong direction. Mixed licensing has been the bane of chemistry codes for years, but seeing it in the genomics world, especially for something that started with a more permissive license is a step in the wrong direction. Others have commented on the potential reasons; commercialization, concern about use by dodby DTC genomics sites; but all of those reasons are quite weak.

So why is this a mistake? First, it shuts out those who may not be academics, but want to (a) do good science, and (b) contribute to good science. What if I was a smart developer, perhaps at a small company, or working for myself. Suddenly, not only is the code no longer available without a license, but their ability to contribute to improving the code is severely diminished. Second, it betrays a lack of understanding of what open source means. Yes, there are plenty of open core models, but GATK is not a company or commercial service. If it plans to be one, they should say so more clearly and spin off a company that does the work of developing products around an open source core. This is neither here nor there, and all it does is come in the way of doing good science and writing good software.

In the end, this sets a terrible precendent. The world of open source has lots of good models for monetizing software. If that is the goal, it would be best to follow those models, or focus on providing quality services, but the non-commercial entity only model is a huge backward step.

The Return

It’s been a while. When the original bbgm went down, I thought it would take a few days bringing it back online. Days became weeks, weeks became months. It’s been over a year since I last wrote a post, and strangely enough for a while there it felt good not to think about writing. Life has been incredibly busy, especially once I moved into my current role. The little spare time available has been spent with family and indulging hobbies old and new.

For now, I have given up any illusions of trying to resurrect the original bbgm, but loss brings new opportunities, and this blog is that opportunity. As always I will write about things I care about, especially science, which is a smaller part of my life than it has been in years. As always, there will be limited writing about my day job, but there’s enough to write about in the world of science, technology, and product development.

The original bbgm ran on Wordpress. For a long time, I’ve wanted to switch to more static sites. deepaksingh.net uses Jekyll and dualnatureofmatter.net uses nanoc. This site uses Octopress, which is a blogging system on top of Jekyll, and is hosted on Amazon S3. Oh and here’s the new bbgm RSS feed.

So yes, this is a reboot of bbgm. Whether it has any legs remains to be seen.