I am a huge fan of Common Crawl. For those who don’t know, Common Crawl is a non-profit whose goal is to build and maintain an open crawl of the web. Their hope is that with the availability of an open, high quality crawl, cool things will happen, e.g. like Michael Nielsen’s how to on crawling 250 million web pages quickly and inexpensively. The thing that makes Common Crawl work is not just quality raw data. They also provide JSON crawl metadata in an S3 bucket, and an Amazon Machine Image to help both users get up and running quickly. The image includes a copy of the Common Crawl User Library, examples, and launch scripts that show users how to analyze the Common Crawl corpus using their own Hadoop cluster or Amazon Elastic MapReduce.
It is this complete picture, data + tools, and the easy availability of infrastructure to do so that make a project like Common Crawl so compelling. When you have the infrastructure in place, the friction to do something interesting gets reduced sufficiently that there are enough smart people using the data that interesting things are inevitable. With people like Michael and Pete Warden publishing great getting started posts, the barriers to entry for Common Crawl are essentially the cost of running a small cluster for a few hours.
I can think of a few life science data sets that would benefit from such an approach, e.g. data sets releated to disease outbreaks, expression profiles, etc. Data that can be analyzed and mashed up with other sources with minimal friction. That would be awesome.