There is an interesting discussion on Titus' blog on VMs and reproducibility, including some great comments. I’ve always considered VMs, especially those that can be deployed in the cloud, a convenience. In other words, they make it easy for people to try and reproduce your work cause you give it to them in a turnkey-type way. However, I’ve never felt that VMs were the optimal solution for doing science. If you think about it, what do you need for good science
- Access to the raw data and any other data sets associated with the science.
- A description of the methods used that are used in the research. Ideally you should be able to use these methods and the data sets above to come up with the same results.
- The code used to implement the methods above.
- A list of dependencies and the execution environment.
Is this a complete list? I am sure if I think about it again the list may evolve, but it seems about right to me. In the end you want to do three things (1) See if you can replicate the work; (2) have enough information to reproduce it but using your own code, in case you don’t like the actual implementation and (3) evolve the science using existing work as a starting point.
What enables all this? It’s open data, it’s open source and it’s programmability. If you think of your infrastructure and your overall system programmatically, it’s a lot more elegant than a VM. It’s not easy, but if you can use recipes and configure a system on the fly then you aren’t limited to a VM, but can dynamically generate the environment required, with the appropriate data sets and dependencies. I’ve always said that data is royal garden, but compute is a fungible commodity, and dynamic environments are super powerful tools that can enable really good science. Unfortunately, they also require a level of skill that many scientists don’t have.
These are topics that Matt Wood and I talk about a lot (see the two decks below for some ideas)
Yes it’s a very cloud-centric view of the world, but there is a reason we work where we do.