I’ve now heard at least a couple of people managing large life science repositories talk about resiliency and durability and mention that they have durability cause they use RAID. That’s just a cringeworthy thing to hear. I would hope that people managing core repositories know better. To the best of my knowledge they do, but it is troubling. I was reminded of the apparently lack of understanding in managing data by a tweet from Adam Kraut earlier where he linked to a paper that talks about the challenges of maintaining file integrity. In general, I recommend anyone in the world of informatics building large scale storage (or even small scale storage) check out James Hamilton’s blog post covering a talk by Jeff Dean on building large distributed systems pdf. The key is failure happens. Between 1-5% of your disks are going to fail over the course of a year and 2-4% of your storage servers. There are any number of reasons these could happen and they all have different failure rates. Google has published work on their analysis of disk failure rates pdf analysis on Storage Mojo.
Where am I going with this? As the size of our storage systems in informatics increases, as we keep data around for longer, we need to take a deeper look at how we are managing our data, and not make naive assumptions. Think about the tradeoffs you need to make between performance, availiability and durability (and think through what durability means). There are simple and creative ways of getting there (e.g. keeping a copy of a disk array in a friends lab in a different building), and a number of solutions (including some from my day job), but let’s not assume that RAID = durability. In the end managing your data is less about the hardware and more about the operational processes and software sitting on top of the hardware.