business|bytes|genes|molecules

At the interface of science and computing

Science and Software

I’ve been busy, so I am rather late jumping into the fray. My favorite source of post ideas, the Chuck Norris of Bioinformatics, has written a series of super interesting, and rather provocative, posts on this subject

At the center of a lot of the debate is software quality, software maintainability, and in the end, the ability of scientists to write maintainable software. There is also an intense debate of whether throwaway software is a good thing or not. I come at this debate from an interesting place. I wrote (bad) code in grad school and most of it was throw away code. I wrote (somewhat better) code as a scientific programmer at my first job. The quality of the code may not have been that much better, but others did look at it. The code was version controlled and it wasn’t meant just for me to use, so the bar was higher. Since then I haven’t written much code professionally but I’ve worked with and led a number of engineering teams. So I think I’ve learnt a thing or two about what good software is all about.

I tend to put scientific software in three buckets. There is software you write for maintenance (clean up, munging, etc). If you are a software engineer/developer, you are very likely to throw that up on Github today, since the likelihood that others may find it useful is > 0. But if you’re not, I don’t see why you should. Chances are that this is a one off anyway, and the quality bar is a little lower (famous last words).

Then there’s the software you use to do your science; the algorithms, the implementation of well known algorithms, and analysis pipelines themselves. This is the area where I feel that without open source software, without software peer review, there is so much room for error that not doing so is just plain bad science. Especially today, when there are resources like Github that allow you to do so easily and make software peer review relatively easily. I remember the day when one of our scientific collaborators sent us some code. It declared a variable, then two lines later the same variable with the opposite sign. Turns out the first one was from an experiment, but not removed. There are many similar examples (FORTRAN allows for some really sloppy code), and it is that risk that makes me worry about how much published science is the result of badly written software. You may ask, what about commercial code? Good question. One of the assumptions people make with commercial software is that (a) the software is a skin on top of well known academic software, or (b) you are assuming that the company is applying a level of rigor that makes their offerings worth your trust.

The third category is what I call infrastructure; databases, repositories, pipeline frameworks, visualization software. That one is probably (hopefully) written by the best programmers in the field. My preference is that these be open source, since platforms can be extended, and too many people build their own for no good reason. The quality of software engineers in the sciences is only going up, and we have some robust platforms out there now. I know teams that do routine code reviews, hire very good software developers, and function like a proper engineering team. If you want to write sustainable software that’s used by lots of other people, i.e. your goal isn’t just to publish, you have no other choice.

I have one more anecdote. My wife wrote a piece of software called Lorikeet (and yes, it needs to move to Github). At the time she wrote it, she never expected anyone else outside her lab to use it. But it turns out that because it was out there, open source, and usable, a lot of other people started using it. It’s been forked, enhanced, etc etc. That only happens when your code is open source.

So, what does this add to the debate. I suppose what I am saying is that “it depends”. In general, open source is awesome, and having people publish their code is a good things since others can use it, improve on it, etc. But I think there is room for throwaway code, but for a very narrow use case. When it comes to algorithms, I really struggle to see how you can get away without peer review, especially when your science depends on it.

OK, enough rambling for now.

Comments