business|bytes|genes|molecules

At the interface of science and computing

Research Code

Iddo Friedberg has an interesting post on making research software accountable. While I have never been in academia since I left grad school, I have been around it through friends and my wife, and I am not sure I completely agree with the post.

He writes

This practices of code writing for day-to-day lab research are therefore completely unlike anything software engineers are taught.

and

Research coding is not done with the purpose of being robust, or reusable, or long-lived in development and versioning repositories.

Just reading these lines and seeing other issues I’ve seen with academic code makes me think of a few things. (1) Scientific programmers are either poor programmers or lazy programmers. That measn that a lot of the reasons scientific code is not robust or maintainable is because they don’t know how to write robust code. (2) There seems to be an assumption that all code inside a software company is written with lots of time on hand and is user facing. There is a lot of code that is created to create metrics, analyze data and is done in “can I get that answer in the next 4 hours”.

Perhaps more than anything what it brought to mind was “technical debt”. For anyone that’s been around software, technical debt is a reality and there is always a tension between speed and debt. The fact remains that debt catches up with you. And then you are faced with all kinds of issues. In the scientific world, I’ll call out specific examples of the impact of technical debt

  • You hack something together to get some preliminary data. You are short on time, so you hard code some parameters, and along the way you forget that you did. Guess what, that could result in scientific errors down the line cause you have bad parameters or you made a mistake in some algorithm that you fat fingered in your hurry.
  • Your code is lying around and gets picked up by someone else. They make assumptions, the wrong ones.
  • You essentially have to reinvent the wheel often cause you don’t have quality reusable code, which also means that your research is going to take even longer.

The fact is that every field has slice and dice code. The better the quality of your programming the better the slicing and dicing. The better your documentation, the more it goes from being something one person knows, to being part of the toolchest of an entire group. I wonder if people would take such shortcuts with their lab protocols?

In the end, no amount of enforcement or procedure is going to help. While there will always be a need to hack something up, and often, scientists need to become better programmers, and realize that code has impact on the quality of the science. A few things that I do think will help. Make programming more of a first class citizen. Right now, it’s still thought of as this other thing. The successful groups have proper software engineers doing the hard stuff, but the majority of scientists can barely script, forget thinking through smart ways of building pipelines or even hacking. The concept of “publishing” scientific code also needs to change. It should be less about publishing papers and more about publishing code. Just throw it up on github, even if no one else is ever going to use it. If you are using a version control system, and there is no excuse not to use one, then pushing that out to Github or similar is trivial.

Let’s just stop using the “we don’t have time” excuse. I don’t know too many graduate students who have more time pressure than an engineer or data scientist at a startup, where every minute counts and costs and people are wearing 10 hats.

Comments