At the interface of science and computing

About Software and Bioinformatics

It seems I end up writing something about software and bioinformatics every so often. In fact, I pretty much wrote about this in my last post. This time it is Lior Pachter’s post on The Myths of Bioinformatics Software. I commented on it on Twitter leading to quite the discussion and a follow up post by Titus Brown.

Before I write more, I’ll try and set up my vantage point. I wrote shoddy computational chemistry software in graduate school (what no version control?!!!), somewhat less shoddy software in my first job at a bioinformatics company (a very small startup), especially since we actually had real software developers, was exposed to a lot of in-licensing as a product manager at Accelrys, industrial strength commercial software at Rosetta Biosoftware, and the last seven years at Amazon where I currently own a growing service (product and engineering). Suffice to say I have strong opinions. The obvious caveat is that I haven’t been in bioinformatics for a long time, but thanks by marriage and through friends I continue to have a pretty close up view of things.

I started reading Titus’ post and was surprised by where it started, then I got to the point where he talked about being angry and frustrated. I am not sure I am angry, but I strongly believe that the biggest problem with bioinformatics is that we expect shoddy in a field where quality should be critical, and the more I think about it the more it’s obvious to me that it is about culture and incentives. So let me address the points one by one.

  1. Somebody will build on your code - This may be true in general, but there are also several counterpoints. I gave an example in my last post with my wife’s spectrum viewer. But I also think I understand why this is not the norm. My wife is a software developer and didn’t care about citations. What she did care about was making sure others knew where to send pull requests, and if there were things she could improve. In other words, the code being used was way more important than being cited in a paper. Unfortunately, that is not how people are measured and writing new code is a better option for many than building upon others. It’s also a reason the licensing system is so screwed up, because people aren’t fundamentally thinking about the ecosystem and your licensing choices are often governed by how you think about your own software. If you are writing software to publish, then your incentives are all wrong. I sound like a broken record, but software systems are not about getting papers published, they are about getting people to do work more effectively (new algorithms are different and should be published)
  2. You should have assembled a team to build your software - This boils down to why the software exists. There are many examples of open source projects that began as one person operations and stayed so for years, but the beauty of open source is that you have a broad community who contribute. But I digress. I don’t think good software needs large teams, but if you approach your software with the mentality that it’s meant to be a platform that’s going to be used by a set of people then you do need to make sure you are thinking about a team, potentially hiring real software developers. Otherwise, your code will atrophy, no one will want to support it, and the vicious cycle starts again. And a team doesn’t have to be too big either, especially if it’s a core team supporting an open source project. I don’t think software these days (especially in the world I live in) is at all about large teams. In fact it’s about being nimble and doing more with less by exposing the right interfaces.
  3. See #4
  4. Making your software free for commercial use shows you are not against companies. - Software that is free for academics and not for commercial use is my biggest bugbear with scientific software. That’s not how software should be. You choose a license (closed, GPL, Apache) based on how you think about your software. But this mixed mode thing is sheer arrogance, and prevents communities from being built around software. It makes sure that smart people who may want to contribute to improving it, can’t, and it also makes terrible assumptions about business models. You are locking out the bootstrapped startup, the lose computational geek in a larger organization who may have limited funding, and in general breaking the rules of software licensing because for some reason academia seems to think it is somehow supposed to hold the rights to science. And remember, Apache licensed software can make money. It boils down to your incentives again. Maybe bioinformatics can learn a little from the technology community and its approach to licensing and commercialization. Maybe I am over reacting, but I felt like that when I was still a grad student, and feel so even more strongly now.
  5. You should maintain your software indefinitely. - I am not sure I understand this one. If you write software that you feel is solid and being used by others, you should think about sustainability, but no one expects software to be indefinite.
  6. Your “stable URL” can exist forever. - Probably not but you could at least try. Most don’t even bother because that URL existed just to write a paper and get funding. There was never any intention of building something that could become a platform people used. Yep I am cynical about why scientific software gets written.
  7. You should make your software “idiot proof”. - Well, nothing to say here. You shouldn’t. Your software should be maintainable, and work for its target audience. But as I read the details, I realized this. People writing bioinformatics (scientific software in general) rarely think about their customers. And no customers aren’t those that pay you. They are those that use your software. Your job is to make their lives easier and yes, they may do things that you may consider weird, but that’s a matter of education and drawing the line at things you won’t try and work around. Another fundamental problem that we need to change.
  8. You used the right programming language for the task. Mostly agree here (although I don’t understand how Spark got put into the language category). The right programming language is the one you are effective in, allows you to be more productive, and helps foster a community.

Reading these points, observing projects, and having seen how innovative software companies work up close and personal, it’s clear to me that the problem is the culture around software development in bioinformatics. Open doesn’t come with caveats, software shouldn’t be about publishing, and there is nothing wrong with writing quality software, if only because you care. We just seem to accept status quo and shrug our shoulders. Our customers (the people using the software) deserve better.