business|bytes|genes|molecules

At the interface of science and computing

About Software and Bioinformatics

It seems I end up writing something about software and bioinformatics every so often. In fact, I pretty much wrote about this in my last post. This time it is Lior Pachter’s post on The Myths of Bioinformatics Software. I commented on it on Twitter leading to quite the discussion and a follow up post by Titus Brown.

Before I write more, I’ll try and set up my vantage point. I wrote shoddy computational chemistry software in graduate school (what no version control?!!!), somewhat less shoddy software in my first job at a bioinformatics company (a very small startup), especially since we actually had real software developers, was exposed to a lot of in-licensing as a product manager at Accelrys, industrial strength commercial software at Rosetta Biosoftware, and the last seven years at Amazon where I currently own a growing service (product and engineering). Suffice to say I have strong opinions. The obvious caveat is that I haven’t been in bioinformatics for a long time, but thanks by marriage and through friends I continue to have a pretty close up view of things.

I started reading Titus’ post and was surprised by where it started, then I got to the point where he talked about being angry and frustrated. I am not sure I am angry, but I strongly believe that the biggest problem with bioinformatics is that we expect shoddy in a field where quality should be critical, and the more I think about it the more it’s obvious to me that it is about culture and incentives. So let me address the points one by one.

  1. Somebody will build on your code - This may be true in general, but there are also several counterpoints. I gave an example in my last post with my wife’s spectrum viewer. But I also think I understand why this is not the norm. My wife is a software developer and didn’t care about citations. What she did care about was making sure others knew where to send pull requests, and if there were things she could improve. In other words, the code being used was way more important than being cited in a paper. Unfortunately, that is not how people are measured and writing new code is a better option for many than building upon others. It’s also a reason the licensing system is so screwed up, because people aren’t fundamentally thinking about the ecosystem and your licensing choices are often governed by how you think about your own software. If you are writing software to publish, then your incentives are all wrong. I sound like a broken record, but software systems are not about getting papers published, they are about getting people to do work more effectively (new algorithms are different and should be published)
  2. You should have assembled a team to build your software - This boils down to why the software exists. There are many examples of open source projects that began as one person operations and stayed so for years, but the beauty of open source is that you have a broad community who contribute. But I digress. I don’t think good software needs large teams, but if you approach your software with the mentality that it’s meant to be a platform that’s going to be used by a set of people then you do need to make sure you are thinking about a team, potentially hiring real software developers. Otherwise, your code will atrophy, no one will want to support it, and the vicious cycle starts again. And a team doesn’t have to be too big either, especially if it’s a core team supporting an open source project. I don’t think software these days (especially in the world I live in) is at all about large teams. In fact it’s about being nimble and doing more with less by exposing the right interfaces.
  3. See #4
  4. Making your software free for commercial use shows you are not against companies. - Software that is free for academics and not for commercial use is my biggest bugbear with scientific software. That’s not how software should be. You choose a license (closed, GPL, Apache) based on how you think about your software. But this mixed mode thing is sheer arrogance, and prevents communities from being built around software. It makes sure that smart people who may want to contribute to improving it, can’t, and it also makes terrible assumptions about business models. You are locking out the bootstrapped startup, the lose computational geek in a larger organization who may have limited funding, and in general breaking the rules of software licensing because for some reason academia seems to think it is somehow supposed to hold the rights to science. And remember, Apache licensed software can make money. It boils down to your incentives again. Maybe bioinformatics can learn a little from the technology community and its approach to licensing and commercialization. Maybe I am over reacting, but I felt like that when I was still a grad student, and feel so even more strongly now.
  5. You should maintain your software indefinitely. - I am not sure I understand this one. If you write software that you feel is solid and being used by others, you should think about sustainability, but no one expects software to be indefinite.
  6. Your “stable URL” can exist forever. - Probably not but you could at least try. Most don’t even bother because that URL existed just to write a paper and get funding. There was never any intention of building something that could become a platform people used. Yep I am cynical about why scientific software gets written.
  7. You should make your software “idiot proof”. - Well, nothing to say here. You shouldn’t. Your software should be maintainable, and work for its target audience. But as I read the details, I realized this. People writing bioinformatics (scientific software in general) rarely think about their customers. And no customers aren’t those that pay you. They are those that use your software. Your job is to make their lives easier and yes, they may do things that you may consider weird, but that’s a matter of education and drawing the line at things you won’t try and work around. Another fundamental problem that we need to change.
  8. You used the right programming language for the task. Mostly agree here (although I don’t understand how Spark got put into the language category). The right programming language is the one you are effective in, allows you to be more productive, and helps foster a community.

Reading these points, observing projects, and having seen how innovative software companies work up close and personal, it’s clear to me that the problem is the culture around software development in bioinformatics. Open doesn’t come with caveats, software shouldn’t be about publishing, and there is nothing wrong with writing quality software, if only because you care. We just seem to accept status quo and shrug our shoulders. Our customers (the people using the software) deserve better.

Science and Software

I’ve been busy, so I am rather late jumping into the fray. My favorite source of post ideas, the Chuck Norris of Bioinformatics, has written a series of super interesting, and rather provocative, posts on this subject

At the center of a lot of the debate is software quality, software maintainability, and in the end, the ability of scientists to write maintainable software. There is also an intense debate of whether throwaway software is a good thing or not. I come at this debate from an interesting place. I wrote (bad) code in grad school and most of it was throw away code. I wrote (somewhat better) code as a scientific programmer at my first job. The quality of the code may not have been that much better, but others did look at it. The code was version controlled and it wasn’t meant just for me to use, so the bar was higher. Since then I haven’t written much code professionally but I’ve worked with and led a number of engineering teams. So I think I’ve learnt a thing or two about what good software is all about.

I tend to put scientific software in three buckets. There is software you write for maintenance (clean up, munging, etc). If you are a software engineer/developer, you are very likely to throw that up on Github today, since the likelihood that others may find it useful is > 0. But if you’re not, I don’t see why you should. Chances are that this is a one off anyway, and the quality bar is a little lower (famous last words).

Then there’s the software you use to do your science; the algorithms, the implementation of well known algorithms, and analysis pipelines themselves. This is the area where I feel that without open source software, without software peer review, there is so much room for error that not doing so is just plain bad science. Especially today, when there are resources like Github that allow you to do so easily and make software peer review relatively easily. I remember the day when one of our scientific collaborators sent us some code. It declared a variable, then two lines later the same variable with the opposite sign. Turns out the first one was from an experiment, but not removed. There are many similar examples (FORTRAN allows for some really sloppy code), and it is that risk that makes me worry about how much published science is the result of badly written software. You may ask, what about commercial code? Good question. One of the assumptions people make with commercial software is that (a) the software is a skin on top of well known academic software, or (b) you are assuming that the company is applying a level of rigor that makes their offerings worth your trust.

The third category is what I call infrastructure; databases, repositories, pipeline frameworks, visualization software. That one is probably (hopefully) written by the best programmers in the field. My preference is that these be open source, since platforms can be extended, and too many people build their own for no good reason. The quality of software engineers in the sciences is only going up, and we have some robust platforms out there now. I know teams that do routine code reviews, hire very good software developers, and function like a proper engineering team. If you want to write sustainable software that’s used by lots of other people, i.e. your goal isn’t just to publish, you have no other choice.

I have one more anecdote. My wife wrote a piece of software called Lorikeet (and yes, it needs to move to Github). At the time she wrote it, she never expected anyone else outside her lab to use it. But it turns out that because it was out there, open source, and usable, a lot of other people started using it. It’s been forked, enhanced, etc etc. That only happens when your code is open source.

So, what does this add to the debate. I suppose what I am saying is that “it depends”. In general, open source is awesome, and having people publish their code is a good things since others can use it, improve on it, etc. But I think there is room for throwaway code, but for a very narrow use case. When it comes to algorithms, I really struggle to see how you can get away without peer review, especially when your science depends on it.

OK, enough rambling for now.

Adieu My Friend

Friendfeed is no more. It seems like many moons ago that I spent almost all my social time on Friendfeed. I started a room called The Life Scientists right around the time Friendfeed introduced rooms with no idea if anyone would ever show up. They did and we ended up forming a community that I will cherish for a long long time. Many friendships were made. There were many discussions on a variety of topics near and dear to my heart. Then came the acquisition and slowly most of all left. This day was inevitable, and it has been ages since I visited the site, but I still feel a little down and definitely very nostalgic. My old blog is no more, so all the articles around Friendfeed are also lost. I suppose that blog and Friendfeed represent a period of innocence for a lot of us who cared about science, about open science, and the avenues that might make it work. We’ve all grown older and, hopefully, wiser.

Fittingly the best Friendfeed farewell comes from Neil, who to me is synonymous with those days. He has also archived a couple of the best rooms.

Python

One of the more frustrating parts of the last few years has been a lack of the kind of time required to learn new programming languages and re-learn stuff I had forgotten in my years on the road and not staying close to analytics and programming. I picked up some Ruby along the way, partly cause I liked the elegance of the language, and partly because it is really good at things I still do from time to time - launching and managing instances, and automating infrastructure. I still suck at it, but I can to launch an EC2 instance or two and can use Ruby-based static website generators. Works for me for the most part until I get frustrated at not being able to do things I could do in my sleep 6-7 years ago.

A language I have resisted over the years is Python. I didn’t love the syntax, hated the whitespace, and given that I had no time to properly learn the language I was more interested in, there was no room for Python. But there was always one reason I kept an eye on Python, scientific computing and analytics. While Ruby seemed to rule the roost for the devops crowd, Python has always been a darling of the science types, and I watched SciPy and Numpy with more than a tinge of jealousy, and I’ve long been an admirer of iPython.Then a colleague told me about Pandas.

Pandas is like R, but it is native Python, so lacks all the ugliness of R. It’s not as powerful as R today, but it was the final straw. I am going to teach myself Python, even if it means I never really become the Ruby guru I’ve always wanted to be. In my day to day life, there is a lot of opportunity for number crunching, data structures and analysis, and the more numerically oriented Python tools provide a powerful toolkit. I’ll still use Ruby for all the infrastructure management I do and hopefully some day find time to get really good with both languages. Given recent developments, not sure when that might be (maybe in another 17-18 years)

Repo of the Week - Feb 9, 2013

So the Repo of the Week didn’t quite pan out weekly, but I am going to keep the category going.

This weeks repo (well, pair of repos) comes to you courtesy of the Dunn lab. The two repos are biolite and agalma. What are these?

BioLite is a bioinformatics framework written in Python/C++ that automates the collection and reporting of diagnostics, tracks provenance, and provides lightweight tools for building out customized analysis pipelines. It is distributed with Agalma, but can be used independently of Agalma.

Agalma is a de novo transcriptome assembly and annotation pipeline for Illumina data. Agalma is built on top of the BioLite framework. If you have downloaded Agalma+BioLite, the files that are specific to the Agalma pipeline are located in the agalma/ subdirectory.

The authors have also made an Amazon EC2 image available with Agalma and all its dependencies. There is a tutorial to get things working on EC2.

More on GATK

I returned to blogging because of a need to rant about newly announced GATK licensing. Well, this time I am going to let others rant since things have only taken a turn for the worse.

I noticed a tweet from Mick Watson, which led me to this discussion on GATK licensing.

You can read my original post, the discussion, or Mick Watson’s blog post. Having worked on the commercial side of scientific software for a good chunk of my career, I understand the commercial side and potential driving factors, but my complete distaste for academic/non-commercial use licensing is well known, and the GATK folks aren’t exactly handling this well.

I will add one thing. There are some whom I respect, who point out that commercial entities add pretty GUIs and don’t add much value. To that I say, that’s pretty much why commercial informatics software is hard. Any company that isn’t really adding value is not going to succeed in the long run. Let the market decide. Your job as GATK is to create high quality, open source, software which benefits science. If companies create no value or minimize the value it means the following in most cases

  • In time the company will go under cause no one else is deriving any value. This is the usual case and hardly something to get concerned about
  • If the company is providing value then it’s a good thing. In most cases, this will happen only if GATK is part of a much more comprehensive package or service that makes it easier for people to get stuff done
  • The onus is on the GATK devs and funders to figure out how to compete if they feel their work is being “trivialized”. Competition is a good thing, even in pure open source code. The problem seems to be, that the Broad considers this their code as opposed to a community resource with a rich developer community. Get the latter behind you and any trivialization by people building pretty GUIs goes out of the window cause your community is going to do that for you if there is demand

To cut a long story short, the Broad is not taking the right steps, but I don’t blame them per sé. Scientific software funding needs to evolve and the idea of community and broad developer outreach needs to evolve. So as much as anything, I blame the system.

On Reproducibility

There is an interesting discussion on Titus’ blog on VMs and reproducibility, including some great comments. I’ve always considered VMs, especially those that can be deployed in the cloud, a convenience. In other words, they make it easy for people to try and reproduce your work cause you give it to them in a turnkey-type way. However, I’ve never felt that VMs were the optimal solution for doing science. If you think about it, what do you need for good science

  • Access to the raw data and any other data sets associated with the science.
  • A description of the methods used that are used in the research. Ideally you should be able to use these methods and the data sets above to come up with the same results.
  • The code used to implement the methods above.
  • A list of dependencies and the execution environment.

Is this a complete list? I am sure if I think about it again the list may evolve, but it seems about right to me. In the end you want to do three things (1) See if you can replicate the work; (2) have enough information to reproduce it but using your own code, in case you don’t like the actual implementation and (3) evolve the science using existing work as a starting point.

What enables all this? It’s open data, it’s open source and it’s programmability. If you think of your infrastructure and your overall system programmatically, it’s a lot more elegant than a VM. It’s not easy, but if you can use recipes and configure a system on the fly then you aren’t limited to a VM, but can dynamically generate the environment required, with the appropriate data sets and dependencies. I’ve always said that data is royal garden, but compute is a fungible commodity, and dynamic environments are super powerful tools that can enable really good science. Unfortunately, they also require a level of skill that many scientists don’t have.

These are topics that Matt Wood and I talk about a lot (see the two decks below for some ideas)

Yes it’s a very cloud-centric view of the world, but there is a reason we work where we do.

My Chem Coach Carnival

Susan Baxter blackmailed me into writing this post, but it is actually an interesting one to write, since I am probably not the most likely person to write one for the Chem Coach Carnival.

I am a chemist by training. Every degree (B.Sc., M.Sc. and Ph.D.) is in chemistry, but I am not a practicing chemist any more, and haven’t been for a very long time. However, I do not have any regrets about the path I have taken. In fact, I think my background in chemistry has helped me quite a bit.

Today, I am a Principal Product Manager at Amazon Web Services. There I work on Amazon EC2 instance platforms. In other words, I spend a lot of my time on the server platform that powers EC2. What does this have to do with chemistry? Not much. So why do I think Chemistry has a role to play in this?

After my B.Sc. in chemistry, I spent most of my Master’s and Ph.D. as a physical chemist/theoretical chemist. That pretty much means that you have be analytical, learn to work with others (who are often doing bench chemistry), and have to learn your way around computers. A lot of what I have done in my professional career has been around software, computers and analytical thinking. Your traning as a chemisty allows you to think about the fundamentals of a problem, about how to break problems down into their consituent parts, and best of all teaches you how to set up experiments. I am not formally trained in software development, web services, data management or product management, so I definitely believe that my training as a chemist has helped me transition into all these non-chemistry roles over the years.

Moral of the story: Your career can take many paths, but your training as a chemist is going to come in good stead along those paths, and stories about lab explosions always come in handy at parties.

Oh, and happy chemistry week.

Titus Makes My Life Easy

This is the second post in the short existence of this blog that starts with “Titus”. Well there is a good reason. In a wonderful blog post Titus pretty much nails my opinion on the matter of research software. He writes

I think this notion that research software is something special and deserving of some accomodation is so wrong that it’s hard to even address it intelligently. What, you think people at Google aren’t doing exploratory programming where they don’t know the answer already? You think Amazon customers don’t behave in unexpected ways? You think Facebook social network data mining is easy? The difference there is that companies have a direct economic incentive to solve these problems, and you don’t.

And I completely agree with him on the excuses.