(Slides from this talk can be found here: https://speakerdeck.com/basho/automatically-scalable-computation-ricon-east-2013) As our computational infrastructure races gracefully forward into increasingly parallel multi-core and blade-based systems, our ability to easily produce software that can successfully exploit such systems continues to stumble. For years, we’ve fantasized about the world in which we’d write simple, sequential programs, add magic sauce, and suddenly have scalable, parallel executions. We’re not there. We’re not even close. I’ll present trajectory-based execution, a radical, potentially crazy, approach for achieving automatic scalability. To date, we’ve achieved surprisingly good speedup in limited domains, but the potential is tantalizingly enormous. About Dr. Seltzer Margo I. Seltzer is a Herchel Smith Professor of Computer Science in the Harvard School of Engineering and Applied Sciences. Her research interests include provenance, file systems, databases, transaction processing systems, and applying technology to problems in healthcare. She is the author of several widely-used software packages including database and transaction libraries and the 4.4BSD log-structured file system. Dr. Seltzer was a founder and CTO of Sleepycat Software, the makers of Berkeley DB, and is now an Architect at Oracle Corporation. She is currently the President of the USENIX Association and a member of the Computing Research Association’s Computing Community Consortium. She is a Sloan Foundation Fellow in Computer Science, an ACM Fellow, a Bunting Fellow, and was the recipient of the 1996 Radcliffe Junior Faculty Fellowship. She is recognized as an outstanding teacher and mentor, having received the Phi Beta Kappa teaching award in 1996, the Abrahmson Teaching Award in 1999, and the Capers and Marion McDonald Award for Excellence in Mentoring and Advising in 2010. Dr. Seltzer received an A.B. degree in Applied Mathematics from Harvard/Radcliffe College in 1983 and a Ph. D. in Computer Science from the University of California, Berkeley, in 1992.
One of the more frustrating parts of the last few years has been a lack of the kind of time required to learn new programming languages and re-learn stuff I had forgotten in my years on the road and not staying close to analytics and programming. I picked up some Ruby along the way, partly cause I liked the elegance of the language, and partly because it is really good at things I still do from time to time - launching and managing instances, and automating infrastructure. I still suck at it, but I can to launch an EC2 instance or two and can use Ruby-based static website generators. Works for me for the most part until I get frustrated at not being able to do things I could do in my sleep 6-7 years ago.
A language I have resisted over the years is Python. I didn’t love the syntax, hated the whitespace, and given that I had no time to properly learn the language I was more interested in, there was no room for Python. But there was always one reason I kept an eye on Python, scientific computing and analytics. While Ruby seemed to rule the roost for the devops crowd, Python has always been a darling of the science types, and I watched SciPy and Numpy with more than a tinge of jealousy, and I’ve long been an admirer of iPython.Then a colleague told me about Pandas.
Pandas is like R, but it is native Python, so lacks all the ugliness of R. It’s not as powerful as R today, but it was the final straw. I am going to teach myself Python, even if it means I never really become the Ruby guru I’ve always wanted to be. In my day to day life, there is a lot of opportunity for number crunching, data structures and analysis, and the more numerically oriented Python tools provide a powerful toolkit. I’ll still use Ruby for all the infrastructure management I do and hopefully some day find time to get really good with both languages. Given recent developments, not sure when that might be (maybe in another 17-18 years)
So the Repo of the Week didn’t quite pan out weekly, but I am going to keep the category going.
BioLite is a bioinformatics framework written in Python/C++ that automates the collection and reporting of diagnostics, tracks provenance, and provides lightweight tools for building out customized analysis pipelines. It is distributed with Agalma, but can be used independently of Agalma.
Agalma is a de novo transcriptome assembly and annotation pipeline for Illumina data. Agalma is built on top of the BioLite framework. If you have downloaded Agalma+BioLite, the files that are specific to the Agalma pipeline are located in the agalma/ subdirectory.
You can read my original post, the discussion, or Mick Watson’s blog post. Having worked on the commercial side of scientific software for a good chunk of my career, I understand the commercial side and potential driving factors, but my complete distaste for academic/non-commercial use licensing is well known, and the GATK folks aren’t exactly handling this well.
I will add one thing. There are some whom I respect, who point out that commercial entities add pretty GUIs and don’t add much value. To that I say, that’s pretty much why commercial informatics software is hard. Any company that isn’t really adding value is not going to succeed in the long run. Let the market decide. Your job as GATK is to create high quality, open source, software which benefits science. If companies create no value or minimize the value it means the following in most cases
- In time the company will go under cause no one else is deriving any value. This is the usual case and hardly something to get concerned about
- If the company is providing value then it’s a good thing. In most cases, this will happen only if GATK is part of a much more comprehensive package or service that makes it easier for people to get stuff done
- The onus is on the GATK devs and funders to figure out how to compete if they feel their work is being “trivialized”. Competition is a good thing, even in pure open source code. The problem seems to be, that the Broad considers this their code as opposed to a community resource with a rich developer community. Get the latter behind you and any trivialization by people building pretty GUIs goes out of the window cause your community is going to do that for you if there is demand
To cut a long story short, the Broad is not taking the right steps, but I don’t blame them per sé. Scientific software funding needs to evolve and the idea of community and broad developer outreach needs to evolve. So as much as anything, I blame the system.
There is an interesting discussion on Titus’ blog on VMs and reproducibility, including some great comments. I’ve always considered VMs, especially those that can be deployed in the cloud, a convenience. In other words, they make it easy for people to try and reproduce your work cause you give it to them in a turnkey-type way. However, I’ve never felt that VMs were the optimal solution for doing science. If you think about it, what do you need for good science
- Access to the raw data and any other data sets associated with the science.
- A description of the methods used that are used in the research. Ideally you should be able to use these methods and the data sets above to come up with the same results.
- The code used to implement the methods above.
- A list of dependencies and the execution environment.
Is this a complete list? I am sure if I think about it again the list may evolve, but it seems about right to me. In the end you want to do three things (1) See if you can replicate the work; (2) have enough information to reproduce it but using your own code, in case you don’t like the actual implementation and (3) evolve the science using existing work as a starting point.
What enables all this? It’s open data, it’s open source and it’s programmability. If you think of your infrastructure and your overall system programmatically, it’s a lot more elegant than a VM. It’s not easy, but if you can use recipes and configure a system on the fly then you aren’t limited to a VM, but can dynamically generate the environment required, with the appropriate data sets and dependencies. I’ve always said that data is royal garden, but compute is a fungible commodity, and dynamic environments are super powerful tools that can enable really good science. Unfortunately, they also require a level of skill that many scientists don’t have.
These are topics that Matt Wood and I talk about a lot (see the two decks below for some ideas)
Yes it’s a very cloud-centric view of the world, but there is a reason we work where we do.
I am a chemist by training. Every degree (B.Sc., M.Sc. and Ph.D.) is in chemistry, but I am not a practicing chemist any more, and haven’t been for a very long time. However, I do not have any regrets about the path I have taken. In fact, I think my background in chemistry has helped me quite a bit.
Today, I am a Principal Product Manager at Amazon Web Services. There I work on Amazon EC2 instance platforms. In other words, I spend a lot of my time on the server platform that powers EC2. What does this have to do with chemistry? Not much. So why do I think Chemistry has a role to play in this?
After my B.Sc. in chemistry, I spent most of my Master’s and Ph.D. as a physical chemist/theoretical chemist. That pretty much means that you have be analytical, learn to work with others (who are often doing bench chemistry), and have to learn your way around computers. A lot of what I have done in my professional career has been around software, computers and analytical thinking. Your traning as a chemisty allows you to think about the fundamentals of a problem, about how to break problems down into their consituent parts, and best of all teaches you how to set up experiments. I am not formally trained in software development, web services, data management or product management, so I definitely believe that my training as a chemist has helped me transition into all these non-chemistry roles over the years.
Moral of the story: Your career can take many paths, but your training as a chemist is going to come in good stead along those paths, and stories about lab explosions always come in handy at parties.
Oh, and happy chemistry week.
This is the second post in the short existence of this blog that starts with “Titus”. Well there is a good reason. In a wonderful blog post Titus pretty much nails my opinion on the matter of research software. He writes
I think this notion that research software is something special and deserving of some accomodation is so wrong that it’s hard to even address it intelligently. What, you think people at Google aren’t doing exploratory programming where they don’t know the answer already? You think Amazon customers don’t behave in unexpected ways? You think Facebook social network data mining is easy? The difference there is that companies have a direct economic incentive to solve these problems, and you don’t.
And I completely agree with him on the excuses.
A lot of scientific software, and this is especially true in bioinformatics, is “open source” in some way or another. That it seems the community doesn’t quite understand the value of open source is another matter and another post, so for the sake of this post, let’s assume it is. Perhaps more importantly, a good chunk of the software used is developed by academia. In my mind, this increases the bar on code quality and software stewardship. Most importantly, developers of academic software need to think about their applications differently and funding agencies need to think about how they fund software development differently.
Under the assumption that the majority of code used to do scientific discovery originates in academia, the question to ask is, what responsibiity does a scientific software developer have? Should they think of their potential users as customers from the beginning or is that something that becomes important later in the process. While in some open source academic projects, especially ones that gave been developed ground up, a customer-centric approach seems to exist, in general it appears that much code is developed to get published or to get something out there to solve a particular problem. Given the realities of scientific problems, I don’t believe you can assume on day 1 that your applications are going to find use in the broader community, but it is a safe assumption to make that for many applications that is the end goal. The reality is that you might be the only one that ever uses the code, especially if it is being developed to solve a specific problem, then it might be your team, then other labs and collaborators and ultimately a wider community. This means that not only should scientific software developers take a step back and think about the potential scope of their project as it evolves, it means that funding agencies need to rethink how they fund software.
First, publishing software as papers needs to go away. Algorithms should get published, novel architectures should get published. Software should only be published as a note to aid discovery. Funding agencies also need to recognize that funding new software projects for 3-5 years and expecting the developer to know the outcome at the beginning is short sighted. Software evolves, features and scope evolves along the way. Three years is an eternity for a software project, five .. I don’t have a word for how long that is. Funders also need to recognize that there is a greater need for funding as a piece of software grows and is recognized by the community. In a way that could be looked at as a return on investment. The broader the reach and impact to science the more successful the initial funding, but you need the concept of angel funding as well to get a project off the ground, see how it will evolve. We also need to raise the bar. Should new proposals be funded or should developers be encouraged to contribute to existing projects? Since there doesn’t seem to be much emphasis on the latter, you see new applications being developed as opposed to getting funding to contribute to existing applications.
bioinformatician still = PI mentality, not team-based or community
Software development is different, it works at different time scales and it requires a different approach. Note that I am not talking about research code, but code that’s meant to be used over a period of time, at the least by multiple generations in your research group. The change has to start within the community, but they aren’t going anywhere without funding agencies changing the incentives.
I have been on a soapbox lately around programming and bioinformatics. So I am going to try and point to find a random repo every week that I like and put it up here. They will mostly be from Github, but that’s not a requirement.
Today’s repo comes to you via the Faculty of Life Sciences at the University of Manchester. The repo consists of “Scripts, utilities and programs for genomic bioinformatics”, and contains scripts for a variety of genome informatics tasks.
This is the kind of repo that’s super useful. For now their seems to be one person pushing code, so hopefully there will be more. There are at least 2 forks, and reasonable activity.
I’ve now heard at least a couple of people managing large life science repositories talk about resiliency and durability and mention that they have durability cause they use RAID. That’s just a cringeworthy thing to hear. I would hope that people managing core repositories know better. To the best of my knowledge they do, but it is troubling. I was reminded of the apparently lack of understanding in managing data by a tweet from Adam Kraut earlier where he linked to a paper that talks about the challenges of maintaining file integrity. In general, I recommend anyone in the world of informatics building large scale storage (or even small scale storage) check out James Hamilton’s blog post covering a talk by Jeff Dean on building large distributed systems pdf. The key is failure happens. Between 1-5% of your disks are going to fail over the course of a year and 2-4% of your storage servers. There are any number of reasons these could happen and they all have different failure rates. Google has published work on their analysis of disk failure rates pdf analysis on Storage Mojo.
Where am I going with this? As the size of our storage systems in informatics increases, as we keep data around for longer, we need to take a deeper look at how we are managing our data, and not make naive assumptions. Think about the tradeoffs you need to make between performance, availiability and durability (and think through what durability means). There are simple and creative ways of getting there (e.g. keeping a copy of a disk array in a friends lab in a different building), and a number of solutions (including some from my day job), but let’s not assume that RAID = durability. In the end managing your data is less about the hardware and more about the operational processes and software sitting on top of the hardware.