Hello, I’m Daisie Huang, and I’m an evolutionary biologist at the University of British Columbia and I’m also a software engineer. I’ll be discussing some matters of implementing policy in sustainable scientific software.
To create sustainable software, we need to look at some key issues. First, we need to acknowledge that as a software package matures, we will face new types of problems, and we need to plan for these throughout a package’s life cycle. Therefore, making scientific software sustainable means that we define policies and guidelines that the scientific community can follow and implement. But the reality of science today is that we have limited resources and rewards to encourage people to follow these policies and guidelines. Implementing policy often takes specialized expertise in software engineering.
In light of these issues, I’ll be discussing several papers that were contributed to the workshop. Some of these papers focus on specific facets of software design that are not often addressed in scientific software development, such as API governance and Software Security, and some of the papers discuss strategies to implement all of the different facets of sustainable software design in the framework of scientific software.
First, Krintz et al from The University of California at Santa Barbara’s Department of Computer Science discuss one of these important issues: developing systems for API governance.
The authors make the point that scientific research is moving away from local hardware environments towards cloud computing. Therefore, instead of focusing on access to hardware, we will need to focus on access to the digital assets: the code and the data. APIs—programming interfaces—are the main link governing interactions between these assets. Because APIs are the main interface between different digital assets, they have to be maintained in a sustainable way.
The authors focus on understanding the portability and consistency of APIs used to connect these data archives, because changes in APIs affect accessibility of data. They define two different types of compatibility, semantic compatibility and syntactic compatibility. They demonstrate an algorithmic method for categorizing a particular API port as “hard” or “easy,” at least for semantic compatibility.
Next, we’ll look at issues of software security.
Heiland et al from the Center for Trustworthy Scientific Infrastructure discussed issues related to implementing strong security measures for scientific software. They point out that cybersecurity is rarely addressed in scientific software design.
Security considerations for software vary depending on the maturity level of the software package. But when we initially develop scientific software, we generally don’t know what the final maturity level will be. Scientific software developers are probably not aware of best practices for cybersecurity. So the authors introduce the concept of Software Security Maturity Models, such as OpenSAMM and BSI-MM. These are used in industry to identify and define security vulnerabilities at different stages of the software life cycle.
They suggest that a similar Software Security Maturity Model can formalize this process: It provides classification of software security practices. It provides a path for tightening security practices as a package’s maturity level increases. It emphasizes understandability over complexity.
Finally, we’ll look at some papers that discuss implementing sustainability in scientific software.
Blanton and Lenhardt from the Renaissance Computing Institute discuss these issues from a user perspective.
The authors focus on a point that has been brought up many times in this context: There is a tension between writing code that is good enough just to “get it done,” i.e. to publish a paper about scientific results obtained using software, and “getting it right,” that is, developing software that is comprehensible to future users and reviewers. Just because the elevator panel works like this doesn’t mean it’s sustainable for the long run. We don’t have a way to validate that the software used in a paper is actually done right. The best way to get software designed correctly is to make sure best practices are considered from the start.
The authors highlight two models for sustainable software, at different extremes: One is what they call “co-funding”: In these projects, usually large, multi-year collaborations, there is equal emphasis on both the science and the software development. Both are planned into the project from inception. In the life sciences, the iPlant Collaborative, Galaxy Project, and Qiime are good examples of these sorts of large, well-designed projects.
At the other extreme, they discuss “software carpentry”: in this model, it’s assumed that the scientists themselves will write and maintain their code. Groups like Software Carpentry and ROpenSci assume that scientists won’t have access to dedicated software engineering, so they try to give them tools to use best practices in their own software development.
There might be a middle ground here: a way to get the engineering expertise that large co-funded projects have to individual scientist-developers. Hilmar Lapp of NESCent and I discuss one such possibility in our paper, Software Engineering as Instrumentation for the Long Tail of Scientific Software.
What do we mean when we refer to the “long tail” of scientific software? Think of the distribution of resources in scientific software. Most are focused on big projects with lots of community buy-in and funding. But a lot of scientific software exists away from this model. For example, scientific software can be used long after the original developer has moved on or the funding runs out. Look at MacClade: it was originally released in 1986 and last updated in 2005, but it was still cited over 400 times in 2013! The scientists who developed it have a newer package, Mesquite, that was meant to replace MacClade, but they haven’t had sufficient time or resources to maintain either package fully, let alone both of them.
Another dimension of the long tail can also be found in my particular research domain. In the field of phylogenetics, we have a lot of programs that implement different computational methods in slightly different ways. Here, Joe Felsenstein has listed some (but not anywhere near all) phylogenetics packages available online. Most of these programs are developed by academic scientists… They generally have limited training in software engineering Limited time or career incentive to improve software Limited funding
So, to summarize a bit: Making sustainable software means we have to pay attention to many facets of software design, like APIs, security, user experience, testing, etc. A single project that requires one full-time software engineer may actually require fractions of different kinds of engineers. But long-tail projects can’t even fund one FTE, let alone one that can address all these facets.
Then we have to consider that the users of scientific software are scientists, so the developers need to understand the users and the science. This is the idea of a “t-skilled” person: one who is both well-versed in a scientific domain and deeply experienced in one or more facets of software engineering. These people are pretty rare in the first place and difficult to retain in academia, because the academic career structure doesn’t incentivize this.
We should look at software engineering as an expensive resource, but one that needs to be accessible to scientists at all levels. Think of it as analogous to DNA sequencing: Sequencers used to be something that individual labs and institutions had to buy, maintain, and operate themselves, so only highly-funded operations had them and probably didn’t use them to their full capacity even when they had one. But now, core facilities provide the instrumentation and service to labs of any size. Anyone can pay a core facility to sequence their samples for them and provide quality control and bioinformatics advice as additional services.
We propose that software engineering can be “instrumented” in a similar way. Let’s create a nonprofit center for scientific software engineering. This center can hire these t-skilled personnel and provide access to them for projects at contracted cost. Because the center is focused on providing development services to scientific projects, it is not tied to the long-term success or failure of any individual project. It would emphasize the centrality of doing good science by making functional software tools as envisioned by scientists.
So, to conclude… Implementing policies to encourage sustainability in scientific software requires that many facets of good software design are addressed throughout the lifecycle of these projects. But most of them aren’t addressed in the status quo. We’ve highlighted some of these facets today and suggested some possible solutions. Large projects can afford to hire software engineers with the expertise to implement these facets correctly. Grassroots developer groups can provide guidance to scientists about best practices in software development. We think there is a place for a software engineering center that can provide both engineering expertise and guidance with a contract-driven instrumentation model to the scientific software in the long tail.
Implementing policy @ WSSSPE
WSSSPE Workshop 2013
Biodiversity Research Centre
University of British Columbia
As software matures, new problems emerge.
Sustainability issues should be addressed
throughout the life cycle.
How to implement sustainability when
resources are limited?
Toward a Research Software Security
R Heiland, B Thomas, V Welch, C Jackson, Center for Trustworthy
Scientific Cyberinfrastructure, Indiana University
A Security Maturity Model can formalize this
Provides classification of software security
Provides a path for tightening security practices
as a package’s maturity level increases.
Emphasizes understandability over complexity.
A User Perspective on Sustainable
Brian Blanton and Chris Lenhardt, Renaissance Computing Institute
Tension between “getting it
done” enough to publish
scientific results and “getting
it right” for future users.
Best suited for large, collaborative projects
Teach scientists to use software development best
Software Engineering as Instrumentation
for the Long Tail of Scientific Software
Daisie Huang and Hilmar Lapp, UBC and NESCent
The Long Tail
The lifespan of scientific software can be
The Long Tail
Lots of small programs implement different methods.
Facets of software design
User interface design
Facets of software design
ecular Biology/Developmental Biology
User interface design
Software engineering as a resource
Analogous to DNA sequencing facilities
A scientific software engineering center can
provide these resources to many projects.
Governed by long-term vision that is not tied to
success or failure of any individual project.
Emphasis on executing good science by making
Many facets of software design not addressed in most
scientific software projects.
Possible solutions include:
large projects can hire developers with software
providing scientists with software design guidance
A software engineering center can provide both
expertise and guidance to the long tail.