Presentation at NSF Workshop on Software and Data Citation. Draws from our study of how software is visible in scientific publications (JASIST) and our CSCW paper on BLAST innovation integration.
Software Citation and a Proposal (NSF workshop at Havard Medical School)
1. Software in the scientific literature:
Software mentions and a
provocative proposal
James Howison
Information School
University of Texas at Austin
This material is based upon work supported by the National
Science Foundation under Grant No. SMA-1064209.
@jameshowison
2. What does a citation do, anyway?
• Gives credit for contribution
– A key reward that drives activity in science
– Sits alongside publications, grants, promotions,
and prizes
– Rewards drive type of artifacts and collaboration
• Explains the method used
– Citations assist in knowing what was done
– Provenance
– Replication and extension
@jameshowison
3. How problematic are current
practices?
• How is software mentioned in papers?
• How accessible and reusable is the software
mentioned?
• How well do these mentions perform the
functions of citation?
github.com/jameshowison/softcite
DOI: 10.6084/m9.figshare.1146366
Howison, J., & Bullard, J. (2015). Software in the scientific literature: Problems with
seeing, finding, and using software mentioned in the biology literature. Journal of the
Association for Information Science and Technology (JASIST), doi: 10.1002/asi.23538
@jameshowison
4. Sample and Method
• 90 randomly selected articles from biology
literature, articles published between 2000
and 2010.
• Journals stratified across Journal Impact
Factor to balance coverage with influence
@jameshowison
5. Content analysis scheme
Manual content analysis (3 coders, Kappa)
1. Identifying mentions
– Read article, locate a mention of a piece of software
2. Identify in-text characteristics of mention
– Name of software? URL? Date? Version number? In
bibliography? Cite to paper/manual/webpage?
3. Functions of mention
– Identifiable? Findable? Accessible? Source? Match
preferred citation?
@jameshowison
7. How many mentions?
• 59 articles mentioned software, 31 did not.
• There were 286 distinct mentions of software.
• Those mentions were to 146 distinct pieces of
software.
– This includes general purpose (e.g., Microsoft
Excel) and science-specific software (e.g., DENZO,
BLAST).
@jameshowison
8. Types of mentions
Mention Type Example
Cite to Publication … was calculated using biosys (Swofford & Selander 1981).
Cite to Project Name or
Website
… using the program Autodecay version 4.0.29 PPC
(Eriksson 1998).
Reference List has: ERIKSSON, T. 1998. Autodecay, vers.
4.0.29 Stockholm: Department of Botany.
Like Instrument … calculated by t-test using the Prism 3.0 software
(GraphPad Software, San Diego, CA, USA).
URL in text … freely available from http://www.cibiv.at/software/pda/
.
In-text name mention
only
… were analyzed using MapQTL (4.0) software.
Not even name
mentioned
… was carried out using software implemented in the Java
programming language.
@jameshowison
17. Other findings
• Only 24% journals had policies that
mentioned software, declining by strata.
– Rarely mention versions.
– Not clear that these are followed.
• Only between 13–30% of packages make a
specific request for a particular type of
citation
– 32% of mentions didn’t follow the citation.
@jameshowison
18. Visible citation formats as “nudge”
• Some disagreement about how important the
text of a publication is:
– Should effort focus on machine readable “meta-data”
in publication repositories (not in paper)?
– Or focus on human readable formats in the paper?
• My position is that human readable will influence
practice more quickly
• Formal, well-structured formats and policies act
as a “nudge” to shape how authors mention
software.
@jameshowison
19. Software archiving
• Strong finding that many pieces of software were
not findable.
– 1 in 10 packages could not be found at all
– Only 1 in 20 packages could the specific version be
found (combination of missing version info and
missing versions online)
• Analogous to link-rot for URLs in publications
(Koehler, 1999)
• Need to influence how software is archived
– Is that a role for publishers? Escrow for non-open
software?
@jameshowison
20. Part 2
But what are we working to incentivize anyway?
@jameshowison
21. NCBI BLAST
WU-BLAST
BLAST+
GPU-
BLAST
CUDA-
BLAST
AB-
BLAST
CS-BLAST
Mac OS X Port
Compaq mods
Apple (A/G)
BLAST
FSA-BLAST
@jameshowison
Howison, J., & Herbsleb, J. D. (2013). Incentives and
integration in scientific software production. In
Proceedings of the ACM Conference on Computer
Supported Cooperative Work (pp. 459–470). San
Antonio, TX.
22. Citation and collaboration
• What is the impact on collaboration of credit-
giving through citations?
• Can a citation (of any kind) incentivize an
ongoing collaboration able to do the work
needed to keep a piece of software
scientifically functional?
• Could a standard undermine collaboration
further?
@jameshowison
23. Can citation incentivize maintenance?
• Software relies on other software
– Dependencies all the way down
– Software stacks change quickly (new opportunities,
new problems, new libraries)
• Scientists seek to extend the work of others, not
just re-execute it.
• Many re-implementations come from frustration
with poorly maintained software
– Software that wasn’t adjusted as its dependencies
changed
– Software that wasn’t updated with newer techniques
@jameshowison
24. A modest proposal
1. Papers have full workflow available
2. Workflows have regression tests running on a
continuous integration system
3. Integration system pulls all new versions of
dependencies, executes regression tests.
4. On fail (build or tests) the paper is retracted.
@jameshowison
Howison, J. (2014). Retract bit-rotten publications: Aligning incentives for sustaining scientific software.
In Working towards Sustainable Software for Science: Practice and Experiences (SuperComputing 2014
Workshop). New Orleans.
25. Uh …
• Retraction too strong, you say?
Ok, let’s revisit step 4:
• On fail, the paper is marked “provisionally
non-extendable” and authors have some
period to fix before marked as “retired”.
@jameshowison
26. Could others fix papers?
• Why must the original authors be the ones to
fix maintenance issues?
– Attract new resources, motivate integration.
• Re-write Step 4 again:
– On fail, workflow is marked as “needing work”
– Anyone can contribute that work
• Those extending the work, grad students, citizen
scientists
– Anyone that succeeds is added as an Author
@jameshowison
27. Added as an author??!?
• Just for fixing a bug?
Ok, fine. Let’s re-write the second half of step 4
again:
– Anyone maintaining a workflow and returning a
publication to full extendable status is:
• Added to paper as an acknowledgement
• Invited to a conference, Given a prize
• Credited in a visible, public, system (think github
profile)
@jameshowison
28. Takeaways
• Software citation is diverse and fails functions:
– “Like instrument” and “cite to publication” citations
give credit but fail to provide version information
– Other, informal mentions, better at versions but often
fail to give credit
• Software is frequently inaccessible
• Collaboration is counter-motivated by publication
• Bit-rotten papers should create opportunities to
earn reputation for scientific contribution.
@jameshowison
Differences in the sample, tending to informalism in lower impact factor journals, but confidence intervals overlap across strata.
Combine out codes to identify different kinds of software (at least for those that we could find). Note that “non-commercial” largely means “written for scientists but not released as open source”; “Open source” combines code written for scientists and general purpose open source tools.
Wide diversity, but “like instrument” mentions much more common for proprietary (commercial) software, while cite to publication much more likely for non-commercial and open source
Overall the practices of mentioning code are useful for identifying and mostly for finding software, but useless for anything requiring a version (important in replication and extension). Happily, around 80% of mentions make some effort to credit those responsible for providing the software.
Neither cite to publication or like instrument mentions do a good job on versioning: the templates for these type of citations simply don’t include this information; authors can include it but aren’t driven to do so. Note that the “other” mention types (informal mentions, like name in text or URL I footnote) work hard to identify the software but are significantly worse at crediting the relevant authors.
“social proof” “demonstration effects”, some evidence for this in the way that different kinds of mentions did different functions better and worse.