Software Citation and a Proposal (NSF workshop at Havard Medical School)

Software in the scientific literature:
Software mentions and a
provocative proposal
James Howison
Information School
University of Texas at Austin
This material is based upon work supported by the National
Science Foundation under Grant No. SMA-1064209.
@jameshowison

What does a citation do, anyway?
• Gives credit for contribution
– A key reward that drives activity in science
– Sits alongside publications, grants, promotions,
and prizes
– Rewards drive type of artifacts and collaboration
• Explains the method used
– Citations assist in knowing what was done
– Provenance
– Replication and extension
@jameshowison

How problematic are current
practices?
• How is software mentioned in papers?
• How accessible and reusable is the software
mentioned?
• How well do these mentions perform the
functions of citation?
github.com/jameshowison/softcite
DOI: 10.6084/m9.figshare.1146366
Howison, J., & Bullard, J. (2015). Software in the scientific literature: Problems with
seeing, finding, and using software mentioned in the biology literature. Journal of the
Association for Information Science and Technology (JASIST), doi: 10.1002/asi.23538
@jameshowison

Sample and Method
• 90 randomly selected articles from biology
literature, articles published between 2000
and 2010.
• Journals stratified across Journal Impact
Factor to balance coverage with influence
@jameshowison

Content analysis scheme
Manual content analysis (3 coders, Kappa)
1. Identifying mentions
– Read article, locate a mention of a piece of software
2. Identify in-text characteristics of mention
– Name of software? URL? Date? Version number? In
bibliography? Cite to paper/manual/webpage?
3. Functions of mention
– Identifiable? Findable? Accessible? Source? Match
preferred citation?
@jameshowison

https://github.com/jameshowison/softcite/blob
/master/data/software-citation-coding.ttl
@jameshowison

How many mentions?
• 59 articles mentioned software, 31 did not.
• There were 286 distinct mentions of software.
• Those mentions were to 146 distinct pieces of
software.
– This includes general purpose (e.g., Microsoft
Excel) and science-specific software (e.g., DENZO,
BLAST).
@jameshowison

Types of mentions
Mention Type Example
Cite to Publication … was calculated using biosys (Swofford & Selander 1981).
Cite to Project Name or
Website
… using the program Autodecay version 4.0.29 PPC
(Eriksson 1998).
Reference List has: ERIKSSON, T. 1998. Autodecay, vers.
4.0.29 Stockholm: Department of Botany.
Like Instrument … calculated by t-test using the Prism 3.0 software
(GraphPad Software, San Diego, CA, USA).
URL in text … freely available from http://www.cibiv.at/software/pda/
.
In-text name mention
only
… were analyzed using MapQTL (4.0) software.
Not even name
mentioned
… was carried out using software implemented in the Java
programming language.
@jameshowison

Types of Mentions
@jameshowison

Simpler Mention Kinds
@jameshowison

What sort of software mentioned?
@jameshowison

Proprietary software more likely to be
mentioned “like instrument”
@jameshowison

How useful are these mentions?
@jameshowison

Not much change across strata
@jameshowison

Do mention types work differently?
@jameshowison

Other findings
• Only 24% journals had policies that
mentioned software, declining by strata.
– Rarely mention versions.
– Not clear that these are followed.
• Only between 13–30% of packages make a
specific request for a particular type of
citation
– 32% of mentions didn’t follow the citation.
@jameshowison

Visible citation formats as “nudge”
• Some disagreement about how important the
text of a publication is:
– Should effort focus on machine readable “meta-data”
in publication repositories (not in paper)?
– Or focus on human readable formats in the paper?
• My position is that human readable will influence
practice more quickly
• Formal, well-structured formats and policies act
as a “nudge” to shape how authors mention
software.
@jameshowison

Software archiving
• Strong finding that many pieces of software were
not findable.
– 1 in 10 packages could not be found at all
– Only 1 in 20 packages could the specific version be
found (combination of missing version info and
missing versions online)
• Analogous to link-rot for URLs in publications
(Koehler, 1999)
• Need to influence how software is archived
– Is that a role for publishers? Escrow for non-open
software?
@jameshowison

Part 2
But what are we working to incentivize anyway?
@jameshowison

NCBI BLAST
WU-BLAST
BLAST+
GPU-
BLAST
CUDA-
BLAST
AB-
BLAST
CS-BLAST
Mac OS X Port
Compaq mods
Apple (A/G)
BLAST
FSA-BLAST
@jameshowison
Howison, J., & Herbsleb, J. D. (2013). Incentives and
integration in scientific software production. In
Proceedings of the ACM Conference on Computer
Supported Cooperative Work (pp. 459–470). San
Antonio, TX.

Citation and collaboration
• What is the impact on collaboration of credit-
giving through citations?
• Can a citation (of any kind) incentivize an
ongoing collaboration able to do the work
needed to keep a piece of software
scientifically functional?
• Could a standard undermine collaboration
further?
@jameshowison

Can citation incentivize maintenance?
• Software relies on other software
– Dependencies all the way down
– Software stacks change quickly (new opportunities,
new problems, new libraries)
• Scientists seek to extend the work of others, not
just re-execute it.
• Many re-implementations come from frustration
with poorly maintained software
– Software that wasn’t adjusted as its dependencies
changed
– Software that wasn’t updated with newer techniques
@jameshowison

A modest proposal
1. Papers have full workflow available
2. Workflows have regression tests running on a
continuous integration system
3. Integration system pulls all new versions of
dependencies, executes regression tests.
4. On fail (build or tests) the paper is retracted.
@jameshowison
Howison, J. (2014). Retract bit-rotten publications: Aligning incentives for sustaining scientific software.
In Working towards Sustainable Software for Science: Practice and Experiences (SuperComputing 2014
Workshop). New Orleans.

Uh …
• Retraction too strong, you say?
Ok, let’s revisit step 4:
• On fail, the paper is marked “provisionally
non-extendable” and authors have some
period to fix before marked as “retired”.
@jameshowison

Could others fix papers?
• Why must the original authors be the ones to
fix maintenance issues?
– Attract new resources, motivate integration.
• Re-write Step 4 again:
– On fail, workflow is marked as “needing work”
– Anyone can contribute that work
• Those extending the work, grad students, citizen
scientists
– Anyone that succeeds is added as an Author
@jameshowison

Added as an author??!?
• Just for fixing a bug?
Ok, fine. Let’s re-write the second half of step 4
again:
– Anyone maintaining a workflow and returning a
publication to full extendable status is:
• Added to paper as an acknowledgement
• Invited to a conference, Given a prize
• Credited in a visible, public, system (think github
profile)
@jameshowison

Takeaways
• Software citation is diverse and fails functions:
– “Like instrument” and “cite to publication” citations
give credit but fail to provide version information
– Other, informal mentions, better at versions but often
fail to give credit
• Software is frequently inaccessible
• Collaboration is counter-motivated by publication
• Bit-rotten papers should create opportunities to
earn reputation for scientific contribution.
@jameshowison

Software packages found
@jameshowison

Software Citation and a Proposal (NSF workshop at Havard Medical School)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Software Citation and a Proposal (NSF workshop at Havard Medical School)

Similar to Software Citation and a Proposal (NSF workshop at Havard Medical School) (20)

More from James Howison

More from James Howison (13)

Recently uploaded

Recently uploaded (20)

Software Citation and a Proposal (NSF workshop at Havard Medical School)

Editor's Notes