Software Repositories for
Research
-- An environmental scan
Micah Altman
MIT LibrariesPrepared for Software
Preservation Network Forum
Atlanta
August 2016
Disclaimer
These opinions are my own, they are not the opinions of MIT, any of
the project funders, nor (with the exception of co-authored
previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson,
Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx,
Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
2
Related Publications
• Altman M, Jackman S. “Nineteen Ways of Looking at Statistical
Software”. Journal of Statistical Software. 2011;42.
• Altman, Micah, and Gary King. "A proposed standard for the
scholarly citation of quantitative data." D-lib 13, no. 3 (2007):
• Altman, M., Gill, J. and McDonald, M.P., 2004. Numerical issues in
statistical computing for the social scientist. John Wiley & Sons.
Reprints available from:
informatics.mit.edu 3
Today’s Perspectives
* Methods *
* Measures *
* Merit *
4
Methods
5
Literature Review
Data Curation, Publication and Citation
Software significant properties, use cases
Software repositories
Software & scientific reproducibility
6
Web Research - Practice
Review of research repositories
Sources: OpenDOAR, Re3Data, Sherpajuliet
Goals: Estimate prevalence of repositories that accept research software; identify exemplar
repositories, characterize feature sets by repository category
Methods: term-based queries; descriptive statistics; stratified content case studies
Review of Software Directories
Sources: OpenHub, OSDir, DMOZ
Goals: Identify additional software repositories used in research
Methods: qualitative text analysis; descriptive statistics
7
Web Research - Policies
Review of funder policies
Sources: Roarmap; US Federal Agency Websites
Goals: Estimate prevalence of funder policies on software curation; identify exemplar
policies; identify recommended repositories
Methods: qualitative text analysis; descriptive statistics
Review of Journal Policies
Source: Google Scholar, WoS, DOAJ, Software Sustainability Institute Index
Goals: Estimate prevalence of journals that publish research software; prevalence of
software policies at journals exemplar policies; identify recommended repositories
Methods: qualitative text analysis; descriptive statistics
8
Measures
9
Typical Prevalence of Software Repositories
10
Some Exemplars and Promising Initiatives
• Citation and publisher policies
FORCE 11 Software Citation Principles
www.force11.org/software-citation-
principles
ACM New Publication Policies on
Software Reproducibility and
Contributorship
www.acm.org/publications/policies
PLOS
http://journals.plos.org/plosone/s/m
aterials-and-software-sharing
11
• Long Term Access:
- www.softwareheritage.org
- guides.github.com/activities/citable-code/
- archive.org/details/softwarelibrary
• Software Journals:
- www.journals.elsevier.com/softwarex/
- www.jstatsoft.org/
-
http://openresearchsoftware.metajnl.com/
Use Cases and Motivating Value
12
Historic / cultural - historical scholarship
- “intrinsic value”
Replication and reproducibility - check claims made in research
- reduced deliberate research fraud
- check reliability (robustness) of results
- check validity (accuracy)
Reuse - efficiency
- increase speed of development
- standards compliance
- apply methodology to a different corpus
- increased quality and dependability
Render other digital objects - renders other objects meaningful
- see digital preservation use cases
Legal - record of licensing, ownership, copyright
- manage legal risks/accountability
- compliance with laws/funding mandates
- reduce barriers to long-term access for other historic use, replication, reuse, rendering
Citation and attribution - track individual academic career
- track software development/history
- track institutional outputs
- track funder outputs
Repository Affordances
13
Authoring/
Development
Discovery/
Access
Collection Preservation Legal
creator Language specific
authoring tools
Build environment
integration
Versioning
Documentation
Project management
Collaboration
Attribution Backups
Commitment to long-
term access
Access control
License templating
curator Project management
License template
Monitoring
Collaboration
Browsing
Searching
Persistent Identifiers
Version Ids
Collection Policy
Peer Review
Selection
Annotation
Metdata
Preservation policy
Documentation
Format management
Access control
License standardization
Legal guidance
institution Author, Funder
Identifiers
Metrics
Author, Funder
Identifiers
Metrics
Author, Funder
Identifiers
Metrics
Compliance
Attribution
Preservation Policy
Preservation replication
Auditability
Certification
License standardization
Privacy Management
end-user Browsing
Searching
Search engine
integration
ersistent Identifiers
Selection criteria
Annotation
Quality Measures
Documentation Open licensing
License discoverability
Merit
14
State of Software Curation
1. No comprehensive indices of software archives
2. Orders of magnitude fewer software archives than data archives.
( Corollary: Institutional repositories offer little functionality for software
archiving, even when nominally supported )
3. Very small proportion of funders have policies addressing software curation
4. There is little available advice for researchers who wish to curate, cite, &
preserve software
5. Substantial reproducibility reproducibility failures related to software continue
to be reported
15
Contrast with Data Curation
-- Lack of Progress• Compliance
– Funder: data management plans, open data
– Publishers: data access/archiving/citation
• Norms & practices
– Joint data citation principles
– Recognition of data in funder biosketches
– Increased recognition of reproducibility gaps
– Increased recognition of open data/open science
• Technical infrastructure
– Data repository directories
– Data citation indices
– ORCID researcher identifier and registry
• Recognition
– Data citation indices
– Virtual branded archives
– High-profile data publications
16
Summing it all up…
Software curation looks a lot like data curation a decade ago…
17
“How much slower would scientific progress be if the near universal standards for scholarly citation of articles and books had
never been developed? Suppose shortly after publication only some printed works could be reliably found by other scholars; or
if researchers were only permitted to read an article if they first committed not to criticize it, or were required to coauthor with
the original author any work that built on the original. How many discoveries would never have been made if the titles of books
and articles in libraries changed unpredictably, with no link back to the old title; if printed works existed in different libraries
under different titles; if researchers routinely redistributed modified versions of other authors' works without changing the title
or author listed; or if publishing new editions of books meant that earlier editions were destroyed? …
“Unfortunately, no such universal standards exist for citing quantitative data software, and so all the problems listed above
exist now. Practices vary from field to field, archive to archive, and often from article to article.
The data software cited may no longer exist, may not be available publicly, or may have never been held by anyone but the
investigator. Data software listed as available from the author are unlikely to be available for long and will not be available after
the author retires or dies. … Data software are sometimes listed in the bibliography, sometimes in the text, sometimes not at
all, and rarely with enough information to guarantee future access to the identical data set. Replicating published tables and
figures even without having to rerun the original experiment, is often difficult or impossible”
-- Altman & King 2007
Questions?
Web:
Informatics.mit.edu
Email:
escience@mit.edu
18

Software Repositories for Research -- An Environmental Scan

  • 1.
    Software Repositories for Research --An environmental scan Micah Altman MIT LibrariesPrepared for Software Preservation Network Forum Atlanta August 2016
  • 2.
    Disclaimer These opinions aremy own, they are not the opinions of MIT, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators Secondary disclaimer: “It’s tough to make predictions, especially about the future!” -- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc. 2
  • 3.
    Related Publications • AltmanM, Jackman S. “Nineteen Ways of Looking at Statistical Software”. Journal of Statistical Software. 2011;42. • Altman, Micah, and Gary King. "A proposed standard for the scholarly citation of quantitative data." D-lib 13, no. 3 (2007): • Altman, M., Gill, J. and McDonald, M.P., 2004. Numerical issues in statistical computing for the social scientist. John Wiley & Sons. Reprints available from: informatics.mit.edu 3
  • 4.
    Today’s Perspectives * Methods* * Measures * * Merit * 4
  • 5.
  • 6.
    Literature Review Data Curation,Publication and Citation Software significant properties, use cases Software repositories Software & scientific reproducibility 6
  • 7.
    Web Research -Practice Review of research repositories Sources: OpenDOAR, Re3Data, Sherpajuliet Goals: Estimate prevalence of repositories that accept research software; identify exemplar repositories, characterize feature sets by repository category Methods: term-based queries; descriptive statistics; stratified content case studies Review of Software Directories Sources: OpenHub, OSDir, DMOZ Goals: Identify additional software repositories used in research Methods: qualitative text analysis; descriptive statistics 7
  • 8.
    Web Research -Policies Review of funder policies Sources: Roarmap; US Federal Agency Websites Goals: Estimate prevalence of funder policies on software curation; identify exemplar policies; identify recommended repositories Methods: qualitative text analysis; descriptive statistics Review of Journal Policies Source: Google Scholar, WoS, DOAJ, Software Sustainability Institute Index Goals: Estimate prevalence of journals that publish research software; prevalence of software policies at journals exemplar policies; identify recommended repositories Methods: qualitative text analysis; descriptive statistics 8
  • 9.
  • 10.
    Typical Prevalence ofSoftware Repositories 10
  • 11.
    Some Exemplars andPromising Initiatives • Citation and publisher policies FORCE 11 Software Citation Principles www.force11.org/software-citation- principles ACM New Publication Policies on Software Reproducibility and Contributorship www.acm.org/publications/policies PLOS http://journals.plos.org/plosone/s/m aterials-and-software-sharing 11 • Long Term Access: - www.softwareheritage.org - guides.github.com/activities/citable-code/ - archive.org/details/softwarelibrary • Software Journals: - www.journals.elsevier.com/softwarex/ - www.jstatsoft.org/ - http://openresearchsoftware.metajnl.com/
  • 12.
    Use Cases andMotivating Value 12 Historic / cultural - historical scholarship - “intrinsic value” Replication and reproducibility - check claims made in research - reduced deliberate research fraud - check reliability (robustness) of results - check validity (accuracy) Reuse - efficiency - increase speed of development - standards compliance - apply methodology to a different corpus - increased quality and dependability Render other digital objects - renders other objects meaningful - see digital preservation use cases Legal - record of licensing, ownership, copyright - manage legal risks/accountability - compliance with laws/funding mandates - reduce barriers to long-term access for other historic use, replication, reuse, rendering Citation and attribution - track individual academic career - track software development/history - track institutional outputs - track funder outputs
  • 13.
    Repository Affordances 13 Authoring/ Development Discovery/ Access Collection PreservationLegal creator Language specific authoring tools Build environment integration Versioning Documentation Project management Collaboration Attribution Backups Commitment to long- term access Access control License templating curator Project management License template Monitoring Collaboration Browsing Searching Persistent Identifiers Version Ids Collection Policy Peer Review Selection Annotation Metdata Preservation policy Documentation Format management Access control License standardization Legal guidance institution Author, Funder Identifiers Metrics Author, Funder Identifiers Metrics Author, Funder Identifiers Metrics Compliance Attribution Preservation Policy Preservation replication Auditability Certification License standardization Privacy Management end-user Browsing Searching Search engine integration ersistent Identifiers Selection criteria Annotation Quality Measures Documentation Open licensing License discoverability
  • 14.
  • 15.
    State of SoftwareCuration 1. No comprehensive indices of software archives 2. Orders of magnitude fewer software archives than data archives. ( Corollary: Institutional repositories offer little functionality for software archiving, even when nominally supported ) 3. Very small proportion of funders have policies addressing software curation 4. There is little available advice for researchers who wish to curate, cite, & preserve software 5. Substantial reproducibility reproducibility failures related to software continue to be reported 15
  • 16.
    Contrast with DataCuration -- Lack of Progress• Compliance – Funder: data management plans, open data – Publishers: data access/archiving/citation • Norms & practices – Joint data citation principles – Recognition of data in funder biosketches – Increased recognition of reproducibility gaps – Increased recognition of open data/open science • Technical infrastructure – Data repository directories – Data citation indices – ORCID researcher identifier and registry • Recognition – Data citation indices – Virtual branded archives – High-profile data publications 16
  • 17.
    Summing it allup… Software curation looks a lot like data curation a decade ago… 17 “How much slower would scientific progress be if the near universal standards for scholarly citation of articles and books had never been developed? Suppose shortly after publication only some printed works could be reliably found by other scholars; or if researchers were only permitted to read an article if they first committed not to criticize it, or were required to coauthor with the original author any work that built on the original. How many discoveries would never have been made if the titles of books and articles in libraries changed unpredictably, with no link back to the old title; if printed works existed in different libraries under different titles; if researchers routinely redistributed modified versions of other authors' works without changing the title or author listed; or if publishing new editions of books meant that earlier editions were destroyed? … “Unfortunately, no such universal standards exist for citing quantitative data software, and so all the problems listed above exist now. Practices vary from field to field, archive to archive, and often from article to article. The data software cited may no longer exist, may not be available publicly, or may have never been held by anyone but the investigator. Data software listed as available from the author are unlikely to be available for long and will not be available after the author retires or dies. … Data software are sometimes listed in the bibliography, sometimes in the text, sometimes not at all, and rarely with enough information to guarantee future access to the identical data set. Replicating published tables and figures even without having to rerun the original experiment, is often difficult or impossible” -- Altman & King 2007
  • 18.