Scientific reproducibility is most viewed through a methodological or statistical lens, and increasingly, through a computational lens. Over the last several years, I've taken part in collaborations to that approach reproducibility from the perspective of informatics: as a flow of information across a lifecycle that spans collection, analysis, publication, and reuse.
These slides sketch of this approach, and were presented at a recent workshop on reproducibility at the National Academy of Sciences, and at one our Program on Information Science brown bag talks. See: informatics.mit.edu
1. Prepared for
Program on Information Science – Brown Bag Talks
MIT
March 2015
Modeling Reproducibility from an
Informatics Perspective
Dr. Micah Altman
<escience@mit.edu>
Director of Research, MIT Libraries
Head/Scientist, Program on Information Sciences
<informatics.mit.edu>
2. DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico
Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan
Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr
L. White, etc.
Modeling Reproducibility from an Informatics
Perspective
3. Collaborators & Co-Conspirators
• Kobbi Nissim, Michael Bar-Sinai, Salil Vadhan
& the Privacy Tools for Research Data Project
<http://privacytools.seas.harvard.edu/>
• Jeff Gill
• Michael P. McDonald
Research Support
Sloan Foundation
National Science Foundation (Award #1237235)
Modeling Reproducibility from an Informatics
Perspective
4. Related Work
• Allen, Liz, et al. "Credit where credit is due." Nature 508.7496
(2014): 312-313.
• Altman, M., & Crosas, M. (2013). The evolution of data
citation: From principles to implementation. IASSIST
Quarterly, 37.
• Garnett, A., Altman, M., Andreev, L., Barbarosa, S., Castro, E.,
Crosas, M., ... & Yang, X. (2013, May). Linking OJS and
Dataverse. In PKP Scholarly Publishing Conference 2013.
• Altman, M., Fox, J., Jackman, S., & Zeileis, A. (2011). An
Special Volume on" Political Methodology". Journal of
Statistical Software, 42(i01).
• Altman, M. (2008). A fingerprint method for scientific data
verification. In Advances in Computer and Information
Sciences and Engineering (pp. 311-316). Springer
Netherlands.
• Altman, M., & King, G. (2007). A proposed standard for the
scholarly citation of quantitative data. D-lib Magazine,
13(3/4).
• Altman, Micah, Jeff Gill, and Michael P. McDonald. (2004).
Numerical issues in statistical computing for the social
scientist. John Wiley & Sons.
• Altman, M., & McDonald, M. P. (2003). Replication with
attention to numerical accuracy. Political Analysis, 11(3),
302-307.
• Altman, Micah. "A review of JMP 4.03 with special attention
to its numerical accuracy." The American Statistician 56.1
(2002): 72-75.
• Altman, M., & McDonald, M. P. (2001). Choosing reliable
statistical software. Political Science & Politics, 34(03), 681-
687.
• Altman, M., Andreev, L., Diggory, M., King, G., Kolster, E.,
Sone, A., ... & Krot, M. (2001, January). Overview of the
virtual data center project and software. In Proceedings of
the JCDL 2001 (pp. 203-204). ACM.Modeling Reproducibility from an Informatics
Perspective
5. Roadmap for this Talk
Reproducibility Concerns…
Modeling Reproducible Research from an
Information Perspective
How can informatics improve reproducibility?
Modeling Reproducibility from an Informatics
Perspective
8. What Goes in the File Drawer?
Maximizing the Impact of Research through
Research Data Management
Daniel
Schectman’s
Lab Notebook
Providing
Initial
Evidence of
Quasi Crystals
• Null results are less likely to be published
published results as a whole are biased toward positive findings
• Outliers are routinely discarded
unexpected patterns of evidence across studies remain hidden
8
9. Replicability of Published Results
Maximizing the Impact of Research through
Research Data Management
Many journals have
no replication policy
Even in journals with
clear policy, success
rate is low
9
10. Many Initiatives to Improve Scientific Reliability
•Retraction monitoring
•Data citation
•Clinical trial
preregistration
•Registered replication
•Open data
•Badges
Modeling Reproducibility from an Informatics
Perspective
12. Framing Reproducibility from an
Informatics Perspective
Reproducibility claims are not formulated as
direct claims about the world…
1. What claims about information are implied by
reproducibility claims/issues?*
2. What properties of information and information
flow are related to those claims?
3. How would possible changes to information
processing and flow yield?
(And how much would they it cost?)
Modeling Reproducibility from an Informatics
Perspective
*
13. Some Types of Reproducibility Issues/Use Cases
Modeling Reproducibility from an Informatics
Perspective
Common Labels For Reproducibility Problems Example Interventions
Misconduct, Bit Rot, Author Responsibility Discipline/community data archives. NIH genomic data sharing
policy
RetractionWatch; Collaborative Data Collection Projects
Misconduct, Negligence, Confusion , Typo,
Proofreader error*, Dynamic Data Problem,
Versioning problem
Dat, DataHub, DataVerse (versioning)
Misconduct, Negligence, Harmless Error, S/Weave; Compendia; Vistrails
Reproducibility [NSF; Donoho 1995]
Replicability [King 1995, many journals]
Journal replication data & code archives.
Virtual Machine archiving.
Replication [NSF]; Reproducibility [King
1995];Independent Replication
Protocol Archive, Journal of Visual Experiments
Result Validation, Fact Checking Data Citation Standards
Calibration, Extension, Reuse Data Archives
File-Drawer Problem APS Preregistration Badge, Journal of Null Results
Undereporting
(Adverse Events); Data Dredging (Multiple
Comparisons)
Clinical Trial Preregistration
Data Dredging: Multiple Comparisons; P-Hacking Holdout Data Escrow
Sensitivity, Robustness Sensitivity Analysis
Reliability Metaanalysis; Cochrane Review; Data Integration
Generalizability Cochrane Review
14. More Operational Reproducibility Claims
Modeling Reproducibility from an Informatics
Perspective
Common Labels Example Interventions
File-Drawer Problem APS Preregistration Badge, Journal of
Null Results
Undereporting
(Adverse Events); Data Dredging (Multiple Comparisons)
Clinical Trial Preregistration
Data Dredging: Multiple Comparisons; P-Hacking Holdout Data Escrow
Sensitivity, Robustness Sensitivity Analysis
Reliability Metaanalysis;
Cochrane Review
Generalizability Cochrane Review
15. My Model of The World,
ca. Early Grad School
Scholarly Communications in the age of Big
Data
λ
β
Parameters
16. My Model of The World,
as a PostDoc in quantitative social science
Scholarly Communications in the age of Big
Data
Target Population
Frame
Selection
Super
Population
Laws
(structures) λ
β
(generates)
Parameters
20. Documents*
(compendia, fairy tales)
Modeling Reproducibility from an Informatics
Perspective
‘’We applied a general linear model’
‘We conjecture kids will choose candy’
‘δ = 2.3 * √Ω’
‘Chewing gum tastes great’
(Altman, et al. 2013)
‘
’
Assertions about
other entities
Logical Claims
Theorem 1
….
Lemma 1.1
Speculations,
Commentary
Thanks to my dog
for his support…
References, Citation
U49845.1 GI:1293613
doi:10.1002/0470841559.ch1
orcid:0000-0001-7382-6960
Internal Meta-Information
Title: XXXX
21. People (their Relationships & Action)
Modeling Reproducibility from an Informatics
Perspective
Identity
Who is the actor?
Relationship
(or action)
What did the actor do,
or how are they related?
24. Some Types of Reproducibility Issues/Use Cases
Modeling Reproducibility from an Informatics
Perspective
Common Labels Reproducibility Related Issue Example Interventions
Misconduct, Bit Rot, Author
Responsibility
Data was fabricated, corrupted, or radically
misinterpreted prior to analysis
Discipline/community data archives.
NIH genomic data sharing policy
RetractionWatch; Collaborative Data
Collection Projects
Misconduct, Negligence,
Confusion , Typo, Proofreader
error*, Dynamic Data Problem,
Versioning problem
Data {referenced by identifier | provided as
an instance| described by method} has
nontrivial set of semantic differences from
that used as input to the publication
Dat, DataHub, DataVerse
(versioning)
Misconduct, Negligence,
Harmless Error,
Published analysis algorithm does not
correspond to implemented analysis
S/Weave; Compendia; Vistrails
Reproducibility
[NSF; Donoho 1995]
Replicability
[King 1995, many journals]
Variance of estimates given data instance &
analysis implementation
Journal replication data & code
archives.
Virtual Machine archiving.
Replication [NSF]
Reproducibility [King 1995]
Independent Replication
Variance of estimates given method
algorithm and analysis algorithm
Protocol Archive, Journal of Visual
Experiments
Result Validation, Fact Checking Variance of estimates given data identifier &
analysis algorithm
Data Citation Standards
Calibration, Extension, Reuse Produce new analysis given data identifier Data Archives
25. More Operational Reproducibility Claims
Modeling Reproducibility from an Informatics
Perspective
Common Labels Reproducibility Related Issue Example
Interventions
File-Drawer Problem Publisher bias toward significant (or expected) results APS
Preregistration
Badge, Journal
of Null Results
Undereporting
(Adverse Events); Data
Dredging (Multiple
Comparisons)
Author bias toward publishing favored outcomes Clinical Trial
Preregistration
Data Dredging: Multiple
Comparisons; P-Hacking
Author bias to creating significant results resulting in difference
between stated method/analysis and actual (complete)
method/analysis
Holdout Data
Escrow
Sensitivity, Robustness Variance of support for claims across specification change Sensitivity
Analysis
Reliability Variance of support for claims across repeated measures, samples Metaanalysis;
Cochrane
Review
Data
Integration
Generalizability Variance of support for claims across different frames Cochrane
Review
Laws, Truth Variance of support for claims to other populations Grand
Challenge ?
26. Operational Reproducibility Claims
Reproducibility Related Issue Related informatics claims
Label Validation, Fact Checking
Reproducibility Issue Variance of estimates given data identifier & analysis algorithm
Reproducibility Claim Variance of estimates given data identifier & analysis algorithm is known &
correctly represented.
Use Case Post-publication reviewer wants to establish that published claims correspond
to analysis method performed…
Potential supporting informational
claims
1. Instance of data retrieved via identifier is semantically equivalent to
instance of data used to support published claim
2. analysis algorithm is robust to choice of reasonable alternative
implementation
3. implementation of algorithm is robust to reasonable choice of execution
details and context
4. published direct claims about data are semantically equivalent to subset
of claims produced by authors previous application of analysis
5. …
Potential information systems
properties supporting claims
1a. Detailed provenance history for data from collection through analysis and
deposition
1b. Automatic replication of direct data claims from deposited source
1c. Cryptographic evidence (e.g. cryptographic signed {analysis output
including, cryptographic hash of data} & {cryptographic hash of data retrieved
via identifier}
…
2a. Standard implementation, subject to community review
2b. Report of results of application of implementation on standard testbed
2c. Availability of implementation for inspection
….
3. …
27. Conjectures: How Could Informatics Improve Reproducibility
Formal Properties
(Some formal properties on information
flow and management tend to support
reproducibility related inferences…)
• Transparency
• Auditability
• Provenance
• Fixity
• Identification
• Durability
• Integrity
• Repeatability
• Self-documentation
• Non-repudiation
Properties applied to different stages,
entities, and to components of the
information system itself
Systems Property*
(How does the system interact with users, and
what incentives and culture does it engender?)
• Barriers to entry
• Ease of use
• Support for intellectual communities
• Speed and performance
• Security
• Access control
• Personalization
• Credit and attribution
• Incent well-founded trust among actors
• Disincent “glamour & deceit”
(How does the system integrate into research
ecosystem?)
Systems Oriented
• Sustainability
• Cost
• Incent well-founded trust in system and outputs
28. Discussion
– How can we better support reproducibility
with information infrastructure?*
•How can we better identify the inferential claims implied
by specific set of (non)reproducibility claims/issues?
•Which information flows and systems that most closely
associated with these inferential claims?
•Which properties of information systems support
generating these inferential claims?
Modeling Reproducibility from an Informatics
Perspective
29. Additional References
• de Waard, A. (2010). The story of science: a syntagmatic/paradigmatic
analysis of scientific text. In Proceedings of the AMICUS Workshop (pp. 36-
41).
• Gentleman, R., & Lang, D. T. (2007). Statistical analyses and reproducible
research. Journal of Computational and Graphical Statistics, 16(1).
• Freire, Juliana. "Making computations and publications reproducible with
vistrails." Computing in Science & Engineering 14.4 (2012): 18-25.
• Kevles, Daniel J. The Baltimore case: A trial of politics, science, and character.
WW Norton & Company, 2000.
• King, G. (1995). Replication, replication. PS: Political Science & Politics,
28(03), 444-452.
• McCullough, B. D. (2009). Open access economics journals and the market
for reproducible economic research. Economic Analysis and Policy, 39(1),
117-126.
Modeling Reproducibility from an Informatics
Perspective