SlideShare a Scribd company logo
1 of 30
Prepared for
Program on Information Science – Brown Bag Talks
MIT
March 2015
Modeling Reproducibility from an
Informatics Perspective
Dr. Micah Altman
<escience@mit.edu>
Director of Research, MIT Libraries
Head/Scientist, Program on Information Sciences
<informatics.mit.edu>
DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico
Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan
Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr
L. White, etc.
Modeling Reproducibility from an Informatics
Perspective
Collaborators & Co-Conspirators
• Kobbi Nissim, Michael Bar-Sinai, Salil Vadhan
& the Privacy Tools for Research Data Project
<http://privacytools.seas.harvard.edu/>
• Jeff Gill
• Michael P. McDonald
Research Support
Sloan Foundation
National Science Foundation (Award #1237235)
Modeling Reproducibility from an Informatics
Perspective
Related Work
• Allen, Liz, et al. "Credit where credit is due." Nature 508.7496
(2014): 312-313.
• Altman, M., & Crosas, M. (2013). The evolution of data
citation: From principles to implementation. IASSIST
Quarterly, 37.
• Garnett, A., Altman, M., Andreev, L., Barbarosa, S., Castro, E.,
Crosas, M., ... & Yang, X. (2013, May). Linking OJS and
Dataverse. In PKP Scholarly Publishing Conference 2013.
• Altman, M., Fox, J., Jackman, S., & Zeileis, A. (2011). An
Special Volume on" Political Methodology". Journal of
Statistical Software, 42(i01).
• Altman, M. (2008). A fingerprint method for scientific data
verification. In Advances in Computer and Information
Sciences and Engineering (pp. 311-316). Springer
Netherlands.
• Altman, M., & King, G. (2007). A proposed standard for the
scholarly citation of quantitative data. D-lib Magazine,
13(3/4).
• Altman, Micah, Jeff Gill, and Michael P. McDonald. (2004).
Numerical issues in statistical computing for the social
scientist. John Wiley & Sons.
• Altman, M., & McDonald, M. P. (2003). Replication with
attention to numerical accuracy. Political Analysis, 11(3),
302-307.
• Altman, Micah. "A review of JMP 4.03 with special attention
to its numerical accuracy." The American Statistician 56.1
(2002): 72-75.
• Altman, M., & McDonald, M. P. (2001). Choosing reliable
statistical software. Political Science & Politics, 34(03), 681-
687.
• Altman, M., Andreev, L., Diggory, M., King, G., Kolster, E.,
Sone, A., ... & Krot, M. (2001, January). Overview of the
virtual data center project and software. In Proceedings of
the JCDL 2001 (pp. 203-204). ACM.Modeling Reproducibility from an Informatics
Perspective
Roadmap for this Talk
Reproducibility Concerns…
Modeling Reproducible Research from an
Information Perspective
How can informatics improve reproducibility?
Modeling Reproducibility from an Informatics
Perspective
Modeling Reproducibility from an Informatics
Perspective
Information
Perspective
Increased Retractions, Allegations of
Fraud
Maximizing the Impact of Research through
Research Data Management
7
What Goes in the File Drawer?
Maximizing the Impact of Research through
Research Data Management
Daniel
Schectman’s
Lab Notebook
Providing
Initial
Evidence of
Quasi Crystals
• Null results are less likely to be published 
published results as a whole are biased toward positive findings
• Outliers are routinely discarded 
unexpected patterns of evidence across studies remain hidden
8
Replicability of Published Results
Maximizing the Impact of Research through
Research Data Management
 Many journals have
no replication policy
 Even in journals with
clear policy, success
rate is low
9
Many Initiatives to Improve Scientific Reliability
•Retraction monitoring
•Data citation
•Clinical trial
preregistration
•Registered replication
•Open data
•Badges
Modeling Reproducibility from an Informatics
Perspective
Modeling Reproducibility from an Informatics
Perspective
Reproducibility
Concerns
Framing Reproducibility from an
Informatics Perspective
Reproducibility claims are not formulated as
direct claims about the world…
1. What claims about information are implied by
reproducibility claims/issues?*
2. What properties of information and information
flow are related to those claims?
3. How would possible changes to information
processing and flow yield?
(And how much would they it cost?)
Modeling Reproducibility from an Informatics
Perspective
*
Some Types of Reproducibility Issues/Use Cases
Modeling Reproducibility from an Informatics
Perspective
Common Labels For Reproducibility Problems Example Interventions
Misconduct, Bit Rot, Author Responsibility Discipline/community data archives. NIH genomic data sharing
policy
RetractionWatch; Collaborative Data Collection Projects
Misconduct, Negligence, Confusion , Typo,
Proofreader error*, Dynamic Data Problem,
Versioning problem
Dat, DataHub, DataVerse (versioning)
Misconduct, Negligence, Harmless Error, S/Weave; Compendia; Vistrails
Reproducibility [NSF; Donoho 1995]
Replicability [King 1995, many journals]
Journal replication data & code archives.
Virtual Machine archiving.
Replication [NSF]; Reproducibility [King
1995];Independent Replication
Protocol Archive, Journal of Visual Experiments
Result Validation, Fact Checking Data Citation Standards
Calibration, Extension, Reuse Data Archives
File-Drawer Problem APS Preregistration Badge, Journal of Null Results
Undereporting
(Adverse Events); Data Dredging (Multiple
Comparisons)
Clinical Trial Preregistration
Data Dredging: Multiple Comparisons; P-Hacking Holdout Data Escrow
Sensitivity, Robustness Sensitivity Analysis
Reliability Metaanalysis; Cochrane Review; Data Integration
Generalizability Cochrane Review
More Operational Reproducibility Claims
Modeling Reproducibility from an Informatics
Perspective
Common Labels Example Interventions
File-Drawer Problem APS Preregistration Badge, Journal of
Null Results
Undereporting
(Adverse Events); Data Dredging (Multiple Comparisons)
Clinical Trial Preregistration
Data Dredging: Multiple Comparisons; P-Hacking Holdout Data Escrow
Sensitivity, Robustness Sensitivity Analysis
Reliability Metaanalysis;
Cochrane Review
Generalizability Cochrane Review
My Model of The World,
ca. Early Grad School
Scholarly Communications in the age of Big
Data
λ
β
Parameters
My Model of The World,
as a PostDoc in quantitative social science
Scholarly Communications in the age of Big
Data
Target Population
Frame
Selection
Super
Population
Laws
(structures) λ
β
(generates)
Parameters
Modeling Reproducibility from an Informatics
Perspective
Domain Theoretic
and Statistical
Models are
Not Enough
Entities, and Relationships, and Straw Models
(oh my!)
‘Actors’
(people)
‘Theory’
(ideas)
‘Documents’
‘Methods’
‘Data’
(affect decisions of)
(interact/interve
ne/simulate)
(select and apply)
(select, design, perform) )
(create and apply)
Analysis
(output)
(apply over)(observe,
edit)
Creation/C
ollection
Storage
/Ingest
Processing
Internal
Sharing
Analysis
External
dissemination/
publication
Re-
use
Long-
term
access
Where to Intervene: Consider Actors
Scholarly
Publishers
Researchers
Data
Archives/
Publisher
Research
Sponsors
Data
Sources/S
ubjects
Consumers
Service/Infras
tructure
Providers
Research
Organizations
Modeling Reproducibility from an
Informatics Perspective
Documents*
(compendia, fairy tales)
Modeling Reproducibility from an Informatics
Perspective
‘’We applied a general linear model’
‘We conjecture kids will choose candy’
‘δ = 2.3 * √Ω’
‘Chewing gum tastes great’
(Altman, et al. 2013)
‘
’
Assertions about
other entities
Logical Claims
Theorem 1
….
Lemma 1.1
Speculations,
Commentary
Thanks to my dog
for his support…
References, Citation
U49845.1 GI:1293613
doi:10.1002/0470841559.ch1
orcid:0000-0001-7382-6960
Internal Meta-Information
Title: XXXX
People (their Relationships & Action)
Modeling Reproducibility from an Informatics
Perspective
Identity
Who is the actor?
Relationship
(or action)
What did the actor do,
or how are they related?
Modeling Methods, Analysis & Data…
Modeling Reproducibility from an Informatics
Perspective
‘
‘’ΩΩΩΩ
Theory
(Rules, Entities, Concepts)
Algorithm
(Protocol, Operationalization)
Theory
(Rules, Entities, Concepts)
Theory
(Rules, Entities, Concepts)
Implementation
(Software, Coding Rules, Instrumentation )
Execution
(Deployment, House Survey Style, Equipment Setting )
’
Algorithms
(Protocol, Operationalization)
Implementations
(Software, Coding Rules, Instrumentation Design )
Executions
(Deployment, House Survey Style, Operating System,
Instrument, Computer , Starting Values, PRNG seeds)
Structure
Formats
Versions/Revisions
Selections
Integrations
Instantiations
(copies)
Modeling Reproducibility from an Informatics
Perspective
Improving
Reproducibility
Some Types of Reproducibility Issues/Use Cases
Modeling Reproducibility from an Informatics
Perspective
Common Labels Reproducibility Related Issue Example Interventions
Misconduct, Bit Rot, Author
Responsibility
Data was fabricated, corrupted, or radically
misinterpreted prior to analysis
Discipline/community data archives.
NIH genomic data sharing policy
RetractionWatch; Collaborative Data
Collection Projects
Misconduct, Negligence,
Confusion , Typo, Proofreader
error*, Dynamic Data Problem,
Versioning problem
Data {referenced by identifier | provided as
an instance| described by method} has
nontrivial set of semantic differences from
that used as input to the publication
Dat, DataHub, DataVerse
(versioning)
Misconduct, Negligence,
Harmless Error,
Published analysis algorithm does not
correspond to implemented analysis
S/Weave; Compendia; Vistrails
Reproducibility
[NSF; Donoho 1995]
Replicability
[King 1995, many journals]
Variance of estimates given data instance &
analysis implementation
Journal replication data & code
archives.
Virtual Machine archiving.
Replication [NSF]
Reproducibility [King 1995]
Independent Replication
Variance of estimates given method
algorithm and analysis algorithm
Protocol Archive, Journal of Visual
Experiments
Result Validation, Fact Checking Variance of estimates given data identifier &
analysis algorithm
Data Citation Standards
Calibration, Extension, Reuse Produce new analysis given data identifier Data Archives
More Operational Reproducibility Claims
Modeling Reproducibility from an Informatics
Perspective
Common Labels Reproducibility Related Issue Example
Interventions
File-Drawer Problem Publisher bias toward significant (or expected) results APS
Preregistration
Badge, Journal
of Null Results
Undereporting
(Adverse Events); Data
Dredging (Multiple
Comparisons)
Author bias toward publishing favored outcomes Clinical Trial
Preregistration
Data Dredging: Multiple
Comparisons; P-Hacking
Author bias to creating significant results resulting in difference
between stated method/analysis and actual (complete)
method/analysis
Holdout Data
Escrow
Sensitivity, Robustness Variance of support for claims across specification change Sensitivity
Analysis
Reliability Variance of support for claims across repeated measures, samples Metaanalysis;
Cochrane
Review
Data
Integration
Generalizability Variance of support for claims across different frames Cochrane
Review
Laws, Truth Variance of support for claims to other populations Grand
Challenge ?
Operational Reproducibility Claims
Reproducibility Related Issue Related informatics claims
Label Validation, Fact Checking
Reproducibility Issue Variance of estimates given data identifier & analysis algorithm
Reproducibility Claim Variance of estimates given data identifier & analysis algorithm is known &
correctly represented.
Use Case Post-publication reviewer wants to establish that published claims correspond
to analysis method performed…
Potential supporting informational
claims
1. Instance of data retrieved via identifier is semantically equivalent to
instance of data used to support published claim
2. analysis algorithm is robust to choice of reasonable alternative
implementation
3. implementation of algorithm is robust to reasonable choice of execution
details and context
4. published direct claims about data are semantically equivalent to subset
of claims produced by authors previous application of analysis
5. …
Potential information systems
properties supporting claims
1a. Detailed provenance history for data from collection through analysis and
deposition
1b. Automatic replication of direct data claims from deposited source
1c. Cryptographic evidence (e.g. cryptographic signed {analysis output
including, cryptographic hash of data} & {cryptographic hash of data retrieved
via identifier}
…
2a. Standard implementation, subject to community review
2b. Report of results of application of implementation on standard testbed
2c. Availability of implementation for inspection
….
3. …
Conjectures: How Could Informatics Improve Reproducibility
Formal Properties
(Some formal properties on information
flow and management tend to support
reproducibility related inferences…)
• Transparency
• Auditability
• Provenance
• Fixity
• Identification
• Durability
• Integrity
• Repeatability
• Self-documentation
• Non-repudiation
Properties applied to different stages,
entities, and to components of the
information system itself
Systems Property*
(How does the system interact with users, and
what incentives and culture does it engender?)
• Barriers to entry
• Ease of use
• Support for intellectual communities
• Speed and performance
• Security
• Access control
• Personalization
• Credit and attribution
• Incent well-founded trust among actors
• Disincent “glamour & deceit”
(How does the system integrate into research
ecosystem?)
Systems Oriented
• Sustainability
• Cost
• Incent well-founded trust in system and outputs
Discussion
– How can we better support reproducibility
with information infrastructure?*
•How can we better identify the inferential claims implied
by specific set of (non)reproducibility claims/issues?
•Which information flows and systems that most closely
associated with these inferential claims?
•Which properties of information systems support
generating these inferential claims?
Modeling Reproducibility from an Informatics
Perspective
Additional References
• de Waard, A. (2010). The story of science: a syntagmatic/paradigmatic
analysis of scientific text. In Proceedings of the AMICUS Workshop (pp. 36-
41).
• Gentleman, R., & Lang, D. T. (2007). Statistical analyses and reproducible
research. Journal of Computational and Graphical Statistics, 16(1).
• Freire, Juliana. "Making computations and publications reproducible with
vistrails." Computing in Science & Engineering 14.4 (2012): 18-25.
• Kevles, Daniel J. The Baltimore case: A trial of politics, science, and character.
WW Norton & Company, 2000.
• King, G. (1995). Replication, replication. PS: Political Science & Politics,
28(03), 444-452.
• McCullough, B. D. (2009). Open access economics journals and the market
for reproducible economic research. Economic Analysis and Policy, 39(1),
117-126.
Modeling Reproducibility from an Informatics
Perspective
Questions?
E-mail: escience@mit.edu
Web: informatics.mit.edu
Modeling Reproducibility from an Informatics
Perspective

More Related Content

What's hot

Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Micah Altman
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 

What's hot (20)

Managing confidential data
Managing confidential dataManaging confidential data
Managing confidential data
 
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
 
July IAP: Confidential Information - Storage, Sharing, & Publication - with M...
July IAP: Confidential Information - Storage, Sharing, & Publication - with M...July IAP: Confidential Information - Storage, Sharing, & Publication - with M...
July IAP: Confidential Information - Storage, Sharing, & Publication - with M...
 
Data Citation Rewards and Incentives
 Data Citation Rewards and Incentives Data Citation Rewards and Incentives
Data Citation Rewards and Incentives
 
Privacy tool osha comments
Privacy tool osha commentsPrivacy tool osha comments
Privacy tool osha comments
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data Citation
 
Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?
 
Best Practices for Sharing Economics Data
Best Practices for Sharing Economics DataBest Practices for Sharing Economics Data
Best Practices for Sharing Economics Data
 
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
 
Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...
 
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina LeonelliLa ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation Infrastructure
 
Accessing and Using Big Data to Advance Social Science Knowledge
Accessing and Using Big Data to Advance Social Science KnowledgeAccessing and Using Big Data to Advance Social Science Knowledge
Accessing and Using Big Data to Advance Social Science Knowledge
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 

Viewers also liked

BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
Micah Altman
 
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKELBROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
Micah Altman
 

Viewers also liked (10)

Dulin PermaCC Talk for MIT PIS
Dulin PermaCC Talk for MIT PISDulin PermaCC Talk for MIT PIS
Dulin PermaCC Talk for MIT PIS
 
Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...
Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...
Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...
 
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
 
Can computers be feminist? Program on Information Science Talk by Gillian Smith
Can computers be feminist? Program on Information Science Talk by Gillian SmithCan computers be feminist? Program on Information Science Talk by Gillian Smith
Can computers be feminist? Program on Information Science Talk by Gillian Smith
 
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKELBROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
 
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
 
Con3036 soaring-through-the-clouds-oow2016-160920214845
Con3036 soaring-through-the-clouds-oow2016-160920214845Con3036 soaring-through-the-clouds-oow2016-160920214845
Con3036 soaring-through-the-clouds-oow2016-160920214845
 
Test driven cloud development using Oracle SOA CS and Oracle Developer CS
Test driven cloud development using Oracle SOA CS and Oracle Developer CSTest driven cloud development using Oracle SOA CS and Oracle Developer CS
Test driven cloud development using Oracle SOA CS and Oracle Developer CS
 
Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...
Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...
Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...
 

Similar to Reproducibility from an infomatics perspective

Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
oiisdp
 

Similar to Reproducibility from an infomatics perspective (20)

AAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveysAAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveys
 
State of the Art Informatics for Research Reproducibility, Reliability, and...
 State of the Art  Informatics for Research Reproducibility, Reliability, and... State of the Art  Informatics for Research Reproducibility, Reliability, and...
State of the Art Informatics for Research Reproducibility, Reliability, and...
 
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
 
Characterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science ResearchCharacterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science Research
 
Privacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use CasesPrivacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use Cases
 
Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
 
“Big data” in human services organisations: Practical problems and ethical di...
“Big data” in human services organisations: Practical problems and ethical di...“Big data” in human services organisations: Practical problems and ethical di...
“Big data” in human services organisations: Practical problems and ethical di...
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
 
Borner - Modelling science technology and innovation
Borner - Modelling science technology and innovationBorner - Modelling science technology and innovation
Borner - Modelling science technology and innovation
 
Methods and Tools for Facilitating Social Participation
Methods and Tools for Facilitating Social ParticipationMethods and Tools for Facilitating Social Participation
Methods and Tools for Facilitating Social Participation
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Researching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media AnalysisResearching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media Analysis
 
"Melting Pot" of the Sciences in interdisciplinary research
"Melting Pot" of the Sciences in interdisciplinary research"Melting Pot" of the Sciences in interdisciplinary research
"Melting Pot" of the Sciences in interdisciplinary research
 
20220103 jim spohrer hicss v9
20220103 jim spohrer hicss v920220103 jim spohrer hicss v9
20220103 jim spohrer hicss v9
 
Computational Models in Systemic Design
Computational Models in Systemic DesignComputational Models in Systemic Design
Computational Models in Systemic Design
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
 

More from Micah Altman

SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Micah Altman
 
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsCreative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Micah Altman
 
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
Micah Altman
 

More from Micah Altman (19)

Selecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesSelecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategies
 
Well-Being - A Sunset Conversation
Well-Being - A Sunset ConversationWell-Being - A Sunset Conversation
Well-Being - A Sunset Conversation
 
Well-being A Sunset Conversation
Well-being A Sunset ConversationWell-being A Sunset Conversation
Well-being A Sunset Conversation
 
Can We Fix Peer Review
Can We Fix Peer ReviewCan We Fix Peer Review
Can We Fix Peer Review
 
Academy Owned Peer Review
Academy Owned Peer ReviewAcademy Owned Peer Review
Academy Owned Peer Review
 
Redistricting in the US -- An Overview
Redistricting in the US -- An OverviewRedistricting in the US -- An Overview
Redistricting in the US -- An Overview
 
A Future for Electoral Districting
A Future for Electoral DistrictingA Future for Electoral Districting
A Future for Electoral Districting
 
A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk  A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
 
Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:
 
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsCreative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
 
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
 
Ndsa 2016 opening plenary
Ndsa 2016 opening plenaryNdsa 2016 opening plenary
Ndsa 2016 opening plenary
 
Software Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental ScanSoftware Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental Scan
 
Attribution from a Research Library Perspective, on NISO Webinar: How Librari...
Attribution from a Research Library Perspective, on NISO Webinar: How Librari...Attribution from a Research Library Perspective, on NISO Webinar: How Librari...
Attribution from a Research Library Perspective, on NISO Webinar: How Librari...
 
Agenda's for Preservation Research
Agenda's for Preservation ResearchAgenda's for Preservation Research
Agenda's for Preservation Research
 
Software Repositories for Research -- An Environmental Scan
Software Repositories for Research -- An Environmental ScanSoftware Repositories for Research -- An Environmental Scan
Software Repositories for Research -- An Environmental Scan
 
How Many Copies is Enough
How Many Copies is EnoughHow Many Copies is Enough
How Many Copies is Enough
 

Reproducibility from an infomatics perspective

  • 1. Prepared for Program on Information Science – Brown Bag Talks MIT March 2015 Modeling Reproducibility from an Informatics Perspective Dr. Micah Altman <escience@mit.edu> Director of Research, MIT Libraries Head/Scientist, Program on Information Sciences <informatics.mit.edu>
  • 2. DISCLAIMER These opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators Secondary disclaimer: “It’s tough to make predictions, especially about the future!” -- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc. Modeling Reproducibility from an Informatics Perspective
  • 3. Collaborators & Co-Conspirators • Kobbi Nissim, Michael Bar-Sinai, Salil Vadhan & the Privacy Tools for Research Data Project <http://privacytools.seas.harvard.edu/> • Jeff Gill • Michael P. McDonald Research Support Sloan Foundation National Science Foundation (Award #1237235) Modeling Reproducibility from an Informatics Perspective
  • 4. Related Work • Allen, Liz, et al. "Credit where credit is due." Nature 508.7496 (2014): 312-313. • Altman, M., & Crosas, M. (2013). The evolution of data citation: From principles to implementation. IASSIST Quarterly, 37. • Garnett, A., Altman, M., Andreev, L., Barbarosa, S., Castro, E., Crosas, M., ... & Yang, X. (2013, May). Linking OJS and Dataverse. In PKP Scholarly Publishing Conference 2013. • Altman, M., Fox, J., Jackman, S., & Zeileis, A. (2011). An Special Volume on" Political Methodology". Journal of Statistical Software, 42(i01). • Altman, M. (2008). A fingerprint method for scientific data verification. In Advances in Computer and Information Sciences and Engineering (pp. 311-316). Springer Netherlands. • Altman, M., & King, G. (2007). A proposed standard for the scholarly citation of quantitative data. D-lib Magazine, 13(3/4). • Altman, Micah, Jeff Gill, and Michael P. McDonald. (2004). Numerical issues in statistical computing for the social scientist. John Wiley & Sons. • Altman, M., & McDonald, M. P. (2003). Replication with attention to numerical accuracy. Political Analysis, 11(3), 302-307. • Altman, Micah. "A review of JMP 4.03 with special attention to its numerical accuracy." The American Statistician 56.1 (2002): 72-75. • Altman, M., & McDonald, M. P. (2001). Choosing reliable statistical software. Political Science & Politics, 34(03), 681- 687. • Altman, M., Andreev, L., Diggory, M., King, G., Kolster, E., Sone, A., ... & Krot, M. (2001, January). Overview of the virtual data center project and software. In Proceedings of the JCDL 2001 (pp. 203-204). ACM.Modeling Reproducibility from an Informatics Perspective
  • 5. Roadmap for this Talk Reproducibility Concerns… Modeling Reproducible Research from an Information Perspective How can informatics improve reproducibility? Modeling Reproducibility from an Informatics Perspective
  • 6. Modeling Reproducibility from an Informatics Perspective Information Perspective
  • 7. Increased Retractions, Allegations of Fraud Maximizing the Impact of Research through Research Data Management 7
  • 8. What Goes in the File Drawer? Maximizing the Impact of Research through Research Data Management Daniel Schectman’s Lab Notebook Providing Initial Evidence of Quasi Crystals • Null results are less likely to be published  published results as a whole are biased toward positive findings • Outliers are routinely discarded  unexpected patterns of evidence across studies remain hidden 8
  • 9. Replicability of Published Results Maximizing the Impact of Research through Research Data Management  Many journals have no replication policy  Even in journals with clear policy, success rate is low 9
  • 10. Many Initiatives to Improve Scientific Reliability •Retraction monitoring •Data citation •Clinical trial preregistration •Registered replication •Open data •Badges Modeling Reproducibility from an Informatics Perspective
  • 11. Modeling Reproducibility from an Informatics Perspective Reproducibility Concerns
  • 12. Framing Reproducibility from an Informatics Perspective Reproducibility claims are not formulated as direct claims about the world… 1. What claims about information are implied by reproducibility claims/issues?* 2. What properties of information and information flow are related to those claims? 3. How would possible changes to information processing and flow yield? (And how much would they it cost?) Modeling Reproducibility from an Informatics Perspective *
  • 13. Some Types of Reproducibility Issues/Use Cases Modeling Reproducibility from an Informatics Perspective Common Labels For Reproducibility Problems Example Interventions Misconduct, Bit Rot, Author Responsibility Discipline/community data archives. NIH genomic data sharing policy RetractionWatch; Collaborative Data Collection Projects Misconduct, Negligence, Confusion , Typo, Proofreader error*, Dynamic Data Problem, Versioning problem Dat, DataHub, DataVerse (versioning) Misconduct, Negligence, Harmless Error, S/Weave; Compendia; Vistrails Reproducibility [NSF; Donoho 1995] Replicability [King 1995, many journals] Journal replication data & code archives. Virtual Machine archiving. Replication [NSF]; Reproducibility [King 1995];Independent Replication Protocol Archive, Journal of Visual Experiments Result Validation, Fact Checking Data Citation Standards Calibration, Extension, Reuse Data Archives File-Drawer Problem APS Preregistration Badge, Journal of Null Results Undereporting (Adverse Events); Data Dredging (Multiple Comparisons) Clinical Trial Preregistration Data Dredging: Multiple Comparisons; P-Hacking Holdout Data Escrow Sensitivity, Robustness Sensitivity Analysis Reliability Metaanalysis; Cochrane Review; Data Integration Generalizability Cochrane Review
  • 14. More Operational Reproducibility Claims Modeling Reproducibility from an Informatics Perspective Common Labels Example Interventions File-Drawer Problem APS Preregistration Badge, Journal of Null Results Undereporting (Adverse Events); Data Dredging (Multiple Comparisons) Clinical Trial Preregistration Data Dredging: Multiple Comparisons; P-Hacking Holdout Data Escrow Sensitivity, Robustness Sensitivity Analysis Reliability Metaanalysis; Cochrane Review Generalizability Cochrane Review
  • 15. My Model of The World, ca. Early Grad School Scholarly Communications in the age of Big Data λ β Parameters
  • 16. My Model of The World, as a PostDoc in quantitative social science Scholarly Communications in the age of Big Data Target Population Frame Selection Super Population Laws (structures) λ β (generates) Parameters
  • 17. Modeling Reproducibility from an Informatics Perspective Domain Theoretic and Statistical Models are Not Enough
  • 18. Entities, and Relationships, and Straw Models (oh my!) ‘Actors’ (people) ‘Theory’ (ideas) ‘Documents’ ‘Methods’ ‘Data’ (affect decisions of) (interact/interve ne/simulate) (select and apply) (select, design, perform) ) (create and apply) Analysis (output) (apply over)(observe, edit)
  • 19. Creation/C ollection Storage /Ingest Processing Internal Sharing Analysis External dissemination/ publication Re- use Long- term access Where to Intervene: Consider Actors Scholarly Publishers Researchers Data Archives/ Publisher Research Sponsors Data Sources/S ubjects Consumers Service/Infras tructure Providers Research Organizations Modeling Reproducibility from an Informatics Perspective
  • 20. Documents* (compendia, fairy tales) Modeling Reproducibility from an Informatics Perspective ‘’We applied a general linear model’ ‘We conjecture kids will choose candy’ ‘δ = 2.3 * √Ω’ ‘Chewing gum tastes great’ (Altman, et al. 2013) ‘ ’ Assertions about other entities Logical Claims Theorem 1 …. Lemma 1.1 Speculations, Commentary Thanks to my dog for his support… References, Citation U49845.1 GI:1293613 doi:10.1002/0470841559.ch1 orcid:0000-0001-7382-6960 Internal Meta-Information Title: XXXX
  • 21. People (their Relationships & Action) Modeling Reproducibility from an Informatics Perspective Identity Who is the actor? Relationship (or action) What did the actor do, or how are they related?
  • 22. Modeling Methods, Analysis & Data… Modeling Reproducibility from an Informatics Perspective ‘ ‘’ΩΩΩΩ Theory (Rules, Entities, Concepts) Algorithm (Protocol, Operationalization) Theory (Rules, Entities, Concepts) Theory (Rules, Entities, Concepts) Implementation (Software, Coding Rules, Instrumentation ) Execution (Deployment, House Survey Style, Equipment Setting ) ’ Algorithms (Protocol, Operationalization) Implementations (Software, Coding Rules, Instrumentation Design ) Executions (Deployment, House Survey Style, Operating System, Instrument, Computer , Starting Values, PRNG seeds) Structure Formats Versions/Revisions Selections Integrations Instantiations (copies)
  • 23. Modeling Reproducibility from an Informatics Perspective Improving Reproducibility
  • 24. Some Types of Reproducibility Issues/Use Cases Modeling Reproducibility from an Informatics Perspective Common Labels Reproducibility Related Issue Example Interventions Misconduct, Bit Rot, Author Responsibility Data was fabricated, corrupted, or radically misinterpreted prior to analysis Discipline/community data archives. NIH genomic data sharing policy RetractionWatch; Collaborative Data Collection Projects Misconduct, Negligence, Confusion , Typo, Proofreader error*, Dynamic Data Problem, Versioning problem Data {referenced by identifier | provided as an instance| described by method} has nontrivial set of semantic differences from that used as input to the publication Dat, DataHub, DataVerse (versioning) Misconduct, Negligence, Harmless Error, Published analysis algorithm does not correspond to implemented analysis S/Weave; Compendia; Vistrails Reproducibility [NSF; Donoho 1995] Replicability [King 1995, many journals] Variance of estimates given data instance & analysis implementation Journal replication data & code archives. Virtual Machine archiving. Replication [NSF] Reproducibility [King 1995] Independent Replication Variance of estimates given method algorithm and analysis algorithm Protocol Archive, Journal of Visual Experiments Result Validation, Fact Checking Variance of estimates given data identifier & analysis algorithm Data Citation Standards Calibration, Extension, Reuse Produce new analysis given data identifier Data Archives
  • 25. More Operational Reproducibility Claims Modeling Reproducibility from an Informatics Perspective Common Labels Reproducibility Related Issue Example Interventions File-Drawer Problem Publisher bias toward significant (or expected) results APS Preregistration Badge, Journal of Null Results Undereporting (Adverse Events); Data Dredging (Multiple Comparisons) Author bias toward publishing favored outcomes Clinical Trial Preregistration Data Dredging: Multiple Comparisons; P-Hacking Author bias to creating significant results resulting in difference between stated method/analysis and actual (complete) method/analysis Holdout Data Escrow Sensitivity, Robustness Variance of support for claims across specification change Sensitivity Analysis Reliability Variance of support for claims across repeated measures, samples Metaanalysis; Cochrane Review Data Integration Generalizability Variance of support for claims across different frames Cochrane Review Laws, Truth Variance of support for claims to other populations Grand Challenge ?
  • 26. Operational Reproducibility Claims Reproducibility Related Issue Related informatics claims Label Validation, Fact Checking Reproducibility Issue Variance of estimates given data identifier & analysis algorithm Reproducibility Claim Variance of estimates given data identifier & analysis algorithm is known & correctly represented. Use Case Post-publication reviewer wants to establish that published claims correspond to analysis method performed… Potential supporting informational claims 1. Instance of data retrieved via identifier is semantically equivalent to instance of data used to support published claim 2. analysis algorithm is robust to choice of reasonable alternative implementation 3. implementation of algorithm is robust to reasonable choice of execution details and context 4. published direct claims about data are semantically equivalent to subset of claims produced by authors previous application of analysis 5. … Potential information systems properties supporting claims 1a. Detailed provenance history for data from collection through analysis and deposition 1b. Automatic replication of direct data claims from deposited source 1c. Cryptographic evidence (e.g. cryptographic signed {analysis output including, cryptographic hash of data} & {cryptographic hash of data retrieved via identifier} … 2a. Standard implementation, subject to community review 2b. Report of results of application of implementation on standard testbed 2c. Availability of implementation for inspection …. 3. …
  • 27. Conjectures: How Could Informatics Improve Reproducibility Formal Properties (Some formal properties on information flow and management tend to support reproducibility related inferences…) • Transparency • Auditability • Provenance • Fixity • Identification • Durability • Integrity • Repeatability • Self-documentation • Non-repudiation Properties applied to different stages, entities, and to components of the information system itself Systems Property* (How does the system interact with users, and what incentives and culture does it engender?) • Barriers to entry • Ease of use • Support for intellectual communities • Speed and performance • Security • Access control • Personalization • Credit and attribution • Incent well-founded trust among actors • Disincent “glamour & deceit” (How does the system integrate into research ecosystem?) Systems Oriented • Sustainability • Cost • Incent well-founded trust in system and outputs
  • 28. Discussion – How can we better support reproducibility with information infrastructure?* •How can we better identify the inferential claims implied by specific set of (non)reproducibility claims/issues? •Which information flows and systems that most closely associated with these inferential claims? •Which properties of information systems support generating these inferential claims? Modeling Reproducibility from an Informatics Perspective
  • 29. Additional References • de Waard, A. (2010). The story of science: a syntagmatic/paradigmatic analysis of scientific text. In Proceedings of the AMICUS Workshop (pp. 36- 41). • Gentleman, R., & Lang, D. T. (2007). Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics, 16(1). • Freire, Juliana. "Making computations and publications reproducible with vistrails." Computing in Science & Engineering 14.4 (2012): 18-25. • Kevles, Daniel J. The Baltimore case: A trial of politics, science, and character. WW Norton & Company, 2000. • King, G. (1995). Replication, replication. PS: Political Science & Politics, 28(03), 444-452. • McCullough, B. D. (2009). Open access economics journals and the market for reproducible economic research. Economic Analysis and Policy, 39(1), 117-126. Modeling Reproducibility from an Informatics Perspective
  • 30. Questions? E-mail: escience@mit.edu Web: informatics.mit.edu Modeling Reproducibility from an Informatics Perspective