Reproducibility:
A Funder and Data
Science Perspective
Philip E. Bourne, PhD, FACMI
University of Virginia
Thanks to Valerie Florence, NIH for some slides
http://www.slideshare.net/pebourne
peb6a@virginia.edu
NetSci Preworkshop 2017
June 19, 2017
6/19/17 1
Who Am I Representing And What
Is My Bias?
• I am presenting my views, not necessarily those of NIH
• Now leading an institutional data science initiative
• Total data parasite
• Unnatural interest in scholarly communication
• Co-founded and founding EIC PLOS Computational Biology –
OA advocate
• Prior co-Director Protein Data Bank
• Amateur student researcher in scholarly communication
26/19/17
Reproducibility is the
responsibility of all stakeholders….
6/19/17 3
6/19/17 4
Lets start with researchers …
6/19/17 5
Reproducibility - Examples From
My Own Work
It took several months to replicate this
work this work
… And recently …
Phew…
http://www.sdsc.edu/pb/kinases/6/19/17 6
Beyond value to myself (and even
then the emphasis is not enough)
there is too little incentive to make
my work reproducible by others …
6/19/17 7
Tools Fix This Problem Right?
• Extracted all PMC papers with associated Jupyter
notebooks available
• Approx. 100
• Took a random sample of 25
• Only 1 ran out of the box
• Several ran with minor modification
• Others lacked libraries, sufficient details to run etc.
It takes more than tools.. It takes incentives …
Daniel Mietchen 2017 Personal Communication
6/19/17 8
Funders and publishers are the major
levers ..
What are funders doing?
Consider the NIH …..
6/19/17 9
6/19/17 10
NIH Special Focus Area
https://www.nih.gov/research-training/rigor-reproducibility
116/19/17
Outcomes – General …
6/19/17 12
Enhancing Reproducibility through Rigor and Transparency
NOT-OD-15-103
• Clarifies NIH expectations in 4 areas
• Scientific premise
• Describe strengths and weaknesses of prior
research
• Rigorous experimental design
• How to achieve robust and unbiased outcomes
• Consideration of sex and other relevant
biological variables
• Authentication of key biological and/or
chemical resources e.g., cell lines
136/19/17
Outcomes – network based …
6/19/17 14
Experiment in Moving from Pipes
to Platforms
Sangeet Paul Choudary https://www.slideshare.net/sanguit
6/19/17 15
Commons & the FAIR Principles
• The Commons is a virtual platform physically located
predominantly on public clouds
• Digital assets (objects) within that system are data,
software, narrative, course materials etc.
• Assets are FAIR – Findable, Accessible, Interoperable
and Reusable
https://www.workitdaily.com/job-search-solution/
Bonazzi and Bourne 2017
FAIR: https://www.nature.com/articles/sdata201618
PLoS Biol 15(4): e2001818
6/19/17 16
Just announced …
https://commonfund.nih.gov/sites/default/file
s/RM-17-026_CommonsPilotPhase.pdf
6/19/17 17
Bonazzi and Bourne 2017
FAIR: https://www.nature.com/articles/sdata201618
Current Data Commons Pilots
Reference Data
Sets
Commons
Platform Pilots
Cloud Credit
Model
Resource
Search & Index
• Explore feasibility of the Commons Platform
• Facilitate collaboration and interoperability
• Provide access to cloud via credits to populate the
Commons
• Connecting credits to NIH Grants
• Making large and/or high value NIH
funded data sets and tools accessible in
the cloud
• Developing Data & Software Indexing methods
• Leveraging BD2K efforts bioCADDIE et al
• Collaborating with external groups
6/19/17 18
Commons - Platform Stack
https://datascience.nih.gov/commons
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
6/19/17 19
NIH + Community
defined data sets
possible FOAs and
CCM
BD2K Centers,
MODS, HMP &
Interoperability
Supplements
Cloud credits
model (CCM)
BioCADDIE/Other
Indexing
NCI &
NIAID
Cloud
Pilots
Compute Platform: Cloud or HPC
Services: APIs, Containers,
Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping BD2K Activities to the
Commons Platform
https://datascience.nih.gov/commons
6/19/17 20
Overarching Questions
• Is the Commons a step towards improved
reproducibility?
• Is the Commons approach at odds with other
approaches, if not how best to coordinate?
• Do the pilots enable a full evaluation for a larger
scale implementation?
• How best to evaluate the success of the pilots?
6/19/17 21
Other Questions
• Is a mix of cloud vendors appropriate?
• How to balance the overall metrics of success?
• Reproducibility
• Cost saving
• Efficiency – centralized data vs distributed
• New science
• User satisfaction
• Data integration and reuse – how to measure?
• Data security
• What are the weaknesses?
6/19/17 22
• Thank You
6/19/17 23
Acknowledgements
• Vivien Bonazzi, Jennie Larkin, Michelle Dunn, Mark Guyer, Allen Dearry,
Sonynka Ngosso Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS)
• NLM/NCBI: Patricia Brennan, Mike Huerta, George Komatsoulis
• NHGRI: Eric Green, Valentina di Francesco
• NIGMS: Jon Lorsch, Susan Gregurick, Peter Lyster
• CIT: Andrea Norris
• NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr
• NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen
• Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG),
Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI),
Leslie Derr (CF), Claire Schulkey (AI)
• RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire
Schulkey (AI), Eric Choi (AI)
• OSP: Dina Paltoo
6/19/17 24

Reproducibility: A Funder and Data Science Perspective

  • 1.
    Reproducibility: A Funder andData Science Perspective Philip E. Bourne, PhD, FACMI University of Virginia Thanks to Valerie Florence, NIH for some slides http://www.slideshare.net/pebourne peb6a@virginia.edu NetSci Preworkshop 2017 June 19, 2017 6/19/17 1
  • 2.
    Who Am IRepresenting And What Is My Bias? • I am presenting my views, not necessarily those of NIH • Now leading an institutional data science initiative • Total data parasite • Unnatural interest in scholarly communication • Co-founded and founding EIC PLOS Computational Biology – OA advocate • Prior co-Director Protein Data Bank • Amateur student researcher in scholarly communication 26/19/17
  • 3.
    Reproducibility is the responsibilityof all stakeholders…. 6/19/17 3
  • 4.
  • 5.
    Lets start withresearchers … 6/19/17 5
  • 6.
    Reproducibility - ExamplesFrom My Own Work It took several months to replicate this work this work … And recently … Phew… http://www.sdsc.edu/pb/kinases/6/19/17 6
  • 7.
    Beyond value tomyself (and even then the emphasis is not enough) there is too little incentive to make my work reproducible by others … 6/19/17 7
  • 8.
    Tools Fix ThisProblem Right? • Extracted all PMC papers with associated Jupyter notebooks available • Approx. 100 • Took a random sample of 25 • Only 1 ran out of the box • Several ran with minor modification • Others lacked libraries, sufficient details to run etc. It takes more than tools.. It takes incentives … Daniel Mietchen 2017 Personal Communication 6/19/17 8
  • 9.
    Funders and publishersare the major levers .. What are funders doing? Consider the NIH ….. 6/19/17 9
  • 10.
  • 11.
    NIH Special FocusArea https://www.nih.gov/research-training/rigor-reproducibility 116/19/17
  • 12.
    Outcomes – General… 6/19/17 12
  • 13.
    Enhancing Reproducibility throughRigor and Transparency NOT-OD-15-103 • Clarifies NIH expectations in 4 areas • Scientific premise • Describe strengths and weaknesses of prior research • Rigorous experimental design • How to achieve robust and unbiased outcomes • Consideration of sex and other relevant biological variables • Authentication of key biological and/or chemical resources e.g., cell lines 136/19/17
  • 14.
    Outcomes – networkbased … 6/19/17 14
  • 15.
    Experiment in Movingfrom Pipes to Platforms Sangeet Paul Choudary https://www.slideshare.net/sanguit 6/19/17 15
  • 16.
    Commons & theFAIR Principles • The Commons is a virtual platform physically located predominantly on public clouds • Digital assets (objects) within that system are data, software, narrative, course materials etc. • Assets are FAIR – Findable, Accessible, Interoperable and Reusable https://www.workitdaily.com/job-search-solution/ Bonazzi and Bourne 2017 FAIR: https://www.nature.com/articles/sdata201618 PLoS Biol 15(4): e2001818 6/19/17 16
  • 17.
    Just announced … https://commonfund.nih.gov/sites/default/file s/RM-17-026_CommonsPilotPhase.pdf 6/19/1717 Bonazzi and Bourne 2017 FAIR: https://www.nature.com/articles/sdata201618
  • 18.
    Current Data CommonsPilots Reference Data Sets Commons Platform Pilots Cloud Credit Model Resource Search & Index • Explore feasibility of the Commons Platform • Facilitate collaboration and interoperability • Provide access to cloud via credits to populate the Commons • Connecting credits to NIH Grants • Making large and/or high value NIH funded data sets and tools accessible in the cloud • Developing Data & Software Indexing methods • Leveraging BD2K efforts bioCADDIE et al • Collaborating with external groups 6/19/17 18
  • 19.
    Commons - PlatformStack https://datascience.nih.gov/commons Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface 6/19/17 19
  • 20.
    NIH + Community defineddata sets possible FOAs and CCM BD2K Centers, MODS, HMP & Interoperability Supplements Cloud credits model (CCM) BioCADDIE/Other Indexing NCI & NIAID Cloud Pilots Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping BD2K Activities to the Commons Platform https://datascience.nih.gov/commons 6/19/17 20
  • 21.
    Overarching Questions • Isthe Commons a step towards improved reproducibility? • Is the Commons approach at odds with other approaches, if not how best to coordinate? • Do the pilots enable a full evaluation for a larger scale implementation? • How best to evaluate the success of the pilots? 6/19/17 21
  • 22.
    Other Questions • Isa mix of cloud vendors appropriate? • How to balance the overall metrics of success? • Reproducibility • Cost saving • Efficiency – centralized data vs distributed • New science • User satisfaction • Data integration and reuse – how to measure? • Data security • What are the weaknesses? 6/19/17 22
  • 23.
  • 24.
    Acknowledgements • Vivien Bonazzi,Jennie Larkin, Michelle Dunn, Mark Guyer, Allen Dearry, Sonynka Ngosso Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS) • NLM/NCBI: Patricia Brennan, Mike Huerta, George Komatsoulis • NHGRI: Eric Green, Valentina di Francesco • NIGMS: Jon Lorsch, Susan Gregurick, Peter Lyster • CIT: Andrea Norris • NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr • NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen • Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI) • RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire Schulkey (AI), Eric Choi (AI) • OSP: Dina Paltoo 6/19/17 24