Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
results may vary
reproducibility, open science
and all that jazz
Professor Carole Goble
The University of Manchester, UK
c...
“knowledge turning”
New Insight

• life sciences
• systems biology
• translational
medicine
• biodiversity
• chemistry
• h...
automate: workflows,
pipeline & service
integrative frameworks

scientific
software
engineering

CS
SE

pool, share &
coll...
coordinated execution of services,
codes, resources
transparent, step-wise methods
auto documentation, logging
reuse varia...
http://www.seek4science.org
store/organise/link
data, models, sops,
experiments,
publications

explore/annotate
data, mode...
• PALS
reproducibility
a principle of the
scientific method
separates scientists
from other researchers
and normal people

http:/...
datasets
data collections
algorithms
configurations
tools and apps
codes
workflows
scripts
code libraries
services,
system...
• Workshop Track (WK03) What
Bioinformaticians need to know
about digital publishing beyond the
PDF
• Workshop Track (WK02...
hope over experience
“an experiment is reproducible until
another laboratory tries to repeat it.”
Alexander Kohn

even com...
hand-wringing,
weeping, wailing,
gnashing of teeth.
Nature checklist.
Science
requirements for
data and code
availability....
47/53 “landmark” publications
could not be replicated
[Begley, Ellis Nature, 483, 2012]
Nekrutenko & Taylor, Next-generation sequencing data interpretation:
enhancing, reproducibility and accessibility, Nature ...
170 journals, 2011-2012
Required as condition of publication
Required but may not affect decisions
Explicitly encouraged m...
replication gap
Out of 18 microarray papers, results
Out of 18 microarray papers, results
from 10 could not be reproduced
...
“When I use a word," Humpty Dumpty
said in rather a scornful tone, "it means
just what I choose it to mean - neither
more ...
repeat
same
experiment
same lab

replicate

test

same
experiment
different set up

reproduce

same
experiment
different l...
validation

verification

assurance meets the
needs of a
stakeholder
e.g. error
measurement,
documentation

complies with ...
defend
repeat

Sound

Design
Design

Collection
Collection

review1/certify
replicate
Peer
Peer
Review
Review

Prediction
...
disorganisation

“I can’t immediately reproduce the
research in my own laboratory. It
took an estimated 280 hours for
an a...
rigour reporting & experimental design
cherry picking data
misapplication use of black box software*
software misconfigura...
http://www.nature.com/authors/policies/checklist.pdf
• anyone anything
anytime
• publication access, data,
models, source codes,
resources, transparent
methods, standards,
for...
G8 open data charter

http://opensource.com/government/13/7/open-data-charter-g8
regulation of science
institution cores public services
libraries
republic of science*
*Merton’s four norms of scientific ...
a meta-manifesto (I)

• all X should be available and assessable
forever
• the copyright of X should be clear
• X should h...
we do pretty well
•
•
•
•
•
•
•
•

major public data repositories
multiple declarations for depositing data
thriving open ...
hard: patient data
(inter)national complications
bleeding heart paternalism
defensive research
informed consent
fortresses...
massive centralisation –
clouds, curated core
facilities
long tail massive
decentralisation –
investigator held datasets
f...
we are not bad people
we make progress
there was never a golden age
there never is
a reproducibility paradox
big, fast,
complicated,
multi-step,
multi-type
multi-field

expectations
of
reproducibility

diy...
pretty stories shiny results feedback loop
announce a result, convince us its correct
novel, attention grabbing
neat, only...
the scientific sweatshop
no resources, time, accountability

getting it published not getting it right
game changing benef...
citation distortion

Micropublications arxive reference

Clark et al Micropublications 2013 arXiv:1305.3506

[Tim Clark]

...
independent replication studies
self-correcting science

“blue collar

• hostility
• hard
• resource
intensive
• no fundin...
independent review
self-correcting science

“blue collar

• hostility
• hard
• resource
intensive
• no funding, time,
reco...
what is the point: “no one will want it”
“the questions don’t change but the answers do”*
• two years time when the paper ...
emerging reproducible system ecosystem
App Store needed!
instrumented desktop tools
hosted services
packaging and archivin...
integrated database and journal
http://www.gigasciencejournal.com

copy editing computational workflows
from 10 scripts + ...
supporting data reproducibility
Open-Paper

to

d
ke
Lin OI
D

Open-Data
Data sets

78GB CC0 data

DO
I

Lin
ke
DOI:10.118...
Here is What I Want – The Paper
As Experiment

0. Full text of PLoS papers stored
in a database

4. The composite view has...
"A single pass approach to reducing sampling variation, removing errors, and
scaling de novo assembly of shotgun sequences...
made reproducible
http://getutopia.com

[Pettifer, Attwood]
The Research Lifecycle
Authoring
Tools
Lab
Notebooks

Data
Capture

Software
Repositories

Analysis
Tools

Scholarly
Commu...
message #1: lower friction
born reproducible

Process

=

Interest
Friction

Number
x people
reach

the neylon equation
Ca...
4+1 architecture of reproducibility
“development” view

“logical” view

social scenarios

“process” view

“physical” view
“logical view”

rigour
reporting
reassembly

recognition
review
reuse

resources
responsibility
reskilling
reporting
documentation

availability
observations
• the strict letter of the law
• (methods) modeller/ workflow makers vs (data)
experimentalists
• young resea...
(Lusch, Vargo 2008)
(Harris and Miller 2011)

(Nowak 2006)
(Clutton-Brock 2009)
Tenopir et al 2011)
Borgman, 2012)

(Malon...
scientific ego-system
trust, reciprocity, collaboration to compete
blame
scooped
uncredited
misinterpretation
scrutiny
cos...
local asset economies
economics of scarce prized
commodities

• local investment
– protective

• collective purchasing
tra...
asymmetrical reciprocity
•
•
•
•
•
•
•
•

hugging
flirting
voyerism
inertia
sharing creep
credit drift
local control
code ...
1 0 JA N UA RY 2 0 1 3 | VO L 4 9 3 | N AT U R E | 1 5 9

recognition

“all research products and all scholarly labour
are...
message #2
visible reciprocity contract

citation is like ♥ not $
large data providers
infrastructure codes
“click and run...
Victoria Stodden, AMP 2011 http://www.stodden.net/AMP2011/,
Workshop: Reproducible Research: Tools and Strategies for Scie...
in perpetuity
“its not ready yet”, “I need another publication”
shame
“its too ugly”, “I didn’t work out the details”
effo...
the goldilocks paradox
“the description needed
to make an experiment
reproducible is too much
for the author and too
littl...
http://www.rightfield.org.uk

1. Enrich Spreadsheet Template

reducing the
friction of
curation

2. Use in Excel or OpenOf...
anonymous reuse is hard
nearly always negotiated
reskilling: software making practices
Zeeya Merali , Nature 467, 775-777 (2010) | doi:10.1038/467775a
Computational scienc...
http://matt.might.net/articles/crapl/

http://sciencecodemanifesto.org/
Greg
Wilson

better
software

C Titus Brown

better
research

data
carpentry
a word on reinventing
Sean Eddy

author HMMER and Infernal software
suites for sequence analysis

innovation is algorithms...
message #3
placing value on reproducibility
take action
Organisation

Culture

Execution

Metrics

Process
[Daron Green]
(re)assembly
Gather the bits together
Find and get the bits
Bits broken/changed/lost
Have other bits
Understand the bits a...
specialist codes

gateways

libraries, platforms, tools

data collections
catalogues
commodity
platforms

my data
my proce...
Diff

Orig

repeat
(re-run)

replicate reproduce
(regenerate)

(recreate)

reuse
(repurpose/extend)

Actors
Results
Experi...
materials

use workflows
capture the steps

method

instruments and laboratory

standardised pipelines
auto record of
expe...
use provenance
the link between computation and results
static verifiable record
track changes
repair
partially repeat/rep...
“an experiment is as transparent
as the visibility of its steps”
black boxes
closed codes &
services, proprietary
licences...
dependencies & change
degree of self-contained preservation
open world, distributed, alien hosted
data/software versions a...
preservation & distribution
portability / packaging
VM

availability
open

[Adapted Freire, 2013]

gather
dependencies
cap...
packaging bickering
byte execution
virtual machine
black box

description
archived record
white box

data+compute
co-locat...
big data big compute
community facilities
cloud host costs and confidence
data scales
dump and file
capability
message #4:
“the reproducible window”
all experiments become less reproducible over time
icanhascheezburger.com

how, why ...
message #5: puppies aren’t free
long term reliability of hosts
multiple stewardship
fragmented
business models
reproducibi...
•
•
•
•
•
•

the meta-manifesto

all X should be available and assessable forever
the copyright of X should be clear
X sho...
• evolution of a body
• fork, pull, merge
• subpart different cycles,
stewardship, authors
• refactored granularity
• soft...
http://www.researchobject.org/

http://www.w3.org/community/rosc/

bundles and relates digital resources of a
scientific e...
towards a release app store
• checklists for
descriptive
reproducibility
• packaging for multihosted research
(executable)...
those messages again
•
•
•
•
•

lower friction, born reproducible
credit is like love
take action, use (workflow) framewor...
final message

The revolution is not
an apple that falls
when it is ripe. You
have to make it drop.
acknowledgements
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

David De Roure
Tim Clark
Sean Bechhofer
Robert Stevens
Christine B...
Mr Cottam

10th anniversary today!
summary

[Jenny Cham]

https://twitter.com/csmcr/status/361835508994813954
•

myGrid
–

•

http://www.biovel.eu

Force11
–

•
•

http://www.software.ac.uk

BioVeL
–

•

http://www.wf4ever-project.o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do open science and  who gets the credit?
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do open science and  who gets the credit?
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do open science and  who gets the credit?
Upcoming SlideShare
Loading in …5
×

ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do open science and who gets the credit?

14,708 views

Published on

Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.

Published in: Education, Technology

ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do open science and who gets the credit?

  1. 1. results may vary reproducibility, open science and all that jazz Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk @caroleannegoble Keynote ISMB/ECCB 2013 Berlin, Germany, 23 July 2013
  2. 2. “knowledge turning” New Insight • life sciences • systems biology • translational medicine • biodiversity • chemistry • heliophysics • astronomy • social science • digital libraries • language analysis [Josh Sommer, Chordoma Foundation] Goble et al Communications in Computer and Information Science 348, 2013
  3. 3. automate: workflows, pipeline & service integrative frameworks scientific software engineering CS SE pool, share & collaborate web systems semantics & ontologies machine readable documentation nanopub
  4. 4. coordinated execution of services, codes, resources transparent, step-wise methods auto documentation, logging reuse variants
  5. 5. http://www.seek4science.org store/organise/link data, models, sops, experiments, publications explore/annotate data, models, sops yellow pages, find peers and experts open and controlled curation & data pooling & credit mgt support catalogue and gateway to local and public resources APIs simulate models governance & policies
  6. 6. • PALS
  7. 7. reproducibility a principle of the scientific method separates scientists from other researchers and normal people http://xkcd.com/242/
  8. 8. datasets data collections algorithms configurations tools and apps codes workflows scripts code libraries services, system software infrastructure, compilers hardware “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment, [the complete data] and the complete set of instructions which generated the figures.” David Donoho, “Wavelab and Reproducible Research,” 1995 Morin et al Shining Light into Black Boxes Science 13 April 2012: 336(6078) 159-160 Ince et al The case for open computer programs, Nature 482, 2012
  9. 9. • Workshop Track (WK03) What Bioinformaticians need to know about digital publishing beyond the PDF • Workshop Track (WK02): Bioinformatics Cores Workshop, • ICSB Public Policy Statement on Access to Data
  10. 10. hope over experience “an experiment is reproducible until another laboratory tries to repeat it.” Alexander Kohn even computational ones
  11. 11. hand-wringing, weeping, wailing, gnashing of teeth. Nature checklist. Science requirements for data and code availability. attacks on authors, editors, reviewers, publishers, funders, and just about everyone. http://www.nature.com/nature/focus/reproducibility/index.html
  12. 12. 47/53 “landmark” publications could not be replicated [Begley, Ellis Nature, 483, 2012]
  13. 13. Nekrutenko & Taylor, Next-generation sequencing data interpretation: enhancing, reproducibility and accessibility, Nature Genetics 13 (2012) 59% of papers in the 50 highest-IF journals comply with (often weak) data sharing rules. Alsheikh-Ali et al Public Availability of Published Research Data in High-Impact Journals. PLoS ONE 6(9) 2011
  14. 14. 170 journals, 2011-2012 Required as condition of publication Required but may not affect decisions Explicitly encouraged may be reviewed and/or hosted Implied No mention Required as condition of publication Required but may not affect decisions Explicitly encouraged may be reviewed and/or hosted Implied No mention Stodden V, Guo P, Ma Z (2013) Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals. PLoS ONE 8(6): e67111. doi:10.1371/journal.pone.0067111
  15. 15. replication gap Out of 18 microarray papers, results Out of 18 microarray papers, results from 10 could not be reproduced from 10 could not be reproduced More retractions: >15X increase in last decade At current % > by 2045 as many papers published as retracted 1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
  16. 16. “When I use a word," Humpty Dumpty said in rather a scornful tone, "it means just what I choose it to mean - neither more nor less.” [Lewis Carroll] conceptual replication “show A is true by doing B rather than doing A again” verify but not falsify [Yong, Nature 485, 2012] regenerate the figure replicate rerun repeat re-compute recreate revise regenerate redo restore recycle reuse re-examine reconstruct review repurpose
  17. 17. repeat same experiment same lab replicate test same experiment different set up reproduce same experiment different lab different experiment some of same reuse Drummond C Replicability is not Reproducibility: Nor is it Good Science, online Peng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
  18. 18. validation verification assurance meets the needs of a stakeholder e.g. error measurement, documentation complies with a regulation, requirement, specification, or imposed condition e.g. a model science review: articles, algorithms, methods technical review: code, data, systems V. Stodden, “Trust Your Science? Open Your Data and Code!” Amstat News, 1 July 2011
  19. 19. defend repeat Sound Design Design Collection Collection review1/certify replicate Peer Peer Review Review Prediction Prediction Peer Peer Reuse Reuse Execution Execution Publish Publish Result Analysis Result Analysis make&run&document report&review&support review2compare reproduce transfer reuse * Adapted from Mesirov, J. Accessible Reproducible Research Science 327(5964), 415-416 (2010)
  20. 20. disorganisation “I can’t immediately reproduce the research in my own laboratory. It took an estimated 280 hours for an average user to approximately reproduce the paper. Data/software versions. Workflows are maturing and becoming helpful” Phil Bourne Garijo et al. 2013 Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome PLOS ONE under review. fraud Corbyn, Nature Oct 2012 inherent
  21. 21. rigour reporting & experimental design cherry picking data misapplication use of black box software* software misconfigurations, random seed reporting non-independent bias, poor positive and negative controls dodgy normalisation, arbitrary cut-offs, premature data triage un-validated materials, improper statistical analysis, poor statistical power, stop when “get to the right answer” *8% validation Joppa, et al, Troubling Trends in Scientific Software Use SCIENCE 340 May 2013
  22. 22. http://www.nature.com/authors/policies/checklist.pdf
  23. 23. • anyone anything anytime • publication access, data, models, source codes, resources, transparent methods, standards, formats, identifiers, apis, licenses, education, policies • “accessible, intelligible, assessable, reusable” http://royalsociety.org/policy/projects/science-public-enterprise/report/
  24. 24. G8 open data charter http://opensource.com/government/13/7/open-data-charter-g8
  25. 25. regulation of science institution cores public services libraries republic of science* *Merton’s four norms of scientific behaviour (1942)
  26. 26. a meta-manifesto (I) • all X should be available and assessable forever • the copyright of X should be clear • X should have citable, versioned identifiers • researchers using X should visibly credit X’s creators • credit should be assessable and count in all assessments • X should be curated, available, linked to all necessary materials, and intelligible What’s the real issue?
  27. 27. we do pretty well • • • • • • • • major public data repositories multiple declarations for depositing data thriving open source community plethora of data standardisation efforts core facilities heroic data campaigns international and national bioinformatics coordination diy biology movement • great stories- Shiga-Toxin strain of E. coli, Hamburg, May 2011, China BGI Open data crowd sourcing effort. • Oh, wait…University of Münster/University of Göttingen squabble http://www.nature.com/news/2011/110721/full/news.2011.430.html
  28. 28. hard: patient data (inter)national complications bleeding heart paternalism defensive research informed consent fortresses [John Wilbanks] http://www.broadinstitute.org/files/news/pd fs/GAWhitePaperJune3.pdf Kotz, J. SciBX 5(25) 2012
  29. 29. massive centralisation – clouds, curated core facilities long tail massive decentralisation – investigator held datasets fragmentation & fragility a data scarcity at point of delivery RIP data quality/trust/utility Acta Crystallographica section B or C data/code as first class citizen
  30. 30. we are not bad people we make progress there was never a golden age there never is
  31. 31. a reproducibility paradox big, fast, complicated, multi-step, multi-type multi-field expectations of reproducibility diy publishing greater access
  32. 32. pretty stories shiny results feedback loop announce a result, convince us its correct novel, attention grabbing neat, only positive review: the direction of science, the next paper, how I would do it. reject papers purely based on public data obfuscate to avoid scrutiny PLoS and F1000 counter
  33. 33. the scientific sweatshop no resources, time, accountability getting it published not getting it right game changing benefit to justify disruption
  34. 34. citation distortion Micropublications arxive reference Clark et al Micropublications 2013 arXiv:1305.3506 [Tim Clark] Greenberg How citation distortions create unfounded authority: analysis of a citation network. British Medical Journal 2009, 339:b2680. Simkin, Roychowdhury Stochastic modeling of citation slips. Scientometrics 2005, 62(3):367-384.
  35. 35. independent replication studies self-correcting science “blue collar • hostility • hard • resource intensive • no funding, time, recognition, place to publish • invisible to science” originators John Quackenbush
  36. 36. independent review self-correcting science “blue collar • hostility • hard • resource intensive • no funding, time, recognition, place to publish • invisible to science” originators John Quackenbush
  37. 37. what is the point: “no one will want it” “the questions don’t change but the answers do”* • two years time when the paper is written • reviewers want additional work • statistician wants more runs • analysis may need to be repeated • post-doc leaves, student arrives • new data, revised data • updated versions of algorithms/codes quid pro quo citizenship • trickle down theory: more open more use more credit* others might • meta-analysis • novel discovery • other methods * Dan Reed
  38. 38. emerging reproducible system ecosystem App Store needed! instrumented desktop tools hosted services packaging and archiving repositories, catalogues online sharing platforms integrated authoring integrative frameworks XworX ReproZip Sweave
  39. 39. integrated database and journal http://www.gigasciencejournal.com copy editing computational workflows from 10 scripts + 4 modules + >20 parameters to Galaxy workflows 2-3 months 2-3 weeks made reproducible galaxy.cbiit.cuhk.edu.hk [Peter Li]
  40. 40. supporting data reproducibility Open-Paper to d ke Lin OI D Open-Data Data sets 78GB CC0 data DO I Lin ke DOI:10.1186/2047-217X-1-18 d >11000 accesses to DOI:10.5524/100038 Analyses Open-Pipelines Open-Workflows DOI:10.5524/100044 Open-Review 8 reviewers tested data in ftp server & named reports published Open-Code Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2 Code in sourceforge under GPLv3: >5000 downloads http://soapdenovo2.sourceforge.net/ [Scott Edmunds]
  41. 41. Here is What I Want – The Paper As Experiment 0. Full text of PLoS papers stored in a database 4. The composite view has links to pertinent blocks of literature text and back to the PDB 4. 1. 1. A link brings up figures from the paper 2. [Phil Bourne] 3. A composite view of journal and database content results 3. 2. Clicking the paper figure retrieves data from the PDB which is analyzed 1. User clicks on thumbnail 2. Metadata and a webservices call provide a renderable image that can be annotated 3. Selecting a features provides a database/literature mashup 4. That leads to new papers PLoS Comp. Biol. 2005 1(3) e34
  42. 42. "A single pass approach to reducing sampling variation, removing errors, and scaling de novo assembly of shotgun sequences" http://arxiv.org/abs/1203.4802 born reproducible http://ged.msu.edu/papers/2012-diginorm/ http://ivory.idyll.org/blog/replication-i.html [C. Titus Brown]
  43. 43. made reproducible http://getutopia.com [Pettifer, Attwood]
  44. 44. The Research Lifecycle Authoring Tools Lab Notebooks Data Capture Software Repositories Analysis Tools Scholarly Communication Visualization IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION Commercial & Public Tools DisciplineBased Metadata Standards Git-like Resources By Discipline Community Portals Data Journals New Reward Systems Training Institutional Repositories Commercial Repositories [Phil Bourne]
  45. 45. message #1: lower friction born reproducible Process = Interest Friction Number x people reach the neylon equation Cameron Neylon, BOSC 2013, http://cameronneylon.net/
  46. 46. 4+1 architecture of reproducibility “development” view “logical” view social scenarios “process” view “physical” view
  47. 47. “logical view” rigour reporting reassembly recognition review reuse resources responsibility reskilling
  48. 48. reporting documentation availability
  49. 49. observations • the strict letter of the law • (methods) modeller/ workflow makers vs (data) experimentalists • young researchers, support from PIs • buddy reproducibility testing, curation help • just enough just in time • staff leaving and project ends • public scrutiny, competition • decaying local systems • long term safe haven commitment • funder commitment from the start
  50. 50. (Lusch, Vargo 2008) (Harris and Miller 2011) (Nowak 2006) (Clutton-Brock 2009) Tenopir et al 2011) Borgman, 2012) (Malone 2010) (Benkler 2011) [Kristian Garza] (Thomson, Perry, and Miller 2009) (Wood and Gray 1991) (Roberts and Bradley 1991) (Shrum and Chompalov 2007)
  51. 51. scientific ego-system trust, reciprocity, collaboration to compete blame scooped uncredited misinterpretation scrutiny cost loss distraction left behind Merton’s four norms of scientific behaviour (1942) dependency fame competitive advantage productivity credit adoption kudos for love Fröhlich’s principles of scientific communication (1998) Malone, Laubacher & Dellarocas The Collective Intelligence Genome, Sloan Management Review,(2010)
  52. 52. local asset economies economics of scarce prized commodities • local investment – protective • collective purchasing trade – share • sole provider – broadcast [Nielson] [Roffel] (Lusch, Vargo 2008) (Harris and Miller 2011
  53. 53. asymmetrical reciprocity • • • • • • • • hugging flirting voyerism inertia sharing creep credit drift local control code throwaway family friends acquaintances strangers rivals ex-friends Tenopir, et al. Data Sharing by Scientists: Practices and Perceptions. PLoS ONE 6(6) 2012 Borgman The conundrum of sharing research data, JASIST 2012
  54. 54. 1 0 JA N UA RY 2 0 1 3 | VO L 4 9 3 | N AT U R E | 1 5 9 recognition “all research products and all scholarly labour are equally valued except by promotion and review committees”
  55. 55. message #2 visible reciprocity contract citation is like ♥ not $ large data providers infrastructure codes “click and run” instrument platforms make credit count Rung, Brazma Reuse of public wide gene expression data Nature Review Genetics 2012 Duck et al bioNerDS: exploring bioinformatics' database and software use through literature mining. BMC Bioinformatics. 2013 Piwowar et al Sharing Detailed Research Data Is Associated with Increased Citation Rate PLoS ONE 2007
  56. 56. Victoria Stodden, AMP 2011 http://www.stodden.net/AMP2011/, Workshop: Reproducible Research: Tools and Strategies for Scientific Computing Special Issue Reproducible Research Computing in Science and Engineering July/August 2012, 14(4)
  57. 57. in perpetuity “its not ready yet”, “I need another publication” shame “its too ugly”, “I didn’t work out the details” effort “we don’t have the skills/resources”, “the reviewers don’t need it” loss “the student left”, “we can’t find it” insecurity “you wouldn’t understand it”, “I made it so no one could understand it”. Randall J. LeVeque ,Top Ten Reasons To Not Share Your Code (and why you should anyway) April 2013 SIAM News
  58. 58. the goldilocks paradox “the description needed to make an experiment reproducible is too much for the author and too little for the reader” just enough just in time Galaxy Luminosity Profiling José Enrique Ruiz (IAA-CSIC)
  59. 59. http://www.rightfield.org.uk 1. Enrich Spreadsheet Template reducing the friction of curation 2. Use in Excel or OpenOffice 3. Extract and Process RDF Graph :
  60. 60. anonymous reuse is hard nearly always negotiated
  61. 61. reskilling: software making practices Zeeya Merali , Nature 467, 775-777 (2010) | doi:10.1038/467775a Computational science: ...Error…why scientific programming does not compute. “As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software”
  62. 62. http://matt.might.net/articles/crapl/ http://sciencecodemanifesto.org/
  63. 63. Greg Wilson better software C Titus Brown better research data carpentry
  64. 64. a word on reinventing Sean Eddy author HMMER and Infernal software suites for sequence analysis innovation is algorithms and methodology. rediscovery of profile stochastic context-free grammars (re)coding is reproducing. reinvent what is innovative. reuse what is utility. Goble, seven deadly sins of bioinformatics, 35.5K views http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics
  65. 65. message #3 placing value on reproducibility take action Organisation Culture Execution Metrics Process [Daron Green]
  66. 66. (re)assembly Gather the bits together Find and get the bits Bits broken/changed/lost Have other bits Understand the bits and how to put together Bits won’t work together What bit is critical? Can I use a different tool? Can’t operate the tool Who’s job is this?
  67. 67. specialist codes gateways libraries, platforms, tools data collections catalogues commodity platforms my data my process my codes integrative frameworks service based software sepositories (cloud) hosted services
  68. 68. Diff Orig repeat (re-run) replicate reproduce (regenerate) (recreate) reuse (repurpose/extend) Actors Results Experiment Materials (datasets, parameters, seeds) Methods (techniques, algorithms, spec of the steps) Setup Instruments (codes, services, scripts, underlying libraries) Laboratory (sw and hw infrastructure, systems software, integrative platforms) snapshot spectrum
  69. 69. materials use workflows capture the steps method instruments and laboratory standardised pipelines auto record of experiment and set-up report & variant reuse buffered infrastructure BioSTIF interactive local & 3rd party independent resources shielded heterogeneous infrastructures
  70. 70. use provenance the link between computation and results static verifiable record track changes repair partially repeat/reproduce carry citation calc data quality/trust select data to keep/release compare diffs/discrepancies d1 d2 d1' d2 S0 S1 S0 S1 w z w S2 S'2 y y' S4 S4 df df' (i) Trace A W3C PROV standard (ii) Trace B PDIFF: comparing provenance traces to diagnose divergence across experimental results [Woodman et al, 2011]
  71. 71. “an experiment is as transparent as the visibility of its steps” black boxes closed codes & services, proprietary licences, magic cloud services, manual manipulations, poor provenance/version reporting, unknown peer review, mis-use, platform calculation dependencies Joppa et al SCIENCE 340 May 2013; Morin et al Science 336 2012
  72. 72. dependencies & change degree of self-contained preservation open world, distributed, alien hosted data/software versions and accessibility hamper replication spin-rate of versions [Zhao et al. e-Science 2012] “all you need to do is copy the box that the internet is in”
  73. 73. preservation & distribution portability / packaging VM availability open [Adapted Freire, 2013] gather dependencies capture steps variability sameness description intelligibility Reproducibility framework
  74. 74. packaging bickering byte execution virtual machine black box description archived record white box data+compute co-location cloud packaging ELIXIR Embassy Cloud reproduce repeat “in-nerd-tia”
  75. 75. big data big compute community facilities cloud host costs and confidence data scales dump and file capability
  76. 76. message #4: “the reproducible window” all experiments become less reproducible over time icanhascheezburger.com how, why and what matters benchmarks for codes plan to preserve repair on demand description persists use frameworks results may vary partial replication approximate reproduction verification Sandve, Nekrutenko, Taylor, Hovig Ten simple rules for reproducible in silico research, PLoS Comp Bio submitted
  77. 77. message #5: puppies aren’t free long term reliability of hosts multiple stewardship fragmented business models reproducibility service industry 24% NAR services unmaintained after three years Schultheiss et al. (2010) PLoS Comp
  78. 78. • • • • • • the meta-manifesto all X should be available and assessable forever the copyright of X should be clear X should have citable, versioned identifiers researchers using X should visibly credit X’s creators credit should be assessable and count in all assessments X should be curated, available, linked to all necessary materials, and intelligible • making X reproducible/open should be from cradle to grave, continuous, routine, and easier • tools/repositories should be made to help, be maintained and be incorporated into working practices • researchers should be able to adapt their working practices, use resources, and be trained to reproduce • cost and responsibility should be transparent, planned for, accounted and borne collectively • we all should start small, be imperfect but take action. Today. http://www.force11.org
  79. 79. • evolution of a body • fork, pull, merge • subpart different cycles, stewardship, authors • refactored granularity • software release practices for workflows, scripts, services, data and articles • thread the salami across parts, repositories and journals • chop up and microattribute research is like software Faculty1000 Jennifer Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012
  80. 80. http://www.researchobject.org/ http://www.w3.org/community/rosc/ bundles and relates digital resources of a scientific experiment or investigation using standard mechanisms
  81. 81. towards a release app store • checklists for descriptive reproducibility • packaging for multihosted research (executable) components • exchange between tools and researchers • framework for research release and threaded publishing using core standards TT43 Lounge 81
  82. 82. those messages again • • • • • lower friction, born reproducible credit is like love take action, use (workflow) frameworks prepare for the reproducible window puppies aren’t free
  83. 83. final message The revolution is not an apple that falls when it is ripe. You have to make it drop.
  84. 84. acknowledgements • • • • • • • • • • • • • • • • • • • David De Roure Tim Clark Sean Bechhofer Robert Stevens Christine Borgman Victoria Stodden Marco Roos Jose Enrique Ruiz del Mazo Oscar Corcho Ian Cottam Steve Pettifer Magnus Rattray Chris Evelo Katy Wolstencroft Robin Williams Pinar Alper C. Titus Brown Greg Wilson Kristian Garza • Wf4ever, SysMO, BioVel, UTOPIA and myGrid teams • • • • • • • • • • • • • • • • • • • Juliana Freire Jill Mesirov Simon Cockell Paolo Missier Paul Watson Gerhard Klimeck Matthias Obst Jun Zhao Pinar Alper Daniel Garijo Yolanda Gil James Taylor Alex Pico Sean Eddy Cameron Neylon Barend Mons Kristina Hettne Stian Soiland-Reyes Rebecca Lawrence
  85. 85. Mr Cottam 10th anniversary today!
  86. 86. summary [Jenny Cham] https://twitter.com/csmcr/status/361835508994813954
  87. 87. • myGrid – • http://www.biovel.eu Force11 – • • http://www.software.ac.uk BioVeL – • http://www.wf4ever-project.org Software Sustainability Institute – • http://www.getutopia.com Wf4ever – • http://www.rightfield.org.uk UTOPIA Documents – • http://www.sysmo-db.org Rightfield – • http://www.biocatalogue.org SysMO-SEEK – • http://www.myexperiment.org BioCatalogue – • http://www.taverna.org.uk myExperiment – • http://www.mygrid.org.uk Taverna – • Further Information http://www.force11.org http://reproducibleresearch.net http;//reproduciblescience.org

×