Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do open science and who gets the credit?
1. results may vary
reproducibility, open science
and all that jazz
Professor Carole Goble
The University of Manchester, UK
carole.goble@manchester.ac.uk
@caroleannegoble
Keynote ISMB/ECCB 2013 Berlin, Germany, 23 July 2013
2. “knowledge turning”
New Insight
• life sciences
• systems biology
• translational
medicine
• biodiversity
• chemistry
• heliophysics
• astronomy
• social science
• digital libraries
• language analysis
[Josh Sommer, Chordoma Foundation]
Goble et al Communications in Computer and Information Science 348, 2013
3. automate: workflows,
pipeline & service
integrative frameworks
scientific
software
engineering
CS
SE
pool, share &
collaborate
web systems
semantics & ontologies
machine readable documentation
nanopub
4. coordinated execution of services,
codes, resources
transparent, step-wise methods
auto documentation, logging
reuse variants
7. reproducibility
a principle of the
scientific method
separates scientists
from other researchers
and normal people
http://xkcd.com/242/
8. datasets
data collections
algorithms
configurations
tools and apps
codes
workflows
scripts
code libraries
services,
system software
infrastructure,
compilers
hardware
“An article about
computational science in a
scientific publication is not the
scholarship itself, it is merely
advertising of the scholarship.
The actual scholarship is the
complete software
development environment,
[the complete data] and the
complete set of instructions
which generated the figures.”
David Donoho, “Wavelab and
Reproducible Research,” 1995
Morin et al Shining Light into Black Boxes
Science 13 April 2012: 336(6078) 159-160
Ince et al The case for open computer
programs, Nature 482, 2012
9. • Workshop Track (WK03) What
Bioinformaticians need to know
about digital publishing beyond the
PDF
• Workshop Track (WK02):
Bioinformatics Cores Workshop,
• ICSB Public Policy Statement on
Access to Data
10. hope over experience
“an experiment is reproducible until
another laboratory tries to repeat it.”
Alexander Kohn
even computational ones
11. hand-wringing,
weeping, wailing,
gnashing of teeth.
Nature checklist.
Science
requirements for
data and code
availability.
attacks on authors,
editors, reviewers,
publishers, funders,
and just about
everyone.
http://www.nature.com/nature/focus/reproducibility/index.html
13. Nekrutenko & Taylor, Next-generation sequencing data interpretation:
enhancing, reproducibility and accessibility, Nature Genetics 13 (2012)
59% of papers in the 50 highest-IF journals comply with
(often weak) data sharing rules.
Alsheikh-Ali et al Public Availability of Published Research Data in High-Impact Journals.
PLoS ONE 6(9) 2011
14. 170 journals, 2011-2012
Required as condition of publication
Required but may not affect decisions
Explicitly encouraged may be reviewed
and/or hosted
Implied
No mention
Required as condition of publication
Required but may not affect decisions
Explicitly encouraged may be reviewed
and/or hosted
Implied
No mention
Stodden V, Guo P, Ma Z (2013) Toward Reproducible Computational
Research: An Empirical Analysis of Data and Code Policy Adoption by
Journals. PLoS ONE 8(6): e67111. doi:10.1371/journal.pone.0067111
15. replication gap
Out of 18 microarray papers, results
Out of 18 microarray papers, results
from 10 could not be reproduced
from 10 could not be reproduced
More retractions:
>15X increase in last decade
At current % > by 2045 as many papers published as
retracted
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
16. “When I use a word," Humpty Dumpty
said in rather a scornful tone, "it means
just what I choose it to mean - neither
more nor less.”
[Lewis Carroll]
conceptual replication
“show A is true by doing B
rather than doing A again”
verify but not falsify
[Yong, Nature 485, 2012]
regenerate
the figure
replicate
rerun
repeat
re-compute
recreate
revise
regenerate
redo
restore
recycle
reuse
re-examine
reconstruct
review
repurpose
17. repeat
same
experiment
same lab
replicate
test
same
experiment
different set up
reproduce
same
experiment
different lab
different
experiment
some of same
reuse
Drummond C Replicability is not Reproducibility: Nor is it Good Science, online
Peng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
18. validation
verification
assurance meets the
needs of a
stakeholder
e.g. error
measurement,
documentation
complies with a
regulation,
requirement,
specification, or
imposed condition
e.g. a model
science review: articles, algorithms, methods
technical review: code, data, systems
V. Stodden, “Trust Your Science? Open Your Data and Code!”
Amstat News, 1 July 2011
20. disorganisation
“I can’t immediately reproduce the
research in my own laboratory. It
took an estimated 280 hours for
an average user to approximately
reproduce the paper.
Data/software versions. Workflows
are maturing and becoming
helpful”
Phil Bourne
Garijo et al. 2013 Quantifying Reproducibility in Computational Biology:
The Case of the Tuberculosis Drugome PLOS ONE under review.
fraud Corbyn, Nature Oct 2012
inherent
21. rigour reporting & experimental design
cherry picking data
misapplication use of black box software*
software misconfigurations, random seed reporting
non-independent bias, poor positive and negative controls
dodgy normalisation, arbitrary cut-offs, premature data triage
un-validated materials, improper statistical analysis, poor
statistical power, stop when “get to the right answer”
*8% validation Joppa, et al, Troubling Trends in Scientific Software Use SCIENCE 340 May 2013
24. G8 open data charter
http://opensource.com/government/13/7/open-data-charter-g8
25. regulation of science
institution cores public services
libraries
republic of science*
*Merton’s four norms of scientific behaviour (1942)
26. a meta-manifesto (I)
• all X should be available and assessable
forever
• the copyright of X should be clear
• X should have citable, versioned identifiers
• researchers using X should visibly credit X’s
creators
• credit should be assessable and count in all
assessments
• X should be curated, available, linked to all
necessary materials, and intelligible
What’s the real issue?
27. we do pretty well
•
•
•
•
•
•
•
•
major public data repositories
multiple declarations for depositing data
thriving open source community
plethora of data standardisation efforts
core facilities
heroic data campaigns
international and national bioinformatics coordination
diy biology movement
• great stories- Shiga-Toxin strain of E. coli, Hamburg, May
2011, China BGI Open data crowd sourcing effort.
• Oh, wait…University of Münster/University of Göttingen
squabble http://www.nature.com/news/2011/110721/full/news.2011.430.html
28. hard: patient data
(inter)national complications
bleeding heart paternalism
defensive research
informed consent
fortresses
[John Wilbanks]
http://www.broadinstitute.org/files/news/pd
fs/GAWhitePaperJune3.pdf
Kotz, J. SciBX 5(25) 2012
29. massive centralisation –
clouds, curated core
facilities
long tail massive
decentralisation –
investigator held datasets
fragmentation & fragility
a data scarcity at point of
delivery
RIP data
quality/trust/utility
Acta Crystallographica
section B or C
data/code as
first class citizen
30. we are not bad people
we make progress
there was never a golden age
there never is
31. a reproducibility paradox
big, fast,
complicated,
multi-step,
multi-type
multi-field
expectations
of
reproducibility
diy publishing
greater access
32. pretty stories shiny results feedback loop
announce a result, convince us its correct
novel, attention grabbing
neat, only positive
review: the direction of
science, the next paper,
how I would do it.
reject papers purely based
on public data
obfuscate to avoid scrutiny
PLoS and F1000 counter
33. the scientific sweatshop
no resources, time, accountability
getting it published not getting it right
game changing benefit to justify disruption
34. citation distortion
Micropublications arxive reference
Clark et al Micropublications 2013 arXiv:1305.3506
[Tim Clark]
Greenberg How citation distortions create unfounded authority: analysis of a citation network.
British Medical Journal 2009, 339:b2680.
Simkin, Roychowdhury Stochastic modeling of citation slips. Scientometrics 2005, 62(3):367-384.
35. independent replication studies
self-correcting science
“blue collar
• hostility
• hard
• resource
intensive
• no funding, time,
recognition, place
to publish
• invisible to
science” originators
John Quackenbush
36. independent review
self-correcting science
“blue collar
• hostility
• hard
• resource
intensive
• no funding, time,
recognition, place
to publish
• invisible to
science” originators
John Quackenbush
37. what is the point: “no one will want it”
“the questions don’t change but the answers do”*
• two years time when the paper is written
• reviewers want additional work
• statistician wants more runs
• analysis may need to be repeated
• post-doc leaves, student arrives
• new data, revised data
• updated versions of algorithms/codes
quid pro quo citizenship
• trickle down theory: more open more use more credit*
others might
• meta-analysis
• novel discovery
• other methods
* Dan Reed
38. emerging reproducible system ecosystem
App Store needed!
instrumented desktop tools
hosted services
packaging and archiving
repositories, catalogues
online sharing platforms
integrated authoring
integrative frameworks
XworX
ReproZip
Sweave
39.
40.
41. integrated database and journal
http://www.gigasciencejournal.com
copy editing computational workflows
from 10 scripts + 4 modules + >20 parameters
to Galaxy workflows
2-3 months
2-3 weeks
made reproducible
galaxy.cbiit.cuhk.edu.hk
[Peter Li]
42. supporting data reproducibility
Open-Paper
to
d
ke
Lin OI
D
Open-Data
Data sets
78GB CC0 data
DO
I
Lin
ke
DOI:10.1186/2047-217X-1-18 d
>11000 accesses
to
DOI:10.5524/100038
Analyses
Open-Pipelines
Open-Workflows
DOI:10.5524/100044
Open-Review
8 reviewers tested data in ftp server & named reports published
Open-Code
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
Code in sourceforge under GPLv3:
>5000 downloads http://soapdenovo2.sourceforge.net/
[Scott Edmunds]
43. Here is What I Want – The Paper
As Experiment
0. Full text of PLoS papers stored
in a database
4. The composite view has
links to pertinent blocks
of literature text and back to the PDB
4.
1.
1. A link brings up figures
from the paper
2.
[Phil Bourne]
3. A composite view of
journal and database
content results
3.
2. Clicking the paper figure retrieves
data from the PDB which is
analyzed
1. User clicks on thumbnail
2. Metadata and a
webservices call provide
a renderable image that
can be annotated
3. Selecting a features
provides a
database/literature
mashup
4. That leads to new
papers
PLoS Comp. Biol. 2005 1(3) e34
44. "A single pass approach to reducing sampling variation, removing errors, and
scaling de novo assembly of shotgun sequences"
http://arxiv.org/abs/1203.4802
born reproducible
http://ged.msu.edu/papers/2012-diginorm/
http://ivory.idyll.org/blog/replication-i.html
[C. Titus Brown]
48. message #1: lower friction
born reproducible
Process
=
Interest
Friction
Number
x people
reach
the neylon equation
Cameron Neylon, BOSC 2013, http://cameronneylon.net/
49. 4+1 architecture of reproducibility
“development” view
“logical” view
social scenarios
“process” view
“physical” view
52. observations
• the strict letter of the law
• (methods) modeller/ workflow makers vs (data)
experimentalists
• young researchers, support from PIs
• buddy reproducibility testing, curation help
• just enough just in time
• staff leaving and project ends
• public scrutiny, competition
• decaying local systems
• long term safe haven commitment
• funder commitment from the start
53. (Lusch, Vargo 2008)
(Harris and Miller 2011)
(Nowak 2006)
(Clutton-Brock 2009)
Tenopir et al 2011)
Borgman, 2012)
(Malone 2010)
(Benkler 2011)
[Kristian Garza]
(Thomson, Perry, and Miller 2009)
(Wood and Gray 1991)
(Roberts and Bradley 1991)
(Shrum and Chompalov 2007)
54. scientific ego-system
trust, reciprocity, collaboration to compete
blame
scooped
uncredited
misinterpretation
scrutiny
cost
loss
distraction
left behind
Merton’s four norms of scientific behaviour (1942)
dependency
fame
competitive
advantage
productivity
credit
adoption
kudos
for love
Fröhlich’s principles of scientific communication (1998)
Malone, Laubacher & Dellarocas The Collective Intelligence Genome, Sloan Management Review,(2010)
55. local asset economies
economics of scarce prized
commodities
• local investment
– protective
• collective purchasing
trade
– share
• sole provider
– broadcast
[Nielson] [Roffel]
(Lusch, Vargo 2008)
(Harris and Miller 2011
57. 1 0 JA N UA RY 2 0 1 3 | VO L 4 9 3 | N AT U R E | 1 5 9
recognition
“all research products and all scholarly labour
are equally valued except by promotion and
review committees”
58. message #2
visible reciprocity contract
citation is like ♥ not $
large data providers
infrastructure codes
“click and run”
instrument platforms
make credit count
Rung, Brazma Reuse of public wide gene expression data Nature Review Genetics 2012
Duck et al bioNerDS: exploring bioinformatics' database and software use through literature mining.
BMC Bioinformatics. 2013
Piwowar et al Sharing Detailed Research Data Is Associated with Increased Citation Rate PLoS
ONE 2007
59. Victoria Stodden, AMP 2011 http://www.stodden.net/AMP2011/,
Workshop: Reproducible Research: Tools and Strategies for Scientific Computing
Special Issue Reproducible Research Computing in Science and Engineering July/August 2012, 14(4)
60. in perpetuity
“its not ready yet”, “I need another publication”
shame
“its too ugly”, “I didn’t work out the details”
effort
“we don’t have the skills/resources”, “the reviewers
don’t need it”
loss
“the student left”, “we can’t find it”
insecurity
“you wouldn’t understand it”, “I made it so no one
could understand it”.
Randall J. LeVeque ,Top Ten Reasons To Not Share Your Code (and why you should anyway) April
2013 SIAM News
61. the goldilocks paradox
“the description needed
to make an experiment
reproducible is too much
for the author and too
little for the reader”
just enough just in time
Galaxy Luminosity Profiling
José Enrique Ruiz (IAA-CSIC)
64. reskilling: software making practices
Zeeya Merali , Nature 467, 775-777 (2010) | doi:10.1038/467775a
Computational science: ...Error…why scientific programming does not compute.
“As a general rule,
researchers do not
test or document
their programs
rigorously, and they
rarely release their
codes, making it
almost impossible
to reproduce and
verify published
results generated
by scientific
software”
67. a word on reinventing
Sean Eddy
author HMMER and Infernal software
suites for sequence analysis
innovation is algorithms and methodology.
rediscovery of profile stochastic context-free grammars
(re)coding is reproducing.
reinvent what is innovative.
reuse what is utility.
Goble, seven deadly sins of bioinformatics, 35.5K views
http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics
68. message #3
placing value on reproducibility
take action
Organisation
Culture
Execution
Metrics
Process
[Daron Green]
69. (re)assembly
Gather the bits together
Find and get the bits
Bits broken/changed/lost
Have other bits
Understand the bits and
how to put together
Bits won’t work together
What bit is critical?
Can I use a different
tool?
Can’t operate the tool
Who’s job is this?
70. specialist codes
gateways
libraries, platforms, tools
data collections
catalogues
commodity
platforms
my data
my process
my codes
integrative
frameworks
service based
software
sepositories
(cloud)
hosted
services
72. materials
use workflows
capture the steps
method
instruments and laboratory
standardised pipelines
auto record of
experiment and set-up
report & variant reuse
buffered infrastructure
BioSTIF
interactive
local & 3rd party independent resources
shielded heterogeneous infrastructures
73. use provenance
the link between computation and results
static verifiable record
track changes
repair
partially repeat/reproduce
carry citation
calc data quality/trust
select data to keep/release
compare diffs/discrepancies
d1
d2
d1'
d2
S0
S1
S0
S1
w
z
w
S2
S'2
y
y'
S4
S4
df
df'
(i) Trace A
W3C PROV standard
(ii) Trace B
PDIFF: comparing provenance traces to
diagnose divergence across experimental
results [Woodman et al, 2011]
74. “an experiment is as transparent
as the visibility of its steps”
black boxes
closed codes &
services, proprietary
licences, magic cloud
services, manual
manipulations, poor
provenance/version
reporting, unknown
peer review, mis-use,
platform calculation
dependencies
Joppa et al SCIENCE 340 May 2013; Morin et al Science 336 2012
75. dependencies & change
degree of self-contained preservation
open world, distributed, alien hosted
data/software versions and accessibility hamper replication
spin-rate of versions
[Zhao et al. e-Science 2012]
“all you need to do is copy the box that the internet is in”
76. preservation & distribution
portability / packaging
VM
availability
open
[Adapted Freire, 2013]
gather
dependencies
capture
steps
variability
sameness
description
intelligibility
Reproducibility
framework
77. packaging bickering
byte execution
virtual machine
black box
description
archived record
white box
data+compute
co-location cloud
packaging
ELIXIR Embassy Cloud
reproduce
repeat
“in-nerd-tia”
78. big data big compute
community facilities
cloud host costs and confidence
data scales
dump and file
capability
79. message #4:
“the reproducible window”
all experiments become less reproducible over time
icanhascheezburger.com
how, why and what matters
benchmarks for codes
plan to preserve
repair on demand
description persists
use frameworks
results may vary
partial replication
approximate reproduction
verification
Sandve, Nekrutenko, Taylor, Hovig Ten simple rules for reproducible in silico research,
PLoS Comp Bio submitted
80. message #5: puppies aren’t free
long term reliability of hosts
multiple stewardship
fragmented
business models
reproducibility service industry
24% NAR services unmaintained after three years Schultheiss et al. (2010) PLoS Comp
81. •
•
•
•
•
•
the meta-manifesto
all X should be available and assessable forever
the copyright of X should be clear
X should have citable, versioned identifiers
researchers using X should visibly credit X’s creators
credit should be assessable and count in all assessments
X should be curated, available, linked to all necessary materials, and
intelligible
• making X reproducible/open should be from
cradle to grave, continuous, routine, and
easier
• tools/repositories should be made to help, be
maintained and be incorporated into working
practices
• researchers should be able to adapt their
working practices, use resources, and be
trained to reproduce
• cost and responsibility should be transparent,
planned for, accounted and borne collectively
• we all should start small, be imperfect but
take action. Today.
http://www.force11.org
82. • evolution of a body
• fork, pull, merge
• subpart different cycles,
stewardship, authors
• refactored granularity
• software release
practices for
workflows, scripts,
services, data and
articles
• thread the salami across
parts, repositories and
journals
• chop up and microattribute
research is like
software
Faculty1000
Jennifer Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012
84. towards a release app store
• checklists for
descriptive
reproducibility
• packaging for multihosted research
(executable)
components
• exchange between
tools and researchers
• framework for research
release and threaded
publishing using core
standards
TT43 Lounge 81
85. those messages again
•
•
•
•
•
lower friction, born reproducible
credit is like love
take action, use (workflow) frameworks
prepare for the reproducible window
puppies aren’t free
87. acknowledgements
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
David De Roure
Tim Clark
Sean Bechhofer
Robert Stevens
Christine Borgman
Victoria Stodden
Marco Roos
Jose Enrique Ruiz del Mazo
Oscar Corcho
Ian Cottam
Steve Pettifer
Magnus Rattray
Chris Evelo
Katy Wolstencroft
Robin Williams
Pinar Alper
C. Titus Brown
Greg Wilson
Kristian Garza
•
Wf4ever, SysMO, BioVel, UTOPIA and myGrid teams
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Juliana Freire
Jill Mesirov
Simon Cockell
Paolo Missier
Paul Watson
Gerhard Klimeck
Matthias Obst
Jun Zhao
Pinar Alper
Daniel Garijo
Yolanda Gil
James Taylor
Alex Pico
Sean Eddy
Cameron Neylon
Barend Mons
Kristina Hettne
Stian Soiland-Reyes
Rebecca Lawrence
{"93":"Added afterwards\n","82":"capacity to do it\ndump and file\nback to negotiated reuse\n","71":"Sean Eddy Howard Hughes Medical Institute's Janelia Farm \nseveral computational tools for sequence analysis \n","60":"Created and shared large, valuable dataset which is highly regarded by peers or\nPublication in J. Big Useful Datasets, impact factor X \n","49":"So we had better be sure we know why we are doing it and make it easier\nCost benefit inbalance….\n","38":"So if you don’t want to do it for “them” do it for you\n","27":"More open more citations\n100,000 genome\n100 genome\nEncode, ENA etc\nBetter than chemists and social scientists\nhttp://www.nature.com/news/2011/110721/full/news.2011.430.html\nPublished online 21 July 2011 | Nature | doi:10.1038/news.2011.430 \nNews\nE. coli outbreak strain in genome race\nSequence data reveal pathogen's deadly origins.\nMarian Turner \nThe collaborative atmosphere that surrounded the public release of genome sequences in the early weeks of this year's European Escherichia coli outbreak has turned into a race for peer-reviewed publication.\nA paper published in PLoS One today, by Dag Harmsen from the University of Münster, Germany, and his colleagues, contains the first comparative analysis of the sequence of this year's E. coli outbreak strain (called LB226692 in the publication) and a German isolate from 2001 (called 01-09591), which was held by the E. coli reference laboratory at the University of Münster, headed by Helge Karch. The scientists also compared the two strains with the publicly available genome of strain 55989, isolated in central Africa in the 1990s.\nThe LB226692 and 01-09591 genomes were sequenced using an Ion Torrent PGM sequencer from Life Technologies of Carlsbad, California (see 'Chip chips away at the cost of a genome'). The authors say that their publication is the first example of next-generation, whole-genome sequencing being used for real-time outbreak analysis. "This represents the birth of a new discipline — prospective genomics epidemiology," says Harmsen. He predicts that this method will rapidly become routine public-health practice for outbreak surveillance.\nBut Harmsen's group was pipped to the publishing post by Rolf Daniel and his colleagues at the University of Göttingen in Germany, who published a comparison of the sequence of two isolates from the outbreak with the 55989 strain in Archives of Microbiology on 28 June. Harmsen says that this competition is why his group did not release the 2001 strain sequence before today's PLoS One publication.\nBoth groups say that their genomic sequencing and analysis were conducted independently. But their findings don't really differ from sequence analyses that other scientists were simultaneously documenting in the public domain, following the release, on 2 June, by China's BGI (formerly known as the Beijing Genomics Institute) of a full genome sequence of the outbreak strain — also generated using Ion Torrent sequencing. These scientists say that there is very little information in either publication that was not previously available on their website. "The crowd-sourcing efforts arrived at almost all of the scientific conclusions about the strain comparisons first," says Mark Pallen from the University of Birmingham, UK, "so we're surprised and disappointed that these findings are not referred to in these papers."\nEveryone agrees that the Münster laboratory released information on defining genetic features of the 2011 outbreak strain that allowed accurate patient diagnosis and strain tracking as soon as they had the information. The current squabbling revolves around genomic details that point to how the unusual strain evolved.\nSo what have the combined analyses revealed so far? All of the strains have a similar enteroaggregative E. coli (EAEC) genetic background, but the 2011 outbreak strain contains plasmid- and chromosome-encoded genes that differ both from the 2001 German and from the earlier African strain. The 2011 and 2001 strains, but not the African strain, carry the important stx gene for Shiga-toxin production — the cause of so many people's sickness — although the African strain carries an intact stx integration site, suggesting it may have evolved from a strain that did once carry it. The African strain also does not contain a tellurite-resistance gene that the other two strains do. The 2011 and 2001 also have different genes for fimbriae — the cell protrusions that make EAEC bacteria particularly sticky.\nThe authors of the paper in PLoS One hypothesize that the strains all derive from a common Shiga-toxin producing EAEC progenitor. They say the genetic steps between the three strains are suggestive of a 'common ancestor model'. It is evolutionarily more likely that bacteria lose genetic elements than gain them, and Harmsen cites the large ter genetic island as an example of a genetic element more likely to have been lost from a common progenitor than gained by subsequently appearing strains.\n"All of these analyses are an example of bacterial evolution being in constant flux," says Pallen, reiterating that this outbreak highlighted the importance of establishing more flexible diagnostic frameworks for E. coli strains.\nHarmsen says he expects at least two further papers on analysis of the genome sequences to be published by independent groups in the next few weeks. \n","16":"The letter or the spirit of the experiment\nindirect and direct reproducibility\nReproduce the same affect? Or same result?\nConcept drift towards bottom.\nAs an old ontologist I wanted an ontology or a framework or some sort of property based classification. \n","88":"ENCODE threads\n","77":"Simplify\nTrack\nVersions and retractions\nError propagation\nContributions and credits\nFix\nWorkflow repair, alternate component discovery, Black box annotation\nRerun and Replay\nPartial reproducibility: Replay some of the workflow\nA verifiable, reviewable trace in people terms\nAnalyse \nCalculate data quality & trust, \nDecide what data to keep or release\nCompare to find differences and discrepancies\nS. Woodman, H. Hiden, P. Watson, P. Missier Achieving Reproducibility by Combining Provenance with Service and Workflow Versioning. In: The 6th Workshop on Workflows in Support of Large-Scale Science. 2011, Seattle\n","55":"Of the SEEK project\n","33":"Less true of specialist journals and things that aren’t nature.\nSloan Digital Sky Survey generated more papers by independents based on its data than the data collectors\nhttp://www.scribedevil.com/dedicated-digital-research-and-development-budget/\n“When we review papers we’re often making authors prove that their findings are novel or interesting. We’re not often making them prove that their findings are true”. Joseph Simmons, in Nature 485, Bad Copy\nPeer review references.\n","83":"Partial – over proprietary steps or difficult-to-reproduce subparts, or through examining the log\nThe law of decline\nall set-ups need to refresh or they stagnate\nthe world changes, \nnew results not the same ones.\nall set-ups need to refresh\nhow long is enough?\npartial – over proprietary steps or difficult-to-reproduce subparts, or through examining the log\nThe lab is not fixed\nPredictive models\nUpdated resources\nNew versions\nDeal with uncertainty\nReproducibility is not fossellization.\nStability – people want to use the same set up at the end of their project as they did at the beginning. Same for paper reviews\nWhen does that matter and when doesn’t it?\nThe change of a API won’t matter to the result but it will to the workflow machinery.\nThat change of an algorithm won’t impact on the workflow but will impact on the what the experiment means.\nSame API?\nSame code?\nSame version of code?\nSame dataset?\nSame version of data set?\nSame method?\n“the questions don’t change but the answers do” [Dan Reed]\n","72":"“faster/easier/intelligible/rewarded to reinvent”\nSociologically:\nAn end to build it and they will come\nAlternative metrics accepted by the community\nAlternative reward systems that recognize the realities of today’s scholarship, namely:\nOpen data availability\nSoftware availability\nCollaborative research\n","61":"“faster/easier/intelligible/rewarded to reinvent”\nWe have seen the enemy and he is us.\nBMC Bioinformatics. 2013 Jun 15;14:194. doi: 10.1186/1471-2105-14-194.\nbioNerDS: exploring bioinformatics' database and software use through literature mining.\nDuck G, Nenadic G, Brass A, Robertson DL, Stevens R.\n1. there's a lot of stuff out there and the world is quite dynamic in some respects 2. the top ten are interesting in themselves 3. it appears that a lot of tools/db have little reported use (note "reported") 4. I would tentatively say "bio types are more consevative".... the paper just reports on a survey of Genome biology and BMC Bioinformatics; Geraint has figure for a survey of all of PMC 2013 version (1/2 million articles)\nCredit is like love\nA reproducibility contract needs to be backed by a reciprocity contract\n","39":"The HOW \nBolt on, Built in\nhttp://www.zenodo.org/\nhttps://olivearchive.org\nmake reproducible -> born reproducible\nComputational experiments have three advantages: explicit, auto-tracking, portability\n","28":"ELIXIR http://www.out-law.com/en/articles/2013/june/international-biomedical-research-data-sharing-standards-to-be-created/\nhttp://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf\nhttp://blog.ted.com/2012/06/29/unreasonable-people-unite-john-wilbanks-at-tedglobal-2012/\n[http://www.nature.com/scibx/journal/v5/n25/fig_tab/scibx.2012.644_F1.html\n","6":"social worker\n","89":"Standards are the key to reproducibility\nMost of the time people don’t care\n","78":"Science 13 April 2012: 336(6078) 159-160 \n","67":"Access to expertise\ncompetency, capacity, resources\nauthors to make in silico experiments reproducible \nreviewers and readers to be able reproduce or reuse them\n","45":"Galaxy pages (30K users, 1K new users/month)\n","34":"I bought the rights to this image\n","23":"“if it isn’t open it isn’t science”Mike Ashburner\n","12":"Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124 \nhttp://www.reuters.com/article/2012/03/28/us-science-cancer-idUSBRE82R12P20120328\n","1":"reproducibility \nreuse and reinventionopenness\nhow we undertake and reward \ncomputational science\nHow could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates. \n","84":"research costs\nSelf-promotion\nI can publish every new monolithic thing and I can’t publish if I reuse someone else’s thing.\nNovelty vs Standards\nStandards are boring “blue collar” science (Quackenbush)\nResearch vs Production Confusion\nHow do you get funding for production software other than claiming to be researching stuff? \nHow do you get a publication out of a bit of research software without claiming a potential user-base?\nI don’t want to be a long-term service provider!\nlifeboats, \npuppies and the\nrepublic of openness\nHow long is forever?\nsmall short funding cycles\n58% built by students 24% unmaintained after three years*\nlarge\nsustain through reinvention\nfunding policies and business Models\nsustainability of suppliers & hosts\nlifeboats, \npuppies and the\nrepublic of openness\nHow long is forever?\ncost of sustaining your home mades\npreparing for reproducibility is not negligible cost.\ntransparently account for the cost\ndrive down & spread the cost\nResearch Management and Scholarship Vendors and service providers\nThe workflow fixer collective!\nMendeley, FigShare … \nResearch Management Systems (PURE, Simplectic)\nLab management systems (Labguru)\nLibraries, communities….\n","73":"Documentation to reassemble\ngovernance\nAnatomy of an experiment\nSubtly sometimes but its pretty important when it comes to tractability, intent and practicality.\nVMs enable you to reproduce a lab\nBy making the experiment portable. That’s the point of portability. It’s the instrument aspect. \nAeronautical engineering codes (proprietary) – multiple codes in business, without being inspected.\nDoesn’t help if running over a super computer or the cloud\n","51":"The only equation I have in the talk.\n","40":"Sharecropping.\nWhy research objects are external.\n","7":"I\n","79":"A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. Virtual machines are separated into two major classifications, based on their use and degree of correspondence to any real machine: System \nZhao, Gomez-Perez, Belhajjame, Klyne, Garcia-Cuesta, Garrido, Hettne, Roos, De Roure and Goble. Why workflows break - Understanding and combating decay in Taverna workflows, 8th Intl Conf e-Science 2012\nReproducibility success is proportional to the number of dependent components and your control over them”\nMany reasons why. \nChange / Availability\nUpdates to public datasets, changes to services / codes\nAvailability/Access to components / execution environment\nPlatform differences on simulations, code ports\nVolatile third-party resources (50%): Not available, available but inaccessible, changed\nPrevent, Detect, Repair\n","68":"Time. Money. No papers (skipped in talk)\n","57":"Barriers\nPerceived norms. Me, you, them.\nTemporal construal. Have values of getting it right – concrete overcomes abstract goal\nMotivated reasoning\nMinimal accountability\nI am busy – can’t append things to the workflow. Must integrate into the workflow. The benefit must be game changing to justify disruption. \nWork with existing incentives and nudge them\nTop down – fast but narrow \nBottom up – slow but comprehensive\nLeverage norms\nwith transparency comes accountability” Mark Borkum\n","35":"Citation Distortion\nGreenberg’s [3, 4] analysis of the distortion and fabrication of claims in the biomedical literature demonstrates why citable Claims are necessary. In his analysis, it is straightforward to see how citation distortions may contribute to non-reproducible results in a pharmaceutical context, as reported by [8]. \nGreenberg SA: How citation distortions create unfounded authority: analysis of a citation network. British Medical Journal 2009, 339:b2680.\n4. Greenberg SA: Understanding belief using citation networks. Journal of Evaluation in Clinical Practice 2011, 17(2):389-393.\nBegley CG, Ellis LM: Drug development: Raise standards for preclinical cancer research. Nature 2012, 483(7391):531-533.\nSimkin and Roychowdhury showed that, in the sample of publications they studied, a majority of scientific citations were merely copied from the reference lists in other publications [31, 32]. The increasing interest in direct data citation of datasets, deposited in robust repositories, is another result of this growing concern with the evidence behind assertions in the literature [33]. \nSimkin MV, Roychowdhury VP: Stochastic modeling of citation slips. Scientometrics 2005, 62(3):367-384.\n32.Simkin MV, Roychowdhury VP: A mathematical theory of citing. Journal of the American Society for Information Science and Technology 2007, 58(11):1661-1673.\n33.Alsheikh-Ali AA, Qureshi W, Al-Mallah MH, Ioannidis JP: Public availability of published research data in high-impact journals. PLoS ONE 2011, 6(9):e24357.\n","2":"When I was 17 years old I took a career capability test \nFailed computer science, recommended social work\nKnowledge Discovery, Knowledge Engineering and Knowledge Management \nCommunications in Computer and Information Science Volume 348, 2013, pp 3-25 \nAccelerating Scientists’ Knowledge Turns\nCarole Goble, David De Roure, Sean Bechhofer \nAbstract\nA “knowledge turn” is a cycle of a process by a professional, including the learning generated by the experience, deriving more good and leading to advance. The majority of scientific advances in the public domain result from collective efforts that depend on rapid exchange and effective reuse of results. We have powerful computational instruments, such as scientific workflows, coupled with widespread online information dissemination to accelerate knowledge cycles. However, turns between researchers continue to lag. In particular method obfuscation obstructs reproducibility. The exchange of “Research Objects” rather than articles proposes a technical solution; however the obstacles are mainly social ones that require the scientific community to rethink its current value systems for scholarship, data, methods and software.\n","85":"spreading the cost\ncradle to grave reproducibility\ntools, processes, standards\ncombine making & reporting\njust enough, imperfect\ncost in\ntrain up and support\nplanning\nWe cannot sacrifice the youth\nProtect them….a new generation\nEcosystem of support tools navigation\n","74":"the reproducibility ecosystem\nFor peer and author\ncomplicated and scattered - super fragmentation – supplementary materials, multi-hosted, multi-stewarded. \nwe must use the right platforms for the right tools \nThe trials and tribulations of review\nIts Complicated\nwww.biostars.org/\nApache\nService based ScienceScience as a Service\n","63":"Randall J. LeVeque ,Top Ten Reasons To Not Share Your Code (and why you should anyway) April 2013 SIAM News\nToo ugly to show anyone else. \nI didn't work out all the details. \nMy ex-student wrote the code \nMy competitors would be unfair. \nIts valuable intellectual property. \nIt would make papers longer. \nReferees won’t check the code. \nThe code is too sophisticated for you\nMy code invokes other code with unpublished (proprietary) code. \nReaders who have access to my code will want user support.\n","80":"Preservation - Lots of copies keeps stuff safe\nStability dimension\nAdd two more dimensions to our classification of themes\nA virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. Virtual machines are separated into two major classifications, based on their use and degree of correspondence to any real machine: System \nOverlap of course\nStatic vs dynamic. \nGRANULARITY\nThis model for audit and target of your systems\novercoming data type silos\npublic integrative data sets\ntransparency matters\ncloud\nRecomputation.org \nReproducibility by ExecutionRun It\nReproducibility by InspectionRead It\nAvailability – coverage\nGathered: scattered across resources, across the paper and supplementary materials \nAvailability of dependencies: Know and have all necessary elements\nChange management: Data? Services? Methods? Prevent, Detect, Repair.\nExecution and Making Environments: Skills/Infrastructure to run it: Portability and the Execution Platform (which can be people…), Skills/Infrastructure for authoring and reading\nDescription: Explicit: How, Why, What, Where, Who, When, Comprehensive: Just Enough, Comprehensible: Independent understanding\nDocumentation vs Bits (VMs) reproducibility\nLearn/understand (reproduce and validate, reproduce using different codes) vs Run (reuse, validate, repeat, reproduce under different configs/settings)\n","58":"And economics changes\nLocal Asset EconomiesScarcity of Prized Commodity (e.g. Instrument / Data / Model / Knowledge)\nEquipment, data, method, analysis\nTrade\nReward\nPenalty\nCost\nLove or Money\n","36":"Not true in other disciplines – like physics\nreluctance and invisibility\n","25":"Pressure from top, pressure from below\nsqueeze\n","14":"Added afterwards. \n1. Required as condition of publication, certain exceptions permitted (e.g. preserving confidentiality of human subjects)\n2. Required but may not affect editorial/publication decisions\n3. Explicitly encouraged/addressed; may be reviewed and/or hosted\n4. Implied\n5. No mention\n","3":"In May myExperiment\n * 14,660 page views * 3,076 unique visitors * 67% new visitors, 33% returning visitors \n","86":"last modelthe pdf not sole focus\n","75":"Reproducibility is like pornography – hard to define but you know it when you see it.\nStatic vs Dynamic\nReproduce the method and result. Reuse the method (reusing the result is just using the result)\n(techniques,\nalgorithms, spec of the \nsteps – pref. executable\n","64":"preparing for reproducibility is not negligible cost.\ntransparently account for the cost\ndrive down & spread the cost\nWhen necessary\nBewildering range of standards for formats, terminologies and checklists\nThe Economics of Curation\nCurate Incrementally\nEarly and When Worthwhile\nRamps: Automation & Integrated Tools\nCopy editing Methods\n35 different kinds of annotations\n5 Main Workflows, 14 Nested Workflows, 25 Scripts, 11 Configuration files, 10 Software dependencies, 1 Web Service, Dataset: 90 galaxies observed in 3 bands \nDifficult and time consuming\nIntrinsic worth\nPoor Reward economy.\nCapability vs Capacity\nSustaining the commons\n","53":"Red dominated by social\nBlack dominated by technical (?)\n","92":"Special thanks – 10th anniversary today! And yes, that is Mike Bada\n","81":"In-nerd-tia\nThe tendency of people, esp techy people, to get bogged down in over thinking, over engineering and trivia to the point that we can do nothing. Julie McMurray\n","70":"http://software-carpentry.org/\nPrlić A, Procter JB (2012) Ten Simple Rules for the Open Development of Scientific Software. PLoS Comput Biol 8(12): e1002802. doi:10.1371/journal.pcbi.1002802 Installing and running someone else’s code, understanding it….\nBest Practices for Scientific Computing http://arxiv.org/abs/1210.0530\nWorkshop on Maintainable Software Practices in e-Science – e-Science Conferences \nStodden, Reproducible Research Standard, Intl J Comm Law & Policy, 13 2009 \n","59":"Less likely to have personal sharing less likely to get credit\n","48":"This article by Phil Bourne et al doesn’t have any data sets deposited in repositories, but does include data in tables in the PDF, which are also available in the XML provided by PLoS. Here, Utopia has spotted that there’s a table of data (notice the little blue table icon to the left of the table). Clicking on the icon opens a window with a simple ‘spreadsheet’ of the data extracted from the paper, which you can then export in CSV to a proper spreadsheet of your choice. You can also scatter-plot the data to get a quick-and-dirty overview of what’s in the table. \n","37":"Not true in other disciplines – like physics\nreluctance and invisibility\n","4":"myExperiment currently has 9183 members, 335 groups, 2869 workflows, 772 files and 341 packs \n21 different systems.\n","87":"To share your research materials (RO as a social object)\nTo facilitate reproducibility and reuse of methods\nTo be recognized and cited (even for constituent resources)\nTo preserve results and prevent decay (curation of workflow definition; using provenance for partial rerun)\n","76":"platform\nlibraries, plugins\nInfrastructure\ncomponents, services\ninfrastructure\n","65":"JERM spreadsheets for data integration and data exchange\nIncluding data and their metadata \nStandards-conformant\n","43":"Took 6 months in total\nThe SOAPdenovo2 paper uses 10 executables/scripts - see attachment. SOAPdenovo2 itself contains 4 modules and each can be called individually. Complexity to reproducing a result also comes from the number of parameters that can be configured in a tool. For example, the SOAPdenovo2 tool allows you to configure over 20 parameters.\nThe paper was SOAPdenovo2: http://www.gigasciencejournal.com/content/1/1/18 It took about 3 months part-time work as I first had to learn how to use and deploy the Galaxy workflow system here in BGI. This was followed by wrapping of SOAPdenovo2 and its supporting tools as Galaxy tools. A lot of effort was required to understand how the analyses in the paper was implemented as bash and perl scripts before the Galaxy workflow could be developed that replicated some of the analyses in the paper. This was due to the fact that most of the executable tools had their own configuration file and then there was a global configuration file on top of this. Now that our Galaxy system is deployed and the editor understand how to use it, the re-implementation of the paper's analyses will probably take 2-3 weeks. This included some extra work the editor did which showed how to visualise the genome assembly results of a SOAPdenovo2 process. \n"}