Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Overcoming the Reproducibility Crisis:
and why I stopped worrying a learned to love open data (& methods)
Scott Edmunds
1s...
Harnessing Data-Driven Intelligence
Using networking power of the internet to tackle problems
Can ask new questions & find...
Dead trees not fit for purpose
18121665 1869
The problems with publishing
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts...
The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expressi...
Consequences: increasing number of retractions
>15X increase in last decade
Strong correlation of “retraction index” with
...
Consequences: growing replication gap
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression ana...
“Faked
research is
endemic in
China”
Global perceptions of Asian Research
Million RMB rewards for high IF publications = ?...
“Faked
research is
endemic in
China”
Global perceptions of Asian Research
Million RMB rewards for high IF publications = ?...
STAP paper demonstrates problems:
…to publish protocols BEFORE analysis
…better access to supporting data
…more transparen...
• Data
• Software
• Review
• Re-use…
= Credit
}
Credit where credit is overdue:
“One option would be to provide researcher...
GigaSolution: deconstructing the paper
www.gigadb.org
www.gigasciencejournal.com
Utilizes big-data infrastructure and expe...
Rewarding open data
Validationchecks
Fail – submitter is
provided error report
Pass – dataset is
uploaded to
GigaDB.
Submission Workflow
Curat...
• Aspera = 10-100x faster up/download than FTP
• Multi-omics/large scale biomedical data focus
• Provide (ISA) curation & ...
IRRI GALAXY
Beneficiaries/users of our work
IRRI GALAXY
Rice 3K project: 3,000 rice genomes, 13.4TB public data
Beneficiaries/users of our work
NO
Diverse Data Types
Cyber-centipedes & virtual worms
More transparency:
open peer review
BMC Series
Medical Journals
More transparency (and speed):
pre-prints
Real-time open-review = paper in arXiv + blogged reviews
Reward open & transparent review
http://tmblr.co/ZzXdssfOMJfywww....
Real-time open-review = paper in arXiv + blogged reviews
Reward open & transparent review
GigaScience + Publons = further credit for reviewers efforts
Reward open & transparent review
Readers are interested in open review
Next step to link to ORCID
Cloud
solutions?
Reward better handling of metadata…
Novel tools/formats for data interoperability/handling.
Implement workflows in a community-accepted format
http://galaxyproject.org
Over 36,000 main
Galaxy server users
Over 1,00...
galaxy.cbiit.cuhk.edu.hk
Visualizations
& DOIs for workflows
How are we supporting data
reproducibility?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1...
7 referees downloaded & tested data, then signed reports
Reward open & transparent review
Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?tit...
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 ...
Can we reproduce results? SOAPdenovo2 S. aureus pipeline
Taking a microscope to peer review
The SOAPdenovo2 Case study
Subject to and test with 3 models:
DataData
Method/Experi
mental protocol
Method/Experi
mental ...
1. While there are huge improvements to the quality of the
resulting assemblies, other than the tables it was not stressed...
Case Study: Lessons Learned
• Most published research findings are false. Or at least have
errors.
• Is possible to push b...
Aiding reproducibility of imaging studies
OMERO: providing
access to imaging data
View, filter, measure raw
images with di...
OMERO: Aiding reproducibility, adding value
The alternative...
...look but don't touch
Step 1: get your code out there
• Put everything in a code repository. Even if it is ugly (see CRAPL)
• Version control. M...
Beyond Commenting Code:
Step 2: Open lab books, dynamic documents
• Facilitate reuse and sharing with tools like: Knitr, S...
E.g.
E.g.
Some testimonials for Knitr
Authors (Wolfgang Huber)
“I do all my projects in Knitr. Having the textual
explanation, the a...
Full reproducibility
Levels of reproducibility
Dynamic results
Usability (e.g. Galaxy Toolshed)
Rich metadata, documentati...
Give us data, papers &
pipelines*
What else can you
do?
scott@gigasciencejournal.com
editorial@gigasciencejournal.com
data...
Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data ...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data ...
Upcoming SlideShare
Loading in …5
×

Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

516 views

Published on

Scott Edmunds talk at the AIST Computational Biology Research Center in Tokyo: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods), July 1st 2014

Published in: Science, Technology
  • Be the first to comment

  • Be the first to like this

Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

  1. 1. Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods) Scott Edmunds 1st July 2013
  2. 2. Harnessing Data-Driven Intelligence Using networking power of the internet to tackle problems Can ask new questions & find hidden patterns & connections Build on each others efforts quicker & more efficiently More collaborations across more disciplines Harness wisdom of the crowds: crowdsourcing, citizen science, crowdfunding Enables: Enabled by: Removing silos, open licenses, transparency, immediacy
  3. 3. Dead trees not fit for purpose 18121665 1869
  4. 4. The problems with publishing • Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995 • Lack of transparency, lack of credit for anything other than “regular” dead tree publication. • If there is interest in data, only to monetise & re-silo • Traditional publishing policies and practices a hindrance
  5. 5. The consequences: growing replication gap 1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8) Out of 18 microarray papers, results from 10 could not be reproduced Out of 18 microarray papers, results from 10 could not be reproduced
  6. 6. Consequences: increasing number of retractions >15X increase in last decade Strong correlation of “retraction index” with higher impact factor 1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
  7. 7. Consequences: growing replication gap 1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950 More retractions: >15X increase in last decade At current % > by 2045 as many papers published as retracted Insufficient methods
  8. 8. “Faked research is endemic in China” Global perceptions of Asian Research Million RMB rewards for high IF publications = ? 475, 267 (2011) New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html Nature, 29th September 2010: http://www.nature.com/news/2010/100929/full/467511a.html Science, 29th November 2013: http://www.sciencemag.org/content/342/6162/1035.full Nature 20th July 2011: http://www.nature.com/news/2011/110720/full/475267a.html
  9. 9. “Faked research is endemic in China” Global perceptions of Asian Research Million RMB rewards for high IF publications = ? 475, 267 (2011) New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html Nature, 29th September 2010: http://www.nature.com/news/2010/100929/full/467511a.html Science, 29th November 2013: http://www.sciencemag.org/content/342/6162/1035.full Nature 20th July 2011: http://www.nature.com/news/2011/110720/full/475267a.html “Wide distribution of information is key to scientific progress, yet traditionally, Chinese scientists have not systematically released data or research findings, even after publication.“ “There have been widespread complaints from scientists inside and outside China about this lack of transparency. ” “Usually incomplete and unsystematic, [what little supporting data released] are of little value to researchers and there is evidence that this drives down a paper's citation numbers.”
  10. 10. STAP paper demonstrates problems: …to publish protocols BEFORE analysis …better access to supporting data …more transparent & accountable review …to publish replication studies Need:
  11. 11. • Data • Software • Review • Re-use… = Credit } Credit where credit is overdue: “One option would be to provide researchers who release data to public repositories with a means of accreditation.” “An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “ Nature Biotechnology 27, 579 (2009) New incentives/credit
  12. 12. GigaSolution: deconstructing the paper www.gigadb.org www.gigasciencejournal.com Utilizes big-data infrastructure and expertise from: Combines and integrates: Open-access journal Data Publishing Platform Data Analysis Platform
  13. 13. Rewarding open data
  14. 14. Validationchecks Fail – submitter is provided error report Pass – dataset is uploaded to GigaDB. Submission Workflow Curator makes dataset public (can be set as future date if required) DataCite XML file Excel submission file Submitter logs in to GigaDB website and uploads Excel submission GigaDB DOI assigned Files Submitter provides files by ftp or Aspera XML is generated and registered with DataCite Curator Review Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). DOI 10.5524/100003 Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011) Public GigaDB dataset See: http://database.oxfordjournals.org/content/2014/bau018.abstract
  15. 15. • Aspera = 10-100x faster up/download than FTP • Multi-omics/large scale biomedical data focus • Provide (ISA) curation & integration with other DBs (e.g. INSDC, MetaboLights, PRIDE, etc.) For more see: http://database.oxfordjournals.org/content/2014/bau018.abstract
  16. 16. IRRI GALAXY Beneficiaries/users of our work
  17. 17. IRRI GALAXY Rice 3K project: 3,000 rice genomes, 13.4TB public data Beneficiaries/users of our work
  18. 18. NO Diverse Data Types Cyber-centipedes & virtual worms
  19. 19. More transparency: open peer review BMC Series Medical Journals
  20. 20. More transparency (and speed): pre-prints
  21. 21. Real-time open-review = paper in arXiv + blogged reviews Reward open & transparent review http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10
  22. 22. Real-time open-review = paper in arXiv + blogged reviews Reward open & transparent review
  23. 23. GigaScience + Publons = further credit for reviewers efforts Reward open & transparent review
  24. 24. Readers are interested in open review Next step to link to ORCID
  25. 25. Cloud solutions? Reward better handling of metadata… Novel tools/formats for data interoperability/handling.
  26. 26. Implement workflows in a community-accepted format http://galaxyproject.org Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source Rewarding and aiding reproducibility
  27. 27. galaxy.cbiit.cuhk.edu.hk
  28. 28. Visualizations & DOIs for workflows
  29. 29. How are we supporting data reproducibility? Data sets Analyses Linked to Linked to DOI DOI Open-Paper Open-Review DOI:10.1186/2047-217X-1-18 >26,000 accesses Open-Code 7 reviewers tested data in ftp server & named reports published DOI:10.5524/100044 Open-Pipelines Open-Workflows DOI:10.5524/100038 Open-Data 78GB CC0 data Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>20,000 downloads Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
  30. 30. 7 referees downloaded & tested data, then signed reports Reward open & transparent review
  31. 31. Post publication: bloggers pull apart code/reviews in blogs + wiki: SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2 Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/ Reward open & transparent review
  32. 32. SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk
  33. 33. SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk Implemented entire workflow in our Galaxy server, inc.: • 3 pre-processing steps • 4 SOAPdenovo modules • 1 post processing steps • Evaluation and visualization tools Also will be available to download by >36K Galaxy users in
  34. 34. Can we reproduce results? SOAPdenovo2 S. aureus pipeline
  35. 35. Taking a microscope to peer review
  36. 36. The SOAPdenovo2 Case study Subject to and test with 3 models: DataData Method/Experi mental protocol Method/Experi mental protocol FindingsFindings Types of resources in an RO Wfdesc/ISA- TAB/ISA2OWL Wfdesc/ISA- TAB/ISA2OWL Models to describe each resource type
  37. 37. 1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer. 4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.
  38. 38. Case Study: Lessons Learned • Most published research findings are false. Or at least have errors. • Is possible to push button(s) & recreate a result from a paper • On a semantic level (via nanopublications) can still have minor errors in text (interpretation not data) • Reproducibility is COSTLY. How much are you willing to spend? • Much easier to do this before rather than after publication
  39. 39. Aiding reproducibility of imaging studies OMERO: providing access to imaging data View, filter, measure raw images with direct links from journal article. See all image data, not just cherry picked examples. Download and reprocess.
  40. 40. OMERO: Aiding reproducibility, adding value
  41. 41. The alternative... ...look but don't touch
  42. 42. Step 1: get your code out there • Put everything in a code repository. Even if it is ugly (see CRAPL) • Version control. Make sure you document exact version in the paper (big problem with lots of our papers). • If system environments are important, consider VMs http://matt.might.net/articles/crapl/
  43. 43. Beyond Commenting Code: Step 2: Open lab books, dynamic documents • Facilitate reuse and sharing with tools like: Knitr, Sweave, iPython Notebook Sweave • Working towards executable papers…
  44. 44. E.g.
  45. 45. E.g.
  46. 46. Some testimonials for Knitr Authors (Wolfgang Huber) “I do all my projects in Knitr. Having the textual explanation, the associated code and the results all in one place really increases productivity, and helps explaining my analyses to colleagues, or even just to my future self.” Reviewers (Christophe Pouzat) “It took me a couple of hours to get the data, the few custom developed routines, the “vignette” and to REPRODUCE EXACTLY the analysis presented in the manuscript. With few more hours, I was able to modify the authors’ code to change their Fig. 4. In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewer’s job much more fun!
  47. 47. Full reproducibility Levels of reproducibility Dynamic results Usability (e.g. Galaxy Toolshed) Rich metadata, documentation Basic code/data Availability
  48. 48. Give us data, papers & pipelines* What else can you do? scott@gigasciencejournal.com editorial@gigasciencejournal.com database@gigasciencejournal.com Contact us: * APC’s currently generously covered by BGI until 2015 www.gigasciencejournal.com
  49. 49. Ruibang Luo (BGI/HKU) Shaoguang Liang (BGI-SZ) Tin-Lap Lee (CUHK) Qiong Luo (HKUST) Senghong Wang (HKUST) Yan Zhou (HKUST) Thanks to: @gigascience facebook.com/GigaScience blogs.biomedcentral.com/gigablog/ Peter Li Huayan Gao Chris Hunter Jesse Si Zhe Nicole Nogoy Laurie Goodman Amye Kenall (BMC) Marco Roos (LUMC) Mark Thompson (LUMC) Jun Zhao (Lancaster) Susanna Sansone (Oxford) Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford) www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com CBIITFunding from: Our collaborators:team: Case study:

×