A journal’s experiences of
reproducing published data
analyses
Peter Li
peter@gigasciencejournal.com
Journal and database
for large-scale data studies
Editor-in-Chief: Laurie Goodman
Executive Editor: Scott Edmunds
Commissi...
www.gigasciencejournal.com
reproducibility
trust
understanding
Publication only Full replication
Not reproducible Gold standard
Data Code and data
Linked and
executable
code and data
Pu...
gigadb.org
Paper DOI
Data set DOI
Linking of papers and data
by citation of DOIs
Publication only Full replication
Not reproducible Gold standard
Data Code and data
Linked and
executable
code and data
Pu...
Can the results in a GigaScience
paper be replicated using Galaxy?
Pilot project
Replicate
Tools
http://gigadb.org/dataset/100044
Tools and data
http://gage.cbcb.umd.edu/data/index.html
Data in GigaGalaxy
Integration of SOAPdenovo2
into GigaGalaxy
Short reads
Downloaded
pipeline
Downloaded pipeline is missing
two tools for reproducibility
KmerFreq_AR
Corrector_AR
SOAP...
Short reads
Table 2 N50 &
corrected N50
scores
Required
pipeline
KmerFreq_AR
Corrector_AR
SOAPdenovo2
GapCloser
ExtractACG...
SOAPdenovo2 S. aureus pipeline
Species Tool Contigs Scaffolds
Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb)
S. aure...
http://galaxy.cbiit.cuhk.edu.hk/u/gigascience/p/soapdenovo2-s-aureus
Observations
• Complete scientific reproduction is difficult
– Time and effort required
• Requires help from authors
• Do ...
http://www.cf.ac.uk/socsi/contactsandpeople/harrycollins/image-36548-web.gif
Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)...
Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses
Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses
Upcoming SlideShare
Loading in …5
×

Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

674 views

Published on

Peter Li at the 2014 Galaxy Community Conference: A journal’s experiences of reproducing published data analyses, 1st July 2014

Published in: Technology, Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
674
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Publication = Selective reporting?
    For research involving computation
  • DOIs
    Provide example of a GigaScience paper
    Mention DOI for the paper itself
    Highlight data set generated and its DOI
  • Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

    1. 1. A journal’s experiences of reproducing published data analyses Peter Li peter@gigasciencejournal.com
    2. 2. Journal and database for large-scale data studies Editor-in-Chief: Laurie Goodman Executive Editor: Scott Edmunds Commissioning Editor: Nicole Nogoy GigaDB: Chris Hunter, Jesse Xiao GigaGalaxy: Peter Li in conjunction with
    3. 3. www.gigasciencejournal.com
    4. 4. reproducibility trust understanding
    5. 5. Publication only Full replication Not reproducible Gold standard Data Code and data Linked and executable code and data Publication + Reproducibility spectrum Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 122
    6. 6. gigadb.org
    7. 7. Paper DOI Data set DOI Linking of papers and data by citation of DOIs
    8. 8. Publication only Full replication Not reproducible Gold standard Data Code and data Linked and executable code and data Publication + Reproducibility spectrum Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 1226-1227.
    9. 9. Can the results in a GigaScience paper be replicated using Galaxy?
    10. 10. Pilot project
    11. 11. Replicate
    12. 12. Tools http://gigadb.org/dataset/100044
    13. 13. Tools and data http://gage.cbcb.umd.edu/data/index.html
    14. 14. Data in GigaGalaxy
    15. 15. Integration of SOAPdenovo2 into GigaGalaxy
    16. 16. Short reads Downloaded pipeline Downloaded pipeline is missing two tools for reproducibility KmerFreq_AR Corrector_AR SOAPdenovo2 GapCloser Scaffold seqs Short reads Table 2 N50 & corrected N50 scores Required pipeline KmerFreq_AR Corrector_AR SOAPdenovo2 GapCloser ExtractACGT GAGE eval
    17. 17. Short reads Table 2 N50 & corrected N50 scores Required pipeline KmerFreq_AR Corrector_AR SOAPdenovo2 GapCloser ExtractACGT GAGE eval Need to add two extra tools into GigaGalaxy
    18. 18. SOAPdenovo2 S. aureus pipeline
    19. 19. Species Tool Contigs Scaffolds Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb) S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342 SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078 ALL-PATHS-LG 37 149.7 13 119.0 11 1477 1 1093 R. sphaeroides SOAPdenovo1 2241 3.5 400 2.8 956 106 24 68 SOAPdenovo2 721 18 106 14.1 333 2549 4 2540 ALL-PATHS-LG 190 41.9 30 36.7 32 3191 0 0 Published and Galaxy-reproduced statistics of genome assemblies of S. aureus and R. sphaeroides Species Tool Contigs Scaffolds Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb) S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342 SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078 ALL-PATHS-LG 37 149.7 13 117.6 10 1477 1 1093 R. sphaeroides SOAPdenovo1 2242 3.5 392 2.8 956 105 18 70 SOAPdenovo2 721 18 106 14.1 333 2549 4 2540 ALL-PATHS-LG 190 41.9 31 36.7 32 3191 0 3310 PublishedReproduced
    20. 20. http://galaxy.cbiit.cuhk.edu.hk/u/gigascience/p/soapdenovo2-s-aureus
    21. 21. Observations • Complete scientific reproduction is difficult – Time and effort required • Requires help from authors • Do we need education and training in scientific reproducibility?
    22. 22. http://www.cf.ac.uk/socsi/contactsandpeople/harrycollins/image-36548-web.gif
    23. 23. Ruibang Luo (BGI/HKU) Shaoguang Liang (BGI-SZ) Tin-Lap Lee (CUHK) Qiong Luo (HKUST) Senghong Wang (HKUST) Yan Zhou (HKUST) Thanks to: @gigascience facebook.com/GigaScience blogs.biomedcentral.com/gigablog/ Peter Li Huayan Gao Chris Hunter Jesse Si Zhe Nicole Nogoy Laurie Goodman Amye Kenall (BMC) Marco Roos (LUMC) Mark Thompson (LUMC) Jun Zhao (Lancaster) Susanna Sansone (Oxford) Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford) www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com Funding from: Our collaborators:team: Case study:

    ×