Nicole Nogoy
eResearch NZ, 2 July 2014
Open-Data
Open-Source
Open-Access
: Improving data sharing,
integration and reprodu...
Open-Review Open-Access
Open-DataOpen-Source
What can be achieved?
Its all about the re-use
To do this everything needs to be free
and accessible to be read by humans &
machines*
* See: htt...
Challenges/Opportunities in the Data-Driven Era
Quick response to climate change, food security & disease outbreaks
Using ...
Not enabled by: paywalls, silos, dead trees
18121665 1869
• Scholarly articles are merely advertisement of scholarship .
T...
Problem: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analys...
Growing Issue: increasing number of retractions
>15X increase in last decade
Strong correlation of “retraction index” with...
How
GigaSolution: Deconstructing the paper
www.gigadb.org
www.gigasciencejournal.com
Utilizes big-data infrastructure and expe...
• Data
• Software
• Review
• Re-use…
= Credit
}
Credit where credit is overdue:
“One option would be to provide researcher...
Anatomy of a Publication
Data
Idea
Study
Analysis
Answer
Metadata
Anatomy of a Data Publication
Data
Idea
Study
Analysis
Answer
Metadata
Fail – submitter is
provided error report
Pass – dataset is
uploaded to GigaDB.
Submission Workflow
Curator makes dataset ...
GigaScience Data Publishing
Platform
Currently 120 datasets & 50TB
data
• ~50 TBs of data from: BGI, ACRG, G10K, Bird10K, 3K Rice Genomes
• Provide curation & integration with other DBs (INSDC d...
Many data types…
BGI Datasets Get DOIs
Plants
Chinese cabbage
Cucumber
Foxtail millet
Pigeonpea
Potato
Sorghum
Wheat A+B
Rice
Microbe/metag...
Cloud
solutions?
Reward better handling of metadata…
Novel tools/formats for data interoperability/handling.
Examples
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released he...
IRRI GALAXY
Beneficiaries of the genomics revolution?
Rice 3K project: 3,000 rice genomes, 13.4TB public data
NO
Collaborations with Pensoft & PLOS
Cyber-centipedes & virtual worms
SOURCE
USE/REUSE
PUBLISH
INTEGRATION WITH
DOMAIN-SPECIFIC
DATABASES VIA ISA-TOOLS
NARRATIVE DATA
(SOCIAL)
MEDIA
DATA PRODU...
Disseminating new types of data
How are we supporting data
reproducibility?
Data sets
Analyses
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
~21,000 a...
New & more transparent peer-review:
The GigaScience way:
8 referees downloaded & tested data, then signed reports
New & more transparent peer-review:
The GigaScience way:
Real-time open-review = paper in arXiv + blogged reviews
Implement workflows in a community-accepted format
http://galaxyproject.org
Over 36,000 main
Galaxy server users
Over 1000...
Visualizations
& DOIs for workflows
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 ...
Tool list Tool parameterisation Results panelResults panel
GigaGalaxy & Metabolomics
Rewarding and aiding reproducibility
OMERO: providing access to
imaging data…
Changing the way we publish:
“Deconstructed”
Journal
“Regular”
Journal
“Conscientious”
Online Journal
Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)...
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility
Upcoming SlideShare
Loading in...5
×

Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

314

Published on

Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility, July 2nd 2014

Published in: Science, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
314
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • ** these are examples of datasets we have in GigaDB
    Quite a few of them were released pre-publication
  • We want to push – better quality metadata
    Working with ISA (investigator study assay) commons to enable this
    Good to be a leader in this field – NPG are following in our footsteps!
  • Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration and reproducibility

    1. 1. Nicole Nogoy eResearch NZ, 2 July 2014 Open-Data Open-Source Open-Access : Improving data sharing, integration and reproducibility
    2. 2. Open-Review Open-Access Open-DataOpen-Source What can be achieved?
    3. 3. Its all about the re-use To do this everything needs to be free and accessible to be read by humans & machines* * See: http://www.biomedcentral.com/about/datamining Take home message:
    4. 4. Challenges/Opportunities in the Data-Driven Era Quick response to climate change, food security & disease outbreaks Using networking power of the internet to tackle problems Can ask new questions & find hidden patterns & connections Build on each others efforts quicker & more efficiently More collaborations across more disciplines Harness wisdom of the crowds: crowdsourcing, citizen science, crowdfunding Enables: Enabled by: Removing silos, standards/formats, open-access/data Challenges:
    5. 5. Not enabled by: paywalls, silos, dead trees 18121665 1869 • Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995 • Lack of transparency, lack of credit for anything other than “regular” dead tree publication • If there is interest in data, only to monetise & repackage
    6. 6. Problem: growing replication gap 1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8) Out of 18 microarray papers, results from 10 could not be reproduced
    7. 7. Growing Issue: increasing number of retractions >15X increase in last decade Strong correlation of “retraction index” with higher impact factor 1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract? At current % increase by 2045 as many papers published as retracted!
    8. 8. How
    9. 9. GigaSolution: Deconstructing the paper www.gigadb.org www.gigasciencejournal.com Utilizes big-data infrastructure and expertise from: Combines and integrates: Open-access journal Data Publishing Platform Data Analysis Platform
    10. 10. • Data • Software • Review • Re-use… = Credit } Credit where credit is overdue: “One option would be to provide researchers who release data to public repositories with a means of accreditation.” “An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “ Nature Biotechnology 27, 579 (2009) New incentives/credit
    11. 11. Anatomy of a Publication Data Idea Study Analysis Answer Metadata
    12. 12. Anatomy of a Data Publication Data Idea Study Analysis Answer Metadata
    13. 13. Fail – submitter is provided error report Pass – dataset is uploaded to GigaDB. Submission Workflow Curator makes dataset public (can be set as future date if required)DataCite XML file Excel submission file Submitter logs in to GigaDB website and uploads Excel submission GigaDB DOI assigned Files Submitter provides files by ftp or Aspera XML is generated and registered with DataCite Curator Review Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). DOI 10.5524/100003 Genomic data from the crab- eating macaque/cynomolgus monkey (Macaca fascicularis) (2011) Public GigaDB dataset See: http://database.oxfordjournals.org/content/2014/bau018.abstract
    14. 14. GigaScience Data Publishing Platform Currently 120 datasets & 50TB data
    15. 15. • ~50 TBs of data from: BGI, ACRG, G10K, Bird10K, 3K Rice Genomes • Provide curation & integration with other DBs (INSDC databases)
    16. 16. Many data types…
    17. 17. BGI Datasets Get DOIs Plants Chinese cabbage Cucumber Foxtail millet Pigeonpea Potato Sorghum Wheat A+B Rice Microbe/metagenomics E. Coli O104:H4 TY-2482 T2D gut metagenome Bulk pooled insects T. Tengcongensis proteome Cell-Lines Chinese Hamster Ovary Mouse methylomes Cancer quantitative protemicsHuman Asian individual (YH) - DNA Methylome - Genome Assembly v1+2 - Transcriptome Cancer (14TB) Single cell bladder cancer HBV infected exomes Ancient DNA - Saqqaq Eskimo - Aboriginal Australian Vertebrates Darwin’s Finch Giant panda Macaque -Chinese rhesus -Crab-eating Mini-Pig Naked mole rat Parrot, Puerto Rican Penguin - Emperor penguin - Adelie penguin Pigeon, domestic Polar bear DA and F344 rats Sheep Tibetan antelope Other fMRI & Retinal waves Invertebrate Ant - Florida carpenter ant - Jerdon’s jumping ant - Leaf-cutter ant Roundworm Schistosoma Silkworm Parasitic nematode Pacific oyster Released pre-publication Paper Published in GigaScience
    18. 18. Cloud solutions? Reward better handling of metadata… Novel tools/formats for data interoperability/handling.
    19. 19. Examples
    20. 20. To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as: Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001 Our first DOI: To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
    21. 21. IRRI GALAXY Beneficiaries of the genomics revolution? Rice 3K project: 3,000 rice genomes, 13.4TB public data
    22. 22. NO Collaborations with Pensoft & PLOS Cyber-centipedes & virtual worms
    23. 23. SOURCE USE/REUSE PUBLISH INTEGRATION WITH DOMAIN-SPECIFIC DATABASES VIA ISA-TOOLS NARRATIVE DATA (SOCIAL) MEDIA DATA PRODUCTION Sneddon,T.P., Zhe,X.S., Edmunds,S.C., et al. GigaDB: promoting data dissemination and reproducibility. Database (2014) Vol. 2014: article ID bau018; doi:10.1093/database/bau018
    24. 24. Disseminating new types of data
    25. 25. How are we supporting data reproducibility? Data sets Analyses Open-Paper Open-Review DOI:10.1186/2047-217X-1-18 ~21,000 accesses Open-Code 8 reviewers tested data in ftp server & named reports published DOI:10.5524/100044 Open-Pipelines Open-Workflows DOI:10.5524/100038 Open-Data 78GB CC0 data Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/~21,000 downloads Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
    26. 26. New & more transparent peer-review: The GigaScience way: 8 referees downloaded & tested data, then signed reports
    27. 27. New & more transparent peer-review: The GigaScience way: Real-time open-review = paper in arXiv + blogged reviews
    28. 28. Implement workflows in a community-accepted format http://galaxyproject.org Over 36,000 main Galaxy server users Over 1000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source
    29. 29. Visualizations & DOIs for workflows
    30. 30. SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk Implemented entire workflow in our Galaxy server, inc.: • 3 pre-processing steps • 4 SOAPdenovo modules • 1 post processing steps • Evaluation and visualization tools Also will be available to download by >36K Galaxy users in
    31. 31. Tool list Tool parameterisation Results panelResults panel GigaGalaxy & Metabolomics
    32. 32. Rewarding and aiding reproducibility OMERO: providing access to imaging data…
    33. 33. Changing the way we publish:
    34. 34. “Deconstructed” Journal “Regular” Journal “Conscientious” Online Journal
    35. 35. Ruibang Luo (BGI/HKU) Shaoguang Liang (BGI-SZ) Tin-Lap Lee (CUHK) Qiong Luo (HKUST) Senghong Wang (HKUST) Yan Zhou (HKUST) Thanks to: @gigascience facebook.com/GigaScience blogs.biomedcentral.com/gigablog/ Peter Li Chris Hunter Jesse Si Zhe Nicole Nogoy Laurie Goodman Rob Davidson Amye Kenall (BMC) Marco Roos (LUMC) Mark Thompson (LUMC) Jun Zhao (Lancaster) Susanna Sansone (Oxford) Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford) www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com CBIITFunding from: Our collaborators:team: Case study:
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×