Open (and Big) Data – the next challenge
Beyond dead trees: are publishers the problem or solution?
Scott Edmunds
OASPA As...
Harnessing Data-Driven Intelligence
Using networking power of the internet to tackle problems
Can ask new questions & find...
Dead trees not fit for purpose
18121665 1869
The problems with publishing
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts...
Things holding us back:
• Disincentives to share or communicate:
– Ingelfinger*! Embargoes, anti preprint & early data rel...
The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expressi...
Consequences: increasing number of retractions
>15X increase in last decade
Strong correlation of “retraction index” with
...
Consequences: growing replication gap
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression ana...
“Faked
research is
endemic in
China”
Global perceptions of Chinese Research
Million RMB rewards for high IF publications =...
“Faked
research is
endemic in
China”
Global perceptions of Chinese Research
Million RMB rewards for high IF publications =...
Issues not just in China…
…to publish protocols BEFORE analysis
…better access to supporting data
…more transparent & acco...
• Data
• Software
• Review
• Re-use…
= Credit
}
Credit where credit is overdue:
“One option would be to provide researcher...
GigaSolution: deconstructing the paper
www.gigadb.org
www.gigasciencejournal.com
Utilizes big-data infrastructure and expe...
Rewarding open data
Fail – submitter is
provided error report
Pass – dataset is
uploaded to
GigaDB.
Submission Workflow
Curator makes dataset ...
• 10-100x faster download than FTP
• Provide curation & integration with other DBs
IRRI GALAXY
Beneficiaries of this open data?
IRRI GALAXY
Beneficiaries of this open data?
Rice 3K project: 3,000 rice genomes, 13.4TB public data
NO
NO
Collaborations with Pensoft & PLOS
Cyber-centipedes & virtual worms
SOURCE
USER
NARRATIVE DATA
PUBLISHER
EXTERNAL
DATABASES
ARRAYEXPRES
Morphbank
DATA PRODUCTION
CURATION/
INTEGRATION
• Geno...
NO
New & more transparent peer-
review: open review
BMC Series
Medical Journals
Reward open & transparent review
End reviewer 3 Downfall parody videos, now!
New & more transparent peer-
review: pre-prints
Real-time open-review = paper in arXiv + blogged reviews
Reward open & transparent review
http://tmblr.co/ZzXdssfOMJfywww....
Real-time open-review = paper in arXiv + blogged reviews
Reward open & transparent review
Readers are interested in open review
Next step to link to ORCID
Cloud
solutions?
Reward better handling of metadata…
Novel tools/formats for data interoperability/handling.
Rewarding and aiding reproducibility
OMERO: providing
access to imaging data…
Implement workflows in a community-accepted format
http://galaxyproject.org
Over 36,000 main
Galaxy server users
Over 1,00...
galaxy.cbiit.cuhk.edu.hk
Visualizations
& DOIs for workflows
How are we supporting data
reproducibility?
Data sets
Analyses
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>23,000 a...
7 referees downloaded & tested data, then signed reports
Reward open & transparent review
Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?tit...
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 ...
SOAPdenovo2 S. aureus pipeline
Taking a microscope to peer review
The SOAPdenovo2 Case study
Subject to and test with 3 models:
Data
Method/Experi
mental protocol
Findings
Types of resourc...
Lessons learned:
• Most published research findings are false. Or at least have
errors.
• On a semantic level (via nanopub...
“Deconstructed”
Journal
“Regular”
Journal
“Conscientious”
Online Journal
“Deconstructed”
Journal
“Regular”
Journal
“Conscientious”
Online Journal
“Deconstructed”
Journal
“Regular”
Journal
“Conscientious”
Online Journal
Image Source: http://commons.wikimedia.org/wiki/File:System-Mechanic-California.jpg
“Deconstructed”
Journal
“Regular”
Jour...
Give us data, papers &
pipelines*
Help us make it
happen!
scott@gigasciencejournal.com
editorial@gigasciencejournal.com
da...
Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)...
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
Upcoming SlideShare
Loading in...5
×

Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge

1,145

Published on

Scott Edmunds talk at OASP Asia in Bangkok: Open (and Big) Data – the next challenge. June 2nd 2014

Published in: Data & Analytics, Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,145
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Over 20,000 users on the main server
    Over 500 papers citing the use of Galaxy
    Over 55 servers deployed on the Web
  • That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.

    Thank you for listening.
  • Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge

    1. 1. Open (and Big) Data – the next challenge Beyond dead trees: are publishers the problem or solution? Scott Edmunds OASPA Asia, 2nd June 2013 @gigascience
    2. 2. Harnessing Data-Driven Intelligence Using networking power of the internet to tackle problems Can ask new questions & find hidden patterns & connections Build on each others efforts quicker & more efficiently More collaborations across more disciplines Harness wisdom of the crowds: crowdsourcing, citizen science, crowdfunding Enables: Enabled by: Removing silos, open licenses, transparency, immediacy
    3. 3. Dead trees not fit for purpose 18121665 1869
    4. 4. The problems with publishing • Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995 • Lack of transparency, lack of credit for anything other than “regular” dead tree publication. • If there is interest in data, only to monetise & re-silo • Traditional publishing policies and practices a hindrance
    5. 5. Things holding us back: • Disincentives to share or communicate: – Ingelfinger*! Embargoes, anti preprint & early data release policies – Page/method/citation limits • Disincentives to remix – Open source approaches = plagiarism? • Disincentives to release more quickly/more granularly – “Salami Slicing” • First 2 years of citation data the only currency – “Faddism” v long term use or reproducibility. Publication bias. * T-Shirts available from Graham Steel / http://www.zazzle.co.uk/steelgraham
    6. 6. The consequences: growing replication gap 1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8) Out of 18 microarray papers, results from 10 could not be reproduced
    7. 7. Consequences: increasing number of retractions >15X increase in last decade Strong correlation of “retraction index” with higher impact factor 1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
    8. 8. Consequences: growing replication gap 1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950 More retractions: >15X increase in last decade At current % > by 2045 as many papers published as retracted Insufficient methods
    9. 9. “Faked research is endemic in China” Global perceptions of Chinese Research Million RMB rewards for high IF publications = ? 475, 267 (2011) New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html Nature, 29th September 2010: http://www.nature.com/news/2010/100929/full/467511a.html Science, 29th November 2013: http://www.sciencemag.org/content/342/6162/1035.full Nature 20th July 2011: http://www.nature.com/news/2011/110720/full/475267a.html
    10. 10. “Faked research is endemic in China” Global perceptions of Chinese Research Million RMB rewards for high IF publications = ? 475, 267 (2011) New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html Nature, 29th September 2010: http://www.nature.com/news/2010/100929/full/467511a.html Science, 29th November 2013: http://www.sciencemag.org/content/342/6162/1035.full Nature 20th July 2011: http://www.nature.com/news/2011/110720/full/475267a.html “Wide distribution of information is key to scientific progress, yet traditionally, Chinese scientists have not systematically released data or research findings, even after publication.“ “There have been widespread complaints from scientists inside and outside China about this lack of transparency. ” “Usually incomplete and unsystematic, [what little supporting data released] are of little value to researchers and there is evidence that this drives down a paper's citation numbers.”
    11. 11. Issues not just in China… …to publish protocols BEFORE analysis …better access to supporting data …more transparent & accountable review …to publish replication studies Need:
    12. 12. • Data • Software • Review • Re-use… = Credit } Credit where credit is overdue: “One option would be to provide researchers who release data to public repositories with a means of accreditation.” “An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “ Nature Biotechnology 27, 579 (2009) New incentives/credit
    13. 13. GigaSolution: deconstructing the paper www.gigadb.org www.gigasciencejournal.com Utilizes big-data infrastructure and expertise from: Combines and integrates: Open-access journal Data Publishing Platform Data Analysis Platform
    14. 14. Rewarding open data
    15. 15. Fail – submitter is provided error report Pass – dataset is uploaded to GigaDB. Submission Workflow Curator makes dataset public (can be set as future date if required) DataCite XML file Excel submission file Submitter logs in to GigaDB website and uploads Excel submission GigaDB DOI assigned Files Submitter provides files by ftp or Aspera XML is generated and registered with DataCite Curator Review Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). DOI 10.5524/100003 Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011) Public GigaDB dataset See: http://database.oxfordjournals.org/content/2014/bau018.abstract
    16. 16. • 10-100x faster download than FTP • Provide curation & integration with other DBs
    17. 17. IRRI GALAXY Beneficiaries of this open data?
    18. 18. IRRI GALAXY Beneficiaries of this open data? Rice 3K project: 3,000 rice genomes, 13.4TB public data
    19. 19. NO
    20. 20. NO Collaborations with Pensoft & PLOS Cyber-centipedes & virtual worms
    21. 21. SOURCE USER NARRATIVE DATA PUBLISHER EXTERNAL DATABASES ARRAYEXPRES Morphbank DATA PRODUCTION CURATION/ INTEGRATION • Genomics • Barcoding • Imaging • microCT • Video (SOCIAL) MEDIA
    22. 22. NO
    23. 23. New & more transparent peer- review: open review BMC Series Medical Journals
    24. 24. Reward open & transparent review End reviewer 3 Downfall parody videos, now!
    25. 25. New & more transparent peer- review: pre-prints
    26. 26. Real-time open-review = paper in arXiv + blogged reviews Reward open & transparent review http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10
    27. 27. Real-time open-review = paper in arXiv + blogged reviews Reward open & transparent review
    28. 28. Readers are interested in open review Next step to link to ORCID
    29. 29. Cloud solutions? Reward better handling of metadata… Novel tools/formats for data interoperability/handling.
    30. 30. Rewarding and aiding reproducibility OMERO: providing access to imaging data…
    31. 31. Implement workflows in a community-accepted format http://galaxyproject.org Over 36,000 main Galaxy server users Over 1,000 papers citing Galaxy use Over 55 Galaxy servers deployed Open source Rewarding and aiding reproducibility
    32. 32. galaxy.cbiit.cuhk.edu.hk
    33. 33. Visualizations & DOIs for workflows
    34. 34. How are we supporting data reproducibility? Data sets Analyses Open-Paper Open-Review DOI:10.1186/2047-217X-1-18 >23,000 accesses Open-Code 7 reviewers tested data in ftp server & named reports published DOI:10.5524/100044 Open-Pipelines Open-Workflows DOI:10.5524/100038 Open-Data 78GB CC0 data Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>20,000 downloads Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
    35. 35. 7 referees downloaded & tested data, then signed reports Reward open & transparent review
    36. 36. Post publication: bloggers pull apart code/reviews in blogs + wiki: SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2 Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/ Reward open & transparent review
    37. 37. SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk
    38. 38. SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk Implemented entire workflow in our Galaxy server, inc.: • 3 pre-processing steps • 4 SOAPdenovo modules • 1 post processing steps • Evaluation and visualization tools Also will be available to download by >36K Galaxy users in
    39. 39. SOAPdenovo2 S. aureus pipeline
    40. 40. Taking a microscope to peer review
    41. 41. The SOAPdenovo2 Case study Subject to and test with 3 models: Data Method/Experi mental protocol Findings Types of resources in an RO ISA-TAB/ISA2OWL Nanopublication Wfdesc/ISA- TAB/ISA2OWL Models to describe each resource type
    42. 42. Lessons learned: • Most published research findings are false. Or at least have errors. • On a semantic level (via nanopublications) discovered 4 minor errors in text (interpretation not data) • Is possible to push button(s) & recreate a result from a paper • Reproducibility is COSTLY. How much are you willing to spend? • Much easier to do this before rather than after publication
    43. 43. “Deconstructed” Journal “Regular” Journal “Conscientious” Online Journal
    44. 44. “Deconstructed” Journal “Regular” Journal “Conscientious” Online Journal
    45. 45. “Deconstructed” Journal “Regular” Journal “Conscientious” Online Journal
    46. 46. Image Source: http://commons.wikimedia.org/wiki/File:System-Mechanic-California.jpg “Deconstructed” Journal “Regular” Journal “Conscientious” Online Journal
    47. 47. Give us data, papers & pipelines* Help us make it happen! scott@gigasciencejournal.com editorial@gigasciencejournal.com database@gigasciencejournal.com Contact us: * APC’s currently generously covered by BGI until 2015 www.gigasciencejournal.com
    48. 48. Ruibang Luo (BGI/HKU) Shaoguang Liang (BGI-SZ) Tin-Lap Lee (CUHK) Qiong Luo (HKUST) Senghong Wang (HKUST) Yan Zhou (HKUST) Thanks to: @gigascience facebook.com/GigaScience blogs.biomedcentral.com/gigablog/ Peter Li Huayan Gao Chris Hunter Jesse Si Zhe Nicole Nogoy Laurie Goodman Amye Kenall (BMC) Marco Roos (LUMC) Mark Thompson (LUMC) Jun Zhao (Lancaster) Susanna Sansone (Oxford) Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford) www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com CBIITFunding from: Our collaborators:team: Case study:
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×