And now that you all want to submit to GigaDB, how do you do that and how will people search and find your data and, other than citing your DOI, what will they be able to do with the data? We have redesigned the underlying Giga database and we’re working on the front end which we hope to be public early next month so the following slides are a mix of screenshots from the development site overlaid with tweaks made in powerpoint to illustrate features you can hope to see when we go live.These include:a home page image slider for browsing datasetsa text box search which I will demonstrate shortly
That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.Thank you for listening.
Scott Edmunds A*STAR open access workshop: how licensing can change the way we do research
...how licensing can change the way we do research Scott Edmunds A*STAR, 18th April 2013Open-Review Open-AccessOpen-Source Open-Data
Journal, data-platform anddatabase for large-scale data in conjunction with Editor-in-Chief: Laurie Goodman Executive Editor: Scott Edmunds Commissioning Editor: Nicole Nogoy Lead Curator: Chris Hunter Data Platform: Peter Li www.gigasciencejournal.com
"Information is the currency of the future world” William Gibson
Era of Data-Driven ScienceEnables:Using networking power of the internet to tackle problemsCan ask new questions & find patterns & connections hidden inothers dataBuild on each others efforts quicker & more efficientlyMore collaborations across more disciplinesHarness wisdom of the crowds: crowdsourcing, citizen science,crowdfunding Enabled by: Removing silos, standards/formats, open-access/data
Good for a field: Genomics/BioinformaticsLong term sharing infrastructure:Strong use of standards/policies:Plummeting cost/explosion in volumes:
Sharing aids specific communities… Rice v Wheat: consequences of publically available genome data. rice wheat 700 600 500Papers 400 300 200 100 0
Sharing aids individuals…Sharing Detailed ResearchData Is Associated withIncreased Citation Rate.Piwowar HA, Day RS, Fridsma DB (2007)PLoS ONE 2(3): e308.doi:10.1371/journal.pone.0000308 Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
Growing Issue: unrepeatability of scientific results Out of 18 microarray papers, results from 10 could not be reproducedIoannidis et al., 2009. Repeatability of published microarray gene expression analyses.Nature Genetics 41: 149-155.
Growing Issue: increasing number of retractions >15X increase in last decade Strong correlation of “retraction index” with higher impact factor 1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
GigaSolution: deconstructing the paperProvide infrastructure and mechanisms of reward for:• Data availability• Metadata/curation Metadata Analyses• Interoperability Methods Data• Availability of workflows• Transparent analyses
GigaSolution: deconstructing the paperCombines and integrates: Open-access journal Data Publishing Platform Data Analysis PlatformUtilizes big-data infrastructure and expertise from: Worlds largest genomics organisation with: 17PB storage, 20.5K cores, 212TFlops, >1000 bioinformaticians www.gigadb.org www.gigasciencejournal.com
Why/what/how?Where does licensing fit? Open-Access
Importance of licensing: ability to mine & reuse content Budapest Open Access Initiative: “By “open access” to *peer-reviewed research literature], we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”Needs to be: = SA, NC, ND put unnecessary restrictions and are not counted as “true OA” = CC0 better than CC-BY for datasets to prevent “attribution stacking”
Importance of licensing: ability to mine & reuse content = • Gives authors control over the integrity of their work and the right to be properly acknowledged and cited. • Does not grant publicity rights, and attribution can be used to clearly disclaim endorsement • Restrictions rarely benefit author, but do inhibit reuse Prevents translations, incompatibility issues mixing other licenses, some combinations illegal (e.g. CC-NC-SA & CC-BY- SA), hinders non-profits and mixed-collaborations, practically unenforceable, dealing with requests more trouble than its worth.Use of non CC-BY by publishers = “double dipping” (selling content, reprints, etc.)Further reading:http://www.nature.com/nature/journal/v495/n7442/full/495440a.htmlhttp://blogs.ch.cam.ac.uk/pmr/2011/11/29/scientists-should-never-use-cc-nc-this-explains-why/
New incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data topublic repositories with a means of accreditation.”“An ability to search the literature for all online papers that used aparticular data set would enable appropriate attribution for thosewho share. “Nature Biotechnology 27, 579 (2009)Prepublication data sharing(Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can ?later be used to reflect impact of the data sets.”Nature 461, 168-170 (2009)
New incentives/credit = Data Citation? “increase acceptance of research data as legitimate, citable contributions to the scholarly record”. “data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”. ?
Anatomy of a Publication IdeaStudy Metadata DataAnalysisAnswer
Anatomy of a Data Publication IdeaStudy Metadata DataAnalysisAnswer
• Data availability• Content re-use• … } = Credit
GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biologicaland biomedical research as it enters the era of “big-data”… (see more)
BGI Datasets Get DOI®sInvertebrate Released pre-publicationAnt Paper Published in GigaScience- Florida carpenter ant Microbe/metagenomics- Jerdon’s jumping ant Vertebrates E. Coli O104:H4 TY-2482- Leaf-cutter ant Darwin’s Finch T2D gut metagenomeRoundworm Giant panda Macaque Bulk pooled insectsSchistosoma -Chinese rhesusSilkworm -Crab-eating Cell-LinesParasitic nematode Mini-Pig Chinese Hamster OvaryPacific oyster Naked mole rat Mouse methylomesHuman Parrot, Puerto RicanAsian individual (YH) Penguin PLANTS- DNA Methylome - Emperor penguin Chinese cabbage- Genome Assembly v1+2 - Adelie penguin Cucumber- Transcriptome Pigeon, domestic Foxtail milletCancer (14TB) Polar bear PigeonpeaSingle cell bladder cancer Sheep PotatoHBV infected exomes Tibetan antelope SorghumAncient DNA Wheat A+B- Saqqaq Eskimo- Aboriginal Australian
Open-Source Why/what/how? The new way of doing science?
Open-Source: the source of it allSoftware community understands benefits• Transparent, fast, collaborative• Long history, large community• Many licenses• Many repositories• Many users/platforms
The Peoples Parrot: Amazona vittataPuerto Rican Parrot Genome ProjectRarest parrot, national bird of Puerto RicoCommunity funded from artworks, fashion shows, crowdfunding…Genome annotated by students in community college as part of bioinformatics educationPaper and Data published in GigaScience and GigaDBTaras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances YoungResearcher Education. GigaScience 2012, 1:14Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience.http://dx.doi.org/10.5524/100039
How are we supporting data reproducibility? Open-Data Open-Paper Data sets DOI:10.5524/100038 78GB CC0 data Open-PipelinesDOI:10.1186/2047-217X-1-18 Open-Workflows ~8000 accesses Analyses DOI:10.5524/100044 Open-Review 8 reviewers tested data in ftp server & named reports published Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2 Open-Code Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/ ~4000 downloads
SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk
SOAPdenovo2 workflows implemented inImplemented entire workflow in our Galaxy server, inc.:• 3 pre-processing steps• 4 SOAPdenovo modules• 1 post processing steps• Evaluation and visualization toolsAlso available to download by >25K Galaxy users in galaxy.cbiit.cuhk.edu.hk
New & more transparent peer-review: The GigaScience way:8 referees downloaded & tested data, then signed reports
New & more transparent peer-review: The GigaScience way:Post publication: bloggers pull apart code/reviews in blogs + wiki: SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2 Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/
New & more transparent peer-review: The GigaScience way:Real-time open-review = paper in arXiv + blogged reviews
Our first DOI:To maximize its utility to the research community and aid those fightingthe current epidemic, genomic data is released here into the public domainunder a CC0 license. Until the publication of research papers on theassembly and whole-genome analysis of this isolate we would ask you tocite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G;Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S;Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z;Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J andthe Escherichia coli O104:H4 TY-2482 isolate genome sequencingconsortium (2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGIShenzhen. doi:10.5524/100001http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
Downstream consequences:1. Citations (~140) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons4. Example for faster & more open science “Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”
1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully illustrated byevents following an outbreak of a severe gastro-intestinal infection inHamburg in Germany in May 2011. This spread through severalEuropean countries and the US, affecting about 4000 people andresulting in over 50 deaths. All tested positive for an unusual andlittle-known Shiga-toxin–producing E. coli bacterium. The strain wasinitially analysed by scientists at BGI-Shenzhen in China, workingtogether with those in Hamburg, and three days later a draftgenome was released under an open data licence. This generatedinterest from bioinformaticians on four continents. 24 hours afterthe release of the genome it had been assembled. Within a weektwo dozen reports had been filed on an open-source site dedicatedto the analysis of the strain. These analyses provided crucialinformation about the strain’s virulence and resistance genes – howit spreads and which antibiotics are effective against it. Theyproduced results in time to help contain the outbreak. By July2011, scientists published papers based on this work. By opening uptheir early sequencing results to internationalcollaboration, researchers in Hamburg produced results that werequickly tested by a wide range of experts, used to produce newknowledge and ultimately to control a public health emergency.
Ultimate Goal: Executable papers Data PapersExecutable(Methods) Papers Analysis Papers
Help us make it happen!Give us your data, papers & pipelines* Contact us: email@example.com firstname.lastname@example.org email@example.com * APC’s currently generously covered by BGI www.gigasciencejournal.com
Thanks to: team: Our collaborators: Funding from:Peter Li Ruibang Luo (BGI/HKU)Chris Hunter Shaoguang Liang (BGI-SZ)Jesse Si Zhe Tin-Lap Lee (CUHK)Nicole Nogoy Huayen Gao (CUHK)Tam Sneddon Qiong Luo (HKUST) CBIITAlexandra Basford Senghong Wang (HKUST)Laurie Goodman Yan Zhou (HKUST) @gigascienceFollow us: facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com