Take control of your SAP testing with UiPath Test Suite
Scott Edmunds A*STAR open access workshop: how licensing can change the way we do research
1. ...how licensing can change
the way we do research
Scott Edmunds
A*STAR, 18th April 2013
Open-Review Open-Access
Open-Source Open-Data
2. Journal, data-platform and
database for large-scale data
in conjunction with
Editor-in-Chief: Laurie Goodman
Executive Editor: Scott Edmunds
Commissioning Editor: Nicole Nogoy
Lead Curator: Chris Hunter
Data Platform: Peter Li
www.gigasciencejournal.com
5. Take home message:
Its all about the re-use
To do this everything needs to be free
and accessible to be read by humans &
machines*
* See: http://www.biomedcentral.com/about/datamining
8. Era of Data-Driven Science
Enables:
Using networking power of the internet to tackle problems
Can ask new questions & find patterns & connections hidden in
others data
Build on each others efforts quicker & more efficiently
More collaborations across more disciplines
Harness wisdom of the crowds: crowdsourcing, citizen science,
crowdfunding
Enabled by:
Removing silos, standards/formats, open-access/data
9. Good for a field:
Genomics/Bioinformatics
Long term sharing infrastructure:
Strong use of standards/policies:
Plummeting cost/explosion in volumes:
10. Sharing aids specific communities…
Rice v Wheat: consequences of publically available
genome data.
rice wheat
700
600
500
Papers 400
300
200
100
0
11. Sharing aids individuals…
Sharing Detailed Research
Data Is Associated with
Increased Citation Rate.
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308
Every 10 datasets collected contributes to at least 4 papers in the
following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473
(7347), 285-285 DOI: 10.1038/473285a
12. Growing Issue: unrepeatability of scientific results
Out of 18 microarray papers, results
from 10 could not be reproduced
Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses.
Nature Genetics 41: 149-155.
13. Growing Issue: increasing number of retractions
>15X increase in last decade
Strong correlation of “retraction index” with
higher impact factor
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
15. GigaSolution: deconstructing the paper
Provide infrastructure and mechanisms of reward for:
• Data availability
• Metadata/curation
Metadata Analyses
• Interoperability
Methods
Data
• Availability of workflows
• Transparent analyses
16. GigaSolution: deconstructing the paper
Combines and integrates:
Open-access journal
Data Publishing Platform
Data Analysis Platform
Utilizes big-data infrastructure and expertise from:
Worlds largest genomics organisation with:
17PB storage, 20.5K cores, 212TFlops,
>1000 bioinformaticians
www.gigadb.org
www.gigasciencejournal.com
18. Importance of licensing: ability to mine & reuse content
Budapest Open Access Initiative:
“By “open access” to *peer-reviewed research literature], we mean its
free availability on the public internet, permitting any users to
read, download, copy, distribute, print, search, or link to the full texts
of these articles, crawl them for indexing, pass them as data to
software, or use them for any other lawful purpose, without
financial, legal, or technical barriers other than those inseparable from
gaining access to the internet itself. The only constraint on
reproduction and distribution, and the only role for copyright in this
domain, should be to give authors control over the integrity of their
work and the right to be properly acknowledged and cited.”
Needs to be:
=
SA, NC, ND put unnecessary restrictions and are not counted as “true OA”
=
CC0 better than CC-BY for datasets to prevent “attribution stacking”
19. Importance of licensing: ability to mine & reuse content
=
• Gives authors control over the integrity of their work and the right
to be properly acknowledged and cited.
• Does not grant publicity rights, and attribution can be used to
clearly disclaim endorsement
• Restrictions rarely benefit author, but do inhibit reuse
Prevents translations, incompatibility issues mixing other
licenses, some combinations illegal (e.g. CC-NC-SA & CC-BY-
SA), hinders non-profits and mixed-collaborations, practically
unenforceable, dealing with requests more trouble than its worth.
Use of non CC-BY by publishers = “double dipping” (selling content, reprints, etc.)
Further reading:
http://www.nature.com/nature/journal/v495/n7442/full/495440a.html
http://blogs.ch.cam.ac.uk/pmr/2011/11/29/scientists-should-never-use-cc-nc-this-explains-why/
21. New incentives/credit
Credit where credit is overdue:
“One option would be to provide researchers who release data to
public repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a
particular data set would enable appropriate attribution for those
who share. “
Nature Biotechnology 27, 579 (2009)
Prepublication data sharing
(Toronto International Data Release Workshop)
“Data producers benefit from creating a citable reference, as it can
?
later be used to reflect impact of the data sets.”
Nature 461, 168-170 (2009)
22. New incentives/credit
= Data Citation?
“increase acceptance of research data as
legitimate, citable contributions to the
scholarly record”.
“data generated in the course of research
are just as valuable to the ongoing
academic discourse as papers and
monographs”. ?
23. Anatomy of a Publication
Idea
Study
Metadata
Data
Analysis
Answer
24. Anatomy of a Data Publication
Idea
Study
Metadata
Data
Analysis
Answer
26. GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological
and biomedical research as it enters the era of “big-data”… (see more)
27. BGI Datasets Get DOI®s
Invertebrate Released pre-publication
Ant Paper Published in GigaScience
- Florida carpenter ant Microbe/metagenomics
- Jerdon’s jumping ant Vertebrates E. Coli O104:H4 TY-2482
- Leaf-cutter ant Darwin’s Finch T2D gut metagenome
Roundworm Giant panda Macaque Bulk pooled insects
Schistosoma -Chinese rhesus
Silkworm -Crab-eating Cell-Lines
Parasitic nematode Mini-Pig Chinese Hamster Ovary
Pacific oyster Naked mole rat Mouse methylomes
Human Parrot, Puerto Rican
Asian individual (YH) Penguin PLANTS
- DNA Methylome - Emperor penguin Chinese cabbage
- Genome Assembly v1+2 - Adelie penguin Cucumber
- Transcriptome Pigeon, domestic Foxtail millet
Cancer (14TB) Polar bear Pigeonpea
Single cell bladder cancer Sheep Potato
HBV infected exomes Tibetan antelope Sorghum
Ancient DNA Wheat A+B
- Saqqaq Eskimo
- Aboriginal Australian
28. Open-Source
Why/what/how?
The new way of doing science?
29. Open-Source: the source of it all
Software community understands benefits
• Transparent, fast, collaborative
• Long history, large community
• Many licenses
• Many repositories
• Many users/platforms
31. New & more transparent peer-review:
Pre-publication: pre-prints
32. New & more transparent peer-review:
During-publication: open-review
BMC Series
Medical Journals
33. New & more transparent peer-review:
Post-publication review
Open content lets you do interesting things post-publication:
New pub models:
Comments, blogs
, online journal
clubs
Altmetrics:
35. The Peoples Parrot: Amazona vittata
Puerto Rican Parrot Genome Project
Rarest parrot, national bird of Puerto Rico
Community funded from artworks, fashion shows, crowdfunding…
Genome annotated by students in community college as part of bioinformatics education
Paper and Data published in GigaScience and GigaDB
Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young
Researcher Education. GigaScience 2012, 1:14
Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13
Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience.
http://dx.doi.org/10.5524/100039
36.
37. How are we supporting data
reproducibility?
Open-Data
Open-Paper Data sets DOI:10.5524/100038
78GB CC0 data
Open-Pipelines
DOI:10.1186/2047-217X-1-18
Open-Workflows
~8000 accesses Analyses DOI:10.5524/100044
Open-Review
8 reviewers tested data in ftp server & named reports published
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
Open-Code
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/
~4000 downloads
39. SOAPdenovo2 workflows implemented in
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
Also available to download by >25K Galaxy users in
galaxy.cbiit.cuhk.edu.hk
40. New & more transparent peer-review:
The GigaScience way:
8 referees downloaded & tested data, then signed reports
41. New & more transparent peer-review:
The GigaScience way:
Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2
Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/
42. New & more transparent peer-review:
The GigaScience way:
Real-time open-review = paper in arXiv + blogged reviews
43. Our first DOI:
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G;
Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S;
Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z;
Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and
the Escherichia coli O104:H4 TY-2482 isolate genome sequencing
consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
44.
45.
46. Downstream consequences:
1. Citations (~140) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli
strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days
for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could
use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that
allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and
publish their work without wasting time on legal wrangling.”
47.
48.
49. 1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully illustrated by
events following an outbreak of a severe gastro-intestinal infection in
Hamburg in Germany in May 2011. This spread through several
European countries and the US, affecting about 4000 people and
resulting in over 50 deaths. All tested positive for an unusual and
little-known Shiga-toxin–producing E. coli bacterium. The strain was
initially analysed by scientists at BGI-Shenzhen in China, working
together with those in Hamburg, and three days later a draft
genome was released under an open data licence. This generated
interest from bioinformaticians on four continents. 24 hours after
the release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site dedicated
to the analysis of the strain. These analyses provided crucial
information about the strain’s virulence and resistance genes – how
it spreads and which antibiotics are effective against it. They
produced results in time to help contain the outbreak. By July
2011, scientists published papers based on this work. By opening up
their early sequencing results to international
collaboration, researchers in Hamburg produced results that were
quickly tested by a wide range of experts, used to produce new
knowledge and ultimately to control a public health emergency.
56. Help us make it
happen!
Give us your data, papers
& pipelines*
Contact us:
scott@gigasciencejournal.com
editorial@gigasciencejournal.com
database@gigasciencejournal.com
* APC’s currently generously covered by BGI
www.gigasciencejournal.com
57. Thanks to:
team: Our collaborators: Funding from:
Peter Li Ruibang Luo (BGI/HKU)
Chris Hunter Shaoguang Liang (BGI-SZ)
Jesse Si Zhe Tin-Lap Lee (CUHK)
Nicole Nogoy Huayen Gao (CUHK)
Tam Sneddon Qiong Luo (HKUST) CBIIT
Alexandra Basford Senghong Wang (HKUST)
Laurie Goodman Yan Zhou (HKUST)
@gigascience
Follow us: facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog/
www.gigadb.org
galaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
Editor's Notes
And now that you all want to submit to GigaDB, how do you do that and how will people search and find your data and, other than citing your DOI, what will they be able to do with the data? We have redesigned the underlying Giga database and we’re working on the front end which we hope to be public early next month so the following slides are a mix of screenshots from the development site overlaid with tweaks made in powerpoint to illustrate features you can hope to see when we go live.These include:a home page image slider for browsing datasetsa text box search which I will demonstrate shortly
That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.Thank you for listening.