@openaire_eu
The case of Open
Research Data
HermansEmilie
GhentUniversity
Based on Mounce, R. (2014), “‘The State of Open Research Data”. talk for OpenCon 2014 (Washington D.C.).
https://www.slideshare.net/rossmounce/open-con-mouncedata?qid=d8f441d1-c968-4c4a-ab4d-
eb2d04d7fc3a&v=&b=&from_search=24
Side note….
Whenever I talk about data in this talk,
assume I’m talking about non-sensitive data e.g.
NOT sensitive medical data
NOT bio-weapons research data
etc. etc….
Challenge
Adapted original source: The University of California, Santa Cruz, Data Management LibGuide, Research Data Management Lifecycle, diagram, viewed 5th May 2018 http://guides.library.ucsc.edu/datamanagement
Challenge:
Adapted original source: The University of California, Santa Cruz, Data Management LibGuide, Research Data Management Lifecycle, diagram, viewed 5th May 2018 http://guides.library.ucsc.edu/datamanagement
From liniar
process to
research data
lifecycle!
Open means anyone can
freely access, use,
modify, and share for any
purpose.
Restricted access to limited
amount of people under
certain conditions
Open Data Data sharing
Whatisopendata?
@openaire_eu
Where did
we come from?
Another side note….
Summarizing the state of Open Data is hard
Data sharing (upon request)
e.g. “The full profile listings are on floppy disks
which are available upon request”*
* Fernolz et al (1989) A survey A survey of measurements and measuring techniques in rapidly distorted compressible turbulent boundary layers.
Data sharing in databanks
Datasharingincertaindisciplines
Community agreements
The Bermuda Principles for sharing DNA sequences data
• Automatic release of sequence
assemblies larger than 1 kb
(preferably within 24 hours).
• Immediate publication of finished
annotated sequences.
• Aim to make the entire sequence
freely available in the public domain
Data online as supplementary material
Databydefaultanddatapapers
Data papers
• A searchable metadata document, describing a
particular dataset or a group of datasets, published
as peer—reviewed article
• Primary purpose: to describe data and collection,
rather than to report hypotheses and conclusions.
Journal policy
• Journals are increasingly asking for associated data to be
deposited (PLOS, Springer, Nature, BMC, BMJ….) as well
as required by funders (EC, FWO)
@openaire_eu
WHY?
It’s 2018!
unfortunately….
Research integrity
“It was a mistake in a spreadsheet that could
have been easily overlooked: a few rows left out
of an equation to average the values in a
column. The spreadsheet was used to draw the
conclusion of an influential 2010 economics
paper: that public debt of more than 90% of GDP
slows down growth. This conclusion was later
cited by the International Monetary Fund and
the UK Treasury to justify programmes of
austerity that have arguably led to riots, poverty
and lost jobs.”
Research integrity
1. e.g. Piwowar HA, Vision TJ. (2013) Data reuse and the open data citation advantage. PeerJ 1:e175 https://doi.org/10.7717/peerj.175, Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308
Prevents data loss
Maximize usefulness
Write a data paper
Credit & longer shelf life 1
Increases transparency
Promote integrity Influence society
DATAMANAGEMENT
AND
OPENDATA
@openaire_eu
HOW?
What about our project page?
Sustainable?
Services?
Legal aspects?
Technical standards?
Metadata standards?
Findable?
Manyoptionsforsharingdata
Where to deposit data?
• Disciplinary/Institutional data repository
Best practice: Research data repository
• Zenodo cost-free data repository
• Matches data needs
• Directory of data repositories:
www.Re3data.org
FAIRdataprinciples
• How to discover your
data?
• How to understand your
data?
• Where to find your
data?
• Can people access
your data?
• Metadata
• Persistent identifier
• Naming convention
• Keywords
• Versioning
• Software,
documentation
• Data repository
• Open Standards
• Vocabulary
• Methodologies
• Licensing
Findable
ReusableInteroperable
Accessible
FAIR principles
FAIR is best practice
• (Open) licenses for data can help you
greatly
• Can be time-consuming, especially when not
incorporated in research process.
• Importance of commonly used standards,
open file formats and metadata
• e.g. creative commons.
Recommended:
Aim for the (near?) future
It’s somewhere
in some form
It’s somewhere in
a structured form
It’s somewhere in
an open format
And you can
POINT at it!
It can even TALK
(to other data)
5-star deployment scheme for Open Data: 5stardata.info
Hope for the (near) future?
• Research institutions will significantly improve research data
management training for ALL staff & students, old and new alike
• Research funding bodies will tighten-up their rules to ensure
immediate post-publication data sharing. No embargoes, no
bullshit.
• If no published data comes from your funded research, it will negatively
effect your future chances of funding
• Good journals will strictly enforce mandatory data sharing.
Journals that don't will get a bad reputation for irreproducible
research
@openaire_eu
Alternative…
Imagine a world where no-one shared
their data (post-publication)
How would we know what was truth & what was lies / fraud / error?
Imagine the waste of time & resources
if everyone had to re-generate data de novo every time
How would we make progress?
We would be in the dark….
Thank you!
Emilie.herlans@ugent.be
Questions?

Introduction to open-data

  • 1.
    @openaire_eu The case ofOpen Research Data HermansEmilie GhentUniversity Based on Mounce, R. (2014), “‘The State of Open Research Data”. talk for OpenCon 2014 (Washington D.C.). https://www.slideshare.net/rossmounce/open-con-mouncedata?qid=d8f441d1-c968-4c4a-ab4d- eb2d04d7fc3a&v=&b=&from_search=24
  • 2.
    Side note…. Whenever Italk about data in this talk, assume I’m talking about non-sensitive data e.g. NOT sensitive medical data NOT bio-weapons research data etc. etc….
  • 3.
  • 4.
    Adapted original source:The University of California, Santa Cruz, Data Management LibGuide, Research Data Management Lifecycle, diagram, viewed 5th May 2018 http://guides.library.ucsc.edu/datamanagement Challenge: Adapted original source: The University of California, Santa Cruz, Data Management LibGuide, Research Data Management Lifecycle, diagram, viewed 5th May 2018 http://guides.library.ucsc.edu/datamanagement From liniar process to research data lifecycle!
  • 5.
    Open means anyonecan freely access, use, modify, and share for any purpose. Restricted access to limited amount of people under certain conditions Open Data Data sharing Whatisopendata?
  • 6.
  • 7.
    Another side note…. Summarizingthe state of Open Data is hard
  • 8.
    Data sharing (uponrequest) e.g. “The full profile listings are on floppy disks which are available upon request”* * Fernolz et al (1989) A survey A survey of measurements and measuring techniques in rapidly distorted compressible turbulent boundary layers.
  • 9.
    Data sharing indatabanks
  • 10.
    Datasharingincertaindisciplines Community agreements The BermudaPrinciples for sharing DNA sequences data • Automatic release of sequence assemblies larger than 1 kb (preferably within 24 hours). • Immediate publication of finished annotated sequences. • Aim to make the entire sequence freely available in the public domain Data online as supplementary material
  • 11.
    Databydefaultanddatapapers Data papers • Asearchable metadata document, describing a particular dataset or a group of datasets, published as peer—reviewed article • Primary purpose: to describe data and collection, rather than to report hypotheses and conclusions. Journal policy • Journals are increasingly asking for associated data to be deposited (PLOS, Springer, Nature, BMC, BMJ….) as well as required by funders (EC, FWO)
  • 12.
  • 13.
  • 15.
    Research integrity “It wasa mistake in a spreadsheet that could have been easily overlooked: a few rows left out of an equation to average the values in a column. The spreadsheet was used to draw the conclusion of an influential 2010 economics paper: that public debt of more than 90% of GDP slows down growth. This conclusion was later cited by the International Monetary Fund and the UK Treasury to justify programmes of austerity that have arguably led to riots, poverty and lost jobs.”
  • 16.
  • 18.
    1. e.g. PiwowarHA, Vision TJ. (2013) Data reuse and the open data citation advantage. PeerJ 1:e175 https://doi.org/10.7717/peerj.175, Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308 Prevents data loss Maximize usefulness Write a data paper Credit & longer shelf life 1 Increases transparency Promote integrity Influence society DATAMANAGEMENT AND OPENDATA
  • 19.
  • 20.
    What about ourproject page? Sustainable? Services? Legal aspects? Technical standards? Metadata standards? Findable?
  • 21.
  • 22.
    Where to depositdata? • Disciplinary/Institutional data repository Best practice: Research data repository • Zenodo cost-free data repository • Matches data needs • Directory of data repositories: www.Re3data.org
  • 23.
    FAIRdataprinciples • How todiscover your data? • How to understand your data? • Where to find your data? • Can people access your data? • Metadata • Persistent identifier • Naming convention • Keywords • Versioning • Software, documentation • Data repository • Open Standards • Vocabulary • Methodologies • Licensing Findable ReusableInteroperable Accessible FAIR principles
  • 24.
    FAIR is bestpractice • (Open) licenses for data can help you greatly • Can be time-consuming, especially when not incorporated in research process. • Importance of commonly used standards, open file formats and metadata • e.g. creative commons. Recommended:
  • 25.
    Aim for the(near?) future It’s somewhere in some form It’s somewhere in a structured form It’s somewhere in an open format And you can POINT at it! It can even TALK (to other data) 5-star deployment scheme for Open Data: 5stardata.info
  • 26.
    Hope for the(near) future? • Research institutions will significantly improve research data management training for ALL staff & students, old and new alike • Research funding bodies will tighten-up their rules to ensure immediate post-publication data sharing. No embargoes, no bullshit. • If no published data comes from your funded research, it will negatively effect your future chances of funding • Good journals will strictly enforce mandatory data sharing. Journals that don't will get a bad reputation for irreproducible research
  • 27.
    @openaire_eu Alternative… Imagine a worldwhere no-one shared their data (post-publication) How would we know what was truth & what was lies / fraud / error? Imagine the waste of time & resources if everyone had to re-generate data de novo every time How would we make progress? We would be in the dark….
  • 28.

Editor's Notes

  • #4 Idea- experiment – data analyse and writing paper – finally time for some pizza while paper gets reviewed – paper: jeej, al your hard work dissapears
  • #5 FROM DATA IN A SCIENTIFIC PIPELINE TO RESEARCH DATA LIFECYCLE Managing data in a research project is a process that runs throughout the project. Good data management is one of the foundations for reproducible research. Good management is essential to ensure that data can be preserved and remain accessible in the long-term, so it can be re-used and understood by future researchers. Begin thinking about how you’ll manage your data before you start collecting it.
  • #6 Open data is data that is free to access, reuse, repurpose, and redistribute. The Open Research Data Pilot aims to make the research data generated by selected Horizon 2020 projects accessible with as few restrictions as possible, while at the same time protecting sensitive data from inappropriate access Data sharing restricted data to restricted organisations or individuals. Access to this data is usually restricted because it is sensitive in some way, either because it is personal or because its general release might cause security problems.
  • #9 expiration date of mediums and data
  • #10 GenBank is a sequence database released in 1982. being one of the earliest bioinformatics community projects on the Internet
  • #11 The Bermuda Principles set out rules for the rapid and public release of DNA sequence data. The Human Genome Project, a multinational effort to sequence the human genome, generated vast quantities of data about the genetic make-up of humans and other organisms. But, in some respects, even more remarkable than the impressive quantity of data generated by the Human Genome Project is the speed at which that data has been released to the public. At a 1996 summit in Bermuda, leaders of the scientific community agreed on a groundbreaking set of principles requiring that all DNA sequence data be released in publicly accessible databases within twenty-four hours after generation. These “Bermuda Principles” (also known as the "Bermuda Accord") contravened the typical practice in the sciences of making experimental data available only after publication. These principles represent a significant achievement of private ordering in shaping the practices of an entire industry and have established rapid pre-publication data release as the norm in genomics and other fields. The three principles retained originally were: Automatic release of sequence assemblies larger than 1 kb (preferably within 24 hours). Immediate publication of finished annotated sequences. Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society.
  • #18 Innovatiion and progres s. a collaborative effort to find the biological markers that show the progression of Alzheimer’s disease in the human brain. But we all realized that we would never get biomarkers unless all of us parked our egos and intellectual-property noses outside the door and agreed that all of our data would be public immediately.” , At first, the collaboration struck many scientists as worrisome — they would be giving up ownership of data, and anyone could use it, publish papers, maybe even misinterpret it and publish information that was wrong.
  • #19 Prevents data loss: 80% of data is lost after 10 years. Data is fragile and reproducibility very difficult without data. 2, Maximize usefulness and built much more efficient on previous work: Maximize usefulness: organize, make understandable, reusable and avoid duplication. Preserves data for further research by organizing, Stop drowning in irrelevant stuff. Reproducibility crisis. 3. Fosters creativity, interdisciplinary use of data and meta-analysis 4, public participation in scientific research 5. Promote integrity and increases transparency: managing data is part of good research, avoid accusations of sloppy science 4. Data tend to have a (much!) longer shelf life than interpretation After accounting for other factors affecting citation rate, we find a robust citation benefit from open data.1
  • #24 Interoperability: how can my data be combined with other datasets and used in other fields? Licensing: who can access my data and for what perpuse can it be used
  • #26 3 stars: You can manipulate the data in any way you like 4 stars: link to it, bookmark it, reuse parts of the data, combine with other data 5 stars: discover more related data,