This document summarizes a presentation about publishing research data with Scientific Data. It discusses the benefits of sharing research data, including generating more analyses and reuse. It outlines Scientific Data's process for publishing Data Descriptors, which include both human-readable articles and machine-readable metadata. Data Descriptors can be published at any point in the research process. The presentation notes that Data Descriptors provide credit for data generators, enable discovery and reuse of data, and have resulted in data being cited and reused in different fields and by the public.
Identifying and tracking research resources using RRIDs: a practical approachdkNET
At this presentation, you will learn (1) Why you need to use Research Resource identifier (RRID) (2) What is Resource Identification Initiative (3) How dkNET.org supports RRID (4) What can you do with RRID
Presentation slides on Open Science and research reproducibility. Presented by Gareth Knight (LSHTM Research Data Manager) on 18th September 2018, as part of an Open Science event for LSHTM Week 2018.
Identifying and tracking research resources using RRIDs: a practical approachdkNET
At this presentation, you will learn (1) Why you need to use Research Resource identifier (RRID) (2) What is Resource Identification Initiative (3) How dkNET.org supports RRID (4) What can you do with RRID
Presentation slides on Open Science and research reproducibility. Presented by Gareth Knight (LSHTM Research Data Manager) on 18th September 2018, as part of an Open Science event for LSHTM Week 2018.
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...Jonathan Tedds
http://dlab.berkeley.edu/event/open-research-challenge-peer-review-and-publication-research-data
A talk by Dr. Jonathan Tedds, Senior Research Fellow, D2K Data to Knowledge, Dept of Health Sciences, University of Leicester.
PI: #BRISSKit www.brisskit.le.ac.uk
PI: #PREPARDE www.le.ac.uk/projects/preparde
The Peer REview for Publication & Accreditation of Research data in the Earth sciences (PREPARDE) project seeks to capture the processes and procedures required to publish a scientific dataset, ranging from ingestion into a data repository, through to formal publication in a data journal. It will also address key issues arising in the data publication paradigm, namely, how does one peer-review a dataset, what criteria are needed for a repository to be considered objectively trustworthy, and how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community.
I will discuss this and alternative approaches to research data management and publishing through examples in astronomy, biomedical and interdisciplinary research including the arts and humanities. Who can help in the long tail of research if lacking established data centers, archives or adequate institutional support? How much can we transfer from the so called “big data” sciences to other settings and where does the institution fit in with all this? What about software?
Publishing research data brings a wide and differing range of challenges for all involved, whatever the discipline. In PREPARDE we also considered the pre and post publication peer review paradigm, as implemented in the F1000 Research Publishing Model for the life sciences. Finally, in an era of truly international research how might we coordinate the many institutional, regional, national and international initiatives – has the time come for an international Research Data Alliance?
OpenAIRE-COAR conference 2014: Allowing research data to shine: providing tan...OpenAIRE
Presentation at the OpenAIRE-COAR Conference: "Open Access Movement to Reality: Putting the Pieces Together", Athens - May 21-22, 2014.
Session 2: Research data in the institutional context and beyond.
Allowing research data to shine: providing tangible credit for data sharing, by Varsha Khodiyar - Editorial Biocurator at F1000Research
Preparing your data for sharing and publishingVarsha Khodiyar
Talk given as part of the MRC Cognition and Brain Sciences Unit Open Science Day on 20th November 2018 , University of Cambridge (https://www.eventbrite.co.uk/e/open-science-day-at-the-mrc-cbu-tickets-50363553745)
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...GigaScience, BGI Hong Kong
Scott Edmunds talk at the 7th Internation Conference on Genomics: "Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era. ICG7, Hong Kong 1st December 2012
"
Presentation given at Open Science question and answer session hosted by the Institute for Quantitative Social Science (IQSS), and the Office for Scholarly Communication (OSC) at Harvard University, on July 16th 2014.
Slides shown to BOSC2014 (Bioinformatics Open Source Conference 2014) attendees as an introduction to the open science journal F1000Research, prior to a panel discussion on reproducibility.
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...Jonathan Tedds
http://dlab.berkeley.edu/event/open-research-challenge-peer-review-and-publication-research-data
A talk by Dr. Jonathan Tedds, Senior Research Fellow, D2K Data to Knowledge, Dept of Health Sciences, University of Leicester.
PI: #BRISSKit www.brisskit.le.ac.uk
PI: #PREPARDE www.le.ac.uk/projects/preparde
The Peer REview for Publication & Accreditation of Research data in the Earth sciences (PREPARDE) project seeks to capture the processes and procedures required to publish a scientific dataset, ranging from ingestion into a data repository, through to formal publication in a data journal. It will also address key issues arising in the data publication paradigm, namely, how does one peer-review a dataset, what criteria are needed for a repository to be considered objectively trustworthy, and how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community.
I will discuss this and alternative approaches to research data management and publishing through examples in astronomy, biomedical and interdisciplinary research including the arts and humanities. Who can help in the long tail of research if lacking established data centers, archives or adequate institutional support? How much can we transfer from the so called “big data” sciences to other settings and where does the institution fit in with all this? What about software?
Publishing research data brings a wide and differing range of challenges for all involved, whatever the discipline. In PREPARDE we also considered the pre and post publication peer review paradigm, as implemented in the F1000 Research Publishing Model for the life sciences. Finally, in an era of truly international research how might we coordinate the many institutional, regional, national and international initiatives – has the time come for an international Research Data Alliance?
OpenAIRE-COAR conference 2014: Allowing research data to shine: providing tan...OpenAIRE
Presentation at the OpenAIRE-COAR Conference: "Open Access Movement to Reality: Putting the Pieces Together", Athens - May 21-22, 2014.
Session 2: Research data in the institutional context and beyond.
Allowing research data to shine: providing tangible credit for data sharing, by Varsha Khodiyar - Editorial Biocurator at F1000Research
Preparing your data for sharing and publishingVarsha Khodiyar
Talk given as part of the MRC Cognition and Brain Sciences Unit Open Science Day on 20th November 2018 , University of Cambridge (https://www.eventbrite.co.uk/e/open-science-day-at-the-mrc-cbu-tickets-50363553745)
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...GigaScience, BGI Hong Kong
Scott Edmunds talk at the 7th Internation Conference on Genomics: "Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era. ICG7, Hong Kong 1st December 2012
"
Presentation given at Open Science question and answer session hosted by the Institute for Quantitative Social Science (IQSS), and the Office for Scholarly Communication (OSC) at Harvard University, on July 16th 2014.
Slides shown to BOSC2014 (Bioinformatics Open Source Conference 2014) attendees as an introduction to the open science journal F1000Research, prior to a panel discussion on reproducibility.
The Evolution of CGM (Computer Graphics Metafile) Viewing. The objective of the workshop is to provide you with all the information required to implement our evolutionary technology.
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA
Date: Apr 4, 2018
Speaker: Hyoungjoo Park, PhD candidate, School of Information Studies, University of Wisconsin-Milwaukee, and Dietmar Wolfram, PhD
Overview: It is increasingly common for researchers to make their data freely available. This is often a requirement of funding agencies but also consistent with the principles of open science, according to which all research data should be shared and made available for reuse. Once data is reused, the researchers who have provided access to it should be acknowledged for their contributions, much as authors are recognised for their publications through citation. Hyoungjoo Park and Dietmar Wolfram have studied characteristics of data sharing, reuse, and citation and found that current data citation practices do not yet benefit data sharers, with little or no consistency in their format. More formalised citation practices might encourage more authors to make their data available for reuse.
The challenge of sharing data well, how publishers can helpVarsha Khodiyar
Researchers, academic institutes and funders are increasingly recognizing the importance of data sharing for reproducible science. However, it is not always straightforward and clear to researchers as to how best to share data in a useful way. At Springer Nature we are working on several initiatives to help facilitate the sharing of research data in a reusable way, with our overarching goal being to publish research that is robust and reproducible. I will talk about the effort that goes into our flagship data journal, Scientific Data, to facilitate best practices in publication and sharing of research data, and share some of our experiences publishing Challenge datasets. I will also describe some of the newer Research Data Services that are now available to help all researchers (not only Springer Nature authors) to share their data in a useful way.
FAIR for the future: embracing all things dataARDC
FAIR for the future: embracing all things data - Natasha Simons, Keith Russell and Liz Stokes, presented at Taylor & Francis Scholarly Summits in Sydney 11 Feb 2019 and Melbourne 14 Feb 2019.
Increased access to the data generated is fuelling increased consumption and accelerating the cycle of discovery. But the successful integration and re-use of heterogeneous data from multiple providers and scientific domains is a major challenge within academia and industry, often due to incomplete description of the study details or metadata about the study. Using the BioSharing, ISA Commons and the STATistics Ontology (STATO) projects as exemplar community efforts, in this breakout session we will discuss the evolving portfolio of community-based standards and methods for structuring and curating datasets, from experimental descriptions to the results of analysis.
http://www.methodsinecologyandevolution.org/view/0/events.html#Data_workshop
INSERM Workshop 246 - Management and reuse of health data: methodological issues: https://ateliersinserm.dakini.fr/en/workshop.246.management.and.reuse.of.health.data.methodological.issues-66-22.php
GSmith Springer Nature Data policies and practices: HKU Open Data and Data Pu...GrahamSmith646206
Supporting research data across Springer Nature: joining up policy and practice. Slides from Graham Smith (Research Data Manager, Springer Nature) at HKU Open Data and Data Publishing Seminar, 25th October 2021.
ODIN Final Event - The Care and Feeding of Scientific Datadatacite
Mercè Crosas @mercecrosas
Director of Data Science, IQSS, Harvard University
Presentation delivered at the ODIN Final Event in Amsterdam (Netherlands) on Wednesday, September 24, 2014: ORCID and DataCite: Towards Holistic Open Research.
More info: www.odin-project.eu
Talk given at the Data Visualisation and the Future of Academic Publishing event. https://www.eventbrite.com/e/data-visualisation-and-the-future-of-academic-publishing-tickets-25372801733?password=dataviz
Research Integrity Advisor and Data ManagementARDC
Dr Paul Wong from the Australian Research Data Commons presented at the University of Technology Sydney's RIA Data Management Workshop on 21 June 2018. In partnership with the Australian Research Council, the National Health and Medical Research Council, the Australian Research Data Commons, and RMIT University, this is part of a national workshop series in data management for research integrity advisors.
Recomendations for infrastructure and incentives for open science, presented to the Research Data Alliance 6th Plenary. Presenter: William Gunn, Director of Scholarly Communications for Mendeley.
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
The NIDDK Information Network (dkNET; http://dknet.org) is a open community resource for basic and clinical investigators in metabolic, digestive and kidney disease. dkNET’s portal facilitates access to a collection of diverse research resources (i.e. the multitude of data, software tools, materials, services, projects and organizations available to researchers in the public domain) that advance the mission of the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). This webinar was presented by dkNET principle investigator Dr. Jeffrey Grethe.
Presented at the Research Support Community Day by Natasha Simons (Program Leader for Skills, Policy and Resources, Australian National Data Service)
An increasing number of scholarly publishers and journals are implementing policies and procedures that require published articles to be accompanied by the underlying research data. These policies are an important part of the shift toward reproducible research and have been shown to influence researchers’ willingness to share research data to varying extents. However journal data availability policies are highly idiosyncratic, vary in strength from encouraging to mandating data sharing, and are often difficult to interpret. This makes it challenging for researchers to comply, editors to introduce and research support staff to assist. This presentation examined why and how more scholarly publishers/journals are introducing data availability policies and explore the differences in journal data sharing policies, referring to examples. It outlined the challenges of current data policies, what is expected of various stakeholders, and reflect on efforts in Australia to engage stakeholders in conversation to improve data policies including 2017 Social Sciences and Health and Medical roundtables. It concluded with an update on international collaborations that are helping to facilitate wider adoption of clear, consistent policies for publishing research data.
Digital transformation to enable a FAIR approach for health data scienceVarsha Khodiyar
Invited talk for ConTech Pharma on 1st March 2022
Abstract
Health Data Research UK is the UK’s national institute for health data science, with a mission to unite the UK’s health data to enable discoveries that improve people’s lives. In this talk, Dr Varsha Khodiyar will outline how HDR UK is bringing together disparate health data from all four countries of the United Kingdom, creating the infrastructure to enable discovery of and access to health data, and the convening standards making bodies to improve data linkage and data reuse. Varsha will also discuss how HDR UK is moving beyond the traditional confines of FAIR data to also ensure that data sharing and data use is transparent and ‘fair’ for the patients and lay public who are the subjects of these datasets.
Lessons from the UK: Data access, patient trust & real-world impact with heal...Varsha Khodiyar
Slides supporting presentation given at the virtual Beilstein Open Science Symposium in October 2021.
Abstract:
Health Data Research UK’s mission is to unite the UK’s health data to enable discoveries that improve people’s lives. Our 20-year vision is for large scale data and advanced analytics to benefit every patient interaction, clinical trial, biomedical discovery and enhance public health. A key part of HDR UK’s vision is our data portal, the Innovation Gateway. The Gateway facilitates discovery of healthcare data and simplifies data request procedures across multiple data custodians. The Gateway contains metadata on a variety of datasets, including those related to COVID-19, cardiovascular, maternal health, emergency care, primary care, secondary care, acute care, palliative care, biobanks, research cohorts and deeply phenotyped patient cohorts.
From the outset HDR UK has sought the voices, views and experiences of patient and lay-public groups to ensure there is transparency and clear public benefit in the use of the UK’s health data. Patient and public involvement is key to making the Gateway accessible, transparent and to ensure public confidence in research access to health data. The importance of public outreach combined with providing research access to data is illustrated with HDR UK’s contribution to the UK’s coronavirus pandemic response. HDR UK was tasked by the UK’s Chief Scientific Office to build and facilitate the infrastructure to support the National Core Studies, providing key insights on the evolving situation to UK policy makers during the course of the pandemic.
In this talk, I will show how HDR UK is enabling open science by facilitating the discovery of health data, and simplifying the process of requesting access to multiple datasets. I’ll discuss HDR UK’s approach to embedding transparency on research data usage for patients and public, and summarise some of the key ways in which HDR UK has contributed to the coronavirus pandemic.
The information in this slide deck was presented at the Covid Crisis in India - Information & Appeal on Sunday 23rd May 2021.
If you find the information in this slide deck useful, please donate to https://justgiving.com/fundraising/covidcrisisinindia
Data citation and sharing during article publicationVarsha Khodiyar
Deck presented to CHORUS forum on 21st Jan 2021, as part of panel on Data Citations & Sharing (https://www.chorusaccess.org/events/chorus-forum-new-connections/)
What role can publishers play in the open data ecosystem?Varsha Khodiyar
Presentation at session 3 of the NIH workshop 'Role of Generalist Repositories to Enhance Data Discoverability and Reuse' on Feb 11th, at the NIH Main Campus.
New approaches to data management: supporting FAIR data sharing at Springer N...Varsha Khodiyar
Presentation given at Biocuration 2019 Session 5 (Data standards and ontologies: Making data FAIR)
Abstract:
Since 2016, academic publishers including Springer Nature, Elsevier and Taylor & Francis have been providing standard research data policies to journal authors, reflecting key aspects of the FAIR Principles’ practical applications: sharing data in repositories, using persistent identifiers and citing data appropriately. In spite of the rise of FAIR and good data management practice, recent surveys found that nearly 60% of researchers had never heard of the FAIR Principles, and 46% are not sure how to organise their data in a presentable and useful way. In this presentation we will analyse the results of a white paper which assessed the key challenges faced by researchers in sharing their data, and discuss current initiatives and approaches to support researchers to adopt good data sharing practice.
These include the roll-out of research data policies since 2016, as well as the launch of a Helpdesk service which has provided support to authors and allowed the research data team to capture more granular information on the challenges they face in sharing their data. We will also discuss the development of a third-party curation service which assists authors in depositing their data into appropriate repositories, and drafting data availability statements.
Finally we will assess the impacts of some of these interventions, including an analysis of data availability statements and an overview of the methods authors are currently using to share their data, and how these align with FAIR.
The value of data curation as part of the publishing processVarsha Khodiyar
Presentation given at Biocuration 2019 Session 5 (Interacting with the Research Community)
Abstract:Journals and publishers have an important role to play in the drive to increase the reproducibility of published science. Since its launch in 2014, the Nature Research journal Scientific Data has established a reputation for publishing data papers (‘Data Descriptors’) that are highly reusable, as evidenced by a strong citation record. One of the ways in which Scientific Data ensures maximum reusability of published data is via the in-house data curation workflow applied to every Data Descriptor. In 2017, Springer Nature launched its Research Data Support (RDS) service to provide data curation expertise to researchers publishing at other Springer Nature journals.
During curation at Scientific Data and RDS, our data editors familiarise themselves with the related manuscript and perform a thorough check of each data archive. This ensures the descriptions in the manuscript match the metadata and data at the data repositories. The curation process facilitates the identification of any discrepancies between the manuscript text and the information held at the data repository.
Over the last year, the curation team have been recording the types of discrepancies rectified as a direct result of our curation process. At Scientific Data approximately 10% of the discrepancies the team find are significant enough to potentially have warranted a formal correction had the issue had not been resolved prior to publication.
In this presentation we give an overview of our observed outcomes from embedding data curation within the publishing process. We describe of how we are monitoring the value of our curation work, and show examples of the types of discrepancy most commonly identified through curation at Scientific Data and RDS.
Facilitating good research data management practice as part of scholarly publ...Varsha Khodiyar
Presentation given to the SciDataCon #IDW2018 session: Democratising Data Publishing: A Global Perspective, on Tuesday 6th November 2018, Gaborone, Botswana
Practical challenges for researchers in data sharingVarsha Khodiyar
Presentation given at the Research Data Alliance Plenary 12 session: IG Open Questionnaire for Research Data Sharing Survey, on Tuesday 6th November 2018, Gaborone, Botswana
Update from Data policy standardisation and implementation IGVarsha Khodiyar
Update given to the Research Data Alliance Plenary 12 joint meeting session: WG FAIRSharing Registry and Data Policy Standardisation and Implementation IG, on Monday 5th November 2018, Gaborone, Botswana
Data Publishing and Institutional RepositoriesVarsha Khodiyar
Slides presented at the Force16 panel discussion on 18th April 2016 "Libraries united in opening new scholarly platforms" https://www.force11.org/meetings/force2016/program/agenda/concurrent-session-libraries-united-opening-new-scholarly
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
1. Varsha Khodiyar, PhD
Data Curation Editor, Scientific Data
Nature Publishing Group
@varsha_khodiyar
@scientificdata
Tweet with #SDJPN16
Gaining credit for sharing research data
Data publishing with Scientific Data
RIKEN Center for Life Science Technologies 4th March 2016
2. My background
• Joined Scientific Data in October 2014
• Professional data curator since 2003
• PhD in Molecular Biology from the University of
Leicester
• Contributed to the Human Genome Project as
member of the Human Gene Nomenclature
Committee (HGNC)
• Gene Ontology curator for 8 years, at University
College London, UK
• 3 years of open data publishing experience
2
4. Generating research data is expensive
Just 18.1% NIH grant applications funded in 2014*
• Hours spent writing grants?
• Hours spent reviewing grants?
Resources are finite/expensive
• Modified animals
• Specialized reagents
Time and effort taken in the laboratory to generate
good, valid data
* report.nih.gov/success_rates/Success_ByIC.cfm
5. Irreproducibility of published science
Figure 1 - Ioannidis JPA. et al. Repeatability of published microarray gene
expression analyses. Nature Genetics 41, 149–55 (2009) doi:10.1038/ng.295
6. Withholding data impacts on human health
Clinical study reports, detailed data and software code available at Dryad
Digital Repository doi:10.5061/dryad.bv8j6 and www.Study329.org
7. • Diversity of analyses and opinion
• New research
• testing of new hypotheses
• new analysis methods
• meta-analyses to create new
datasets
• studies on data collection methods
• Education of new researchers
• Increased return on investment in
research
Vickers AJ: Whose data set is it anyway? Sharing raw
data from randomized trials. Trials 2006, 7:15
Hrynaszkiewicz I, Altman DG: Towards agreement on
best practice for publishing raw clinical trial data.
Trials 2009, 10:17
Sharing data promotes
8. Researchers already share data
• Most researchers are sharing
data, and using the data of
others
• Direct contact between
researchers (on request) is a
common way of sharing data
• Repositories are second most
common method of sharing
Kratz and Strasser (2015) doi: 10.1371/journal.pone.0117619 9
9. Some problems…
• Sharing upon request relies heavily on trust
• Informally stored data associated with published works disappears at a
rate of ~17% per year (Vines et al. 2014; doi: 10.1016/j.cub.2013.11.014)
• Datasets not referenced in a manuscript are essentially invisible (a.k.a
“Dark data”)
• If data are available, they are often not interpretable or reusable
because sufficient detail is not included
• Data producers do not get appropriate credit for their work
11. Credit – Scholarly credit for publishing data; all publications are indexed
and citeable.
Reuse – Standardized and detailed descriptions enables easier reuse of
published research data.
Quality – Rigorous peer-review on technical quality and reusability.
Editorial Board of experts in their field maintain community standards.
Discovery – Curated, machine-readable metadata for dataset discovery.
Validated links to published data in each article.
Open – Use of CC-BY licence for articles and CC0 for metadata. Promote
use of open licences for published data.
Service – Commitment to excellent service for authors and readers.
13. Data Descriptors have human and machine readable
components
13
Human readable
representation of
study
i.e. article (HTML &
PDF)
Human readable
representation of
study
i.e. article (HTML
& PDF)
Machine
readable
representation
of study
i.e. metadata
14. Synthesis
Analysis
Conclusions
What did I do to generate the data?
How was the data processed?
Where is the data?
Who did what and when?
Methods and technical analyses supporting the quality of the measurements.
Do not contain tests of new scientific hypotheses
Comparison of Data Descriptor to traditional article
15. What types of data can be published?
15
Decades
old
dataset
Standalone
dataset
Data that has been
used in an analysis
article
Large
consortium
dataset
Data from a
single
experiment
Data that the
researcher finds
valuable and that
others might find
useful too
Data associated
with a high impact
analysis article
16. When can a Data Descriptor be published?
16
After data
analysis has
been
published
Before analysis
has been
published
Authors not
intending to
analyse data
Data Descriptors can be
submitted and published
at any point in the
research workflow, i.e.
whenever it makes most
sense for your data
After data
analysis has
been
published
Before the
analysis has
been published
Publication
alongside analysis
article
19. Scientific Data’s Repository List
Browse our recommended data repositories online.
• We currently list almost 80 repositories, across biological, medical,
physical and social sciences
• When required, we provide guidance to authors on the best place to
store their data
www.nature.com/sdata/data-policies/repositories
21. • We want to capture metadata about the dataset being described in each Data
Descriptor
• The manuscript captures human readable metadata needed for data reuse
• The curated metadata records capture machine readable metadata needed for
machine based data discovery
Metadata at Scientific Data
22. ISA-Tab format for machine readable metadata
22
• Study workflow
• Key sample characteristics
needed for data discovery
• Relates samples to data files
• Shows location of dataset
• Uses controlled vocabularies
and ontologies (where
possible)
23. Use of community endorsed ontologies and controlled
vocabularies
23
Controlled vocabulary = list of standardized phrases of scientific concepts
Ontology = controlled vocabulary with defined relationships between terms
24. Structured Summary table from curated metadata
24
Investigation file
Study file
Sample characteristics reported in Structured Summary table:
Organism
Organism part
Cell line
Geographical location
Environment type
28. Citing my own data
1. In the
article text
2. In the Data
Citation section
29. Citing data I’ve reused
1. In the
article text
2. In the
References
section
30. Clinical researchers support sharing, but…
Rathi V, Dzara K, Gross CP, Hrynaszkiewicz I, Joffe S, Krumholz HM, Strait KM, Ross JS:
Sharing of clinical trial data among trialists: a cross sectional survey. BMJ 2012;345:e7570
• Sharing de-identified data via repositories should be
required (236 respondents, 74%)
• Investigators should share de-identified data on request
(229 respondents, 72%)
31. …clinical data producers have specific concerns
Rathi V, Dzara K, Gross CP, Hrynaszkiewicz I, Joffe S, Krumholz HM, Strait KM, Ross JS: Sharing of
clinical trial data among trialists: a cross sectional survey. BMJ 2012;345:e7570
32. Example initiatives for sharing clinical data
Yale Open Data Access (YODA) & Clinical Study Data
Request (CSDR) projects:
• Data Use Agreements (DUAs)
• Controlled access environment
• Scientific validity of reanalysis checked
• Independent governance
• Data anonymisation checks
http://yoda.yale.edu/
https://www.clinicalstudydatarequest.com/
33. Clinical data publication at Scientific Data
• Identify repositories able to archive clinical data
• Work with identified repositories to establish workflows for
peer review and publication, whilst maintaining patient
privacy
• Facilitate specialist peer review process for clinical data, for
example ensure peer reviewers have agreed to terms of data
use agreement
Hrynaszkiewicz, I., Khodiyar, V., Hufton, A. & Sansone, S. A. Publishing descriptions of non-
public clinical datasets: guidance for researchers, repositories, editors and funding
organisations. BioRxiv http://dx.doi.org/10.1101/021667 (2015).
39. Data reuse by other researchers in the same field
39
“The Data Descriptor made it easier
to use the data, for me it was critical
that everything was there…all the
technical details like voxel size.”
Professor Daniele Marinazzo
42. Data reuse by the non-research community
42
http://www.nytimes.com/interactive/2014/12/30/science/history-of-ebola-in-24-outbreaks.html
43. Data Descriptors…
• …enable you to gain scholarly credit for your data gathering
efforts.
• …are human AND machine readable.
• …can be published with, or independently of, an analysis article.
• …can be published point in the research workflow.
• …allow the publication and discovery of clinical data, whilst
maintaining your patients privacy.
• …result in greater reuse and citation by fellow members of your
research community.
• …extend the impact of your research data by enabling access to
and reuse by the non-research community.
43