Managing and sharing data 
Sarah Jones 
DCC, University of Glasgow 
sarah.jones@glasgow.ac.uk 
Twitter: @sjDCC 
ERC Workshop on Research Data Management and Sharing 
18-19 September 2014 , Brussels 
Funded by:
European Research Council policy 
Commitment to open science from the start: 
"it is the firm intention of the ERC Scientific Council to issue 
specific guidelines for the mandatory deposit in open access 
repositories of research results – that is, publications, data 
and primary materials – obtained thanks to ERC grants, as 
soon as pertinent repositories become operational." 
Statement on Open Access, December 2006 
Image CC BY-SA 3.0 by Greg Emmerich 
www.flickr.com/photos/gemmerich/6365692655
Why make data available?
Sharing leads to breakthroughs 
www.nytimes.com/2010/08/13/health/research 
/13alzheimer.html?pagewanted=all&_r=0 
“It was unbelievable. Its not science 
the way most of us have practiced in 
our careers. But we all realised that 
we would never get biomarkers 
unless all of us parked our egos and 
intellectual property noses outside 
the door and agreed that all of our 
data would be public immediately.” 
Dr John Trojanowski, University of Pennsylvania 
... increases the speed of discovery
Returns for institutions 
“If an institution spent A$10 million on data, 
what would be the return? The answer is: more 
publications; an increased citation count; more 
grants; greater profile; and more collaboration.” 
Dr Ross Wilkinson, ANDS 
www.ariadne.ac.uk/issue72/oar-2013-rpt
Researchers get a citation boost 
“Publicly available data was significantly 
(p = 0.006) associated with a 69% increase in 
citations, independently of journal impact 
factor, date of publication, and author 
country of origin using linear regression.” 
Piwowar H., Day, R and Fridsma, D. (2007) Sharing detailed research data 
is associated with increased citation rate. DOI: 10.1371/journal.pone.0000308
But, there are also barriers... 
Who owns the data? 
• Researchers? 
• University? 
• Commercial partners? 
• Funders? 
• … 
People are often misinformed about 
who owns the data. It is particularly 
hard to determine in international 
projects or ones with industry. 
Restrictions on sharing 
• Patentable data 
• Commercial sensitivities 
• Personal, identifiable data 
• Lack of consent 
• … 
There are legitimate reasons to agree 
embargo periods, impose conditions, 
or to share only some of the data. 
However, these are often given as 
reasons not to share data at all. 
www.dcc.ac.uk/sites/default/files/documents/events/ 
workshops/IHW-2013/UKDA-barriers-to-data-sharing.pdf
And opportunity costs 
By Emilio Bruna 
http://brunalab.org/blog/2014/09/04/the-opportunity-cost- 
of-my-openscience-was-35-hours-690 
For his most recent paper: 
1. Double checking the main dataset and 
reformatting to submit to Dryad: 5 hours 
2. Creating complementary file and preparing 
metadata: 3 hours 
3. Submission of these two files and the 
metadata to Dryad: 45 minutes 
4. Preparing a map of the locations: 1 hour 
5. Submission of map to Figshare: 15 minutes 
6. Cleaning up and documenting the code, 
uploading it to GitHub: 25 hours 
7. Cost of archiving in Dryad: US$90 
8. Page Charges: $600
What needs to change? 
Conclusions from Emilio Bruna: 
• Develop a better system of incentives from the 
community for archiving data and code 
• Teach our students how to do this NOW - it’s much easier 
if you develop good habits early 
• Minimise the actual and opportunity costs 
We need to stop telling people “You should” and get 
better at telling people “Here’s how”
What is involved in data curation 
• Data Management Planning 
• Data creation 
• Annotating / documenting data 
• Analysis, use, versioning 
• Storage and backup 
• Publishing papers and data 
• Preparing for deposit 
• Archiving and sharing 
• Licensing 
• Citing… 
Plan 
Create 
Document 
Use 
Share 
Publish
Data Management Plans 
Brief plans to determine how data will be created, managed and 
shared. DMPs usually cover: 
1. Description of data to be collected / created 
2. Standards and methodologies for data collection & management 
3. Any issues or restrictions due to ethics and Intellectual Property 
4. Plans for data sharing and access 
5. Strategy for long-term preservation 
DMPs are often submitted as part of grant applications, but are 
useful whenever you’re creating data.
Help with DMPs 
A web-based tool to help researchers 
write data management plans 
https://dmponline.dcc.ac.uk 
Framework for creating a DMP 
A list of common elements explaining why they 
are important and giving example answers 
www.icpsr.umich.edu/icpsrweb/content/ 
datamanagement/dmp/framework.html 
www.dcc.ac.uk/sites/default/files/documents 
/resource/DMP_Checklist_2013.pdf 
Examples plans 
www.dcc.ac.uk/resources/data-management- 
plans/guidance-examples
Managing and sharing data: 
a best practice guide 
http://data-archive.ac.uk/media/2894/managingsharing.pdf
Training materials 
FOSTER project 
• Open science training 
• Courses across EU 
• Portal to OA materials 
• Guidance on Horizon 2020 
• Free online training course 
• Aimed at PhD students 
• Case studies, quizzes etc 
• Data handling tutorials 
– R 
– SPSS 
– ArcGIS 
– Nvivo 
http://datalib.edina.ac.uk/mantra www.fosteropenscience.eu
DCC tools catalogue 
A catalogue of RDM tools for different audiences. 
Tools for researchers focus on data handling, managing 
workflows, citation and impact. 
www.dcc.ac.uk/resources/external/tools-services
Tools to help with RDM activities 
impactstory.org 
Citation & 
impact 
owncloud.org 
www.datacite.org 
thedata.org 
www.taverna.org.uk 
www.myexperiment.org 
www.labtrove.org 
Documentation 
& metadata 
dataup.cdlib.org 
Workflow 
management 
Storage & 
collaboration
Metadata standards catalogue 
Use standards wherever possible for interoperability 
www.dcc.ac.uk/resources/ 
metadata-standards
Data repositories 
http://databib.org 
http://service.re3data.org/search
1. How do you foster open science? 
• Make it feasible to comply 
– provide tools and infrastructure 
• Train people early in their careers 
• Incentivise openness 
• Listen to researchers and learn from their 
experience about what doesn’t work 
• Follow up on any demands made in policies
2. Who is responsible for providing 
infrastructure and support? 
Discipline 
Funders 
Institution 
Third-party 
services 
National 
provider 
Data centres 
e.g. via NERC 
Institutional support for discipline-specific 
tools e.g. Monash MeRC 
partnership on tools like OMERO 
National brokerage of deals with third-party 
providers e.g. Jisc Janet deals with Arkivum 
And what about 
co-ordination?
3. Who should pay? 
Funding Research Data Management 
"A conversation with the funders” 
The DCC held a special 
event on this topic in 
the UK, but there’s still a 
long way to go 
www.dcc.ac.uk/events/research-data- 
management-forum-rdmf/ 
rdmf-special-event-funding- 
research-data-management
Thanks – any questions? 
DCC guidance, tools and case studies: 
www.dcc.ac.uk/resources 
Follow us on twitter: 
@digitalcuration and #ukdcc

Managing and sharing data

  • 1.
    Managing and sharingdata Sarah Jones DCC, University of Glasgow sarah.jones@glasgow.ac.uk Twitter: @sjDCC ERC Workshop on Research Data Management and Sharing 18-19 September 2014 , Brussels Funded by:
  • 2.
    European Research Councilpolicy Commitment to open science from the start: "it is the firm intention of the ERC Scientific Council to issue specific guidelines for the mandatory deposit in open access repositories of research results – that is, publications, data and primary materials – obtained thanks to ERC grants, as soon as pertinent repositories become operational." Statement on Open Access, December 2006 Image CC BY-SA 3.0 by Greg Emmerich www.flickr.com/photos/gemmerich/6365692655
  • 3.
    Why make dataavailable?
  • 4.
    Sharing leads tobreakthroughs www.nytimes.com/2010/08/13/health/research /13alzheimer.html?pagewanted=all&_r=0 “It was unbelievable. Its not science the way most of us have practiced in our careers. But we all realised that we would never get biomarkers unless all of us parked our egos and intellectual property noses outside the door and agreed that all of our data would be public immediately.” Dr John Trojanowski, University of Pennsylvania ... increases the speed of discovery
  • 5.
    Returns for institutions “If an institution spent A$10 million on data, what would be the return? The answer is: more publications; an increased citation count; more grants; greater profile; and more collaboration.” Dr Ross Wilkinson, ANDS www.ariadne.ac.uk/issue72/oar-2013-rpt
  • 6.
    Researchers get acitation boost “Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression.” Piwowar H., Day, R and Fridsma, D. (2007) Sharing detailed research data is associated with increased citation rate. DOI: 10.1371/journal.pone.0000308
  • 7.
    But, there arealso barriers... Who owns the data? • Researchers? • University? • Commercial partners? • Funders? • … People are often misinformed about who owns the data. It is particularly hard to determine in international projects or ones with industry. Restrictions on sharing • Patentable data • Commercial sensitivities • Personal, identifiable data • Lack of consent • … There are legitimate reasons to agree embargo periods, impose conditions, or to share only some of the data. However, these are often given as reasons not to share data at all. www.dcc.ac.uk/sites/default/files/documents/events/ workshops/IHW-2013/UKDA-barriers-to-data-sharing.pdf
  • 8.
    And opportunity costs By Emilio Bruna http://brunalab.org/blog/2014/09/04/the-opportunity-cost- of-my-openscience-was-35-hours-690 For his most recent paper: 1. Double checking the main dataset and reformatting to submit to Dryad: 5 hours 2. Creating complementary file and preparing metadata: 3 hours 3. Submission of these two files and the metadata to Dryad: 45 minutes 4. Preparing a map of the locations: 1 hour 5. Submission of map to Figshare: 15 minutes 6. Cleaning up and documenting the code, uploading it to GitHub: 25 hours 7. Cost of archiving in Dryad: US$90 8. Page Charges: $600
  • 9.
    What needs tochange? Conclusions from Emilio Bruna: • Develop a better system of incentives from the community for archiving data and code • Teach our students how to do this NOW - it’s much easier if you develop good habits early • Minimise the actual and opportunity costs We need to stop telling people “You should” and get better at telling people “Here’s how”
  • 10.
    What is involvedin data curation • Data Management Planning • Data creation • Annotating / documenting data • Analysis, use, versioning • Storage and backup • Publishing papers and data • Preparing for deposit • Archiving and sharing • Licensing • Citing… Plan Create Document Use Share Publish
  • 11.
    Data Management Plans Brief plans to determine how data will be created, managed and shared. DMPs usually cover: 1. Description of data to be collected / created 2. Standards and methodologies for data collection & management 3. Any issues or restrictions due to ethics and Intellectual Property 4. Plans for data sharing and access 5. Strategy for long-term preservation DMPs are often submitted as part of grant applications, but are useful whenever you’re creating data.
  • 12.
    Help with DMPs A web-based tool to help researchers write data management plans https://dmponline.dcc.ac.uk Framework for creating a DMP A list of common elements explaining why they are important and giving example answers www.icpsr.umich.edu/icpsrweb/content/ datamanagement/dmp/framework.html www.dcc.ac.uk/sites/default/files/documents /resource/DMP_Checklist_2013.pdf Examples plans www.dcc.ac.uk/resources/data-management- plans/guidance-examples
  • 13.
    Managing and sharingdata: a best practice guide http://data-archive.ac.uk/media/2894/managingsharing.pdf
  • 14.
    Training materials FOSTERproject • Open science training • Courses across EU • Portal to OA materials • Guidance on Horizon 2020 • Free online training course • Aimed at PhD students • Case studies, quizzes etc • Data handling tutorials – R – SPSS – ArcGIS – Nvivo http://datalib.edina.ac.uk/mantra www.fosteropenscience.eu
  • 15.
    DCC tools catalogue A catalogue of RDM tools for different audiences. Tools for researchers focus on data handling, managing workflows, citation and impact. www.dcc.ac.uk/resources/external/tools-services
  • 16.
    Tools to helpwith RDM activities impactstory.org Citation & impact owncloud.org www.datacite.org thedata.org www.taverna.org.uk www.myexperiment.org www.labtrove.org Documentation & metadata dataup.cdlib.org Workflow management Storage & collaboration
  • 17.
    Metadata standards catalogue Use standards wherever possible for interoperability www.dcc.ac.uk/resources/ metadata-standards
  • 18.
    Data repositories http://databib.org http://service.re3data.org/search
  • 19.
    1. How doyou foster open science? • Make it feasible to comply – provide tools and infrastructure • Train people early in their careers • Incentivise openness • Listen to researchers and learn from their experience about what doesn’t work • Follow up on any demands made in policies
  • 20.
    2. Who isresponsible for providing infrastructure and support? Discipline Funders Institution Third-party services National provider Data centres e.g. via NERC Institutional support for discipline-specific tools e.g. Monash MeRC partnership on tools like OMERO National brokerage of deals with third-party providers e.g. Jisc Janet deals with Arkivum And what about co-ordination?
  • 21.
    3. Who shouldpay? Funding Research Data Management "A conversation with the funders” The DCC held a special event on this topic in the UK, but there’s still a long way to go www.dcc.ac.uk/events/research-data- management-forum-rdmf/ rdmf-special-event-funding- research-data-management
  • 22.
    Thanks – anyquestions? DCC guidance, tools and case studies: www.dcc.ac.uk/resources Follow us on twitter: @digitalcuration and #ukdcc

Editor's Notes

  • #3 Quite forward-thinking for such an early OA policy to be framed in terms of data and primary materials too, not just publications.
  • #6 He was making a comparison with the Hubble telescope, which A$1.5 billion is spent on each year. The cost of the Hubble archive (A$1 million per annum) is just a fraction of this, but given the OA mandate, they’ve see the research publications produced by Hubble discoveries double.
  • #7 There have been lots of studies in this area since that show a demonstrable citation boost, though not as high as 69%. This figure was for microarray data from cancer trials and it seems that the early datasets had a particularly strong impact and came from authors who were well-cited. A more realistic figure across the board is probably 10-30% increase.