More Related Content

Slideshows for you(20)

Similar to Research Data, or: How I Learned to Stop Worrying and Love the Policy(20)


Research Data, or: How I Learned to Stop Worrying and Love the Policy

  1. Research Data, or: How I Learned to Stop Worrying and Love the Policy RDMF14: Research Data (and) Systems York, 9th November 2015 Dr Torsten Reimer Scholarly Communications Officer Imperial College London / @torstenreimer
  2. Why are we here?
  3. Why we fight – Compliance! Really? “Well compliance is really important, yes that's the whole reason we are doing it really. I mean to comply with Research Council guidelines yes. I am not saying the whole reason but that's the main driver, yes.” 10.1371/journal.pone.0114734 There are issues with RCUK/EPSRC policy: • cost-benefit analysis, anyone? • expensive/issues around funding • enough support/incentive for culture change? • fine in theory, but is it workable in practice? But…
  4. Blame funders, or blame ourselves (hedgehog and hare)? It seems wherever we go, the funders have already been there: HEFCE open access policy; EPSRC data policy… Are the funders too fast? Or we too slow? Imagine the sector had agreed on best practice years ago – and implemented it in a sensible way!
  5. So, why are we here again? No really, why?
  6. Data Science hub and KPMG Data Observatory
  7. Data Science hub and KPMG Data Observatory launch (04 Nov) "At a research intensive university like Imperial it is hard to do anything that doesn't involve data.“ James Stirling, Provost "Data is at the heart of the human condition." Joanna Shields, UK Minister for Internet Safety and Security Considering these statements you’d think everyone, especially Imperial, would have RDM all sorted, wouldn’t you?
  8. … and yet we are losing research data “In their parents' attic, in boxes in the garage, or stored on now-defunct floppy disks — these are just some of the inaccessible places in which scientists have admitted to keeping their old research data.”
  9. Isn’t research meant to be reproducible? The results of only 6 out 53 ‘landmark’ studies were found reproducible. Drug development: Raise standards for preclinical cancer research. DOI: doi:10.1038/483531a “Several recent publications suggested that the seminal findings from academic laboratories could only be reproduced 11–50% of the time. The lack of data reproducibility likely contributes to the difficulty in rapidly developing new drugs and biomarkers that significantly impact the lives of patients with cancer and other diseases.” A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic. DOI: 10.1371/journal.pone.0063221
  10. Shouldn’t the public too be allowed to play with data?
  11. (This is in our own interest!)
  12. RDM systems landscape
  13. Case for a national infrastructure? Currently, ~100 UK institutions spend effort to define and implement an RDM infrastructure (storage, workflows, interfaces, metadata, compliance, monitoring, business model etc.). Some aspects have to be local, but… …imagine a national research data infrastructure (say for data publishing and preservation), run by RCUK: • Economies of scale • No issues with funding • Just one system to interface with • Increased visibility/discoverability • Solution would by default be compliant • No commercial “ownership” of public data
  14. However, past experience suggests…
  15. However, past experience suggests…
  16. One RDM system to rule them all? • Is community track record actually better than funders’? • Jisc offers components, but have we found right model for collaboration (supplier? leader? partner?)? • Commercial solutions exist– trust? Should they define our infrastructure?  Funders set policy; 3rd parties infrastructure – we’ve been too slow again!  However, is one system actually suitable (redundancy, competition, disciplines etc.)? Until the one solution emerges (if ever), we should: • consider defining minimum requirements (metadata, identifiers, embargoes) for 3rd party solutions? • use a flexible approach that enables us to learn and change
  17. Imperial College London (From funder policy to) institutional strategy
  18. Imperial College London • Seven London campuses • Four Faculties: Engineering, Medicine, Natural Sciences and Business School • Ranked 3rd in Europe / 8th in the world (THE 2015-16 rankings) • Net income (2014): £855m, incl. £351m research grants and contracts • ~15,000 students, ~7,400 staff, incl. ~3,900 academic & research staff • Staff publish 10-12,000 scholarly articles per year • Largest data traffic into Janet network of all UK universities
  19. Process of policy development • 2014: Draft policy: “Statement of Strategic Aims” • Lack of reliable data (on data storage needs (scale) in particular) • Concerns about cost of maintaining infrastructure • Concerns about uncertainties and changing market / policy landscape • Decision: re-think approach – more cost-effective, based on better data • Approach: RDM Green Shoots and RDM Investigation • Funded by Vice-Provost (Research) • Green Shoots: 6 bottom-up, academic projects (2nd half of 2014) • RDM investigation (Oct 2014-Jan 2015) • Online survey (academics; 390 responses) • ~40 interviews (academics) • Workshops (academics & data managers)
  20. RDM Green Shoots • Haystack – a computational molecular data notebook (Dr Mike Bearpark, Chemistry) • Imperial College Healthcare Tissue Bank (Prof. Gerry Thomas, Surgery & Cancer) • Integrated Rule-based Data Management System for Genome Sequencing Data (Dr Michael Mueller, Medicine) • RDM in Computational and Experimental Molecular Sciences (Prof. Henry Rzepa, Chemistry) • RDM: Where software meets data (Dr Gerard Gorman & Dr Matthew Piggott, Earth Science & Engineering) • Time Series (Dr Nick Jones, Mathematics)
  21. Idea • Provide a platform and technology which automatically connects researchers through their time-series data, models and analysis methods Achievements • Online interdisciplinary collection of time-series data and time-series analysis code • Functionality to automatically profile time series • Functionality to automatically profile time series algorithms • Functionality to use these profiles to place a user’s work in the context of others RDM Benefits • Incentivises data sharing by allowing data comparison – increases discoverability of an academic’s data plus increases likelihood of finding other relevant data • Resource also available to general public More Information • Example project: Time Series
  22. Online survey – where does active data live? 0 10 20 30 40 50 60 70 80 College computer External/portable storage Cloud storage Personal computer Departmental/group storage College H drive ICT central storage Use of different types of storage in %
  23. Online survey – growth of data volume 0 5 10 15 20 25 30 > 1 PB 100 TB – 1 PB 10 TB – 100 TB 1 TB – 10 TB 100 GB – 1 TB 10 GB – 100 GB < 10 GB Research group data storage needs in % Now In 2 years
  24. Findings (best practice) • RDM principles are considered to be sound but not fully practised • Sharing publicly-funded data accepted in principle but some question value and cost • Concerns about (metadata) effort to make shared data discoverable • Metadata schemas are not yet widely available across disciplines • Auto-generate metadata where possible • Consensus that RDM training for PhDs is vital (also to ensure data loss when they leave)
  25. Findings (data) • 60-100% of grant required to re-generate data used in publications • % of data that needs retaining to support publications: ~60% • Data storage capacity will have to grow significantly • Concerns around back-up and archiving, esp. considering data volume • Popularity of cloud services (as opposed to College storage)  Researchers want self-administered, secure, responsive solution for data sharing, storing and archiving; open APIs preferred (“Yes [storage] is really important. Basically, whenever we have been out to talk to researchers, that's the thing they have latched on to and want to talk about the most.” 10.1371/journal.pone.0114734)
  26. Conclusions / policy implementation principles • Provide platform-independent, flexible data storage • Embed RDM training into PhD progression • Where available, uses existing workflows: • Symplectic Elements: metadata management • Spiral (DSpace): public (metadata) catalogue • Additional infrastructure: • use external resources • no long-term commitment • as flexible as possible • cost-effective
  27. Reesult: Imperial College RDM Policy “Imperial College London is committed to promoting the highest standards of academic research, including excellence in research data management. This includes a robust digital curation infrastructure that supports open data access and protects confidential data. The College acknowledges legal, ethical and commercial constraints on data sharing and the need to preserve the academic entitlement to publication.” “Principal Investigators have overall responsibility for the effective management of research data generated within or obtained for their research, including by their research groups. The Library and ICT will provide training, guidance and services to support PIs.”
  28. Building a flexible RDM infrastructure
  29. Research Project Data: Box Software: GitHub Data/software stillneeded Delete External repositoryInternalStorage Elements Spiral Creates data/software Project ends no yes Metadata, manual or automatic Can it be published or embargoed externally? yesno Metadata, manual or automatic Can metadata bepublished? Library reviews yes
  30. Summarising RDM in 6 steps 1. Make a data management plan: use DMPOnline 2. Store your data management plan centrally: use InfoEd 3. Store your live data securely and safely: use Box 4. Store your final data (and/or code) for 10+ years, making it publicly available: use Zenodo 5. Tell the College where your data (and/or code) is published or stored: use Symplectic 6. Reference your funding and your data in the publications it underpins: tell your publisher
  31. Box – Data storage, sharing and syncing Roll-out across College: • unlimited data storage • online access, easy sharing, data syncing • file viewers included • backup, data remains even when staff leave • machine learning tools to describe data • API
  32. Infrastructure summary • Flexible, can react to market / policy changes • Components can be exchanged, no additional in-house infrastructure • Make a start, collect data, learn – change as required • Preservation infrastructure needs further work (discussions with Arkivum about ‘framework’ for costing into grants) – how much do we need to retain beyond published data? • It isn’t perfect, but we can make a start
  33. “In, through … and beyond”
  34. RDM policy with research software requirements “3.6.7 Cost Effectiveness – where computer-generated data may be reliably recreated at a cost less than that of storing raw output data, then the inputs and human-readable outputs of the relevant programme may be stored instead along with a reference to or copy of the software version used.” “3.7 If software is developed as part of a research project, Principal Investigators must archive the particular version of the software used to generate or analyse the data in a repository and inform the Library of its location, taking account of the points raised in 3.5 above. Principal Investigators are encouraged to follow the Sustainability and Preservation Framework of the Software Sustainability Institute.”
  35. Treat software as valuable research output PyRDM Green Shoots project Zenodo integrates with GitHub College survey on distributed version control Software Sustainability Institute – I a fellow
  36. ORCID – Open Researcher and Contributor ID • Emerging global standard for identifying authors of academic outputs • The College created ORCID iDs for academics staff in late 2014 (now 2,088 of 3,200 iDs claimed, ~1,500 linked in Elements) • Imperial hosted launch of Jisc ORCID consortium with 50 UK universities in September 2015
  37. Towards automating RDM reporting with ORCID Author links ORCID with CRIS …shares ORCID iD with repository …publishes dataset DataCite DOI linked to ORCID iD CRIS pulls metadata from ORCID / DataCite / Repository But: is the external metadata likely to be complete “enough”?
  38. Useful infrastructure makes compliance a by-product • One workflow for data generation, publishing, reporting and curation • Link data generation directly to storage (log into facility, data “at your desk” before you are out of the “lab”) • (HSS colleagues – “facility” can also be a book scanner • Automate reporting and generating / sharing of metadata Facilities write (meta) data into Box Data processed / analysed from Box Machine- learning adds metadata Publish to repository from Box, with reference Metadata directly or indirectly (ORCID) to CRISS
  39. Make data useful for us, not just for external re-use Now that we get data, shouldn’t we analyse it? Add value by: • connect researchers who have similar data interests • connect researchers to relevant data • present data in a way that’s suitable for public reuse • develop data analytics and knowledge transfer service • collect impact information on data
  40. • Let’s make a start and learn from doing, from actual data • Think about where we can coordinate (3rd party requirements) • It is early stages, take a flexible approach • Don’t wait for funders, interpret policies in a useful way and lead => If we lead instead of following there will be fewer unpleasant surprises to deal with! Research Data, or: How I Learned to Stop Worrying and Love the Policy
  41. Image Credit (note NC licence!) 1. _Group_Captain_Lionel_Mandrake.png public domain 2. _title.jpg public domain 3. public domain 4. C-3PO vs. Data (137/365), by JD Hancock, CC BY 2.0 5. public domain 6. OXO tools, by Didriks, CC BY 2.0 7. How I Learned To Stop Worrying..., by hjhipster, CC BY NC 2.0