Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Curation of Research Data


Published on

Supplementary presentation slides from a lecture on digital preservation given at the University of the West of England (UWE) as part of the MSc in Library and Library Management, University of the West of England, Frenchay Campus, Bristol, March 10, 2010

Published in: Technology, Education
  • Be the first to comment

Curation of Research Data

  1. 1. UKOLN is supported by: The curation of research data Michael Day Research and Development Team Leader UKOLN, University of Bath MSc Lecture, UWE, Bristol, 10 March 2010
  2. 2. Presentation outline <ul><li>Why curate research data </li></ul><ul><li>Repositories and the curation of research data </li></ul><ul><ul><li>Roles and responsibilities </li></ul></ul><ul><ul><li>Infrastructures </li></ul></ul><ul><ul><li>Curation challenges </li></ul></ul>
  3. 3. Dealing with research data <ul><li>An extremely broad category of material: </li></ul><ul><ul><li>“... any information that can be stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simulations, etc.” (National Science Board, Long-lived digital data collections, 2005) </li></ul></ul><ul><ul><li>In practice, it can mean almost anything </li></ul></ul>
  4. 4. Why curate research data? (1) <ul><li>Part of the normal research process: </li></ul><ul><ul><li>The need for others to validate and replicate research </li></ul></ul><ul><ul><li>In some disciplines, supporting data is routinely made available to reviewers and linked from journal papers </li></ul></ul><ul><ul><li>Principles of sharing and openness are firmly embedded in some disciplines </li></ul></ul>
  5. 5. Why curate research data? (2) <ul><li>Extrinsic and intrinsic value; </li></ul><ul><ul><li>High investment in research </li></ul></ul><ul><ul><li>Data can be very expensive to capture and analyse </li></ul></ul><ul><ul><li>Data is impossible to recreate once lost </li></ul></ul><ul><ul><li>Observational data (by definition) is irreplaceable </li></ul></ul><ul><ul><li>Current generations of instruments can gather more data than can be analysed </li></ul></ul>
  6. 6. Why curate research data? (3) <ul><li>The potential for creating 'new' knowledge from existing data: </li></ul><ul><ul><li>Re-use, re-analysis, data mining </li></ul></ul><ul><ul><li>Annotation, e.g. in molecular biology astronomy </li></ul></ul><ul><ul><li>Combining datasets in innovative ways, e.g. mapping biodiversity data onto ecological GIS </li></ul></ul><ul><ul><li>“Science 2.0” – Open Science </li></ul></ul>
  7. 7. Why curate research data? (4) <ul><li>It is increasingly a requirement of some research funding bodies </li></ul><ul><ul><li>Some have quite mature data retention policies (not necessarily for permanent retention) </li></ul></ul><ul><ul><li>Increasing expectation of access to data from publicly-funded research </li></ul></ul><ul><ul><li>OECD Principles and guidelines for access to research data from public funding (2007) </li></ul></ul>
  8. 8. Why curate research data? (5) <ul><li>Institutional asset management: </li></ul><ul><ul><li>Universities and other research organisations invest very large sums of money into research activities </li></ul></ul><ul><ul><li>Research data is a key output of this activity </li></ul></ul><ul><ul><li>It is, therefore, an institutional asset that needs stewardship </li></ul></ul>
  9. 9. Why curate research data? (6) <ul><li>Promoting the institution, research group or individual: </li></ul><ul><ul><li>Re-use helps promote visibility and 'impact' </li></ul></ul><ul><ul><li>Institutions become acknowledged 'centres of competence' </li></ul></ul>
  10. 10. Who undertakes preservation? <ul><li>Researchers </li></ul><ul><ul><li>Indirectly - they have most direct contact with creation stage, and understand how data can be used </li></ul></ul><ul><ul><li>Directly - sometimes responsible for maintaining community data collections </li></ul></ul><ul><li>Information professionals </li></ul><ul><ul><li>Sometimes, but it depends on the context </li></ul></ul><ul><li>IT professionals </li></ul><ul><ul><li>Primarily informaticians working with scientists </li></ul></ul>
  11. 11. Roles and responsibilities (1) <ul><li>Long-lived data collections (NSB) </li></ul><ul><ul><li>Data authors </li></ul></ul><ul><ul><li>Data managers </li></ul></ul><ul><ul><li>Data scientists </li></ul></ul><ul><ul><li>Data users </li></ul></ul><ul><ul><li>Funding agencies </li></ul></ul><ul><li>Dealing with data (JISC) </li></ul><ul><ul><li>Scientist </li></ul></ul><ul><ul><li>Institution </li></ul></ul><ul><ul><li>Data centre </li></ul></ul><ul><ul><li>User </li></ul></ul><ul><ul><li>Funder </li></ul></ul><ul><ul><li>Publisher </li></ul></ul>
  12. 12. Roles and responsibilities (2) <ul><li>Scientists </li></ul><ul><ul><li>Initial creation and use of data </li></ul></ul><ul><ul><li>Expectation of first use and in gaining appropriate credit and recognition </li></ul></ul><ul><ul><li>Responsible for: </li></ul></ul><ul><ul><ul><li>Managing data for life of project </li></ul></ul></ul><ul><ul><ul><li>For using standards (where possible) </li></ul></ul></ul><ul><ul><ul><li>For complying with data policies </li></ul></ul></ul><ul><ul><ul><li>For making the data available in a form that can (easily?) be used by others </li></ul></ul></ul>
  13. 13. Roles and responsibilities (3) <ul><li>Institutions: </li></ul><ul><ul><li>Role less clear </li></ul></ul><ul><ul><li>Institutional policies may require short-term management of data </li></ul></ul><ul><ul><ul><li>Advocacy and training </li></ul></ul></ul><ul><ul><li>Some institutions are developing repository services </li></ul></ul><ul><ul><ul><li>Are rarely currently used for research data </li></ul></ul></ul><ul><ul><ul><li>Federated approaches maintain disciplinary involvement </li></ul></ul></ul>
  14. 14. Roles and responsibilities (3) <ul><li>Data centres </li></ul><ul><ul><li>Undertakes curation and provides access </li></ul></ul><ul><ul><li>Responsible for: </li></ul></ul><ul><ul><ul><li>Selection and ingest </li></ul></ul></ul><ul><ul><ul><li>Participating in the development of standards </li></ul></ul></ul><ul><ul><ul><li>Protecting the rights of data creators </li></ul></ul></ul><ul><ul><ul><li>Supporting ingest and metadata capture </li></ul></ul></ul><ul><ul><ul><li>Supporting re-use (tools and services) </li></ul></ul></ul><ul><ul><ul><li>Training </li></ul></ul></ul>
  15. 15. Roles and responsibilities (4) <ul><li>Users: </li></ul><ul><ul><li>Users of third-party data </li></ul></ul><ul><ul><li>Responsible for: </li></ul></ul><ul><ul><ul><li>Adhering to any licenses and restrictions on use </li></ul></ul></ul><ul><ul><ul><li>Acknowledging data creators and curators </li></ul></ul></ul><ul><ul><ul><li>Managing any derived data </li></ul></ul></ul><ul><ul><ul><li>Provide feedback to scientists and data centres </li></ul></ul></ul>
  16. 16. Roles and responsibilities (5) <ul><li>Funding bodies: </li></ul><ul><ul><li>Acting at policy level </li></ul></ul><ul><ul><li>Responsible for: </li></ul></ul><ul><ul><ul><li>Considering wider policy perspectives </li></ul></ul></ul><ul><ul><ul><li>Developing policies in co-operation with other stakeholders </li></ul></ul></ul><ul><ul><ul><li>Monitoring and enforcing data policies </li></ul></ul></ul><ul><ul><ul><li>Support for long-term data management </li></ul></ul></ul><ul><ul><ul><li>Support for data curation </li></ul></ul></ul>
  17. 17. Research data collections (1) <ul><li>A typology (1): </li></ul><ul><ul><li>From National Science Board report Long-lived digital data collections (2005) </li></ul></ul><ul><ul><ul><li>Research data collections – the products of one or more focused research projects </li></ul></ul></ul><ul><ul><ul><li>Resource or community data collections – collections that emerge to serve particular subject sub-disciplines </li></ul></ul></ul><ul><ul><ul><li>Reference data collections – serve a broader and more diverse set of user communities </li></ul></ul></ul>
  18. 18. Research data collections (2) <ul><li>Data in “research data collections” is most at risk </li></ul><ul><ul><li>A modern version of the “file-drawer problem” </li></ul></ul><ul><ul><li>Data stored on personal hard-drives or on media; largely undocumented </li></ul></ul><ul><ul><li>Particular challenge when the data creator has retired or moved to another institution </li></ul></ul><ul><ul><li>Data creators not always aware of its potential value </li></ul></ul><ul><ul><li>The reward structure of science is not always helpful </li></ul></ul>
  19. 19. Curation infrastructures (1) <ul><li>Focus on the generic: </li></ul><ul><ul><li>Need for a balance between: </li></ul></ul><ul><ul><ul><li>The 'bottom-up' discipline-based drivers that promote the generation of research data </li></ul></ul></ul><ul><ul><ul><li>The policy level, looking to make cost effective investment in curation </li></ul></ul></ul><ul><ul><li>When building Infrastructures, focus on the generic </li></ul></ul><ul><ul><ul><li>Storage systems and middleware </li></ul></ul></ul><ul><ul><ul><li>Preservation services </li></ul></ul></ul><ul><ul><ul><li>Identifying the needs of the wider community </li></ul></ul></ul>
  20. 20. Curation infrastructures (2) <ul><li>The need for collaboration: </li></ul><ul><ul><li>Need for 'deep-infrastructure' recognised as far back as 1996 by the Task Force on Archiving of Digital Information </li></ul></ul><ul><ul><li>Digital preservation involves the &quot;grander problem of organizing ourselves over time and as a society ... [to manoeuvre] effectively in a digital landscape&quot; (p. 7) </li></ul></ul>
  21. 21. Curation challenges: Costs <ul><li>NSF Task Force looking at this subject: </li></ul><ul><ul><li>Sustainable Economics for a Digital Planet (2010): </li></ul></ul><ul><li>JISC-funded LIFE (Life Cycle Information for E-Literature) project is developing a predictive costing tool </li></ul><ul><ul><li> </li></ul></ul><ul><li>JISC-funded study ( Keeping research data safe , 2008) focused on research data curation at the institution level </li></ul><ul><ul><li>The complex service requirements for curating research data means that institutions are setting-up federated approaches to repository development </li></ul></ul><ul><ul><li>Currently ingest costs are much higher than long-term storage and preservation costs </li></ul></ul>
  22. 22. Curation challenges: Scale (1) <ul><li>The “digital deluge” in e-Science </li></ul><ul><ul><li>New generations of instruments </li></ul></ul><ul><ul><li>Computer simulations </li></ul></ul><ul><ul><li>Many terabytes generated per day, petabyte scale computing (and growing) </li></ul></ul><ul><ul><li>Cory Doctorow, “Welcome to the petacentre.” Nature, 455, pp 17-21, 4 Sep 2008 </li></ul></ul><ul><ul><li>Are Institutional Repositories ready for this? </li></ul></ul><ul><li>Digitised content: </li></ul><ul><ul><li>Google Book Search (~10 million items) </li></ul></ul><ul><ul><li>A role for research libraries? </li></ul></ul>
  23. 23. Curation challenges: Scale (2) <ul><li>Problems of scale are particularly acute in traditional 'big-science' disciplines: </li></ul><ul><ul><li>Particle physics (e.g., the Large Hadron Collider) </li></ul></ul><ul><ul><li>Astronomy (sky surveys, etc) </li></ul></ul><ul><li>But “smaller experiments will grow the fastest” (Szalay & Gray, Nature , 440, 413-4, 23 Mar 2006) </li></ul><ul><ul><li>Bioinformatics, crystallography, engineering design, and many others </li></ul></ul><ul><li>In some cases it may be cheaper just to generate the data again, e.g. for computer simulations </li></ul>
  24. 24. Curation challenges: Complexity (2) <ul><li>Research data is extremely diverse - not really a single category of material </li></ul><ul><ul><li>tabular data, images, GIS, etc. </li></ul></ul><ul><ul><li>raw machine output vs, derived data </li></ul></ul><ul><ul><li>varying levels of structure (XML, legacy formats, etc.) </li></ul></ul><ul><ul><li>many different standards </li></ul></ul><ul><li>Research data is not homogeneous </li></ul><ul><li>No one-size-fits-all approach possible </li></ul>
  25. 25. Curation challenges: Cultures <ul><li>Diverse research cultures </li></ul><ul><ul><li>Data practices vary widely, even within a single discipline </li></ul></ul><ul><ul><ul><li>Gene sequence data is typically deposited in public databases </li></ul></ul></ul><ul><ul><ul><li>In proteomics, sharing is not so widespread; partly driven by lack of standards, but there is also concern about who have exploitation rights </li></ul></ul></ul><ul><ul><li>Role of commercial interests </li></ul></ul><ul><ul><ul><li>Pharmaceuticals, architecture and engineering, geological prospecting </li></ul></ul></ul>
  26. 26. The Future ... <ul><li>“It is always a mistake for a historian to try and predict the future. Life, unlike science, is simply too full of surprises” - Richard J. Evans, In defence of history (1997, p. 62) </li></ul>
  27. 27. Further reading <ul><li>National Science Board, Long-lived digital data collections: enabling research and education in the 21st century (NSF, 2005): http// </li></ul><ul><li>Liz Lyon, Dealing with data; roles, rights, responsibilities and relationships (JISC, 2007): dealingwithdata.aspx </li></ul><ul><li>Neil Beagrie, Jullia Chruszcz, and Brian Lavoie, Keeping research data safe: a cost model and guidance for UK universities (JISC, 2008): </li></ul><ul><li>Liz Lyon, Open science at Web-scale: optimising participation and predictive potential (JISC, 2009): </li></ul><ul><li>Blue Riband Task Force on Sustainable Digital Preservation and Access, Sustainable economics for a digital planet (2010): </li></ul><ul><li>Digital Curation Centre: </li></ul>
  28. 28. Questions?
  29. 29. Acknowledgments <ul><li>UKOLN is funded by the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, the Museums, Libraries and Archives Council (MLA), as well as by project funding from the JISC, the European Union, and other sources. UKOLN also receives support from the University of Bath, where it is based. </li></ul><ul><li>More information: </li></ul>
  30. 30. Thank You!