Current and emerging scientific data curation practices


Published on

Slides from a presentation given at the 4th DELOS Summer School on Digital Preservation in Digital Libraries, Tirrenia, Pisa, Italy, 8-13 June 2008

Published in: Technology, Education
  • Good slides...Thank you!
    Are you sure you want to  Yes  No
    Your message goes here
  • Cool. That's a lot of explanation. Thanks for uploading.

    Darren Mack
    Are you sure you want to  Yes  No
    Your message goes here
  • Very interesting ppt. The research indicated PPT only contains 30% of information; therefore the 70% valuable information comes from the presenter himself/herself. provides you a chance to record your voice with your PowerPoint presentation and upload to the website. It can share with more readers and also promote your presentation more effectively on
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Current and emerging scientific data curation practices

  1. 1. Current and Emerging Scientific Data Curation Practices <ul><ul><li>Michael Day, Digital Curation Centre UKOLN, University of Bath [email_address] </li></ul></ul><ul><ul><li>4th Summer School on preservation in digital libraries, Tirrenia, Italy, 12 June 2008 </li></ul></ul>
  2. 2. Presentation outline: <ul><ul><li>Some definitions </li></ul></ul><ul><ul><li>Reasons for curating research data </li></ul></ul><ul><ul><li>Some specific issues: </li></ul></ul><ul><ul><ul><li>Scale and complexity, diversity of social contexts, costs </li></ul></ul></ul><ul><ul><li>Types of research data collection </li></ul></ul><ul><ul><li>Roles and responsibilities </li></ul></ul><ul><ul><li>Potential for collaboration </li></ul></ul><ul><ul><li>Some open questions </li></ul></ul>
  3. 3. Definitions – research data (1) <ul><li>What is research data? </li></ul><ul><ul><li>An extremely broad category of material </li></ul></ul><ul><ul><ul><li>“ ... any information that can be stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simulations, etc.” (National Science Board, Long-lived digital data collections , 2005) </li></ul></ul></ul><ul><ul><ul><li>In practice, it can mean almost anything </li></ul></ul></ul>
  4. 4. Definitions – digital curation (1) <ul><li>DCC definition: </li></ul><ul><ul><li>“ ... maintaining and adding value to a trusted body of digital information for current and future use; specifically, we mean the active management and appraisal of data over the life-cycle of scholarly and scientific materials” ( </li></ul></ul>
  5. 5. Definitions – digital curation (2) <ul><li>Main themes: </li></ul><ul><ul><li>Curation is seen as an ongoing process, e.g. the active management of data over time </li></ul></ul><ul><ul><li>It is also about adding-value through things like community annotation </li></ul></ul><ul><ul><li>Life-cycles are important, long-term stewardship not always necessary </li></ul></ul><ul><ul><li>Not identical to digital preservation </li></ul></ul>
  6. 6. Why curate research data? (1) <ul><li>Part of the normal research process: </li></ul><ul><ul><li>The need for others to validate and replicate research </li></ul></ul><ul><ul><li>In some disciplines, supporting data is routinely made available to reviewers and linked from journal papers </li></ul></ul><ul><ul><li>Principles of sharing and openness are firmly embedded in some disciplines </li></ul></ul>
  7. 7. Why curate research data? (2) <ul><li>Extrinsic and intrinsic value; </li></ul><ul><ul><li>High investment in research </li></ul></ul><ul><ul><li>Data can be very expensive to capture and analyse </li></ul></ul><ul><ul><li>Data is impossible to recreate once lost </li></ul></ul><ul><ul><li>Observational data (by definition) is irreplaceable </li></ul></ul><ul><ul><li>Current generations of instruments can gather more data than can be analysed </li></ul></ul>
  8. 8. Why curate research data? (3) <ul><li>The potential for creating 'new' knowledge from existing data: </li></ul><ul><ul><li>Re-use, re-analysis, data mining </li></ul></ul><ul><ul><li>Annotation, e.g. in molecular biology astronomy </li></ul></ul><ul><ul><li>Combining datasets in innovative ways, e.g. mapping biodiversity data onto ecological GIS </li></ul></ul><ul><ul><li>“ Science 2.0” </li></ul></ul>
  9. 9. Why curate research data? (4) <ul><li>It is increasingly a requirement of some research funding bodies </li></ul><ul><ul><li>Some have quite mature data retention policies (not necessarily for permanent retention) </li></ul></ul><ul><ul><li>Increasing expectation of access to data from publicly-funded research </li></ul></ul><ul><ul><li>OECD Principles and guidelines for access to research data from public funding (2007) </li></ul></ul>
  10. 10. Why curate research data? (5) <ul><li>Institutional asset management: </li></ul><ul><ul><li>Universities and other research organisations invest very large sums of money into research activities </li></ul></ul><ul><ul><li>Research data is a key output of this activity </li></ul></ul><ul><ul><li>It is, therefore, an institutional asset that needs stewardship </li></ul></ul>
  11. 11. Why curate research data? (6) <ul><li>Promoting the institution, research group or individual: </li></ul><ul><ul><li>Re-use helps promote visibility and 'impact' </li></ul></ul><ul><ul><li>Institutions become acknowledged 'centres of competence' </li></ul></ul>
  12. 12. Scale and complexity (1) <ul><li>Scale (1): </li></ul><ul><ul><li>The “digital deluge” </li></ul></ul><ul><ul><ul><li>e-Science </li></ul></ul></ul><ul><ul><ul><li>New generations of instruments </li></ul></ul></ul><ul><ul><ul><li>Computer simulation </li></ul></ul></ul><ul><ul><ul><li>Mny terabytes generated per day, petabyte scale computing (and growing) </li></ul></ul></ul>
  13. 13. Scale and complexity (2) <ul><li>Scale (2): </li></ul><ul><ul><li>Problems of scale are particularly acute in traditional 'big-science' disciplines: </li></ul></ul><ul><ul><ul><li>Particle physics (e.g., Large Hadron Collider) </li></ul></ul></ul><ul><ul><ul><li>Astronomy (sky surveys, etc) </li></ul></ul></ul><ul><ul><li>Also increasingly important in: </li></ul></ul><ul><ul><ul><li>Bioinformatics, crystallography, engineering design, and many others </li></ul></ul></ul><ul><ul><li>May be cheaper just to generate the data again, e.g. for gene sequencing </li></ul></ul>
  14. 14. Scale and complexity (3) <ul><li>Complexity (1) </li></ul><ul><ul><li>Research data is extremely diverse - not really a single category of material </li></ul></ul><ul><ul><ul><li>tabular data, images, GIS, etc. </li></ul></ul></ul><ul><ul><ul><li>raw machine output vs, derived data </li></ul></ul></ul><ul><ul><ul><li>varying levels of structure (XML, legacy formats, etc.) </li></ul></ul></ul><ul><ul><ul><li>many different standards </li></ul></ul></ul><ul><ul><li>Research data is not homogeneous </li></ul></ul><ul><ul><li>No one-size-fits-all approach possible </li></ul></ul>
  15. 15. Scale and complexity (4) <ul><li>Complexity (2): </li></ul><ul><ul><li>Even wider range of social contexts in which data is used (and shared) </li></ul></ul><ul><ul><li>DCC SCARP project has been exploring disciplinary factors in curation practice </li></ul></ul><ul><ul><ul><li>Practice even within single disciplines is very fragmented </li></ul></ul></ul><ul><ul><ul><li>Case studies ongoing </li></ul></ul></ul><ul><ul><ul><ul><li>Big-science archives, medical and social sciences, architecutre and engineering, biological images </li></ul></ul></ul></ul>
  16. 16. Diversity of contexts <ul><li>Research cultures </li></ul><ul><ul><li>Data practices vary widely, even within a single discipline </li></ul></ul><ul><ul><ul><li>Gene sequence data is typically deposited in public databases </li></ul></ul></ul><ul><ul><ul><li>In proteomics sharing is not so widespread; partly driven by lack of standards, but also about who has exploitation rights </li></ul></ul></ul><ul><ul><li>Role of commercial interests </li></ul></ul><ul><ul><ul><li>Pharmaceuticals, architecture and engineering, geological prospecting </li></ul></ul></ul>
  17. 17. Costs <ul><li>Recent JISC study (2008): </li></ul><ul><ul><li>Focused on the institution level </li></ul></ul><ul><ul><li>Some findings: </li></ul></ul><ul><ul><ul><li>The complex service requirements for curating research data means that institutions are setting-up federated approaches to repository development </li></ul></ul></ul><ul><ul><ul><li>Currently ingest costs are much higher than long-term storage and preservation costs </li></ul></ul></ul><ul><ul><ul><li>Start-up (and R&D) costs are high, but there can be economies of scale </li></ul></ul></ul>
  18. 18. Research data collections (1) <ul><li>A typology (1): </li></ul><ul><ul><li>From National Science Board report Long-lived digital data collections (2005) </li></ul></ul><ul><ul><ul><li>Research data collections – the products of one or more focused research projects </li></ul></ul></ul><ul><ul><ul><li>Resource or community data collections – collections that emerge to serve particular subject sub-disciplines </li></ul></ul></ul><ul><ul><ul><li>Reference data collections – serve a broader and more diverse set of user communities </li></ul></ul></ul>
  19. 19. Research data collections (2) <ul><li>A typology (2) </li></ul><ul><ul><li>Research data collections – the products of one or more focused research projects </li></ul></ul><ul><ul><ul><li>Extremely diverse </li></ul></ul></ul><ul><ul><ul><li>Have small user communities </li></ul></ul></ul><ul><ul><ul><li>Inconsistent standardisation </li></ul></ul></ul><ul><ul><ul><li>Typically no funding available to support the collection beyond the project funding cycle </li></ul></ul></ul>
  20. 20. Research data collections (3) <ul><li>A typology (3) </li></ul><ul><ul><li>Resource or community data collections – collections that emerge to serve particular subject sub-disciplines </li></ul></ul><ul><ul><ul><li>Often establish community-level standards </li></ul></ul></ul><ul><ul><ul><li>In many cases supported by funding bodies or particular research institutions </li></ul></ul></ul>
  21. 21. Research data collections (4) <ul><li>A typology (4) </li></ul><ul><ul><li>Reference data collections – collections that serve a broader and more diverse set of user communities </li></ul></ul><ul><ul><ul><li>conformance to robust, well-established standards essential </li></ul></ul></ul><ul><ul><ul><li>Expensive (time and money) </li></ul></ul></ul><ul><ul><ul><li>Budget typically comes from multiple sources, expectation that collections will persist </li></ul></ul></ul>
  22. 22. Research data collections (5) <ul><li>Data at risk </li></ul><ul><ul><li>Data in “research data collections” is most at risk </li></ul></ul><ul><ul><ul><li>A modern version of the “file-drawer problem” </li></ul></ul></ul><ul><ul><ul><li>Data stored on personal hard-drives or on media; largely undocumented (c.f DAF) </li></ul></ul></ul><ul><ul><ul><li>Particular challenge when the data creator has retired or moved to another institution </li></ul></ul></ul><ul><ul><ul><li>Data creators not aways aware of its value </li></ul></ul></ul><ul><ul><ul><li>The reward structure of science is not helpful </li></ul></ul></ul>
  23. 23. Research data collections (6) <ul><li>Collections can evolve: </li></ul><ul><ul><li>For example, Protein Data Bank (PDB) </li></ul></ul><ul><ul><ul><li>Launched 1971, small-scale, focused on a limited set of biological structures </li></ul></ul></ul><ul><ul><ul><li>Now is the main source of experimental structural information on biological macromolecules </li></ul></ul></ul><ul><ul><li>How do we recognise research data collections that have the potential to evolve into reference collections? </li></ul></ul>
  24. 24. Roles and responsibilities (1) <ul><li>Long-lived data collections (NSB) </li></ul><ul><ul><li>Data authors </li></ul></ul><ul><ul><li>Data managers </li></ul></ul><ul><ul><li>Data scientists </li></ul></ul><ul><ul><li>Data users </li></ul></ul><ul><ul><li>Funding agencies </li></ul></ul><ul><li>Dealing with data (JISC) </li></ul><ul><ul><li>Scientist </li></ul></ul><ul><ul><li>Institution </li></ul></ul><ul><ul><li>Data centre </li></ul></ul><ul><ul><li>User </li></ul></ul><ul><ul><li>Funder </li></ul></ul><ul><ul><li>Publisher </li></ul></ul>
  25. 25. Roles and responsibilities (2) <ul><li>Scientists </li></ul><ul><ul><li>Initial creation and use of data </li></ul></ul><ul><ul><li>Expectation of first use and in gaining appropriate credit and recognition </li></ul></ul><ul><ul><li>Responsible for: </li></ul></ul><ul><ul><ul><li>Managing data for life of project </li></ul></ul></ul><ul><ul><ul><li>For using standards (where possible) </li></ul></ul></ul><ul><ul><ul><li>For complying with data policies </li></ul></ul></ul><ul><ul><ul><li>For making the data available in a form that can (easily?) be used by others </li></ul></ul></ul>
  26. 26. Roles and responsibilities (3) <ul><li>Institutions: </li></ul><ul><ul><li>Role less clear </li></ul></ul><ul><ul><li>Institutional policies may require short-term management of data </li></ul></ul><ul><ul><ul><li>Advocacy and training </li></ul></ul></ul><ul><ul><li>Some institutions are developing repository services </li></ul></ul><ul><ul><ul><li>Are rarely currently used for research data </li></ul></ul></ul><ul><ul><ul><li>Federated approaches maintain disciplinary involvement </li></ul></ul></ul>
  27. 27. Roles and responsibilities (3) <ul><li>Data centres </li></ul><ul><ul><li>Undertakes curation and provides access </li></ul></ul><ul><ul><li>Responsible for: </li></ul></ul><ul><ul><ul><li>Selection and ingest </li></ul></ul></ul><ul><ul><ul><li>Participating in the development of standards </li></ul></ul></ul><ul><ul><ul><li>Protecting the rights of data creators </li></ul></ul></ul><ul><ul><ul><li>Supporting ingest and metadata capture </li></ul></ul></ul><ul><ul><ul><li>Supporting re-use (tools and services) </li></ul></ul></ul><ul><ul><ul><li>Training </li></ul></ul></ul>
  28. 28. Roles and responsibilities (4) <ul><li>Users: </li></ul><ul><ul><li>Users of third-party data </li></ul></ul><ul><ul><li>Responsible for: </li></ul></ul><ul><ul><ul><li>Adhering to any licenses and restrictions on use </li></ul></ul></ul><ul><ul><ul><li>Acknowledging data creators and curators </li></ul></ul></ul><ul><ul><ul><li>Managing any derived data </li></ul></ul></ul><ul><ul><ul><li>Provide feedback to scientists and data centres </li></ul></ul></ul>
  29. 29. Roles and responsibilities (5) <ul><li>Funding bodies: </li></ul><ul><ul><li>Acting at policy level </li></ul></ul><ul><ul><li>Responsible for: </li></ul></ul><ul><ul><ul><li>Considering wider policy perspectives </li></ul></ul></ul><ul><ul><ul><li>Developing policies in co-operation with other stakeholders </li></ul></ul></ul><ul><ul><ul><li>Monitoring and enforcing data policies </li></ul></ul></ul><ul><ul><ul><li>Support for long-term data management </li></ul></ul></ul><ul><ul><ul><li>Support for data curation </li></ul></ul></ul>
  30. 30. Curation infrastructures (1) <ul><li>Focus on the generic: </li></ul><ul><ul><li>Need for a balance between: </li></ul></ul><ul><ul><ul><li>The 'bottom-up' discipline-based drivers that promote the generation of research data </li></ul></ul></ul><ul><ul><ul><li>The policy level, looking to make cost effective investment in curation </li></ul></ul></ul><ul><ul><li>When building Infrastructures, focus on the generic </li></ul></ul><ul><ul><ul><li>Storage systems and middleware </li></ul></ul></ul><ul><ul><ul><li>Identifying the needs of the wider community </li></ul></ul></ul>
  31. 31. Curation infrastructures (2) <ul><li>The need for collaboration: </li></ul><ul><ul><li>Need for 'deep-infrastructure' recognised as far back as 1996 by the Task Force on Archiving of Digital Information </li></ul></ul><ul><ul><li>Digital preservation involves the &quot;grander problem of organizing ourselves over time and as a society ... [to manoeuvre] effectively in a digital landscape&quot; (p. 7) </li></ul></ul>
  32. 32. Collaboration on curation (1) <ul><li>Collaboration in science: </li></ul><ul><ul><li>Collaboration is deeply embedded in some (but not all) research cultures </li></ul></ul><ul><ul><li>Research collaboration is a well-established phenomenon that has been studied by sociologists of science (and others) </li></ul></ul><ul><ul><li>The nature of collaboration differs markedly between academic disciplines </li></ul></ul>
  33. 33. Collaboration on curation (2) <ul><li>Scientific collaboration types: </li></ul><ul><ul><li>Informal social networks </li></ul></ul><ul><ul><ul><li>Helps to define disciplinary norms and interpretational paradigms </li></ul></ul></ul><ul><ul><li>Formalised, semi-permanent organisations </li></ul></ul><ul><ul><ul><li>Traditionally most common in &quot;big-science&quot; domains </li></ul></ul></ul><ul><ul><ul><li>The growth of e-science has emphasised the collaborative nature of research </li></ul></ul></ul>
  34. 34. Collaboration on curation (4) <ul><li>Implications for curation; </li></ul><ul><ul><li>Collaborative data curation facilities might emerge first in sub-disciplines that have a more participatory collaboration pattern or otherwise have a strong emphasis on data sharing </li></ul></ul><ul><ul><li>Need for more systematic research into this across all research domains </li></ul></ul><ul><ul><ul><li>Building on DCC SCARP </li></ul></ul></ul>
  35. 35. Collaboration on curation (5) <ul><li>Collaboration is: </li></ul><ul><ul><li>Currently focused at disciplinary or sub-disciplinary levels </li></ul></ul><ul><ul><li>It is embedded within the workflows of particular research communities (e.g., genomics, crystallography, astronomy) </li></ul></ul><ul><ul><li>Takes advantage of the specialised knowledge available within particular &quot;designated communities&quot; </li></ul></ul>
  36. 36. Collaboration on curation (6) <ul><li>Collaboration and standards: </li></ul><ul><ul><li>Common standards emerge where there is a recognised need for data sharing </li></ul></ul><ul><ul><li>The existence of common standards make data centres and repositories viable </li></ul></ul>
  37. 37. Collaboration on curation (7) <ul><li>Interdisciplinary collaboration; </li></ul><ul><ul><li>Previously little demand for collaboration on data curation across disciplinary borders </li></ul></ul><ul><ul><li>But the fundamentally collaborative nature of e-research should make us challenge this: </li></ul></ul><ul><ul><ul><li>A need to pool resources and expertise </li></ul></ul></ul><ul><ul><ul><li>A need for supporting infrastructures </li></ul></ul></ul>
  38. 38. Collaboration on curation (8) <ul><li>Need for strategic alliances </li></ul><ul><ul><li>National initiatives, e.g. DPC, NDIIPP, nestor </li></ul></ul><ul><ul><li>European Alliance for Permanent Access </li></ul></ul>
  39. 39. Open questions (1) <ul><li>The role of institutions </li></ul><ul><ul><li>Universities are setting up repositories </li></ul></ul><ul><ul><li>Rhetoric suggests that they aim to manage all research outputs (i.e. including data) </li></ul></ul><ul><ul><li>In practice, they currently mostly deal with research papers </li></ul></ul><ul><ul><li>What is the role of the institution with regard to research data? </li></ul></ul><ul><ul><li>Do they have the trust of researchers? </li></ul></ul>
  40. 40. Open questions (2) <ul><li>Are generic approaches possible? </li></ul><ul><ul><li>There is a tension between the diversity and complexity of research data and the need for generic solutions </li></ul></ul><ul><ul><ul><li>Promoting data sharing between disciplines </li></ul></ul></ul><ul><ul><ul><li>Interoperability </li></ul></ul></ul>
  41. 41. Open questions (3) <ul><li>Data can only exist as part of wider research contexts </li></ul><ul><ul><li>They are referenced in papers and other forms of research communication, in project documentation and archives </li></ul></ul><ul><ul><li>Linked from project Web pages, etc. </li></ul></ul><ul><ul><li>How do we ensure that curated data remains integrated within this scholarly web? </li></ul></ul><ul><ul><li>How do we make the links persistent? </li></ul></ul>
  42. 42. Summing-up <ul><li>Size and diversity </li></ul><ul><ul><li>Research data is extremely diverse </li></ul></ul><ul><ul><li>No one-size-fits-all solution </li></ul></ul><ul><ul><li>Scale is a growing problem </li></ul></ul><ul><li>Infrastructures: </li></ul><ul><ul><li>Many data curation services already exist – good practice </li></ul></ul><ul><ul><li>Need to integrate these (and institutional initiatives) at the policy level </li></ul></ul>
  43. 43. Further reading <ul><li>Neil Beagrie, Jullia Chruszcz, and Brian Lavoie, Keeping research data safe: a cost model and guidance for UK universities (JISC, 2008) </li></ul><ul><li>Liz Lyon, Dealing with data; roles, rights, responsibilities and relationships (JISC, 2007) </li></ul><ul><li>National Science Board, Long-lived digital data collections: enabling research and education in the 21st century (NSF, 2005) </li></ul>
  44. 44. Exercise
  45. 45. Exercise (1) <ul><li>4 Scenarios: </li></ul><ul><ul><li>A research team in 2028 is evaluating a particular set of content for use in a particular project (Web content, multimedia, images, dataset) </li></ul></ul><ul><ul><li>Ask questions about what they would need to know to interpret the content correctly </li></ul></ul><ul><ul><li>Evaluate the relative importance of: content, context, appearance, structure, behaviour </li></ul></ul>
  46. 46. Exercise (2) <ul><li>Scenario 1: A research project in 2028 is trying to explore how the first generation Internet was used by European political parties in the 1990s to promote citizen participation in policy formation. The investigators know that a large amount of Web material from this period is held by an organisation called the Internet Archive, and they have begun to use data mining tools to explore the extent of their holdings. What will they need to know about the collection in order to be able to do their work properly? </li></ul>
  47. 47. Exercise (3) <ul><li>Scenario 2: Art curators in 2028 are trying to put together an exhibition of digital art in a public gallery. They have found that a university art department retains a collection of digital art resources (chiefly multimedia) produced by their undergraduate students between 2000-2005, some of which have gone on to become extremely important figures in the art establishment. When evaluating the collection for use in the exhibition, what would they consider to be the most important object characteristics? </li></ul>
  48. 48. Exercise (4) <ul><li>Scenario 3: Healthcare researchers in 2028 are trying to trace the historical incidence certain lung abnormalities and have access to a massive database of medical images (X-rays) that they intend to submit to the most up to date content-based image retrieval techniques. The database is made up of imaging output from more than one hospital and the researchers are worried that certain parameters essential to their research (e.g., the age and sex of patient, imaging dates, etc.) may be missing. What else need they know about the database before they can start running their search algorithms? </li></ul>
  49. 49. Exercise (5) <ul><li>Scenario 4: A research project in 2028 is trying to find links between climate records and biological species diversity in south-west England. The principal investigator has found a promising dataset of geographically-relevant biodiversity information in a local history museum. What more does she need to know about this dataset before she can get her team to try to integrate this dataset (and others like it) with historical climate models? </li></ul>
  50. 50. Acknowledgements The Digital Curation Centre is funded by the JISC and the UK Research Councils' e-Science Core Programme. UKOLN is funded by the Museums, Libraries and Archives Council, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC, the European Union, and other sources. UKOLN also receives support from the University of Bath, where it is based.