Current and Emerging Scientific Data Curation Practices  Michael Day, Digital Curation Centre UKOLN, University of Bath [email_address] 4th Summer School on preservation in digital libraries, Tirrenia, Italy, 12 June 2008
Presentation outline: Some definitions Reasons for curating research data Some specific issues: Scale and complexity, diversity of social contexts, costs Types of research data collection Roles and responsibilities Potential for collaboration Some open questions
Definitions – research data (1) What is research data? An extremely broad category of material “ ... any information that can be stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simulations, etc.” (National Science Board,  Long-lived digital data collections , 2005) In practice, it can mean almost anything
Definitions – digital curation (1) DCC definition: “ ... maintaining and adding value to a trusted body of digital information for current and future use; specifically, we mean the active management and appraisal of data over the life-cycle of scholarly and scientific materials” (http://www.dcc.ac.uk/)
Definitions – digital curation (2) Main themes: Curation is seen as an ongoing process, e.g. the active management of data over  time It is also about adding-value through things like community annotation Life-cycles are important, long-term stewardship not always necessary Not identical to digital preservation
Why curate research data? (1) Part of the normal research process: The need for others to validate and replicate research In some disciplines, supporting data is routinely made available to reviewers and linked from journal papers Principles of sharing and openness are firmly embedded in some disciplines
Why curate research data? (2) Extrinsic and intrinsic value; High investment in research Data can be very expensive to capture and analyse Data is impossible to recreate once lost Observational data (by definition) is irreplaceable Current generations of instruments can gather more data than can be analysed
Why curate research data? (3) The potential for creating 'new' knowledge from existing data: Re-use, re-analysis, data mining Annotation, e.g. in molecular biology astronomy Combining datasets in innovative ways, e.g. mapping biodiversity data onto ecological GIS “ Science 2.0”
Why curate research data? (4) It is increasingly a requirement of some research funding bodies Some have quite mature data retention policies (not necessarily for permanent retention) Increasing expectation of access to data from publicly-funded research OECD Principles and guidelines for access to research data from public funding (2007)
Why curate research data? (5) Institutional asset management: Universities and other research organisations invest very large sums of money into research activities Research data is a key output of this activity It is, therefore, an institutional asset that needs stewardship
Why curate research data? (6) Promoting the institution, research group or individual: Re-use helps promote visibility and 'impact' Institutions become acknowledged 'centres of competence'
Scale and complexity (1) Scale (1): The “digital deluge” e-Science New generations of instruments Computer  simulation Mny terabytes generated per day, petabyte scale computing (and growing)
Scale and complexity (2) Scale (2): Problems of scale are particularly acute in traditional 'big-science' disciplines: Particle physics (e.g., Large Hadron Collider) Astronomy (sky surveys, etc) Also increasingly important in: Bioinformatics, crystallography, engineering design, and many others May be cheaper just to generate the data again, e.g. for gene sequencing
Scale and complexity (3) Complexity (1) Research data is extremely diverse - not really a single category of material tabular data, images, GIS, etc. raw machine output vs, derived data varying levels of structure (XML, legacy formats, etc.) many different standards Research data is not homogeneous No one-size-fits-all approach possible
Scale and complexity (4) Complexity (2): Even wider range of social contexts in which data is used (and shared) DCC SCARP project has been exploring disciplinary factors in curation practice Practice even within single disciplines is very fragmented Case studies ongoing Big-science archives, medical and social sciences, architecutre and engineering, biological images
Diversity of contexts Research cultures Data practices vary widely, even within a single discipline Gene sequence data is typically deposited in public databases In proteomics sharing is not so widespread; partly driven by lack of standards, but also about who has exploitation rights Role of commercial interests Pharmaceuticals, architecture and engineering, geological prospecting
Costs Recent JISC study (2008):  Focused on the institution level Some findings: The complex service requirements for curating research data means that institutions are setting-up federated approaches to repository development Currently ingest costs are much higher than long-term storage and preservation costs Start-up (and R&D) costs are high, but there can be economies of scale
Research data collections (1) A typology (1): From National Science Board report  Long-lived digital data collections  (2005) Research data collections – the products of one or more focused research projects Resource or community data collections – collections that emerge to serve particular subject sub-disciplines Reference data collections – serve a broader and more diverse set of user communities
Research data collections (2) A typology (2) Research data collections – the products of one or more focused research projects Extremely diverse Have small user communities Inconsistent standardisation Typically no funding available to support the collection beyond the project funding cycle
Research data collections (3) A typology (3) Resource or community data collections – collections that emerge to serve particular subject sub-disciplines Often establish community-level standards In many cases supported by funding bodies or particular research institutions
Research data collections (4) A typology (4) Reference data collections – collections that serve a broader and more diverse set of user communities conformance to robust, well-established standards essential Expensive (time and money) Budget typically comes from multiple sources, expectation that collections will persist
Research data collections (5) Data at risk Data in “research data collections” is most at risk A modern version of the “file-drawer problem” Data stored on personal hard-drives or on media; largely undocumented (c.f DAF) Particular challenge when the data creator has retired or moved to another institution Data creators not aways aware of its value The reward structure of science is not helpful
Research data collections (6) Collections can evolve: For example, Protein Data Bank (PDB) Launched 1971, small-scale, focused on a limited set of biological structures Now is the main source of experimental structural information on biological macromolecules How do we recognise research data collections that have the potential to evolve into reference collections?
Roles and responsibilities (1) Long-lived data collections (NSB) Data authors Data managers Data scientists Data users Funding agencies Dealing with data (JISC) Scientist Institution Data centre User Funder Publisher
Roles and responsibilities (2) Scientists Initial creation and use of data Expectation of first use and in gaining appropriate credit and recognition Responsible for: Managing data for life of project For using standards (where possible) For complying with data policies For making the data available in a form that can (easily?) be used by others
Roles and responsibilities (3) Institutions: Role less clear Institutional policies may require short-term management of data Advocacy and training Some institutions are developing repository services Are rarely currently used for research data Federated approaches maintain disciplinary involvement
Roles and responsibilities (3) Data centres Undertakes curation and provides access  Responsible for: Selection and ingest Participating in the development of standards Protecting the rights of data creators Supporting ingest and metadata capture Supporting re-use (tools and services) Training
Roles and responsibilities (4) Users: Users of third-party data Responsible for: Adhering to any licenses and restrictions on use Acknowledging data creators and curators Managing any derived data Provide feedback to scientists and data centres
Roles and responsibilities (5) Funding bodies: Acting at policy level Responsible for: Considering wider policy perspectives Developing policies in co-operation with other stakeholders Monitoring and enforcing data policies Support for long-term data management Support for data curation
Curation infrastructures (1) Focus on the generic: Need for a balance between: The 'bottom-up' discipline-based drivers that promote the generation of research data The policy level, looking to make cost effective investment in curation When building Infrastructures, focus on the generic Storage systems and middleware Identifying the needs of the wider community
Curation infrastructures (2) The need for collaboration: Need for 'deep-infrastructure' recognised as far back as 1996 by the Task Force on Archiving of Digital Information Digital preservation involves the "grander problem of organizing ourselves over time and as a society ... [to manoeuvre] effectively in a digital landscape" (p. 7)
Collaboration on curation (1) Collaboration in science: Collaboration is deeply embedded in some (but not all) research cultures Research collaboration is a well-established phenomenon that has been studied by sociologists of science (and others) The nature of collaboration differs markedly between academic disciplines
Collaboration on curation (2) Scientific collaboration types: Informal social networks Helps to define disciplinary norms and interpretational paradigms Formalised, semi-permanent organisations Traditionally most common in "big-science" domains The growth of e-science has emphasised the collaborative nature of research
Collaboration on curation (4) Implications for curation; Collaborative data curation facilities  might  emerge first in sub-disciplines that have a more participatory collaboration pattern or otherwise have a strong emphasis on data sharing Need for more systematic research into this across  all  research domains Building on DCC SCARP
Collaboration on curation (5) Collaboration is: Currently focused at disciplinary or sub-disciplinary levels It is embedded within the workflows of particular research communities (e.g., genomics, crystallography, astronomy) Takes advantage of the specialised knowledge available within particular "designated communities"
Collaboration on curation (6) Collaboration and standards: Common standards emerge where there is a recognised need for data sharing The existence of common standards make data centres and repositories viable
Collaboration on curation (7) Interdisciplinary collaboration; Previously little demand for collaboration on data curation across disciplinary borders But the fundamentally collaborative nature of e-research should make us challenge this: A need to pool resources and expertise A need for supporting infrastructures
Collaboration on curation (8) Need for strategic alliances National initiatives, e.g. DPC, NDIIPP, nestor European Alliance for Permanent Access
Open questions (1) The role of institutions Universities are setting up repositories Rhetoric suggests that they aim to manage all research outputs  (i.e. including data) In practice, they currently mostly deal with research papers What is the role of the institution with regard to research data? Do they have the trust of researchers?
Open questions (2) Are generic approaches possible? There is a tension between the diversity and complexity of research data and the need for generic solutions Promoting data sharing between disciplines Interoperability
Open questions (3) Data can only exist as part of wider research contexts They are referenced in papers and other forms of research communication, in project documentation and archives Linked from project Web pages, etc. How do we ensure that curated data remains integrated within this scholarly web? How do we make the links persistent?
Summing-up Size and diversity Research data is extremely diverse No one-size-fits-all solution Scale is a growing problem Infrastructures: Many data curation services already exist – good practice Need to integrate these (and institutional initiatives) at the policy level
Further reading Neil Beagrie, Jullia Chruszcz, and Brian Lavoie,  Keeping research data safe: a cost model and guidance for UK universities  (JISC, 2008) Liz Lyon,  Dealing with data; roles, rights, responsibilities and relationships  (JISC, 2007) National Science Board,  Long-lived digital data collections: enabling research and education in the 21st century  (NSF, 2005)
Exercise
Exercise (1) 4 Scenarios: A research team in 2028 is evaluating a particular set of content for use in a particular project (Web content, multimedia, images, dataset) Ask questions about what they would need to know to interpret the content correctly Evaluate the relative importance of: content, context, appearance, structure, behaviour
Exercise (2) Scenario 1: A research project in 2028 is trying to explore how the first generation Internet was used by European political parties in the 1990s to promote citizen participation in policy formation. The investigators know that a large amount of Web material from this period is held by an organisation called the Internet Archive, and they have begun to use data mining tools to explore the extent of their holdings. What will they need to know about the collection in order to be able to do their work properly?
Exercise (3) Scenario 2:  Art curators in 2028 are trying to put together an exhibition of digital art in a public gallery. They have found that a university art department retains a collection of digital art resources (chiefly multimedia) produced by their undergraduate students between 2000-2005, some of which have  gone on to become extremely important figures in the art establishment. When evaluating the collection for use in the exhibition, what would they consider to be the most important object characteristics?
Exercise (4) Scenario 3: Healthcare researchers in 2028 are trying to trace the historical incidence certain lung abnormalities and have access to a massive database of medical images (X-rays) that they intend to submit to the most up to date content-based image retrieval techniques. The database is made up of imaging output from more than one hospital and the researchers are worried that certain parameters essential to their research (e.g., the age and sex of patient, imaging dates, etc.) may be missing. What else need they know about the database before they can start running their search algorithms?
Exercise (5) Scenario 4: A research project in 2028 is trying to find links between climate records and biological species diversity in south-west England. The principal investigator has found a promising dataset of geographically-relevant biodiversity information in a local history museum. What more does she need to know about this dataset before she can get her team to try to integrate this dataset (and others like it) with historical climate models?
Acknowledgements The  Digital Curation Centre  is funded by the JISC and the UK Research Councils' e-Science Core Programme. http://www.dcc.ac.uk/ UKOLN  is funded by the Museums, Libraries and Archives Council, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC, the European Union, and other sources. UKOLN also receives support from the University of Bath, where it is based. http://www.ukoln.ac.uk/

Current and emerging scientific data curation practices

  • 1.
    Current and EmergingScientific Data Curation Practices Michael Day, Digital Curation Centre UKOLN, University of Bath [email_address] 4th Summer School on preservation in digital libraries, Tirrenia, Italy, 12 June 2008
  • 2.
    Presentation outline: Somedefinitions Reasons for curating research data Some specific issues: Scale and complexity, diversity of social contexts, costs Types of research data collection Roles and responsibilities Potential for collaboration Some open questions
  • 3.
    Definitions – researchdata (1) What is research data? An extremely broad category of material “ ... any information that can be stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simulations, etc.” (National Science Board, Long-lived digital data collections , 2005) In practice, it can mean almost anything
  • 4.
    Definitions – digitalcuration (1) DCC definition: “ ... maintaining and adding value to a trusted body of digital information for current and future use; specifically, we mean the active management and appraisal of data over the life-cycle of scholarly and scientific materials” (http://www.dcc.ac.uk/)
  • 5.
    Definitions – digitalcuration (2) Main themes: Curation is seen as an ongoing process, e.g. the active management of data over time It is also about adding-value through things like community annotation Life-cycles are important, long-term stewardship not always necessary Not identical to digital preservation
  • 6.
    Why curate researchdata? (1) Part of the normal research process: The need for others to validate and replicate research In some disciplines, supporting data is routinely made available to reviewers and linked from journal papers Principles of sharing and openness are firmly embedded in some disciplines
  • 7.
    Why curate researchdata? (2) Extrinsic and intrinsic value; High investment in research Data can be very expensive to capture and analyse Data is impossible to recreate once lost Observational data (by definition) is irreplaceable Current generations of instruments can gather more data than can be analysed
  • 8.
    Why curate researchdata? (3) The potential for creating 'new' knowledge from existing data: Re-use, re-analysis, data mining Annotation, e.g. in molecular biology astronomy Combining datasets in innovative ways, e.g. mapping biodiversity data onto ecological GIS “ Science 2.0”
  • 9.
    Why curate researchdata? (4) It is increasingly a requirement of some research funding bodies Some have quite mature data retention policies (not necessarily for permanent retention) Increasing expectation of access to data from publicly-funded research OECD Principles and guidelines for access to research data from public funding (2007)
  • 10.
    Why curate researchdata? (5) Institutional asset management: Universities and other research organisations invest very large sums of money into research activities Research data is a key output of this activity It is, therefore, an institutional asset that needs stewardship
  • 11.
    Why curate researchdata? (6) Promoting the institution, research group or individual: Re-use helps promote visibility and 'impact' Institutions become acknowledged 'centres of competence'
  • 12.
    Scale and complexity(1) Scale (1): The “digital deluge” e-Science New generations of instruments Computer simulation Mny terabytes generated per day, petabyte scale computing (and growing)
  • 13.
    Scale and complexity(2) Scale (2): Problems of scale are particularly acute in traditional 'big-science' disciplines: Particle physics (e.g., Large Hadron Collider) Astronomy (sky surveys, etc) Also increasingly important in: Bioinformatics, crystallography, engineering design, and many others May be cheaper just to generate the data again, e.g. for gene sequencing
  • 14.
    Scale and complexity(3) Complexity (1) Research data is extremely diverse - not really a single category of material tabular data, images, GIS, etc. raw machine output vs, derived data varying levels of structure (XML, legacy formats, etc.) many different standards Research data is not homogeneous No one-size-fits-all approach possible
  • 15.
    Scale and complexity(4) Complexity (2): Even wider range of social contexts in which data is used (and shared) DCC SCARP project has been exploring disciplinary factors in curation practice Practice even within single disciplines is very fragmented Case studies ongoing Big-science archives, medical and social sciences, architecutre and engineering, biological images
  • 16.
    Diversity of contextsResearch cultures Data practices vary widely, even within a single discipline Gene sequence data is typically deposited in public databases In proteomics sharing is not so widespread; partly driven by lack of standards, but also about who has exploitation rights Role of commercial interests Pharmaceuticals, architecture and engineering, geological prospecting
  • 17.
    Costs Recent JISCstudy (2008): Focused on the institution level Some findings: The complex service requirements for curating research data means that institutions are setting-up federated approaches to repository development Currently ingest costs are much higher than long-term storage and preservation costs Start-up (and R&D) costs are high, but there can be economies of scale
  • 18.
    Research data collections(1) A typology (1): From National Science Board report Long-lived digital data collections (2005) Research data collections – the products of one or more focused research projects Resource or community data collections – collections that emerge to serve particular subject sub-disciplines Reference data collections – serve a broader and more diverse set of user communities
  • 19.
    Research data collections(2) A typology (2) Research data collections – the products of one or more focused research projects Extremely diverse Have small user communities Inconsistent standardisation Typically no funding available to support the collection beyond the project funding cycle
  • 20.
    Research data collections(3) A typology (3) Resource or community data collections – collections that emerge to serve particular subject sub-disciplines Often establish community-level standards In many cases supported by funding bodies or particular research institutions
  • 21.
    Research data collections(4) A typology (4) Reference data collections – collections that serve a broader and more diverse set of user communities conformance to robust, well-established standards essential Expensive (time and money) Budget typically comes from multiple sources, expectation that collections will persist
  • 22.
    Research data collections(5) Data at risk Data in “research data collections” is most at risk A modern version of the “file-drawer problem” Data stored on personal hard-drives or on media; largely undocumented (c.f DAF) Particular challenge when the data creator has retired or moved to another institution Data creators not aways aware of its value The reward structure of science is not helpful
  • 23.
    Research data collections(6) Collections can evolve: For example, Protein Data Bank (PDB) Launched 1971, small-scale, focused on a limited set of biological structures Now is the main source of experimental structural information on biological macromolecules How do we recognise research data collections that have the potential to evolve into reference collections?
  • 24.
    Roles and responsibilities(1) Long-lived data collections (NSB) Data authors Data managers Data scientists Data users Funding agencies Dealing with data (JISC) Scientist Institution Data centre User Funder Publisher
  • 25.
    Roles and responsibilities(2) Scientists Initial creation and use of data Expectation of first use and in gaining appropriate credit and recognition Responsible for: Managing data for life of project For using standards (where possible) For complying with data policies For making the data available in a form that can (easily?) be used by others
  • 26.
    Roles and responsibilities(3) Institutions: Role less clear Institutional policies may require short-term management of data Advocacy and training Some institutions are developing repository services Are rarely currently used for research data Federated approaches maintain disciplinary involvement
  • 27.
    Roles and responsibilities(3) Data centres Undertakes curation and provides access Responsible for: Selection and ingest Participating in the development of standards Protecting the rights of data creators Supporting ingest and metadata capture Supporting re-use (tools and services) Training
  • 28.
    Roles and responsibilities(4) Users: Users of third-party data Responsible for: Adhering to any licenses and restrictions on use Acknowledging data creators and curators Managing any derived data Provide feedback to scientists and data centres
  • 29.
    Roles and responsibilities(5) Funding bodies: Acting at policy level Responsible for: Considering wider policy perspectives Developing policies in co-operation with other stakeholders Monitoring and enforcing data policies Support for long-term data management Support for data curation
  • 30.
    Curation infrastructures (1)Focus on the generic: Need for a balance between: The 'bottom-up' discipline-based drivers that promote the generation of research data The policy level, looking to make cost effective investment in curation When building Infrastructures, focus on the generic Storage systems and middleware Identifying the needs of the wider community
  • 31.
    Curation infrastructures (2)The need for collaboration: Need for 'deep-infrastructure' recognised as far back as 1996 by the Task Force on Archiving of Digital Information Digital preservation involves the "grander problem of organizing ourselves over time and as a society ... [to manoeuvre] effectively in a digital landscape" (p. 7)
  • 32.
    Collaboration on curation(1) Collaboration in science: Collaboration is deeply embedded in some (but not all) research cultures Research collaboration is a well-established phenomenon that has been studied by sociologists of science (and others) The nature of collaboration differs markedly between academic disciplines
  • 33.
    Collaboration on curation(2) Scientific collaboration types: Informal social networks Helps to define disciplinary norms and interpretational paradigms Formalised, semi-permanent organisations Traditionally most common in "big-science" domains The growth of e-science has emphasised the collaborative nature of research
  • 34.
    Collaboration on curation(4) Implications for curation; Collaborative data curation facilities might emerge first in sub-disciplines that have a more participatory collaboration pattern or otherwise have a strong emphasis on data sharing Need for more systematic research into this across all research domains Building on DCC SCARP
  • 35.
    Collaboration on curation(5) Collaboration is: Currently focused at disciplinary or sub-disciplinary levels It is embedded within the workflows of particular research communities (e.g., genomics, crystallography, astronomy) Takes advantage of the specialised knowledge available within particular "designated communities"
  • 36.
    Collaboration on curation(6) Collaboration and standards: Common standards emerge where there is a recognised need for data sharing The existence of common standards make data centres and repositories viable
  • 37.
    Collaboration on curation(7) Interdisciplinary collaboration; Previously little demand for collaboration on data curation across disciplinary borders But the fundamentally collaborative nature of e-research should make us challenge this: A need to pool resources and expertise A need for supporting infrastructures
  • 38.
    Collaboration on curation(8) Need for strategic alliances National initiatives, e.g. DPC, NDIIPP, nestor European Alliance for Permanent Access
  • 39.
    Open questions (1)The role of institutions Universities are setting up repositories Rhetoric suggests that they aim to manage all research outputs (i.e. including data) In practice, they currently mostly deal with research papers What is the role of the institution with regard to research data? Do they have the trust of researchers?
  • 40.
    Open questions (2)Are generic approaches possible? There is a tension between the diversity and complexity of research data and the need for generic solutions Promoting data sharing between disciplines Interoperability
  • 41.
    Open questions (3)Data can only exist as part of wider research contexts They are referenced in papers and other forms of research communication, in project documentation and archives Linked from project Web pages, etc. How do we ensure that curated data remains integrated within this scholarly web? How do we make the links persistent?
  • 42.
    Summing-up Size anddiversity Research data is extremely diverse No one-size-fits-all solution Scale is a growing problem Infrastructures: Many data curation services already exist – good practice Need to integrate these (and institutional initiatives) at the policy level
  • 43.
    Further reading NeilBeagrie, Jullia Chruszcz, and Brian Lavoie, Keeping research data safe: a cost model and guidance for UK universities (JISC, 2008) Liz Lyon, Dealing with data; roles, rights, responsibilities and relationships (JISC, 2007) National Science Board, Long-lived digital data collections: enabling research and education in the 21st century (NSF, 2005)
  • 44.
  • 45.
    Exercise (1) 4Scenarios: A research team in 2028 is evaluating a particular set of content for use in a particular project (Web content, multimedia, images, dataset) Ask questions about what they would need to know to interpret the content correctly Evaluate the relative importance of: content, context, appearance, structure, behaviour
  • 46.
    Exercise (2) Scenario1: A research project in 2028 is trying to explore how the first generation Internet was used by European political parties in the 1990s to promote citizen participation in policy formation. The investigators know that a large amount of Web material from this period is held by an organisation called the Internet Archive, and they have begun to use data mining tools to explore the extent of their holdings. What will they need to know about the collection in order to be able to do their work properly?
  • 47.
    Exercise (3) Scenario2: Art curators in 2028 are trying to put together an exhibition of digital art in a public gallery. They have found that a university art department retains a collection of digital art resources (chiefly multimedia) produced by their undergraduate students between 2000-2005, some of which have gone on to become extremely important figures in the art establishment. When evaluating the collection for use in the exhibition, what would they consider to be the most important object characteristics?
  • 48.
    Exercise (4) Scenario3: Healthcare researchers in 2028 are trying to trace the historical incidence certain lung abnormalities and have access to a massive database of medical images (X-rays) that they intend to submit to the most up to date content-based image retrieval techniques. The database is made up of imaging output from more than one hospital and the researchers are worried that certain parameters essential to their research (e.g., the age and sex of patient, imaging dates, etc.) may be missing. What else need they know about the database before they can start running their search algorithms?
  • 49.
    Exercise (5) Scenario4: A research project in 2028 is trying to find links between climate records and biological species diversity in south-west England. The principal investigator has found a promising dataset of geographically-relevant biodiversity information in a local history museum. What more does she need to know about this dataset before she can get her team to try to integrate this dataset (and others like it) with historical climate models?
  • 50.
    Acknowledgements The Digital Curation Centre is funded by the JISC and the UK Research Councils' e-Science Core Programme. http://www.dcc.ac.uk/ UKOLN is funded by the Museums, Libraries and Archives Council, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC, the European Union, and other sources. UKOLN also receives support from the University of Bath, where it is based. http://www.ukoln.ac.uk/