SlideShare a Scribd company logo
a centre of expertise in data curation and preservation




Curation of Scientific Data:
                 Challenges for Repositories

                  Chris Rusbridge
           JISC Repositories Conference
              5 June 2007, Manchester
                                                                                                    Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5
UK: Scotland License, excluding content property of others. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative
Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
a centre of expertise in data curation and preservation




                         Contents
     •   Audience?
     •   Science and digital curation
     •   Why are data important?
     •   What kinds of data?
     •   What to do with data?
     •   Repository options
     •   Changing practice


JISC Repositories 2007
a centre of expertise in data curation and preservation




                         Audience
     • I assume you are either…
         • A Repository Manager concerned about adding
           data to your collections of ePrints (most likely), or
         • A research data manager or other researcher,
           concerned about finding an appropriate repository
           to curate your data (possibly), or
         • Neither of the above, in the wrong room, just come
           in to get out of the sun…




JISC Repositories 2007
a centre of expertise in data curation and preservation




  Digital Curation Centre Mission
        “The over-riding purpose of the DCC is to
        support and promote continuing improvement
        in the quality of data curation, and of
        associated digital preservation”




JISC Repositories 2007
a centre of expertise in data curation and preservation




JISC Repositories 2007
a centre of expertise in data curation and preservation




       “The Records of Science”
     • Data increasingly important as evidence
         • Key part of the scholarly record (public good)
             • Unrepeatable observations & experiments
             • Value for public money (eg OECD)
     • Experimental verifiability (the basis of science)
         • Would Chang retractions have been reduced if his first data
           were available?
                 CHANG, G., ROTH, C. B., REYES, C. L., PORNILLOS, O., CHEN, Y.-J. & CHEN, A. P. (2006)
                 Retraction of Pornillos et al., Science 310 (5756) 1950-1953. Retraction of Reyes and Chang,
                 Science 308 (5724) 1028-1031. Retraction of Chang and Roth, Science 293 (5536) 1793-1800.
                 Science Magazine, 314. http://www.sciencemag.org/cgi/content/full/314/5807/1875b

     • Allows additional interpretations
     • Legal and compliance (eg emerging RC mandates)

JISC Repositories 2007
a centre of expertise in data curation and preservation




                OECD declaration
     • “…Work towards the establishment of access regimes
       for digital research data from public funding in
       accordance with the following objectives and principles:
         •   Openness
         •   Transparency
         •   Legal conformity
         •   Formal responsibility
         •   Professionalism
         •   Protection of intellectual property
         •   Interoperability
         •   Quality and security
         •   Efficiency
         •   Accountability”
JISC Repositories 2007
a centre of expertise in data curation and preservation




Retaining research data means…
     • Data secure against loss (within group)
     • Communal repository (secure data store)
     • Re-usable, sharable information
     • As above, plus active curation (eg bio-
       informatics)
     • Long term preservation of information

     • Be clear what you are trying to do!

JISC Repositories 2007
a centre of expertise in data curation and preservation




     … or the data trajectory is…
     • Hard drive → lost (crash)
     • Hard drive →DVD →Cardboard box →Loft
       →Skip/dumpster → lost




     • Sometimes this is a very bad thing
     • Sometimes these are the right options!


JISC Repositories 2007                                          •© Marita Bushell
a centre of expertise in data curation and preservation




         Long term bit storage…
     • A solved problem? Just requires well-
       understood good data management
       practices?
     • Wrong! For very large datasets over very long
       time, there are significant problems…




                   BAKER, M., SHAH, M., ROSENTHAL, D. S. H., ROUSSOPOLOUS, M., MANIATIS, P., GIULI, T.
                   J. & BUNGALE, P. (2006) A Fresh Look at the Reliability of Long-term Digital Storage. EuroSys
                   '06. Leuven, Belgium, ACM.

JISC Repositories 2007
a centre of expertise in data curation and preservation




   How Well Must We Preserve?
   Keep a petabyte for a century
   – With   50% chance of remaining completely undamaged

   Consider each bit decaying independently
   – Analogy   with radioactive decay

   That's a bit half- life of 10**18 years
   – One    hundred million times the age of the universe

   That's a very demanding requirement
   – Hard   to measure
    – Even very unlikely faults will matter a lot

JISC Repositories 2007      •Slide from David Rosenthal, LOCKSS
a centre of expertise in data curation and preservation




       What to do about curation
     • Build curation/reusability into science workflow
         • Curation begins before creation
         • What’s easy at first becomes (impossibly) hard later
         • Describe data (metadata schemas, “representation info”,
           etc)
         • Keep experimental parameters (technical, who, what, when,
           where)
         • Keep ability to process
         • Keep data!




JISC Repositories 2007
a centre of expertise in data curation and preservation




    What to do about curation - 2
     • Use standard/agreed formats for data
     • Make ownership & restrictions clear, &
       explain how to cite data
     • Offer for deposit in institutional or discipline
       repository
         • Appraisal and selection essential
         • Possible time-limited embargos
     • “Publish” data in support of articles



JISC Repositories 2007
a centre of expertise in data curation and preservation



 Internet Archaeology: publication with
                 data




JISC Repositories 2007
a centre of expertise in data curation and preservation




            Database as book…
     • Buneman (early pilot)
       work on IUPHAR
       database
     • MySQL to XML
       database
         • Historic to logical
           schema
     • XML via XSLT to LaTeX




JISC Repositories 2007
a centre of expertise in data curation and preservation




                  The StORe vision
     • Seamless transport                                   Source
       from research data to
       research publications
       and vice versa                                        ware
     • Bi-directional links                                 Middle
       proven in social science
       e-research but capable
       of export to other
       disciplines
                                                            Output




                   •http://jiscstore.jot.com/WikiHome/
JISC Repositories 2007                              •Slide from Graham Pryor
a centre of expertise in data curation and preservation




  StORe survey: linkage value?
   The value of
                       University   University
   direct links                                    PG       Contract    Independent
                       academic     research                                           Other    Totals
   from source to                                student   researcher    researcher
                         staff      assistant
   output data

         Significant
        advantage         85           18         33           11            2          26       175
            Useful        78            9         41            5            4           9       146
        Interesting       24            4          5            3            0           5        41
     Of no interest        9            0          0            0            0           1        10
          Not sure         7            0          7            0            1           2        17
             Other         1            1          0            0            0           1         3
            Totals       204           32         86           19            7          44       392
             •But: “researchers’ attitudes to enabling access depend to a large
             •extent on whether they are behaving as producers or users of data”

JISC Repositories 2007                                              •Slide from StORe project
a centre of expertise in data curation and preservation




        What to do about data (3)
     • Institutional repository managers
         • Make contact with emerging institutional data services
         • Start raising awareness of the need to curate rather than just
           dump data
         • Start thinking about the relationship of data to publications
           (especially e-theses)
         • Start thinking about the metadata needed to find and re-use
           data
         • Make contact with key researchers
         • Start thinking about their data…




JISC Repositories 2007
a centre of expertise in data curation and preservation




             What kinds of data?
     • Observations
         • eg UARS (Upper Atmosphere) Level 0: telemetry
         • UARS Level 1: measured physical parameters (post
           calibration?)
     • Derived data
         • UARS Level 2: calculated geophysical? profiles
         • UARS level 3: gridded, interpolated?
     • Combined data
     • Crafted data
         • Eg annotated gene/protein databases
     • Descriptive (meta)data


JISC Repositories 2007
a centre of expertise in data curation and preservation




StORe: Source data formats
                                                  CAD/GIS:                       39

                 Extensible mark -up language (XML):                             35

                Database files (e.g. Access, MySQL):                            117

                                     Flat files (e.g. FITS):                     66

                Hypertext mark -up language (HTML):                              60

                 Image files (e.g. .jpg, .tif, .bmp, .gif):                     228

                                           Plain text (.txt):                   179

                    Portable document format (.pdf):                            156

                                      Rich text files (.rtf):                    53

                        Spreadsheets (e.g. Excel/.xls):                         220

                                     Statistical software:                       75

                                      Tables/catalogues:                        102

               Word processed files (e.g. Word/.doc):                           220

                                 Other (please specify) :                        76




JISC Repositories 2007                                          •Slide from StORe project
a centre of expertise in data curation and preservation




StORe: the other data formats?
     They said the 76 other formats included:
       +latex+.cc source code, .cif (crystallographic data),
       .pdb, .mtz, .pool, .root, .raw, .swf, .fla, .raw, .mpg,
       binary files, chemdraw cdx, xwin nmr files, .ps files,
       .fla, .swf, masslynx files, derived data in PAw-format
       ntuples, raw mass spectrometry data, X-ray
       diffraction data, kaleidagraphs, Atlas/ti hermeneutic
       unit files, C++/shell scripts, Fourier induction decay
       files, etc., etc., etc., etc………..



JISC Repositories 2007                         •Slide from StORe project
a centre of expertise in data curation and preservation




StORe: the other data formats - more
  They also said such things as:
    “It is stored in a database, but nothing so simple as an
    Access file! It's one of the largest databases in the world!
    The format is Kanga/Root and previously was
    Objectivity. I think it's of the order of Picobytes in size.”
  And:
    “God preserve us from idiots who archive data in
    proprietary commercial formats (Excel spreadsheets and
    MS-word documents)!”




JISC Repositories 2007                         •Slide from StORe project
a centre of expertise in data curation and preservation




  What are the reusability issues?
     • Data not neutral; highly contextual!
     • Hard to know the risks & pitfalls of a particular
       dataset
     • Data not self-describing: hard to find
       appropriate data (but see Murray-Rust on
       Googling InChI etc)
     • Hard to “understand” data once found
         • Really need information, not data!
     • Hard to use data once understood

JISC Repositories 2007
a centre of expertise in data curation and preservation




                         Context
     • Data meaningless without context
         • Metadata of many kinds
         • Representation information… from data to
           information
         • Linkage and connection between datasets
     • Provenance
         • Authenticity/integrity
         • Computational lineage



JISC Repositories 2007
a centre of expertise in data curation and preservation




              Access and re-use
     • Ethics and rights control access
         • Weak in expressing this long-term
     • Collaboration tools
         • Annotation, discussion, review (see DART…)
         • Re-use leading to change and development
     • “Publication”
         • Not just in “print”
         • Underlying data should be “published”, too


JISC Repositories 2007
a centre of expertise in data curation and preservation




           Data citation issues…
     • Citation for human readers and machine use cases
     • Granularity: database, record, item
     • Citation of changing objects
         • Version change (eg W3C practice: no version = latest, vs bibliographic:
           no version = first)
         • An efficient way to reference and access “archived” past states of
           more rapidly changing dataset, eg Genomics… datasets that result
           from the combined work of curators, or contain opinions or facts likely
           to change (work in progress, Buneman et al)
     • Standards conflict and immature (NLM best?)

     • Citation ESSENTIAL for motivating quality academic work on data
       management and curation


JISC Repositories 2007
a centre of expertise in data curation and preservation




             Repository challenges
     • Data are different: you’ll need access to some domain
       knowledge
     • Appraisal/selection harder
     • Broader range of formats
         • Appropriate “standards” for longevity? XML-based?
     • What metadata are needed?
         •   Descriptive, to find the dataset
         •   Context and background
         •   Provenance
         •   “Representation information” to connect data to information
             (whatever gives meaning to data for the “designated
             community”)

JISC Repositories 2007
a centre of expertise in data curation and preservation




        Repository challenges - 2
     • May distort your repository
         •   Size
         •   Number of objects
         •   Rate of deposit
         •   Nature of use
     • Databases may be dynamic
     • Databases may need to be accessed in situ
     • Rights and ethical limitations hard to describe and
       enforce
     • Need to build links to publications (cf StORe)
     • Need to build discipline links across repositories…

JISC Repositories 2007
a centre of expertise in data curation and preservation




        Repository challenges - 3
     • Is your platform suitable?
     • Most successful (ie older) data repositories
       are DIY
     • Data also held in repositories built on Dspace,
       ePrints and Fedora




JISC Repositories 2007
a centre of expertise in data curation and preservation




JISC Repositories 2007   •Data from MIT DSpace Political Science
a centre of expertise in data curation and preservation




JISC Repositories 2007
a centre of expertise in data curation and preservation




JISC Repositories 2007
a centre of expertise in data curation and preservation




         Who does data curation?
     •   Individuals
     •   Departments or groups
     •   Institutions, often through libraries
     •   Communities
     •   Disciplines
     •   Publishers
     •   National services
     •   Other 3rd parties…

JISC Repositories 2007
a centre of expertise in data curation and preservation




   Who are the curation players?




JISC Repositories 2007
a centre of expertise in data curation and preservation




       Disciplinary repositories…
     • >900 Nucleic Acids datasets!
     • ESDS/UKDA and NERC data centres, but…
     • “AHRC Council has decided to cease funding the Arts
       and Humanities Data Service (AHDS) from March
       2008. […] Grant holders must make materials they
       had planned to deposit with the AHDS available in an
       accessible depository for at least three years after the
       end of their grant”
             • AHRC Press Release 14/05/2007
             • (Note petition at http://petitions.pm.gov.uk/AHDSfunding/)
         • Does not apply to Archaeology: ADS still funded?



JISC Repositories 2007
a centre of expertise in data curation and preservation




        Institutional Repositories
     • OpenDOAR: only 5 Institutional Repositories claim to
       include datasets
         •   Bristol
         •   Cambridge
         •   Edinburgh
         •   Leicester
         •   Southampton
     • …and some of these seem doubtful on inspection!
         • … of course not all research data are “datasets”




JISC Repositories 2007
a centre of expertise in data curation and preservation




                 Cultural change
     • If we build it, will they come? NO!!
     • Outreach important: communication with
       scientists and researchers is hard graft
     • Cultural change to new approach requires more:
         • Incentives, rewards and mandates
         • Successful exemplars (well publicised)
         • Discipline-oriented approach (one size does not fit all)




JISC Repositories 2007
a centre of expertise in data curation and preservation




Need for advocacy?
       What functionality is missing from source repositories?
                    Academic     Research        Post-             Independent
                    staff        assistants      graduates         researchers



       None               9           2                 7
       Don’t use          7                            10                   1
       Lack of            3                             4                   2
       knowledge
       Don’t know         5           3                13                   1

       No reply          129          20               45                  13

JISC Repositories 2007                            •Slide from StORe project
a centre of expertise in data curation and preservation




Need for advocacy?
       What functionality is missing from output repositories?
                    Academic     Research        Post-             Independent
                    staff        assistants      graduates         researchers



       None               3           2                 5                   1
       Don’t use          1           1
       Lack of                                          2                   1
       knowledge
       Don’t know                     2                 6                   1

       No reply          123          15               48                  15

JISC Repositories 2007                            •Slide from StORe project
a centre of expertise in data curation and preservation




Need for advocacy?
        “The majority of academics do not know
        what repositories are nor are they
        familiar with the issues around new
        means of dissemination”
        – UKOLN/Eduserv Foundation: Digital
        Repositories Roadmap: looking forward, April
        2006



JISC Repositories 2007                    •Slide from StORe project
a centre of expertise in data curation and preservation




                               Thank you
                         c.rusbridge@ed.ac.uk




JISC Repositories 2007

More Related Content

What's hot

Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Deborah McGuinness
 
Public data archiving: Who does? Who doesn't? What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does? Who doesn't? What can we do about it?
Heather Piwowar
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
Jeroen Rombouts
 
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data ServicesNISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
National Information Standards Organization (NISO)
 
Working with Global Infrastructure at a National Level
Working with Global Infrastructure at a National LevelWorking with Global Infrastructure at a National Level
Working with Global Infrastructure at a National Level
National Institute of Informatics (NII)
 
SCAR Data Management and Policy
SCAR Data Management and PolicySCAR Data Management and Policy
SCAR Data Management and Policy
Anton Van de Putte
 
Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015
LYRASIS
 

What's hot (7)

Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
 
Public data archiving: Who does? Who doesn't? What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does? Who doesn't? What can we do about it?
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data ServicesNISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
 
Working with Global Infrastructure at a National Level
Working with Global Infrastructure at a National LevelWorking with Global Infrastructure at a National Level
Working with Global Infrastructure at a National Level
 
SCAR Data Management and Policy
SCAR Data Management and PolicySCAR Data Management and Policy
SCAR Data Management and Policy
 
Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015
 

Similar to Curation of scientifica data: Challenges for repositories

The future of the DCC
The future of the DCCThe future of the DCC
The future of the DCC
Chris Rusbridge
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management EcosystemJohn Kunze
 
Research data lifecycle diagram
Research data lifecycle diagramResearch data lifecycle diagram
Research data lifecycle diagramSteven Cracknell
 
RDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemRDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management Ecosystem
ASIS&T
 
Curating data for integrated science
Curating data for integrated scienceCurating data for integrated science
Curating data for integrated science
Chris Rusbridge
 
Curating data for integrated science
Curating data for integrated scienceCurating data for integrated science
Curating data for integrated science
Chris Rusbridge
 
Moving the repository upstream
Moving the repository upstreamMoving the repository upstream
Moving the repository upstream
Chris Rusbridge
 
Data Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach DataData Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach Data
cunera
 
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
John Scally
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
ENUG
 
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Anna Maria Tammaro
 
Data presentation and transfer
Data presentation and transferData presentation and transfer
Data presentation and transferIyad Abou Rabii
 
Presentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research SeriesPresentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research Series
SEAD
 
Improving RDM through closer integration of electronic lab notebooks and data...
Improving RDM through closer integration of electronic lab notebooks and data...Improving RDM through closer integration of electronic lab notebooks and data...
Improving RDM through closer integration of electronic lab notebooks and data...
rmacneil88
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
Rob Grim
 
Curation of Research Data
Curation of Research DataCuration of Research Data
Curation of Research Data
Michael Day
 
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
datacite
 
Presentation to EASE, Tallinn, June 2012
Presentation to EASE, Tallinn, June 2012Presentation to EASE, Tallinn, June 2012
Presentation to EASE, Tallinn, June 2012
Sarah Callaghan
 
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
Sarah Anna Stewart
 
DataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRefDataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRefCrossref
 

Similar to Curation of scientifica data: Challenges for repositories (20)

The future of the DCC
The future of the DCCThe future of the DCC
The future of the DCC
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management Ecosystem
 
Research data lifecycle diagram
Research data lifecycle diagramResearch data lifecycle diagram
Research data lifecycle diagram
 
RDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemRDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management Ecosystem
 
Curating data for integrated science
Curating data for integrated scienceCurating data for integrated science
Curating data for integrated science
 
Curating data for integrated science
Curating data for integrated scienceCurating data for integrated science
Curating data for integrated science
 
Moving the repository upstream
Moving the repository upstreamMoving the repository upstream
Moving the repository upstream
 
Data Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach DataData Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach Data
 
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
 
Data presentation and transfer
Data presentation and transferData presentation and transfer
Data presentation and transfer
 
Presentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research SeriesPresentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research Series
 
Improving RDM through closer integration of electronic lab notebooks and data...
Improving RDM through closer integration of electronic lab notebooks and data...Improving RDM through closer integration of electronic lab notebooks and data...
Improving RDM through closer integration of electronic lab notebooks and data...
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
Curation of Research Data
Curation of Research DataCuration of Research Data
Curation of Research Data
 
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
 
Presentation to EASE, Tallinn, June 2012
Presentation to EASE, Tallinn, June 2012Presentation to EASE, Tallinn, June 2012
Presentation to EASE, Tallinn, June 2012
 
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
 
DataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRefDataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRef
 

More from Chris Rusbridge

The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...
Chris Rusbridge
 
JISC Digital Library initiatives
JISC Digital Library initiativesJISC Digital Library initiatives
JISC Digital Library initiatives
Chris Rusbridge
 
Practical steps towards digital preservation at institutional levels
Practical steps towards digital preservation at institutional levelsPractical steps towards digital preservation at institutional levels
Practical steps towards digital preservation at institutional levels
Chris Rusbridge
 
The Licence Trap
The Licence TrapThe Licence Trap
The Licence Trap
Chris Rusbridge
 
Cautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your GardenCautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your Garden
Chris Rusbridge
 
Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...
Chris Rusbridge
 
Issues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineeringIssues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineering
Chris Rusbridge
 
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
Chris Rusbridge
 
LOCKSS UK, with a focus on reporting experience
LOCKSS UK, with a focus on reporting experienceLOCKSS UK, with a focus on reporting experience
LOCKSS UK, with a focus on reporting experienceChris Rusbridge
 
Trust and repository audit: can repository managers assure trustworthiness?
Trust and repository audit: can repository managers assure trustworthiness?Trust and repository audit: can repository managers assure trustworthiness?
Trust and repository audit: can repository managers assure trustworthiness?
Chris Rusbridge
 
Disciplinary dimensions of digital curation: introduction and synthesis
Disciplinary dimensions of digital curation: introduction and synthesisDisciplinary dimensions of digital curation: introduction and synthesis
Disciplinary dimensions of digital curation: introduction and synthesis
Chris Rusbridge
 
Reference Model for Economically Sustainable Digital Curation
Reference Model for Economically Sustainable Digital CurationReference Model for Economically Sustainable Digital Curation
Reference Model for Economically Sustainable Digital Curation
Chris Rusbridge
 
Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Chris Rusbridge
 
Blue Ribbon Task Force on Sustainable Digital Preservation
Blue Ribbon Task Force on Sustainable Digital PreservationBlue Ribbon Task Force on Sustainable Digital Preservation
Blue Ribbon Task Force on Sustainable Digital Preservation
Chris Rusbridge
 
Sustainable Digital Preservation and Access
Sustainable Digital Preservation and AccessSustainable Digital Preservation and Access
Sustainable Digital Preservation and Access
Chris Rusbridge
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
Chris Rusbridge
 

More from Chris Rusbridge (18)

The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...
 
JISC Digital Library initiatives
JISC Digital Library initiativesJISC Digital Library initiatives
JISC Digital Library initiatives
 
Practical steps towards digital preservation at institutional levels
Practical steps towards digital preservation at institutional levelsPractical steps towards digital preservation at institutional levels
Practical steps towards digital preservation at institutional levels
 
The Licence Trap
The Licence TrapThe Licence Trap
The Licence Trap
 
Cautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your GardenCautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your Garden
 
Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...
 
Dcc endeavour-2006
Dcc endeavour-2006Dcc endeavour-2006
Dcc endeavour-2006
 
Issues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineeringIssues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineering
 
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
 
LOCKSS UK, with a focus on reporting experience
LOCKSS UK, with a focus on reporting experienceLOCKSS UK, with a focus on reporting experience
LOCKSS UK, with a focus on reporting experience
 
Dcc jsr phase 3
Dcc jsr phase 3Dcc jsr phase 3
Dcc jsr phase 3
 
Trust and repository audit: can repository managers assure trustworthiness?
Trust and repository audit: can repository managers assure trustworthiness?Trust and repository audit: can repository managers assure trustworthiness?
Trust and repository audit: can repository managers assure trustworthiness?
 
Disciplinary dimensions of digital curation: introduction and synthesis
Disciplinary dimensions of digital curation: introduction and synthesisDisciplinary dimensions of digital curation: introduction and synthesis
Disciplinary dimensions of digital curation: introduction and synthesis
 
Reference Model for Economically Sustainable Digital Curation
Reference Model for Economically Sustainable Digital CurationReference Model for Economically Sustainable Digital Curation
Reference Model for Economically Sustainable Digital Curation
 
Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...
 
Blue Ribbon Task Force on Sustainable Digital Preservation
Blue Ribbon Task Force on Sustainable Digital PreservationBlue Ribbon Task Force on Sustainable Digital Preservation
Blue Ribbon Task Force on Sustainable Digital Preservation
 
Sustainable Digital Preservation and Access
Sustainable Digital Preservation and AccessSustainable Digital Preservation and Access
Sustainable Digital Preservation and Access
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Curation of scientifica data: Challenges for repositories

  • 1. a centre of expertise in data curation and preservation Curation of Scientific Data: Challenges for Repositories Chris Rusbridge JISC Repositories Conference 5 June 2007, Manchester Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
  • 2. a centre of expertise in data curation and preservation Contents • Audience? • Science and digital curation • Why are data important? • What kinds of data? • What to do with data? • Repository options • Changing practice JISC Repositories 2007
  • 3. a centre of expertise in data curation and preservation Audience • I assume you are either… • A Repository Manager concerned about adding data to your collections of ePrints (most likely), or • A research data manager or other researcher, concerned about finding an appropriate repository to curate your data (possibly), or • Neither of the above, in the wrong room, just come in to get out of the sun… JISC Repositories 2007
  • 4. a centre of expertise in data curation and preservation Digital Curation Centre Mission “The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation” JISC Repositories 2007
  • 5. a centre of expertise in data curation and preservation JISC Repositories 2007
  • 6. a centre of expertise in data curation and preservation “The Records of Science” • Data increasingly important as evidence • Key part of the scholarly record (public good) • Unrepeatable observations & experiments • Value for public money (eg OECD) • Experimental verifiability (the basis of science) • Would Chang retractions have been reduced if his first data were available? CHANG, G., ROTH, C. B., REYES, C. L., PORNILLOS, O., CHEN, Y.-J. & CHEN, A. P. (2006) Retraction of Pornillos et al., Science 310 (5756) 1950-1953. Retraction of Reyes and Chang, Science 308 (5724) 1028-1031. Retraction of Chang and Roth, Science 293 (5536) 1793-1800. Science Magazine, 314. http://www.sciencemag.org/cgi/content/full/314/5807/1875b • Allows additional interpretations • Legal and compliance (eg emerging RC mandates) JISC Repositories 2007
  • 7. a centre of expertise in data curation and preservation OECD declaration • “…Work towards the establishment of access regimes for digital research data from public funding in accordance with the following objectives and principles: • Openness • Transparency • Legal conformity • Formal responsibility • Professionalism • Protection of intellectual property • Interoperability • Quality and security • Efficiency • Accountability” JISC Repositories 2007
  • 8. a centre of expertise in data curation and preservation Retaining research data means… • Data secure against loss (within group) • Communal repository (secure data store) • Re-usable, sharable information • As above, plus active curation (eg bio- informatics) • Long term preservation of information • Be clear what you are trying to do! JISC Repositories 2007
  • 9. a centre of expertise in data curation and preservation … or the data trajectory is… • Hard drive → lost (crash) • Hard drive →DVD →Cardboard box →Loft →Skip/dumpster → lost • Sometimes this is a very bad thing • Sometimes these are the right options! JISC Repositories 2007 •© Marita Bushell
  • 10. a centre of expertise in data curation and preservation Long term bit storage… • A solved problem? Just requires well- understood good data management practices? • Wrong! For very large datasets over very long time, there are significant problems… BAKER, M., SHAH, M., ROSENTHAL, D. S. H., ROUSSOPOLOUS, M., MANIATIS, P., GIULI, T. J. & BUNGALE, P. (2006) A Fresh Look at the Reliability of Long-term Digital Storage. EuroSys '06. Leuven, Belgium, ACM. JISC Repositories 2007
  • 11. a centre of expertise in data curation and preservation How Well Must We Preserve? Keep a petabyte for a century – With 50% chance of remaining completely undamaged Consider each bit decaying independently – Analogy with radioactive decay That's a bit half- life of 10**18 years – One hundred million times the age of the universe That's a very demanding requirement – Hard to measure – Even very unlikely faults will matter a lot JISC Repositories 2007 •Slide from David Rosenthal, LOCKSS
  • 12. a centre of expertise in data curation and preservation What to do about curation • Build curation/reusability into science workflow • Curation begins before creation • What’s easy at first becomes (impossibly) hard later • Describe data (metadata schemas, “representation info”, etc) • Keep experimental parameters (technical, who, what, when, where) • Keep ability to process • Keep data! JISC Repositories 2007
  • 13. a centre of expertise in data curation and preservation What to do about curation - 2 • Use standard/agreed formats for data • Make ownership & restrictions clear, & explain how to cite data • Offer for deposit in institutional or discipline repository • Appraisal and selection essential • Possible time-limited embargos • “Publish” data in support of articles JISC Repositories 2007
  • 14. a centre of expertise in data curation and preservation Internet Archaeology: publication with data JISC Repositories 2007
  • 15. a centre of expertise in data curation and preservation Database as book… • Buneman (early pilot) work on IUPHAR database • MySQL to XML database • Historic to logical schema • XML via XSLT to LaTeX JISC Repositories 2007
  • 16. a centre of expertise in data curation and preservation The StORe vision • Seamless transport Source from research data to research publications and vice versa ware • Bi-directional links Middle proven in social science e-research but capable of export to other disciplines Output •http://jiscstore.jot.com/WikiHome/ JISC Repositories 2007 •Slide from Graham Pryor
  • 17. a centre of expertise in data curation and preservation StORe survey: linkage value? The value of University University direct links PG Contract Independent academic research Other Totals from source to student researcher researcher staff assistant output data Significant advantage 85 18 33 11 2 26 175 Useful 78 9 41 5 4 9 146 Interesting 24 4 5 3 0 5 41 Of no interest 9 0 0 0 0 1 10 Not sure 7 0 7 0 1 2 17 Other 1 1 0 0 0 1 3 Totals 204 32 86 19 7 44 392 •But: “researchers’ attitudes to enabling access depend to a large •extent on whether they are behaving as producers or users of data” JISC Repositories 2007 •Slide from StORe project
  • 18. a centre of expertise in data curation and preservation What to do about data (3) • Institutional repository managers • Make contact with emerging institutional data services • Start raising awareness of the need to curate rather than just dump data • Start thinking about the relationship of data to publications (especially e-theses) • Start thinking about the metadata needed to find and re-use data • Make contact with key researchers • Start thinking about their data… JISC Repositories 2007
  • 19. a centre of expertise in data curation and preservation What kinds of data? • Observations • eg UARS (Upper Atmosphere) Level 0: telemetry • UARS Level 1: measured physical parameters (post calibration?) • Derived data • UARS Level 2: calculated geophysical? profiles • UARS level 3: gridded, interpolated? • Combined data • Crafted data • Eg annotated gene/protein databases • Descriptive (meta)data JISC Repositories 2007
  • 20. a centre of expertise in data curation and preservation StORe: Source data formats CAD/GIS: 39 Extensible mark -up language (XML): 35 Database files (e.g. Access, MySQL): 117 Flat files (e.g. FITS): 66 Hypertext mark -up language (HTML): 60 Image files (e.g. .jpg, .tif, .bmp, .gif): 228 Plain text (.txt): 179 Portable document format (.pdf): 156 Rich text files (.rtf): 53 Spreadsheets (e.g. Excel/.xls): 220 Statistical software: 75 Tables/catalogues: 102 Word processed files (e.g. Word/.doc): 220 Other (please specify) : 76 JISC Repositories 2007 •Slide from StORe project
  • 21. a centre of expertise in data curation and preservation StORe: the other data formats? They said the 76 other formats included: +latex+.cc source code, .cif (crystallographic data), .pdb, .mtz, .pool, .root, .raw, .swf, .fla, .raw, .mpg, binary files, chemdraw cdx, xwin nmr files, .ps files, .fla, .swf, masslynx files, derived data in PAw-format ntuples, raw mass spectrometry data, X-ray diffraction data, kaleidagraphs, Atlas/ti hermeneutic unit files, C++/shell scripts, Fourier induction decay files, etc., etc., etc., etc……….. JISC Repositories 2007 •Slide from StORe project
  • 22. a centre of expertise in data curation and preservation StORe: the other data formats - more They also said such things as: “It is stored in a database, but nothing so simple as an Access file! It's one of the largest databases in the world! The format is Kanga/Root and previously was Objectivity. I think it's of the order of Picobytes in size.” And: “God preserve us from idiots who archive data in proprietary commercial formats (Excel spreadsheets and MS-word documents)!” JISC Repositories 2007 •Slide from StORe project
  • 23. a centre of expertise in data curation and preservation What are the reusability issues? • Data not neutral; highly contextual! • Hard to know the risks & pitfalls of a particular dataset • Data not self-describing: hard to find appropriate data (but see Murray-Rust on Googling InChI etc) • Hard to “understand” data once found • Really need information, not data! • Hard to use data once understood JISC Repositories 2007
  • 24. a centre of expertise in data curation and preservation Context • Data meaningless without context • Metadata of many kinds • Representation information… from data to information • Linkage and connection between datasets • Provenance • Authenticity/integrity • Computational lineage JISC Repositories 2007
  • 25. a centre of expertise in data curation and preservation Access and re-use • Ethics and rights control access • Weak in expressing this long-term • Collaboration tools • Annotation, discussion, review (see DART…) • Re-use leading to change and development • “Publication” • Not just in “print” • Underlying data should be “published”, too JISC Repositories 2007
  • 26. a centre of expertise in data curation and preservation Data citation issues… • Citation for human readers and machine use cases • Granularity: database, record, item • Citation of changing objects • Version change (eg W3C practice: no version = latest, vs bibliographic: no version = first) • An efficient way to reference and access “archived” past states of more rapidly changing dataset, eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change (work in progress, Buneman et al) • Standards conflict and immature (NLM best?) • Citation ESSENTIAL for motivating quality academic work on data management and curation JISC Repositories 2007
  • 27. a centre of expertise in data curation and preservation Repository challenges • Data are different: you’ll need access to some domain knowledge • Appraisal/selection harder • Broader range of formats • Appropriate “standards” for longevity? XML-based? • What metadata are needed? • Descriptive, to find the dataset • Context and background • Provenance • “Representation information” to connect data to information (whatever gives meaning to data for the “designated community”) JISC Repositories 2007
  • 28. a centre of expertise in data curation and preservation Repository challenges - 2 • May distort your repository • Size • Number of objects • Rate of deposit • Nature of use • Databases may be dynamic • Databases may need to be accessed in situ • Rights and ethical limitations hard to describe and enforce • Need to build links to publications (cf StORe) • Need to build discipline links across repositories… JISC Repositories 2007
  • 29. a centre of expertise in data curation and preservation Repository challenges - 3 • Is your platform suitable? • Most successful (ie older) data repositories are DIY • Data also held in repositories built on Dspace, ePrints and Fedora JISC Repositories 2007
  • 30. a centre of expertise in data curation and preservation JISC Repositories 2007 •Data from MIT DSpace Political Science
  • 31. a centre of expertise in data curation and preservation JISC Repositories 2007
  • 32. a centre of expertise in data curation and preservation JISC Repositories 2007
  • 33. a centre of expertise in data curation and preservation Who does data curation? • Individuals • Departments or groups • Institutions, often through libraries • Communities • Disciplines • Publishers • National services • Other 3rd parties… JISC Repositories 2007
  • 34. a centre of expertise in data curation and preservation Who are the curation players? JISC Repositories 2007
  • 35. a centre of expertise in data curation and preservation Disciplinary repositories… • >900 Nucleic Acids datasets! • ESDS/UKDA and NERC data centres, but… • “AHRC Council has decided to cease funding the Arts and Humanities Data Service (AHDS) from March 2008. […] Grant holders must make materials they had planned to deposit with the AHDS available in an accessible depository for at least three years after the end of their grant” • AHRC Press Release 14/05/2007 • (Note petition at http://petitions.pm.gov.uk/AHDSfunding/) • Does not apply to Archaeology: ADS still funded? JISC Repositories 2007
  • 36. a centre of expertise in data curation and preservation Institutional Repositories • OpenDOAR: only 5 Institutional Repositories claim to include datasets • Bristol • Cambridge • Edinburgh • Leicester • Southampton • …and some of these seem doubtful on inspection! • … of course not all research data are “datasets” JISC Repositories 2007
  • 37. a centre of expertise in data curation and preservation Cultural change • If we build it, will they come? NO!! • Outreach important: communication with scientists and researchers is hard graft • Cultural change to new approach requires more: • Incentives, rewards and mandates • Successful exemplars (well publicised) • Discipline-oriented approach (one size does not fit all) JISC Repositories 2007
  • 38. a centre of expertise in data curation and preservation Need for advocacy? What functionality is missing from source repositories? Academic Research Post- Independent staff assistants graduates researchers None 9 2 7 Don’t use 7 10 1 Lack of 3 4 2 knowledge Don’t know 5 3 13 1 No reply 129 20 45 13 JISC Repositories 2007 •Slide from StORe project
  • 39. a centre of expertise in data curation and preservation Need for advocacy? What functionality is missing from output repositories? Academic Research Post- Independent staff assistants graduates researchers None 3 2 5 1 Don’t use 1 1 Lack of 2 1 knowledge Don’t know 2 6 1 No reply 123 15 48 15 JISC Repositories 2007 •Slide from StORe project
  • 40. a centre of expertise in data curation and preservation Need for advocacy? “The majority of academics do not know what repositories are nor are they familiar with the issues around new means of dissemination” – UKOLN/Eduserv Foundation: Digital Repositories Roadmap: looking forward, April 2006 JISC Repositories 2007 •Slide from StORe project
  • 41. a centre of expertise in data curation and preservation Thank you c.rusbridge@ed.ac.uk JISC Repositories 2007