• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
 

"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage

on

  • 901 views

Invited presentation to OCLC, 2006, discussing roles and responsibilities in data curation, and in particular the roles of libraries

Invited presentation to OCLC, 2006, discussing roles and responsibilities in data curation, and in particular the roles of libraries

Statistics

Views

Total Views
901
Views on SlideShare
901
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Although in the play a speech of despair, we can read into this metaphors for long term digital curation and preservation!
  • Another key point in the play, remember the prophecy?
  • So how would Macbeth have got on with modern tools? BTW, this might be an argument for migration rather than emulation; a much better Interface than a map scratched on calf skin…
  • Woops! Perhaps emulation (calf-skin) was better after all; at least it would not be troubled by spelling mistakes. Anyway, Google Never claimed to be a battlefield management system! However, we move on…
  • Initially we have concentrated on data extracted from relational databases, mainly because this is where the IUPHAR data is. 1) Extract to XML (friendly hierarchical format). 2) Next we want to merge with the archive containing the previous versions. 3) Process and Merge 4) New archive with latest version added. Demo ....

"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage "Tomorrow, and tomorrow, and tomorrow": the players on the curation stage Presentation Transcript

  • a centre of expertise in data curation and preservation “Tomorrow, and tomorrow, and tomorrow”:the players on the curation stage Chris Rusbridge Presentation at OCLC Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
  • a centre of expertise in data curation and preservation •"To-morrow, and to-morrow, and to-morrow, •Creeps in this petty pace from day to day, •To the last syllable of recorded time; •And all our yesterdays have lighted fools •The way to dusty death. •Out, out, brief candle! •Lifes but a walking shadow; a poor player, •That struts and frets his hour upon the stage, •And then is heard no more: it is a tale •Told by an idiot, full of sound and fury, •Signifying nothing." •Shakespeare: MacbethOCLC October 2006
  • a centre of expertise in data curation and preservation •Dunsinane HillOCLC October 2006 •Photo by Fabrice
  • a centre of expertise in data curation and preservationOCLC October 2006
  • a centre of expertise in data curation and preservationOCLC October 2006
  • a centre of expertise in data curation and preservation Contents • Curation and the Digital Curation Centre • Science and Data Citations • The “poor players” of data curation • Sustainability of curated data • Macbeth again…OCLC October 2006
  • a centre of expertise in data curation and preservation Curation • Data increasingly important as evidence • Experimental verifiability (the basis of science) • Unrepeatable observations & experiments (particularly environmental in broadest sense) • Legal, compliance & transactions • Cultural resources • “Preservation” view vs “Publishing” viewOCLC October 2006
  • a centre of expertise in data curation and preservation Lynch remarks • Closing the Curation Conference • 3 views of digital curation • Finite process, handover to preservation • Whole life process, evolving object(s) • Collection as a living thingOCLC October 2006
  • a centre of expertise in data curation and preservation Digital curation? For later use Static Digital preservationOCLC October 2006
  • a centre of expertise in data curation and preservation Digital curation? In use now (and the future) For later use Dynamic Static Long-term Digital curation Digital preservationOCLC October 2006
  • a centre of expertise in data curation and preservation Digital curation In use now (and the future) For later use Dynamic Static Long-term Digital curation & preservation “maintaining and adding value to a trusted body of digital information for current and future use”OCLC October 2006
  • a centre of expertise in data curation and preservation Mission “The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”OCLC October 2006
  • a centre of expertise in data curation and preservation Aims • Establish vibrant research • Build strong community relations • Development activity leading to service • Achieve the “virtuous circle” • NOT a repository (funder mandate)!OCLC October 2006
  • a centre of expertise in data curation and preservation Organisation to Engage & Collaborate communities of curation practice: users organisations eg DPC community support & outreach service management Associates research definition & admin research Network collaborators & delivery support development co-ordination testbeds & tools Industry standards bodiesOCLC October 2006
  • a centre of expertise in data curation and preservationOrganisation to Engage & Collaborate: Leads communities of curation practice: users organisations eg DPC Bath Associates Glasgow Edinburgh Edinburgh research Network collaborators CCLRC testbeds & tools Industry standards bodiesOCLC October 2006
  • a centre of expertise in data curation and preservation Achievements: Research • Edinburgh Database Group • Annotation, archiving, citation, lineage, provenance, “publishing” • CCLRC: metadata curation • Glasgow: genre extraction • Bath: repository interactionsOCLC October 2006
  • a centre of expertise in data curation and preservation Achievements: Development • RLG/NARA Certification checklist • Representation Information Registry/ Repository • Concept from Open Archival Information System standard (OAIS) on preserving informationOCLC October 2006
  • a centre of expertise in data curation and preservation Achievements: Services • Help Desk • Workshops and events • Curation manual • Audit and certification • From checklists to service? • Standards and tools • Representation information • From tool to service?OCLC October 2006
  • a centre of expertise in data curation and preservation Achievements: Outreach • Developing web site content • Conferences • Associates Network and ForumOCLC October 2006
  • a centre of expertise in data curation and preservation Achievements: management • Developing international impact • Developing policy impactOCLC October 2006
  • a centre of expertise in data curation and preservation Associated work • DCC LOCKSS Technical Support Service (Lots of Copies Keep Stuff Safe) • DCC SCARP Project • Disciplinary approaches to sharing, curation, re- use and preservation • EU projects associated • CASPAR • Digital Preservation Europe • PLANETSOCLC October 2006
  • a centre of expertise in data curation and preservation Phase 2 • Externally-moderated, reflective self- evaluation completed • Phase 2 proposal (2007/10) to JISC • Accepted: focus on science data, reduced scale • EPSRC-funded Research continues until 2007/8OCLC October 2006
  • a centre of expertise in data curation and preservation 2nd International Digital Curation Conference • Research & invited presentations • Glasgow, 21/22 November, 2006 • Please register at: http://www.dcc.ac.uk/events/dcc-2006/OCLC October 2006
  • a centre of expertise in data curation and preservationOCLC October 2006
  • a centre of expertise in data curation and preservation Data resource stages • Curated data is created… • Observations? Fixed! • Or Acquired… • Data brought/bought from outside • Ingest • Development • Derived, refined, combined, processed data • Potentially many stagesOCLC October 2006
  • a centre of expertise in data curation and preservation TWOMASS (Infrared) SDSS (Visual)OCLC October 2006 Slide from Rajendra Bose
  • a centre of expertise in data curation and preservationOCLC October 2006 Slide from Rajendra Bose
  • a centre of expertise in data curation and preservation New discovery… • National Virtual Observatory • Johns Hopkins press release: “Scientists working to create the NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”OCLC October 2006
  • a centre of expertise in data curation and preservation Context • Data meaningless without context • Linkage • Metadata of many kinds • Workflow! • Provenance • Computational lineage • AuthenticityOCLC October 2006
  • a centre of expertise in data curation and preservation NASA Csat8-day composite and subsceneCsat 8-day composite subscene PAR subscene RPT E0SST and Pbopt calc H Ctot calc Zeu calc PPeu calc University research University group3 local research research decision- group1 group2 making bodyOCLC October 2006 Slide from Rajendra Bose
  • a centre of expertise in data curation and preservation Access and re-use • Ethics and rights control access • Weak in expressing this long-term • Collaboration tools • Annotation, discussion, review • Re-use leading to change and development • “Publication” • Not just in “print” • Underlying data should be “published”, too • Citation…OCLC October 2006
  • a centre of expertise in data curation and preservation CLADDIER citation investigation “My last example was an MST data set held at the BADC, and I was suggesting something like this (for a citation): <Citation><Author> Natural Environment Research Council </Author> <Title> Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </Title> <Medium> Internet </Medium> <Publisher> British Atmospheric Data Centre (BADC) </Publisher> <PublicationDate status="ongoing"> 1990</PublicationDate> <Identifier> badc.nerc.ac.uk/data/mst/v3/upd15032006</Identifier> <Feature><FeatureType>http://featuretype.registry/verticalProfile</FeatureType>< LocalID>200409031205</LocalID></Feature> <AccessDate> Sep 21 2006 </AccessDate> <AvailableAt><url>http://badc.nerc.ac.uk/data/mst/v3/</url></AvailableAt> </Citation> (Made up tags!)”OCLC October 2006 •Bryan Lawrence Weblog
  • a centre of expertise in data curation and preservationCLADDIER 2: “Version of record” • Role of Publisher: add value • provision of catalogue metadata • some commitment to maintenance of the resource at the AvailableAt url • some commitment to the resource being conformant to the description of the Feature • some commitment to the maintenance of the mapping between the identifier [LocalID] and the resource.OCLC October 2006 •Bryan Lawrence Weblog
  • a centre of expertise in data curation and preservation CLADDIER 3: persistence • Wayback Machine • Only snapshots (eg only 2004 version of Bryan’s home page!) • WebCite • allows the creater of content to submit URLs for [archiving], thus ensuring when one writes an academic document, the material will be archived, and the citation will be persistent • But no real help for data… • “… only allow [data citation] when we believe in the persistence of the organisation making the data available…”OCLC October 2006 •Bryan Lawrence Weblog
  • a centre of expertise in data curation and preservationOCLC October 2006
  • a centre of expertise in data curation and preservation Citation • Needs a stable resource to cite… OWL Web Ontology Language Reference W3C Proposed Recommendation 15 December 2003 This version: http://www.w3.org/TR/2003/PR-owl-ref-20031215/ Latest version: http://www.w3.org/TR/owl-ref/ Previous version: http://www.w3.org/TR/2003/CR-owl-ref-2003081 • (FRBR works & expressions?)OCLC October 2006
  • a centre of expertise in data curation and preservation Citation… • The date alone (as in common web citation approaches) is not enough! •[6] The CIA World Factbook. •www.cia.gov/cia/publications/factbook/. •Retrieved on 8 Jan 2006. • Cited object likely to have changed… • Citation should link to the cited object as it was!OCLC October 2006
  • a centre of expertise in data curation and preservation Citation needs… • An efficient way to reference and access “archived” past states of a changing dataset (work in progress, Buneman et al) • Not important for original observations • Don’t mess with those data • Less important for incremental datasets • Later stuff should not invalidate earlier • Very important for revisable datasets • Eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change • Eg Mapping… OS maps represent a huge database that changes on a daily basisOCLC October 2006
  • a centre of expertise in data curation and preservation XML Archive at time t - 1 XMLArch: System Architecture time t Relational XML Archiver XML Snapshot at Database Pre-processor Version Merger Data Extractor XML Archive at time tOCLC October 2006 •Carwyn Edwards
  • a centre of expertise in data curation and preservation Consider the Record • Business records • Evidence of business decisions • Formal: minutes, letters etc • Informal: emails, notes • Databases • Records of Science • Lab notebooks • Proposals, reports, papers • Data • Informal parts generally poorly managedOCLC October 2006
  • a centre of expertise in data curation and preservation Preservation & curation • Use preserves • Money preserves • Redundancy good, monoculture bad? • LOCKSS-type & other approaches… • Bits are fragile and robust • Don’t rely on portable media • Look after them well • Technology changes… • How fast? What impact? • Metadata matters! (Know what you’ve got)OCLC October 2006
  • a centre of expertise in data curation and preservation Who are the curation players?OCLC October 2006
  • a centre of expertise in data curation and preservation Curation: Individual • “Small science” 2-3 times more data than “Big science”, but much more at risk • PhD student? RA? PI? Administrator? IT support? • Data potentially on local hard drives, or at best shared network drives • May be inadequately protected • Liable for policy-led deletion on resignation • Individual “knows” too much • Documentation/metadata unlikely to be adequate • Tomorrow: gone!OCLC October 2006
  • a centre of expertise in data curation and preservation Department: eCrystals • Specialist department archive (& national service) • Workflow recording of lab parameters (R4L) • Public & private elements • Trying to build eCrystals federation (eBank 3) • But… ReciprocalNet? French COD efforts? Fragmented discipline! • Tomorrow: likely to continueOCLC October 2006
  • a centre of expertise in data curation and preservation Institution: Cambridge Chemistry • 175,000 small molecule structures in CML • Alongside Archaeology, Manuscripts, Learning Materials, etc • No library curation skills; dependent on research group enthusiast • Collection isolated from other Chemistry • Tomorrow: assured…OCLC October 2006
  • a centre of expertise in data curation and preservation Community: CDL • Shared effort from group of institutions • Comparison OhioLink? • Document tradition, not data • Passive role re collections • Rely on departmental & domain expertise • Tomorrow: assured…OCLC October 2006
  • a centre of expertise in data curation and preservation Community: SDSC? • Data specialists • Multiple disciplines • Distinct from domains; curation dependent on external expertise • Research ethos • Tomorrow: dependent on grant/contract income & research prioritiesOCLC October 2006
  • a centre of expertise in data curation and preservation Community: LOCKSS? • Self-selected group of collectors: closest to genuine open activity (despite Alliance)? • Traditionally libraries collecting eJournals • Model respects IPR • No domain expertise; rely on origins • Data limitations… • Tomorrow: potentially very persistent (low cost, high reliability, attack resistance, distributed)OCLC October 2006
  • a centre of expertise in data curation and preservation Discipline: Archaeology • Staffed by archaeologist curators • Understand special legal issues • Strong relationship with community & peers • Internationally still fragmented? • Tomorrow: dependent on research council grants + deposit fundingOCLC October 2006
  • a centre of expertise in data curation and preservation Discipline: Astronomy • Part of major international effort • Expensive shared facilities, global reach • Well integrated into community • Enable new science • Tomorrow: assured by community (another large facility)OCLC October 2006
  • a centre of expertise in data curation and preservation Discipline: Atmosphere • Strong believer in need for domain scientists as curators • Significant participant in “community proxy” agenda-setting activities • Internationally fragmented resources • Tomorrow: mostly dependent on grant funding (but strong commitment)OCLC October 2006
  • a centre of expertise in data curation and preservation Discipline: Pharmacology • International Scientific Union • Attempting to build credit for data contributions • DB ownership rotates • Tomorrow: extremely limited fundingOCLC October 2006
  • a centre of expertise in data curation and preservation Discipline: PharmacologyOCLC October 2006
  • a centre of expertise in data curation and preservation Discipline: Social Sciences • Mature! • Staffed by Social Science curators • Alert to opportunities • Able to appraise material offered • Strong relationship to discipline • Tomorrow: assured through broad mix of funding streamsOCLC October 2006
  • a centre of expertise in data curation and preservation Publisher: Crystallography • Publisher and Scientific Union • Created key domain crystallographic standard (CIF) • Strong motivator for deposit of structure data • Consistent quality checks • DOIs used for structure data • Tomorrow: publishing business modelOCLC October 2006 •Slide from IUCr
  • a centre of expertise in data curation and preservation National bodies: British Library • Serious and robust approach • Legal deposit powers & responsibilities as driver • Oriented primarily towards “cultural heritage” (broadly interpreted) • Little data, no science domain experience • Tomorrow: strong future commitmentOCLC October 2006
  • a centre of expertise in data curation and preservation National bodies: TNA/NDAD • Specialist archive for government datasets • Understand government regulations, dynamics & requirements • Subject generalists; disconnected from associated science • Technology specialists (understand databases) • Tomorrow: likely to pass eventually to The National ArchivesOCLC October 2006
  • a centre of expertise in data curation and preservation National bodies: NOAA (etc) • Government body making serious data available • Domain scientists curate data • Operates in current political context (!) • Tomorrow: reasonably assured but some un- funded mandates?OCLC October 2006
  • a centre of expertise in data curation and preservation 3rd parties: OCLC? • Should this be community? • Demand driven • No domain science expertise: rely on origins • Tomorrow: business caseOCLC October 2006
  • a centre of expertise in data curation and preservation 3rd parties: Portico • Specific area: eJournals • Depends on publisher agreements • No data or domain science expertise • Tomorrow: commitment from Mellon + publishers + subscriptions, good funding mixOCLC October 2006
  • a centre of expertise in data curation and preservation 3rd Parties: Iron Mountain • Records management IS a curation problem • Organisations like this very likely to branch out • No domain science expertise • Tomorrow: business case, viability, stock market…OCLC October 2006
  • a centre of expertise in data curation and preservation Institutions & the network • Institutions have some fundamental sustainability • Disciplines live in the network; sustainability is an issue • Can we get the best of both?OCLC October 2006
  • a centre of expertise in data curation and preservation Intersections… Institution Institution Institution etc 1 2 3 Discipline X X 1 Discipline X X 2 Discipline X X 3 etcOCLC October 2006
  • a centre of expertise in data curation and preservation Who are the curation players again?OCLC October 2006
  • a centre of expertise in data curation and preservation Project StORe findings • Discipline commonality from survey (Miller, UKDA, 2006): • 2-way links between data & publication useful • Barriers to actual deposit of data/outputs • Sharing data important, likely between colleagues • Perceived inconsistency across repositories • Most common searching: Google type • Researchers favour self-reliance rather than library support • Recognise need for common minimum metadata • Aim for pilot linking middleware demonstrator • “Creating small scale ‘silos’ of information with institutional repositories is not … a compelling information management strategy in the ‘Google age’” (Heery & Anderson for JISC, 2005)OCLC October 2006
  • a centre of expertise in data curation and preservation Sustainability: tomorrow is the emerging worry • Sustainability work package in DCC (new grant!) • JISC/NDIIPP meeting addressed it • AHRC report draft soon • Research Information Network report draft • JISC study on sustainable IT systems for HE • Recent ARL/NSF workshop, NSF strategyOCLC October 2006
  • a centre of expertise in data curation and preservation Sustainability of what? • Repository as an organisation • Repository as a service • Repository as a system • Repositories as a network (federation?) • Collections and objects supported by repositories • Commit to collection: contract the manager!OCLC October 2006
  • a centre of expertise in data curation and preservation Sustainability of what? • Culture of deposit & re-use! • One of the most important social dimensions, but out of scope here… • Curation service • Separate service from collection • Funding always finite: 5 + 5 then re-compete? • Relay approach: hand on in good order • Succession! Start with the plan for your own end… • Data • Digital object access when required (for long future time)OCLC October 2006
  • a centre of expertise in data curation and preservation Sustainability for what? • Variety of curation approaches • Developing resource • Preserving resource • Significant properties have a big impact • Produce bit stream as ingested? • All the work for the consumer • Produce full look and feel as ingested? Expensive! • May also be unfamiliar for future consumer • Somewhere between? • Depends on goals…OCLC October 2006
  • a centre of expertise in data curation and preservation Social factors • Commitment essential… much more than anything else (cf persistent identifiers) • Funder requirements express social determination • Policy & grant application forms, selection criteria • Monitoring essential • Legal, ethical, IPR impacts all significant • Public good questions • Academic credit (citations?) • Free-loaders (embargos?) • Disciplines are different! • Workforce skills: researcher, data librarian/scientistOCLC October 2006
  • a centre of expertise in data curation and preservation Sustainability a function of... • Commitment • Goals • Value and cost • Business model • Time • Environment • Domain knowledge and information • Dimensions (how much stuff) • Technical approaches • UsageOCLC October 2006
  • a centre of expertise in data curation and preservationFinancial sustainability 2: projects • Traditional research project approach: • Produces unsustainable resources • PIs focus on next project proposal • RAs focus on next job application • Result: no metadata, orphan dataOCLC October 2006
  • a centre of expertise in data curation and preservation Financial sustainability 3: investment • How you justify a long-term spend: persuasive? No! Return on investment = value - cost • Intangible asset: hard to value; situated, multi-scaled • Aggregate rather than individual • Academic value is key • Citations support this: needs work • Reputation is the target currency • But dollars pay the bills!OCLC October 2006
  • a centre of expertise in data curation and preservation Value • “… the by-products of our research may be more significant than our soon dated theoretical insights." (Seeger 2004, quoted by Kevin Bradley, APSR) • “I think I would be safe in saying that worldwide hundreds of millions of dollars’ worth of crystallographic data is lost each year. For spectra and synthetic chemistry it will be at least 10 times greater. Many synthetic chemists say they are interested in failed reactions - and these are almost never published!” (Murray-Rust blog) • Value of curation service can grow from re-use promotion & community proxy activities (eg BADC & CF conventions, ICPSR & DDI) • (But the value of data is easily negated, at creation and after)OCLC October 2006
  • a centre of expertise in data curation and preservation Financial sustainability: the 8 pillars of wisdom? • Someone has to pay… • Consumer pays: subscription or usage? • Depositor pays (ie grant or institution)? • Institution pays (IR, cf library/archive/museum) • Community (discipline repository?) pays • Government, or science funder • Learned society? • Volunteers (cf open source, social computing, LOCKSS)? • Side effect (advertiser) pays (unlikely for much data?) • Endowment or donor pays… • Diversity?OCLC October 2006
  • a centre of expertise in data curation and preservation Role of libraries • 2-4% of university budgets (“There’s plenty of money… there’s just not plenty of money for everything!” Courant)? • Traditional role in sustaining the raw material of scholarship • Looking for new roles in the digital world? • Many unsaid assumptions from publishing paradigm? • Domain knowledge: wide but not deep • Involvement in data creation lowOCLC October 2006
  • a centre of expertise in data curation and preservation So, tomorrow… • Digital data repositories already sustained > 30 years • How? • Vision, leadership, commitment • Libraries, archives, museums sustained 100s of years • How? • Aggregate value proposition • Perception now under threat! • Collectively we need to identify the next steps toward digital data sustainability, for tomorrow, and tomorrow, and tomorrow!OCLC October 2006
  • a centre of expertise in data curation and preservation Macbeth again… •"To-morrow, and to-morrow, and to-morrow, •Creeps in this petty pace from day to day, •To the last syllable of recorded time; •…it is a tale •Told by an idiot, full of sound and fury, •Signifying nothing."OCLC October 2006
  • a centre of expertise in data curation and preservation Mission (impossible?) • To that last syllable of recorded time • Keep our tales forever full of significance! Thank youOCLC October 2006