Comet project


Published on

Overview of COMET project for OCLC

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Respond to academic / national demand for Open Data – previously given some to the Open Bibliography projectGet our data to non-librarians and provicdeTax-payer value-for-moneyCUL already provides public APIsGain in-house experience of RDFMove library services forward
  • This is my colleague Katies’ write up of a talk lead by Owen Stephens it really sums it all up …
  • See if there were any expressive contractual clauses saying we could not redistribute
  • Where does a record come from ? – practically quite hard to determine …Several places in Marc21 where this data could be held …Logic for examinationAttempt at scripted analysis – list bib_ids by record vendor
  • Most vendors happy with permissive license for ‘non-marc21’ formats - Non marc thing is not an issue in this context, no one outside of library land cares about a load of binary encoded numbers … we are re-purposing Marc originated data for a wider audienceRLUK / BL BNB – PDDL OCLC – ODC-By Attribution licenseNo good reason not to re-publish – need the right license!
  • RDF allows you to freely mix vocabularies – choices of fields to describe your dataEmerging consensus on bibliographic description - thankfully no-one is attempting to recreate Marc, mainly a use of Qualified Dublin Core, FOAFAnd other relevant bibliographic focused vocabularies. There may never emrgeseuch a Our conversion script is CSV customisableBL and others leading the way on vocab choice – they did some great data modelling, which we stayed clear ofIts my personal hope that we never see a heavyweight approach of the style of Marc again. As we move forward with new container formats, pragmatism needs to rule over completionism if we are to successfully share our valuable data with a wider user base.
  • PHP script to match text against LOC subject headings – enrich with LOC GUIDFAST / VIAF enrichment courtesy of OCLC FAST – next generation subject headings – very excitingVIAF – Virtual International Authority FileOCLC want to develop these as linked services, keen to help.
  • Marc / AACR2 cannot translate will to semantically rich formats Need better container / transfer standards (not necessarily RDF)
  • So despite the change its my worry that those in charge of Marc21 and RDA developments arenot thinking widely enough about the new open ecosystem in which our data must inhabit
  • Two projects, focused less on data release and license and more about exploiting its value in an open environment
  • If we don’t try and shift …It becomes easier to go to Amazon – who have awesome API’sOr even Google books (theirs are rubbish)Our status as an authority of data providers will be further erodedNo-one will want to play with us if we do not share
  • Comet project

    1. 1. Ed ChamberlainCambridge University Library
    2. 2.  Cambridge Open Metadata Funded by the JISC Infrastructure for Resource Discovery Project
    3. 3.  Cambridge, back in 2010 … OKFN - Open Bibliography project (2010-2011) Debate around re-use of catalogue records from vendors (not just OCLC) CUL already provides public APIs Increasing interest in linked data FAST / VIAF Lorcan
    4. 4.  “The initial aim of this project will be to identify and release a substantial record set to an external platform under an open license” … “For OCLC-derived bibliographic records data will be released in a fashion compliant with their WorldCat Rights and Responsibilities for the OCLC Cooperative” … “The project aims to then deploy and test and number of technologies and methodologies for releasing open bibliographic data including XML, RDF, SPARQL, and JSON” …
    5. 5.  Cambridge University Library  Metadata conversion  Development  Project management CARET  Infrastructure support OCLC  Licensing consultancy  FAST / VIAFF enrichment
    6. 6.  Value for money – Taxpayers Open data = affiliate marketing for our collections Drive innovation - vital buy-in from non library developer communities One of many open data projects at the time
    7. 7.  “Library catalogues have imposed on them librarian or supplier-made decisions about what can/can’t be searched and in what way. Some of these decisions are limited by current cataloguing rules, but not all; often the data is recorded, but not in a usable way, or is there but isn’t tapped by the interface. For example, in most catalogues you can limit by publication type to newspapers, but you can’t limit by frequency of the issues.” “Releasing data means that people can start to use it in the way they want to.”
    8. 8.  Most of the catalogue (3 million +)  Bulk downloads of RDF triples  Query-able ‘endpoints’  Fast / VIAF enriched  Snapshot RDF conversion tools Working model and code to decide on MArc21 record origin Codebase for ‘library centric’ RDF publishing website SPARQL tutorial Verbose blog Data and code at
    9. 9.  Examine contracts with major vendors Contact them and decide on re-use conditions Deduce record origin from Marc21 fields
    10. 10.  Several places in Marc21 where this data could be held (015,035,038,994 …) Logic and hierarchy for examination Attempt at scripted analysis Marc21 fails at ‘IPR’ Potential down the line for problem to persist if attribution is not handled correctly in future formats
    11. 11. Need the rightlicense! Most vendors happy with permissive license for ‘non- marc21’ formats RLUK / BL B.N.B. – Public Domain Data License OCLC – ODC-By Attribution license with community norms
    12. 12.  RDF allows you to freely mix vocabularies Emerging consensus on bibliographic description BL and others leading the way Victory for pragmatism?
    13. 13.  Punctuation as a function Binary encoding Numbers for field names Bad characters Replication of data in fields
    14. 14.  PHP script to match text against LOC subject headings – enrich with LOC GUID FAST / VIAF enrichment courtesy of OCLC
    15. 15. No. of records: 3,658,384No. of records with LCSHheadings: 2,709,878Percentage with LCSHheadings: 74%No. of subject headings found: 5,889,048No. of subject headingsskipped: 45Valid FAST subjects: 8,134,230
    16. 16.  Marc / AACR2 cannot translate easily to semantically rich formats Libraries need to better utilise modern container / transfer standards (not necessarily RDF) No ‘one size fits all’ approach for future
    17. 17. Karen Coyle criticises the Marc21 Bibliographic Framework Transition Initiativefor not including museums, publishing, and IT professionals …She argues that our data is not just for us to consume alone … “The next data carrier for libraries needs to be developed as a truly open effort.ItSteeringbe led byand Marc organization (possibly ad hoc) that can bring should for RDA a neutral replacement needs non-librarian together the wide range of interested parties and make sure that all voices are heard. or ownership input Technical development should be done by computer professionals with expertise in metadata design. The resulting system should be rigorous yet flexible enough to allow growth and specialization.”
    18. 18. Open Bibliography 2Lightweight approach to sharingbibliography now its open …  Bottom up, community led software called Bibserver  Wikimedia for bib data  JSON as a container format – flexible, able to cope with different structures, vocabularies etc.  Engagement with UK PubMed CentralC.L.O.C.K. (Cambridge/ Lincoln opencataloguing knowledgebase)New approaches to traditional libraryworkflows (copy cataloguing) usingopen data  Using rich open data to enrich bare bones data  NOSQL database technology  APIs as key deliverables
    19. 19. FAST subject Language Place of headings publication LCSH subject headings Special Archives Bibliographiccollections Creator / entity Holdings LibrariesLibrarians Course lists Transactions
    20. 20.  Anonymous usage data from circulation systems Aggregated from several University Libraries API feed Available openly (CC-BY )
    21. 21.  It becomes (even) easier to go to Amazon Our status as authoritative data providers will be (further) eroded Assume we can Assume we should (where we can)
    22. 22.  - Discovery Ncg4lib mailing list - Open Knowledge Foundation
    23. 23.  Ed Chamberlain  @edchamberlain  
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.