Linked data and voyager

763
-1

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
763
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Thank you /// So Hello, as an introduction, I’m very much a Systems Librarian, management of LMS is may bread and butter, but this was a great chance to do something very different …
  • Just a Systems Librarian – this time last year my understanding at the start of linked data was limited to Talis demonstrations … Been to a lot of conferences and seen a lot of guys in back polos necks telling me that this is the future … Apologies if you see the semantic web as up there with quantum mechanics … Will contain some techy stuff Not actually that much on Voyager, although we will talk about what a next (current?) generation LMS could do in terms of RDF publishing
  • Semantic = meaning explained (to machines) – so we would see a 245, and know to display it as a title. We would need to programme a computer specifically to work that out, it has with Marc21 no real way to discover it itself. That is the purpose of semantic or ‘self describing’ data. Hyperlinked = meaning contextualised elsewhere – use a common set of descriptions For machines as much as people – That is the grand theory – web pages for people – semantic data for machines.
  • So after about 10 years of various approaches, An approach referred to as Linked Data looks like an emerging framework for the semantic web with some heavyweights behind it… Use URIs as names for things Use HTTP URIs so that people can look up those names When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) Include links to other URIs, so that they can discover more things
  • Conversely Here is a description of RDF – Resource Description Framework it’s a set of metadata encapsulation standards. Does not need to be linked, but often is. (Go back) Note that RDF is not mentioned except in brackets. Other data can be linked, as long as it follows these conventions. We can have unlinked RDF data and linked non -RDF data … So, RDF is not the be all and end all of linked data, but it’s the most commonly used mechanism right now so I’ll probably use both terms interchangably which might annoy some types But we did linked RDF (or attempted to …) and for the purposes of this presentation, the two terms are somewhat interchangeable …
  • So lets take a look at some. The n-triple notation format is the simplest means of expressing RDF triples. Earch triple is described in one line. RDF is in its purest form is data described as triples, each a statement in three parts A subject (the person or thing being described) A predicate (affirming what feature of the subject is being described) An object (the descriptive term itself)
  • Combine several triples and we can see what looks like a bib record But the data extends beyond the record as a composite entity through links to other sets of triples
  • Big growth in cultural heritage
  • No talk on linked data is complete without this image …
  • Respond to academic / national demand for Open Data – University of Southampton - Bibliography is a key stepping stone into teaching, learning research. Reading lists, personal bibliographies, group research. Getting this data openly available in standardised formats is a real world win Tax-payer value-for-money –they pay for this data to be created, lets give it back in some other form than an OPAC CUL already provides public APIs, this seems like natural progression Gain in-house experience of RDF – see how hard it can be. A lot of RDF projects tend to outsource to people like Talis, I really wanted to chart the in-house learning process. Move library services forward – we need to be in the world of linked data however it ends up – we also need greater flexibility in record re-use as we go forward with new forats - Have been well argued …
  • RDF works best with a permissive license – its now generally accepted that permissive licenses and bib data are a good thing Para-phrasing Paul Ayris "open bibliographic data offers chance for anyone to re-use the data to build innovative services“ CC0 or Public Domain Data License Non-commercial licenses not suitable – look at University and research funding now - what is non-commercial exactly? No NC license defines commercial activity, it actually creates more doubt, puts people off. Could building a free website based on our data and then running ads alongside it to be seen as commercial? Probably, would it deprive us of revenue or users, probably not. Consensus is not to go there with data… Permissive approach creates potential conflict with record vendors – not outright conflict, RLUK and OCLC –they are valid partners, the UL’s cataloguing team of 100+could not work if either organisation folded, its in our interests to keep them alive. For context, RLUK were behind the JISC Discovery programme, and OCLC acted as an invaluable partner with us on the project
  • See if there were any expressive contractual clauses saying we could not redistribute
  • Where does a record come from ? – practically quite hard to determine … Several places in Marc21 where this data could be held … Logic for examination Attempt at scripted analysis – list bib_ids by record vendor
  • All on the project blog along with some comprehensive explanation of methodology …
  • Most vendors happy with permissive license for ‘non-marc21’ formats - Non marc thing is not an issue in this context, no one outside of library land cares about a load of binary encoded numbers … we are re-purposing Marc originated data for a wider audience RLUK / BL BNB – PDDL OCLC – ODC-By Attribution license No good reason not to re-publish – need the right license!
  • What did we learn Marc actually made it really difficult, hence the diagram Better container formats could have sown this up With a national / international mandate to open up data, we need a better container format other than Marc to go forward. No good reason not to re-publish – need the right license!
  • Several attempts at conversion – settled on SQL extracts based on lists of bib_ids Use Perl scripting to ‘munge’ the data, quite dirty nasty coding around Marc files You can try this at home ! Scripts available for Voyager SQL extraction and standalone batch file conversion.
  • Marc21 – data rich, semantically poor. Designed to print out cards for display. Never really got past that. Hard to generate granular items of data for linking (i.e. triples). A lot is lost. Data is binary encoded, hard to transfer via modern web services - needs specialised code libraries to crack, XML and JSON are the way forward here Numbers as field names – why do we need this in 2011? It’s a dark art, bears no relation to the rest of the real world - makes it very hard for external developers To come in and do this kind of work - needs specialised knowledge and is developer underfiendly Bad characters – bane of any software developer – XML encoding and validation would deal with this problem Replication – as we’ve seen above, four fields serving effectively the same purpose Over one hundred notes fields? Come on …
  • RDF allows you to freely mix vocabularies – choices of fields to describe your data Emerging consensus on bibliographic description - thankfully no-one is attempting to recreate Marc, mainly a use of Qualified Dublin Core, FOAF Our conversion script is CSV customisable BL and others leading the way on vocab choice – they did some great data modelling, which we stayed clear of
  • PHP script to match text against LOC subject headings – enrich with LOC GUID FAST / VIAF enrichment courtesy of OCLC FAST – next generation subject headings – very exciting VIAF – Virtual International Authority File OCLC want to develop these as linked services, keen to help.
  • Marc / AACR2 cannot translate will to semantically rich formats Need better container / transfer standards (not necessarily RDF)
  • Scaling issues
  • Triplestores are cumbersome SPARQL alone does not do the trick – need faster, easier indexes covering data. One of the advantages of the Talis platform is that they can do this … High entry barrier to RDF is partly a result of these accompanying technologies – as much as the confusion and complexity around the data
  • Building whole systems around RDF is not really a good idea – thankfully they are not doing this Need the flexibility to do this by dropping Marc21 as an internal storage format – Thankfully they are doing this - plenty of other ways to get at data RDF works best on the side, as a separate machine friendly view of data. Ensure any RDF publishing capacity is flexible (as ours is) RDF capability for Primo ?
  • Standalone RDF is just fiddly Dublin Core, so … Create httpd uris for things so they have a permenant name on the web – really exciting Link it to something useful (LOC, FAST, VIAF) Don’t limit to the bibliographic – if records describe music or film, link to IMDB, Wikipedia or some domain specific authority … We have a chance to break out of the library bubble here…
  • Use URIs as names for things Use HTTP URIs so that people can look up those names When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) Include links to other URIs, so that they can discover more things Away from RDF, triplestores, MArc21 035s and licensing issues, these four points are conceptually the right approach for linked data, or any data that exists on the web
  • Linked data and voyager

    1. 1. Ed Chamberlain Systems Development Librarian Cambridge University Library
    2. 2. Disclaimers … <ul><li>Apologies if you see the semantic web as up there with quantum mechanics … </li></ul><ul><ul><li>Will contain some techy stuff </li></ul></ul><ul><ul><li>Not that much on Voyager … </li></ul></ul>
    3. 3. Overview <ul><li>Linked data in theory </li></ul><ul><li>What we learnt </li></ul><ul><ul><li>IPR </li></ul></ul><ul><ul><li>Data </li></ul></ul><ul><ul><li>Supporting technology </li></ul></ul><ul><li>How could it be used by Ex Libris? </li></ul>
    4. 4. What is the semantic web? <ul><li>“ The Semantic Web is a &quot;man-made woven web of data&quot; that facilitates machines to understand the semantics , or meaning, of information on the World Wide Web [1] [2 ] .” </li></ul><ul><li>“ The concept of Semantic Web applies methods beyond linear presentation of information ( Web 1.0 ) and multi-linear presentation of information ( Web 2.0 ) to make use of hyper-structures leading to entities of hypertext.” </li></ul><ul><li>http://en.wikipedia.org/wiki/Semantic_Web </li></ul>
    5. 5. Eh? <ul><li>Semantic = its meaning is explained - self-describing data! </li></ul><ul><li>Hyperlinked = meaning contextualised elsewhere </li></ul><ul><li>Focus on machines rather than people </li></ul>
    6. 6. What is Linked Data … <ul><li>After several iterations of semantic web development … </li></ul><ul><li>Tim Berners-Lee has advocated four underlying design principles for linked data: </li></ul><ul><ul><li>Use URIs as names for things </li></ul></ul><ul><ul><li>Use HTTP URIs so that people can look up those names </li></ul></ul><ul><ul><li>When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) </li></ul></ul><ul><ul><li>Include links to other URIs, so that they can discover more things </li></ul></ul><ul><ul><ul><ul><ul><li> http://www.w3.org/DesignIssues/LinkedData.html </li></ul></ul></ul></ul></ul>
    7. 7. And RDF ? <ul><li>The Resource Description Framework ( RDF ) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model . It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax formats. </li></ul><ul><li>http://en.wikipedia.org/wiki/Resource_Description_Framework </li></ul>
    8. 8. What does this mean in practice … <ul><li>RDF Data is expressed as triples: </li></ul><ul><li>DC XML … </li></ul><ul><li><dc:identifer>1000346</dc:identifer> </li></ul><ul><li><dc:title>Early medieval history of Kashmir : [with special reference to the Loharas] A.D. 1003-1171</dc:title> </li></ul><ul><li>Marc21 … </li></ul><ul><li>001 1000346 </li></ul><ul><li>245$aEarly medieval history of Kashmir : $b[with special reference to the Loharas] A.D. 1003-1171 / </li></ul><ul><li>RDF triples … </li></ul><ul><li><http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> </li></ul><ul><li><http://purl.org/dc/terms/title> </li></ul><ul><li>&quot;Early medieval history of Kashmir : [with special reference to the Loharas] A.D. 1003-1171&quot; . </li></ul>
    9. 9. Most of a record … <ul><li>1. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/title> &quot;Early medieval history of Kashmir : [with special reference to the Loharas] A.D. 1003-1171&quot; . </li></ul><ul><li>2. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/type> <http://data.lib.cam.ac.uk/id/type/1cb251ec0d568de6a929b520c4aed8d1> . </li></ul><ul><li>3. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/type> <http://data.lib.cam.ac.uk/id/type/46657eb180382684090fda2b5670335d> . </li></ul><ul><li>4. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/identifier> &quot;UkCU1000346&quot; . </li></ul><ul><li>5. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/issued> &quot;1981&quot; . </li></ul><ul><li>6. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/creator> <http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0> . </li></ul><ul><li>7. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/language> <http://id.loc.gov/vocabulary/iso639-2/eng> . </li></ul><ul><li>8. <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://RDVocab.info/ElementsplaceOfPublication> <http://id.loc.gov/vocabulary/countries/ii> </li></ul>
    10. 10. Where is the linking exactly? <ul><li><http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/creator> <http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0 > </li></ul><ul><li><http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0> <http://www.w3.org/2000/01/rdf-schema#label> &quot;Mohan, Krishna&quot; . <http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> . <http://data.lib.cam.ac.uk/id/entity/cambrdgedb_a5a6f7a184ff02e08b1befedc1b3a4d0> <http://xmlns.com/foaf/0.1#name> &quot;Mohan, Krishna&quot; . </li></ul>
    11. 11. External linking <ul><li><http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346> <http://purl.org/dc/terms/subject> </li></ul><ul><li><http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> . </li></ul><ul><li><http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> . <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> <http://www.w3.org/2004/02/skos/core#inScheme> <http://id.loc.gov/authorities#conceptScheme> . <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> <http://www.w3.org/2004/02/skos/core#prefLabel> &quot;Lohars -- History&quot; . <http://data.lib.cam.ac.uk/id/entry/cambrdgedb_43e3fa1b4404410454c90d8022578852> <http://purl.org/dc/terms/hasPart> <http://id.loc.gov/authorities/sh85078149#concept> . </li></ul>
    12. 12. Live demo … <ul><li>http://data.lib.cam.ac.uk/id/entry/cambrdgedb_1000346 </li></ul>
    13. 13. Meanwhile … <ul><li>BNB </li></ul><ul><li>British Museum </li></ul><ul><li>Library of Congress </li></ul><ul><li>BBC Nature </li></ul>
    14. 14. The Linking Open Data cloud diagram - http : //richard . cyganiak .de/2007/10/lod /
    15. 15. What was COMET? <ul><li>Cambridge Open Metadata </li></ul><ul><li>Cambridge University Library / CARET / OCLC </li></ul><ul><li>Funded by the JISC Infrastructure for Resource Discovery Project </li></ul><ul><li>February to July 2011 </li></ul><ul><ul><ul><li>http://discovery.ac.uk </li></ul></ul></ul>
    16. 16. What did COMET do … <ul><li>Experimentally convert as much of the Cambridge University Library catalogue as it could from Marc21 to RDF triples </li></ul><ul><li>Investigate IPR issues around Open License publishing and Marc21 </li></ul><ul><li>Construct an RDF publishing platform to site behind those URI’s … </li></ul><ul><li>Release tools for others to do the same </li></ul><ul><li>Blog and documentation </li></ul>
    17. 17. Why? <ul><li>Respond to academic / national demand for Open Data </li></ul><ul><li>Get our data to non-librarians! </li></ul><ul><li>Tax-payer value-for-money </li></ul><ul><li>CUL already provides public APIs </li></ul><ul><li>Gain in-house experience of RDF </li></ul><ul><li>Move library services forward </li></ul>
    18. 18. Why - IPR <ul><li>Linked data works best with a permissive license </li></ul><ul><li>CC0 or Public Domain Data License </li></ul><ul><li>Non-commercial licenses not suitable </li></ul><ul><li>Conflict with record vendors </li></ul>
    19. 19. How – IPR <ul><li>Examine contracts with major vendors </li></ul><ul><li>Decide on re-use conditions and contact them </li></ul><ul><li>Decode record ownership from Marc21 fields (Could not use Voyager SQL) </li></ul>
    20. 20. How – IPR <ul><li>Where does a record come from ? </li></ul><ul><li>Several places in Marc21 where this data could be held (015,035,038,994 …) </li></ul><ul><li>Logic and hierarchy for examination </li></ul><ul><li>Attempt at scripted analysis – list bib_ids by record vendor </li></ul>
    21. 22. What - IPR <ul><li>Most vendors happy with permissive license for ‘non-marc21’ formats </li></ul><ul><li>RLUK / BL B.N.B. – PDDL </li></ul><ul><li>OCLC – ODC-By Attribution license </li></ul><ul><li>No good reason not to re-publish – need the right license! </li></ul>
    22. 23. IPR - What did we learn? <ul><li>Marc21 not fit for purpose here, no ‘authoritative code’ for license </li></ul><ul><li>National / international mandate to release open data </li></ul><ul><li>No good reason not to re-publish – need the right license! </li></ul>
    23. 24. How - data <ul><li>Several attempts – settled on SQL extracts based on lists of bib_ids </li></ul><ul><li>Use Perl scripting to ‘munge’ the data </li></ul><ul><li>You can try this at home ! (work) </li></ul>
    24. 25. How - marc problems <ul><li>Punctuation as a function </li></ul><ul><li>Binary encoding </li></ul><ul><li>Numbers for field names </li></ul><ul><li>Bad characters </li></ul><ul><li>Replication of data in fields </li></ul>
    25. 26. How – data vocab <ul><li>RDF allows you to freely mix vocabularies </li></ul><ul><li>Emerging consensus on bibliographic description </li></ul><ul><li>Our conversion script is CSV customisable </li></ul><ul><li>BL and others leading the way </li></ul>
    26. 27. How - data publishing <ul><li>Bulk downloads </li></ul><ul><li>Queryable ‘endpoints’ </li></ul><ul><li>Data and code at http://data.lib.cam.ac.uk </li></ul>
    27. 28. How – linking <ul><li>PHP script to match text against LOC subject headings – enrich with LOC GUID </li></ul><ul><li>FAST / VIAF enrichment courtesy of OCLC </li></ul>
    28. 29. Data - What did we learn ? <ul><li>Marc / AACR2 cannot translate will to semantically rich formats </li></ul><ul><li>Need better container / transfer standards (not necessarily RDF) </li></ul>
    29. 30. What else?
    30. 31. RDF friendly database <ul><li>Called RDF stores, triplestores or Quadstores </li></ul><ul><li>Vary in size scale and scope </li></ul><ul><li>None are particularly admin / dev friendly right now … </li></ul>
    31. 32. How - SPARQL <ul><li>Query language for RDF stores </li></ul><ul><li>Still a work in progress </li></ul><ul><li>Some similarities with SQL </li></ul><ul><li>Bibliographic-centric tutorial </li></ul>
    32. 33. How –storage and access <ul><li>ARC2 - Lightweight MYSQL / PHP solution </li></ul><ul><ul><li>Good fit for a six month project </li></ul></ul><ul><ul><li>Great for around 3-500 k records </li></ul></ul><ul><ul><li>Not so good for 1 million plus </li></ul></ul><ul><ul><li>20 million + ? </li></ul></ul>
    33. 34. Supporting tech -What did we learn? <ul><li>Triplestores are cumbersome </li></ul><ul><li>SPARQL alone does not do the trick </li></ul><ul><li>High entry barrier to RDF is partly a result of these accompanying technologies </li></ul>
    34. 35. What does this mean for Ex Libris <ul><li>Building whole systems around RDF is not really a good idea </li></ul><ul><li>Need the flexibility to do this by dropping Marc21 </li></ul><ul><li>GUIDS for records (or allow us to have our own) – resolvable ? </li></ul><ul><li>Ensure any RDF publishing capacity is flexible (as ours is) </li></ul><ul><li>RDF capability for Primo ? </li></ul>
    35. 36. Always add value to RDF … <ul><li>Standalone RDF is just fiddly Dublin Core, so … </li></ul><ul><ul><li>Create httpd URI’s for entities </li></ul></ul><ul><ul><li>Link it to something useful (LOC, FAST, VIAF) </li></ul></ul><ul><ul><li>Endpoint (SPARQL?) </li></ul></ul><ul><ul><li>Don’t limit to the bibliographic </li></ul></ul>
    36. 37. Beyond bibliographic Bibliographic Holdings FAST subject headings Libraries Transactions Special collections Archives Creator / entity Place of publication LCSH subject headings Course lists Language Librarians
    37. 38. Do what Tim said … <ul><ul><li>Use URIs as names for things </li></ul></ul><ul><ul><li>Use HTTP URIs so that people can look up those names </li></ul></ul><ul><ul><li>When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) </li></ul></ul><ul><ul><li>Include links to other URIs, so that they can discover more things </li></ul></ul><ul><ul><ul><ul><ul><li> http://www.w3.org/DesignIssues/LinkedData.html </li></ul></ul></ul></ul></ul>
    38. 39. Questions? <ul><li>@edchamberlain / [email_address] </li></ul><ul><li>http://data.lib.cam.ac.uk </li></ul><ul><li>http://cul-comet.blogspot.com/ </li></ul>

    ×