Linked Library Data in the wild

5,008 views
4,921 views

Published on

How we use Linked Open Data to drive our next generation discovery interface, and how we've gone about it.

Published in: Technology, Education
2 Comments
10 Likes
Statistics
Notes
No Downloads
Views
Total views
5,008
On SlideShare
0
From Embeds
0
Number of Embeds
1,033
Actions
Shares
0
Downloads
60
Comments
2
Likes
10
Embeds 0
No embeds

No notes for slide

Linked Library Data in the wild

  1. 1. Linked Library Datain the wild<br />
  2. 2. Technical Lead for Prism<br />Phil John<br />Introductions...<br />
  3. 3. So, what’s Prism then?<br />Introductions...<br />
  4. 4.
  5. 5.
  6. 6.
  7. 7. a next generation discovery interface<br />Prism<br />Introductions<br />
  8. 8. (yes…even configuration settings)<br />Built entirely on Linked Data<br />Prism<br />
  9. 9. Discovery of library <br />catalogue resources<br />Prism<br />but grander plans afoot...<br />
  10. 10. ...some future sources...<br />Prism<br /><ul><li> journal metadata
  11. 11. archives/records (e.g. DS Calm)
  12. 12. thesis repositories
  13. 13. rare items/special collections
  14. 14. and more!</li></li></ul><li>SaaS/Cloud Based<br />Prism<br />
  15. 15. MARC 21 RDF<br />Performs data conversion<br />Prism<br />
  16. 16. this ensures it keeps in sync with the LMS<br />Initial “bulk” conversion then periodic “delta” files<br />Prism<br />
  17. 17. provided by a suite of RESTful web services<br />Borrower/Availability data pulled from LMS “live”<br />Prism<br />
  18. 18. just add .rss to collectionsor .rdf/.nt/.ttl/.json to items<br />Linked Data API<br />Prism<br />
  19. 19.
  20. 20.
  21. 21.
  22. 22. The Challenges<br />Prism <br />
  23. 23. Extracting data from MARC 21<br />The Challenges<br />
  24. 24. Some quotes...<br />Extracting Data from MARC 21<br />...cataloguers may want to look away now<br />
  25. 25.
  26. 26. ...and even if it does, there are millions of existing records that we’ll want to convert<br />MARC 21 is not going<br />away anytime soon...<br />Extracting Data from MARC 21<br />
  27. 27.
  28. 28. How are we approaching it?<br />Extracting Data from MARC 21<br />
  29. 29. By tackling it in small chunks!<br />Extracting Data from MARC 21<br />
  30. 30. We’ve created a solution that...<br />Extracting Data from MARC 21<br /><ul><li> allows us to build the model iteratively
  31. 31. compartmentalises code for different sections
  32. 32. provides robustness
  33. 33. is performant
  34. 34. allows us to experiment </li></li></ul><li>Parser<br />Observer<br />Handlers<br />Our conversion pipeline<br />Extracting Data from MARC 21<br />
  35. 35. fires events when it encounters a MARC 21 data structure; very strict with syntax<br />MARC 21 Parser<br />Extracting Data from MARC 21<br />
  36. 36. listens for MARC 21 data structures and hands control over to one or more handlers<br />Event Observer<br />Extracting Data from MARC 21<br />
  37. 37. know how to convert MARC 21structures and fields into linked data<br />Bibliographic Handlers<br />Extracting Data from MARC 21<br />
  38. 38. So, where are we up to?<br />Extracting Data from MARC 21<br />
  39. 39. we tackled this one first as it allows us to reason more fully about the record<br />Format (and duration)<br />Extracting Data from MARC 21<br />
  40. 40. In theory quite easy...<br />Format<br />
  41. 41. ...in practice not so much...<br />Format<br /><ul><li> no code for CD (12cm sound disk, 1.4m/s)
  42. 42. DVD and LaserDisc share(d) a code
  43. 43. LC slow(ish) to support new formats in M21
  44. 44. limited use of control field (007) codings...
  45. 45. ...so need to parse text from 3xx, 5xx fields</li></li></ul><li>LDR: 01425ngm a22005058 4504<br />001: 750785<br />003: xxxxxxx<br />005: 20090824164118.0<br />007: vd||s||||<br />008: 080623s2007 enk||| e v|eng d<br />020: , | $c Retail (S24.99) |<br />024: 3, | $a 7321900108089 |<br />028: 4, 0 | $a BDY10808 | $b Warner Home Video |<br />029: , | $a 7321900108089 |<br />082: , | $a 812<br />245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks<br />260: , | $b Warner Home Video, | $c 2007. |<br />300: , | $a 1 Blu-Ray (139 min.) : | $b col. |<br />306: , | $a 021900 |<br />366: , | $b 20070611 |<br />511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci<br />521: 8, | $a BBFC code: 18. |<br />538: , | $a Blu-Ray. |<br />700: 1, | $a Scorsese, Martin |<br />700: 1, | $a Brooks, Christopher |<br />852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert<br />Teasing format from a MARC 21 Record<br />
  46. 46. Which gives us...<br />
  47. 47. an important part of the recordto model, or so I’ve been told<br />Title<br />Extracting Data from MARC 21<br />
  48. 48. Quite tricky because...<br />Title<br /><ul><li> don’t want to duplicate data that appears elsewhere (e.g. in 100/700)
  49. 49. ‡c must be last subfield in a 245...
  50. 50. ...so sometimes data from ‡n / ‡p is in ‡c instead...
  51. 51. ...which means we can’t just drop the ‡c </li></li></ul><li>http://journal.code4lib.org/articles/3832<br />Got a helping hand from Code4Lib Journal (thanks!)<br />Title<br />
  52. 52. Now with more title<br />
  53. 53. sounds easy...acronyms from EAN to UPC describing 13 digit codes...right?<br />Identifier<br />Extracting Data from MARC 21<br />
  54. 54. what are all those other things doing in the ‡a?<br />...STOP!<br />Identifier<br />
  55. 55. Identifier<br />“For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.”<br />Library of Congress Rule Interpretation 1.8<br />
  56. 56.
  57. 57. (and then validate whatever’s left)<br />So we need to parse them out<br />Identifier<br />
  58. 58. LDR: 01425ngm a22005058 4504<br />001: 750785<br />003: xxxxxxx<br />005: 20090824164118.0<br />007: vd||s||||<br />008: 080623s2007 enk||| e v|eng d<br />020: , | $c Retail (S24.99) |<br />024: 3, | $a 7321900108089 |<br />028: 4, 0 | $a BDY10808 | $b Warner Home Video |<br />029: , | $a 7321900108089 |<br />082: , | $a 812<br />245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks<br />260: , | $b Warner Home Video, | $c 2007. |<br />300: , | $a 1 Blu-Ray (139 min.) : | $b col. |<br />306: , | $a 021900 |<br />366: , | $b 20070611 |<br />511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci<br />521: 8, | $a BBFC code: 18. |<br />538: , | $a Blu-Ray. |<br />700: 1, | $a Scorsese, Martin |<br />700: 1, | $a Brooks, Christopher |<br />852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert<br />Phew, this one’s easy, no (pbk), (hbk) or even (pbk. , alk. paper) to contend with<br />
  59. 59. Now we can start performing lookups against other sources!<br />
  60. 60. hardest of the lot...<br />Author<br />Extracting Data from MARC 21<br />
  61. 61. ...why?<br />Author<br /><ul><li> Newt Scamander
  62. 62. Rowling, J.K. vs Rowling, Joanne K.
  63. 63. Few records with relator term in 100/700 ‡e...
  64. 64. ...so we have to parse that from the 245 ‡c...
  65. 65. ...and we don’t just deal with English records.</li></li></ul><li>
  66. 66. we’ve licensed the names/subjects authority files, and created RDF from them<br />Library of Congress<br />to the rescue!<br />Author<br />
  67. 67. LDR: 01425ngm a22005058 4504<br />001: 750785<br />003: xxxxxxx<br />005: 20090824164118.0<br />007: vd||s||||<br />008: 080623s2007 enk||| e v|eng d<br />020: , | $c Retail (S24.99) |<br />024: 3, | $a 7321900108089 |<br />028: 4, 0 | $a BDY10808 | $b Warner Home Video |<br />029: , | $a 7321900108089 |<br />082: , | $a 812<br />245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks<br />260: , | $b Warner Home Video, | $c 2007. |<br />300: , | $a 1 Blu-Ray (139 min.) : | $b col. |<br />306: , | $a 021900 |<br />366: , | $b 20070611 |<br />511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci<br />521: 8, | $a BBFC code: 18. |<br />538: , | $a Blu-Ray. |<br />700: 1, | $a Scorsese, Martin |<br />700: 1, | $a Brooks, Christopher | $e music<br />852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert<br />A contrived example (sorry!) with and without relator terms<br />
  68. 68. Hope you can all read this at the back!<br />
  69. 69. A closer look at<br />Authority Matching<br />Author<br />
  70. 70. Some requirements:<br />Author<br /><ul><li> needs to be fast...
  71. 71. ...(able to process 2M records in several hours)
  72. 72. requires accuracy
  73. 73. must handle pseudonyms and variant spellings</li></li></ul><li>which means that for bulk conversions we aren’t incurring HTTP overhead millions of times<br />So we store as RDF,<br />but index in SOLR<br />Author<br />
  74. 74. You can tell J.K. Rowling is successful, she’s been translated lots<br />
  75. 75. Language/Alternate Graphical Representation<br />Extracting Data from MARC 21<br />
  76. 76. Nice “high impact” feature<br />Language<br /><ul><li> allows switching between representations
  77. 77. both forms can be searched for
  78. 78. uses RDF content language feature, so useful for people using machine readable RDF</li></li></ul><li>001: | 3013197<br />008: | 080624s2007cca0000chid<br />041: , | $a chi<br />043: , | $a a-cc--- |<br />050: , 4 | $a NE1300.8.C6 | $b S48 2007 |<br />100: 1, | $6 880-01 | $a Shu, Huifang. |<br />245: 1, 0 | $6 880-02 | $a Fan chensuzi : | $b Min jiannianhuazhong de wenqingfengsu / | $c ShuHuifang, Shen Hong zhu. |<br />246: 3, 1 | $6 880-03 | $a Min jiannianhuazhong de wenqingfengsu |<br />250: , | $6 880-04 | $a Di 1 ban. |<br />260: , | $6 880-05 | $a Beijing : | $b Zhongguo gong renchu ban she, | $c 2007. |<br />300: , | $a 3, 3, 229 p. : | $b col. ill. ; | $c 24 cm. |<br />440: , 0 | $6 880-06 | $a Zhongguo min suwenhua cong shu |<br />700: 1, | $6 880-07 | $a Shen, Hong. |<br />880: 1, | $6 100-01/$1 | $a 舒惠芳. |<br />880: 1, 0 | $6 245-02/$1 | $a 凡尘俗子 : | $b 民间年画中的温情风俗 / | $c 舒惠芳, 沈泓著. <br />880: 3, 1 | $6 246-03/$1 | $a 民间年画中的温情风俗 |<br />880: , | $6 250-04/$1 | $a 第1版. |<br />880: , | $6 260-05/$1 | $a 北京 : | $b 中国工人出版社 | $c 2007. |<br />880: , 0 | $6 440-06/$1 | $a 中国民俗文化丛书 |<br />880: 1, | $6 700-07/$1 | $a 沈泓. |<br />852: , | $b Main Library | $c East Asian Coll.,Purple 2 | $h 398.351 | $m S4 |<br />Dealing with language in MARC 21<br />
  79. 79. tagged with an ISO-639-2 language and masquerading as the field listed in ‡6<br />Passes 880s back into Observer<br />Language<br />
  80. 80. Which gives us...<br />
  81. 81.
  82. 82.
  83. 83.
  84. 84. it’s part of the reason we use Linked Data...but it’s got some challenges at the moment<br />Using/Linking to<br />External Datasets<br />The Challenges<br />
  85. 85. Pitfalls:<br />Language<br /><ul><li> what if a datasource suffers downtime...
  86. 86. ...or worse, is taken offline permanently?
  87. 87. can we trust this data?
  88. 88. can we display it, or is it susceptible to vandalism?</li></li></ul><li>Potential solutions (YMMV):<br />Language<br /><ul><li> harvest datasets and keep them close to the app...
  89. 89. ...or, if that’s not practical, proxy requests using a caching proxy such as Squid
  90. 90. if using Wikipedia and worried about vandalism...
  91. 91. ...check for lots of rapid edits, consider caching (or turning off temporarily)</li></li></ul><li>
  92. 92. ...or – what we’d like to seehappen to Linked Library Data<br />The Future...<br />
  93. 93. especially on the peripheries – authority data, author information, links to other resources<br />More library data as LOD<br />The Future<br />
  94. 94. seriously – this would makeour lives so much simpler<br />LMS vendors adopting LOD<br />The Future<br />
  95. 95. LOD replacing MARC 21 as the standard representation of bibliographic records<br />The Future<br />
  96. 96.
  97. 97. Photo Credits<br />Slide 15 - http://www.flickr.com/photos/gammaman/5241860326/<br />Slide 21 - http://www.flickr.com/photos/agizienski/3778965891/<br />Slide 40 - http://www.flickr.com/photos/54409200@N04/5070012761/<br />Slide 42 - http://www.flickr.com/photos/proimos/4199675334/<br />Slide 48 - http://www.flickr.com/photos/maveric2003/91198458/<br />Slide 63 - http://richard.cyganiak.de/2007/10/lod/<br />Slide 67 - http://www.flickr.com/photos/markchapmanphoto/5139429152/<br />Slide 72 - http://www.flickr.com/photos/-bast-/349497988/<br />

×