SlideShare a Scribd company logo
1 of 76
Linked Library Datain the wild
Technical Lead for Prism Phil John Introductions...
So, what’s Prism then? Introductions...
a next generation discovery interface Prism Introductions
(yes…even configuration settings) Built entirely on Linked Data Prism
Discovery of library  catalogue resources Prism but grander plans afoot...
...some future sources... Prism ,[object Object]
 archives/records (e.g. DS Calm)
 thesis repositories
 rare items/special collections
 and more!,[object Object]
MARC 21    RDF Performs data conversion Prism
this ensures it keeps in sync with the LMS Initial “bulk” conversion then periodic “delta” files Prism
provided by a suite of RESTful web services Borrower/Availability data pulled from LMS “live” Prism
just add .rss to collectionsor .rdf/.nt/.ttl/.json to items Linked Data API Prism
The Challenges Prism
Extracting data from MARC 21 The Challenges
Some quotes... Extracting Data from MARC 21 ...cataloguers may want to look away now
...and even if it does, there are millions of existing records that we’ll want to convert MARC 21 is not going away anytime soon... Extracting Data from MARC 21
How are we approaching it? Extracting Data from MARC 21
By tackling it in small chunks! Extracting Data from MARC 21
We’ve created a solution that... Extracting Data from MARC 21 ,[object Object]
 compartmentalises code for different sections
 provides robustness
 is performant
 allows us to experiment ,[object Object]
fires events when it encounters a MARC 21 data structure; very strict with syntax MARC 21 Parser Extracting Data from MARC 21
listens for MARC 21 data structures and hands control over to one or more handlers Event Observer Extracting Data from MARC 21
know how to convert MARC 21structures and fields into linked data Bibliographic Handlers Extracting Data from MARC 21
So, where are we up to? Extracting Data from MARC 21
we tackled this one first as it allows us to reason more fully about the record Format (and duration) Extracting Data from MARC 21
In theory quite easy... Format
...in practice not so much... Format ,[object Object]
 DVD and LaserDisc share(d) a code
 LC slow(ish) to support new formats in M21
 limited use of control field (007) codings...
 ...so need to parse text from 3xx, 5xx fields,[object Object]
Which gives us...
an important part of the recordto model, or so I’ve been told Title Extracting Data from MARC 21
Quite tricky because... Title ,[object Object]
 ‡c must be last subfield in a 245...
 ...so sometimes data from ‡n / ‡p is in ‡c instead...
 ...which means we can’t just drop the ‡c ,[object Object]
Now with more title
sounds easy...acronyms from EAN to UPC describing 13 digit codes...right? Identifier Extracting Data from MARC 21
what are all those other things doing in the ‡a? ...STOP! Identifier
Identifier “For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.” Library of Congress Rule Interpretation 1.8
(and then validate whatever’s left) So we need to parse them out Identifier
LDR: 01425ngm a22005058  4504 001: 750785 003: xxxxxxx 005: 20090824164118.0 007: vd||s|||| 008: 080623s2007    enk||| e          v|eng d 020:  ,   | $c Retail (S24.99) | 024: 3,   | $a 7321900108089 | 028: 4, 0 | $a BDY10808 | $b Warner Home Video | 029:  ,   | $a 7321900108089 | 082:  ,   | $a 812 245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks 260:  ,   | $b Warner Home Video, | $c 2007. | 300:  ,   | $a 1 Blu-Ray (139 min.) : | $b col. | 306:  ,   | $a 021900 | 366:  ,   | $b 20070611 | 511:  ,   | $a Starring Robert De Niro, Ray Liotta and Joe Pesci 521: 8,   | $a BBFC code: 18. | 538:  ,   | $a Blu-Ray. | 700: 1,   | $a Scorsese, Martin | 700: 1,   | $a Brooks, Christopher | 852:  ,   | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert Phew, this one’s easy, no (pbk), (hbk) or even (pbk. , alk. paper) to contend with
Now we can start performing lookups against other sources!
hardest of the lot... Author Extracting Data from MARC 21
...why? Author ,[object Object]
 Rowling, J.K. vs Rowling, Joanne K.
 Few records with relator term in 100/700 ‡e...
 ...so we have to parse that from the 245 ‡c...
 ...and we don’t just deal with English records.,[object Object]
we’ve licensed the names/subjects authority files, and created RDF from them Library of Congress to the rescue! Author
LDR: 01425ngm a22005058  4504 001: 750785 003: xxxxxxx 005: 20090824164118.0 007: vd||s|||| 008: 080623s2007    enk||| e          v|eng d 020:  ,   | $c Retail (S24.99) | 024: 3,   | $a 7321900108089 | 028: 4, 0 | $a BDY10808 | $b Warner Home Video | 029:  ,   | $a 7321900108089 | 082:  ,   | $a 812 245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks 260:  ,   | $b Warner Home Video, | $c 2007. | 300:  ,   | $a 1 Blu-Ray (139 min.) : | $b col. | 306:  ,   | $a 021900 | 366:  ,   | $b 20070611 | 511:  ,   | $a Starring Robert De Niro, Ray Liotta and Joe Pesci 521: 8,   | $a BBFC code: 18. | 538:  ,   | $a Blu-Ray. | 700: 1,   | $a Scorsese, Martin | 700: 1,   | $a Brooks, Christopher | $e music 852:  ,   | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert A contrived example (sorry!) with and without relator terms
Hope you can all read this at the back!
A closer look at Authority Matching Author
Some requirements: Author ,[object Object]
 ...(able to process 2M records in several hours)
 requires accuracy
 must handle pseudonyms and variant spellings,[object Object]
You can tell J.K. Rowling is successful, she’s been translated lots
Language/Alternate Graphical Representation Extracting Data from MARC 21
Nice “high impact” feature Language ,[object Object]

More Related Content

Similar to Linked Library Data in the wild

SHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesSHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesFarzad Nozarian
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
All About Storeconfigs
All About StoreconfigsAll About Storeconfigs
All About StoreconfigsBrice Figureau
 
Introduction to Transcoding: Tools and Processes
Introduction to Transcoding: Tools and ProcessesIntroduction to Transcoding: Tools and Processes
Introduction to Transcoding: Tools and ProcessesPrestoCentre
 
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014Amazon Web Services
 
IBM SAN Volume Controller Performance Analysis
IBM SAN Volume Controller Performance AnalysisIBM SAN Volume Controller Performance Analysis
IBM SAN Volume Controller Performance Analysisbrettallison
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsAmazon Web Services
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
Data Alchemy: Turn your Data into Gold
Data Alchemy: Turn your Data into GoldData Alchemy: Turn your Data into Gold
Data Alchemy: Turn your Data into GoldSøren Schaffstein
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
15 bufferand records
15 bufferand records15 bufferand records
15 bufferand recordsashish61_scs
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGDuyhai Doan
 
DynamoDB as a Secondary Language - Pop-up Loft Tel Aviv
DynamoDB as a Secondary Language - Pop-up Loft Tel AvivDynamoDB as a Secondary Language - Pop-up Loft Tel Aviv
DynamoDB as a Secondary Language - Pop-up Loft Tel AvivAmazon Web Services
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesSlide_N
 

Similar to Linked Library Data in the wild (20)

PAL
PALPAL
PAL
 
SHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesSHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL Databases
 
Cwmg
CwmgCwmg
Cwmg
 
CouchDB
CouchDBCouchDB
CouchDB
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
All About Storeconfigs
All About StoreconfigsAll About Storeconfigs
All About Storeconfigs
 
Introduction to Transcoding: Tools and Processes
Introduction to Transcoding: Tools and ProcessesIntroduction to Transcoding: Tools and Processes
Introduction to Transcoding: Tools and Processes
 
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
 
IBM SAN Volume Controller Performance Analysis
IBM SAN Volume Controller Performance AnalysisIBM SAN Volume Controller Performance Analysis
IBM SAN Volume Controller Performance Analysis
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Data Alchemy: Turn your Data into Gold
Data Alchemy: Turn your Data into GoldData Alchemy: Turn your Data into Gold
Data Alchemy: Turn your Data into Gold
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
unit 5.ppt
unit 5.pptunit 5.ppt
unit 5.ppt
 
15 bufferand records
15 bufferand records15 bufferand records
15 bufferand records
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
No more dumb hex!
No more dumb hex!No more dumb hex!
No more dumb hex!
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ ING
 
DynamoDB as a Secondary Language - Pop-up Loft Tel Aviv
DynamoDB as a Secondary Language - Pop-up Loft Tel AvivDynamoDB as a Secondary Language - Pop-up Loft Tel Aviv
DynamoDB as a Secondary Language - Pop-up Loft Tel Aviv
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation Consoles
 

Recently uploaded

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Linked Library Data in the wild

  • 2. Technical Lead for Prism Phil John Introductions...
  • 3. So, what’s Prism then? Introductions...
  • 4.
  • 5.
  • 6.
  • 7. a next generation discovery interface Prism Introductions
  • 8. (yes…even configuration settings) Built entirely on Linked Data Prism
  • 9. Discovery of library catalogue resources Prism but grander plans afoot...
  • 10.
  • 13. rare items/special collections
  • 14.
  • 15. MARC 21 RDF Performs data conversion Prism
  • 16. this ensures it keeps in sync with the LMS Initial “bulk” conversion then periodic “delta” files Prism
  • 17. provided by a suite of RESTful web services Borrower/Availability data pulled from LMS “live” Prism
  • 18. just add .rss to collectionsor .rdf/.nt/.ttl/.json to items Linked Data API Prism
  • 19.
  • 20.
  • 21.
  • 23. Extracting data from MARC 21 The Challenges
  • 24. Some quotes... Extracting Data from MARC 21 ...cataloguers may want to look away now
  • 25.
  • 26. ...and even if it does, there are millions of existing records that we’ll want to convert MARC 21 is not going away anytime soon... Extracting Data from MARC 21
  • 27.
  • 28. How are we approaching it? Extracting Data from MARC 21
  • 29. By tackling it in small chunks! Extracting Data from MARC 21
  • 30.
  • 31. compartmentalises code for different sections
  • 34.
  • 35. fires events when it encounters a MARC 21 data structure; very strict with syntax MARC 21 Parser Extracting Data from MARC 21
  • 36. listens for MARC 21 data structures and hands control over to one or more handlers Event Observer Extracting Data from MARC 21
  • 37. know how to convert MARC 21structures and fields into linked data Bibliographic Handlers Extracting Data from MARC 21
  • 38. So, where are we up to? Extracting Data from MARC 21
  • 39. we tackled this one first as it allows us to reason more fully about the record Format (and duration) Extracting Data from MARC 21
  • 40. In theory quite easy... Format
  • 41.
  • 42. DVD and LaserDisc share(d) a code
  • 43. LC slow(ish) to support new formats in M21
  • 44. limited use of control field (007) codings...
  • 45.
  • 47. an important part of the recordto model, or so I’ve been told Title Extracting Data from MARC 21
  • 48.
  • 49. ‡c must be last subfield in a 245...
  • 50. ...so sometimes data from ‡n / ‡p is in ‡c instead...
  • 51.
  • 52. Now with more title
  • 53. sounds easy...acronyms from EAN to UPC describing 13 digit codes...right? Identifier Extracting Data from MARC 21
  • 54. what are all those other things doing in the ‡a? ...STOP! Identifier
  • 55. Identifier “For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.” Library of Congress Rule Interpretation 1.8
  • 56.
  • 57. (and then validate whatever’s left) So we need to parse them out Identifier
  • 58. LDR: 01425ngm a22005058 4504 001: 750785 003: xxxxxxx 005: 20090824164118.0 007: vd||s|||| 008: 080623s2007 enk||| e v|eng d 020: , | $c Retail (S24.99) | 024: 3, | $a 7321900108089 | 028: 4, 0 | $a BDY10808 | $b Warner Home Video | 029: , | $a 7321900108089 | 082: , | $a 812 245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks 260: , | $b Warner Home Video, | $c 2007. | 300: , | $a 1 Blu-Ray (139 min.) : | $b col. | 306: , | $a 021900 | 366: , | $b 20070611 | 511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci 521: 8, | $a BBFC code: 18. | 538: , | $a Blu-Ray. | 700: 1, | $a Scorsese, Martin | 700: 1, | $a Brooks, Christopher | 852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert Phew, this one’s easy, no (pbk), (hbk) or even (pbk. , alk. paper) to contend with
  • 59. Now we can start performing lookups against other sources!
  • 60. hardest of the lot... Author Extracting Data from MARC 21
  • 61.
  • 62. Rowling, J.K. vs Rowling, Joanne K.
  • 63. Few records with relator term in 100/700 ‡e...
  • 64. ...so we have to parse that from the 245 ‡c...
  • 65.
  • 66. we’ve licensed the names/subjects authority files, and created RDF from them Library of Congress to the rescue! Author
  • 67. LDR: 01425ngm a22005058 4504 001: 750785 003: xxxxxxx 005: 20090824164118.0 007: vd||s|||| 008: 080623s2007 enk||| e v|eng d 020: , | $c Retail (S24.99) | 024: 3, | $a 7321900108089 | 028: 4, 0 | $a BDY10808 | $b Warner Home Video | 029: , | $a 7321900108089 | 082: , | $a 812 245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks 260: , | $b Warner Home Video, | $c 2007. | 300: , | $a 1 Blu-Ray (139 min.) : | $b col. | 306: , | $a 021900 | 366: , | $b 20070611 | 511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci 521: 8, | $a BBFC code: 18. | 538: , | $a Blu-Ray. | 700: 1, | $a Scorsese, Martin | 700: 1, | $a Brooks, Christopher | $e music 852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert A contrived example (sorry!) with and without relator terms
  • 68. Hope you can all read this at the back!
  • 69. A closer look at Authority Matching Author
  • 70.
  • 71. ...(able to process 2M records in several hours)
  • 73.
  • 74. You can tell J.K. Rowling is successful, she’s been translated lots
  • 75. Language/Alternate Graphical Representation Extracting Data from MARC 21
  • 76.
  • 77. both forms can be searched for
  • 78.
  • 79. tagged with an ISO-639-2 language and masquerading as the field listed in ‡6 Passes 880s back into Observer Language
  • 81.
  • 82.
  • 83.
  • 84. it’s part of the reason we use Linked Data...but it’s got some challenges at the moment Using/Linking to External Datasets The Challenges
  • 85.
  • 86. ...or worse, is taken offline permanently?
  • 87. can we trust this data?
  • 88.
  • 89. ...or, if that’s not practical, proxy requests using a caching proxy such as Squid
  • 90. if using Wikipedia and worried about vandalism...
  • 91.
  • 92. ...or – what we’d like to seehappen to Linked Library Data The Future...
  • 93. especially on the peripheries – authority data, author information, links to other resources More library data as LOD The Future
  • 94. seriously – this would makeour lives so much simpler LMS vendors adopting LOD The Future
  • 95. LOD replacing MARC 21 as the standard representation of bibliographic records The Future
  • 96.
  • 97. Photo Credits Slide 15 - http://www.flickr.com/photos/gammaman/5241860326/ Slide 21 - http://www.flickr.com/photos/agizienski/3778965891/ Slide 40 - http://www.flickr.com/photos/54409200@N04/5070012761/ Slide 42 - http://www.flickr.com/photos/proimos/4199675334/ Slide 48 - http://www.flickr.com/photos/maveric2003/91198458/ Slide 63 - http://richard.cyganiak.de/2007/10/lod/ Slide 67 - http://www.flickr.com/photos/markchapmanphoto/5139429152/ Slide 72 - http://www.flickr.com/photos/-bast-/349497988/