Catalog Records in BatchLucas Mak, Michigan State University LibrariesTexas Library Association Annual Conference, April 2...
Agenda Overview of XSLT & AutoIt Case studies Cataloging○ Digitized monograph workflow○ Spoken word recordings catalogi...
Overview of XSLT XSLT (Extensible Stylesheet LanguageTransformations) Within the family of XML○ Case sensitive○ Current ...
 <template>○ Contains set of instructions to be executed when atemplate is called explicitly or invoked based on amatchin...
Overview of AutoIt http://www.autoitscript.com/site/autoit/ Freeware Scripting language designed forautomating the Wind...
 Automation To execute multiple XSLTs in a sequence bySaxon-HE command line XSLT processor To link XSLT processes with ...
Case study 1
Background Digital & Multimedia Center Media library & Digitization service Chiefly text digitization Flatbed, overhea...
Old Cataloging Workflow forDigitized MonographsDMC• Mountingdigital files onweb serverDMC• Spreadsheetof titles andURLs to...
Task Automation Extraction of print version MARC records List of bib record no. Utilizing AutoIT to build XML query aga...
New Cataloging WorkflowDMC• Mounting digitalfiles on webserverDMC• Prepare a TXT filewith bib no. & URLMetadataLibrarian• ...
Design of XSLT Processing logic Derive electronic record from print record Insert URL into electronic record bymatching...
Conversion Table Structure of bib no./URL pair XML<pair><bibNumber>b55612367</bibNumber><url>http://archive.lib.msu.edu/D...
Derive Template Provider-neutral record Copy data from print records without manipulation○ e.g. 100, subject headings○ E...
WorkflowPrintrecords(MarcXML)XSLTPN e-monographrecordsCatalogBib no. &URLsExtraction Printrecords(III XML)Format conversio...
Implementation Issues Legacy data 440 vs 490 & 830 pair Descriptive rules & ISBD punctuations Obsolete 1st/2nd indicat...
Case study 2
Background Vincent Voice Library Over 40,000 hours of spoken wordrecordings Open-reel taps, cassettes, DAT tapes, digit...
Voice Library Cataloging MySQL database Digitization & inventory tracking Students create a database record, whichinclu...
Objective Automate cataloging of digital files Reformatted items○ Derive from suppressed analog records ifavailable○ Cre...
Tasks Extract XML records of digital items fromMySQL database Matches XML records against existingrecords of analog item...
Task #1: XML recordsExtraction PHP script written by a programmer Extract XML records for digital items readyfor catalog...
<vvl:Record><vvl:id>2159</vvl:id><vvl:vvl_number>01-0350-113</vvl:vvl_number><vvl:copyright>Broadcast News</vvl:copyright>...
Task #2: Records matching Match point Analog item no.○ <formatid> in database records○ Call no. in analog version MARC r...
 “Derive” Template Copy as is○ e.g. 1XX, 7XX, 6XX, 518, etc. Hardcode constant data○ e.g. 006, 007, 588, etc. Insert v...
 “Create new” Template (i.e. no match) 1XX, 7XX from <main_speaker>,<additional_speakers> 245 from first sentence of <s...
WorkflowDigitalrecords(MarcXML)XSLTCatalogExtractionby PHPAnalogrecords(MarcXML)Formatconversionby MarcEditSQLDigitalrecor...
Benefits Time saving Eliminate manual searching for catalogedanalog items in local catalog Eliminate manual copy and pa...
Limitations False no-match Occasional discrepancy in analog item no. indatabase and catalog records○ Typo○ Digitization ...
Case study 3
Background Thesis and dissertation cataloging atMichigan State University Libraries Current practice: Separate-record ap...
Summary of MainCharacteristicsMARC Fields Characteristics001 1 (print) or 2 (print, microform)007 For microform008 Form of...
Objective Un-mulvering One record for print, one record for microform○ Record for print: Turn mulvered record into prin...
WorkflowILS MulveredPrintMicroformOverlay existingExport as newExtractFix upCreateXSLT
Roadblocks Loss of items over the years 46 titles with print only 11 titles with microform only Need to determine whic...
Loss of items Determine available format by locationcode in item record “mc” = Microfiche/ Microfilm “th” or other bran...
 Export item info separately from ILS
 Merge item info into extracted bib bymatching up item record no.
 Base on item location info in MARC 952 todetermine what record(s) to be generated• Print record (overlay)• Microform rec...
Revised WorkflowILSMulveredPrintMicroformOverlay existingExport as newExtractFix upCreateXSLT 2IteminfoXSLT 1Extract
Multiple microform formats Some titles have microfiche and microfilmon the same record Similarity Both microfiche and m...
○ Microfiche =533 $aMicrofiche.$bAnn Arbor, Mich.:$cUniversity Microfilms,$d1979.$e4 microfiche ; 11X 15cm.○ Microfilm =...
Design of XSLT Processing logic Both print and microform available○ Go through the record twice* 1st pass for print rec...
 5 templates○ Format determination template○ Print data template○ Microform templates Microform only (949 overlay comman...
 Format determination template1. Parse item location info in MARC 952 as avariable2. Determine which location(s), aka for...
 Print data template To copy print 001, 049, 245 with adjustment To copy 008, 099 as is Generate 949 overlay command ...
 Microform data templates To copy microform 001, 008, 245 with adjustment○ 008: Add “a” (microfilm) or “b” (microfiche) ...
 Common data template Copy fields not touched by other templatesmostly without adjustmente.g. leader, subject headings (...
Results Total mulvered: 7387 records Total un-mulvered: 14722 records Microform: 7346 records Print: 7376 records
Case study 4
Acronyms NAR (Name Authority Record) LC/NACO NAF (Name Authority File) BFM (Bibliographic File Maintenance) Heading/Au...
LC/NACO NAF Dynamic file Contributions from over 700 NACOparticipants Updated everyday with new andchanged NARs from NA...
LC/NACO NAFMaintenanceLCDatabaseDistribution toBL, OCLC,SkyRiver
Authority Control at MSU In-house NACO institution Database maintenance Post-cataloging Authority Control New Heading...
LC/NACO NAF RDATransition PCC Day 1 for RDA NAR: Mar. 31, 2013 PCC Task Group on AACR2 & RDAAcceptable Heading Categorie...
 Phased reissuance of NARs Phase 1○ Scope NARs with characteristics known to be at variance with RDA practice Not cand...
 Updates of NARs by NACO institutions Reviewing, upgrading, and recoding Phase 1records to RDA Upgrading and recoding n...
Objectives To catch changes to NARs Changes in 1XX Addition, deletion, or updates of elementsother than 1XX To perform...
Tasks To download NARs one-by-one/in bulk To detect updates to NARs alreadyexisting in Millennium To overlay existing N...
Task #1: Download NARs OCLC LCNAF SRU Service Pros○ Multiple indexes (LCCN, names, dates, etc.)○ Available in multiple s...
Task #1: Download NARs(cont’d) Implementation Search LCCNs one-by-one by AutoIt script○ Around 10 records/sec. retrieved...
Task #2: NAR Update Detection To compare NARs from Millennium and NARs fromLC/NACO NAF by XSLT MARC 005 (timestamp) If ...
Task #3: Export/Overlay ofNARs MarcEdit Export updated NARs into Millennium Through TCP/IP (Host address, Port, .mrcfil...
Task #4: Updates of BibHeadings XSLT To detect changes in 1XX between old andnew NARs To build heading conversion table...
Task #5: Automation Use AutoIt to: Link up various steps in the workflow Automate searching against OCLC LCNAF SRUServi...
Basic WorkflowMillenniumMillenniumNARsExtract byCreateListsLCCNsExtractby XSLTSearch by AutoItLC/NACONARsRetrieveUpdatedNA...
Test Results 82,398 NARs tested 81,362 NARs needed to be overlaid* 4,584 headings became obsolete 10,900 bib records h...
Limitations Identities broken out from undifferentiatedNARs can’t be detected Partially taken care of by “New Headings R...
Reflections Low tolerance to differences between specifiedpattern and target Case-sensitive○ Normalization needed Impli...
 Unique identifiers vital for matching Full automation vs. Semi-automation Integration with other scripting language fo...
Lucas MakMetadata & Catalog LibrarianMichigan State University Librariesmakw@mail.lib.msu.edu
Unleashing the Power of XSLT: Catalog Records in Batch
Unleashing the Power of XSLT: Catalog Records in Batch
Upcoming SlideShare
Loading in …5
×

Unleashing the Power of XSLT: Catalog Records in Batch

2,349 views

Published on

Published in: Education, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,349
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Unleashing the Power of XSLT: Catalog Records in Batch

  1. 1. Catalog Records in BatchLucas Mak, Michigan State University LibrariesTexas Library Association Annual Conference, April 24-27, 2013
  2. 2. Agenda Overview of XSLT & AutoIt Case studies Cataloging○ Digitized monograph workflow○ Spoken word recordings cataloging Catalog Maintenance○ Multi-format records clean up project○ NAR and Bib record update Reflections
  3. 3. Overview of XSLT XSLT (Extensible Stylesheet LanguageTransformations) Within the family of XML○ Case sensitive○ Current version: 2.0 (3.0 in draft)○ Unicode compliant○ File extension: .xsl○ Requires matching start and end tags “Transformation” means:○ Manipulation of documents by creating a new documentbased on the original document○ Output can be in XML, HTML, XHTML, text etc. Data-driven execution○ Codes executed when a certain piece of data isencountered  unexpected outcomes
  4. 4.  <template>○ Contains set of instructions to be executed when atemplate is called explicitly or invoked based on amatching node○ Modularity & Reusability Multiple templates in a XSLT Multiple XSLTs in a pipeline Multiple inputs and outputs○ Comparing data from multiple inputs document ( ) key ( ) Common usages in library context○ Web display e.g. converting EAD into HTML for display○ Metadata crosswalking Data selection and manipulation
  5. 5. Overview of AutoIt http://www.autoitscript.com/site/autoit/ Freeware Scripting language designed forautomating the Windows GUI and generalscripting Simulated keystrokes, mouse movement andwindow/control manipulation Simple text manipulation (supports regularexpression), read/write text file Send HTTP request Open/run/close applications
  6. 6.  Automation To execute multiple XSLTs in a sequence bySaxon-HE command line XSLT processor To link XSLT processes with otherapplications, e.g. MarcEdit To read and compile TXT, XML, XSLT files
  7. 7. Case study 1
  8. 8. Background Digital & Multimedia Center Media library & Digitization service Chiefly text digitization Flatbed, overhead, planetary, slide, &sheetfed scanners 2.5 FTE staff, 21 students 100k – 150k pages scanned & processedper year
  9. 9. Old Cataloging Workflow forDigitized MonographsDMC• Mountingdigital files onweb serverDMC• Spreadsheetof titles andURLs to besent tocataloging(DataCat)DataCat• Search title incatalog• Manual derivefrom print• Insert URLfromspreadsheet
  10. 10. Task Automation Extraction of print version MARC records List of bib record no. Utilizing AutoIT to build XML query againstcatalog XML server○ Turning III XML into MarcXML by XSLT Insertion of URLs to appropriate records Finding a match point Instead of recording “title” & “URL” pair,recording “unique identifier” & “URL” pair○ Unique identifier  bib record no.
  11. 11. New Cataloging WorkflowDMC• Mounting digitalfiles on webserverDMC• Prepare a TXT filewith bib no. & URLMetadataLibrarian• Run AutoIT scriptto extract MARCrecords fromMillennium/Sierra• Batch derive fromprint and insertURL by matchingbib no.
  12. 12. Design of XSLT Processing logic Derive electronic record from print record Insert URL into electronic record bymatching bib no. of the print record againstXML document with bib no./URL pairs
  13. 13. Conversion Table Structure of bib no./URL pair XML<pair><bibNumber>b55612367</bibNumber><url>http://archive.lib.msu.edu/DMC/AREP/AREP1.pdf</url></pair><pair>……
  14. 14. Derive Template Provider-neutral record Copy data from print records without manipulation○ e.g. 100, subject headings○ Element: <xsl:copy-of> Hard-coding new data in○ 006, 007, 040, 049, 588○ Element: <xsl:text> Combine existing and new data to create new element○ e.g. 008 (“o” into “form” byte), 090 (“Online” at the end of call #),776 (main entry, title, imprint, 020, 010)○ Element: <xsl:value-of>, <xsl:text>○ Function: substring() Copy URL from the 2nd XML file○ Key bib number into 2nd XML file○ Element: <xsl:key>, <xsl:variable>○ Function: document()
  15. 15. WorkflowPrintrecords(MarcXML)XSLTPN e-monographrecordsCatalogBib no. &URLsExtraction Printrecords(III XML)Format conversionby XSLTConverted intoXML by AutoIT
  16. 16. Implementation Issues Legacy data 440 vs 490 & 830 pair Descriptive rules & ISBD punctuations Obsolete 1st/2nd indicators
  17. 17. Case study 2
  18. 18. Background Vincent Voice Library Over 40,000 hours of spoken wordrecordings Open-reel taps, cassettes, DAT tapes, digitalfiles Over 2TB of born-digital and digitizedrecordings (WAV format)○ Provide MP3 for public domain/copyrights-cleared items (e.g. pre-1923, presidentialspeeches, MSU provenance, etc.)
  19. 19. Voice Library Cataloging MySQL database Digitization & inventory tracking Students create a database record, whichincludes summary, date of utterance, speakernames, recording source, format(s) available,etc., for each digital file Library Catalog Item-level cataloging○ Cataloging of analog items stopped in 1990s Analog item records suppressed since then○ Cataloging of digital items started in 2000s Based on database records
  20. 20. Objective Automate cataloging of digital files Reformatted items○ Derive from suppressed analog records ifavailable○ Create brief electronic records, from SQLdatabase records, for items with nosuppressed records in catalog Born-digital items○ Create brief electronic records, from SQLdatabase records
  21. 21. Tasks Extract XML records of digital items fromMySQL database Matches XML records against existingrecords of analog items to createrecords for digital items Convert into .mrc file using MarcEdit forloading into cataloging client Automate and link all above steps by anAutoIt script
  22. 22. Task #1: XML recordsExtraction PHP script written by a programmer Extract XML records for digital items readyfor cataloging (*status = cataloging) Each XML record includes:○ Summary written by students doing digitization○ Date of utterance○ Recording source○ Names of speakers○ Database (DB) no. assigned to each digital file If reformatted, both DB no. and analog no. (M no. foropen-reel tapes, C no. for cassettes)○ Running time (in seconds)
  23. 23. <vvl:Record><vvl:id>2159</vvl:id><vvl:vvl_number>01-0350-113</vvl:vvl_number><vvl:copyright>Broadcast News</vvl:copyright><vvl:main_speaker>Farrakhan, Louis</vvl:main_speaker><vvl:additional_speakers/><vvl:recording_source>CNN</vvl:recording_source><vvl:summary>Minister Louis Farrakhan, Head of the Nation of Islam and organizer ofthe The Million Man March, concludes the gathering of social activists with a 2-1/2 hourspeech. Held on and around the National Mall in Washington, D.C.</vvl:summary><vvl:date_day>16</vvl:date_day><vvl:date_month>October</vvl:date_month><vvl:date_year>1995</vvl:date_year><vvl:running_time>9067</vvl:running_time><vvl:open-reel><vvl:formatid>M5420 - M5421</vvl:formatid><vvl:type>open-reel</vvl:type><vvl:size>0</vvl:size></vvl:open-reel><vvl:wav><vvl:formatid>DB2159</vvl:formatid><vvl:type>wav</vvl:type><vvl:size>870444032</vvl:size></vvl:wav></vvl:Record>
  24. 24. Task #2: Records matching Match point Analog item no.○ <formatid> in database records○ Call no. in analog version MARC records Matching & deriving done by XSLT Matched: derive from analog version MARCrecord No-Match: create brief MARC record fromdatabase XML record
  25. 25.  “Derive” Template Copy as is○ e.g. 1XX, 7XX, 6XX, 518, etc. Hardcode constant data○ e.g. 006, 007, 588, etc. Insert variable data○ e.g. 776 (info from analog record), 099 (infofrom database record), 033 Copy with adjustments○ e.g. 008 (form byte), 245 (GMD), 300
  26. 26.  “Create new” Template (i.e. no match) 1XX, 7XX from <main_speaker>,<additional_speakers> 245 from first sentence of <summary> 520 from <summary> 033 from <date_day>, <date_month>, &<date_year> 518 from <recording_source> & date info 099 from <formatid> Hardcode MARC leader, 006, 007, 008 (exceptdate(s)), etc.
  27. 27. WorkflowDigitalrecords(MarcXML)XSLTCatalogExtractionby PHPAnalogrecords(MarcXML)Formatconversionby MarcEditSQLDigitalrecords(VVL XML)Digitalrecords(.mrc)One-timeExtraction
  28. 28. Benefits Time saving Eliminate manual searching for catalogedanalog items in local catalog Eliminate manual copy and paste fromdatabase to SkyRiver cataloging client
  29. 29. Limitations False no-match Occasional discrepancy in analog item no. indatabase and catalog records○ Typo○ Digitization vs. Cataloging practice Heading updates Headings updated after analog recordsexported from catalog Suppressed analog records  can’t do realtime lookup through XML server
  30. 30. Case study 3
  31. 31. Background Thesis and dissertation cataloging atMichigan State University Libraries Current practice: Separate-record approach Pre-2007 practice:○ Mulvered records: Print & Microform on the samerecord○ Either: Cataloged print on OCLC and added microform infolocally  no record for microform in OCLC Cataloged print and microform separately on OCLC andmerged two records into one locally○ 7387 titles with mulvered records
  32. 32. Summary of MainCharacteristicsMARC Fields Characteristics001 1 (print) or 2 (print, microform)007 For microform008 Form of item (byte 23): Blank099 Call no. for print245$h [paper, microform]533 Reproduction note for microform952 $a: Item record no. for print & microform$b: Barcode
  33. 33. Objective Un-mulvering One record for print, one record for microform○ Record for print: Turn mulvered record into print record by removing info onlypertaining to microform (e.g. 533)- Overlay the original record- Delete item records pertaining to microform after overlay○ Record for microform: Create new microform bib record by- Transferring data only pertaining to microform to the newrecord (microform 001, 007, 533)- Copying, with/without modification, common data into thenew record (008, 245 etc.) Create new item record by copying info from original itemrecord for microform To process records by XSLT
  34. 34. WorkflowILS MulveredPrintMicroformOverlay existingExport as newExtractFix upCreateXSLT
  35. 35. Roadblocks Loss of items over the years 46 titles with print only 11 titles with microform only Need to determine which title has whatformat(s) Multiple microform formats Some titles have both microfiche andmicrofilm formats  need to be split into 3records (1 print, 1 microfiche, 1 microfilm)
  36. 36. Loss of items Determine available format by locationcode in item record “mc” = Microfiche/ Microfilm “th” or other branch locations = Print Most reliable since 049 (location in bib) didno get updated when format was lost/added Bib records extracted from ILS do notcontain item location=952 $a.i69490405$b31293027362833=952 $a.i69490417
  37. 37.  Export item info separately from ILS
  38. 38.  Merge item info into extracted bib bymatching up item record no.
  39. 39.  Base on item location info in MARC 952 todetermine what record(s) to be generated• Print record (overlay)• Microform record (export as new)“mc” & printlocation(s)• Microform record (overlay)“mc” & NO printlocations• Print record (overlay)Print location(s)& NO “mc”
  40. 40. Revised WorkflowILSMulveredPrintMicroformOverlay existingExport as newExtractFix upCreateXSLT 2IteminfoXSLT 1Extract
  41. 41. Multiple microform formats Some titles have microfiche and microfilmon the same record Similarity Both microfiche and microfilm use “mc” as itemlocation Differences Call# system○ Microfiche: Goetz, S - 3 fiche○ Microfilm: 24354 THS Microfilm Two MARC 533 (reproduction note)○ One for microfiche, one for microfilm
  42. 42. ○ Microfiche =533 $aMicrofiche.$bAnn Arbor, Mich.:$cUniversity Microfilms,$d1979.$e4 microfiche ; 11X 15cm.○ Microfilm =533 $aMicrofilm.$bAnn Arbor, Mich. :$cUniversityMicrofilms,$d 1973.$e1 microfilm reel ; 35 mm. Solutions○ Base on number of 533 to determine howmany microform record(s) to be generated○ Use MARC 533$a to pull appropriate call#from MARC 952
  43. 43. Design of XSLT Processing logic Both print and microform available○ Go through the record twice* 1st pass for print record 2nd pass for microform record** When both microfiche & microfilm are available  3passes (print, fiche, film) One format available○ Go through the record once (print/ microform*)* When both microfiche and microfilm are available  2passes (one for microfiche, one for microfilm)
  44. 44.  5 templates○ Format determination template○ Print data template○ Microform templates Microform only (949 overlay command) Microform with print (949 item generation command)○ Common data template Data common to both formats Reusable
  45. 45.  Format determination template1. Parse item location info in MARC 952 as avariable2. Determine which location(s), aka format(s),is available3. Invoke different combination of “Print data”,“microform only”, and “microform (withprint)” templates accordingly
  46. 46.  Print data template To copy print 001, 049, 245 with adjustment To copy 008, 099 as is Generate 949 overlay command Invoke “common data” template
  47. 47.  Microform data templates To copy microform 001, 008, 245 with adjustment○ 008: Add “a” (microfilm) or “b” (microfiche) in Form byte(byte 23) based on 533$a○ 245$h: replace “[paper, microform]” with “[microform]” To copy 007, 533 as is To generate 049, 099○ 099: Copy call# from item info stored in 952 949 command○ Overlay command if microform is the only availableformat○ Item record generation if print is also available Invoke “common data” template
  48. 48.  Common data template Copy fields not touched by other templatesmostly without adjustmente.g. leader, subject headings (6XX), thesisnote (502), imprint date (260$c), physicaldimension (300), etc.
  49. 49. Results Total mulvered: 7387 records Total un-mulvered: 14722 records Microform: 7346 records Print: 7376 records
  50. 50. Case study 4
  51. 51. Acronyms NAR (Name Authority Record) LC/NACO NAF (Name Authority File) BFM (Bibliographic File Maintenance) Heading/Authorized access point updates inbib records SRU (Search/Retrieval via URL) HTTP request
  52. 52. LC/NACO NAF Dynamic file Contributions from over 700 NACOparticipants Updated everyday with new andchanged NARs from NACO nodes Full nodes○ British Library, OCLC, SkyRiver Contribution-only node○ National Library of Medicine
  53. 53. LC/NACO NAFMaintenanceLCDatabaseDistribution toBL, OCLC,SkyRiver
  54. 54. Authority Control at MSU In-house NACO institution Database maintenance Post-cataloging Authority Control New Headings Report○ Download NARs from SkyRiver Updates to NARs not necessary caught○ 1XX (No item cataloged under changed 1XX not in new heading report)○ Elements other than 1XX (e.g. 4XX, 670)
  55. 55. LC/NACO NAF RDATransition PCC Day 1 for RDA NAR: Mar. 31, 2013 PCC Task Group on AACR2 & RDAAcceptable Heading Categories (Aug 2011) 225,000 NARs with 1XX not usable in RDA bibrecords 172,000 NARs with 1XX usable in RDA bibrecords after batch manipulation by software 7,631,00 NARs with 1XX usable in RDA bibrecords as they are and can be recoded as RDA
  56. 56.  Phased reissuance of NARs Phase 1○ Scope NARs with characteristics known to be at variance with RDA practice Not candidates for any of the mechanical changes to be made duringphase 2○ Adding a 667 note “THIS 1XX FIELD CANNOT BE USED UNDERRDA UNTIL THIS RECORD HAS BEEN REVIEWED AND/ORUPDATED” Completed Aug. 20, 2012 (436,943 records processed) Phase 2○ Programmatic changes to 1XX headings that are not acceptableunder RDA (e.g., changes to Bible headings, spelling out Dept. andmonths, etc., abbreviations in the subfield $d for personal names)○ Completed March 27, 2013 (371,942 records changed)
  57. 57.  Updates of NARs by NACO institutions Reviewing, upgrading, and recoding Phase 1records to RDA Upgrading and recoding non-Phase 1 records toRDA Adding any of the 17 new MARC fields (e.g.046, 372, etc.) Routine NAR maintenance○ PCC post-RDA test guidelines “stronglyencourage” to evaluate and recode the “RDA-acceptable AACR2 NARs” to RDA wheneverpossible
  58. 58. Objectives To catch changes to NARs Changes in 1XX Addition, deletion, or updates of elementsother than 1XX To perform related BFM if 1XX in a NARis changed
  59. 59. Tasks To download NARs one-by-one/in bulk To detect updates to NARs alreadyexisting in Millennium To overlay existing NARs with updatedones Updates headings in bib records if 1XXin NAR updated To automate and link up the above tasks
  60. 60. Task #1: Download NARs OCLC LCNAF SRU Service Pros○ Multiple indexes (LCCN, names, dates, etc.)○ Available in multiple schema including MARCXML○ One-by-one or bulk download*○ SRU-based service (HTTP request)○ FREE!! Cons○ Updated every Monday night○ OAI-PMH service is not available though there is anindex for OAI identifier○ Bulk download – by search term (e.g. after certain date)
  61. 61. Task #1: Download NARs(cont’d) Implementation Search LCCNs one-by-one by AutoIt script○ Around 10 records/sec. retrieved Download XML files into one folder (filesnamed by LCCN)
  62. 62. Task #2: NAR Update Detection To compare NARs from Millennium and NARs fromLC/NACO NAF by XSLT MARC 005 (timestamp) If timestamp more current on the NAR from NAF  Overlay theNAR in Millennium
  63. 63. Task #3: Export/Overlay ofNARs MarcEdit Export updated NARs into Millennium Through TCP/IP (Host address, Port, .mrcfile)○ Same as export from OCLC Connexion orSkyRiver One-by-one (though .mrc file can containmultiple NARs)
  64. 64. Task #4: Updates of BibHeadings XSLT To detect changes in 1XX between old andnew NARs To build heading conversion table (a TXTfile) when 1XX is changed AutoIt Automate bib heading updates by “GlobalUpdate” module in Millennium○ Read old and new headings from the TXT fileand fill out info required in “Global Update”process
  65. 65. Task #5: Automation Use AutoIt to: Link up various steps in the workflow Automate searching against OCLC LCNAF SRUService by compiling and sending HTTPrequests Execute various XSLTs in a predeterminedsequence○ e.g. NAR comparison  Heading comparison Read TXT files (LCCN list, heading conversiontable) created by XSLT processes Run MarcEdit to overlay obsolete NARs Execute “Global Update” process
  66. 66. Basic WorkflowMillenniumMillenniumNARsExtract byCreateListsLCCNsExtractby XSLTSearch by AutoItLC/NACONARsRetrieveUpdatedNARsCompare by XSLTOverlaybyMarcEditUpdatedHeadingsGlobal Update
  67. 67. Test Results 82,398 NARs tested 81,362 NARs needed to be overlaid* 4,584 headings became obsolete 10,900 bib records had at least one headingflipped* Many NARs exported from Millennium do not contain field 005 overlay those will save comparison time down the road
  68. 68. Limitations Identities broken out from undifferentiatedNARs can’t be detected Partially taken care of by “New Headings Report” Headings with diacritics Code points & exact match in Global Update Headings in Field 880 Slow export using MarcEdit Data Exchange module Slow “Global Update” process Wrong indicators put in by AutoIt during GlobalUpdate (though correct in conversion table) “Java heap space” out of memory error
  69. 69. Reflections Low tolerance to differences between specifiedpattern and target Case-sensitive○ Normalization needed Implication on processing time○ RegEx helps specify a range of patterns○ Exceptions: match(), replace(), tokenize() Data encoding○ Diacritics, non-ASCII characters Data consistency○ Extra conditions/steps needed to account for exceptions○ Legacy data, incorrect data○ Pre-processing clean up vs. On-the-fly clean up○ Familiarity of source data  normal pattern & exceptions
  70. 70.  Unique identifiers vital for matching Full automation vs. Semi-automation Integration with other scripting language forfull automation of ongoing workflow Demanding for computing power Multi-step matching Processing multiple documents Large XML files○ <xsl:stream> in XSLT 3.0 Can’t create records out of thin air
  71. 71. Lucas MakMetadata & Catalog LibrarianMichigan State University Librariesmakw@mail.lib.msu.edu

×