Data Wrangling: MSCS View from the trenches
Upcoming SlideShare
Loading in...5
×
 

Data Wrangling: MSCS View from the trenches

on

  • 285 views

Presentation slides from MSCS Systems Librarian Sara Amato's ALA ALCTS Pre-Conference presentation in Chicago, IL on June 27, 2013.

Presentation slides from MSCS Systems Librarian Sara Amato's ALA ALCTS Pre-Conference presentation in Chicago, IL on June 27, 2013.

Statistics

Views

Total Views
285
Views on SlideShare
285
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Easy to say “we want detailed subject analysis and title lists” but if you don't have the staff time to review, does this really matter? Try to have a clear picture BEFORE starting the project. (Data can go stale … interest vs actionable data)
  • Easy to say “we want detailed subject analysis and title lists” but if you don't have the staff time to review, does this really matter? Try to have a clear picture BEFORE starting the project. (Data can go stale … interest vs actionable data)
  • Can you get what you want in a way that is meaningful to the vendor / programmer?
  • Do you have enough for it have value (some question if it has value at all..) Did it get checked out to processing? Another example is getting lists of barcodes into review file – ran into this where odd internal use data in different fields Do you really want to rely on it? – That 1980's Word Perfect manual vs. Portuguese poetry
  • You've decided what you want and you've pulled all your data … and ?? do you know how it's going to be processed.
  • Variations in cataloging practices over time and space Lots of oddities – no 260, no 001, multiple 001s …
  • Internal Use Circ in different field – different catalog
  • Sent data to three different places (again document what went where!)
  • Data is messy Nothing is ever perfect Please do not despair

Data Wrangling: MSCS View from the trenches Data Wrangling: MSCS View from the trenches Presentation Transcript

  • Data Wrangling: MSCS View from the trenches What we've learned Where we failed How we succeeded
  • You do what?  Liaisoning between tech services, project team, and vendors on data manipulation and display  Skills: − Marc and ILS data migration/manipulation − Nitty Gritty details – hows and whys − Knowledge sharing between partners − Investigations and Implementations − Project management − Meeting management
  • Data driven? Start at the end!  What do you really want to know?  Do you have the data to answer that?  What are you going to do with the data  What is interesting vs. what is actionable  Test out your theories!!
  • We Needed Data
  • Data driven? Start at the end!  Comparisons across institutions – match points Started with an OCLC reclamation project Records Sent Returned Unresolved Updated OCLC # Ursus 2,100,299 13,232 171,474 Colby 474,438 373 26,334 Bowdoin 624,164 37,848 Bates 656,926 25,101 TOTALS 3,855,827 13,605 260,757
  • Start at the end...if your ordering out  Think about what you want to get back, make sure it goes out.  HOW will you deal with returned data?  Can all the partners do the same things in terms of processing?
  • Lists, lists, lists! What will you in/exclude if you are extracting: types: gov docs, serials, media, e-resources locations: ref, off-site, reserve, special collections status: billed, missing, suppressed, withdrawn (!) use: circ, internal use, reserves What constitutes a circulating copy? How are the above encoded? Can you get what you want?
  • Circ Data  How long has it been retained?  Any tech processing that included circing?  Has it ever been cleared?  (… and what does it really tell you ...)
  • Know your vendor / programmer  What exactly is going to happen to the data, and what will be in(ex)cluded?  Leader bib level m , s  Gov Doc? (008 / 28) ?  Printed material? Media?
  • So, you think you know your data...
  • Can you get it out? Export Tables  What exactly is exported  What do they do with weird data? (b b, b 930)  Do the add any data? v.v.29 , oclc prefix  Formats of dates
  • Your data may vary 35109002285482 3510900228549
  • Document!!! REALLY!!!  Export tables and field mappings  Locations  List creation criteria  Record ranges exported and dates  Files
  • … a few of the ugly things we saw...  Multiple fields used for internal use (INTL USE, COPY USE, and IUSE3)  Records with multiple 001s  Records with multiple barcodes, duplicate barcodes, bound with items  Barcodes in 949 not 'b'  Records with no 260  3 0000003 ocm3 3_
  • Your data through different lenses Points of departure: -Merged 001s -FRBR -Volume vs Title counts -Unique vs Holdings counts -Date of data used -Definition of public domain
  • When things go wrong MarcEdit is your friend!
  • One more reason to thank Terry Reese SELECT T0xx.field_data FROM T0xx, T9xx WHERE T9xx.field = '945' AND T9xx.subfield = "f" AND T9xx.field_data > 0 AND T0xx.cid = T9xx.cid AND T0xx.field = '001'
  • Data Wrangling: MSCS Side Closing Haiku: Data is messy While it can be normalized Nothing is perfect