SCLENDS dedupping project


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

SCLENDS dedupping project

  1. 1. SCLENDs Dedupping Project<br />Rogan Hamby<br /><br />2010-07-28<br />
  2. 2. Obligatory Obligations*<br />This is a new version of the slides I used at a Cataloging Workgroup meeting on 2010-07-21. I’ve tried to clean it to read better without me talking to it but it still carries with it most of the faults of being meant as a speaking aide rather than self explanatory. And paradoxically is very text heavy as well. Some (few) tweaks have been contributed for clarity. All faults are purely mine.<br />* You’ll find scattered foot notes in here. I apologize in advance. Really, you can skip them if you want to. <br />
  3. 3. On Made Up Words<br />When I say ‘dedupping’ I mean <br />‘MARC de-duplication’<br />
  4. 4. Schrödinger’s MARC<br />MARC records are simultaneously perfect and horrible and only acquire one state once we start using them.<br />‘Bad’ or ‘Idiosyncratic’ records often exist due to valid decisions in the past that now are unviable due to a strict MARC centric ILS and consortia cohabitation in the catalog.<br />
  5. 5. It’s Dead Jim *<br />‘Idiosyncratic’ records and natural variety among MARC records hampered the deduplication process during the original migrations and database merges.<br />* The slide title is a reference to Schrodinger’s cat as a MARC record and that it’s attained a state now that it’s in use. If you don’t get it, that’s OK. I’m a geek and should get out more.<br />
  6. 6. The Problem<br />The result is a messy database that is reflected in the catalog. Searching the OPAC felt more like an obscure, and maybe arcane, process then we were comfortable with.<br />
  7. 7. Time for the Cleaning Gloves<br />In March 2009 we began discussing the issue with ESI. The low merging rate was due to the very precise and conservative finger printing of the dedupping process. In true open source spirit we decided to roll our own solution and start cleaning up the database.<br />
  8. 8. A Disclaimer<br />The dedupping as it was performed was not incorrect or wrong in any way. It put a strong emphasis on avoiding creating wrong or imprecise (edition) matches which is almost inevitable with looser finger printing. We decided that we had different priorities and were willing to make compromises.<br />
  9. 9. Project Goals<br />Improve Searching<br />Faster Holds Filling<br />(maybe) Reduce ICL costs<br />
  10. 10. Scope of Dedupping<br />2,048,936 bib records<br />Shasta & Lynn worked <br />with the CatWoG. <br />Rogan joined to look at doing some modeling and translating the project into production.<br />
  11. 11. On Changes<br />I watch the ripples change their size / But never leave the stream <br /><ul><li>David Bowie, Changes</li></ul>The practical challenges meant that a lot changed from the early discussion to development. We weighted decisions heavily on the side of needing to have a significant and practical impact.<br />
  12. 12. Two Types of Match Points *<br />Limiting Match Points – these create a basis for matches and exclude potential matches.<br />Additive Match Points – these are not required but create additional matches.<br />* These are terms I use to differentiate between two kinds of logistical match points you have to make decisions about. I have no idea if anyone else uses similar terms for the same principles.<br />
  13. 13. Modeling the Data part 1<br />Determining match points determines the scope of the record set you may create mergers from.<br />Due to lack of uniformity in records matching became extremely important. Adding a single extra limiting match point caused high percentage drops in possible matches reducing the effectiveness of the project.<br />
  14. 14. Tilting at Windmills<br />We refused to believe that dedupping is something that is done to minimal effect where minimizing bad merges is the highest priority.<br />Many said we were a bit mad. Fortunately, we took it as a complement.*<br />* Cervantes was actually reacting against what he saw as prevailing custom when he wrote Don Quixote and ended up with brilliant literature. He was also bitter and jealous but we’ll gloss over that part. We were hoping to be more like the first part.<br />
  15. 15. Modeling the Data part 2<br />We agreed upon only two match points. Title and ISBN.<br />This excluded a large number of records by requiring both a valid title and ISBN entry.<br />Records with ISBNs and Titlesaccounted for ~1,200,000 of the over 2 million bib records in the system.<br />
  16. 16. What Was Left Behind<br />Records excluded include many that do not have valid ISBNs including those that have SUDOC numbers, ISSNs, pre-cats, etc…<br />Also excluded were a significant number of potential matches that might have been matched using additive match points. <br />
  17. 17. The Importance of Being Earnest<br />We were absolutely confidant that we could not achieve a high level of matching with extra limiting match points.<br />We chose to not include additional merging (additive) match points because we could easily over reach.<br />We estimated based on modeling a conservative ~300,000 merges or about 15% of our ISBNs.<br />
  18. 18. The Wisdom of Crowds<br />Conventional wisdom said that MARC could not be generalized despite the presence of supposedly unique information in the records.<br />We were taking risks and very aware of it but the need to create a large impact on our database drove us to disregard friendly warnings.<br />
  19. 19. An Imperfect World<br />We knew that we would miss things that could potentially be merged.<br />We knew that we would create some bad merges when there were bad records.*<br />10% wrong to get it 90% done.<br />* GIGO = Garbage In, Garbage Out<br />
  20. 20. Next Step … Normalization<br />With matching decided we needed to normalize the data. This was done to copies of the production MARC records and that used to make lists. <br />Normalization is needed because of variability in how data was entered. It allows us to get the most possible matches based on data.<br />
  21. 21. Normalization Details<br />We normalized case, punctuation, numbers, non-Roman characters, trailing and leading spaces, some GMDs put in as parts of titles, redacted fields, 10 digit ISBNs as 13 digit and lots, lots more.<br />This was not done to permanent records but to copies used to make the lists.<br />
  22. 22. Weighting<br />Finally, we had to weight the records that have been matched to determine which should be the record to keep. To do this each bib record is given a score to show it’s quality.<br />
  23. 23. The Weighting Criteria<br />We looked at the presence, length number of entries in the 003, 02X, 24X, 300, 260$b, 100, 010, 500s, 440, 490, 830s, 7XX, 9XX and 59X to manipulate, add to, subtract from, bludgeon, poke and eventually determine a 24 digit number that would represent the quality of a bib record. *<br />* While not complete, this is mostly accurate.<br />
  24. 24. The Merging<br />Once the weighing is done the highest scored record in each group (that should be the same items) is made the master record, the copies from the others moved to it and those bibs marked deleted. Holds move with the copies and then holds can be retargeted allowed back logged holds to fill.<br />
  25. 25. The Coding<br />We proceeded to contract with Equinox to have them develop the code and run it against our test environment (and eventually production). Galen Charlton was our primary contact in this and aside from excellent work also provided us wonderful feedback about additional criteria to include in the weighting and normalization. <br />
  26. 26. Test Server<br />Once run on the test server we took our new batches of records and broke them into 50,000 record chunks. We then gave those chunks to member libraries and had them do random samples for five days.<br />
  27. 27. Fixed As We Went<br />Lynn quickly found a problem with 13 digit ISBNs normalizing as 10 digit ISBNs. We quickly identified many parts of DVD sets and some shared title publications that would be issues. Galen was responsive and helped us compensate for these issues as they were discovered.<br />
  28. 28. In Conclusion<br />We don’t know how many bad matches were formed but it was below our threshold, perhaps a few hundred. We are still gathering that feedback.<br />We were able to purge 326,098 bib records or about 27% of our ISBN based collection. <br />
  29. 29. Evaluation<br />The catalog is visibly cleaner. <br />The cost per bib record was 1.5 cents.<br />Absolutely successful.<br />
  30. 30. Future<br />This dedupping system will improve further.<br />There are still problems that need to be cleaned up – some manually and some by automation.<br />New libraries that join SCLENDs will use our dedupping algorithm not the old one.<br />
  31. 31. Challenges<br />One, how do we go forward with more clean up? Treat AV materials separately? We need to look at repackaging standards more. <br />Two, how do we prevent adding new errors to the system (which is happening)?<br />
  32. 32. Questions?<br />