Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012


Published on

HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Published in: Technology
  • Be the first to comment

  • Be the first to like this

HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

  1. 1. HATHITRUST: SHARING THE CARE AND FEEDING OF THE ELEPHANTJohn Weise and Chris Powell and Kat Hagedorn University of Michigan Libraries
  2. 2. IntroductionHathiTrust ingests and integrates digital content produced by a variety of systems, processes, practices, and workflows at partner institutions. • Google • Internet Archive • Locally scanned e.g., Yale, Michigan, and several others.
  3. 3. Some of Michigans Hats•  Google partner•  HathiTrust administrator o  Specifications and guidelines o  Ingest manager/gatekeeper•  HathiTrust partner•  Michigan as Michigan o  MDP scans to HT (i.e., Google scans) o  Local scans to HT o  Legacy migration to HT o  Investigate and fix problems
  4. 4. Making DecisionsTry as we might, to do what is right, there may be more than one right answer.
  5. 5. The aggregation of content in HathiTrust has revealed outcroppings in the data landscape that were not as apparent when segregated.
  6. 6. We wont talk about...•  HathiTrust governance, the many benefits of partnership, or the lawsuit.•  Users, data mining, or preservation per se, but they are inherent throughout.•  Googles scanning processes except to illustrate a point.
  7. 7. In a nutshellWere contemplating the impact of independent decisions made in the past on preservation and access today.
  8. 8. To do this, well talk about...•  Michigans digital library heritage.•  The impact of local decisions on global preservation and access.•  Meaningful vs. meaningless variations in practice.•  Variations in quality.•  The benefits of aggregation for preservation.•  Where we can go from here.
  9. 9. Our mass digitization heritage
  10. 10. Large scale, but sharp focus•  Collaborative, but separate•  Curated o  Condition o  Completeness o  Metadata availability o  Restricted scope o  Meaningfulness within the context of the collection•  Separate systems obscured variation in application of agreed-upon standards
  11. 11. Now these texts are moving into an environment where the sharp focus that defined their previous online existence is less meaningful, and some shortcomings are now exposed.
  12. 12. Michigans Local Legacy•  5K-10K volumes/year back to the 1990s•  24K volumes migrated to HathiTrust.•  Relatively painstaking process. o  Why?
  13. 13. Reasons volumes that dont makethe automated move•  A record for the item cannot be located in the catalog•  Non-standard naming conventions•  Skips in file sequence•  Bitonal TIFF images arent 600 dpi•  Various TIFF header anomalies•  JPEG2000 images that dont contain resolution information
  14. 14. Successful volumes sharing thelarger repository arent all the same•  Different libraries (even within the same institution)•  Different materials (books, journals, photos)•  Different physical formats•  Different languages and scripts•  Different application of standards (including MARC)•  Different decisions made along the way
  15. 15. Meaningful vs. meaninglessvariation•  Variation you want to maintain vs. variation you want to obscure•  Need for consensus•  Need for certainty that solutions are truly global•  Why is this variation occurring?•  How can you spot variation in such a large pool?•  How are truly meaningful variants identified and preserved?
  16. 16. Digitization Decisions: PageFeatures/Book Structures
  17. 17. Digitization Decisions: Omissions•  Its impossible to illustrate what you have omitted•  Its also impossible to find where omissions occurred
  18. 18. Digitization Decisions: Inserts
  19. 19. Cataloging differences
  20. 20. Even among brief descriptions
  21. 21. And among expanded descriptions
  22. 22. The combined repository gives you a fresh and broader look at your collections and your practices.
  23. 23. Content quality problems•  Issues we see with quality can be found in any collection•  Some are unavoidable or were based on a particular decision due to resource issues•  Some can be given special treatment if they occur frequently or are anticipated•  Theres a trade-off, naturally o  decision between a pristine corpus and a massively useful corpus
  24. 24. Focus on potential physical volumeerrors NOT volume scan errorsThese are volume scan errors... Skew Warp
  25. 25. RTL and upside-down (e.g.,Japanese)
  26. 26. Unfolded foldouts
  27. 27. Pagetagging gone awry
  28. 28. Faint text
  29. 29. Pages misnumbered and duplicated in physical volumepage135 page 139, which should be page 136
  30. 30. Pages missing in the physical volumespage96 page 99 pages 97 and 98 are not in volume
  31. 31. Benefits of corpus•  Preservation•  Noting provenance and process of creating these digitized volumes•  Aggregation•  Ability to compare volumes•  Reveal potential solutions to problems•  Certification of particular volumes
  32. 32. More hands make lighter work•  Working with institutions on a collective level as opposed to singularly•  Working together to find common models and workflows•  Share experience and develop policies to mitigate newly discovered issues and maintain the corpus
  33. 33. Lessons were learning as we go•  You do NOT have to solve everything at once•  Dont let potential problems prevent you from moving forward•  Decide what is the most important, and where you use your resources, and do it at the beginning of your project, if at all possible
  34. 34. Contact info••  John:•  Chris:•  Kat: