HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012
SHARING THE CARE AND
FEEDING OF THE ELEPHANT
John Weise and Chris Powell and Kat Hagedorn
University of Michigan Libraries
HathiTrust ingests and integrates digital content
produced by a variety of systems, processes,
practices, and workflows at partner
• Internet Archive
• Locally scanned
e.g., Yale, Michigan, and several others.
3. Some of Michigan's Hats
• Google partner
• HathiTrust administrator
o Specifications and guidelines
o Ingest manager/gatekeeper
• HathiTrust partner
• Michigan as Michigan
o MDP scans to HT (i.e., Google scans)
o Local scans to HT
o Legacy migration to HT
o Investigate and fix problems
5. The aggregation of content in HathiTrust has
revealed outcroppings in the data landscape
that were not as apparent when segregated.
6. We won't talk about...
• HathiTrust governance, the many benefits of
partnership, or the lawsuit.
• Users, data mining, or preservation per se,
but they are inherent throughout.
• Google's scanning processes except to
illustrate a point.
7. In a nutshell
We're contemplating the impact of independent
decisions made in the past on preservation
and access today.
8. To do this, we'll talk about...
• Michigan's digital library heritage.
• The impact of local decisions on global
preservation and access.
• Meaningful vs. meaningless variations in
• Variations in quality.
• The benefits of aggregation for preservation.
• Where we can go from here.
10. Large scale, but sharp focus
• Collaborative, but separate
o Metadata availability
o Restricted scope
o Meaningfulness within the context of the collection
• Separate systems obscured variation in
application of agreed-upon standards
11. Now these texts are moving into an
environment where the sharp focus that
defined their previous online existence is less
meaningful, and some shortcomings are now
12. Michigan's Local Legacy
• 5K-10K volumes/year back to the 1990's
• 24K volumes migrated to HathiTrust.
• Relatively painstaking process.
13. Reasons volumes that don't make
the automated move
• A record for the item cannot be located in
• Non-standard naming conventions
• Skips in file sequence
• Bitonal TIFF images aren't 600 dpi
• Various TIFF header anomalies
• JPEG2000 images that don't contain
14. Successful volumes sharing the
larger repository aren't all the same
• Different libraries (even within the same
• Different materials (books, journals, photos)
• Different physical formats
• Different languages and scripts
• Different application of standards (including
• Different decisions made along the way
15. Meaningful vs. meaningless
• Variation you want to maintain vs. variation
you want to obscure
• Need for consensus
• Need for certainty that solutions are truly
• Why is this variation occurring?
• How can you spot variation in such a large
• How are truly meaningful variants identified
23. Content quality problems
• Issues we see with quality can be found in
• Some are unavoidable or were based on a
particular decision due to resource issues
• Some can be given special treatment if they
occur frequently or are anticipated
• There's a trade-off, naturally
o decision between a pristine corpus and a massively
24. Focus on potential physical volume
errors NOT volume scan errors
These are volume scan errors...
30. Pages missing in the physical
31. Benefits of corpus
• Noting provenance and process of creating
these digitized volumes
• Ability to compare volumes
• Reveal potential solutions to problems
• Certification of particular volumes
32. More hands make lighter work
• Working with institutions on a collective level
as opposed to singularly
• Working together to find common models
• Share experience and develop policies to
mitigate newly discovered issues and
maintain the corpus
33. Lessons we're learning as we go
• You do NOT have to solve everything at once
• Don't let potential problems prevent you from
• Decide what is the most important, and where
you use your resources, and do it at the
beginning of your project, if at all possible