SlideShare a Scribd company logo
1 of 34
Download to read offline
HATHITRUST:
     SHARING THE CARE AND
    FEEDING OF THE ELEPHANT

John Weise and Chris Powell and Kat Hagedorn
       University of Michigan Libraries
Introduction
HathiTrust ingests and integrates digital content
 produced by a variety of systems, processes,
 practices, and workflows at partner
 institutions.
   • Google
   • Internet Archive
   • Locally scanned
     e.g., Yale, Michigan, and several others.
Some of Michigan's Hats
•  Google partner
•  HathiTrust administrator
  o    Specifications and guidelines
  o    Ingest manager/gatekeeper
•  HathiTrust partner
•  Michigan as Michigan
  o    MDP scans to HT (i.e., Google scans)
  o    Local scans to HT
  o    Legacy migration to HT
  o    Investigate and fix problems
Making Decisions
Try as we might, to do what is right, there may
  be more than one right answer.
The aggregation of content in HathiTrust has
  revealed outcroppings in the data landscape
  that were not as apparent when segregated.
We won't talk about...
•  HathiTrust governance, the many benefits of
     partnership, or the lawsuit.
•    Users, data mining, or preservation per se,
     but they are inherent throughout.
•    Google's scanning processes except to
     illustrate a point.
In a nutshell
We're contemplating the impact of independent
 decisions made in the past on preservation
 and access today.
To do this, we'll talk about...
•  Michigan's digital library heritage.
•  The impact of local decisions on global
     preservation and access.
•    Meaningful vs. meaningless variations in
     practice.
•    Variations in quality.
•    The benefits of aggregation for preservation.
•    Where we can go from here.
Our mass digitization heritage
Large scale, but sharp focus
•  Collaborative, but separate
•  Curated
  o    Condition
  o    Completeness
  o    Metadata availability
  o    Restricted scope
  o    Meaningfulness within the context of the collection
•  Separate systems obscured variation in
  application of agreed-upon standards
Now these texts are moving into an
 environment where the sharp focus that
 defined their previous online existence is less
 meaningful, and some shortcomings are now
 exposed.
Michigan's Local Legacy
•  5K-10K volumes/year back to the 1990's
•  24K volumes migrated to HathiTrust.
•  Relatively painstaking process.
  o    Why?
Reasons volumes that don't make
the automated move
•  A record for the item cannot be located in
     the catalog
•    Non-standard naming conventions
•    Skips in file sequence
•    Bitonal TIFF images aren't 600 dpi
•    Various TIFF header anomalies
•    JPEG2000 images that don't contain
     resolution information
Successful volumes sharing the
larger repository aren't all the same
•  Different libraries (even within the same
     institution)
•    Different materials (books, journals, photos)
•    Different physical formats
•    Different languages and scripts
•    Different application of standards (including
     MARC)
•    Different decisions made along the way
Meaningful vs. meaningless
variation
•  Variation you want to maintain vs. variation
     you want to obscure
•    Need for consensus
•    Need for certainty that solutions are truly
     global
•    Why is this variation occurring?
•    How can you spot variation in such a large
     pool?
•    How are truly meaningful variants identified
     and preserved?
Digitization Decisions: Page
Features/Book Structures
Digitization Decisions: Omissions
•  It's impossible to illustrate what you have
     omitted
•    It's also impossible to find where omissions
     occurred
Digitization Decisions: Inserts
Cataloging differences
Even among brief descriptions
And among expanded descriptions
The combined repository gives you a fresh
  and broader look at your collections and your
  practices.
Content quality problems
•  Issues we see with quality can be found in
     any collection
•    Some are unavoidable or were based on a
     particular decision due to resource issues
•    Some can be given special treatment if they
     occur frequently or are anticipated

•  There's a trade-off, naturally
     o    decision between a pristine corpus and a massively
          useful corpus
Focus on potential physical volume
errors NOT volume scan errors
These are volume scan errors...


 Skew




                                  Warp
RTL and upside-down (e.g.,
Japanese)
Unfolded foldouts
Pagetagging gone awry
Faint text
Pages misnumbered and
  duplicated in physical volume


page
135                               page
                                  139,
                                  which
                                  should
                                  be
                                  page
                                  136
Pages missing in the physical
  volumes


page
96
                                    page
                                    99




                                   pages
                                   97 and
                                   98 are
                                   not in
                                   volume
Benefits of corpus
•  Preservation
•  Noting provenance and process of creating
     these digitized volumes
•    Aggregation
•    Ability to compare volumes
•    Reveal potential solutions to problems
•    Certification of particular volumes
More hands make lighter work
•  Working with institutions on a collective level
     as opposed to singularly
•    Working together to find common models
     and workflows
•    Share experience and develop policies to
     mitigate newly discovered issues and
     maintain the corpus
Lessons we're learning as we go

•  You do NOT have to solve everything at once
•  Don't let potential problems prevent you from
     moving forward
•    Decide what is the most important, and where
     you use your resources, and do it at the
     beginning of your project, if at all possible
Contact info
•  www.hathitrust.org
•  John: jweise@umich.edu
•  Chris: sooty@umich.edu
•  Kat: khage@umich.edu

More Related Content

Viewers also liked

Food Couriers Menu Guide 2012
Food Couriers Menu Guide 2012Food Couriers Menu Guide 2012
Food Couriers Menu Guide 2012foodcouriers
 
English(eng) - Английский язык
English(eng) - Английский языкEnglish(eng) - Английский язык
English(eng) - Английский языкMr Intenglish
 
Social Travel Britain 2015 conference: The Chinese traveller
Social Travel Britain 2015 conference: The Chinese travellerSocial Travel Britain 2015 conference: The Chinese traveller
Social Travel Britain 2015 conference: The Chinese travellerMark Frary
 
Pembudayaan ICT Di SK. Kabogan
Pembudayaan ICT Di SK. KaboganPembudayaan ICT Di SK. Kabogan
Pembudayaan ICT Di SK. Kaboganamsinahlisani
 
Rbs maart-april 2015
Rbs maart-april 2015Rbs maart-april 2015
Rbs maart-april 2015Rbs Jabbeke
 
The downing street declaration for website
The downing street declaration for websiteThe downing street declaration for website
The downing street declaration for websitemrdowdican
 
Руководство для программистов по устройству на работу в Unigine
Руководство для программистов по устройству на работу в UnigineРуководство для программистов по устройству на работу в Unigine
Руководство для программистов по устройству на работу в UnigineUnigine Corp.
 
Treballem una cançó
Treballem una cançóTreballem una cançó
Treballem una cançóMercè Gimeno
 
Social Travel Britain 2015 conference: The changing media landscape
Social Travel Britain 2015 conference: The changing media landscapeSocial Travel Britain 2015 conference: The changing media landscape
Social Travel Britain 2015 conference: The changing media landscapeMark Frary
 
UV ESL Center - Trung tâm Anh ngữ UV ESL
UV ESL Center - Trung tâm Anh ngữ UV ESLUV ESL Center - Trung tâm Anh ngữ UV ESL
UV ESL Center - Trung tâm Anh ngữ UV ESLUV ESL Center
 
Orange@php conf
Orange@php confOrange@php conf
Orange@php confHash Lin
 
каникулы 10 а класса
каникулы 10 а классаканикулы 10 а класса
каникулы 10 а классаoznob
 
Rbs mei-juni 2015
Rbs mei-juni 2015Rbs mei-juni 2015
Rbs mei-juni 2015Rbs Jabbeke
 
Portal domashniy.ru september 2012 final v
Portal domashniy.ru september 2012 final vPortal domashniy.ru september 2012 final v
Portal domashniy.ru september 2012 final vAlina Borisovna
 
stm 2012 facebook neasa costin
stm 2012 facebook neasa costinstm 2012 facebook neasa costin
stm 2012 facebook neasa costinMark Frary
 

Viewers also liked (20)

Food Couriers Menu Guide 2012
Food Couriers Menu Guide 2012Food Couriers Menu Guide 2012
Food Couriers Menu Guide 2012
 
English(eng) - Английский язык
English(eng) - Английский языкEnglish(eng) - Английский язык
English(eng) - Английский язык
 
Listening skills
Listening skillsListening skills
Listening skills
 
Social Travel Britain 2015 conference: The Chinese traveller
Social Travel Britain 2015 conference: The Chinese travellerSocial Travel Britain 2015 conference: The Chinese traveller
Social Travel Britain 2015 conference: The Chinese traveller
 
Pembudayaan ICT Di SK. Kabogan
Pembudayaan ICT Di SK. KaboganPembudayaan ICT Di SK. Kabogan
Pembudayaan ICT Di SK. Kabogan
 
Mp 7500 manual
Mp 7500 manualMp 7500 manual
Mp 7500 manual
 
Rbs maart-april 2015
Rbs maart-april 2015Rbs maart-april 2015
Rbs maart-april 2015
 
The downing street declaration for website
The downing street declaration for websiteThe downing street declaration for website
The downing street declaration for website
 
Bab i
Bab iBab i
Bab i
 
Руководство для программистов по устройству на работу в Unigine
Руководство для программистов по устройству на работу в UnigineРуководство для программистов по устройству на работу в Unigine
Руководство для программистов по устройству на работу в Unigine
 
Call on me.
Call on me.Call on me.
Call on me.
 
Treballem una cançó
Treballem una cançóTreballem una cançó
Treballem una cançó
 
Social Travel Britain 2015 conference: The changing media landscape
Social Travel Britain 2015 conference: The changing media landscapeSocial Travel Britain 2015 conference: The changing media landscape
Social Travel Britain 2015 conference: The changing media landscape
 
UV ESL Center - Trung tâm Anh ngữ UV ESL
UV ESL Center - Trung tâm Anh ngữ UV ESLUV ESL Center - Trung tâm Anh ngữ UV ESL
UV ESL Center - Trung tâm Anh ngữ UV ESL
 
Orange@php conf
Orange@php confOrange@php conf
Orange@php conf
 
Instructional 1
Instructional 1Instructional 1
Instructional 1
 
каникулы 10 а класса
каникулы 10 а классаканикулы 10 а класса
каникулы 10 а класса
 
Rbs mei-juni 2015
Rbs mei-juni 2015Rbs mei-juni 2015
Rbs mei-juni 2015
 
Portal domashniy.ru september 2012 final v
Portal domashniy.ru september 2012 final vPortal domashniy.ru september 2012 final v
Portal domashniy.ru september 2012 final v
 
stm 2012 facebook neasa costin
stm 2012 facebook neasa costinstm 2012 facebook neasa costin
stm 2012 facebook neasa costin
 

Similar to Sharing the Care and Feeding of Mass Digitization Projects

Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information ArchitectureRob Bogue
 
What They Won't Tell You About DITA
What They Won't Tell You About DITAWhat They Won't Tell You About DITA
What They Won't Tell You About DITAAlan Houser
 
Building and Deploying a Global Intranet with Liferay
Building and Deploying a Global Intranet with LiferayBuilding and Deploying a Global Intranet with Liferay
Building and Deploying a Global Intranet with Liferayrivetlogic
 
Archiving Best Practices -- Creative Operations Essentials
Archiving Best Practices -- Creative Operations EssentialsArchiving Best Practices -- Creative Operations Essentials
Archiving Best Practices -- Creative Operations Essentialsglobaledit®
 
Manage Complex Digital Assets at Massive Scale
Manage Complex Digital Assets at Massive ScaleManage Complex Digital Assets at Massive Scale
Manage Complex Digital Assets at Massive ScaleNuxeo
 
Collections Sneak Peek
Collections Sneak PeekCollections Sneak Peek
Collections Sneak PeekCorey Hamilton
 
The Elusive Promise of Reuse
The Elusive Promise of ReuseThe Elusive Promise of Reuse
The Elusive Promise of ReuseIXIASOFT
 
Moving Shared Print to the Network Level
Moving Shared Print to the Network LevelMoving Shared Print to the Network Level
Moving Shared Print to the Network LevelMaine_SharedCollections
 
Lavacon preso-2015-miranda-meyers
Lavacon preso-2015-miranda-meyersLavacon preso-2015-miranda-meyers
Lavacon preso-2015-miranda-meyersJoe Meyers
 
Balancing Governance with Engagement
Balancing Governance with EngagementBalancing Governance with Engagement
Balancing Governance with EngagementRob Bogue
 
Redesigning a Website Using Information Architecture Principals
Redesigning a Website Using Information Architecture PrincipalsRedesigning a Website Using Information Architecture Principals
Redesigning a Website Using Information Architecture PrincipalsJenny Emanuel
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information ArchitectureRob Bogue
 
The Elusive Promise of Reuse
The Elusive Promise of ReuseThe Elusive Promise of Reuse
The Elusive Promise of ReuseLeigh White
 
Facing our e-demons: challenges of e-serial management in a large academic li...
Facing our e-demons: challenges of e-serial management in a large academic li...Facing our e-demons: challenges of e-serial management in a large academic li...
Facing our e-demons: challenges of e-serial management in a large academic li...NASIG
 
Joseph Matthews assumptions librarians make
Joseph Matthews assumptions librarians makeJoseph Matthews assumptions librarians make
Joseph Matthews assumptions librarians makeJoe Matthews
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesMarieke Guy
 
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...Michael Levine-Clark
 
Managing a Distributed Content Cycle
Managing a Distributed Content CycleManaging a Distributed Content Cycle
Managing a Distributed Content CycleIXIASOFT
 

Similar to Sharing the Care and Feeding of Mass Digitization Projects (20)

Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
What They Won't Tell You About DITA
What They Won't Tell You About DITAWhat They Won't Tell You About DITA
What They Won't Tell You About DITA
 
Building and Deploying a Global Intranet with Liferay
Building and Deploying a Global Intranet with LiferayBuilding and Deploying a Global Intranet with Liferay
Building and Deploying a Global Intranet with Liferay
 
Archiving Best Practices -- Creative Operations Essentials
Archiving Best Practices -- Creative Operations EssentialsArchiving Best Practices -- Creative Operations Essentials
Archiving Best Practices -- Creative Operations Essentials
 
Manage Complex Digital Assets at Massive Scale
Manage Complex Digital Assets at Massive ScaleManage Complex Digital Assets at Massive Scale
Manage Complex Digital Assets at Massive Scale
 
Collections Sneak Peek
Collections Sneak PeekCollections Sneak Peek
Collections Sneak Peek
 
The Elusive Promise of Reuse
The Elusive Promise of ReuseThe Elusive Promise of Reuse
The Elusive Promise of Reuse
 
Moving Shared Print to the Network Level
Moving Shared Print to the Network LevelMoving Shared Print to the Network Level
Moving Shared Print to the Network Level
 
Lavacon preso-2015-miranda-meyers
Lavacon preso-2015-miranda-meyersLavacon preso-2015-miranda-meyers
Lavacon preso-2015-miranda-meyers
 
Balancing Governance with Engagement
Balancing Governance with EngagementBalancing Governance with Engagement
Balancing Governance with Engagement
 
Redesigning a Website Using Information Architecture Principals
Redesigning a Website Using Information Architecture PrincipalsRedesigning a Website Using Information Architecture Principals
Redesigning a Website Using Information Architecture Principals
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
The Elusive Promise of Reuse
The Elusive Promise of ReuseThe Elusive Promise of Reuse
The Elusive Promise of Reuse
 
Facing our e-demons: challenges of e-serial management in a large academic li...
Facing our e-demons: challenges of e-serial management in a large academic li...Facing our e-demons: challenges of e-serial management in a large academic li...
Facing our e-demons: challenges of e-serial management in a large academic li...
 
Facing our E-Demons: Challenges of E-Serial Management in a Large Academic Li...
Facing our E-Demons: Challenges of E-Serial Management in a Large Academic Li...Facing our E-Demons: Challenges of E-Serial Management in a Large Academic Li...
Facing our E-Demons: Challenges of E-Serial Management in a Large Academic Li...
 
Joseph Matthews assumptions librarians make
Joseph Matthews assumptions librarians makeJoseph Matthews assumptions librarians make
Joseph Matthews assumptions librarians make
 
Susanne haydon
Susanne haydonSusanne haydon
Susanne haydon
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sources
 
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
Levine-Clark, Michael, and Barbara Kawecki, "Best Practices for Demand-Driven...
 
Managing a Distributed Content Cycle
Managing a Distributed Content CycleManaging a Distributed Content Cycle
Managing a Distributed Content Cycle
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Sharing the Care and Feeding of Mass Digitization Projects

  • 1. HATHITRUST: SHARING THE CARE AND FEEDING OF THE ELEPHANT John Weise and Chris Powell and Kat Hagedorn University of Michigan Libraries
  • 2. Introduction HathiTrust ingests and integrates digital content produced by a variety of systems, processes, practices, and workflows at partner institutions. • Google • Internet Archive • Locally scanned e.g., Yale, Michigan, and several others.
  • 3. Some of Michigan's Hats •  Google partner •  HathiTrust administrator o  Specifications and guidelines o  Ingest manager/gatekeeper •  HathiTrust partner •  Michigan as Michigan o  MDP scans to HT (i.e., Google scans) o  Local scans to HT o  Legacy migration to HT o  Investigate and fix problems
  • 4. Making Decisions Try as we might, to do what is right, there may be more than one right answer.
  • 5. The aggregation of content in HathiTrust has revealed outcroppings in the data landscape that were not as apparent when segregated.
  • 6. We won't talk about... •  HathiTrust governance, the many benefits of partnership, or the lawsuit. •  Users, data mining, or preservation per se, but they are inherent throughout. •  Google's scanning processes except to illustrate a point.
  • 7. In a nutshell We're contemplating the impact of independent decisions made in the past on preservation and access today.
  • 8. To do this, we'll talk about... •  Michigan's digital library heritage. •  The impact of local decisions on global preservation and access. •  Meaningful vs. meaningless variations in practice. •  Variations in quality. •  The benefits of aggregation for preservation. •  Where we can go from here.
  • 10. Large scale, but sharp focus •  Collaborative, but separate •  Curated o  Condition o  Completeness o  Metadata availability o  Restricted scope o  Meaningfulness within the context of the collection •  Separate systems obscured variation in application of agreed-upon standards
  • 11. Now these texts are moving into an environment where the sharp focus that defined their previous online existence is less meaningful, and some shortcomings are now exposed.
  • 12. Michigan's Local Legacy •  5K-10K volumes/year back to the 1990's •  24K volumes migrated to HathiTrust. •  Relatively painstaking process. o  Why?
  • 13. Reasons volumes that don't make the automated move •  A record for the item cannot be located in the catalog •  Non-standard naming conventions •  Skips in file sequence •  Bitonal TIFF images aren't 600 dpi •  Various TIFF header anomalies •  JPEG2000 images that don't contain resolution information
  • 14. Successful volumes sharing the larger repository aren't all the same •  Different libraries (even within the same institution) •  Different materials (books, journals, photos) •  Different physical formats •  Different languages and scripts •  Different application of standards (including MARC) •  Different decisions made along the way
  • 15. Meaningful vs. meaningless variation •  Variation you want to maintain vs. variation you want to obscure •  Need for consensus •  Need for certainty that solutions are truly global •  Why is this variation occurring? •  How can you spot variation in such a large pool? •  How are truly meaningful variants identified and preserved?
  • 17. Digitization Decisions: Omissions •  It's impossible to illustrate what you have omitted •  It's also impossible to find where omissions occurred
  • 20. Even among brief descriptions
  • 21. And among expanded descriptions
  • 22. The combined repository gives you a fresh and broader look at your collections and your practices.
  • 23. Content quality problems •  Issues we see with quality can be found in any collection •  Some are unavoidable or were based on a particular decision due to resource issues •  Some can be given special treatment if they occur frequently or are anticipated •  There's a trade-off, naturally o  decision between a pristine corpus and a massively useful corpus
  • 24. Focus on potential physical volume errors NOT volume scan errors These are volume scan errors... Skew Warp
  • 25. RTL and upside-down (e.g., Japanese)
  • 29. Pages misnumbered and duplicated in physical volume page 135 page 139, which should be page 136
  • 30. Pages missing in the physical volumes page 96 page 99 pages 97 and 98 are not in volume
  • 31. Benefits of corpus •  Preservation •  Noting provenance and process of creating these digitized volumes •  Aggregation •  Ability to compare volumes •  Reveal potential solutions to problems •  Certification of particular volumes
  • 32. More hands make lighter work •  Working with institutions on a collective level as opposed to singularly •  Working together to find common models and workflows •  Share experience and develop policies to mitigate newly discovered issues and maintain the corpus
  • 33. Lessons we're learning as we go •  You do NOT have to solve everything at once •  Don't let potential problems prevent you from moving forward •  Decide what is the most important, and where you use your resources, and do it at the beginning of your project, if at all possible
  • 34. Contact info •  www.hathitrust.org •  John: jweise@umich.edu •  Chris: sooty@umich.edu •  Kat: khage@umich.edu