SlideShare a Scribd company logo
1 of 26
Download to read offline
Kat Hagedorn
University of Michigan Library
Migrating 1st-Generation Digital Texts from
Local Collections to HathiTrust
SAA 2015 | August 21 10am | Session 308
#saa15
Overview
We’ve been migrating our older collections to
HathiTrust for 6 years now (since 2009).
Recently, we’ve been taking a closer
look at correctly preserving this content.
We’ve been doing that for the past 3 years
(since early 2012).
#saa15
Overview
DLXS: (Digital Library eXtension Service)
● Includes content from 1995 to present
● Text, images, bibliographies, finding aids
HathiTrust:
● Consortium of over 100 research institutions & libraries
● Content includes material created through Google
digitization project, as well as locally scanned material
#saa15
Two repositories
Some things are similar:
● File formats (TIFF, JPEG2000), structure of
repository, repository management staff
● Content to be migrated (printed text volumes)
However:
● HathiTrust has stricter technical requirements
● HathiTrust has a more formal ingest process
#saa15
Our team
John Weise - Head of the Digital
Library Production Service (DLPS)
Aaron Elkiss - Systems Programmer
in Core Services
Chris Powell - Coordinator for Text
Collections in DLPS
Matt LaChance - Data Processing
Automation Programmer in DLPS
Lance Stuchell - Digital
Preservation Librarian
Kat Hagedorn - Project Manager for
Digital Projects in DLPS
Our team at U-M. Four of us work on this semi-regularly.
#saa15
Way back in 1995...
Content was created with long term
preservation in mind, but when we
started, we were balancing pragmatism
with best practices and a thoughtful
approach.
#saa15
The Weasley House effect
We didn’t have good validation tools, because
they didn’t exist.
We relied on vendors creating correct content.
Metadata that was to become engine of
management at scale was inconsistent or
incorrect.
We got smarter as we went along, and we got
better tools to help us, but it means we don’t
have consistency across all our collections.
#saa15
Shift from “hand-crafted” to mass digitization
led to unprecedented scale.
Further reliance on tools facilitating automated
processes like ingest (JHOVE, etc.).
Increased use of standards (METS, PREMIS,
etc.).
Effort to move away from anecdotal knowledge
of problems with content.
#saa15
The Weasley House effect
The pain point
95% of our materials can be ingested easily -
they pass current validation with little or no
problems.
However, that remaining 5%...
#saa15
It’s not really broken
Problem Details Solution
Bitonal image resolution Resolution value is incorrect or zero Manually fix the images
Invalid but well-formed
bitonal images
Images are viewable, but the image
metadata is incorrect
Automation for those can batch fix,
manual intervention for the rest
Unexpected contone
image dimensions
Some images in a volume are suspiciously
not a reasonable size, in relation to the other
pages
Automate discovery mechanism, but in
the end manually inspecting and
annotating
“Funny” filenames Filenames that met a previous spec don’t
meet current spec
Automate discovery, and automatically
fixed (pattern matching)
Bad/ambiguous dates e.g., 4-5-98, 040506 Automate discovery and fix of patterns,
but some will need manual fixing
Legitimate sequence
skips
Skips in the numbering or nomenclature of a
sequence (not the page)
Automate discovery and most fixes, some
manually rename
It’s kind of wrong to call these
broken or unbroken - they
don’t meet our standards of
preservation as-is.
It’s really broken
Problem Details Solution
Blank contone images Contone is a blank image Need to reload/rescan the volume
Contone images don’t
match bitonals
Misalignment (of sequences) of contones
and bitonals within the volume
Manual work to discover the correct
matches, then make those changes
Skipped pages Illegitimate skips Need to reload/rescan the volume
OCR but no images OCR for certain sequences, but no matching
images
After discovery, may need to rescan
some images, but some images may not
be needed
Character errors e.g., em-dashes, control characters not
recognizable by XML
After discovery, remove/replace those
characters
Non-viewable,
malformed images
Cannot open, view or discover the problems
with certain images
“Cut bait” (rescan the entire volume)
Mismatches
We put off fixing and ingesting all of “Michigan County
Histories and Atlases” because it involved manual labor.
● Continuous tone images (we call them ‘contones’) did not match
bitonals for maps and atlases in volumes.
● Color images were therefore often inappropriately placed in the
volume.
● Required manual examination and renaming of page files in
volumes.
#saa15
Misnamed contones/bitonals
You should have seen the TOC on the TOC page. Instead you saw the contone Ingham County map.
Misnamed contones/bitonals
You should have seen the contone Ingham
County map on page 7.
Instead you saw the bitonal Ingham County map.
Fixing these problems - investigating, identifying, analyzing
and renaming - took our Text Coordinator about 50 hours of
time.
We could train people to do this work, but it has been
invaluable to have her institutional knowledge on the
problem.
#saa15
Mismatches
Sequence skips
How did the sequence skips occur? What seemed to be
missing?
● Are they legitimate?
(Yes, the page really does NOT exist.)
● Are they illegitimate?
(No, that page should have been scanned but wasn’t.)
#saa15
Sequence skips
Involve writing custom scripts (we call this “craft work”) to
investigate the skips.
Requires running and tweaking and re-running … lather,
rinse, repeat.
Again, needed Text Coordinator to determine the fix.
#saa15
The “note from mom” tool
We needed a way to
indicate which
volumes we were
“allowing” through. A
thoughtful analysis
didn’t require a fix,
but we needed to
note that we’d done
that analysis.
#saa15http://funnyjunk.com/funny_pictures/3859033/Note+from+mom/ No individual or site license specified.
We built an ingest utility - aka the “note from mom” tool - to
preserve that analysis, and allow us to ingest these
volumes. We use this for:
● legitimate printed volume errors (missing pages in the printed
volume)
● unexpected contone image dimensions (if they looked
reasonable on manual inspection, even if not an
“appropriate” size)
The “note from mom” tool
#saa15
“Note from mom” PREMIS event
<PREMIS:event>
<PREMIS:eventIdentifier>
<PREMIS:eventIdentifierType>UUID</PREMIS:eventIdentifierType>
<PREMIS:eventIdentifierValue>9e11033c-f2a5-38d5-a8a4-5adad8eb677b</PREMIS:eventIdentifierValue>
</PREMIS:eventIdentifier>
<PREMIS:eventType>manual inspection</PREMIS:eventType>
<PREMIS:eventDateTime>2014-11-05T17:06:21Z</PREMIS:eventDateTime>
<PREMIS:eventDetail>Manually inspect item for completeness and legibility</PREMIS:eventDetail>
#saa15
<PREMIS:eventOutcomeInformation>
<PREMIS:eventOutcome>validation exception granted</PREMIS:eventOutcome>
<PREMIS:eventOutcomeDetail>
<PREMIS:eventOutcomeDetailExtension>
<HT:exceptionsAllowed category="tiff_resolution"/>
</PREMIS:eventOutcomeDetailExtension>
</PREMIS:eventOutcomeDetail>
</PREMIS:eventOutcomeInformation>
Touch these last
● Some volumes we ingested earlier, but they eventually need a
“note from mom” and stricter validation.
● Collections with permission or ownership questions.
● Collections that contain contones we couldn’t include in our text
collections, at the time. (Yes, they’re separate. Yes, it’s really user-
friendly.)
● Collections containing more highly-encoded volumes (level 4 TEI).
● Collections that do not have unique materials from a preservation
standpoint.
● Collections without page images.
● Obviously, collections that are licensed.
#saa15
“Final” result?
Of the 167,102 volumes in our text collections,
we have validated and migrated 27,308 of them
to HathiTrust (16%).
● 20 collections (mostly) ingested (22%), 29 next on the docket
● 41 cannot be currently migrated
It took us most of the past 3 years to validate
and ingest these volumes. And the last year to
go from 15% to 16%. #saa15
Hard, but interesting
We have been:
● intrigued - by the extent of certain problems
● annoyed - by lack of standards or documentation that
would allow us to understand the problem
● surprised - at the differences between latter years and
today in terms of preservation tools and expertise
#saa15
How did we do?
Small percentage of files can’t be assessed,
or require rescanning.
Most errors happened in digitization and/or
package creation… but we’re not sure.
● Not caused by degrading in the
repository.
● Not caused by anything inherent in the
file formats.
Errors discovered because of HT validation
requirements.
#saa15
Are we better now?
Migration project offers an opportunity to evaluate how
we did, where we have gone, and where we are now.
Now we have more consistency -
● Processes allow for consistent metadata creation
and content validation at scale.
● Content is much more consistent.
● Standards (changes to, application of local notes).
#saa15
Documentation
● Is imperative, and it is harder at scale.
Evolution
● Metadata and supporting elements will continue to evolve,
eg, METS and PREMIS uplifts.
Decision-making
● Always a need for constant decision-making to balance real-
world constraints with preservation ideals.
● Not a lot of time for second-guessing!
#saa15
Preservation is iterative… not static

More Related Content

Viewers also liked

Geriatric Population Presentation
Geriatric Population PresentationGeriatric Population Presentation
Geriatric Population Presentationkrystals13
 
Fotos blog_2010
Fotos blog_2010Fotos blog_2010
Fotos blog_2010rosanarvf
 
Ley4632
Ley4632Ley4632
Ley4632EPRE
 
Seminário navegação em hipermídia
Seminário navegação em hipermídiaSeminário navegação em hipermídia
Seminário navegação em hipermídiaLuiz Agner
 
Author study louise_erdrich
Author study louise_erdrichAuthor study louise_erdrich
Author study louise_erdrichwylkdaat
 
Slam bolt scrappers post mortem
Slam bolt scrappers post mortemSlam bolt scrappers post mortem
Slam bolt scrappers post mortemfirehosegames
 
Capitulo 6 Livro Nielsen
Capitulo 6 Livro NielsenCapitulo 6 Livro Nielsen
Capitulo 6 Livro NielsenLuiz Agner
 
Imaginem i construim
Imaginem i construimImaginem i construim
Imaginem i construimMercè Gimeno
 
несклоняемые имена существительные
несклоняемые имена существительныенесклоняемые имена существительные
несклоняемые имена существительныеNatalya Dyrda
 
Про автотесты, фреймворки и железки. Андрей Баюн. Debug time#2 2014
Про автотесты, фреймворки и железки. Андрей Баюн. Debug time#2 2014Про автотесты, фреймворки и железки. Андрей Баюн. Debug time#2 2014
Про автотесты, фреймворки и железки. Андрей Баюн. Debug time#2 2014Unigine Corp.
 
каникулы 10 а класса
каникулы 10 а классаканикулы 10 а класса
каникулы 10 а классаoznob
 
Teoriya literatury v_tablitsah.
Teoriya literatury v_tablitsah.Teoriya literatury v_tablitsah.
Teoriya literatury v_tablitsah.Natalya Dyrda
 

Viewers also liked (15)

Geriatric Population Presentation
Geriatric Population PresentationGeriatric Population Presentation
Geriatric Population Presentation
 
Fotos blog_2010
Fotos blog_2010Fotos blog_2010
Fotos blog_2010
 
Ley4632
Ley4632Ley4632
Ley4632
 
Seminário navegação em hipermídia
Seminário navegação em hipermídiaSeminário navegação em hipermídia
Seminário navegação em hipermídia
 
Author study louise_erdrich
Author study louise_erdrichAuthor study louise_erdrich
Author study louise_erdrich
 
Slam bolt scrappers post mortem
Slam bolt scrappers post mortemSlam bolt scrappers post mortem
Slam bolt scrappers post mortem
 
Capitulo 6 Livro Nielsen
Capitulo 6 Livro NielsenCapitulo 6 Livro Nielsen
Capitulo 6 Livro Nielsen
 
Wiki
WikiWiki
Wiki
 
tagxedo
tagxedotagxedo
tagxedo
 
Imaginem i construim
Imaginem i construimImaginem i construim
Imaginem i construim
 
несклоняемые имена существительные
несклоняемые имена существительныенесклоняемые имена существительные
несклоняемые имена существительные
 
Про автотесты, фреймворки и железки. Андрей Баюн. Debug time#2 2014
Про автотесты, фреймворки и железки. Андрей Баюн. Debug time#2 2014Про автотесты, фреймворки и железки. Андрей Баюн. Debug time#2 2014
Про автотесты, фреймворки и железки. Андрей Баюн. Debug time#2 2014
 
каникулы 10 а класса
каникулы 10 а классаканикулы 10 а класса
каникулы 10 а класса
 
Ershad(17)
Ershad(17)Ershad(17)
Ershad(17)
 
Teoriya literatury v_tablitsah.
Teoriya literatury v_tablitsah.Teoriya literatury v_tablitsah.
Teoriya literatury v_tablitsah.
 

Similar to Migrating 1st-Generation Digital Texts from Local Collections to HathiTrust

The Great Migration: Moving First Generation Digital Texts to HathiTrust: Dig...
The Great Migration: Moving First Generation Digital Texts to HathiTrust: Dig...The Great Migration: Moving First Generation Digital Texts to HathiTrust: Dig...
The Great Migration: Moving First Generation Digital Texts to HathiTrust: Dig...khage1
 
Data skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story pointsData skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story pointsyasinnathani
 
Process Mapping, System Insight
Process Mapping, System InsightProcess Mapping, System Insight
Process Mapping, System InsightCole Whiteman
 
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...Dana Gardner
 
Responsive Web Design Process
Responsive Web Design ProcessResponsive Web Design Process
Responsive Web Design ProcessSteve Fisher
 
How (fr)agile we are. ALE2011
How (fr)agile we are. ALE2011How (fr)agile we are. ALE2011
How (fr)agile we are. ALE2011Gaetano Mazzanti
 
Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...WeLoveSEO
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedis Labs
 
Semanticmerge
SemanticmergeSemanticmerge
Semanticmergepsluaces
 
Three baseline metrics & what they can tell you about your team.
Three baseline metrics & what they can tell you about your team.Three baseline metrics & what they can tell you about your team.
Three baseline metrics & what they can tell you about your team.Mike Burns
 
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021DavidSmart53
 
How Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowHow Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowGiuseppe Gaviani
 
Estimation is dead - long live sizing, by John Coleman 13June2023.pdf
Estimation is dead - long live sizing, by John Coleman 13June2023.pdfEstimation is dead - long live sizing, by John Coleman 13June2023.pdf
Estimation is dead - long live sizing, by John Coleman 13June2023.pdfOrderly Disruption
 
Nondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsNondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsTomer Gabel
 
Data Scientists Are Analysts Are Also Software Engineers
Data Scientists Are Analysts Are Also Software EngineersData Scientists Are Analysts Are Also Software Engineers
Data Scientists Are Analysts Are Also Software EngineersDomino Data Lab
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 

Similar to Migrating 1st-Generation Digital Texts from Local Collections to HathiTrust (20)

The Great Migration: Moving First Generation Digital Texts to HathiTrust: Dig...
The Great Migration: Moving First Generation Digital Texts to HathiTrust: Dig...The Great Migration: Moving First Generation Digital Texts to HathiTrust: Dig...
The Great Migration: Moving First Generation Digital Texts to HathiTrust: Dig...
 
Data skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story pointsData skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story points
 
Process Mapping, System Insight
Process Mapping, System InsightProcess Mapping, System Insight
Process Mapping, System Insight
 
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
 
Responsive Web Design Process
Responsive Web Design ProcessResponsive Web Design Process
Responsive Web Design Process
 
How (fr)agile we are. ALE2011
How (fr)agile we are. ALE2011How (fr)agile we are. ALE2011
How (fr)agile we are. ALE2011
 
Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
 
Gaps in the algorithm
Gaps in the algorithmGaps in the algorithm
Gaps in the algorithm
 
Semanticmerge
SemanticmergeSemanticmerge
Semanticmerge
 
Three baseline metrics & what they can tell you about your team.
Three baseline metrics & what they can tell you about your team.Three baseline metrics & what they can tell you about your team.
Three baseline metrics & what they can tell you about your team.
 
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
 
How Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowHow Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with Snowplow
 
Estimation is dead - long live sizing, by John Coleman 13June2023.pdf
Estimation is dead - long live sizing, by John Coleman 13June2023.pdfEstimation is dead - long live sizing, by John Coleman 13June2023.pdf
Estimation is dead - long live sizing, by John Coleman 13June2023.pdf
 
Quality software management
Quality software managementQuality software management
Quality software management
 
Nondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsNondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of Us
 
Data Scientists Are Analysts Are Also Software Engineers
Data Scientists Are Analysts Are Also Software EngineersData Scientists Are Analysts Are Also Software Engineers
Data Scientists Are Analysts Are Also Software Engineers
 
BD-ACA Week6
BD-ACA Week6BD-ACA Week6
BD-ACA Week6
 
AgileCamp Silicon Valley 2015: User Story Mapping
AgileCamp Silicon Valley 2015: User Story MappingAgileCamp Silicon Valley 2015: User Story Mapping
AgileCamp Silicon Valley 2015: User Story Mapping
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 

Recently uploaded

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Recently uploaded (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Migrating 1st-Generation Digital Texts from Local Collections to HathiTrust

  • 1. Kat Hagedorn University of Michigan Library Migrating 1st-Generation Digital Texts from Local Collections to HathiTrust SAA 2015 | August 21 10am | Session 308 #saa15
  • 2. Overview We’ve been migrating our older collections to HathiTrust for 6 years now (since 2009). Recently, we’ve been taking a closer look at correctly preserving this content. We’ve been doing that for the past 3 years (since early 2012). #saa15
  • 3. Overview DLXS: (Digital Library eXtension Service) ● Includes content from 1995 to present ● Text, images, bibliographies, finding aids HathiTrust: ● Consortium of over 100 research institutions & libraries ● Content includes material created through Google digitization project, as well as locally scanned material #saa15
  • 4. Two repositories Some things are similar: ● File formats (TIFF, JPEG2000), structure of repository, repository management staff ● Content to be migrated (printed text volumes) However: ● HathiTrust has stricter technical requirements ● HathiTrust has a more formal ingest process #saa15
  • 5. Our team John Weise - Head of the Digital Library Production Service (DLPS) Aaron Elkiss - Systems Programmer in Core Services Chris Powell - Coordinator for Text Collections in DLPS Matt LaChance - Data Processing Automation Programmer in DLPS Lance Stuchell - Digital Preservation Librarian Kat Hagedorn - Project Manager for Digital Projects in DLPS Our team at U-M. Four of us work on this semi-regularly. #saa15
  • 6. Way back in 1995... Content was created with long term preservation in mind, but when we started, we were balancing pragmatism with best practices and a thoughtful approach. #saa15
  • 7. The Weasley House effect We didn’t have good validation tools, because they didn’t exist. We relied on vendors creating correct content. Metadata that was to become engine of management at scale was inconsistent or incorrect. We got smarter as we went along, and we got better tools to help us, but it means we don’t have consistency across all our collections. #saa15
  • 8. Shift from “hand-crafted” to mass digitization led to unprecedented scale. Further reliance on tools facilitating automated processes like ingest (JHOVE, etc.). Increased use of standards (METS, PREMIS, etc.). Effort to move away from anecdotal knowledge of problems with content. #saa15 The Weasley House effect
  • 9. The pain point 95% of our materials can be ingested easily - they pass current validation with little or no problems. However, that remaining 5%... #saa15
  • 10. It’s not really broken Problem Details Solution Bitonal image resolution Resolution value is incorrect or zero Manually fix the images Invalid but well-formed bitonal images Images are viewable, but the image metadata is incorrect Automation for those can batch fix, manual intervention for the rest Unexpected contone image dimensions Some images in a volume are suspiciously not a reasonable size, in relation to the other pages Automate discovery mechanism, but in the end manually inspecting and annotating “Funny” filenames Filenames that met a previous spec don’t meet current spec Automate discovery, and automatically fixed (pattern matching) Bad/ambiguous dates e.g., 4-5-98, 040506 Automate discovery and fix of patterns, but some will need manual fixing Legitimate sequence skips Skips in the numbering or nomenclature of a sequence (not the page) Automate discovery and most fixes, some manually rename It’s kind of wrong to call these broken or unbroken - they don’t meet our standards of preservation as-is.
  • 11. It’s really broken Problem Details Solution Blank contone images Contone is a blank image Need to reload/rescan the volume Contone images don’t match bitonals Misalignment (of sequences) of contones and bitonals within the volume Manual work to discover the correct matches, then make those changes Skipped pages Illegitimate skips Need to reload/rescan the volume OCR but no images OCR for certain sequences, but no matching images After discovery, may need to rescan some images, but some images may not be needed Character errors e.g., em-dashes, control characters not recognizable by XML After discovery, remove/replace those characters Non-viewable, malformed images Cannot open, view or discover the problems with certain images “Cut bait” (rescan the entire volume)
  • 12. Mismatches We put off fixing and ingesting all of “Michigan County Histories and Atlases” because it involved manual labor. ● Continuous tone images (we call them ‘contones’) did not match bitonals for maps and atlases in volumes. ● Color images were therefore often inappropriately placed in the volume. ● Required manual examination and renaming of page files in volumes. #saa15
  • 13. Misnamed contones/bitonals You should have seen the TOC on the TOC page. Instead you saw the contone Ingham County map.
  • 14. Misnamed contones/bitonals You should have seen the contone Ingham County map on page 7. Instead you saw the bitonal Ingham County map.
  • 15. Fixing these problems - investigating, identifying, analyzing and renaming - took our Text Coordinator about 50 hours of time. We could train people to do this work, but it has been invaluable to have her institutional knowledge on the problem. #saa15 Mismatches
  • 16. Sequence skips How did the sequence skips occur? What seemed to be missing? ● Are they legitimate? (Yes, the page really does NOT exist.) ● Are they illegitimate? (No, that page should have been scanned but wasn’t.) #saa15
  • 17. Sequence skips Involve writing custom scripts (we call this “craft work”) to investigate the skips. Requires running and tweaking and re-running … lather, rinse, repeat. Again, needed Text Coordinator to determine the fix. #saa15
  • 18. The “note from mom” tool We needed a way to indicate which volumes we were “allowing” through. A thoughtful analysis didn’t require a fix, but we needed to note that we’d done that analysis. #saa15http://funnyjunk.com/funny_pictures/3859033/Note+from+mom/ No individual or site license specified.
  • 19. We built an ingest utility - aka the “note from mom” tool - to preserve that analysis, and allow us to ingest these volumes. We use this for: ● legitimate printed volume errors (missing pages in the printed volume) ● unexpected contone image dimensions (if they looked reasonable on manual inspection, even if not an “appropriate” size) The “note from mom” tool #saa15
  • 20. “Note from mom” PREMIS event <PREMIS:event> <PREMIS:eventIdentifier> <PREMIS:eventIdentifierType>UUID</PREMIS:eventIdentifierType> <PREMIS:eventIdentifierValue>9e11033c-f2a5-38d5-a8a4-5adad8eb677b</PREMIS:eventIdentifierValue> </PREMIS:eventIdentifier> <PREMIS:eventType>manual inspection</PREMIS:eventType> <PREMIS:eventDateTime>2014-11-05T17:06:21Z</PREMIS:eventDateTime> <PREMIS:eventDetail>Manually inspect item for completeness and legibility</PREMIS:eventDetail> #saa15 <PREMIS:eventOutcomeInformation> <PREMIS:eventOutcome>validation exception granted</PREMIS:eventOutcome> <PREMIS:eventOutcomeDetail> <PREMIS:eventOutcomeDetailExtension> <HT:exceptionsAllowed category="tiff_resolution"/> </PREMIS:eventOutcomeDetailExtension> </PREMIS:eventOutcomeDetail> </PREMIS:eventOutcomeInformation>
  • 21. Touch these last ● Some volumes we ingested earlier, but they eventually need a “note from mom” and stricter validation. ● Collections with permission or ownership questions. ● Collections that contain contones we couldn’t include in our text collections, at the time. (Yes, they’re separate. Yes, it’s really user- friendly.) ● Collections containing more highly-encoded volumes (level 4 TEI). ● Collections that do not have unique materials from a preservation standpoint. ● Collections without page images. ● Obviously, collections that are licensed. #saa15
  • 22. “Final” result? Of the 167,102 volumes in our text collections, we have validated and migrated 27,308 of them to HathiTrust (16%). ● 20 collections (mostly) ingested (22%), 29 next on the docket ● 41 cannot be currently migrated It took us most of the past 3 years to validate and ingest these volumes. And the last year to go from 15% to 16%. #saa15
  • 23. Hard, but interesting We have been: ● intrigued - by the extent of certain problems ● annoyed - by lack of standards or documentation that would allow us to understand the problem ● surprised - at the differences between latter years and today in terms of preservation tools and expertise #saa15
  • 24. How did we do? Small percentage of files can’t be assessed, or require rescanning. Most errors happened in digitization and/or package creation… but we’re not sure. ● Not caused by degrading in the repository. ● Not caused by anything inherent in the file formats. Errors discovered because of HT validation requirements. #saa15
  • 25. Are we better now? Migration project offers an opportunity to evaluate how we did, where we have gone, and where we are now. Now we have more consistency - ● Processes allow for consistent metadata creation and content validation at scale. ● Content is much more consistent. ● Standards (changes to, application of local notes). #saa15
  • 26. Documentation ● Is imperative, and it is harder at scale. Evolution ● Metadata and supporting elements will continue to evolve, eg, METS and PREMIS uplifts. Decision-making ● Always a need for constant decision-making to balance real- world constraints with preservation ideals. ● Not a lot of time for second-guessing! #saa15 Preservation is iterative… not static