Digitisation at Scale: Automating the mass acquisition of digitised content
1. Digitisation at Scale:
Automating the mass acquisition of
digitised content
IS&T Archiving Conference, Washington, April 2016
Dave Thompson
Digital Curator, Wellcome Library
2. The Wellcome Library
• Part of Wellcome Collection, astonishing public
venue in London developed by the Wellcome
Trust. Where people can learn more about
medicine through the ages & across cultures
• Five-year plan for transforming the Wellcome
Library.
3. Driver for digitisation
• To make our collections available to anyone,
anywhere, we are digitising as much of our
physical collection as we can, for both our website
and the websites of other organisations. We are
also digitising and hosting collections from
partners that complement our holdings
Transforming the Wellcome Library: 2009-2014.
http://wellcomelibrary.org/what-we-do/library-strategy-and-policy/transforming-
the-wellcome-library/
4. The problem
• How to scale systems & processes to deliver on
our ambition
• How to design & build new high volume systems &
processes for; acquisition, storage, processing,
access
• How to manage volumes of data during
creation/acquisition
5. Process design – sources of content
Goobi
(METS/OCR)
Preservica
In-house
Institutions
Contractors
Harvesting
TIFF or JP2
TIFF or JP2
HD & ftp
TIFF or JP2
Normalises TIFF
to JP2
Manual
Automatic
Jpylyzer validates
JP2
Auto harvesting of
JP2 & DMD
Grey literature
PDF
Ingest Officer / Digital Curator
Snagging
Snagging
6. The approach
• (Re)Use/develop existing systems were possible,
e.g. bibliographic system Sierra, Preservica EE
repository
• Identify where new systems would be required,
e.g. workflow middle ware
• Take a practical approach & accept that it would
be iterative learning as we go
8. Why Goobi?
• Dedicated to digitisation
• Flexibility & process control
• Adaptable & scalable
• Vendor expertise/support
http://www.inspirelancs.org.uk/interested-in-volunteering-family-carers-volunteers-wanted/
9. Role of Goobi
• Role of Goobi is overall management & tracking of
processes
• Initiate ingest into our DAM Preservica
• Reporting & statistics
10. Role of humans
• Working at volume did not imply more staff, it
implied efficiency
• Also implied automation
• Human work was focussed on tasks machines
couldn't do
http://planetivy.com/gaming/25273/natural-selection-2-gaming-evolution-in-action/
11. System & process design
• High volume doesn’t imply use of many systems
• Requires design to be as simple as possible, with
as few moving parts as possible
• Processes need to be efficient & scalable, human
as well as system
http://www.nivenswealthstrategies.com/keeping-it-simple/
12. Partnership for scalable digitisation
• Relationship with Internet Archive digitising our
Library content
• High volume long term project
• Content harvested from Internet Archive website &
processed automatically
• Dedicated Goobi process for fully automated
harvesting
13. Harvesting from Internet Archive
Content processed automatically, including
creation of METS & ALTO.
Goobi has a ‘repository’ of IA identifiers for
searching/harvesting.
Goobi harvests data from Internet Archive
website.
Content available in the player.
Content stored in Preservica. DDS creates JSON for the player & pre-
caches some content.
14. Challenges - M&Ms
• Multi volume works
• No metadata to support their union
• Have to construct them manually, but process can
be simplified
• Time consuming, still to be fully automated
15. Challenges – Working with partners
• Changes to Internet Archive website broke our
harvesting
• For automated ftp to work 3rd parties need to
follow instructions
• Creation of JPEG2000 images/video
• Incorrect identifiers trips up processes
16. Opportunities
• Working with IT, flexibility of virtualised
environment
• Working with Intranda, brings in vendor expertise
• Distributed system brings in feedback from many
users
• Small team simplifies decision making
• Success leads to success
17. Life cycle management
• Good place with regard to life cycle management
• Consistent processes based on common
workflows
• Goobi outputs consistent & predictable
• Unified data set easier to manage in the future
18. Has automation been successful?
• Yes with a but
• Automation can be complex, easy to make
mistakes
• Automation requires metadata to be available
• Automated processes still require a human minder
21. Lessons learned
• Complexity Vs simplicity
• Iterative approaches work but are time consuming
• Vendor support/input crucial when starting from
scratch
• Process design essential