1. Digitisation;
Nuts & bolts at the Wellcome
Library
In the picture: getting the most out of images inside & outside your collection.
CILIP, September 2014
Dave Thompson
Digital Curator, Wellcome Library
2. The Wellcome Library
• Part of Wellcome Collection, astonishing public
venue in London developed by the Wellcome
Trust. Where people can learn more about
medicine through the ages & across cultures.
• More than 10,000 readers visit us each year,
including historians, academics, students, health
professionals & consumers, journalists, artists &
members of the general public.
Harvesting
Harvesting
3. Digitisation in the Wellcome Library
• Strategic approach, conscious planned decisions.
• Library transformation strategy, physical to digital.
• From ‘project’ to ‘production’.
• Digitisation as a sustainable end-to-end process.
• Sustainable activity delivering access to content.
4. Overview - three IT systems…
1. Workflow management system – ‘Goobi’ =
PRODUCTION.
2. Digital object repository – ‘Preservica’ =
STORAGE.
3. Front end - ‘the player’ = ACCESS.
Remember, this doesn’t include cataloguing or bibliographic systems. Here
we’re just talking about the process of creating, storing & delivering digital
content. You have to assume that those other systems are also in place.
5. Goobi is our core digitisation system
• Goobi can be used to normalise image formats,
e.g. TIFFs into JPEG2000.
• Used for reporting, volumes, numbers, etc.
• Web based, used by all staff involved in
digitisation.
• Produces METS files, flexible & standards based.
Goobi is the primary interface for most staff involved in digitisation. It’s the only
software that many use, which simplifies training & delivery.
6. Goobi workflow tracking & management
• Manages & tracks the production of content.
• Workflow driven. Already highly automated.
• Allows us to set very granular access conditions.
• Scalable & highly adaptable to different projects.
Goobi has been in production for about 3 years now, it’s already processed
some 2.5 million images. Content which is publicly available in our player.
8. Digitisation – enter the humans
Digitised images are imported into Goobi &
automatically associated with that metadata
We use cameras not scanners for better resolution & quicker imaging.
10. Digitisation – enter the humans
Goobi initiates ingest of the JPEG2000
images & metadata in Preservica
11. Digitisation – enter the humans
Player pulls images from
Preservica using metadata in the
METS file
12. Goobi – exit the humans
• Goobi key steps performed by humans.
• There are high levels of automation, but not
everything is automated.
• Ambition is to build fully automated workflows.
• Scalable & highly adaptable to different projects.
Remember, humans are still an important part of digitisation. There are some
decisions that only a human can make, & there will always be a need for
human driven processes.
13. Working with digitised content
Goobi Preservica
In-house
Institutions
Contractors
Harvesting
TIFF or JP2
TIFF or JP2
HD & ftp
TIFF or JP2
Normalises TIFF to
JP2
Manual
Automatic
Jpylyzer validates JP2
Auto harvesting of
JP2 & DMD
Grey literature
PDF
Ingest Officer / Digital Curator
Snagging
Snagging
14. Goobi – 19th century book project
• Internet Archive (IA) is digitising our 19th century
books.
• Content is uploaded by them to the IA website.
• IA do Optical Character Recognition the books &
create structure.
• Goobi harvests the files that the IA create to
automatically process content.
http://www.kuka-robotics.com/l
15. Looking at the IA website
https://archive.org/details/wellcomelibrary
17. How the automation works
• Goobi builds a process using the MARC record.
• Against this process it imports the images.
• Uses the scandata file to create a METS file with
pagination & structure.
• Uses the raw Abbyy file to create ALTO files that
allow us to search for words & highlight search
term hits.
http://www.impactautomation.com.au/automation
21. So, to wrap up…
• Digitisation is a strategic activity.
• We have built an end-to-end process from
selection to access.
• Working at scale so efficiency is important.
• Integrated in our OPAC. No silos.
• Well articulated architecture.
22. Thank you
Questions now, questions later…?
Dave Thompson, Digital Curator
Wellcome Library
d.thompson@wellcome.ac.uk - @d_n_t
http://wellcomelibrary.org/