Mechanical curator - Technical notes

"Mechanical Curator"
(The technical story)

It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"

It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"
• So we contrived a research question:

"Can we find the faces in the
19th C scanned book
collection?"

Outcome:
• Majority of tools and libraries expect local
filesystem or in-memory access; no
network/API knowledge needed by
researcher.
• While lookup by layout is awkward, it is a
pragmatic approach when distributing
content by sneakernet. Might be pairable
by a light online search-engine and
documentation/wiki for best practices.

'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.

'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.
• But... applying Haar cascade profiles,
based on a photo training set, had some
reasonable success!

19C depictions of faces
• Likelyhood of detection:
• Female faces > Male

19C depictions of faces
• Likelyhood of detection:
• Female faces > Male
• Why women?
• Drawn more symmetrically - male faces were
more likely to be exaggerated.
• Depiction is typically 'clean' and posed
• Fashion: beards, spectacles and hats - very
different to the training sets

An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.

An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.
– polygonal boundaries for areas where it
detected contiguous content but where OCR
didn't work.

A map to all* the images?
* Unlikely to be comprehensive

A map to all* the images?
The 'Mechanical Curator' found:
– Maps
– Portraits
– Marginalia
– Covers
– Charts and diagrams
– Decorations

Microsoft Books
• Context:
– 47k 'works' digitised, 68k volumes
– 15.3Tb images, 1.3Tb ALTO XML
– circa 22+ million JP2000 images, 150-200DPI
(unconfirmed), a zipfile ('store') per volume
– 360 pages per volume on average
– No explicit subjects in metadata, but heavy on
travel, geography, ethnology, (English)
literature and plenty of 'misc'

Accessible?
• In theory, the books were accessible
online.
• In practice, it was a real challenge to find
anything viewable.

Image extraction process
• Worker-based, using a message queue to
coordinate.
• Thread-unsafe (due to zips) so limited to
one worker per zip.
– Local network storage was nearly full
– Limited by hardware too (4 months to get
RAM upgrade)

Tech used:
• Virtualbox
• Redis (msg queue, semaphore, metadata
cache)
• Python
– OpenCV main library used:
• Opens JP2000 with colour profiles
• Quick to work with image regions
• Also saved region as JPG (92%) for reuse

Filter first!
• ALTO with Illustration element is only
concern.
• Grep - quickly discerned the 1 million XML
files of interest (only 4-5% of total)

Resilience
• Never trust a process
– Did it fail?

Resilience
– Did it fail?
– Did it fail silently?

Resilience
– Did it fail?
– Does the expected JPG exist on disc? Is it
non-zero in length?

Resilience
– Did it fail?
– Does the expected JPG exist on disc? Is it
non-zero in length?
– Did IT services hard reboot your desktop
machine hosting the VMs you use in a given
night?

Overview:
• Started with one desktop VM, and a
connection to a local NAS
• Ended having used multiple VMs on Azure
as well, after piping content to their store.
– Redis replicated natively w/ SSH tunnel to
write node

Identifiers...
• Little help available from overstretched IT
architecture team.
• Naive filename syntax to begin with:
– SYSNUM_VOL_PG_IMGIDX_humantxt.jpg
– Stored by publication year.

We have images!
• 580Gb JPGs
• From dogfooding, hybrid approach
seemed necessary:
• Online, sharable, linkable, easy to find
presence, with a unique ID per image.
• Easy mapping between local image and
online image.

Images already available
• ... in theory.
• We needed something else in the short-
term.

Options
• Wikimedia Commons: we know about the
books, but have no idea about the actual
content! WC wouldn't be able to handle
1mil images in one go.
• Er... Flickr?

Upload by worker
• Again, similar structure - job was simply a
filepath (metadata deduceable)
• Ran approximately 16-18 workers for 9
days to upload images.
• High 90s upload success rate (time of day
dependent)

Outcome
• Launched 13 December on Flickr
Commons
• Spike: 55 million image views in 5 days
• By March 2014, 70k+ tags added by
community -
map, portrait, cover, childrensbook, and so
on.

Keeping track
• Many bad/misleading API calls
• (people.photos.)recentlyUpdated seems to
mostly work

Current scheme
• Every morning, call recentlyUpdated for
list of images that had some change
• Re-scan images and deduce changes in
tags, comments, views and favourites.
– (Same pattern, rescan jobs taken by
get_activity workers. Running 4 is enough
outside of spike times)

Caching
• Redis sets:
– PeopleID links to set of FlickrID+tagadded
– FlickrID links to set of user tags
– Sorted sets for 'high score' lists:
contributors, favourites, tags

Summary
• Workers to spin up when required
• Variety of workers, variety of queues
• Never trust a worker or process
• Never trust an API
• Sample where you can't test.

Mechanical curator - Technical notes

Recommended

Recommended

More Related Content

Similar to Mechanical curator - Technical notes

Similar to Mechanical curator - Technical notes (20)

More from benosteen

More from benosteen (20)

Recently uploaded

Recently uploaded (20)

Mechanical curator - Technical notes