BL Labs 2014 Symposium: The Mechanical Curator

Building Bridges
(and rapid depreciation)

David Foster Wallace, on Ambition:
“You know, the whole thing about perfectionism. The
perfectionism is very dangerous, because of course if your
fidelity to perfectionism is too high, you never do
anything.
Because doing anything results in— It’s actually kind of
tragic because it means you sacrifice how gorgeous and
perfect it is in your head for what it really is.”
- As told to Leonard Lopate on WNYC on March 4, 1996.
(emphasis my own)
http://blankonblank.org/interviews/david-foster-wallace-on-ambition/

The unifying theme to (pretty much)
all the requests:

all the requests:
Give me
EVERYTHING!

all the requests:
Give me
EVERYTHING!
(that might be important to my work)

Why?
“Can’t they just find the things they want
through the catalogue?”

1. If they knew which bits
of data were necessary,
they would already know
the answers.

“I am
interested in
travel
accounts in
Europe during
the 19th
Century”

2. If a conventional
search interface worked,
they wouldn’t be asking.

How does conventional search work
anyway? Under what assumptions?
Starts with the Text:
“I quickly explained that many big jobs involve
a few hazards.”

Then it is Tokenised (with some assumptions
on how this is possible):
“I”, “quickly”, “explained”, “that”, ”many”, “big”,
“jobs”, “involve”, “a”, “few”, “hazards”

Then, the most common words are removed
as these are assumed to be unimportant.
(Stopwords)
“quickly”, “explained”, ”many”, “big”, “jobs”,
“involve”, “few”, “hazards”

Many fulltext search services will also perform
language-specific Stemming, that is, to reduce
each word to a root:
“quick”, “explain”, ”many”, “big”, “job”,
“involve”, “few”, “hazard”
(Lookup ‘porter’ and ‘snowball’ stemmers for more.)

Finally, an inverse-index is created* and
arranged with the assumption that you want to
find the most Relevant results to future
queries.
Search terms are passed through the same
workflow.
(*Contemporary search engines are more complex of course, but the basics
are still there.)

Why on earth did I teach you about
search?
All services are made with compromises and
assumptions, and it is good to examine these
from time to time.
The key assumption is that people will search
for the most Relevant record that matches the
text they entered.

The most Relevant record
that matches the text they
entered.

Why not:
All the works that likely
cover a specific topic I
define or fit an arbitrary
algorithm I can supply.

“That’s great and all but
it’s all subjective; you
can’t teach a computer
that…”

http://www.robertelliottsmith.com/?p=530

2013 Competition winners
http://labs.bl.uk/Ideas+for+Labs
Pieter Francois

2013 Competition winners
http://labs.bl.uk/Ideas+for+Labs
Dan Norton - “Mixing the Library. Information
Interaction and the DJ”
Can a researcher record a session drawing
from digital objects, in the same way a DJ does
with music tracks?

The other unifying themes to the
requests:
“I need tools to help me interpret the vast
amount of content you hold. You don’t provide
any but make it impossible for others to do
so.”
“I want to work on broad sweeps of content,
rather than book-by-book. It would take too
much time to get each one.”
“API? what’s that? I don’t care. Just give me the
files.”

So, a challenge was born…
If a researcher is given direct file access to a
large amount of data, can it be useful?
What internal conventions would need to be
removed? What external conventions added?
One way to try it out, was to pretend to be a
researcher and to ‘eat our own dogfood’.

How has the depiction of
faces changed in books
over the 19th Century?
aka how well does modern photographic
face detection routines work on 19th C
illustrations?

Success? Not really.
Many more female faces were found than
male.
This did not mean that there are more
images of women in the books than men!

19C depictions of faces
• Often drawn more symmetrically - male faces
were more likely to be exaggerated.
• Depiction is typically 'clean' and posed
• Fashion: beards, spectacles and hats - different
to the modern photographic training data

There was something else though...
People on their way past would occasionally
pause and look over my shoulder.
Every day it dug up illustrations that
surprised me and the team around me.
So… I wondered if anyone else might be
surprised and intrigued by them too?
http://mechanicalcurator.tumblr.com/archive

How does machine learning work?
First, turn the raw data into numbers,
something the computer can deal with:
eg when analysing text, assign a number to
each word and form a ‘dictionary’

Process the numeric data in an effort to
better expose the “important” information
- removing noise and tone variation from an
image
- turning a grid of pixels into independant
trackable ‘points of interest’
- hue, saturation, levels
- produce metrics

Annotate - manually or automatically - what is
useful and what is not in a portion of the data:
- Characteristics:
- Spam or not?
- Face at x,y,w,h
- Positive, neutral and negative sentiment
- Scalar qualities

Pass most of the ‘known’ data through one of
many machine learning algorithms, such as a
Scalable Vector Machine (SVM) as
implemented in libsvm.
Which one depends entirely on what the
computer will be able to do once trained.

Test your trained machine with half of the rest
of the data to see how it does.
eg if characterising email, does it correctly spot
Spam?

Now, use the trained profile on real data!
Sometimes, these profiles are shared, for
example, Haar cascades trained on
photographic datasets (face, body, etc) are
freely available

Why the second lesson?
Analysis starts with a bulk set of data, and a
set of assumptions and ideas.
The usefulness of a stemming/tokenising
search service is unquestioned and Libraries
support metadata-level search.
No-one can support all assumptions and
ideas!

Surprising? It was an experiment,
after all...

Accessible?
• In theory, the books were accessible.
• In practice, it was a real challenge to find
anything viewable.
The chasm between digital and print:
http://samplegenerator.cloudapp.net

As this is all in the public domain
anyway...
What’s the harm in making it a bit more
accessible?
The Mechanical Curator twitter account has
only got a handful of people following it
after all. Maybe there isn’t much appetite for
it?

Impact?
Hard to measure:
- 20 million hits on average every month,
over 200 million in 10 months*.
- Over 100,000 tags added.
- Hundreds of contributors.
- Iterative crowdsourcing is ongoing.
- Peter Balman’s aforementioned project
* Are image view stats really a good measure?

Research and Technology
• Mario Klingemann Pattern Recognition Software
• Collaborative PhD ‘A History of the Printed Image 1750-1850: Applying
Data Science Techniques to Printed Book Illustration’
• TSB Digitial Innovation Contest New tech for tracking Public Domain in
the Wild

Crowdsourcing & Apps
• Metadata Games
• Wikipedia Synoptic Index
• BL Georeferencer - 3221 maps referenced in a few weeks!

[Tangent warning]
Scott Nicholson’s RECIPE

Creative Uses
• David Normal installation at Burning Man Festival
• “Moments” by Joe Bell
• Colouring-in Pages for Children

Tutorial
s
• Using Photoshop to Up-res images
• Converting images to vector graphics

Collaborations with Colleagues
• Inspired by Flickr, a Sound Archive series
• Maps will be fed into the next phase of the Georeferencer

Education
• Images included in Wikipedia Articles
• University of Minnesota English Literature Course Exercise on Tagging
• Art Therapy Courses

The ‘British Library Big Data
Experiment’
http://britishlibrary.typepad.co.uk/digital-scholarship/
2014/06/the-british-library-big-data-
experiment.html
“What can a group of UCL Big Data CS
students do when given access to cloud
computing, all of the book data and a focus
group of digital humanists?”

The ‘British Library Big Data
Experiment’
Next phase will work with an undergraduate
team with experience at image analysis.
We are hosting an event on the 18th of
December 2014, on “Pattern Recognition”.

In summary, “Clarity”
It is clear that we can:
fail and fail quickly
build experiments that
won’t last
open content
build bridges

My contact details for later technical
questions:
ben.osteen@bl.uk
@benosteen
Links:
http://labs.bl.uk
http://mechanicalcurator.tumblr.com
https://flickr.com/photos/britishlibrary
https://github.com/bl-labs
http://britishlibrary.typepad.co.uk/digital-scholarship/2013/12/a-million-first-steps.html

Image credits:
Title image: from https://www.flickr.com/photos/britishlibrary/11223645575
Title: "The Book of The Grand Junction Railway, being a history and description of the line from Birmingham to Liverpool and
Manchester ... By T. Roscoe, assisted by the resident engineers of the line"
Author: Roscoe, Thomas.
Shelfmark: "British Library HMNTS 796.f.3."
https://www.flickr.com/photos/britishlibrary/11209677645 - Foot Bridge, Dartmoor
https://www.flickr.com/photos/britishlibrary/11208502325 - The Suspension Bridge
https://www.flickr.com/photos/britishlibrary/11234482436 - Wensleydale & Swaledale
Image taken from page 97 of 'The Mineral Baths of Bath. The Bathes of Bathe's Ayde in the reign of Charles 2nd as
illustrated by a drawing of the King's and Queen's Bath, signed 1675. Whereunto is annexed a Visit to Bath in the year
1675 by “A Person of Q" by The British Library (More from this book here: https://www.flickr.com/search/?
tags=sysnum000878624)
Image taken from page 467 of '[The History of New South Wales, including Botany Bay, Port Jackson, Pamaratta [sic],
Sydney, and all its dependancies ... with the customs and manners of the natives, and an account of the English colony,
from its foundation https://www.flickr.com/photos/britishlibrary/11001417405
http://britishlibrary.typepad.co.uk/digital-scholarship/2013/10/peeking-behind-the-curtain-of-the-mechanical-curator.html

BL Labs 2014 Symposium: The Mechanical Curator

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to BL Labs 2014 Symposium: The Mechanical Curator

Similar to BL Labs 2014 Symposium: The Mechanical Curator (20)

More from benosteen

More from benosteen (20)

Recently uploaded

Recently uploaded (20)

BL Labs 2014 Symposium: The Mechanical Curator