SlideShare a Scribd company logo
"Mechanical Curator"
(The technical story)
It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"
It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"
• So we contrived a research question:
"Can we find the faces in the
19th C scanned book
collection?"
Outcome:
• Majority of tools and libraries expect local
filesystem or in-memory access; no
network/API knowledge needed by
researcher.
• While lookup by layout is awkward, it is a
pragmatic approach when distributing
content by sneakernet. Might be pairable
by a light online search-engine and
documentation/wiki for best practices.
'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.
'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.
• But... applying Haar cascade profiles,
based on a photo training set, had some
reasonable success!
19C depictions of faces
• Likelyhood of detection:
• Female faces > Male
19C depictions of faces
• Likelyhood of detection:
• Female faces > Male
• Why women?
• Drawn more symmetrically - male faces were
more likely to be exaggerated.
• Depiction is typically 'clean' and posed
• Fashion: beards, spectacles and hats - very
different to the training sets
An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.
An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.
– polygonal boundaries for areas where it
detected contiguous content but where OCR
didn't work.
A map to all* the images?
* Unlikely to be comprehensive
A map to all* the images?
The 'Mechanical Curator' found:
– Maps
– Portraits
– Marginalia
– Covers
– Charts and diagrams
– Decorations
Microsoft Books
• Context:
– 47k 'works' digitised, 68k volumes
– 15.3Tb images, 1.3Tb ALTO XML
– circa 22+ million JP2000 images, 150-200DPI
(unconfirmed), a zipfile ('store') per volume
– 360 pages per volume on average
– No explicit subjects in metadata, but heavy on
travel, geography, ethnology, (English)
literature and plenty of 'misc'
Accessible?
• In theory, the books were accessible
online.
• In practice, it was a real challenge to find
anything viewable.
Image extraction process
• Worker-based, using a message queue to
coordinate.
• Thread-unsafe (due to zips) so limited to
one worker per zip.
– Local network storage was nearly full
– Limited by hardware too (4 months to get
RAM upgrade)
Tech used:
• Virtualbox
• Redis (msg queue, semaphore, metadata
cache)
• Python
– OpenCV main library used:
• Opens JP2000 with colour profiles
• Quick to work with image regions
• Also saved region as JPG (92%) for reuse
Filter first!
• ALTO with Illustration element is only
concern.
• Grep - quickly discerned the 1 million XML
files of interest (only 4-5% of total)
Resilience
• Never trust a process
– Did it fail?
Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
– Does the expected JPG exist on disc? Is it
non-zero in length?
Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
– Does the expected JPG exist on disc? Is it
non-zero in length?
– Did IT services hard reboot your desktop
machine hosting the VMs you use in a given
night?
Overview:
• Started with one desktop VM, and a
connection to a local NAS
• Ended having used multiple VMs on Azure
as well, after piping content to their store.
– Redis replicated natively w/ SSH tunnel to
write node
Identifiers...
• Little help available from overstretched IT
architecture team.
• Naive filename syntax to begin with:
– SYSNUM_VOL_PG_IMGIDX_humantxt.jpg
– Stored by publication year.
We have images!
• 580Gb JPGs
• From dogfooding, hybrid approach
seemed necessary:
• Online, sharable, linkable, easy to find
presence, with a unique ID per image.
• Easy mapping between local image and
online image.
Images already available
• ... in theory.
• We needed something else in the short-
term.
Options
• Wikimedia Commons: we know about the
books, but have no idea about the actual
content! WC wouldn't be able to handle
1mil images in one go.
• Er... Flickr?
Upload by worker
• Again, similar structure - job was simply a
filepath (metadata deduceable)
• Ran approximately 16-18 workers for 9
days to upload images.
• High 90s upload success rate (time of day
dependent)
Outcome
• Launched 13 December on Flickr
Commons
• Spike: 55 million image views in 5 days
• By March 2014, 70k+ tags added by
community -
map, portrait, cover, childrensbook, and so
on.
Keeping track
• Many bad/misleading API calls
• (people.photos.)recentlyUpdated seems to
mostly work
Current scheme
• Every morning, call recentlyUpdated for
list of images that had some change
• Re-scan images and deduce changes in
tags, comments, views and favourites.
– (Same pattern, rescan jobs taken by
get_activity workers. Running 4 is enough
outside of spike times)
Caching
• Redis sets:
– PeopleID links to set of FlickrID+tagadded
– FlickrID links to set of user tags
– Sorted sets for 'high score' lists:
contributors, favourites, tags
Summary
• Workers to spin up when required
• Variety of workers, variety of queues
• Never trust a worker or process
• Never trust an API
• Sample where you can't test.

More Related Content

Similar to Mechanical curator - Technical notes

Developing a Staff-Only Samvera Application
Developing a Staff-Only Samvera ApplicationDeveloping a Staff-Only Samvera Application
Developing a Staff-Only Samvera Application
James Griffin
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
cneudecker
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Future-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do TodayFuture-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do Today
John Kunze
 
Stegano Forensics
Stegano ForensicsStegano Forensics
Stegano Forensics
Chiawei Wang
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep Learning
David Khosid
 
Tooling for the JavaScript Era
Tooling for the JavaScript EraTooling for the JavaScript Era
Tooling for the JavaScript Eramartinlippert
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
BICA Labs
 
About Scanning and Metadata Standards - NEMO 2010
About Scanning and Metadata Standards - NEMO 2010About Scanning and Metadata Standards - NEMO 2010
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
Zavain Dar
 
neurisa_11_09_rosenthal
neurisa_11_09_rosenthalneurisa_11_09_rosenthal
neurisa_11_09_rosenthaltutorialsruby
 
neurisa_11_09_rosenthal
neurisa_11_09_rosenthalneurisa_11_09_rosenthal
neurisa_11_09_rosenthaltutorialsruby
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
Databricks
 
Kings fund - implementing Hyku
Kings fund - implementing HykuKings fund - implementing Hyku
Kings fund - implementing Hyku
PTFS Europe Limited
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Docker, Inc.
 
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
MLconf
 
The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...
locloud
 

Similar to Mechanical curator - Technical notes (20)

Developing a Staff-Only Samvera Application
Developing a Staff-Only Samvera ApplicationDeveloping a Staff-Only Samvera Application
Developing a Staff-Only Samvera Application
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Future-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do TodayFuture-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do Today
 
Stegano Forensics
Stegano ForensicsStegano Forensics
Stegano Forensics
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep Learning
 
Tooling for the JavaScript Era
Tooling for the JavaScript EraTooling for the JavaScript Era
Tooling for the JavaScript Era
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
About Scanning and Metadata Standards - NEMO 2010
About Scanning and Metadata Standards - NEMO 2010About Scanning and Metadata Standards - NEMO 2010
About Scanning and Metadata Standards - NEMO 2010
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
 
neurisa_11_09_rosenthal
neurisa_11_09_rosenthalneurisa_11_09_rosenthal
neurisa_11_09_rosenthal
 
neurisa_11_09_rosenthal
neurisa_11_09_rosenthalneurisa_11_09_rosenthal
neurisa_11_09_rosenthal
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 
Kings fund - implementing Hyku
Kings fund - implementing HykuKings fund - implementing Hyku
Kings fund - implementing Hyku
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
 
The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...
 

More from benosteen

Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
benosteen
 
Bl labs ucl-services
Bl labs ucl-servicesBl labs ucl-services
Bl labs ucl-services
benosteen
 
Bl labs what is british library labs
Bl labs   what is british library labsBl labs   what is british library labs
Bl labs what is british library labs
benosteen
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017
benosteen
 
Uses of Library Collections
Uses of Library CollectionsUses of Library Collections
Uses of Library Collections
benosteen
 
CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016
benosteen
 
NDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - KeynoteNDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - Keynote
benosteen
 
British library labs - What? Why?
British library labs - What? Why?British library labs - What? Why?
British library labs - What? Why?
benosteen
 
UKSG 2015 Mechanical curator and British Library labs
UKSG 2015  Mechanical curator and British Library labsUKSG 2015  Mechanical curator and British Library labs
UKSG 2015 Mechanical curator and British Library labs
benosteen
 
Sharing and Serendipity
Sharing and SerendipitySharing and Serendipity
Sharing and Serendipity
benosteen
 
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
benosteen
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curator
benosteen
 
The surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorThe surprising adventures of the mechanical curator
The surprising adventures of the mechanical curator
benosteen
 
Mashspa
MashspaMashspa
Mashspa
benosteen
 
Postscript, books and binding
Postscript, books and bindingPostscript, books and binding
Postscript, books and binding
benosteen
 
Open Bibliography, Citations and Scholarship
Open Bibliography, Citations and ScholarshipOpen Bibliography, Citations and Scholarship
Open Bibliography, Citations and Scholarship
benosteen
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
benosteen
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
benosteen
 
Choices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein OntologiesChoices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein Ontologiesbenosteen
 
Where are Repository's Going?
Where are Repository's Going?Where are Repository's Going?
Where are Repository's Going?
benosteen
 

More from benosteen (20)

Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
 
Bl labs ucl-services
Bl labs ucl-servicesBl labs ucl-services
Bl labs ucl-services
 
Bl labs what is british library labs
Bl labs   what is british library labsBl labs   what is british library labs
Bl labs what is british library labs
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017
 
Uses of Library Collections
Uses of Library CollectionsUses of Library Collections
Uses of Library Collections
 
CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016
 
NDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - KeynoteNDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - Keynote
 
British library labs - What? Why?
British library labs - What? Why?British library labs - What? Why?
British library labs - What? Why?
 
UKSG 2015 Mechanical curator and British Library labs
UKSG 2015  Mechanical curator and British Library labsUKSG 2015  Mechanical curator and British Library labs
UKSG 2015 Mechanical curator and British Library labs
 
Sharing and Serendipity
Sharing and SerendipitySharing and Serendipity
Sharing and Serendipity
 
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curator
 
The surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorThe surprising adventures of the mechanical curator
The surprising adventures of the mechanical curator
 
Mashspa
MashspaMashspa
Mashspa
 
Postscript, books and binding
Postscript, books and bindingPostscript, books and binding
Postscript, books and binding
 
Open Bibliography, Citations and Scholarship
Open Bibliography, Citations and ScholarshipOpen Bibliography, Citations and Scholarship
Open Bibliography, Citations and Scholarship
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
 
Choices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein OntologiesChoices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein Ontologies
 
Where are Repository's Going?
Where are Repository's Going?Where are Repository's Going?
Where are Repository's Going?
 

Recently uploaded

Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
Vivekanand Anglo Vedic Academy
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 

Recently uploaded (20)

Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 

Mechanical curator - Technical notes

  • 2. It began with dogfood... • "Given access to a filesystem of media with an easily learned layout convention, can a researcher use their own tools?"
  • 3. It began with dogfood... • "Given access to a filesystem of media with an easily learned layout convention, can a researcher use their own tools?" • So we contrived a research question:
  • 4. "Can we find the faces in the 19th C scanned book collection?"
  • 5.
  • 6. Outcome: • Majority of tools and libraries expect local filesystem or in-memory access; no network/API knowledge needed by researcher. • While lookup by layout is awkward, it is a pragmatic approach when distributing content by sneakernet. Might be pairable by a light online search-engine and documentation/wiki for best practices.
  • 7. 'Project' success? • Computer Vision algorithms are predominantly based on photographic input. Room for improvement. • Catch-22 with respect to training sets.
  • 8. 'Project' success? • Computer Vision algorithms are predominantly based on photographic input. Room for improvement. • Catch-22 with respect to training sets. • But... applying Haar cascade profiles, based on a photo training set, had some reasonable success!
  • 9. 19C depictions of faces • Likelyhood of detection: • Female faces > Male
  • 10. 19C depictions of faces • Likelyhood of detection: • Female faces > Male • Why women? • Drawn more symmetrically - male faces were more likely to be exaggerated. • Depiction is typically 'clean' and posed • Fashion: beards, spectacles and hats - very different to the training sets
  • 11. An Interesting By-product emerged • The ALTO XML, created by MS as part of the digitisation process, was found to have 'GraphicalIllustration' elements.
  • 12. An Interesting By-product emerged • The ALTO XML, created by MS as part of the digitisation process, was found to have 'GraphicalIllustration' elements. – polygonal boundaries for areas where it detected contiguous content but where OCR didn't work.
  • 13. A map to all* the images? * Unlikely to be comprehensive
  • 14. A map to all* the images? The 'Mechanical Curator' found: – Maps – Portraits – Marginalia – Covers – Charts and diagrams – Decorations
  • 15.
  • 16.
  • 17. Microsoft Books • Context: – 47k 'works' digitised, 68k volumes – 15.3Tb images, 1.3Tb ALTO XML – circa 22+ million JP2000 images, 150-200DPI (unconfirmed), a zipfile ('store') per volume – 360 pages per volume on average – No explicit subjects in metadata, but heavy on travel, geography, ethnology, (English) literature and plenty of 'misc'
  • 18. Accessible? • In theory, the books were accessible online. • In practice, it was a real challenge to find anything viewable.
  • 19. Image extraction process • Worker-based, using a message queue to coordinate. • Thread-unsafe (due to zips) so limited to one worker per zip. – Local network storage was nearly full – Limited by hardware too (4 months to get RAM upgrade)
  • 20. Tech used: • Virtualbox • Redis (msg queue, semaphore, metadata cache) • Python – OpenCV main library used: • Opens JP2000 with colour profiles • Quick to work with image regions • Also saved region as JPG (92%) for reuse
  • 21. Filter first! • ALTO with Illustration element is only concern. • Grep - quickly discerned the 1 million XML files of interest (only 4-5% of total)
  • 22. Resilience • Never trust a process – Did it fail?
  • 23. Resilience • Never trust a process – Did it fail? – Did it fail silently?
  • 24. Resilience • Never trust a process – Did it fail? – Did it fail silently? – Does the expected JPG exist on disc? Is it non-zero in length?
  • 25. Resilience • Never trust a process – Did it fail? – Did it fail silently? – Does the expected JPG exist on disc? Is it non-zero in length? – Did IT services hard reboot your desktop machine hosting the VMs you use in a given night?
  • 26. Overview: • Started with one desktop VM, and a connection to a local NAS • Ended having used multiple VMs on Azure as well, after piping content to their store. – Redis replicated natively w/ SSH tunnel to write node
  • 27. Identifiers... • Little help available from overstretched IT architecture team. • Naive filename syntax to begin with: – SYSNUM_VOL_PG_IMGIDX_humantxt.jpg – Stored by publication year.
  • 28. We have images! • 580Gb JPGs • From dogfooding, hybrid approach seemed necessary: • Online, sharable, linkable, easy to find presence, with a unique ID per image. • Easy mapping between local image and online image.
  • 29. Images already available • ... in theory. • We needed something else in the short- term.
  • 30. Options • Wikimedia Commons: we know about the books, but have no idea about the actual content! WC wouldn't be able to handle 1mil images in one go. • Er... Flickr?
  • 31. Upload by worker • Again, similar structure - job was simply a filepath (metadata deduceable) • Ran approximately 16-18 workers for 9 days to upload images. • High 90s upload success rate (time of day dependent)
  • 32. Outcome • Launched 13 December on Flickr Commons • Spike: 55 million image views in 5 days • By March 2014, 70k+ tags added by community - map, portrait, cover, childrensbook, and so on.
  • 33.
  • 34. Keeping track • Many bad/misleading API calls • (people.photos.)recentlyUpdated seems to mostly work
  • 35. Current scheme • Every morning, call recentlyUpdated for list of images that had some change • Re-scan images and deduce changes in tags, comments, views and favourites. – (Same pattern, rescan jobs taken by get_activity workers. Running 4 is enough outside of spike times)
  • 36. Caching • Redis sets: – PeopleID links to set of FlickrID+tagadded – FlickrID links to set of user tags – Sorted sets for 'high score' lists: contributors, favourites, tags
  • 37. Summary • Workers to spin up when required • Variety of workers, variety of queues • Never trust a worker or process • Never trust an API • Sample where you can't test.