Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillaume Chiron - EuropeanaTech Conference 2018

HYBRID IMAGE RETRIEVAL
IN DIGITAL LIBRARIES
ALarge Scale Multicollection
Experimentation of Deep Learning Techniques
Jean-Philippe MOREUX
Guillaume CHIRON
EuropeanaTech
Conference 2018

Outline
• Introduction
• ETL (Extract, Transform, Load) approach on Great War theme:
the Gallica.pix PoC
• Deep Learning experimentation:
• Image Genres Classification
• Visual Recognition
• Use cases
• Conclusion
« L’Auto » magazine, photo lab (1914)
Hybrid Image Retrieval — A Large Scale Multicollection
Experimentation of Deep Learning Techniques

People are using image retrieval with Google (2001, 2011), iPhoto
(2009), Flickr (2017)
…
They would like to do the same with
our heritage collections!
But the Gallica images collection (as a test bed)
only contains 1.2 M items: silence or limited
number of results
 140 documents for "Georges
Clemenceau" (1914-1918)
Our Users are Looking for
Iconographic Resources
Number of image documents in Gallica for the
first Top 100 queries on a named entity of type Person
3Image Search in DLs

Hopefully, our DLs are full of Images!
1.2M pages manually indexed and tagged as
"image" (picture, engraving, work of art, map…)
Large reservoir of
potential illustrations in
• manuscripts
• printed materials
• digital born content
…
To valorize these assets, we need
automation:
 automatic recognition of illustrations
 automatic description of illustrations

… is to let users express these kind of queries:
“I want caricatures/cartoons of George Clemenceau
from all the digitized collections”
Our mission

Several challenges:
• They are not always identified (within the scanned page)
• They are sometimes stored in data silos (images/prints/manuscripts…)
• They are highly variable (in time, artistic/printing techniques,
scanning practices...)
Where are the illustrations?

More challenges…
• In terms of semantic indexing of images, we still have to consider
scientific barriers
• Catalogues and digital libraries have not been designed to handle
the granularity of illustrations, neither with the adequate metadata
(size, dominant colors, genre...)
How to describe them?

For what purposes?
• Different use cases must be considered:
• Similarity search based on the selection of a source image
• Content-based indexing (semantic labels)
• Hybrid search on metadata+OCR+image content
• Various users needs, from posting pictures
on social media to scientific purposes

For what purposes?
• Working on animal iconography? But which one?
EPFL segmentation on Gallica manuscripts Gallica.pix (WW1)

• Extract-Transform-Load approach
• On Great War Gallica collections: still images, newspapers,
magazines, monographs, posters, maps… (1910-1920)
• Enriched with deep learning techniques
Proof of Concept: Gallica.pix
From catalogs
and OCRs
Transform & enrich
the illustrations
metadata
Image retrieval
(web app)
Extract Transform Load
10ETL approach

The Tool Bag
• Standard protocols and APIs
• Machine Learning: Sofware as a Service (IBM Watson, Google
Cloud Vision), deep learning frameworks (TensorFlow, OpenCV/dnn)
Extract
• Gallica APIs
• Gallica OAI-
PMH
• Gallica SRU
Transform
• IBM Watson,
Google CV
• TensorFlow
• OpenCV/dnn
• IIIF
Load
• BaseX
• XQuery
• IIIF
• Mansory.js
 The glue: Perl and Python scripts
11ETL approach

1. Extract
• Data mining all the available data from
all the data sources we have: catalog records,
images, OCR, tables of content
Image MD: size, color…
Catalog records OCRed text around image
(when exists), ToC
E
12ETL approach: Extract

For Printed Content, Data Mining the OCR
PagesdeGloire,fév.1917
Le Miroir, nov. 1918
La Science et la Vie, déc. 1917
… can be used to identify illustrations

Extract: preliminary remark
• This first step is worth the pain: it gives a direct access to
“invisible” illustrations (deeply hidden into the digitized content)
within a GUI designed for this purpose.
airplane
Classic results list GUI Image retrieval GUI

Extract: remarks
Challenges:
• Heterogeneity of cataloguing and digitization
formats and practices
• Lack of essential metadata (e.g. types of illustrations,
topics, color modes, dominant colors…)
• Segmentation of illustrations on heterogeneous documents
• Data mining raw OCR documents may produce a lot of noise
• Computer-intensive treatments
 A lot of engineering
 Some scientific barriers
 Data-centric issues

One catalog record  256 pages without any caption, multiple illustrations per
page, multi-genre illustrations (picture, map, drawing)
Extract: remarks on metadata

Illustrations data mining on raw OCR newspapers outputs a lot of noise…
Extract: remarks on OCR

48%
52%
Extract: remarks on OCR
• For newspapers, OCR noise can be massive!
 Heuristics and deep learning filtering
 Redoing the segmentation?
18Classification
99%
0.2%
0.1%
0.6%
Newspapers
Origin of noise per collections
Illustrations / noise ratio
Noise
OLR (high end OCR)
20%
80%
OCR
91%
9%

Extract: the pipeline
• Linear pipeline but with some variants
• Multiples sources of data and formats
Charger
Catalogs
Images
OCR, OLR
Images DB
65,000
documents
600,000
illustrations
Process
Selection
Extract
• Simple but massive treatments
• Needs some monitoring
12 M
metadata
475,000
pages
OAI,
SRU
API
Gallica
BaseX

2. Transform & Enrich
• Topic modeling: semantic network, LDA (Latent Dirichlet Allocation)
• OCR enrichment: Google Cloud Vision
• Image genres classification: Convolutional Neural Network (CNN) model
(TensorFlow, Inception-v3)
• Image content recognition: IBM Watson Visual Recognition, Google Cloud
Vision, ResNet OpenCV/dnn
Visual recognition
Image genres
classification
Topic Modeling
T
20ETL approach: Transform & Enrich
OCR

2.a Genres Classification with a ConvNet
• Deep learning classification with a convolutional neural network
(Google Inception-V3 TensorFlow model, 1,000 classes, Top 5 error rate
= 3,46 %) and "transfer learning » approach
• We want to classify illustrations genres (pictures, drawings, maps,
comics, charts…)

Genres Classification with a ConvNet
• Transfer learning: only the last layer of the network is retrained on
a ground thruth dataset of 12 classes, 12k images
• Training/evaluation: 80/20 (training ≈ 2 hours on a MacBook Pro)
• 4 noisy classes: Cover,
Blank Page, Ornament,
Text

Image Genres Classification: Results
• Recall: 0.90
• Accuracy: 0.90
• Better performances can be obtained on less generic models (e.g.
monographs only: recall≈95%)
or with full trained models (which
imply more computing power)

Image Genres Classification: Remarks
• Neural nets have the ability to generalize
These kind of maps were not included
in the training dataset

Image Genres Classification: Remarks
• If a new genre occurs in the data, the training dataset must be updated
& the network must be retrained
Graphs,
charts,
scientific &
technical
illustrations…

Image Genres Classification: the ads problem
A lot of illustrated ads are not visually distinguishable from
editorial contents! (type of communication  graphical form)
 Rules-based system, deep learning approach
on text+image
2
8
%
28%

2.a Visual Recognition with IBM Watson
• Visual Recognition Service API
• Outputs pairs of class/confidence score
• Detects object, person, face, color…
"images": [
{
"classifiers": [
{
"classes": [
{
"class": "armored personnel carrier",
"score": 0.568,
"type_hierarchy": "/vehicle/wheeled vehicle/armored
armored personnel carrier"
}, {
"class": "armored vehicle",
"score": 0.576 },
{
"class": "wheeled vehicle",
"score": 0.705
}, {
"class": "vehicle",
"score": 0.706
}, {
"class": "personnel carrier",
"score": 0.541,
"type_hierarchy": "/vehicle/wheeled vehicle/personn
}, {
"class": "fire engine",
"score": 0.526,
"type_hierarchy": "/vehicle/wheeled vehicle/truck/fire
}, {
"class": "truck",
"score": 0.526
}, {
"class": "structure",
"score": 0.516
}, {
"class": "Army Base",
"score": 0.511,
"type_hierarchy": "/defensive structure/Army Base"
}, {
"class": "defensive structure",
"score": 0.512
}, {
"class": "gas pump",
"score": 0.5,
"type_hierarchy": "/mechanical device/pump/gas pum
}, {
"class": "pump",
"score": 0.5
}, {
"class": "mechanical device",
"score": 0.501
}, {
"class": "black color",
"score": 0.905
}, {
"class": "coal black color",
"score": 0.691
} …
black color - 0.90
vehicle - 0.70
coal black color - 0.69
armored vehicle - 0.57
truck – 0.52
…
« Les tanks de la bataille de Cambrai, la reine d'Angleterre
écoute les explications données par un officiers anglais », 1917

Experimentation on Person Detection
• Ground truth of 4,000 images for Person detection.
• “Person”  recall=55%, accuracy=98%
• With a WW1 custom classifier: recall=60%
• “Soldier”  recall=50%, accuracy=80%
• Modest rates but we’ve to keep in mind that Person or Soldier
metadata are not available in catalog records and are difficult
to express with keywords!
• Keyword Search on WW1 Soldiers:
 recall=21%
“soldier” OR “military officer” OR “gunner”
OR “aviator” OR “poilus”…
Soldiers moving a
sculpture, 1918

Experimentation on Soldier Detection
0% 20% 40% 60% 80% 100%
Text MD only
Visual Reco.
Custom classifier
Hybrid
recall
70%
20%
50%

A generic service like Watson works on heritage documents,
even on "difficult" ones
Visual Recognition: remarks

But we are also facing some limitations:
• Generalization from contemporary training
datasets  anachronisms (even on WW1)
• Generalization from a limited training corpus:
 classification errors
• Complex scene are difficult to handle
Segway
armored vehicule
bourgogne wine label
car bombing
3,000 classes are enough to
satisfy generalist requests
for modern or contemporary
content, but not for the wide
spectrum of cultural objects
in a heritage library...

Large unsegmented images result in generic classes: "frame",
"document", "written document"…

Experimentation on Face Detection
• The Watson API also performs Face and Gender detection:
• “Face”: recall=43%, accuracy=99.9%
• The combined use of the two recognition APIs (Person and Face Detection) results
in an improvement of the overall recall for Person detection from 55% to 65%

Face detection: OpenCV/dnn
• dnn (deep neural networks) module within OpenCV 3.3, ResNet model
• “Single Shot Multibox Detector” method (SSD)
• “Face” detection:
• recall=58%, accuracy=92% (Confidence Score=20%)
 Frameworks are more flexible than SaS (Watson seems to be tuned
to favour accuracy: recall=43%, accuracy=99.9%)
gallica.pix

SaS VS Deep Learning Frameworks
SaS (IBM, Google, Amazon…) Deep Learning Frameworks
(TF, Keras, Caffe…)
Almost everything in a tool box, from
Content Indexing to Layout Analysis and
OCR
You have to pick up the right tools,
implement them and run them
(but it’s often a 30 lines Python script)
REST APIs (client library may be
available)
Local
Constrained by the API design Very flexible
Trained on contemporary materials.
(sometimes the API allows you to
develop custom classifier)
You can train models on your materials
Licenced on volumes
(Google Vision: 150$ for 100k images)
Free (but you need computing power)
You need a developer You need a developer + some deep
learning expertise (but not a PhD!)

Transform & enrich: the pipeline
Besoins :
• Linear complex pipeline
• Using multiple tools
• Complex & heavy computing
36Transformer et enrichir
ChargerImages DB
& Ads DBTopic Modelling
• Needs monitoring
• Needs training
• Results need manual correction
Images DB
600,000
illustrations
265,000
illustrations
Visual
RecognitionClassification
Filtering
noise, ads
BaseX
(XQuery)
TensorFlow
API Watson,
OpenCV/dnn

3. Load (& Search)
• In a XML database (baseX.org)
• Search with XQuery
(REST API)
• Display with IIIF
Image metadata
Catalog metadata
Full text
37Image Retrieval
http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq
http://gallicastudio.bnf.fr
WW1 database:
200k illustrations
65k illustrated ads
Extracted from
470k pages

Image Retrieval: the Data Deluge
• The complexity of the search form and the large number of results it
often leads to reveal that searching and browsing in image
databases carries specific issues of usability and remains a
research topic in its own right…
38Image Retrieval

Encyclopedic Query on a Named Entity
• Textual descriptors (metadata and OCR) are used.
“George Clemenceau” query: 140 ill. in Gallica/Images, >900 in Gallica.pix
Caricatures can
be found with
the “Drawing” facet
39Image Retrieval
gallica.pix

Encyclopedic Query on a Concept
• Interested in “airplanes? A keyword query on “avion” returns lot of
noise… Aviator portraits, aerial pictures, maps
40Image Retrieval
gallica.pix

Encyclopedic Query on a Concept
• If we use the conceptual classes extracted by the Watson API
(airplane), we can filter the noise (and get some false positives!)
41Image Retrieval
Concepts overcome silent metadata or
silent OCR, multilanguage barrier, lexical
evolution (from “aéronef” to “avion”)
gallica.pix
Portraits of aviators
can be found with
the Person facet
gallica.pix

Hybrid Query
• Conceptual classes, text and image MD are used
Search for visuals relating to the urban destruction following the Battle
of Verdun: class=(”street” OR ”house” OR ”ruin”) AND
keyword=”Verdun”
42Image Retrieval
gallica.pix

{
"@id": "http://wellcomelibrary.org/iiif/b28047345/annos/contentAsText/a31i0",
"@type": "oa:Annotation",
"motivation": "oa:classifying",
"resource": {
"@id": "dctypes:Image",
"label": "Picture"
},
"on": "http://mylibrary.org/iiif/b28047345/canvas/c31#xywh=
201,1768,2081,725"
}
Sharing the images but also the CBIR tags
• The CBIR classification and tags can be exposed thanks
to IIIF Presention/Open Annotation
• Open Annotations are attached to a layer (Canvas) in the IIIF manifest
• These annotations can be handled by a IIIF compliant viewer or
harvested to be then operated by machine at large scale
What next?

Exposing the CBIR tags?
• Which data models for the image content metadata?
• What about the interoperability? And the life cycle of these "metadata"?
What next?
IBM Watson Visual Recog. Google Cloud Vision Your CBIR model
3,000 classes vocabulary list* 1,500 classes* ?
Hierarchical classes Flat ?
orange color
olive color
…
soldier
soldier wearing beret
woman soldier
trooper
orangered
darkolivegreen
…
soldier
troop
abbess
abbey
academic certificate
action figure
advertising
aeolian landform
aerial photography
…
abacus
abattoir
abbey (monastry like)
Aberdeen Angus cattle
abutment (support of arch or …)
abutment arch
…
?
* Found in the WW1 dataset

Open Libraries
• Central open data repositories are used as source datasets
• New repositories/apps/datasets are developped
using a decentralized approach (on your laptop,
within a research lab or an institution)
• These new digital resources become
in turn sources of data
What next?
Library of Congress Labs
beyondwords.labs.loc.gov/ https://www.europeana.eu/portal/fr/collections/world-war-I
Gallica.pix WW1
Europeana 14-18
Your app!

Drawings: 25k
Contributing to DH: ready-to-use datasets & models
• Topic-based datasets: Sports, Ads, etc.
• Document-based datasets: Maps, Drawings, Engraving, etc.
• Time periods, Events, People…
• Pre-trained deep learning models
What next?
Illustrated ads: 65k
Maps: 13k
Very soon on api.bnf.fr !

Conclusion
• Unified access to all illustrations in an encyclopedic
digital collection is an innovative service that meets a real
need.
• It will foster the illustrations reuse
• The maturity of AI techniques in image content indexing
makes possible their integration into our toolbox.
• Their results, even imperfect, help to make visible and
searchable the large quantities of illustrations in our
collections.
• There is no universal solution for CBIR,
but many applications are just waiting
to be implemented!
47Conclusion

Digital Humanities focus
• Today, the image is a new playground for DH researchers
• Tomorrow, image datasets will be the daily life
of researchers
• AI tools will be free and trivialized
• Heritage libraries will be solicited for their iconographic
collections (web archive, photo collections, newspapers
and magazines, etc.) for visual data mining
48Conclusion

49Portraits Galery
Thanks for your attention!
jean-philippe.moreux@bnf.fr
Datasets, trained model and scripts very soon on:
• api.bnf.fr
• github.com/altomator/Image_Retrieval
Gallica.pix demonstrator:
• gallicastudio.bnf.fr
• http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq

Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillaume Chiron - EuropeanaTech Conference 2018

Recommended

Recommended

More Related Content

Similar to Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillaume Chiron - EuropeanaTech Conference 2018

Similar to Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillaume Chiron - EuropeanaTech Conference 2018 (20)

More from Europeana

More from Europeana (20)

Recently uploaded

Recently uploaded (19)

Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillaume Chiron - EuropeanaTech Conference 2018