Exploring Machine Learning for Libraries and Archives: Present and FutureBohyun Kim
A conference presentation given by Bohyun Kim, Chief Technology Officer & Professor, University of Rhode Island Libraries, USA for the Bite-sized Internet Librarian International 2021 on September 22, 2021.
Bibliotheca Digitalis. Reconstitution of Early Modern Cultural Networks. From Primary Source to Data.
DARIAH / Biblissima Summer School, 4-8 July 2017, Le Mans, France.
1st day, July 4th – Digital sources: theoretical fundamentals.
From pixels to content.
Jean-Yves Ramel – Professor of Computer Science, Computer Laboratory, University of Tours.
Abstract: https://bvh.hypotheses.org/3294#conf-JYRamel
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.
Exploring Machine Learning for Libraries and Archives: Present and FutureBohyun Kim
A conference presentation given by Bohyun Kim, Chief Technology Officer & Professor, University of Rhode Island Libraries, USA for the Bite-sized Internet Librarian International 2021 on September 22, 2021.
Bibliotheca Digitalis. Reconstitution of Early Modern Cultural Networks. From Primary Source to Data.
DARIAH / Biblissima Summer School, 4-8 July 2017, Le Mans, France.
1st day, July 4th – Digital sources: theoretical fundamentals.
From pixels to content.
Jean-Yves Ramel – Professor of Computer Science, Computer Laboratory, University of Tours.
Abstract: https://bvh.hypotheses.org/3294#conf-JYRamel
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
Deep Learning is the area of machine learning and one of the most talked about trends in business and computer science today.
In this talk, I will give a review of Deep Learning explaining what it is, what kinds of tasks it can do today, and what it probably could do in the future.
Microsoft COCO: Common Objects in Context KhalidKhan412
Datasets are available for facial recognition, action recognition, object detection and recognition, etc. Image datasets are helpful in scene understanding and providing semantic description
3D-ICONS: Interactive storytelling through innovative interfaces, Carlotta C...3D ICONS Project
This presentation by Carlotta Capurro and Daniel Pletinckx, (Visual Dimension bvba) gives an introduction to the 3D-ICONS guidelines for creating 3D models of cultural objects. It introduces 3D capture techniques, post-processing of 3D content, 3D publishing methodology, metadata, licencing and IPR considerations, and includes a case study of the digitisation of Ename, Belgium. A 4D visualisation of the Ename abbey site has been created providing a framework for interactive storytelling about the evolution of the abbey through time.
Introducing TensorFlow: The game changer in building "intelligent" applicationsRokesh Jankie
This is the slidedeck used for the presentation of the Amsterdam Pipeline of Data Science, held in December 2016. TensorFlow in the open source library from Google to implement deep learning, neural networks. This is an introduction to Tensorflow.
Note: Videos are not included (which were shown during the presentation)
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
Image search interfaces either prompt the searcher to provide a search image (image-to-image search) or a text description of the image (text-to-image search). Image to Image search is generally implemented as a nearest neighbor search in a dense image embedding space, where the embedding is derived from Neural Networks pre-trained on a large image corpus such as ImageNet. Text to image search can be implemented via traditional (TF/IDF or BM25 based) text search against image captions or image tags.
In this presentation, we describe how we fine-tuned the OpenAI CLIP model (available from Hugging Face) to learn a joint image/text embedding representation from naturally occurring image-caption pairs in literature, using contrastive learning. We then show this model in action against a dataset of medical image-caption pairs, using the Vespa search engine to support text based (BM25), vector based (ANN) and hybrid text-to-image and image-to-image search.
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
Deep Learning is the area of machine learning and one of the most talked about trends in business and computer science today.
In this talk, I will give a review of Deep Learning explaining what it is, what kinds of tasks it can do today, and what it probably could do in the future.
Microsoft COCO: Common Objects in Context KhalidKhan412
Datasets are available for facial recognition, action recognition, object detection and recognition, etc. Image datasets are helpful in scene understanding and providing semantic description
3D-ICONS: Interactive storytelling through innovative interfaces, Carlotta C...3D ICONS Project
This presentation by Carlotta Capurro and Daniel Pletinckx, (Visual Dimension bvba) gives an introduction to the 3D-ICONS guidelines for creating 3D models of cultural objects. It introduces 3D capture techniques, post-processing of 3D content, 3D publishing methodology, metadata, licencing and IPR considerations, and includes a case study of the digitisation of Ename, Belgium. A 4D visualisation of the Ename abbey site has been created providing a framework for interactive storytelling about the evolution of the abbey through time.
Introducing TensorFlow: The game changer in building "intelligent" applicationsRokesh Jankie
This is the slidedeck used for the presentation of the Amsterdam Pipeline of Data Science, held in December 2016. TensorFlow in the open source library from Google to implement deep learning, neural networks. This is an introduction to Tensorflow.
Note: Videos are not included (which were shown during the presentation)
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
Image search interfaces either prompt the searcher to provide a search image (image-to-image search) or a text description of the image (text-to-image search). Image to Image search is generally implemented as a nearest neighbor search in a dense image embedding space, where the embedding is derived from Neural Networks pre-trained on a large image corpus such as ImageNet. Text to image search can be implemented via traditional (TF/IDF or BM25 based) text search against image captions or image tags.
In this presentation, we describe how we fine-tuned the OpenAI CLIP model (available from Hugging Face) to learn a joint image/text embedding representation from naturally occurring image-caption pairs in literature, using contrastive learning. We then show this model in action against a dataset of medical image-caption pairs, using the Vespa search engine to support text based (BM25), vector based (ANN) and hybrid text-to-image and image-to-image search.
At this online web conference, the Europeana Aggregators’ Forum will open their virtual doors to cultural heritage professionals and anyone with an interest in high quality, open cultural heritage content.
At this online web conference, the Europeana Aggregators’ Forum will open their virtual doors to cultural heritage professionals and anyone with an interest in high quality, open cultural heritage content.
Slides 2 - 39:Europeana Network Association General Assembly by Marco de Niet, Georgia Angelaki, Erwin Verbruggen, Fred Truyen and Sara Di Giorgio
Slide 40: Keynote Frédéric Kaplan
Slide 41: State Secretary Angela Ferreira
Slide 42: Wrap up day one by Marco de Niet
Slide 45: Welcome by Marco de Niet
Slide 46: Welcome by Maria Ines Cordeiro
Slide 47: Europeana Strategy 2020+ by Rehana Schwinninger-Ladak
Slides 48 - 142: Developments at Europeana by Harry Verwayen
Slides 143 - 147: Welcome & Introduction to the conference programme by Marco de Niet
Slides 149 - 191: The Europeana Innovation Agenda highlights by Ina Blümel, Johan Oomen, Sara Di Giorgio, Lorna Hughes, Pedro Santos and Andy Neale
Slides 193 - 194: Introduction of the afternoon programme by Fred Truyen
Slides 195 - 231: We transform the world with culture by Harry Verwayen, Elisabeth Niggemann, Rehana Schwinninger-Ladak, Katherine Heid and Merete Sanderhoff
Slides 232 - : The Europeana Innovation Agenda highlights by Gregory Markus, Chris Dijkshoorn, Maarten Dammers and Harald Sack
Slide 285: Pitch your project (See pitch your project presentation slides)
Slides 286 - 290: Unsung Heroes by Marco de Niet
Slides 291 - 292: Wrap up and closure of day two by Sara Di Giorgio
Slides 2 - 6: Introduction to the programme by Georgia Angelaki
Slides 7 - 9: Keynote Michael Edson
Slides 10 - 40: Europeana Aggregators Forum by Marco Rendina
Slides 42 - 75: Promoting Cultural Heritage with digital invasion by Altheo Valentini-Egina and Marianna Marcucci
Slides 77 - 97: Opportunities for digital cultural heritage and the public domain, under the EU Copyright Rules by Paul Keller, Steven Stegers, Jurga Gradauskaite, Antje Schmidt, Sebastiaan ter Burg and Harry Verwayen
Slides 98 - 101: Climate Call for Action: Outcomes by Barbara Fischer
Slides 102 - 114: Wrap up and closure by Marco de Niet
Europeana 2019 - Connect Communities - Pitch your projectEuropeana
Slides 3 - 10: The GIFT Box: Helping museums make richer digital experiences for their visitors by Anders Sundnes Lovlie
Slides 11 - 18: Between people and things - Transfer of knowledge at SHMH by Elisabeth Böhm
Slides 19 - 30: Automated recognition of historical image content by Tino Mager
Slides 31 - 51: 50s in Europe: Kaleidoscope by Sofie Taes
Slides 52 - 63: CrowdHeritage: Crowdsourcing Platform for Enriching Europeana Metadata by Vassilis Tzouvaras
Slides 64 - 73: One by One: developing digital literacy in museums by Anra Kennedy
Slides 74 - 85: HeritageMaps.ie - Ireland's One-Stop Heritage Portal by Patrick Reid
Slides 86 - 90: Open GLAM now! - Sharing knowledge openly online by Larissa Borck
Slides 91 - 103: Endangered Archives Programme the world's most diverse online archive by Tristan Roddis
Slides 104 - 109: We transform the world with culture - Our impact on climate change by Barbara Fischer, Killian Downing and Peter Soemers
Slide 2 - 66: Shaping innovatin in education with cultural heritage by Fred Truyen, Steven Stegers, Evita Tasiopoulou and Marco Neves
Slides 67 - 152: Multilingual access and machine translation by Andy Neale, Antoine Isaac, Pavel Kats, Alex Raginsky and Sergiu Gordea
Slides 155 - 164: How to implement the FAIR principles in digital culture by Sara Di Giorgio, Saskia Scheltjens and Makx Dekkers, Seamus Ross, Franco Niccolucci and Erzsébet Tóth-Czifra
Slide 166: EuropeanaTech Unconference by Clemens Neudecker
Slides 2 - 35: Introduction to Impact Workshop by Dafydd Tudur, Maja Drabczyk, Julia Fallon and Simon Tanner
Slides 36 - 68: Music to my ears: Making rights understandable by Juozas Markauskas and Jurga Gradauskaite
Slides 70 - 92: Achieving inclusivity & diversity in the Europeana Network by Killian Downing, Larissa Borck and Tola Dabiri
Slides 94 - 123: Communicating the value of digital culture to stakeholders by Susan Hazan, Eleanor Kenny and Katherine Heid
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsRosie Wells
Insight: In a landscape where traditional narrative structures are giving way to fragmented and non-linear forms of storytelling, there lies immense potential for creativity and exploration.
'Collapsing Narratives: Exploring Non-Linearity' is a micro report from Rosie Wells.
Rosie Wells is an Arts & Cultural Strategist uniquely positioned at the intersection of grassroots and mainstream storytelling.
Their work is focused on developing meaningful and lasting connections that can drive social change.
Please download this presentation to enjoy the hyperlinks!
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillaume Chiron - EuropeanaTech Conference 2018
1. HYBRID IMAGE RETRIEVAL
IN DIGITAL LIBRARIES
ALarge Scale Multicollection
Experimentation of Deep Learning Techniques
Jean-Philippe MOREUX
Guillaume CHIRON
EuropeanaTech
Conference 2018
2. Outline
• Introduction
• ETL (Extract, Transform, Load) approach on Great War theme:
the Gallica.pix PoC
• Deep Learning experimentation:
• Image Genres Classification
• Visual Recognition
• Use cases
• Conclusion
« L’Auto » magazine, photo lab (1914)
Hybrid Image Retrieval — A Large Scale Multicollection
Experimentation of Deep Learning Techniques
3. People are using image retrieval with Google (2001, 2011), iPhoto
(2009), Flickr (2017)
…
They would like to do the same with
our heritage collections!
But the Gallica images collection (as a test bed)
only contains 1.2 M items: silence or limited
number of results
140 documents for "Georges
Clemenceau" (1914-1918)
Our Users are Looking for
Iconographic Resources
Number of image documents in Gallica for the
first Top 100 queries on a named entity of type Person
3Image Search in DLs
4. Hopefully, our DLs are full of Images!
1.2M pages manually indexed and tagged as
"image" (picture, engraving, work of art, map…)
Large reservoir of
potential illustrations in
• manuscripts
• printed materials
• digital born content
…
To valorize these assets, we need
automation:
automatic recognition of illustrations
automatic description of illustrations
4Image Search in DLs
5. … is to let users express these kind of queries:
“I want caricatures/cartoons of George Clemenceau
from all the digitized collections”
Our mission
5Image Search in DLs
6. Several challenges:
• They are not always identified (within the scanned page)
• They are sometimes stored in data silos (images/prints/manuscripts…)
• They are highly variable (in time, artistic/printing techniques,
scanning practices...)
Where are the illustrations?
6Image Search in DLs
7. More challenges…
• In terms of semantic indexing of images, we still have to consider
scientific barriers
• Catalogues and digital libraries have not been designed to handle
the granularity of illustrations, neither with the adequate metadata
(size, dominant colors, genre...)
How to describe them?
7Image Search in DLs
8. For what purposes?
• Different use cases must be considered:
• Similarity search based on the selection of a source image
• Content-based indexing (semantic labels)
• Hybrid search on metadata+OCR+image content
• Various users needs, from posting pictures
on social media to scientific purposes
8Image Search in DLs
9. For what purposes?
• Working on animal iconography? But which one?
9Image Search in DLs
EPFL segmentation on Gallica manuscripts Gallica.pix (WW1)
10. • Extract-Transform-Load approach
• On Great War Gallica collections: still images, newspapers,
magazines, monographs, posters, maps… (1910-1920)
• Enriched with deep learning techniques
Proof of Concept: Gallica.pix
From catalogs
and OCRs
Transform & enrich
the illustrations
metadata
Image retrieval
(web app)
Extract Transform Load
10ETL approach
11. The Tool Bag
• Standard protocols and APIs
• Machine Learning: Sofware as a Service (IBM Watson, Google
Cloud Vision), deep learning frameworks (TensorFlow, OpenCV/dnn)
Extract
• Gallica APIs
• Gallica OAI-
PMH
• Gallica SRU
Transform
• IBM Watson,
Google CV
• TensorFlow
• OpenCV/dnn
• IIIF
Load
• BaseX
• XQuery
• IIIF
• Mansory.js
The glue: Perl and Python scripts
11ETL approach
12. 1. Extract
• Data mining all the available data from
all the data sources we have: catalog records,
images, OCR, tables of content
Image MD: size, color…
Catalog records OCRed text around image
(when exists), ToC
E
12ETL approach: Extract
13. For Printed Content, Data Mining the OCR
PagesdeGloire,fév.1917
Le Miroir, nov. 1918
La Science et la Vie, déc. 1917
… can be used to identify illustrations
13ETL approach: Extract
14. Extract: preliminary remark
• This first step is worth the pain: it gives a direct access to
“invisible” illustrations (deeply hidden into the digitized content)
within a GUI designed for this purpose.
14ETL approach: Extract
airplane
Classic results list GUI Image retrieval GUI
15. Extract: remarks
Challenges:
• Heterogeneity of cataloguing and digitization
formats and practices
• Lack of essential metadata (e.g. types of illustrations,
topics, color modes, dominant colors…)
• Segmentation of illustrations on heterogeneous documents
• Data mining raw OCR documents may produce a lot of noise
• Computer-intensive treatments
15ETL approach: Extract
A lot of engineering
Some scientific barriers
Data-centric issues
16. One catalog record 256 pages without any caption, multiple illustrations per
page, multi-genre illustrations (picture, map, drawing)
Extract: remarks on metadata
16ETL approach: Extract
17. Illustrations data mining on raw OCR newspapers outputs a lot of noise…
Extract: remarks on OCR
17ETL approach: Extract
18. 48%
52%
Extract: remarks on OCR
• For newspapers, OCR noise can be massive!
Heuristics and deep learning filtering
Redoing the segmentation?
18Classification
99%
0.2%
0.1%
0.6%
Newspapers
Origin of noise per collections
Illustrations / noise ratio
Noise
OLR (high end OCR)
20%
80%
OCR
91%
9%
19. Extract: the pipeline
• Linear pipeline but with some variants
• Multiples sources of data and formats
19ETL approach: Extract
Charger
Catalogs
Images
OCR, OLR
Images DB
65,000
documents
600,000
illustrations
Process
Selection
Extract
• Simple but massive treatments
• Needs some monitoring
12 M
metadata
475,000
pages
OAI,
SRU
API
Gallica
BaseX
21. 2.a Genres Classification with a ConvNet
21ETL approach: Transform & Enrich
• Deep learning classification with a convolutional neural network
(Google Inception-V3 TensorFlow model, 1,000 classes, Top 5 error rate
= 3,46 %) and "transfer learning » approach
• We want to classify illustrations genres (pictures, drawings, maps,
comics, charts…)
22. Genres Classification with a ConvNet
• Transfer learning: only the last layer of the network is retrained on
a ground thruth dataset of 12 classes, 12k images
• Training/evaluation: 80/20 (training ≈ 2 hours on a MacBook Pro)
• 4 noisy classes: Cover,
Blank Page, Ornament,
Text
22ETL approach: Transform & Enrich
23. Image Genres Classification: Results
• Recall: 0.90
• Accuracy: 0.90
• Better performances can be obtained on less generic models (e.g.
monographs only: recall≈95%)
or with full trained models (which
imply more computing power)
23ETL approach: Transform & Enrich
24. Image Genres Classification: Remarks
• Neural nets have the ability to generalize
24ETL approach: Transform & Enrich
These kind of maps were not included
in the training dataset
25. Image Genres Classification: Remarks
• If a new genre occurs in the data, the training dataset must be updated
& the network must be retrained
25ETL approach: Transform & Enrich
Graphs,
charts,
scientific &
technical
illustrations…
26. Image Genres Classification: the ads problem
A lot of illustrated ads are not visually distinguishable from
editorial contents! (type of communication graphical form)
Rules-based system, deep learning approach
on text+image
26ETL approach: Transform & Enrich
2
8
%
28%
28. Experimentation on Person Detection
• Ground truth of 4,000 images for Person detection.
• “Person” recall=55%, accuracy=98%
• With a WW1 custom classifier: recall=60%
• “Soldier” recall=50%, accuracy=80%
• Modest rates but we’ve to keep in mind that Person or Soldier
metadata are not available in catalog records and are difficult
to express with keywords!
• Keyword Search on WW1 Soldiers:
recall=21%
“soldier” OR “military officer” OR “gunner”
OR “aviator” OR “poilus”…
Soldiers moving a
sculpture, 1918
28ETL approach: Transform & Enrich
29. Experimentation on Soldier Detection
29ETL approach: Transform & Enrich
0% 20% 40% 60% 80% 100%
Text MD only
Visual Reco.
Custom classifier
Hybrid
recall
70%
20%
50%
30. A generic service like Watson works on heritage documents,
even on "difficult" ones
Visual Recognition: remarks
30ETL approach: Transform & Enrich
31. But we are also facing some limitations:
• Generalization from contemporary training
datasets anachronisms (even on WW1)
• Generalization from a limited training corpus:
classification errors
• Complex scene are difficult to handle
Visual Recognition: remarks
Segway
armored vehicule
bourgogne wine label
31ETL approach: Transform & Enrich
car bombing
3,000 classes are enough to
satisfy generalist requests
for modern or contemporary
content, but not for the wide
spectrum of cultural objects
in a heritage library...
32. Large unsegmented images result in generic classes: "frame",
"document", "written document"…
Visual Recognition: remarks
32ETL approach: Transform & Enrich
33. Experimentation on Face Detection
• The Watson API also performs Face and Gender detection:
• “Face”: recall=43%, accuracy=99.9%
• The combined use of the two recognition APIs (Person and Face Detection) results
in an improvement of the overall recall for Person detection from 55% to 65%
33ETL approach: Transform & Enrich
34. Face detection: OpenCV/dnn
• dnn (deep neural networks) module within OpenCV 3.3, ResNet model
• “Single Shot Multibox Detector” method (SSD)
• “Face” detection:
• recall=58%, accuracy=92% (Confidence Score=20%)
• recall=53%, accuracy=94% (Confidence Score=25%)
• recall=42%, accuracy=98% (Confidence Score=50%)
Frameworks are more flexible than SaS (Watson seems to be tuned
to favour accuracy: recall=43%, accuracy=99.9%)
34ETL approach: Transform & Enrich
gallica.pix
35. SaS VS Deep Learning Frameworks
SaS (IBM, Google, Amazon…) Deep Learning Frameworks
(TF, Keras, Caffe…)
Almost everything in a tool box, from
Content Indexing to Layout Analysis and
OCR
You have to pick up the right tools,
implement them and run them
(but it’s often a 30 lines Python script)
REST APIs (client library may be
available)
Local
Constrained by the API design Very flexible
Trained on contemporary materials.
(sometimes the API allows you to
develop custom classifier)
You can train models on your materials
Licenced on volumes
(Google Vision: 150$ for 100k images)
Free (but you need computing power)
You need a developer You need a developer + some deep
learning expertise (but not a PhD!)
35ETL approach: Transform & Enrich
36. Transform & enrich: the pipeline
Besoins :
• Linear complex pipeline
• Using multiple tools
• Complex & heavy computing
36Transformer et enrichir
ChargerImages DB
& Ads DBTopic Modelling
• Needs monitoring
• Needs training
• Results need manual correction
Images DB
600,000
illustrations
265,000
illustrations
Visual
RecognitionClassification
Filtering
noise, ads
BaseX
(XQuery)
TensorFlow
API Watson,
OpenCV/dnn
37. 3. Load (& Search)
• In a XML database (baseX.org)
• Search with XQuery
(REST API)
• Display with IIIF
Image metadata
Catalog metadata
Full text
37Image Retrieval
http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq
http://gallicastudio.bnf.fr
WW1 database:
200k illustrations
65k illustrated ads
Extracted from
470k pages
38. Image Retrieval: the Data Deluge
• The complexity of the search form and the large number of results it
often leads to reveal that searching and browsing in image
databases carries specific issues of usability and remains a
research topic in its own right…
38Image Retrieval
39. Encyclopedic Query on a Named Entity
• Textual descriptors (metadata and OCR) are used.
“George Clemenceau” query: 140 ill. in Gallica/Images, >900 in Gallica.pix
Caricatures can
be found with
the “Drawing” facet
39Image Retrieval
gallica.pix
40. Encyclopedic Query on a Concept
• Interested in “airplanes? A keyword query on “avion” returns lot of
noise… Aviator portraits, aerial pictures, maps
40Image Retrieval
gallica.pix
41. Encyclopedic Query on a Concept
• If we use the conceptual classes extracted by the Watson API
(airplane), we can filter the noise (and get some false positives!)
41Image Retrieval
Concepts overcome silent metadata or
silent OCR, multilanguage barrier, lexical
evolution (from “aéronef” to “avion”)
gallica.pix
Portraits of aviators
can be found with
the Person facet
gallica.pix
42. Hybrid Query
• Conceptual classes, text and image MD are used
Search for visuals relating to the urban destruction following the Battle
of Verdun: class=(”street” OR ”house” OR ”ruin”) AND
keyword=”Verdun”
42Image Retrieval
gallica.pix
43. {
"@id": "http://wellcomelibrary.org/iiif/b28047345/annos/contentAsText/a31i0",
"@type": "oa:Annotation",
"motivation": "oa:classifying",
"resource": {
"@id": "dctypes:Image",
"label": "Picture"
},
"on": "http://mylibrary.org/iiif/b28047345/canvas/c31#xywh=
201,1768,2081,725"
}
Sharing the images but also the CBIR tags
• The CBIR classification and tags can be exposed thanks
to IIIF Presention/Open Annotation
• Open Annotations are attached to a layer (Canvas) in the IIIF manifest
• These annotations can be handled by a IIIF compliant viewer or
harvested to be then operated by machine at large scale
What next?
44. Exposing the CBIR tags?
• Which data models for the image content metadata?
• What about the interoperability? And the life cycle of these "metadata"?
What next?
IBM Watson Visual Recog. Google Cloud Vision Your CBIR model
3,000 classes vocabulary list* 1,500 classes* ?
Hierarchical classes Flat ?
orange color
olive color
…
soldier
soldier wearing beret
woman soldier
trooper
orangered
darkolivegreen
…
soldier
troop
abbess
abbey
academic certificate
action figure
advertising
aeolian landform
aerial photography
…
abacus
abattoir
abbey (monastry like)
Aberdeen Angus cattle
abutment (support of arch or …)
abutment arch
…
?
* Found in the WW1 dataset
45. Open Libraries
• Central open data repositories are used as source datasets
• New repositories/apps/datasets are developped
using a decentralized approach (on your laptop,
within a research lab or an institution)
• These new digital resources become
in turn sources of data
What next?
Library of Congress Labs
beyondwords.labs.loc.gov/ https://www.europeana.eu/portal/fr/collections/world-war-I
Gallica.pix WW1
Europeana 14-18
Your app!
46. Drawings: 25k
Contributing to DH: ready-to-use datasets & models
• Topic-based datasets: Sports, Ads, etc.
• Document-based datasets: Maps, Drawings, Engraving, etc.
• Time periods, Events, People…
• Pre-trained deep learning models
What next?
Illustrated ads: 65k
Maps: 13k
Very soon on api.bnf.fr !
47. Conclusion
• Unified access to all illustrations in an encyclopedic
digital collection is an innovative service that meets a real
need.
• It will foster the illustrations reuse
• The maturity of AI techniques in image content indexing
makes possible their integration into our toolbox.
• Their results, even imperfect, help to make visible and
searchable the large quantities of illustrations in our
collections.
• There is no universal solution for CBIR,
but many applications are just waiting
to be implemented!
47Conclusion
48. Digital Humanities focus
• Today, the image is a new playground for DH researchers
• Tomorrow, image datasets will be the daily life
of researchers
• AI tools will be free and trivialized
• Heritage libraries will be solicited for their iconographic
collections (web archive, photo collections, newspapers
and magazines, etc.) for visual data mining
48Conclusion
49. 49Portraits Galery
Thanks for your attention!
jean-philippe.moreux@bnf.fr
Datasets, trained model and scripts very soon on:
• api.bnf.fr
• github.com/altomator/Image_Retrieval
Gallica.pix demonstrator:
• gallicastudio.bnf.fr
• http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq