Successfully reported this slideshow.
Your SlideShare is downloading. ×

Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillaume Chiron - EuropeanaTech Conference 2018

Upcoming SlideShare
Image Retrieval at the BnF
Image Retrieval at the BnF
Loading in …3

Check these out next

1 of 49 Ad

More Related Content

Similar to Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillaume Chiron - EuropeanaTech Conference 2018 (20)

More from Europeana (20)


Recently uploaded (20)

Hybrid Image Retrieval in Digital Libraries by Jean-Philippe Moreux & Guillaume Chiron - EuropeanaTech Conference 2018

  1. 1. HYBRID IMAGE RETRIEVAL IN DIGITAL LIBRARIES ALarge Scale Multicollection Experimentation of Deep Learning Techniques Jean-Philippe MOREUX Guillaume CHIRON EuropeanaTech Conference 2018
  2. 2. Outline • Introduction • ETL (Extract, Transform, Load) approach on Great War theme: the Gallica.pix PoC • Deep Learning experimentation: • Image Genres Classification • Visual Recognition • Use cases • Conclusion « L’Auto » magazine, photo lab (1914) Hybrid Image Retrieval — A Large Scale Multicollection Experimentation of Deep Learning Techniques
  3. 3. People are using image retrieval with Google (2001, 2011), iPhoto (2009), Flickr (2017) … They would like to do the same with our heritage collections! But the Gallica images collection (as a test bed) only contains 1.2 M items: silence or limited number of results  140 documents for "Georges Clemenceau" (1914-1918) Our Users are Looking for Iconographic Resources Number of image documents in Gallica for the first Top 100 queries on a named entity of type Person 3Image Search in DLs
  4. 4. Hopefully, our DLs are full of Images! 1.2M pages manually indexed and tagged as "image" (picture, engraving, work of art, map…) Large reservoir of potential illustrations in • manuscripts • printed materials • digital born content … To valorize these assets, we need automation:  automatic recognition of illustrations  automatic description of illustrations 4Image Search in DLs
  5. 5. … is to let users express these kind of queries: “I want caricatures/cartoons of George Clemenceau from all the digitized collections” Our mission 5Image Search in DLs
  6. 6. Several challenges: • They are not always identified (within the scanned page) • They are sometimes stored in data silos (images/prints/manuscripts…) • They are highly variable (in time, artistic/printing techniques, scanning practices...) Where are the illustrations? 6Image Search in DLs
  7. 7. More challenges… • In terms of semantic indexing of images, we still have to consider scientific barriers • Catalogues and digital libraries have not been designed to handle the granularity of illustrations, neither with the adequate metadata (size, dominant colors, genre...) How to describe them? 7Image Search in DLs
  8. 8. For what purposes? • Different use cases must be considered: • Similarity search based on the selection of a source image • Content-based indexing (semantic labels) • Hybrid search on metadata+OCR+image content • Various users needs, from posting pictures on social media to scientific purposes 8Image Search in DLs
  9. 9. For what purposes? • Working on animal iconography? But which one? 9Image Search in DLs EPFL segmentation on Gallica manuscripts Gallica.pix (WW1)
  10. 10. • Extract-Transform-Load approach • On Great War Gallica collections: still images, newspapers, magazines, monographs, posters, maps… (1910-1920) • Enriched with deep learning techniques Proof of Concept: Gallica.pix From catalogs and OCRs Transform & enrich the illustrations metadata Image retrieval (web app) Extract Transform Load 10ETL approach
  11. 11. The Tool Bag • Standard protocols and APIs • Machine Learning: Sofware as a Service (IBM Watson, Google Cloud Vision), deep learning frameworks (TensorFlow, OpenCV/dnn) Extract • Gallica APIs • Gallica OAI- PMH • Gallica SRU Transform • IBM Watson, Google CV • TensorFlow • OpenCV/dnn • IIIF Load • BaseX • XQuery • IIIF • Mansory.js  The glue: Perl and Python scripts 11ETL approach
  12. 12. 1. Extract • Data mining all the available data from all the data sources we have: catalog records, images, OCR, tables of content Image MD: size, color… Catalog records OCRed text around image (when exists), ToC E 12ETL approach: Extract
  13. 13. For Printed Content, Data Mining the OCR PagesdeGloire,fév.1917 Le Miroir, nov. 1918 La Science et la Vie, déc. 1917 … can be used to identify illustrations 13ETL approach: Extract
  14. 14. Extract: preliminary remark • This first step is worth the pain: it gives a direct access to “invisible” illustrations (deeply hidden into the digitized content) within a GUI designed for this purpose. 14ETL approach: Extract airplane Classic results list GUI Image retrieval GUI
  15. 15. Extract: remarks Challenges: • Heterogeneity of cataloguing and digitization formats and practices • Lack of essential metadata (e.g. types of illustrations, topics, color modes, dominant colors…) • Segmentation of illustrations on heterogeneous documents • Data mining raw OCR documents may produce a lot of noise • Computer-intensive treatments 15ETL approach: Extract  A lot of engineering  Some scientific barriers  Data-centric issues
  16. 16. One catalog record  256 pages without any caption, multiple illustrations per page, multi-genre illustrations (picture, map, drawing) Extract: remarks on metadata 16ETL approach: Extract
  17. 17. Illustrations data mining on raw OCR newspapers outputs a lot of noise… Extract: remarks on OCR 17ETL approach: Extract
  18. 18. 48% 52% Extract: remarks on OCR • For newspapers, OCR noise can be massive!  Heuristics and deep learning filtering  Redoing the segmentation? 18Classification 99% 0.2% 0.1% 0.6% Newspapers Origin of noise per collections Illustrations / noise ratio Noise OLR (high end OCR) 20% 80% OCR 91% 9%
  19. 19. Extract: the pipeline • Linear pipeline but with some variants • Multiples sources of data and formats 19ETL approach: Extract Charger Catalogs Images OCR, OLR Images DB 65,000 documents 600,000 illustrations Process Selection Extract • Simple but massive treatments • Needs some monitoring 12 M metadata 475,000 pages OAI, SRU API Gallica BaseX
  20. 20. 2. Transform & Enrich • Topic modeling: semantic network, LDA (Latent Dirichlet Allocation) • OCR enrichment: Google Cloud Vision • Image genres classification: Convolutional Neural Network (CNN) model (TensorFlow, Inception-v3) • Image content recognition: IBM Watson Visual Recognition, Google Cloud Vision, ResNet OpenCV/dnn Visual recognition Image genres classification Topic Modeling T 20ETL approach: Transform & Enrich OCR
  21. 21. 2.a Genres Classification with a ConvNet 21ETL approach: Transform & Enrich • Deep learning classification with a convolutional neural network (Google Inception-V3 TensorFlow model, 1,000 classes, Top 5 error rate = 3,46 %) and "transfer learning » approach • We want to classify illustrations genres (pictures, drawings, maps, comics, charts…)
  22. 22. Genres Classification with a ConvNet • Transfer learning: only the last layer of the network is retrained on a ground thruth dataset of 12 classes, 12k images • Training/evaluation: 80/20 (training ≈ 2 hours on a MacBook Pro) • 4 noisy classes: Cover, Blank Page, Ornament, Text 22ETL approach: Transform & Enrich
  23. 23. Image Genres Classification: Results • Recall: 0.90 • Accuracy: 0.90 • Better performances can be obtained on less generic models (e.g. monographs only: recall≈95%) or with full trained models (which imply more computing power) 23ETL approach: Transform & Enrich
  24. 24. Image Genres Classification: Remarks • Neural nets have the ability to generalize 24ETL approach: Transform & Enrich These kind of maps were not included in the training dataset
  25. 25. Image Genres Classification: Remarks • If a new genre occurs in the data, the training dataset must be updated & the network must be retrained 25ETL approach: Transform & Enrich Graphs, charts, scientific & technical illustrations…
  26. 26. Image Genres Classification: the ads problem A lot of illustrated ads are not visually distinguishable from editorial contents! (type of communication  graphical form)  Rules-based system, deep learning approach on text+image 26ETL approach: Transform & Enrich 2 8 % 28%
  27. 27. 2.a Visual Recognition with IBM Watson • Visual Recognition Service API • Outputs pairs of class/confidence score • Detects object, person, face, color… "images": [ { "classifiers": [ { "classes": [ { "class": "armored personnel carrier", "score": 0.568, "type_hierarchy": "/vehicle/wheeled vehicle/armored armored personnel carrier" }, { "class": "armored vehicle", "score": 0.576 }, { "class": "wheeled vehicle", "score": 0.705 }, { "class": "vehicle", "score": 0.706 }, { "class": "personnel carrier", "score": 0.541, "type_hierarchy": "/vehicle/wheeled vehicle/personn }, { "class": "fire engine", "score": 0.526, "type_hierarchy": "/vehicle/wheeled vehicle/truck/fire }, { "class": "truck", "score": 0.526 }, { "class": "structure", "score": 0.516 }, { "class": "Army Base", "score": 0.511, "type_hierarchy": "/defensive structure/Army Base" }, { "class": "defensive structure", "score": 0.512 }, { "class": "gas pump", "score": 0.5, "type_hierarchy": "/mechanical device/pump/gas pum }, { "class": "pump", "score": 0.5 }, { "class": "mechanical device", "score": 0.501 }, { "class": "black color", "score": 0.905 }, { "class": "coal black color", "score": 0.691 } … black color - 0.90 vehicle - 0.70 coal black color - 0.69 armored vehicle - 0.57 truck – 0.52 … « Les tanks de la bataille de Cambrai, la reine d'Angleterre écoute les explications données par un officiers anglais », 1917 27ETL approach: Transform & Enrich
  28. 28. Experimentation on Person Detection • Ground truth of 4,000 images for Person detection. • “Person”  recall=55%, accuracy=98% • With a WW1 custom classifier: recall=60% • “Soldier”  recall=50%, accuracy=80% • Modest rates but we’ve to keep in mind that Person or Soldier metadata are not available in catalog records and are difficult to express with keywords! • Keyword Search on WW1 Soldiers:  recall=21% “soldier” OR “military officer” OR “gunner” OR “aviator” OR “poilus”… Soldiers moving a sculpture, 1918 28ETL approach: Transform & Enrich
  29. 29. Experimentation on Soldier Detection 29ETL approach: Transform & Enrich 0% 20% 40% 60% 80% 100% Text MD only Visual Reco. Custom classifier Hybrid recall 70% 20% 50%
  30. 30. A generic service like Watson works on heritage documents, even on "difficult" ones Visual Recognition: remarks 30ETL approach: Transform & Enrich
  31. 31. But we are also facing some limitations: • Generalization from contemporary training datasets  anachronisms (even on WW1) • Generalization from a limited training corpus:  classification errors • Complex scene are difficult to handle Visual Recognition: remarks Segway armored vehicule bourgogne wine label 31ETL approach: Transform & Enrich car bombing 3,000 classes are enough to satisfy generalist requests for modern or contemporary content, but not for the wide spectrum of cultural objects in a heritage library...
  32. 32. Large unsegmented images result in generic classes: "frame", "document", "written document"… Visual Recognition: remarks 32ETL approach: Transform & Enrich
  33. 33. Experimentation on Face Detection • The Watson API also performs Face and Gender detection: • “Face”: recall=43%, accuracy=99.9% • The combined use of the two recognition APIs (Person and Face Detection) results in an improvement of the overall recall for Person detection from 55% to 65% 33ETL approach: Transform & Enrich
  34. 34. Face detection: OpenCV/dnn • dnn (deep neural networks) module within OpenCV 3.3, ResNet model • “Single Shot Multibox Detector” method (SSD) • “Face” detection: • recall=58%, accuracy=92% (Confidence Score=20%) • recall=53%, accuracy=94% (Confidence Score=25%) • recall=42%, accuracy=98% (Confidence Score=50%)  Frameworks are more flexible than SaS (Watson seems to be tuned to favour accuracy: recall=43%, accuracy=99.9%) 34ETL approach: Transform & Enrich gallica.pix
  35. 35. SaS VS Deep Learning Frameworks SaS (IBM, Google, Amazon…) Deep Learning Frameworks (TF, Keras, Caffe…) Almost everything in a tool box, from Content Indexing to Layout Analysis and OCR You have to pick up the right tools, implement them and run them (but it’s often a 30 lines Python script) REST APIs (client library may be available) Local Constrained by the API design Very flexible Trained on contemporary materials. (sometimes the API allows you to develop custom classifier) You can train models on your materials Licenced on volumes (Google Vision: 150$ for 100k images) Free (but you need computing power) You need a developer You need a developer + some deep learning expertise (but not a PhD!) 35ETL approach: Transform & Enrich
  36. 36. Transform & enrich: the pipeline Besoins : • Linear complex pipeline • Using multiple tools • Complex & heavy computing 36Transformer et enrichir ChargerImages DB & Ads DBTopic Modelling • Needs monitoring • Needs training • Results need manual correction Images DB 600,000 illustrations 265,000 illustrations Visual RecognitionClassification Filtering noise, ads BaseX (XQuery) TensorFlow API Watson, OpenCV/dnn
  37. 37. 3. Load (& Search) • In a XML database ( • Search with XQuery (REST API) • Display with IIIF Image metadata Catalog metadata Full text 37Image Retrieval WW1 database: 200k illustrations 65k illustrated ads Extracted from 470k pages
  38. 38. Image Retrieval: the Data Deluge • The complexity of the search form and the large number of results it often leads to reveal that searching and browsing in image databases carries specific issues of usability and remains a research topic in its own right… 38Image Retrieval
  39. 39. Encyclopedic Query on a Named Entity • Textual descriptors (metadata and OCR) are used. “George Clemenceau” query: 140 ill. in Gallica/Images, >900 in Gallica.pix Caricatures can be found with the “Drawing” facet 39Image Retrieval gallica.pix
  40. 40. Encyclopedic Query on a Concept • Interested in “airplanes? A keyword query on “avion” returns lot of noise… Aviator portraits, aerial pictures, maps 40Image Retrieval gallica.pix
  41. 41. Encyclopedic Query on a Concept • If we use the conceptual classes extracted by the Watson API (airplane), we can filter the noise (and get some false positives!) 41Image Retrieval Concepts overcome silent metadata or silent OCR, multilanguage barrier, lexical evolution (from “aéronef” to “avion”) gallica.pix Portraits of aviators can be found with the Person facet gallica.pix
  42. 42. Hybrid Query • Conceptual classes, text and image MD are used Search for visuals relating to the urban destruction following the Battle of Verdun: class=(”street” OR ”house” OR ”ruin”) AND keyword=”Verdun” 42Image Retrieval gallica.pix
  43. 43. { "@id": "", "@type": "oa:Annotation", "motivation": "oa:classifying", "resource": { "@id": "dctypes:Image", "label": "Picture" }, "on": " 201,1768,2081,725" } Sharing the images but also the CBIR tags • The CBIR classification and tags can be exposed thanks to IIIF Presention/Open Annotation • Open Annotations are attached to a layer (Canvas) in the IIIF manifest • These annotations can be handled by a IIIF compliant viewer or harvested to be then operated by machine at large scale What next?
  44. 44. Exposing the CBIR tags? • Which data models for the image content metadata? • What about the interoperability? And the life cycle of these "metadata"? What next? IBM Watson Visual Recog. Google Cloud Vision Your CBIR model 3,000 classes vocabulary list* 1,500 classes* ? Hierarchical classes Flat ? orange color olive color … soldier soldier wearing beret woman soldier trooper orangered darkolivegreen … soldier troop abbess abbey academic certificate action figure advertising aeolian landform aerial photography … abacus abattoir abbey (monastry like) Aberdeen Angus cattle abutment (support of arch or …) abutment arch … ? * Found in the WW1 dataset
  45. 45. Open Libraries • Central open data repositories are used as source datasets • New repositories/apps/datasets are developped using a decentralized approach (on your laptop, within a research lab or an institution) • These new digital resources become in turn sources of data What next? Library of Congress Labs Gallica.pix WW1 Europeana 14-18 Your app!
  46. 46. Drawings: 25k Contributing to DH: ready-to-use datasets & models • Topic-based datasets: Sports, Ads, etc. • Document-based datasets: Maps, Drawings, Engraving, etc. • Time periods, Events, People… • Pre-trained deep learning models What next? Illustrated ads: 65k Maps: 13k Very soon on !
  47. 47. Conclusion • Unified access to all illustrations in an encyclopedic digital collection is an innovative service that meets a real need. • It will foster the illustrations reuse • The maturity of AI techniques in image content indexing makes possible their integration into our toolbox. • Their results, even imperfect, help to make visible and searchable the large quantities of illustrations in our collections. • There is no universal solution for CBIR, but many applications are just waiting to be implemented! 47Conclusion
  48. 48. Digital Humanities focus • Today, the image is a new playground for DH researchers • Tomorrow, image datasets will be the daily life of researchers • AI tools will be free and trivialized • Heritage libraries will be solicited for their iconographic collections (web archive, photo collections, newspapers and magazines, etc.) for visual data mining 48Conclusion
  49. 49. 49Portraits Galery Thanks for your attention! Datasets, trained model and scripts very soon on: • • Gallica.pix demonstrator: • •