Your SlideShare is downloading. ×
Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Datech2014 - Automatic Article Extraction in Old Newspapers Digitized Collections

203
views

Published on

Slides of the presentation of the paper Automatic Article Extraction in Old Newspapers Digitized Collections by David Hebert, Thomas Palfray, Pierrick Tranouez, Stéphane Nicolas and Thierry Paquet. …

Slides of the presentation of the paper Automatic Article Extraction in Old Newspapers Digitized Collections by David Hebert, Thomas Palfray, Pierrick Tranouez, Stéphane Nicolas and Thierry Paquet. #digidays

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
203
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Automatic Article Extraction in Old Newspapers Digitized Collections David Hébert May 19th 2014 David Hébert, Thomas Palfray, Pierrick Tranouez, Stéphane Nicolas, Thierry Paquet
  • 2. Document digitization David Hébert - Datech - May 19th 2014 2 Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du Planier. Tout autour, la ville de béton et de tuiles à perte de vue. Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le Corbusier offre une vue panoramique unique à Marseille. Sur ce promontoire, il faut ajouter les cris des enfants de l'école maternelle dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une incroyable cour de récréation. Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du Planier. Tout autour, la ville de béton et de tuiles à perte de vue. Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le Corbusier offre une vue panoramique unique à Marseille. Sur ce promontoire, il faut ajouter les cris des enfants de l'école maternelle dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une incroyable cour de récréation.
  • 3. 180 years of diversity PlaIR : Regional Indexation Platform Enrichment of the « Journal de Rouen » • 1762 – 1947 • Approximately 300 000 images • Various layouts David Hébert - Datech - May 19th 2014 3
  • 4. Plan 1. Proposed Approach 2. Logical labeling at pixel level 3. Logical structureextraction 4. Results 5. Conclusion and future work David Hébert - Datech - May 19th 2014 4
  • 5. Overview of our method David Hébert - Datech - May 19th 2014 5 Physico-logical entities extraction Physico-logical entities extraction Article reconstruction Article reconstruction • Labelling at the pixel level • Contextualisation • Graphical model • Discriminative model The CRF • Higher level of analysis • Blocs identification • Taking advantage of hierarchical organisation of information • Finding a reading order Logical labeling at pixel level Logical structure extraction
  • 6. Plan 1. Proposed Approach 2. Logical labeling at pixel level 3. Logical structureextraction 4. Results 5. Conclusion and future work David Hébert - Datech - May 19th 2014 6
  • 7. Conditional Random Fields Proposed by Lafferty, McCallum and Peirera in 2001 for Part Of Speech tagging Having a sequence of observations X, find the best label sequence Y Having a sequence of words, find the role of the words in the sentence => observations are words (discrete observations) => labels are the description of the role in the sentence David Hébert - Datech - May 19th 2014 7 [Lafferty 01] John Lafferty,Andrew McCallum & Fernando Pereira.Conditional Random Fields :Probabilistic Models for Segmenting and Labeling Sequence Data.In Proc. 18th International Conf.on Machine Learning,pages 282-289,2001. xt-1 yt-1yt-1 xt ytyt xt+1 yt+1yt+1 Local combination of potentials Global combination over the sequence
  • 8. Feature functions David Hébert - Datech - May 19th 2014 8 : generical notation of a feature function that include 2 kind of functions - Observation functions, denoted by - Transition functions, denoted by - Each feature function is linked to a parameter λk x1 x2 xT ytytYt-1Yt-1 Parameter estimation = conditional log-likelihood on N labelled examples Inference: Having X, find Y* as
  • 9. Which physico-logical entities? David Hébert - Datech - May 19th 2014 9 Pixel description with numerical values Require some data adaptation to feed the CRF: Multi-scale quantization x1 x2 xT y1y1 y2y2 yTyT Numerical descriptors D. Hébert, T. Paquet, S. Nicolas, Continuous CRF with Multi-scale Quantization Feature Functions Application to Structure Extraction in Old Newspaper,ICDAR 2011
  • 10. Experimentations David Hébert - Datech - May 19th 2014 10 Identification of: - Text lines - Titles - Horizontal separators - Vertical separators - Noisy areas - Characters - Inter-character white spaces - Inter-words white spaces • Observations are horizontal runs length. • An observation is described by : - its length - The median length of the vertical runs
  • 11. A generical model of data David Hébert - Datech - May 19th 2014 11 • Not a complete document model • A model of columns of information • A model of entities sequences => Generical enought model for various layouts
  • 12. Approach recall David Hébert - Datech - May 19th 2014 12 Physico-logical entities extraction Physico-logical entities extraction Article reconstruction Article reconstruction Pixel level analysis : DONE Higher level of analysis to identify articles
  • 13. Plan 1. Proposed Approach 2. Logical labeling at pixel level 3. Logical structureextraction 4. Results 5. Conclusion and future work David Hébert - Datech - May 19th 2014 13
  • 14. Article reconstruction David Hébert - Datech - May 19th 2014 14
  • 15. Article reconstruction David Hébert - Datech - May 19th 2014 15
  • 16. David Hébert - Datech - May 19th 2014 16 D O R B F S Z A P W O O P P R R A A Z Z S S B B F F W W Article reconstruction
  • 17. David Hébert - Datech - May 19th 2014 17 D Reading order O R B F S Z A P W O O P P R R A A Z Z S S B F F B W W Article reconstruction
  • 18. Plan 1. Proposed Approach 2. Logical labeling at pixel level 3. Logical structureextraction 4. Results 5. Conclusion and future work David Hébert - Datech - May 19th 2014 18
  • 19. Results David Hébert - Datech - May 19th 2014 19 Quantitative evaluation : 42 images evaluated manually 226 true articles 245 articles detected 194 correct detection (85,84%) Over-segmentation rate of 8.41% • 21550 documents made of 4 pages on average (101978 images) on the platform : http://plair.univ-rouen.fr • 550 000 articles • Approximately 20 days of computation (8 cores)
  • 20. Results on other layouts David Hébert - Datech - May 19th 2014 20
  • 21. Conclusion and future work David Hébert - Datech - May 19th 2014 21 Presentation of a logical segmentation method in two steps : - Physico-logical entities segmentation with CRF - Article identification with a generic layout model Suitable for complex Manhattan layouts with little set of rules Average article detection rate of 85% Future work : - Improve the CRF model (descriptors and/or the labels description) - Add variability in the description of an entity (typicaly the definition of a separator)
  • 22. 22 The end… Thanks for your attention Questions? David Hébert - Datech - May 19th 2014