IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

2,095 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,095
On SlideShare
0
From Embeds
0
Number of Embeds
1,547
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.OCR for Typewritten Documents Stefan Pletschacher
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Overview Introduction to Typewritten OCR Document Types and Challenges Specific Approaches Results Hansen Writing Ball, Source: WikipediaStefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 2
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.(The) Short History 1870 first commercially manufactured typewriter 1970s-80s first PCs and desktop printers Sholes and Glidden typewriter, 1873, IBM 5150 PC, 1981, Source: Wikipedia Source: WikipediaStefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 3
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Typewritten Documents Millions of pages of significant typewritten documents exist in archives and libraries – Practically most administrative and individually-produced documents of the 20th Century Typewritten documents pose unique challenges to recognition – Each character is produced independently of the rest – glyphs can appear with different intensity/weight even within the same word – Carbon copies are common – glyphs are blurred, connected to each other and the background is textured – Content – administrative documents with names, abbreviations, numbers etc. which render lexicon based recognition approaches less useful In addition, the usual degradations of historical documents are present due to ageing and useStefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 4
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Document Types and Challenges Manuscripts  Annotations Scientific publications  Abbreviations and names Index cards  Carbon copies (low contrast) Administrative documents  Punch holes, staples etc. Letters  Damage from regular handling (folds, … tears, stains)  Discoloured paper (often unevenly)Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 5
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Some ExamplesStefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 6
  7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Specific Approaches Incorporate background knowledge about typewritten documents Pre-processing – Improved glyph segmentation – Enhancement of individual glyph images Recognition – Perform language independent character recognition using specifically trained classifiers – Voting engineStefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 7
  8. 8. Typewritten OCR IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. TOCR Document Image (greyscale) Glyph Segmentation System Glyph Elements developed Binarisation Glyph Enhancement in IMPACT Document Image Enhanced(black-and-white) Glyphs Composite Character Recognition Region Segmentation Template Matching Weights Voting Engine <?xml version="1.0“> <PcGts> Feature-based Classifier PAGE XML <Page> ... <Region/>(with text regions) </Page> </PcGts> Glyph Text Line Segmentation Elements (with text) <?xml version="1.0“> <PcGts> <?xml version="1.0“> PAGE Exporter PAGE XML <Page>PAGE XML <PcGts> <Page> (completely filled) <Region/> (with text lines) <Region/> </Page> (includes word composition) </Page> </PcGts> </PcGts> Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 8
  9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Some ResultsTop:CommercialOCRBottom:IMPACTTypewrittenOCR prototypeMore completeresults and thushigher overallaccuracy Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 9
  10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. For more information visit: PRImA http://www.primaresearch.org IMPACT http://www.impact-project.eu Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 10

×