Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
OCR ERRORS & NAMED ENTITY RECOGNITION IN LA STAMPA’S HISTORICAL ARCHIVE Andrea Bolioli, Eleonora Marchioni, Raffaella Vent...
Microfilm 
Scan and OCR 
Full text indexing 
NER and infographics 
1 
2 
3 
0 
The project was realized in 2010-2011. 
Pub...
We annotated about 16,000 errors and corrections in a sample of 894 front page articles. 
A few examples: dustin hoflman, ...
People recognized in the front pages of the newspaper (number of articles). 
Measures over a test corpus of 500 documents....
Upcoming SlideShare
Loading in …5
×

Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive

502 views

Published on

Ecco la prima delle nostre presentazioni al CLIC 2014: Errori di OCR e riconoscimento di Named Entities nell'Archivio Storico de La Stampa

  • Be the first to comment

  • Be the first to like this

Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive

  1. 1. OCR ERRORS & NAMED ENTITY RECOGNITION IN LA STAMPA’S HISTORICAL ARCHIVE Andrea Bolioli, Eleonora Marchioni, Raffaella Ventaglio abolioli@celi.it, marchioni@celi.it, ventaglio@celi.it 1
  2. 2. Microfilm Scan and OCR Full text indexing NER and infographics 1 2 3 0 The project was realized in 2010-2011. Public web site: www.archiviolastampa.it 5 million newspaper articles, 1910-2005
  3. 3. We annotated about 16,000 errors and corrections in a sample of 894 front page articles. A few examples: dustin hoflman, hoftman, holfman, hollman, hotfman, hotlman (dustin hoffmann) , pohtica (politica), de (dc), pei (pci), doc um e nto (documento) , re- latore (relatore), ima (una), gh (gli). OCR errors classified in types: Segmentation, Hyphenation, Character misrecognition, Punctuation, Graphics, etc. Type and Percentage of errors ( errors / num of tokens ) OCR Errors in the Historical Archive
  4. 4. People recognized in the front pages of the newspaper (number of articles). Measures over a test corpus of 500 documents. NEs that occur more than 10 times, extracted from 4,8M documents. Named Entity Recognition

×