Celi @Clic2014: OCR Errors & Named Entity Recognition in La Stampa Historical Archive
OCR ERRORS & NAMED ENTITY RECOGNITION IN LA STAMPA’S HISTORICAL ARCHIVE Andrea Bolioli, Eleonora Marchioni, Raffaella Ventaglio email@example.com, firstname.lastname@example.org, email@example.com
Scan and OCR
Full text indexing
NER and infographics
The project was realized in 2010-2011.
Public web site: www.archiviolastampa.it
5 million newspaper articles, 1910-2005
We annotated about 16,000 errors and corrections in a sample of 894 front page articles.
A few examples: dustin hoflman, hoftman, holfman, hollman, hotfman, hotlman (dustin hoffmann) , pohtica (politica), de (dc), pei (pci), doc um e nto (documento) , re- latore (relatore), ima (una), gh (gli).
OCR errors classified in types:
Segmentation, Hyphenation, Character misrecognition, Punctuation, Graphics, etc.
Type and Percentage of errors ( errors / num of tokens )
OCR Errors in the Historical Archive
People recognized in the front pages of the newspaper (number of articles).
Measures over a test corpus of 500 documents.
NEs that occur more than 10 times, extracted from 4,8M documents.
Named Entity Recognition