Europeana Newspapers LFT Infoday Muehlberger

Text- und Strukturerkennung für historische Zeitungen
Günter Mühlberger
Universität Innsbruck – Digitalisierung und elektronische Archivierung

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Who we are
•Digitisation and Digital Preservation group @ University of Innsbruck
•Since mid 1990ies involved in digitisation and Optical Character Recognition (OCR)
•Research projects: LAURIN, METADATA ENGINE, books2u!, reUSE, Digitisation on Demand, eBooks on Demand, IMPACT, PrestoPRIME, ARROW+, Europeana Newspaper, tranScriptorium,…
•Our mission: “Digitisation of humanities” = Digital Humanities
•Selection of Digitisation projects
•Austrian Literature Online (since 2002)
•Digitisation of the Innsbrucker Newspaper Archive (2004-2006)
•Digitisation of the Tiroler Tageszeitung from 1945-2003) (2012-2014)
•Text recognition of 8 Mill. Newspaper pages within Europeana Newspapers
•Commercial services via the Technology Transferplatform of the University
2

Digitisation
3
IMAGE CAPTURING
TEXT & STRUCTURE RECOGNITION
NATURAL LANGUAGE PROCESSING
CONTENT REPRESEN- TATION

Example – Index card: Capturing
4

OCR Interface
5

Raw OCR Text
6
“Â.”- ikonogr.
religiös
V oragine , Jacob a ; LEGENDA AUREA Dresdae ÄLipsiae 1846

Structure Recognition
7
“Â.”- ikonogr.
religiös
V oragine , Jacob a ; LEGENDA AUREA Dresdae ÄLipsiae 1846

Natural Language Processing
8
Voragine, Jacob
LEGENDA AUREA
1846
 Matching with reference database, e.g. WorldCat

Matching with Reference (knowledge) data
9

The actual book
10

Content Representation
Instead of a scanned index card we are able to access/link/work with a full featured catalogue entry and the actually digitised work
Instead of digitised newspapers we want to access/link/work with the content/information/knowledge contained in these newspapers!
OCR is one important step towards this overall objective!
11

OCR – Some Facts
•Optical Character Recognition
•“Old” technology: “pattern recognition”
•Largest progress in late 1990ies
•Market situation
•Two large companies: ABBYY, Nuance
•Cheap technology
•Open Source tools: Tesseract, Ocropus, Gamera,…
•Google: Worked with ABYYY, changed to Tesseract since 2012
•ABBYY
•Took part in two EU projects
•Gothic letter and long “s” out of the box “Old Italian” as language
•Direct export of Analysed Layout and Text Object (ALTO)
12

Output
•Processing
•University Innsbruck, 32 ABBYY Licenses on 4 Server
•10.000 large newspaper pages per day, 40.000 medium size, 150.000 book size
•PDF
•Text above the image vs. text behind the image
•PDF/A Standard
•Tagged PDF
•XML - ALTO
•Keeps all the information: Blocks, type of blocks, languages, lines, words, characters, confidence of words, etc.
•ALTO: de-facto standard – Library of Congress
13

Accuracy rates
•What do we expect?
•Researchers: Critical edition of Shakespears Works: no error accepted
•eBooks: less than 1 error per 1000 characters (=half a page)
•Users getting full-text searching offered as an additional feature?
•Academic staff working (copy & paste) with a text?
•Natural language processing?
•Knowledge extraction?
•Word Error Rate (WER) vs. Character Error Rate (CER)
•WER more meaningful to users
•WER easier to measure
14

IMPACT EVA/MINERVA 12th Nov. 2008
15

IMPACT EVA/MINERVA 12th Nov. 2008
16

17

Outlook OCR
•Abbyy
•For small and medium amounts, up to some ten-millions of pages
•Tesseract
•Growing community
•Can be parallelized on High Performance Computing engines (e.g. several hundreds or thousands of nodes)
•More experiments can be done for very large volumes, e.g. hundreds of millions of pages
•Handwritten Text Recognition
•Next generation of engines for handwritten material
•Speech and face recognition as technological background
•Transcription and Recognition Platform
•Virtual Research Environment
•Will be released by University of Innsbruck in 2015
18

Structural Metadata
•Layout Analyses
•Noise reduction (redundant text)
•A newspaper contains much more than edited articles
Content units
•One separation could be: edited articles – advertisements - entertainment
•Document Understanding
•Newspaper consists of repeated sections (“templates”)
•Unique vs. common content
E.g. local news, local advertisements, etc. vs. “world news”
•Common content may be found elsewhere in more detail
E.g. book announcement
19

Austrian Newspapers Online – ANNO - 1916
20

…more than edited articles
21

Edited articles vs. advertisements vs. entertainment
22
Innsbrucker Nachrichten, 4 June 1870

Innsbrucker Nachrichten 1870
23

Content units
•Types
•List of recently died persons
•Announcement of local associations
•Apartments to rent
•Obituaries
•Continued novels
•…
24

Technical approaches
•Layout analysis
•Specific tools
•XML Output of OCR engine (cheap, easy to handle)
•Approaches
•Rule based approaches (experts needed)
•Machine learning approaches (large amounts of training samples needed)
•Functional Extension Parser (IMPACT project)
•Rule based approach for historical books (pre 1900)
•More than 80% accuracy for non-trivial features are hard to reach
•E.g. separation edited text – advertisments – entertainment, running titles, section headings,
25

Summary
•Digitisation of newspapers is in many countries/regions still at the beginning
•OCR, though erroneous, is a must and cheap (compared to scanning)
•Post-processing of OCR is promising
•Structural metadata are a must as well, new approaches are needed (beyond article separation)
•Natural Language Processing and more advanced operations will benefit
•Final goal of “document understanding” by machines
26

Thank you for your attention!
l
Günter Mühlberger <guenter.muehlberger@uibk.ac.at>

Europeana Newspapers LFT Infoday Muehlberger

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Europeana Newspapers LFT Infoday Muehlberger

Similar to Europeana Newspapers LFT Infoday Muehlberger (19)

More from Europeana Newspapers

More from Europeana Newspapers (20)

Recently uploaded

Recently uploaded (20)

Europeana Newspapers LFT Infoday Muehlberger