Europeana Newspapers in a nutshell

Copyright: Olmsted County Historical Society
Europeana Newspapers
…in a nutshell
Newspapers in Europe and the Digital
Agenda for Europe - Final Workshop
29 September 2014, London, British Library
Clemens Neudecker, State Library Berlin
@cneudecker

Facts & Figures
• Europeana Newspapers – EU ICT-PSP Best Practice Network
• Started in February 2012 and will run until January 2015
• 18 partners, 11 associated partners, 22 networking partners
(28 countries involved)
• Total budget: €5.16M – EC contribution: €4.12M
• Project coordination: State Library Berlin / Preußischer Kulturbesitz
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp 2

Europeana Newspapers is all over Europe…and beyond
http://ec.europa.eu/ict_psp
3
Red = Project
Partners
Blue = Associated
Partners
Green = Networking
Partners

Refinement - we‘re scaling it up!
• 8 million pages refined with Optical Character Recognition (OCR)
• 2 million pages refined with Optical Layout Recognition (OLR)
• Technical resources for Named Entity Recognition (NER) in
three languages (Dutch, German, French)
• Metadata for >18 million pages ingested to Europeana
 In comparison: currently provides access to
8,056,532 pages
4

Quality & Performance
Bag of Words OCR Evaluation
Per Language
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
Layout Analysis Performance
Per evaluation profile
Per Font
5
82.4%
85.3%
80.9%
75.9%
67.5%
83.4% 84.1%
68.1%
93.1%
57.6%
87.0%
68.3%
76.1%
82.6%
54.1%
32.7%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Success Rate
Language Setting
71.9%
74.3%
80%
75%
70%
65%
60%
55%
50%
Index based Count based
Success Rate
Index based rate vs. count based rate
79.1%
62.2%
55.9%
58.8%
94.7%
0%
Keyword
search
Phrase search Access via
content
structure
Print/ebook
on demand
Content
based image
retrieval
Success Rate (harmonic, area based)
Evaluation Profile
67.3%
81.4%
64.0%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Gothic Normal Mixed
Success Rate
Font
FineReader vs. Tesseract
75.3%
53.78%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Success Rate (count based)
OCR Engine
FineReader Tesseract

Access via TEL & Europeana
• Full text search in TEL Historic Newspapers Browser:
http://www.theeuropeanlibrary.org/tel4/newspapers
(recently updated following usability testing)
• Metadata search in Europeana:
http://www.europeana.eu/portal
(now with embedded object presentation via TEL)
6

Full-text search
7

Browse by date
8

Explore on a map
9

Title list
10

Embedded TEL Viewer in Europeana!
http://ec.europa.eu/ict_psp 11

Metadata Best Practices
• Europeana Newspapers METS/ALTO Profile (ENMAP)
• Contributions to ALTO standard v2.x, v3.0
• Structural metadata with tool support - Structify
12

Media, News, Events
13

Lots of opportunities for research & reuse
• Metadata for >18M pages licensed CC0
• Images & full-text for 10M pages licensed public domain
• See also:
http://www.europeana-newspapers.eu/
category/
interviews-with-researchers/
yet another way to reuse
newspapers…
14

Thank you for your attention!
@eurnews
http://www.europeana-newspapers.eu
http://www.theeuropeanlibrary.org/tel4/newspapers
http://www.europeana.eu/

Europeana Newspapers in a nutshell

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Europeana Newspapers in a nutshell

Similar to Europeana Newspapers in a nutshell (13)

More from cneudecker

More from cneudecker (20)

Recently uploaded

Recently uploaded (20)

Europeana Newspapers in a nutshell