Europeana Newspapers Workshop: 
Refinement 
WP2 – Introduction to Refinement 
Munich, 26 June 2013 
Clemens Neudecker (@cneudecker)
Overview 
• Objectives & Challenges 
• Overview of Refinement Dataset 
• Introduction to Refinement: Workflow & Technologies 
• Questions & Answers 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
2
Objectives 
- Analysis of available digital newspaper collections of project partners 
and identification of subsets suitable for refinement 
- Definition of requirements and minimum quality of digitized newspapers 
for refinement to enable advanced services in Europeana 
- Coordination of the scalable processing of 10 million digitised newspaper 
pages with several refinement technologies 
- Providing recommendations on best practices for the refinement of 
digitised newspaper collections with full-text (and ingest to Europeana) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Challenges 
• Processing quality vs. speed/throughput 
• Volume of data requires focus on simple & 
standardised workflow with clear checkpoints 
• Diverse partners supplying content with different 
digitisation & access policies 
• Large variety of content in terms of file formats, 
fonts, languages, etc. 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
4
The data 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Refinement Workflow steps 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
10
Tools (BCT) 
• BCT = Binarisation and Colour Reduction Tool 
• Purpose: Convert grey/colour 
scans to bitonal using highly 
optimised GPP method 
• Background: Reduce total file 
size of master images to 
guarantee feasibility and 
timing of data transfers 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
11
Tools (FRT) 
• FRT = File Rename Tool 
• Purpose: Support content 
holders in preparing their 
data in the correct format 
• Background: Ensure folder 
structure and file naming 
requirements for automated 
processing are met 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
12
Tools (FAT) 
• FAT = File Analyzer Tool 
• Purpose: Final quality check 
of data before refinement 
• Background: Ensure content 
and refinement partners that 
all preparation steps have 
been executed successfully 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
13
Refinement: OCR@UIBK 
• OCR = Optical Character Recognition 
• Number of pages to be refined: 8 million 
• Technologies: ABBYY FineReader SDK 
• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts 
• Result: METS/ALTO package containing images, metadata & full text 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
14
OCR  Full text search 
http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/ 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
15
Refinement: OLR@CCS 
• OLR = Optical Layout Recognition 
• Number of pages to be refined: 2 million 
• Technologies: docWorks 
• Separation of columns, articles, headlines, page classes 
• Result: METS/ALTO package containing images, metadata & full text 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
16
OLR  Article separation 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
17
Refinement: NER@KB 
• NER = Named Entities Recognition 
• Number of pages to be refined: 2 million 
• Technologies: Stanford CRF-NER 
• Languages supported: German, Dutch, English (+ French, Latvian) 
• Open source: https://github.com/KBNLresearch/europeananp-ner 
• Detection of Named entities: Person, Location, Organization 
• Feedback cycle with manual training step  better results 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
18
NER  Browse by names or places 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
19
Thank you for your attention! 
clemens.neudecker@kb.nl

Refinement of Digitised Newspapers

  • 1.
    Europeana Newspapers Workshop: Refinement WP2 – Introduction to Refinement Munich, 26 June 2013 Clemens Neudecker (@cneudecker)
  • 2.
    Overview • Objectives& Challenges • Overview of Refinement Dataset • Introduction to Refinement: Workflow & Technologies • Questions & Answers This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 2
  • 3.
    Objectives - Analysisof available digital newspaper collections of project partners and identification of subsets suitable for refinement - Definition of requirements and minimum quality of digitized newspapers for refinement to enable advanced services in Europeana - Coordination of the scalable processing of 10 million digitised newspaper pages with several refinement technologies - Providing recommendations on best practices for the refinement of digitised newspaper collections with full-text (and ingest to Europeana) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 4.
    Challenges • Processingquality vs. speed/throughput • Volume of data requires focus on simple & standardised workflow with clear checkpoints • Diverse partners supplying content with different digitisation & access policies • Large variety of content in terms of file formats, fonts, languages, etc. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 4
  • 5.
    The data Thisproject is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 6.
    Europeana Newspaper Dataset(1) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 7.
    Europeana Newspaper Dataset(2) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 8.
    Europeana Newspapers Dataset(3) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 9.
    Europeana Newspapers Dataset(4) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 10.
    Refinement Workflow steps This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 10
  • 11.
    Tools (BCT) •BCT = Binarisation and Colour Reduction Tool • Purpose: Convert grey/colour scans to bitonal using highly optimised GPP method • Background: Reduce total file size of master images to guarantee feasibility and timing of data transfers This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 11
  • 12.
    Tools (FRT) •FRT = File Rename Tool • Purpose: Support content holders in preparing their data in the correct format • Background: Ensure folder structure and file naming requirements for automated processing are met This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 12
  • 13.
    Tools (FAT) •FAT = File Analyzer Tool • Purpose: Final quality check of data before refinement • Background: Ensure content and refinement partners that all preparation steps have been executed successfully This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 13
  • 14.
    Refinement: OCR@UIBK •OCR = Optical Character Recognition • Number of pages to be refined: 8 million • Technologies: ABBYY FineReader SDK • State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts • Result: METS/ALTO package containing images, metadata & full text This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 14
  • 15.
    OCR  Fulltext search http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/ This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 15
  • 16.
    Refinement: OLR@CCS •OLR = Optical Layout Recognition • Number of pages to be refined: 2 million • Technologies: docWorks • Separation of columns, articles, headlines, page classes • Result: METS/ALTO package containing images, metadata & full text This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 16
  • 17.
    OLR  Articleseparation This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 17
  • 18.
    Refinement: NER@KB •NER = Named Entities Recognition • Number of pages to be refined: 2 million • Technologies: Stanford CRF-NER • Languages supported: German, Dutch, English (+ French, Latvian) • Open source: https://github.com/KBNLresearch/europeananp-ner • Detection of Named entities: Person, Location, Organization • Feedback cycle with manual training step  better results This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 18
  • 19.
    NER  Browseby names or places This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 19
  • 20.
    Thank you foryour attention! clemens.neudecker@kb.nl