ENP Belgrade WS refinement introduction

Europeana Newspapers -
Refinement Workshop
WP2 – Introduction to Refinement
Belgrade, 13 June 2013
Clemens Neudecker (@cneudecker)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Overview
• Objectives & Challenges
• Overview of Refinement Dataset
• Introduction to Refinement: Workflow & Technologies
• Questions & Answers
2
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Objectives
- Analysis of available digital newspaper collections of project partners
and identification of subsets suitable for refinement
- Definition of requirements and minimum quality of digitized newspapers
for refinement to enable advanced services in Europeana
- Coordination of the scalable processing of 10 million digitised newspaper
pages with several refinement technologies
- Providing recommendations on best practices for the refinement of
digitised newspaper collections with full-text (and ingest to Europeana)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Challenges
• Processing quality vs. speed/throughput
• Volume of data requires focus on simple &
standardised workflow with clear checkpoints
• Diverse partners supplying content with different
digitisation & access policies
• Large variety of content in terms of file formats,
fonts, languages, etc.
4
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
The data
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement Workflow steps
10
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Master List
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT)
• BCT = Binarisation and Colour Reduction Tool
• Purpose: Convert grey/colour
scans to bitonal using highly
optimised GPP method
• Background: Reduce total file
size of master images to
guarantee feasibility and
timing of data transfers
12
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT)
• FRT = File Rename Tool
• Purpose: Support content
holders in preparing their
data in the correct format
• Background: Ensure folder
structure and file naming
requirements for automated
processing are met
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT)
• FAT = File Analyzer Tool
• Purpose: Final quality check
of data before refinement
• Background: Ensure content
and refinement partners that
all preparation steps have
been executed successfully
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OCR@UIBK
• OCR = Optical Character Recognition
• Number of pages to be refined: 8 million
• Technologies: ABBYY FineReader SDK
• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts
• Result: METS/ALTO package containing images, metadata & full text
15
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OLR@CCS
• OLR = Optical Layout Recognition
• Number of pages to be refined: 2 million
• Technologies: docWorks
• Separation of columns, articles, headlines, page classes
• Result: METS/ALTO package containing images, metadata & full text
16
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: NER@KB
• NER = Named Entities Recognition
• Number of pages to be refined: 2 million
• Technologies: Stanford CRF-NER
• Languages supported: German, Dutch, English (+ French, Latvian)
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Detection of Named entities: Person, Location, Organization
• Feedback cycle with manual training step  better results
17
Thank you for your attention -
Now come ask me almost
anything!
clemens.neudecker@kb.nl
1 of 18

Recommended

ENP Belgrade Workshop Project Overview by
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewEuropeana Newspapers
1.4K views14 slides
ENP Belgrade WS Introduction by
ENP Belgrade WS IntroductionENP Belgrade WS Introduction
ENP Belgrade WS IntroductionEuropeana Newspapers
1.4K views12 slides
ENP Belgrade WS Metadata by
ENP Belgrade WS MetadataENP Belgrade WS Metadata
ENP Belgrade WS MetadataEuropeana Newspapers
1.5K views8 slides
Refinement by
RefinementRefinement
RefinementEuropeana Newspapers
1.5K views34 slides
Europeana Newspapers Project by
Europeana Newspapers ProjectEuropeana Newspapers Project
Europeana Newspapers ProjectEuropeana Newspapers
1.2K views10 slides
Metadata by
MetadataMetadata
MetadataEuropeana Newspapers
1.3K views32 slides

More Related Content

What's hot

Europeana Newspapers Amsterdam workshop introduction by
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers
1.3K views14 slides
The challenges of making Europe's newspapers available online by
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineLIBER Europe
2.2K views11 slides
Refinement of Digitised Newspapers by
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspaperscneudecker
350 views20 slides
Europeana Newspapers LIBER2013 Workshop intro by
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers
1.5K views17 slides
Europeana Newspaper metadata LIBER2013 by
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspapers
2.1K views7 slides
Europeana Newspapers Aggregation Plan by
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers
1K views36 slides

What's hot(20)

Europeana Newspapers Amsterdam workshop introduction by Europeana Newspapers
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introduction
The challenges of making Europe's newspapers available online by LIBER Europe
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available online
LIBER Europe2.2K views
Refinement of Digitised Newspapers by cneudecker
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
cneudecker350 views
Presentation of Hans-Jörg Lieder, BnF Information Day by Europeana Newspapers
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day by Europeana Newspapers
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
04 europeana newspapers by Europeana
04 europeana newspapers04 europeana newspapers
04 europeana newspapers
Europeana409 views
Large scale refinement of digital historical newspapers with named entities r... by cneudecker
Large scale refinement of digital historical newspapers with named entities r...Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...
cneudecker936 views
Challenges and solutions in creating a european historic newspapers browser by Europeana Newspapers
Challenges and solutions in creating a european historic newspapers browser Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser

Viewers also liked

The European(a) Newspapers Project by
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers ProjectEuropeana Newspapers
878 views17 slides
On the two sides of the pond by
On the two sides of the pondOn the two sides of the pond
On the two sides of the pondEuropeana Newspapers
644 views19 slides
Europeana Newspapers Project - German infoday by
Europeana Newspapers Project - German infoday Europeana Newspapers Project - German infoday
Europeana Newspapers Project - German infoday Europeana Newspapers
627 views16 slides
Web services uddi by
Web services uddiWeb services uddi
Web services uddiprinceirfancivil
795 views16 slides
What is a named entity by
What is a named entityWhat is a named entity
What is a named entityEuropeana Newspapers
1.3K views17 slides
Trtovac, dakic, september 2012 by
Trtovac, dakic, september 2012Trtovac, dakic, september 2012
Trtovac, dakic, september 2012Europeana Newspapers
484 views22 slides

Similar to ENP Belgrade WS refinement introduction

Europeana Newspapers in a nutshell by
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshellcneudecker
373 views15 slides
Europeana Newspapers LFT Infoday Muehlberger by
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
729 views27 slides
Seminario IoT - Internet of Things by
Seminario IoT - Internet of ThingsSeminario IoT - Internet of Things
Seminario IoT - Internet of ThingsLuiz Oliveira
5.9K views87 slides
Fiona ollerenshaw by
Fiona ollerenshawFiona ollerenshaw
Fiona ollerenshawBoilerhouse Communications
59 views10 slides
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada... by
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Nathalie Danse
2.4K views14 slides
Acatech.pptx by
Acatech.pptxAcatech.pptx
Acatech.pptxFIWARE
24 views10 slides

Similar to ENP Belgrade WS refinement introduction(17)

Europeana Newspapers in a nutshell by cneudecker
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshell
cneudecker373 views
Seminario IoT - Internet of Things by Luiz Oliveira
Seminario IoT - Internet of ThingsSeminario IoT - Internet of Things
Seminario IoT - Internet of Things
Luiz Oliveira5.9K views
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada... by Nathalie Danse
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Nathalie Danse2.4K views
Acatech.pptx by FIWARE
Acatech.pptxAcatech.pptx
Acatech.pptx
FIWARE24 views
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin... by Invest Northern Ireland
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
S.2.h Meter Data Management Service by SUNSHINEProject
S.2.h Meter Data Management ServiceS.2.h Meter Data Management Service
S.2.h Meter Data Management Service
SUNSHINEProject1.1K views
Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ... by Sitra / Hyvinvointi
Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...
Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...
Mastercourse Hortibusiness by Sjaak Wolfert
Mastercourse HortibusinessMastercourse Hortibusiness
Mastercourse Hortibusiness
Sjaak Wolfert826 views
Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe... by ICT FOOTPRINT .eu
Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...
Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...
ICT FOOTPRINT .eu132 views
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d... by Apulian ICT Living Labs
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Digitising European Industry - 12/10/2017 by Sandro D'Elia
Digitising European Industry - 12/10/2017Digitising European Industry - 12/10/2017
Digitising European Industry - 12/10/2017
Sandro D'Elia127 views
Max Lemke, Head of Unit, Components and Systems, European Commission by I4MS_eu
Max Lemke, Head of Unit, Components and Systems, European CommissionMax Lemke, Head of Unit, Components and Systems, European Commission
Max Lemke, Head of Unit, Components and Systems, European Commission
I4MS_eu1K views

More from Europeana Newspapers

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris by
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisEuropeana Newspapers
1.6K views6 slides
Presentation of Ioannis Anagnostopoulos at BnF Information Day by
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayEuropeana Newspapers
1.5K views23 slides
Présentation Günter Mühlberger, BnF Information Day by
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayEuropeana Newspapers
961 views59 slides
Presentation of Claus Gravenhorst, BnF Information Day by
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayEuropeana Newspapers
1.1K views14 slides
Presentation of Alaa Abi Haidar at the BnF Information Day by
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayEuropeana Newspapers
2.9K views14 slides
Europeana Newspapers Estonian Infoday Ragne Kouts by
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers
681 views10 slides

More from Europeana Newspapers(20)

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris by Europeana Newspapers
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Ioannis Anagnostopoulos at BnF Information Day by Europeana Newspapers
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Présentation Günter Mühlberger, BnF Information Day by Europeana Newspapers
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day by Europeana Newspapers
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day by Europeana Newspapers
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
Europeana Newspapers Estonian Infoday Kristel Veimann by Europeana Newspapers
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann

Recently uploaded

Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
61 views38 slides
"Node.js Development in 2024: trends and tools", Nikita Galkin by
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin Fwdays
17 views38 slides
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
317 views86 slides
Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
50 views15 slides
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc
72 views29 slides
Vertical User Stories by
Vertical User StoriesVertical User Stories
Vertical User StoriesMoisés Armani Ramírez
17 views16 slides

Recently uploaded(20)

"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays17 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software317 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc72 views
The Forbidden VPN Secrets.pdf by Mariam Shaba
The Forbidden VPN Secrets.pdfThe Forbidden VPN Secrets.pdf
The Forbidden VPN Secrets.pdf
Mariam Shaba20 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab23 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi139 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10345 views
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe by Simone Puorto
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe
Simone Puorto13 views

ENP Belgrade WS refinement introduction

  • 1. Europeana Newspapers - Refinement Workshop WP2 – Introduction to Refinement Belgrade, 13 June 2013 Clemens Neudecker (@cneudecker)
  • 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Overview • Objectives & Challenges • Overview of Refinement Dataset • Introduction to Refinement: Workflow & Technologies • Questions & Answers 2
  • 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Objectives - Analysis of available digital newspaper collections of project partners and identification of subsets suitable for refinement - Definition of requirements and minimum quality of digitized newspapers for refinement to enable advanced services in Europeana - Coordination of the scalable processing of 10 million digitised newspaper pages with several refinement technologies - Providing recommendations on best practices for the refinement of digitised newspaper collections with full-text (and ingest to Europeana)
  • 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Challenges • Processing quality vs. speed/throughput • Volume of data requires focus on simple & standardised workflow with clear checkpoints • Diverse partners supplying content with different digitisation & access policies • Large variety of content in terms of file formats, fonts, languages, etc. 4
  • 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp The data
  • 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (1)
  • 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (2)
  • 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (3)
  • 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (4)
  • 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement Workflow steps 10
  • 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Master List
  • 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT) • BCT = Binarisation and Colour Reduction Tool • Purpose: Convert grey/colour scans to bitonal using highly optimised GPP method • Background: Reduce total file size of master images to guarantee feasibility and timing of data transfers 12
  • 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT) • FRT = File Rename Tool • Purpose: Support content holders in preparing their data in the correct format • Background: Ensure folder structure and file naming requirements for automated processing are met 13
  • 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT) • FAT = File Analyzer Tool • Purpose: Final quality check of data before refinement • Background: Ensure content and refinement partners that all preparation steps have been executed successfully 14
  • 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OCR@UIBK • OCR = Optical Character Recognition • Number of pages to be refined: 8 million • Technologies: ABBYY FineReader SDK • State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts • Result: METS/ALTO package containing images, metadata & full text 15
  • 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OLR@CCS • OLR = Optical Layout Recognition • Number of pages to be refined: 2 million • Technologies: docWorks • Separation of columns, articles, headlines, page classes • Result: METS/ALTO package containing images, metadata & full text 16
  • 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: NER@KB • NER = Named Entities Recognition • Number of pages to be refined: 2 million • Technologies: Stanford CRF-NER • Languages supported: German, Dutch, English (+ French, Latvian) • Open source: https://github.com/KBNLresearch/europeananp-ner • Detection of Named entities: Person, Location, Organization • Feedback cycle with manual training step  better results 17
  • 18. Thank you for your attention - Now come ask me almost anything! clemens.neudecker@kb.nl