Large scale refinement of digital historical newspapers with named entities recognition

C
Large-scale refinement of digital historical 
newspapers with named entity recognition 
IFLA Newspaper Pre-Conference 
14 August 2014, Geneva 
Clemens Neudecker, SBB, @cneudecker
Overview 
• Background 
• NER Introduction 
• Approach 
• Challenges 
• Scalability 
• First results 
• Outlook 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Background 
• Europeana Newspapers 
EU Best Practice Network 
• 10 million newspaper pages 
with full-text from 12 libraries 
• 36 million newspaper pages 
with metadata for Europeana 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Named entity recognition (I) 
1. Detect names of 
persons, places, 
organisations 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Named entity recognition (II) 
2. Disambiguate entities 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Named entity recognition (III) 
3. Link to online resources 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Approach (I) 
• Tackle content in 
Dutch, German, French 
(about 50% of the 10m pages) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Approach (II) 
• Use a machine learning tool (open source) 
developed by Stanford University, adapted 
for Europeana Newspapers by KBNL 
https://github.com/KBNLresearch/europeananp-ner 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Approach (III) 
• Create (and release) training 
material by manually annotating 
named entities on OCR‘d 
newspaper pages 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Challenges 
• OCR quality 
• Multiple (mixed) 
languages 
• Historical spelling 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Scalability 
• Stanford NER software is multi-threaded 
e.g. 4 CPU cores – 4x throughput 
• Optimise the NER classifier by filtering 
noise and sentences without NE‘s marked 
• Robust proven Java technology 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
First results (Dutch) 
Persons Locations Organizations 
Precision 0.940 0.950 0.942 
Recall 0.588 0.760 0.559 
F-measure 0.689 0.838 0.671 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
First results (French) 
Persons Locations 
Precision 0.529 0.548 
Recall 0.834 0.216 
F-measure 0.622 0.310 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
* Score for 
organisations 
omitted since 
not enough 
present in the 
source material
Outlook 
• Q3: Release of training data for Named Entity Recognition 
in Dutch, German, French 
• Q3: First results for German (Austrian, Italian/South Tirol), 
final results for Dutch, French 
• Q4: Release of software (open source) for disambiguating 
and linking of NER results to DBPedia 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
www.europeana-newspapers.eu/ 
www.theeuropeanlibrary.org/tel4/newspapers 
https://github.com/KBNLresearch/europeananp-ner 
Thank you for your attention! 
IFLA Newspaper Pre-Conference 
14 August 2014, Geneva 
Clemens Neudecker, SBB, @cneudecker
1 of 15

Recommended

Presentation of Hans-Jörg Lieder, BnF Information Day by
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayEuropeana Newspapers
1.2K views15 slides
Presentation of Clemens Neudecker, BnF Information Day by
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
1.4K views15 slides
ENP Belgrade WS Introduction by
ENP Belgrade WS IntroductionENP Belgrade WS Introduction
ENP Belgrade WS IntroductionEuropeana Newspapers
1.4K views12 slides
Europeana Newspapers - the Gateway to European Newspapers Online by
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Onlinecneudecker
434 views17 slides
ENP Belgrade WS refinement introduction by
ENP Belgrade WS refinement introductionENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionEuropeana Newspapers
1.6K views18 slides
ENP Belgrade Workshop Project Overview by
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewEuropeana Newspapers
1.4K views14 slides

More Related Content

What's hot

Refinement by
RefinementRefinement
RefinementEuropeana Newspapers
1.5K views34 slides
Europeana Newspapers Project by
Europeana Newspapers ProjectEuropeana Newspapers Project
Europeana Newspapers ProjectEuropeana Newspapers
1.2K views10 slides
Metadata by
MetadataMetadata
MetadataEuropeana Newspapers
1.3K views32 slides
Realising the value of Europe's newspaper heritage by
Realising the value of Europe's newspaper heritage Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage Europeana Newspapers
1.2K views14 slides
EurnewsLDN_Clemens_Neudecker by
EurnewsLDN_Clemens_NeudeckerEurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_NeudeckerEuropeana Newspapers
639 views15 slides
Europeana Newspapers Amsterdam workshop introduction by
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers
1.3K views14 slides

What's hot(20)

Realising the value of Europe's newspaper heritage by Europeana Newspapers
Realising the value of Europe's newspaper heritage Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage
Europeana Newspapers Amsterdam workshop introduction by Europeana Newspapers
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introduction
The challenges of making Europe's newspapers available online by LIBER Europe
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available online
LIBER Europe2.2K views
The EPO and Tecnology Transfer: a brief overview 4T-Tech Transfer Think Tank by Viola Zazzera
The EPO and Tecnology Transfer: a brief overview 4T-Tech Transfer Think TankThe EPO and Tecnology Transfer: a brief overview 4T-Tech Transfer Think Tank
The EPO and Tecnology Transfer: a brief overview 4T-Tech Transfer Think Tank
Viola Zazzera78 views
SLOPE Final Conference - general presentation by SLOPE Project
SLOPE Final Conference - general presentationSLOPE Final Conference - general presentation
SLOPE Final Conference - general presentation
SLOPE Project615 views
SLOPE Final Conference - online purchase of timber and biomass by SLOPE Project
SLOPE Final Conference - online purchase of timber and biomassSLOPE Final Conference - online purchase of timber and biomass
SLOPE Final Conference - online purchase of timber and biomass
SLOPE Project655 views
SLOPE Final Conference - intelligent truck by SLOPE Project
SLOPE Final Conference - intelligent truckSLOPE Final Conference - intelligent truck
SLOPE Final Conference - intelligent truck
SLOPE Project616 views
04 europeana newspapers by Europeana
04 europeana newspapers04 europeana newspapers
04 europeana newspapers
Europeana409 views
SLOPE Final Conference - innovative cable yarder by SLOPE Project
SLOPE Final Conference - innovative cable yarderSLOPE Final Conference - innovative cable yarder
SLOPE Final Conference - innovative cable yarder
SLOPE Project593 views
SLOPE Final Conference - electronic marking of trees by SLOPE Project
SLOPE Final Conference - electronic marking of treesSLOPE Final Conference - electronic marking of trees
SLOPE Final Conference - electronic marking of trees
SLOPE Project578 views
SLOPE Final Conference - 3D harvesting planner by SLOPE Project
SLOPE Final Conference - 3D harvesting plannerSLOPE Final Conference - 3D harvesting planner
SLOPE Final Conference - 3D harvesting planner
SLOPE Project619 views
EMPOWER - EMPOWERING a reduction in use of conventionally fuelled vehicles us... by ERTRAC
EMPOWER - EMPOWERING a reduction in use of conventionally fuelled vehicles us...EMPOWER - EMPOWERING a reduction in use of conventionally fuelled vehicles us...
EMPOWER - EMPOWERING a reduction in use of conventionally fuelled vehicles us...
ERTRAC195 views

Similar to Large scale refinement of digital historical newspapers with named entities recognition

Europeana Newspapers LFT Infoday Muehlberger by
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
729 views27 slides
Europeana Newspapers in a nutshell by
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshellcneudecker
373 views15 slides
Europeana Newspapers Estonian Infoday Krista Kiisa by
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers
785 views38 slides
Overview of the Europeana Newspapers Project by
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectEuropeana Newspapers
1.2K views36 slides
Europeana Newspapers LIBER2013 Workshop intro by
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers
1.5K views17 slides
FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of... by
FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...
FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...FIWARE
159 views18 slides

Similar to Large scale refinement of digital historical newspapers with named entities recognition(19)

Europeana Newspapers in a nutshell by cneudecker
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshell
cneudecker373 views
FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of... by FIWARE
FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...
FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...
FIWARE159 views
Max Lemke, Head of Unit, Components and Systems, European Commission by I4MS_eu
Max Lemke, Head of Unit, Components and Systems, European CommissionMax Lemke, Head of Unit, Components and Systems, European Commission
Max Lemke, Head of Unit, Components and Systems, European Commission
I4MS_eu1K views
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d... by Apulian ICT Living Labs
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada... by Nathalie Danse
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Nathalie Danse2.4K views
DIGITISING EUROPEAN INDUSTRY: THE ROLE OF DIGITAL INNOVATION HUBS by I4MS_eu
DIGITISING EUROPEAN INDUSTRY: THE ROLE OF DIGITAL INNOVATION HUBSDIGITISING EUROPEAN INDUSTRY: THE ROLE OF DIGITAL INNOVATION HUBS
DIGITISING EUROPEAN INDUSTRY: THE ROLE OF DIGITAL INNOVATION HUBS
I4MS_eu1.6K views
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin... by Invest Northern Ireland
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
Digitising European Industry - 12/10/2017 by Sandro D'Elia
Digitising European Industry - 12/10/2017Digitising European Industry - 12/10/2017
Digitising European Industry - 12/10/2017
Sandro D'Elia127 views
EOSC-DIH: Bringing industry into the EOSC by EOSC-hub project
EOSC-DIH: Bringing industry into the EOSCEOSC-DIH: Bringing industry into the EOSC
EOSC-DIH: Bringing industry into the EOSC
EOSC-hub project141 views
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M... by I4MS_eu
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
I4MS_eu1.3K views

More from cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library by
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
142 views13 slides
ALTO, PAGE & Co. Formate für Volltexte by
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
82 views22 slides
OCR und Strukturerkennung für Zeitungen by
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
99 views21 slides
Digitisation and Digital Humanities - what is the role of Libraries? by
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
214 views26 slides
Multimodal Perspectives for Digitised Historical Newspapers by
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
344 views15 slides
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi... by
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
95 views18 slides

More from cneudecker(20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library by cneudecker
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
cneudecker142 views
ALTO, PAGE & Co. Formate für Volltexte by cneudecker
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
cneudecker82 views
OCR und Strukturerkennung für Zeitungen by cneudecker
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
cneudecker99 views
Digitisation and Digital Humanities - what is the role of Libraries? by cneudecker
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
cneudecker214 views
Multimodal Perspectives for Digitised Historical Newspapers by cneudecker
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
cneudecker344 views
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi... by cneudecker
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker95 views
AI for digitized cultural heritage by cneudecker
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
cneudecker196 views
Kuratieren mit künstlicher Intelligenz by cneudecker
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
cneudecker1.2K views
Überblick zum DFG-Projekt OCR-D by cneudecker
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
cneudecker370 views
The many uses of digitized newspapers by cneudecker
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
cneudecker302 views
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten... by cneudecker
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
cneudecker539 views
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her... by cneudecker
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
cneudecker286 views
OCR-D: An end-to-end open source OCR framework for historical printed documents by cneudecker
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
cneudecker2K views
Text and Data Mining by cneudecker
Text and Data MiningText and Data Mining
Text and Data Mining
cneudecker698 views
Formate für Volltexte by cneudecker
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
cneudecker172 views
Extrablatt: The Latest News on Newspaper Digitisation in Europe by cneudecker
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
cneudecker375 views
Reise durch Europeana Collections in 11 Minuten by cneudecker
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
cneudecker306 views
Europeana Newspapers in a Nutshell by cneudecker
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
cneudecker507 views
lab.sbb.berlin by cneudecker
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
cneudecker349 views
Named Entity Recognition for Europeana Newspapers by cneudecker
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
cneudecker644 views

Recently uploaded

2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlueShapeBlue
147 views23 slides
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueShapeBlue
135 views13 slides
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueShapeBlue
263 views23 slides
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...ShapeBlue
119 views17 slides
Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
64 views27 slides
DRBD Deep Dive - Philipp Reisner - LINBIT by
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBITShapeBlue
180 views21 slides

Recently uploaded(20)

2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue147 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue135 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue263 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue119 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty64 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue180 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue252 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson160 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue123 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue132 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays56 views
Initiating and Advancing Your Strategic GIS Governance Strategy by Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software176 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue180 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE79 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu423 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10139 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue145 views

Large scale refinement of digital historical newspapers with named entities recognition

  • 1. Large-scale refinement of digital historical newspapers with named entity recognition IFLA Newspaper Pre-Conference 14 August 2014, Geneva Clemens Neudecker, SBB, @cneudecker
  • 2. Overview • Background • NER Introduction • Approach • Challenges • Scalability • First results • Outlook This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 3. Background • Europeana Newspapers EU Best Practice Network • 10 million newspaper pages with full-text from 12 libraries • 36 million newspaper pages with metadata for Europeana This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 4. Named entity recognition (I) 1. Detect names of persons, places, organisations This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 5. Named entity recognition (II) 2. Disambiguate entities This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 6. Named entity recognition (III) 3. Link to online resources This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 7. Approach (I) • Tackle content in Dutch, German, French (about 50% of the 10m pages) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 8. Approach (II) • Use a machine learning tool (open source) developed by Stanford University, adapted for Europeana Newspapers by KBNL https://github.com/KBNLresearch/europeananp-ner This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 9. Approach (III) • Create (and release) training material by manually annotating named entities on OCR‘d newspaper pages This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 10. Challenges • OCR quality • Multiple (mixed) languages • Historical spelling This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 11. Scalability • Stanford NER software is multi-threaded e.g. 4 CPU cores – 4x throughput • Optimise the NER classifier by filtering noise and sentences without NE‘s marked • Robust proven Java technology This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 12. First results (Dutch) Persons Locations Organizations Precision 0.940 0.950 0.942 Recall 0.588 0.760 0.559 F-measure 0.689 0.838 0.671 This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 13. First results (French) Persons Locations Precision 0.529 0.548 Recall 0.834 0.216 F-measure 0.622 0.310 This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp * Score for organisations omitted since not enough present in the source material
  • 14. Outlook • Q3: Release of training data for Named Entity Recognition in Dutch, German, French • Q3: First results for German (Austrian, Italian/South Tirol), final results for Dutch, French • Q4: Release of software (open source) for disambiguating and linking of NER results to DBPedia This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 15. www.europeana-newspapers.eu/ www.theeuropeanlibrary.org/tel4/newspapers https://github.com/KBNLresearch/europeananp-ner Thank you for your attention! IFLA Newspaper Pre-Conference 14 August 2014, Geneva Clemens Neudecker, SBB, @cneudecker