An Open Corpus for Named Entity Recognition in Historic Newspapers

C
An Open Corpus for Named Entity
Recognition in Historic Newspapers
Clemens Neudecker
Berlin State Library
@cneudecker
LREC2016, 23-28 May 2016, Portorož, Slovenia
Background
• Europeana Newspapers EU-project:
www.europeana-newspapers.eu
• OCRed 12m pages of historic newspapers
from Europe (an estimated 25 billion words!)
• Newspaper content from 23 libraries, in 40
languages, covering 4 centuries (1618-1990)
• Public domain full-text available for download
per language/content provider
Formats & Standards
• Full-text produced in ALTO
• Metadata (structural) in METS
• Metadata (bibliographic) in EDM
• Not a fan of XML?
Good ol‘ plain text (UTF-8) is also available…
research.europeana.eu/itemtype/newspapers
• Currently working on:
– API for text/search
– API for images (IIIF)
Approach
• 3 languages selected for NER:
Dutch, German, French – in collab. with
• Content in these languages constitutes about
50% of the overall full-text in the collection
Methodology
• Select 100 representative pages per language
– If a classifier already exists for given language –
run it on the selected 100 pages
– Ingest tagged/untagged pages to annotation tool
– Manually add/correct annotations
(>=2 librarians per language)
– Export and convert tagged data to BIO format
– Train classifier from BIO & gazetteers (if available)
– Evaluate derived classifier using 4-fold cross-eval
– Repeat until classification performance converges
NER software
• Tested Stanford NER, OpenNLP, NLTK, Gate
• Adaptation of Stanford NER package (CRF)
– Mature, well-documented, widely used
– Open source (GPL)
– Thread-safe & platform-independent (JVM)
– Machine learning scales out more easily
to multiple languages
– Prior experience working with CRF
NER encoding in ALTO
• In ALTO versions >2.1, this is possible:
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0"
VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5">
</String>
<String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"
VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10">
</String>
…
<Tags>
<NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/>
<NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/>
</Tags>
Annotation
• Evaluated BRAT, WebAnno, INL Attestation
• Reasons for selection of INL Attestation:
– Speed
– Support
of ALTO
format
– Support
from INL
available
Annotation stats
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%
Challenges
• Clear, comprehensive & common guidelines
for manual annotation
• OCR quality – on average 80% word accuracy
• Wide variation in historical spelling
• Mix of languages on a single page
• Lack/loss of metadata on page/word level
• Some data corruption occured when ingesting
pre-tagged data into the annotation tool
Attempted workarounds
• Introduce OCR error patterns into training
data
 actually yields less precision/recall
• Introduce a spelling variation module in the
NER classifier
 rewrite rules (e.g. „frorn“  „from“)
 high integration effort
 requires reasonable amount of rules
 abandoned due to high complexity
Evaluation NL
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
Evaluation FR
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
Use cases
• Improving search, information retrieval
– Within digital newspapers, a vast majority of
user queries are person and place names
• Linking of named entities to authority files
to create linked data
– The classification and disambiguation of named
entities allows the assignment of unique
identifiers from authorative sources – thus
enabling cross-language/cross-collection linking
Next steps
• Volunteers wanted!
Help correct corpus and collaboratively create a
free dataset – instructions on GitHub wiki:
– github.com/EuropeanaNewspapers/
ner-corpora/wiki/Corpus-cleanup
• Plans to improve performance:
– Add distributional similarity as feature (Clark 2003)
– Semantic generalisation (Faruqui & Padò 2010)
– Specialised gazetteers (e.g. list of historic place names)
– Data, data, data
Open resources
• European Newspapers NER dataset (CC0):
– github.com/EuropeanaNewspapers/ner-corpora
• Europeana Newspapers NER software (EUPL):
– github.com/EuropeanaNewspapers/europeananp-
ner
– github.com/EuropeanaNewspapers/europeananp-
dbpedia-disambiguation
• Annotated ALTO files:
– lab.kbresearch.nl/static/html/eunews.html
References
• C. Neudecker, W.J. Faber, L. Wilms, T. van Veen:
Large scale refinement of digital historical
newspapers with named entity recognition
Proceedings of the IFLA Newspaper Section
Satellite Meeting, 2014, Geneva, Switzerland.
• Y. Mossalam, A. Abi-Haidar, J.G. Ganascia:
Unsupervised named entity recognition and
disambiguation: An application to old French
journals
Advances in Data Mining. Applications and
Theoretical Aspects, Springer LNCS, 2014.
Thank you for your attention!
Questions?
Clemens Neudecker
Berlin State Library
@cneudecker
1 of 18

Recommended

BPI- (ATV) Advanced Tactical Vest by
BPI- (ATV) Advanced Tactical VestBPI- (ATV) Advanced Tactical Vest
BPI- (ATV) Advanced Tactical VestJEFF MEINING
176 views1 slide
prácticas pedagogía_symbaloo by
prácticas pedagogía_symbalooprácticas pedagogía_symbaloo
prácticas pedagogía_symbalooYolandaRibeiroB
349 views3 slides
Presentation by
PresentationPresentation
PresentationCeramics India
126 views11 slides
Cartaz feira de autor madalena santos by
Cartaz feira de autor madalena santosCartaz feira de autor madalena santos
Cartaz feira de autor madalena santospoletef
477 views1 slide
práctica 1. elaboración de una clasificación propia by
práctica 1. elaboración de una clasificación propiapráctica 1. elaboración de una clasificación propia
práctica 1. elaboración de una clasificación propiaYolandaRibeiroB
411 views3 slides
Boletim de novembro by
Boletim de novembroBoletim de novembro
Boletim de novembrobibliotecap
306 views1 slide

More Related Content

Viewers also liked

презентация скоцкой т.н. by
презентация скоцкой т.н.презентация скоцкой т.н.
презентация скоцкой т.н.skotckaiatn
207 views13 slides
Amigos reales o virtuales by
Amigos reales o virtualesAmigos reales o virtuales
Amigos reales o virtualesMaria Martin Sanchez
578 views10 slides
Esquema o processo de reconhecimento de competências by
Esquema   o processo de reconhecimento de competênciasEsquema   o processo de reconhecimento de competências
Esquema o processo de reconhecimento de competênciasJ P
733 views5 slides
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n... by
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...cneudecker
731 views19 slides
TIC[B1] by
TIC[B1]TIC[B1]
TIC[B1]J P
2.1K views6 slides
MV[B1] by
MV[B1]MV[B1]
MV[B1]J P
1.2K views6 slides

Viewers also liked(12)

презентация скоцкой т.н. by skotckaiatn
презентация скоцкой т.н.презентация скоцкой т.н.
презентация скоцкой т.н.
skotckaiatn207 views
Esquema o processo de reconhecimento de competências by J P
Esquema   o processo de reconhecimento de competênciasEsquema   o processo de reconhecimento de competências
Esquema o processo de reconhecimento de competências
J P733 views
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n... by cneudecker
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
cneudecker731 views
TIC[B1] by J P
TIC[B1]TIC[B1]
TIC[B1]
J P2.1K views
MV[B1] by J P
MV[B1]MV[B1]
MV[B1]
J P1.2K views
CP_2 by J P
CP_2CP_2
CP_2
J P1.9K views
Ficha de avaliação nº 21 importancia do operador de caixa by Leonor Alves
Ficha de avaliação nº 21 importancia do operador de caixaFicha de avaliação nº 21 importancia do operador de caixa
Ficha de avaliação nº 21 importancia do operador de caixa
Leonor Alves812 views
Ficha de trabalho nº 3 spv pos venda e fidelização by Leonor Alves
Ficha de trabalho nº 3 spv   pos venda e fidelizaçãoFicha de trabalho nº 3 spv   pos venda e fidelização
Ficha de trabalho nº 3 spv pos venda e fidelização
Leonor Alves553 views
Ficha de trabalho nº18 spv- o livro de reclamações by Leonor Alves
Ficha de trabalho nº18  spv- o livro de reclamaçõesFicha de trabalho nº18  spv- o livro de reclamações
Ficha de trabalho nº18 spv- o livro de reclamações
Leonor Alves1.7K views
Ficha de trabalho nº14 spv-como reagem os clientes ás falhas de serviços by Leonor Alves
Ficha de trabalho nº14  spv-como reagem os clientes ás falhas de serviçosFicha de trabalho nº14  spv-como reagem os clientes ás falhas de serviços
Ficha de trabalho nº14 spv-como reagem os clientes ás falhas de serviços
Leonor Alves423 views

More from cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library by
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
142 views13 slides
ALTO, PAGE & Co. Formate für Volltexte by
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
82 views22 slides
OCR und Strukturerkennung für Zeitungen by
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
99 views21 slides
Digitisation and Digital Humanities - what is the role of Libraries? by
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
214 views26 slides
Multimodal Perspectives for Digitised Historical Newspapers by
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
344 views15 slides
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi... by
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
95 views18 slides

More from cneudecker(20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library by cneudecker
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
cneudecker142 views
ALTO, PAGE & Co. Formate für Volltexte by cneudecker
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
cneudecker82 views
OCR und Strukturerkennung für Zeitungen by cneudecker
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
cneudecker99 views
Digitisation and Digital Humanities - what is the role of Libraries? by cneudecker
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
cneudecker214 views
Multimodal Perspectives for Digitised Historical Newspapers by cneudecker
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
cneudecker344 views
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi... by cneudecker
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker95 views
AI for digitized cultural heritage by cneudecker
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
cneudecker196 views
Kuratieren mit künstlicher Intelligenz by cneudecker
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
cneudecker1.2K views
Überblick zum DFG-Projekt OCR-D by cneudecker
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
cneudecker370 views
The many uses of digitized newspapers by cneudecker
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
cneudecker302 views
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten... by cneudecker
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
cneudecker539 views
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her... by cneudecker
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
cneudecker286 views
OCR-D: An end-to-end open source OCR framework for historical printed documents by cneudecker
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
cneudecker2K views
Text and Data Mining by cneudecker
Text and Data MiningText and Data Mining
Text and Data Mining
cneudecker698 views
Formate für Volltexte by cneudecker
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
cneudecker172 views
Extrablatt: The Latest News on Newspaper Digitisation in Europe by cneudecker
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
cneudecker375 views
Reise durch Europeana Collections in 11 Minuten by cneudecker
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
cneudecker306 views
Europeana Newspapers in a Nutshell by cneudecker
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
cneudecker507 views
lab.sbb.berlin by cneudecker
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
cneudecker349 views
Named Entity Recognition for Europeana Newspapers by cneudecker
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
cneudecker644 views

Recently uploaded

MMF Newsletter Februar 2022.pdf by
MMF Newsletter Februar 2022.pdfMMF Newsletter Februar 2022.pdf
MMF Newsletter Februar 2022.pdfmmpcofficial
7 views12 slides
East godavari_art63.pdf by
East godavari_art63.pdfEast godavari_art63.pdf
East godavari_art63.pdfnarsireddynannuri1
6 views2 slides
PPT - SIGMA-GIZ Academies - Topic 4 - Azerbaijan - Public Service Design.pdf by
PPT - SIGMA-GIZ Academies - Topic 4 - Azerbaijan - Public Service Design.pdfPPT - SIGMA-GIZ Academies - Topic 4 - Azerbaijan - Public Service Design.pdf
PPT - SIGMA-GIZ Academies - Topic 4 - Azerbaijan - Public Service Design.pdfSupport for Improvement in Governance and Management SIGMA
49 views37 slides
Andreas Schleicher Global Launch of PISA - Presentation - 5 December 2023 by
Andreas Schleicher Global Launch of PISA - Presentation - 5 December 2023Andreas Schleicher Global Launch of PISA - Presentation - 5 December 2023
Andreas Schleicher Global Launch of PISA - Presentation - 5 December 2023EduSkills OECD
59 views50 slides
CBO’s Role and Most Recent Long-Term Budget Projections by
CBO’s Role and Most Recent Long-Term Budget ProjectionsCBO’s Role and Most Recent Long-Term Budget Projections
CBO’s Role and Most Recent Long-Term Budget ProjectionsCongressional Budget Office
254 views22 slides
Mukhya Mantri Gramin Peyjal Nishchay Yojana (MGPNY) – Bihar_Pankaj Kumar_AKRS... by
Mukhya Mantri Gramin Peyjal Nishchay Yojana (MGPNY) – Bihar_Pankaj Kumar_AKRS...Mukhya Mantri Gramin Peyjal Nishchay Yojana (MGPNY) – Bihar_Pankaj Kumar_AKRS...
Mukhya Mantri Gramin Peyjal Nishchay Yojana (MGPNY) – Bihar_Pankaj Kumar_AKRS...India Water Portal
22 views15 slides

Recently uploaded(20)

MMF Newsletter Februar 2022.pdf by mmpcofficial
MMF Newsletter Februar 2022.pdfMMF Newsletter Februar 2022.pdf
MMF Newsletter Februar 2022.pdf
mmpcofficial7 views
Andreas Schleicher Global Launch of PISA - Presentation - 5 December 2023 by EduSkills OECD
Andreas Schleicher Global Launch of PISA - Presentation - 5 December 2023Andreas Schleicher Global Launch of PISA - Presentation - 5 December 2023
Andreas Schleicher Global Launch of PISA - Presentation - 5 December 2023
EduSkills OECD59 views
Mukhya Mantri Gramin Peyjal Nishchay Yojana (MGPNY) – Bihar_Pankaj Kumar_AKRS... by India Water Portal
Mukhya Mantri Gramin Peyjal Nishchay Yojana (MGPNY) – Bihar_Pankaj Kumar_AKRS...Mukhya Mantri Gramin Peyjal Nishchay Yojana (MGPNY) – Bihar_Pankaj Kumar_AKRS...
Mukhya Mantri Gramin Peyjal Nishchay Yojana (MGPNY) – Bihar_Pankaj Kumar_AKRS...
Support Girl students with Education by SERUDS INDIA
Support Girl students with EducationSupport Girl students with Education
Support Girl students with Education
SERUDS INDIA7 views
MMF Newsletter March 2022.pdf by mmpcofficial
MMF Newsletter March 2022.pdfMMF Newsletter March 2022.pdf
MMF Newsletter March 2022.pdf
mmpcofficial21 views
The National Security Framework of Spain by Miguel A. Amutio
The National Security Framework of SpainThe National Security Framework of Spain
The National Security Framework of Spain
Miguel A. Amutio33 views
Mapping location and co-location of industries at the neighborhood level - A... by OECD CFE
Mapping location and co-location of industries at the neighborhood level  - A...Mapping location and co-location of industries at the neighborhood level  - A...
Mapping location and co-location of industries at the neighborhood level - A...
OECD CFE7 views
Social behavioural change to drive community ownership_ Divyang Waghela_Tata ... by India Water Portal
Social behavioural change to drive community ownership_ Divyang Waghela_Tata ...Social behavioural change to drive community ownership_ Divyang Waghela_Tata ...
Social behavioural change to drive community ownership_ Divyang Waghela_Tata ...
Financial sustainability of schemes managed by PHED in Punjab_Krishnakumar Th... by India Water Portal
Financial sustainability of schemes managed by PHED in Punjab_Krishnakumar Th...Financial sustainability of schemes managed by PHED in Punjab_Krishnakumar Th...
Financial sustainability of schemes managed by PHED in Punjab_Krishnakumar Th...
Advancing and democratizing business data in Canada- Patrick Gill & Stephen Tapp by OECD CFE
Advancing and democratizing business data in Canada- Patrick Gill & Stephen TappAdvancing and democratizing business data in Canada- Patrick Gill & Stephen Tapp
Advancing and democratizing business data in Canada- Patrick Gill & Stephen Tapp
OECD CFE7 views

An Open Corpus for Named Entity Recognition in Historic Newspapers

  • 1. An Open Corpus for Named Entity Recognition in Historic Newspapers Clemens Neudecker Berlin State Library @cneudecker LREC2016, 23-28 May 2016, Portorož, Slovenia
  • 2. Background • Europeana Newspapers EU-project: www.europeana-newspapers.eu • OCRed 12m pages of historic newspapers from Europe (an estimated 25 billion words!) • Newspaper content from 23 libraries, in 40 languages, covering 4 centuries (1618-1990) • Public domain full-text available for download per language/content provider
  • 3. Formats & Standards • Full-text produced in ALTO • Metadata (structural) in METS • Metadata (bibliographic) in EDM • Not a fan of XML? Good ol‘ plain text (UTF-8) is also available… research.europeana.eu/itemtype/newspapers • Currently working on: – API for text/search – API for images (IIIF)
  • 4. Approach • 3 languages selected for NER: Dutch, German, French – in collab. with • Content in these languages constitutes about 50% of the overall full-text in the collection
  • 5. Methodology • Select 100 representative pages per language – If a classifier already exists for given language – run it on the selected 100 pages – Ingest tagged/untagged pages to annotation tool – Manually add/correct annotations (>=2 librarians per language) – Export and convert tagged data to BIO format – Train classifier from BIO & gazetteers (if available) – Evaluate derived classifier using 4-fold cross-eval – Repeat until classification performance converges
  • 6. NER software • Tested Stanford NER, OpenNLP, NLTK, Gate • Adaptation of Stanford NER package (CRF) – Mature, well-documented, widely used – Open source (GPL) – Thread-safe & platform-independent (JVM) – Machine learning scales out more easily to multiple languages – Prior experience working with CRF
  • 7. NER encoding in ALTO • In ALTO versions >2.1, this is possible: <String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"> </String> <String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0" VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"> </String> … <Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/> </Tags>
  • 8. Annotation • Evaluated BRAT, WebAnno, INL Attestation • Reasons for selection of INL Attestation: – Speed – Support of ALTO format – Support from INL available
  • 9. Annotation stats Language # tokens # PER # LOC # ORG French 207,000 5,672 5,614 2,574 Dutch 182,483 4,492 4,448 1,160 German 96,735 7,914 6,143 2,784 Language # tokens # PER # LOC # ORG French 100% 2,75% 2,71% 1,24% Dutch 100% 2,46% 2,44% 0,64% German 100% 8,18% 6,35% 2,88% Language Word-Error-Rate (Bag of Words) Reading Order Success Rate French 16,6% 19,9% Dutch 17,6% 23,2% German 15,9% / 21,9% 13,6%
  • 10. Challenges • Clear, comprehensive & common guidelines for manual annotation • OCR quality – on average 80% word accuracy • Wide variation in historical spelling • Mix of languages on a single page • Lack/loss of metadata on page/word level • Some data corruption occured when ingesting pre-tagged data into the annotation tool
  • 11. Attempted workarounds • Introduce OCR error patterns into training data  actually yields less precision/recall • Introduce a spelling variation module in the NER classifier  rewrite rules (e.g. „frorn“  „from“)  high integration effort  requires reasonable amount of rules  abandoned due to high complexity
  • 12. Evaluation NL Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
  • 13. Evaluation FR Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
  • 14. Use cases • Improving search, information retrieval – Within digital newspapers, a vast majority of user queries are person and place names • Linking of named entities to authority files to create linked data – The classification and disambiguation of named entities allows the assignment of unique identifiers from authorative sources – thus enabling cross-language/cross-collection linking
  • 15. Next steps • Volunteers wanted! Help correct corpus and collaboratively create a free dataset – instructions on GitHub wiki: – github.com/EuropeanaNewspapers/ ner-corpora/wiki/Corpus-cleanup • Plans to improve performance: – Add distributional similarity as feature (Clark 2003) – Semantic generalisation (Faruqui & Padò 2010) – Specialised gazetteers (e.g. list of historic place names) – Data, data, data
  • 16. Open resources • European Newspapers NER dataset (CC0): – github.com/EuropeanaNewspapers/ner-corpora • Europeana Newspapers NER software (EUPL): – github.com/EuropeanaNewspapers/europeananp- ner – github.com/EuropeanaNewspapers/europeananp- dbpedia-disambiguation • Annotated ALTO files: – lab.kbresearch.nl/static/html/eunews.html
  • 17. References • C. Neudecker, W.J. Faber, L. Wilms, T. van Veen: Large scale refinement of digital historical newspapers with named entity recognition Proceedings of the IFLA Newspaper Section Satellite Meeting, 2014, Geneva, Switzerland. • Y. Mossalam, A. Abi-Haidar, J.G. Ganascia: Unsupervised named entity recognition and disambiguation: An application to old French journals Advances in Data Mining. Applications and Theoretical Aspects, Springer LNCS, 2014.
  • 18. Thank you for your attention! Questions? Clemens Neudecker Berlin State Library @cneudecker