Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

Digitale Zeitungen –
Verarbeitung in Europeana Newspapers
Information Day SBB
Berlin, 27 Februar 2014
Clemens Neudecker, KB, Twitter: @cneudecker
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Übersicht
• Ziele & Herausforderungen
• Zeitungen im Projekt
• Workflow & Technologien
• Fragen & Antworten
2
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Ziele
• Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK)
• Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS)
• Erstellen von Software für NER in 3 Sprachen (KB)
• Entwicklung von Tools die den Workflow automatisieren
• Erstellen von Richtlinien und Empfehlungen (“best practices”)
3
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Herausforderungen
• Qualität vs. Durchsatz
• Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen)
• Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal)
• Unterschiedliche Dateiformate, Sprachen, Alphabete
• Historische Schreibvarianten
• Klar strukturierter und weitgehend automatisierter Workflow
4
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Die Zeitungen
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Workflow
10
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR @ UIBK
• OCR = Optical Character Recognition (Optische Zeichenerkennung)
• Technologien: ABBYY FineReader SDK
• State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box
• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext
11
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT)
• BCT = Binarisation and Colour Reduction Tool
• Ziel: Konvertierung von Farb-/
Graustufenscans nach 1-bit
mit für OCR optimierter
Methode (GPP) + JP2k
• Hintergrund: Dateigrösse
der Images reduzieren um
Datenmenge handhabbar
zu machen (hunderte TBs)
12
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT)
• FRT = File Rename Tool
• Ziel: Unterstützung der
Bibliotheken bei der Daten-
anlieferung – Umbenennung
von Dateien und Ordnern
• Hintergrund: Daten in der für
automatisierte Verarbeitung
notwendigen Struktur aufbereiten
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT)
• FAT = File Analyzer Tool
• Ziel: Check und Validierung
der Datenstruktur vor
Anlieferung zur Verarbeitung
• Hintergrund: Garantie für
alle Beteiligten dass die Daten
für die weitere Verarbeitung
in geeigneter Form vorliegen
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR @ CCS
• OLR = Optical Layout Recognition (Optische Layouterkennung)
• Technologien: docWorks
• Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen)
• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext
15
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR Artikelerkennung
16
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER @ KB
• NER = Named Entities Recognition
• Technologien: Stanford CRF-NER
• 3 Sprachen: Deutsch, Niederländisch, Französisch
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Erkennung von 3 Klassen: Person, Ort, Organisation
17
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp 18
Ergebnisse für NL
Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900.
100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”)
*
* K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung
Personen Orte Organisationen
Precision 0.940 0.950 0.942
Recall 0.588 0.760 0.559
F-measure 0.689 0.838 0.671
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER vs. OCR
19
0,25
0,35
0,45
0,55
0,65
0,75
0,85
0,95
NER
OCR
Danke für die Aufmerksamkeit!
Noch Fragen?
clemens.neudecker@kb.nl
1 of 20

Recommended

Europeana Newspapers Project - German infoday by
Europeana Newspapers Project - German infoday Europeana Newspapers Project - German infoday
Europeana Newspapers Project - German infoday Europeana Newspapers
627 views16 slides
ENP_ONB_infoday_Neudecker by
ENP_ONB_infoday_NeudeckerENP_ONB_infoday_Neudecker
ENP_ONB_infoday_NeudeckerEuropeana Newspapers
654 views19 slides
Digitalisierte Zeitungen und Digital Humanities - Probleme und Chancen by
Digitalisierte Zeitungen und Digital Humanities - Probleme und ChancenDigitalisierte Zeitungen und Digital Humanities - Probleme und Chancen
Digitalisierte Zeitungen und Digital Humanities - Probleme und Chancencneudecker
710 views19 slides
Europeana Newspapers German infoday - Struturelle Metadata historische Zeitungen by
Europeana Newspapers German infoday - Struturelle Metadata historische ZeitungenEuropeana Newspapers German infoday - Struturelle Metadata historische Zeitungen
Europeana Newspapers German infoday - Struturelle Metadata historische ZeitungenEuropeana Newspapers
503 views33 slides
Enp lft infoday_neudecker by
Enp lft infoday_neudeckerEnp lft infoday_neudecker
Enp lft infoday_neudeckerEuropeana Newspapers
358 views15 slides
Europeana Newspapers German Infoday Quality Assessment by
Europeana Newspapers German Infoday Quality AssessmentEuropeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality AssessmentEuropeana Newspapers
731 views15 slides

More Related Content

Viewers also liked

Projekt Europeana Newspapers - online brána k evropským historickým novinám by
Projekt Europeana Newspapers - online brána k evropským historickým novinámProjekt Europeana Newspapers - online brána k evropským historickým novinám
Projekt Europeana Newspapers - online brána k evropským historickým novinámEuropeana Newspapers
283 views25 slides
ENP Belgrade WS Introduction by
ENP Belgrade WS IntroductionENP Belgrade WS Introduction
ENP Belgrade WS IntroductionEuropeana Newspapers
1.4K views12 slides
Challenges and solutions in creating a european historic newspapers browser by
Challenges and solutions in creating a european historic newspapers browser Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser Europeana Newspapers
1.3K views22 slides
Europeana Newspapers Amsterdam workshop introduction by
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers
1.3K views14 slides
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja by
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja Europeana Newspapers
703 views25 slides
ENP_SEEDI_2013_UB by
ENP_SEEDI_2013_UBENP_SEEDI_2013_UB
ENP_SEEDI_2013_UBEuropeana Newspapers
653 views28 slides

Viewers also liked(17)

Projekt Europeana Newspapers - online brána k evropským historickým novinám by Europeana Newspapers
Projekt Europeana Newspapers - online brána k evropským historickým novinámProjekt Europeana Newspapers - online brána k evropským historickým novinám
Projekt Europeana Newspapers - online brána k evropským historickým novinám
Challenges and solutions in creating a european historic newspapers browser by Europeana Newspapers
Challenges and solutions in creating a european historic newspapers browser Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser
Europeana Newspapers Amsterdam workshop introduction by Europeana Newspapers
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja by Europeana Newspapers
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja
Presentation of Claus Gravenhorst, BnF Information Day by Europeana Newspapers
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
eluxemburgensia: the portal for Luxembourg's historic newspapers by Europeana Newspapers
eluxemburgensia: the portal for Luxembourg's historic newspaperseluxemburgensia: the portal for Luxembourg's historic newspapers
eluxemburgensia: the portal for Luxembourg's historic newspapers
Historical newspapers in the context of Digital Library of Slovenia by Europeana Newspapers
Historical newspapers in the context of Digital Library of SloveniaHistorical newspapers in the context of Digital Library of Slovenia
Historical newspapers in the context of Digital Library of Slovenia

Similar to Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

ENP_ONB_infday_GMuehlberger by
ENP_ONB_infday_GMuehlbergerENP_ONB_infday_GMuehlberger
ENP_ONB_infday_GMuehlbergerEuropeana Newspapers
1.4K views24 slides
Europeana Newpapers LFT Infoday Neudecker by
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newspapers
722 views15 slides
Bessere Suchergebnisse durch Named Entity Recognition by
Bessere Suchergebnisse durch Named Entity RecognitionBessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity Recognitioncneudecker
799 views15 slides
Linked Open Data Pilotprojekt Österreich - LOD Pilot AT by
Linked Open Data Pilotprojekt Österreich - LOD Pilot ATLinked Open Data Pilotprojekt Österreich - LOD Pilot AT
Linked Open Data Pilotprojekt Österreich - LOD Pilot ATMartin Kaltenböck
1.1K views22 slides
Dipl.-Ing. Christoph Raber (BMWA) by
Dipl.-Ing. Christoph Raber (BMWA)Dipl.-Ing. Christoph Raber (BMWA)
Dipl.-Ing. Christoph Raber (BMWA)Praxistage
38 views12 slides
BMVIT & Data Market Austria by
BMVIT & Data Market AustriaBMVIT & Data Market Austria
BMVIT & Data Market AustriaData Market Austria
108 views22 slides

Similar to Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen(20)

Bessere Suchergebnisse durch Named Entity Recognition by cneudecker
Bessere Suchergebnisse durch Named Entity RecognitionBessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity Recognition
cneudecker799 views
Linked Open Data Pilotprojekt Österreich - LOD Pilot AT by Martin Kaltenböck
Linked Open Data Pilotprojekt Österreich - LOD Pilot ATLinked Open Data Pilotprojekt Österreich - LOD Pilot AT
Linked Open Data Pilotprojekt Österreich - LOD Pilot AT
Martin Kaltenböck1.1K views
Dipl.-Ing. Christoph Raber (BMWA) by Praxistage
Dipl.-Ing. Christoph Raber (BMWA)Dipl.-Ing. Christoph Raber (BMWA)
Dipl.-Ing. Christoph Raber (BMWA)
Praxistage38 views
Linked Open Data Pilot Österreich - Beta Launch by Martin Kaltenböck
Linked Open Data Pilot Österreich - Beta LaunchLinked Open Data Pilot Österreich - Beta Launch
Linked Open Data Pilot Österreich - Beta Launch
Martin Kaltenböck1.8K views
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ... by Martin Kaltenböck
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...
Martin Kaltenböck2.9K views
Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB) by Agenda Europe 2035
Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)
Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)
EU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACT by Max Kaiser
EU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACTEU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACT
EU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACT
Max Kaiser704 views
OkLab Leipzig (state: 2017) by joergreichert
OkLab Leipzig (state: 2017)OkLab Leipzig (state: 2017)
OkLab Leipzig (state: 2017)
joergreichert1.1K views
Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records) by Praxistage
Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)
Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)
Praxistage23 views
Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT) by Agenda Europe 2035
Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)
Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)
Grosse Projekte in der Wissenschaft am Beispiel von NextGEOSS by Wolfgang Ksoll
Grosse Projekte in der Wissenschaft am Beispiel von NextGEOSSGrosse Projekte in der Wissenschaft am Beispiel von NextGEOSS
Grosse Projekte in der Wissenschaft am Beispiel von NextGEOSS
Wolfgang Ksoll778 views
Pivotal Digital Transformation Forum: Fraport AG by VMware Tanzu
Pivotal Digital Transformation Forum: Fraport AGPivotal Digital Transformation Forum: Fraport AG
Pivotal Digital Transformation Forum: Fraport AG
VMware Tanzu3.2K views

More from Europeana Newspapers

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris by
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisEuropeana Newspapers
1.6K views6 slides
Presentation of Ioannis Anagnostopoulos at BnF Information Day by
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayEuropeana Newspapers
1.5K views23 slides
Presentation of Clemens Neudecker, BnF Information Day by
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
1.4K views15 slides
Presentation of Hans-Jörg Lieder, BnF Information Day by
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayEuropeana Newspapers
1.2K views15 slides
Présentation Günter Mühlberger, BnF Information Day by
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayEuropeana Newspapers
961 views59 slides
Presentation of Alaa Abi Haidar at the BnF Information Day by
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayEuropeana Newspapers
2.9K views14 slides

More from Europeana Newspapers(20)

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris by Europeana Newspapers
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Ioannis Anagnostopoulos at BnF Information Day by Europeana Newspapers
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day by Europeana Newspapers
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day by Europeana Newspapers
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day by Europeana Newspapers
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day by Europeana Newspapers
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
Europeana Newspapers Estonian Infoday Kristel Veimann by Europeana Newspapers
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann

Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

  • 1. Digitale Zeitungen – Verarbeitung in Europeana Newspapers Information Day SBB Berlin, 27 Februar 2014 Clemens Neudecker, KB, Twitter: @cneudecker
  • 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Übersicht • Ziele & Herausforderungen • Zeitungen im Projekt • Workflow & Technologien • Fragen & Antworten 2
  • 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Ziele • Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK) • Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS) • Erstellen von Software für NER in 3 Sprachen (KB) • Entwicklung von Tools die den Workflow automatisieren • Erstellen von Richtlinien und Empfehlungen (“best practices”) 3
  • 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Herausforderungen • Qualität vs. Durchsatz • Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen) • Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal) • Unterschiedliche Dateiformate, Sprachen, Alphabete • Historische Schreibvarianten • Klar strukturierter und weitgehend automatisierter Workflow 4
  • 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Die Zeitungen
  • 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (1)
  • 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (2)
  • 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (3)
  • 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (4)
  • 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Workflow 10
  • 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OCR @ UIBK • OCR = Optical Character Recognition (Optische Zeichenerkennung) • Technologien: ABBYY FineReader SDK • State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box • Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext 11
  • 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT) • BCT = Binarisation and Colour Reduction Tool • Ziel: Konvertierung von Farb-/ Graustufenscans nach 1-bit mit für OCR optimierter Methode (GPP) + JP2k • Hintergrund: Dateigrösse der Images reduzieren um Datenmenge handhabbar zu machen (hunderte TBs) 12
  • 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT) • FRT = File Rename Tool • Ziel: Unterstützung der Bibliotheken bei der Daten- anlieferung – Umbenennung von Dateien und Ordnern • Hintergrund: Daten in der für automatisierte Verarbeitung notwendigen Struktur aufbereiten 13
  • 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT) • FAT = File Analyzer Tool • Ziel: Check und Validierung der Datenstruktur vor Anlieferung zur Verarbeitung • Hintergrund: Garantie für alle Beteiligten dass die Daten für die weitere Verarbeitung in geeigneter Form vorliegen 14
  • 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR @ CCS • OLR = Optical Layout Recognition (Optische Layouterkennung) • Technologien: docWorks • Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen) • Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext 15
  • 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR Artikelerkennung 16
  • 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER @ KB • NER = Named Entities Recognition • Technologien: Stanford CRF-NER • 3 Sprachen: Deutsch, Niederländisch, Französisch • Open source: https://github.com/KBNLresearch/europeananp-ner • Erkennung von 3 Klassen: Person, Ort, Organisation 17
  • 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 18 Ergebnisse für NL Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900. 100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”) * * K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung Personen Orte Organisationen Precision 0.940 0.950 0.942 Recall 0.588 0.760 0.559 F-measure 0.689 0.838 0.671
  • 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER vs. OCR 19 0,25 0,35 0,45 0,55 0,65 0,75 0,85 0,95 NER OCR
  • 20. Danke für die Aufmerksamkeit! Noch Fragen? clemens.neudecker@kb.nl