Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

359 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
359
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

  1. 1. Digitale Zeitungen – Verarbeitung in Europeana Newspapers Information Day SBB Berlin, 27 Februar 2014 Clemens Neudecker, KB, Twitter: @cneudecker
  2. 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Übersicht • Ziele & Herausforderungen • Zeitungen im Projekt • Workflow & Technologien • Fragen & Antworten 2
  3. 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Ziele • Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK) • Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS) • Erstellen von Software für NER in 3 Sprachen (KB) • Entwicklung von Tools die den Workflow automatisieren • Erstellen von Richtlinien und Empfehlungen (“best practices”) 3
  4. 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Herausforderungen • Qualität vs. Durchsatz • Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen) • Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal) • Unterschiedliche Dateiformate, Sprachen, Alphabete • Historische Schreibvarianten • Klar strukturierter und weitgehend automatisierter Workflow 4
  5. 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Die Zeitungen
  6. 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (1)
  7. 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (2)
  8. 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (3)
  9. 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (4)
  10. 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Workflow 10
  11. 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OCR @ UIBK • OCR = Optical Character Recognition (Optische Zeichenerkennung) • Technologien: ABBYY FineReader SDK • State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box • Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext 11
  12. 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT) • BCT = Binarisation and Colour Reduction Tool • Ziel: Konvertierung von Farb-/ Graustufenscans nach 1-bit mit für OCR optimierter Methode (GPP) + JP2k • Hintergrund: Dateigrösse der Images reduzieren um Datenmenge handhabbar zu machen (hunderte TBs) 12
  13. 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT) • FRT = File Rename Tool • Ziel: Unterstützung der Bibliotheken bei der Daten- anlieferung – Umbenennung von Dateien und Ordnern • Hintergrund: Daten in der für automatisierte Verarbeitung notwendigen Struktur aufbereiten 13
  14. 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT) • FAT = File Analyzer Tool • Ziel: Check und Validierung der Datenstruktur vor Anlieferung zur Verarbeitung • Hintergrund: Garantie für alle Beteiligten dass die Daten für die weitere Verarbeitung in geeigneter Form vorliegen 14
  15. 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR @ CCS • OLR = Optical Layout Recognition (Optische Layouterkennung) • Technologien: docWorks • Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen) • Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext 15
  16. 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR Artikelerkennung 16
  17. 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER @ KB • NER = Named Entities Recognition • Technologien: Stanford CRF-NER • 3 Sprachen: Deutsch, Niederländisch, Französisch • Open source: https://github.com/KBNLresearch/europeananp-ner • Erkennung von 3 Klassen: Person, Ort, Organisation 17
  18. 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 18 Ergebnisse für NL Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900. 100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”) * * K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung Personen Orte Organisationen Precision 0.940 0.950 0.942 Recall 0.588 0.760 0.559 F-measure 0.689 0.838 0.671
  19. 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER vs. OCR 19 0,25 0,35 0,45 0,55 0,65 0,75 0,85 0,95 NER OCR
  20. 20. Danke für die Aufmerksamkeit! Noch Fragen? clemens.neudecker@kb.nl

×