Your SlideShare is downloading. ×
0
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

128

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
128
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Digitale Zeitungen – Verarbeitung in Europeana Newspapers Information Day SBB Berlin, 27 Februar 2014 Clemens Neudecker, KB, Twitter: @cneudecker
  • 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Übersicht • Ziele & Herausforderungen • Zeitungen im Projekt • Workflow & Technologien • Fragen & Antworten 2
  • 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Ziele • Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK) • Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS) • Erstellen von Software für NER in 3 Sprachen (KB) • Entwicklung von Tools die den Workflow automatisieren • Erstellen von Richtlinien und Empfehlungen (“best practices”) 3
  • 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Herausforderungen • Qualität vs. Durchsatz • Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen) • Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal) • Unterschiedliche Dateiformate, Sprachen, Alphabete • Historische Schreibvarianten • Klar strukturierter und weitgehend automatisierter Workflow 4
  • 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Die Zeitungen
  • 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (1)
  • 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (2)
  • 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (3)
  • 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (4)
  • 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Workflow 10
  • 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OCR @ UIBK • OCR = Optical Character Recognition (Optische Zeichenerkennung) • Technologien: ABBYY FineReader SDK • State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box • Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext 11
  • 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT) • BCT = Binarisation and Colour Reduction Tool • Ziel: Konvertierung von Farb-/ Graustufenscans nach 1-bit mit für OCR optimierter Methode (GPP) + JP2k • Hintergrund: Dateigrösse der Images reduzieren um Datenmenge handhabbar zu machen (hunderte TBs) 12
  • 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT) • FRT = File Rename Tool • Ziel: Unterstützung der Bibliotheken bei der Daten- anlieferung – Umbenennung von Dateien und Ordnern • Hintergrund: Daten in der für automatisierte Verarbeitung notwendigen Struktur aufbereiten 13
  • 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT) • FAT = File Analyzer Tool • Ziel: Check und Validierung der Datenstruktur vor Anlieferung zur Verarbeitung • Hintergrund: Garantie für alle Beteiligten dass die Daten für die weitere Verarbeitung in geeigneter Form vorliegen 14
  • 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR @ CCS • OLR = Optical Layout Recognition (Optische Layouterkennung) • Technologien: docWorks • Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen) • Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext 15
  • 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR Artikelerkennung 16
  • 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER @ KB • NER = Named Entities Recognition • Technologien: Stanford CRF-NER • 3 Sprachen: Deutsch, Niederländisch, Französisch • Open source: https://github.com/KBNLresearch/europeananp-ner • Erkennung von 3 Klassen: Person, Ort, Organisation 17
  • 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 18 Ergebnisse für NL Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900. 100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”) * * K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung Personen Orte Organisationen Precision 0.940 0.950 0.942 Recall 0.588 0.760 0.559 F-measure 0.689 0.838 0.671
  • 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER vs. OCR 19 0,25 0,35 0,45 0,55 0,65 0,75 0,85 0,95 NER OCR
  • 20. Danke für die Aufmerksamkeit! Noch Fragen? clemens.neudecker@kb.nl

×