Your SlideShare is downloading. ×
0
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Europeana Newspapers German Infoday Quality Assessment
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Europeana Newspapers German Infoday Quality Assessment

153

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
153
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Europeana Newspapers - Evaluierung und Qualitätskontrolle Information Day SBB Berlin, 28 Februar 2014 Clemens Neudecker, KB, Twitter: @cneudecker
  • 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 2 Übersicht • Qualitätskontrolle in Digitalisierungsprojekten • Besondere Herausforderungen bei der Digitalisierung von Zeitungen • Digitalisierungsworkflows und Qualitätskontrolle • Das PAGE Evaluierungsframework • Ground truth • Tools • Layoutanalyse • Lesefluss • Textgenauigkeit • Was tun mit den Ergebnissen? • Zusammenfassung und Ausblick
  • 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 3 Qualitätskontrolle in Digitalisierungsprojekten • Planung • Machbarkeit • Prioritäten • Kosten, Zeitaufwand, manuelle Schritte • Services, Dateiformate • Umsetzung • Aufsetzen des Workflows • Aufspüren von “Bottlenecks” • Optimierung der Prozessschritte • Kontrolle • Qualität der OCR Performance Analyse: Gründliche Analyse aller Prozessschritte – was trägt wie zur Qualität bei?
  • 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Herausforderungen für Zeitungen • Anzahl Zeichen pro Seite sehr hoch • Mehrere Spalten • Unterschiedlichste Typen von Regionen • Lesefluss • Komplexe Layouts • Abbildungen • Tabellen • Werbung • Schlechte Papierqualität • Oft von Mikrofilm gescannt • … 4 Quelle: NLF
  • 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Digitalisierungsworkflows und Qualitätskontrolle 5 ① Scannen ② (Bild-)vorverarbeitung Doppelseiten aufsplitten Rand entfernen/Ausschneiden Geraderücken Entfernen von Artefakten (Noise) Binarisierung ③ Layoutanalyse Segmentierung in Regionen, Zeilen, Wörter und Zeichen Klassifizierung von Regionen Analyse der logischen Struktur ④ Zeichenerkennung (OCR) ⑤ Nachverarbeitung •Einzelne Prozessschritte vs. gesamter Workflow •Direkt vs. indirekt •Basierend auf realen Nutzungsszenarien
  • 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Das PAGE Evaluierungsframework 6 Evaluation Tools Image Repository Evaluation Results Compatibility through one common format (PAGE)
  • 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Ground Truth 7
  • 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools zur Erstellung von Ground Truth 8 • Aletheia • Seitenrand, Satzspiegel • Regionen (inkl. Typ) • Zeilen, Wörter und Glyphen • Unicode text • Lesefluss, Layer etc. • FineReader Engine Exporter (Preproduction) • GT Validator • GT Converter/Normaliser http://www.primaresearch.org/tools
  • 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Layoutanalyse 9 Miss / Part. Miss Split Misclass- ification Merge False Detection Fehlerkategorien Ground truth OCR
  • 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Lesefluss 10 Ground truth OCR
  • 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Textgenauigkeit 11 • Vergleich von Ground Truth und durch OCR erkanntem Text unter Berücksichtigung des Textencoding (ASCII, Unicode) • Normalisierung • Zeichengenauigkeit • Distance measure: Minimale Anzahl von Edits (insertions, deletions, substitutions) • Für alle Klassen von Zeichen (lower case, upper case, whitespace characters, numbers, symbols) • Wortgenauigkeit • Korrekt erkannte Wörter vs. Gesamtanzahl Wörter • Bag of words (index, ranking) • Stop words und non-stop words (“und”, “in”, etc.) • Rejected and suspicious characters/words • Substitutionsfehler (höher gewichtet) • OCR confidence ≠ accuracy “OCR is cool”  “OOR is cod”
  • 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Was tun mit den Ergebnissen? 12 • Kriterien • Min. Anforderungen erfüllt? • Anzahl und Klassen von Fehlern • Szenarien • Anwendung / Kontext • Gewichtung von Fehlern Miss Misclass. Merge Split False detect. Merge Rate M1 M2 M3 Split Rate S1 S2 ... Error Rate • Gesamtergebnis / Aggregation • gewichtete Einzelergebnisse • Typ und Umfang der falschen Regionen • Erlaubte vs. nicht-erlaubte Fehler
  • 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 13 Zusammenfassung und Ausblick • Gute und gründliche Evaluierung kostet Zeit und Geld… • Festlegen der Qualitätsanforderungen (in Abhängigkeit von Nutzungsszenarien) • Erstellen von Ground Truth (hoher manueller Aufwand) • Durchführen der Evaluierung • Interpretation der Ergebnisse • …aber nur auf diesem Weg lassen sich wirklich verlässliche Aussagen zur Qualität der Layout- und Textgenauigkeit treffen! • Das IMPACT Centre of Competence kann Ihnen dabei helfen: www.digitisation.eu
  • 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 14 Weiterführende Informationen PRImA www.primaresearch.org Europeana Newspapers www.europeana-newspapers.eu
  • 15. Danke für die Aufmerksamkeit! Noch Fragen? clemens.neudecker@kb.nl

×