Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's up, Europeana Newspapers?

100 views

Published on

Status update and outlook of Europeana Newspapers for Oceanic Exchanges Workshop Stuttgart, Germany, 8-9 May 2018

Published in: Technology
  • Be the first to comment

  • Be the first to like this

What's up, Europeana Newspapers?

  1. 1. What‘s up, Europeana Newspapers? Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
  2. 2. A little bit of history 2012 – 2015: Europeana Newspapers ICT-PSP Project (2012-2015) 31 Dec 2016: The European Library (TEL) closed 2017: DSI-2/3: Migration; Newspapers Collection Plan July 2018: Planned Re-Launch of Europeana Newspapers as thematic collection
  3. 3. Main outcomes – TEL Historic Newspapers Portal: http://www.theeuropeanlibrary.org/tel4/newspapers – Deliverables: http://www.europeana-newspapers.eu/ public-materials/deliverables/ – Tools: http://www.europeana-newspapers.eu/ public-materials/tools/ – Final Report: http://europeananewspapers.github.io/
  4. 4. Data • 1618 – 2016 • 12 countries • 40 languages • 120 TB • Ca. 1,000 titles • 3,3M issues
  5. 5. Data • Metadata for more than >20 million pages • 12 million pages processed with OCR • 2 million pages processed with OLR • Most content licensed as Public Domain • All metadata licensed under CC0 • Copyright cut-off date („copyright cliff of death“)
  6. 6. Data • JP2000 images for use with IIIPserver • METS container with embedded MODS for structural and bibliographic metadata • ALTO for OCRed text • EDM for Europeana  Europeana Newspapers METS/ALTO Profile (ENMAP)
  7. 7. OCR/OLR • OCR: ABBYY FineReader Engine 11 – Gothic license per page (A4!) – 4 servers with 8 cores = 32 processing cores – Average processing time of 5s per newspaper page • OLR: CCS docWorks – Article separation & page classification – Possibility for post-correction/validation of results
  8. 8. Evaluation • Scenario-based performance evaluation of OCR/OLR using PAGE ground truth • Ground truth dataset: http://primaresearch.org/datasets/ENP • Performance Evaluation Report: http://www.europeana-newspapers.eu/wp- content/uploads/2015/05/D3.5_Performance_Ev aluation_Report_1.0.pdf
  9. 9. Evaluation 82.4% 85.3% 80.9% 75.9% 67.5% 83.4% 84.1% 68.1% 93.1% 57.6% 87.0% 68.3% 76.1% 82.6% 54.1% 32.7% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% SuccessRate Language Setting Bag of Words OCR Evaluation Per Language 67.3% 81.4% 64.0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Gothic Normal Mixed SuccessRate Font Bag of Words OCR Evaluation Per Font 79.1% 62.2% 55.9% 58.8% 94.7% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Keyword search Phrase search Access via content structure Print/ebook on demand Content based image retrieval SuccessRate(harmonic,areabased) Evaluation Profile Layout Analysis Performance Per evaluationprofile 74.35% 75.31% 70% 71% 72% 73% 74% 75% 76% 77% NCSR Binarisation Original Image SuccessRate Image Source Bag of Words OCR Evaluation Binarised image vs. original image 75.3% 53.78% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% SuccessRate(countbased) OCR Engine Bag of Words OCR Evaluation FineReader vs. Tesseract FineReader Tesseract
  10. 10. Use in Research
  11. 11. Use in Research • Oceanic Exchanges (Digging Into Data, 2017-2019) • impresso (Swiss National Fund, 2017 – 2020) • NewsEye (EU H2020, 2018 – 2020) • CLARIN (EU ERIC) • Europeana Research, Interviews with Researchers • At Scientific Conferences – DAS, ICDAR: Europeana Newspapers Ground Truth – LREC, ACL: Europeana Newspapers NER Corpora
  12. 12. Oceanic Exchanges (Digging Into Data, 2017-2019)
  13. 13. impresso (Swiss National Fund, 2017 – 2020)
  14. 14. Use in Research • Digital Humanities – DHd AG Newspapers initiated at DHd 2018 – #HacktheNews workshop at DHNord 2018 – Roundtable on newspapers at DHBenelux 2018 • At the Berlin State Library: – University Regensburg – Technical University Dortmund – Berlin-Brandenburg Academy of Sciences
  15. 15. Other Activities • Rise of Literacy Generic Services Projekt • IIIF Newspaper Interest Group – http://iiif.io/community/groups/newspapers/ – https://github.com/IIIF/awesome-iiif#newspapers • TEI SIG Newspapers & Periodicals – https://wiki.tei-c.org/index.php/ SIG:Newspapers%26Periodicals
  16. 16. Creative Reuse
  17. 17. Berliner Schlagzeilen • Created as part of Coding da Vinci Berlin 2017 • Twitterbot that tweets out daily about the news from 100 years ago • Source code available: https://github.com/shoutrlabs/ berliner-schlagzeilen
  18. 18. Altpapier App • Created as part of Coding da Vinci Berlin 2017 • Android (and soon also iOS) app that shows the user newspaper articles with the possibility to correct errors • Available as source code https://github.com/mariabecker/OldNews and on the Play Store https://play.google.com/store/apps/details?id=ol dnews.de.oldnews
  19. 19. Visualizing European Newspapers • Visualization prototype with large touch interface composed of multiple screens made by Sven Charleer of KU Leuven
  20. 20. Future Plans
  21. 21. Europeana Newspapers Thematic Collection
  22. 22. The Situation in Germany 2012 – 2015:DFG Pilot Project „Digitisation of historical newspapers“ Master Plan, Guidelines, etc. 2017: Relaunch of ZDB union catalog of serials http://zdb-katalog.de/ 2017: DFG Proposal (SBB, DDB involved) „A national portal for digitised historical newspapers at the Germany Digital Library“ 2018: DFG Call for proposals „Digitisation of historical newspapers“

×