Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Europeana Newspapers - Data, Tools & Future Plans

169 views

Published on

Europeana Newspapers Presentation at impresso Project Kick-off & 1st Workshop

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Europeana Newspapers - Data, Tools & Future Plans

  1. 1. Europeana Newspapers Data, Tools & Future Plans Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker
  2. 2. Europeana Newspapers • EU FP7 ICT-PSP Project 2012 – 2015 • www.europeana-newspapers.eu • Main outcomes – TEL Historic Newspapers Portal: http://www.theeuropeanlibrary.org/tel4/newspapers – Deliverables: http://www.europeana-newspapers.eu/ public-materials/deliverables/ – Tools: http://www.europeana-newspapers.eu/ public-materials/tools/ – Final Report: http://europeananewspapers.github.io/
  3. 3. Data
  4. 4. Data • 1618 – 2016 • 12 countries • 40 languages • 120 TB • Ca. 1,000 titles • 3,3M issues
  5. 5. Data • Metadata for more than >20 million pages • 12 million pages processed with OCR • 2 million pages processed with OLR • Most content licensed as Public Domain • Metadata licensed CC0 • Copyright cut-off date
  6. 6. Data • JP2000 images for use with IIPsrv • METS container with embedded MODS for structural and bibliographic metadata • ALTO for OCRed text • EDM for Europeana  Europeana Newspapers METS/ALTO Profile (ENMAP)
  7. 7. Data • Portals – http://www.theeuropeanlibrary.org/tel4/newspapers – http://europeana.eu/portal/search.html?query=euro- peana_collectionName%3A92*ewspapers*&rows=24 &qt=false • Downloads – https://pro.europeana.eu/itemtype/newspapers – http://test-solr-mongo.eanadev.org/europeana- research-newspapers-dump/
  8. 8. Tools & Technologies
  9. 9. Preprocessing • Preprocessing with adaptive Binarization to reduce overall image file size and processing time (yielded >90% reduction of data volume vs. <1% lower accuracy in OCR results) • Preprocessing to create tiled JP2000 files for zooming using graphicsmagick + kakadu • Created easy-to-use set of preprocessing tools that also validate and harmonize data input for efficient OCR/OLR processing
  10. 10. OCR/OLR • OCR: ABBYY FineReader Engine 11 – Gothic license per page (A4!) – 4 servers with 8 cores = 32 processing cores – Average processing time of 5s per newspaper page • OLR: CCS docWorks – Article separation & page classification – Possibility for post-correction/validation of results
  11. 11. NER • Stanford CoreNLP Named Entity Recognition (Conditional Random Fields) • Adapted for METS/ALTO processing • Added ALTO v3 (tags) output • https://github.com/EuropeanaNewspapers/ner-app • Annotated training & evaluation data • 100 pages each for (historical) German, French, Dutch • https://github.com/EuropeanaNewspapers/ner-corpora
  12. 12. Evaluation • Scenario-based performance evaluation of OCR/OLR using PAGE ground truth • Ground truth dataset: http://primaresearch.org/datasets/ENP • Performance Evaluation Report: http://www.europeana-newspapers.eu/wp- content/uploads/2015/05/D3.5_Performance_ Evaluation_Report_1.0.pdf
  13. 13. Evaluation
  14. 14. IIIF • International Image Interoperability Framework (iiif.io) for online presentation and aggregation • Implementing Image API and Presentation API • Europeana IIIF Task Force: https://pro.europeana.eu/post/iiif-adoption- by-europeana-future-perspectives-for-the- network-1
  15. 15. Future plans
  16. 16. Future plans • Migration of data from TEL (closed 12/2016) to new Europeana Thematic Collections http://europeana.eu/portal/ • Re-develop Newspapers API • Re-develop search & browse interface • Add new newspaper content • Create virtual exhibitions & browse entry points
  17. 17. Future Plans https://acceptance-npc.eanadev.org/portal/de/collections/newspapers
  18. 18. Future plans • Automatic OCR error correction • Improved newspaper layout analysis • Named Entity Recognition, Disambiguation and Linking (Wikidata) • Extraction and classification of image content • Deep semantic structuring of newspapers • User corrections and annotations
  19. 19. Collaboration with Researchers • Interviews with researchers • Europeana Research • CLARIN • Viral Texts • Oceanic Exchanges • DDB • impresso
  20. 20. Coding da Vinci https://codingdavinci.de/
  21. 21. Thank you for your attention! Questions? Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker

×