Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Europeana Newspapers - the Gateway to European Newspapers Online

386 views

Published on

Europeana Newspapers - the Gateway to European Newspapers Online
IFLA 2013 Satellite Meeting on Newspaper & Genloc Sections, Science Centre Singapore, 14-15 August 2013, Singapore.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Europeana Newspapers - the Gateway to European Newspapers Online

  1. 1. Europeana Newspapers: The Gateway to European Newspapers Online IFLA 2013 SATELLITE MEETING ON NEWSPAPER & GENLOC SECTIONS Singapore, 14 August 2013 Clemens Neudecker @cneudecker
  2. 2. Overview • Objectives • Overview of Dataset • Workflows & Technologies • Questions & Answers This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 2 Image: Nationaal Archief The Netherlands
  3. 3. Objectives • Refinement of 10 mill. pages with OCR, OLR, NER • Ingestion of metadata for 18 mill. pages in Europeana • Create a full text content browser for newspapers • Create a unified METS/ALTO profile (ENMAP) • Produce tools in order to ease creation of ENMAP objects • Share best practices and provide recommendations This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 3
  4. 4. Who 12 content providers 2 networking partners 4 technology providers This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 1 aggregator
  5. 5. Recently associated This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  6. 6. The data This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  7. 7. Europeana Newspaper Dataset (1) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  8. 8. Europeana Newspaper Dataset (2) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  9. 9. Europeana Newspapers Dataset (3) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  10. 10. Europeana Newspapers Dataset (4) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  11. 11. The workflow This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 11
  12. 12. OCR @ UIBK • OCR = Optical Character Recognition • Technologies: ABBYY FineReader SDK • State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts • METS/ALTO package containing images, metadata & full text This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 12
  13. 13. OLR @ CCS • OLR = Optical Layout Recognition • Technologies: docWorks • Separation of columns, articles, headlines, page classes • METS/ALTO package containing images, metadata & full text This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 13
  14. 14. NER @ KB • NER = Named Entities Recognition • Technologies: Stanford CRF-NER • Open source: https://github.com/KBNLresearch/europeananp-ner • Detection of Named entities: Person, Location, Organization This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 14
  15. 15. QA @ PRImA • Layout and OCR evaluation • Technologies: Ground truth + Evaluation Tools (IMPACT) • In-depth scenario driven evaluation using profiles with more than 600 metrics This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 15
  16. 16. Full-text search @ TEL This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Blog www.europeana-newspapers.eu Workshop 16 Sept. 2013 (Amsterdam) 16
  17. 17. Thank you for your attention! clemens.neudecker@kb.nl

×