ENP Belgrade WS refinement introduction

1,132 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,132
On SlideShare
0
From Embeds
0
Number of Embeds
526
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ENP Belgrade WS refinement introduction

  1. 1. Europeana Newspapers -Refinement WorkshopWP2 – Introduction to RefinementBelgrade, 13 June 2013Clemens Neudecker (@cneudecker)
  2. 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspOverview• Objectives & Challenges• Overview of Refinement Dataset• Introduction to Refinement: Workflow & Technologies• Questions & Answers2
  3. 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspObjectives- Analysis of available digital newspaper collections of project partnersand identification of subsets suitable for refinement- Definition of requirements and minimum quality of digitized newspapersfor refinement to enable advanced services in Europeana- Coordination of the scalable processing of 10 million digitised newspaperpages with several refinement technologies- Providing recommendations on best practices for the refinement ofdigitised newspaper collections with full-text (and ingest to Europeana)
  4. 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspChallenges• Processing quality vs. speed/throughput• Volume of data requires focus on simple &standardised workflow with clear checkpoints• Diverse partners supplying content with differentdigitisation & access policies• Large variety of content in terms of file formats,fonts, languages, etc.4
  5. 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspThe data
  6. 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspEuropeana Newspaper Dataset (1)
  7. 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspEuropeana Newspaper Dataset (2)
  8. 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspEuropeana Newspapers Dataset (3)
  9. 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspEuropeana Newspapers Dataset (4)
  10. 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement Workflow steps10
  11. 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspMaster List
  12. 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (BCT)• BCT = Binarisation and Colour Reduction Tool• Purpose: Convert grey/colourscans to bitonal using highlyoptimised GPP method• Background: Reduce total filesize of master images toguarantee feasibility andtiming of data transfers12
  13. 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (FRT)• FRT = File Rename Tool• Purpose: Support contentholders in preparing theirdata in the correct format• Background: Ensure folderstructure and file namingrequirements for automatedprocessing are met13
  14. 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (FAT)• FAT = File Analyzer Tool• Purpose: Final quality checkof data before refinement• Background: Ensure contentand refinement partners thatall preparation steps havebeen executed successfully14
  15. 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement: OCR@UIBK• OCR = Optical Character Recognition• Number of pages to be refined: 8 million• Technologies: ABBYY FineReader SDK• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts• Result: METS/ALTO package containing images, metadata & full text15
  16. 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement: OLR@CCS• OLR = Optical Layout Recognition• Number of pages to be refined: 2 million• Technologies: docWorks• Separation of columns, articles, headlines, page classes• Result: METS/ALTO package containing images, metadata & full text16
  17. 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement: NER@KB• NER = Named Entities Recognition• Number of pages to be refined: 2 million• Technologies: Stanford CRF-NER• Languages supported: German, Dutch, English (+ French, Latvian)• Open source: https://github.com/KBNLresearch/europeananp-ner• Detection of Named entities: Person, Location, Organization• Feedback cycle with manual training step  better results17
  18. 18. Thank you for your attention -Now come ask me almostanything!clemens.neudecker@kb.nl

×