Refinement

749 views
714 views

Published on

Published in: Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
749
On SlideShare
0
From Embeds
0
Number of Embeds
392
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Refinement

  1. 1. Europeana Newspapers -Turkish Information DayWP2 - RefinementAnkara, 3 May 2013Clemens Neudecker (@cneudecker)
  2. 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspOverview• Objectives & Challenges• Introduction to Refinement Dataset• Overview of Refinement Workflow & Tools• Refinement with OCR• Refinement with OLR• Refinement with NER• Short summary• Questions & Answers2
  3. 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspObjectives & Challenges3
  4. 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspObjectives- Analysis of available digital newspaper collections at project partners andselection of subsets suitable for refinement- Definition of requirements and minimum quality of digitized newspapers forrefinement and advanced services in Europeana- Coordinate timely processing of 10 million newspaper pages provided bylibraries with several refinement technologies- Provide recommendations on best practices for refinement of digitizednewspaper collections for full-text ingest to Europeana
  5. 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspChallenges• Processing quality vs. speed/throughput• Volume of data requires focus on simple & strictlyfollowed workflows with checkpoints on progress• Large number of partners supplying content withdifferent digitisation & access policies• Large variety of content in terms of file formats,fonts, languages5
  6. 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement Dataset6
  7. 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspInitial dataset
  8. 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspMaster Listhttps://sp.uibk.ac.at/sites/eu-news/Refinement/Lists/MasterList/AllItems.aspx
  9. 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspEuropeana Newspaper Dataset (1)
  10. 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspEuropeana Newspaper Dataset (2)
  11. 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspEuropeana Newspapers Dataset (3)
  12. 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspEuropeana Newspapers Dataset (4)
  13. 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspWorkflow & Tools13
  14. 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement Workflow steps14
  15. 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (BCT.1)• BCT = Binarisation and Colour Reduction Tool• Produced by UIBK as a Windows EXE-Tool with GUI• Purpose: Convert grey/colour scans to bitonal using specialmethod from Gatos/Pratikakis/Perantonis (GPP)• Background: Need to reduce total file size of master imagesto guarantee feasibility and timing of data transfers15
  16. 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (BCT.2)
  17. 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (BCT.3)• Internally wraps Graphicsmagick tool to create lowerresolution images for viewing in content browser• Integration of Kakadu for JP2000 support being discussed• Using GPP method, next to no decrease in OCR accuracyobserved when using bitonal images for OCR rather thangrey/colour (in small test even went up from 72% to 83%)17
  18. 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (FRT.1)• FRT = File Rename Tool• Produced by UIBK as a Windows EXE-Tool with GUI• Purpose: Support content holders in preparing their data inthe correct structure required for large-scale processing byrefinement partners18
  19. 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (FRT.2)19
  20. 20. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (FRT.3)• Simplifies batch renaming of files, folders according toproject delivery specification• Visual checks in the tool interface help spotting issuesthat still have to be corrected• Highlights possible errors and conflicts to the user20
  21. 21. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (FAT.1)• FAT = File Analyzer Tool• Produced by UIBK as a Windows EXE-Tool with GUI• Purpose: Final quality check of data preparation• FAT analyses the final data (images & metadata) preparedby content holder for refinement and checks whether allnecessary data preparation steps have been successfullycompleted21
  22. 22. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (FAT.2)22
  23. 23. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspTools (FAT.3)• Verifies metadata against data available in Master List• Verifies file & folder structure against project specification• Produces log and XML information about the data andprovenance about the processing23
  24. 24. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement: OCR24
  25. 25. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement: OCR• OCR = Optical Character Recognition• Executing organisation: University of Innsbruck (UIBK)• Number of pages to be refined: 8 million• Technologies: ABBYY FineReader SDK• State-of-the-art OCR software, fully supports Fraktur/Arabic/Cyrillic fonts25
  26. 26. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspOCR processing at UIBK26
  27. 27. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement: OLR27
  28. 28. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement: OLR• OLR = Optical Layout Recognition• Executing organisation: Content Conversion Specialists (CCS)• Number of pages to be refined: 2 million• Technologies: docWorks• Columns, articles, headlines, page classification28
  29. 29. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspOLR processing at CCS29Three ways are offered tolibraries for doing the OLRprocess with CCS:1.Fully on-site at the library(requires local installationof docWorks)2.Conversion off-shore, QAat the library via internetconnection3.Conversion off-shore, QAat the library via backupshipment
  30. 30. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement: NER30
  31. 31. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspRefinement: NER• NER = Named Entities Recognition• Executing organisation: Koninklijke Bibliotheek• Number of pages to be refined: > 2 million• Technologies: Stanford CRF-NER• Languages: German, Dutch, English, (French)• Open source available: https://github.com/KBNLresearch/europeananp-ner• Named entities: Person, Location, Organization31
  32. 32. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_pspNER processing at KB321. UIBK/CCS complete refinement withOCR/OLR2. Data (OCR, images, metadata) sentvia harddisk to Europeana/TEL andKB in the ENMAP package format3. KB NER-Tool extracts references tothe OCR files from the ENMAPpackage4. OCR files (ALTO) are processedwith the Stanford CRF-NER algorithm5. Detected named entities can beexported in a variety of output formats
  33. 33. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of theCompetitiveness and Innovation Framework Programme by the European Communityhttp://ec.europa.eu/ict_psp 33Issues encountered1. Issue: Amount of data to be transferred from libraries to refinement partners What will be done to address the problem?Reduction of file size by applying optimized GPP binarization2. Issue: Storage format for named entities needs to preserve coordinates,but ALTO-XML cannot store semantic information What will be done to address the problem?Several alternative storage formats have been implementedNER-Tool ensures the word coordinates are retained after processing3. Issue: Ottoman language/script currently not supported in OCR software What will be done to address the problem?Select only newspapers in Latin alphabet for refinement of NLT content
  34. 34. Thank you for your attention!Questions?clemens.neudecker@kb.nl

×