Successfully reported this slideshow.
Your SlideShare is downloading. ×
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Metadata
Metadata
Loading in …3
×

Check these out next

1 of 34 Ad
Advertisement

More Related Content

Slideshows for you (20)

Similar to Refinement (14)

Advertisement

More from Europeana Newspapers (20)

Recently uploaded (20)

Advertisement

Refinement

  1. 1. Europeana Newspapers - Turkish Information Day WP2 - Refinement Ankara, 3 May 2013 Clemens Neudecker (@cneudecker)
  2. 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Overview • Objectives & Challenges • Introduction to Refinement Dataset • Overview of Refinement Workflow & Tools • Refinement with OCR • Refinement with OLR • Refinement with NER • Short summary • Questions & Answers 2
  3. 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Objectives & Challenges 3
  4. 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Objectives - Analysis of available digital newspaper collections at project partners and selection of subsets suitable for refinement - Definition of requirements and minimum quality of digitized newspapers for refinement and advanced services in Europeana - Coordinate timely processing of 10 million newspaper pages provided by libraries with several refinement technologies - Provide recommendations on best practices for refinement of digitized newspaper collections for full-text ingest to Europeana
  5. 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Challenges • Processing quality vs. speed/throughput • Volume of data requires focus on simple & strictly followed workflows with checkpoints on progress • Large number of partners supplying content with different digitisation & access policies • Large variety of content in terms of file formats, fonts, languages 5
  6. 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement Dataset 6
  7. 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Initial dataset
  8. 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Master List https://sp.uibk.ac.at/sites/eu-news/Refinement/Lists/MasterList/AllItems.aspx
  9. 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (1)
  10. 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (2)
  11. 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (3)
  12. 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (4)
  13. 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Workflow & Tools 13
  14. 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement Workflow steps 14
  15. 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT.1) • BCT = Binarisation and Colour Reduction Tool • Produced by UIBK as a Windows EXE-Tool with GUI • Purpose: Convert grey/colour scans to bitonal using special method from Gatos/Pratikakis/Perantonis (GPP) • Background: Need to reduce total file size of master images to guarantee feasibility and timing of data transfers 15
  16. 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT.2)
  17. 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT.3) • Internally wraps Graphicsmagick tool to create lower resolution images for viewing in content browser • Integration of Kakadu for JP2000 support being discussed • Using GPP method, next to no decrease in OCR accuracy observed when using bitonal images for OCR rather than grey/colour (in small test even went up from 72% to 83%) 17
  18. 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT.1) • FRT = File Rename Tool • Produced by UIBK as a Windows EXE-Tool with GUI • Purpose: Support content holders in preparing their data in the correct structure required for large-scale processing by refinement partners 18
  19. 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT.2) 19
  20. 20. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT.3) • Simplifies batch renaming of files, folders according to project delivery specification • Visual checks in the tool interface help spotting issues that still have to be corrected • Highlights possible errors and conflicts to the user 20
  21. 21. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT.1) • FAT = File Analyzer Tool • Produced by UIBK as a Windows EXE-Tool with GUI • Purpose: Final quality check of data preparation • FAT analyses the final data (images & metadata) prepared by content holder for refinement and checks whether all necessary data preparation steps have been successfully completed 21
  22. 22. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT.2) 22
  23. 23. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT.3) • Verifies metadata against data available in Master List • Verifies file & folder structure against project specification • Produces log and XML information about the data and provenance about the processing 23
  24. 24. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OCR 24
  25. 25. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OCR • OCR = Optical Character Recognition • Executing organisation: University of Innsbruck (UIBK) • Number of pages to be refined: 8 million • Technologies: ABBYY FineReader SDK • State-of-the-art OCR software, fully supports Fraktur/Arabic/Cyrillic fonts 25
  26. 26. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OCR processing at UIBK 26
  27. 27. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OLR 27
  28. 28. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OLR • OLR = Optical Layout Recognition • Executing organisation: Content Conversion Specialists (CCS) • Number of pages to be refined: 2 million • Technologies: docWorks • Columns, articles, headlines, page classification 28
  29. 29. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR processing at CCS 29 Three ways are offered to libraries for doing the OLR process with CCS: 1.Fully on-site at the library (requires local installation of docWorks) 2.Conversion off-shore, QA at the library via internet connection 3.Conversion off-shore, QA at the library via backup shipment
  30. 30. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: NER 30
  31. 31. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: NER • NER = Named Entities Recognition • Executing organisation: Koninklijke Bibliotheek • Number of pages to be refined: > 2 million • Technologies: Stanford CRF-NER • Languages: German, Dutch, English, (French) • Open source available: https://github.com/KBNLresearch/europeananp-ner • Named entities: Person, Location, Organization 31
  32. 32. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER processing at KB 32 1. UIBK/CCS complete refinement with OCR/OLR 2. Data (OCR, images, metadata) sent via harddisk to Europeana/TEL and KB in the ENMAP package format 3. KB NER-Tool extracts references to the OCR files from the ENMAP package 4. OCR files (ALTO) are processed with the Stanford CRF-NER algorithm 5. Detected named entities can be exported in a variety of output formats
  33. 33. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 33 Issues encountered 1. Issue: Amount of data to be transferred from libraries to refinement partners  What will be done to address the problem? Reduction of file size by applying optimized GPP binarization 2. Issue: Storage format for named entities needs to preserve coordinates, but ALTO-XML cannot store semantic information  What will be done to address the problem? Several alternative storage formats have been implemented NER-Tool ensures the word coordinates are retained after processing 3. Issue: Ottoman language/script currently not supported in OCR software  What will be done to address the problem? Select only newspapers in Latin alphabet for refinement of NLT content
  34. 34. Thank you for your attention! Questions? clemens.neudecker@kb.nl

×