Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Europeana Newspapers Workshop: 
Refinement 
WP2 – Introduction to Refinement 
Munich, 26 June 2013 
Clemens Neudecker (@cn...
Overview 
• Objectives & Challenges 
• Overview of Refinement Dataset 
• Introduction to Refinement: Workflow & Technologi...
Objectives 
- Analysis of available digital newspaper collections of project partners 
and identification of subsets suita...
Challenges 
• Processing quality vs. speed/throughput 
• Volume of data requires focus on simple & 
standardised workflow ...
The data 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitivenes...
Europeana Newspaper Dataset (1) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part...
Europeana Newspaper Dataset (2) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part...
Europeana Newspapers Dataset (3) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as par...
Europeana Newspapers Dataset (4) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as par...
Refinement Workflow steps 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of th...
Tools (BCT) 
• BCT = Binarisation and Colour Reduction Tool 
• Purpose: Convert grey/colour 
scans to bitonal using highly...
Tools (FRT) 
• FRT = File Rename Tool 
• Purpose: Support content 
holders in preparing their 
data in the correct format ...
Tools (FAT) 
• FAT = File Analyzer Tool 
• Purpose: Final quality check 
of data before refinement 
• Background: Ensure c...
Refinement: OCR@UIBK 
• OCR = Optical Character Recognition 
• Number of pages to be refined: 8 million 
• Technologies: A...
OCR  Full text search 
http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/ 
This projec...
Refinement: OLR@CCS 
• OLR = Optical Layout Recognition 
• Number of pages to be refined: 2 million 
• Technologies: docWo...
OLR  Article separation 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the...
Refinement: NER@KB 
• NER = Named Entities Recognition 
• Number of pages to be refined: 2 million 
• Technologies: Stanfo...
NER  Browse by names or places 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part...
Thank you for your attention! 
clemens.neudecker@kb.nl
Upcoming SlideShare
Loading in …5
×

Refinement of Digitised Newspapers

289 views

Published on

Refinement
Europeana Newspapers Workshop: A Gateway to European Newspapers Online. Research Information Infrastructures and the Future Role of Libraries.
LIBER 2013 Annual Conference, Bavarian State Library, 26-29 June 2013, Munich, Germany.

Published in: Technology
  • There is a useful site for you that will help you to write a perfect and valuable essay and so on. Check out, please ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Great information about writing! If you ever need any help with proofreading, editing or research check out Writer’s Help. They are a great resource for personal, educational or business writing needs. The website is HelpWriting.net
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Refinement of Digitised Newspapers

  1. 1. Europeana Newspapers Workshop: Refinement WP2 – Introduction to Refinement Munich, 26 June 2013 Clemens Neudecker (@cneudecker)
  2. 2. Overview • Objectives & Challenges • Overview of Refinement Dataset • Introduction to Refinement: Workflow & Technologies • Questions & Answers This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 2
  3. 3. Objectives - Analysis of available digital newspaper collections of project partners and identification of subsets suitable for refinement - Definition of requirements and minimum quality of digitized newspapers for refinement to enable advanced services in Europeana - Coordination of the scalable processing of 10 million digitised newspaper pages with several refinement technologies - Providing recommendations on best practices for the refinement of digitised newspaper collections with full-text (and ingest to Europeana) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  4. 4. Challenges • Processing quality vs. speed/throughput • Volume of data requires focus on simple & standardised workflow with clear checkpoints • Diverse partners supplying content with different digitisation & access policies • Large variety of content in terms of file formats, fonts, languages, etc. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 4
  5. 5. The data This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  6. 6. Europeana Newspaper Dataset (1) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  7. 7. Europeana Newspaper Dataset (2) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  8. 8. Europeana Newspapers Dataset (3) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  9. 9. Europeana Newspapers Dataset (4) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  10. 10. Refinement Workflow steps This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 10
  11. 11. Tools (BCT) • BCT = Binarisation and Colour Reduction Tool • Purpose: Convert grey/colour scans to bitonal using highly optimised GPP method • Background: Reduce total file size of master images to guarantee feasibility and timing of data transfers This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 11
  12. 12. Tools (FRT) • FRT = File Rename Tool • Purpose: Support content holders in preparing their data in the correct format • Background: Ensure folder structure and file naming requirements for automated processing are met This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 12
  13. 13. Tools (FAT) • FAT = File Analyzer Tool • Purpose: Final quality check of data before refinement • Background: Ensure content and refinement partners that all preparation steps have been executed successfully This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 13
  14. 14. Refinement: OCR@UIBK • OCR = Optical Character Recognition • Number of pages to be refined: 8 million • Technologies: ABBYY FineReader SDK • State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts • Result: METS/ALTO package containing images, metadata & full text This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 14
  15. 15. OCR  Full text search http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/ This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 15
  16. 16. Refinement: OLR@CCS • OLR = Optical Layout Recognition • Number of pages to be refined: 2 million • Technologies: docWorks • Separation of columns, articles, headlines, page classes • Result: METS/ALTO package containing images, metadata & full text This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 16
  17. 17. OLR  Article separation This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 17
  18. 18. Refinement: NER@KB • NER = Named Entities Recognition • Number of pages to be refined: 2 million • Technologies: Stanford CRF-NER • Languages supported: German, Dutch, English (+ French, Latvian) • Open source: https://github.com/KBNLresearch/europeananp-ner • Detection of Named entities: Person, Location, Organization • Feedback cycle with manual training step  better results This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 18
  19. 19. NER  Browse by names or places This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 19
  20. 20. Thank you for your attention! clemens.neudecker@kb.nl

×