Europeana Newspapers wp2 liber2013
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Europeana Newspapers wp2 liber2013

on

  • 399 views

 

Statistics

Views

Total Views
399
Views on SlideShare
399
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Europeana Newspapers wp2 liber2013 Presentation Transcript

  • 1. Europeana Newspapers Workshop: Refinement WP2 – Introduction to Refinement Munich, 26 June 2013 Clemens Neudecker (@cneudecker)
  • 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Overview • Objectives & Challenges • Overview of Refinement Dataset • Introduction to Refinement: Workflow & Technologies • Questions & Answers 2
  • 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Objectives - Analysis of available digital newspaper collections of project partners and identification of subsets suitable for refinement - Definition of requirements and minimum quality of digitized newspapers for refinement to enable advanced services in Europeana - Coordination of the scalable processing of 10 million digitised newspaper pages with several refinement technologies - Providing recommendations on best practices for the refinement of digitised newspaper collections with full-text (and ingest to Europeana)
  • 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Challenges • Processing quality vs. speed/throughput • Volume of data requires focus on simple & standardised workflow with clear checkpoints • Diverse partners supplying content with different digitisation & access policies • Large variety of content in terms of file formats, fonts, languages, etc. 4
  • 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp The data
  • 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (1)
  • 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (2)
  • 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (3)
  • 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (4)
  • 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement Workflow steps 10
  • 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT) • BCT = Binarisation and Colour Reduction Tool • Purpose: Convert grey/colour scans to bitonal using highly optimised GPP method • Background: Reduce total file size of master images to guarantee feasibility and timing of data transfers 11
  • 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT) • FRT = File Rename Tool • Purpose: Support content holders in preparing their data in the correct format • Background: Ensure folder structure and file naming requirements for automated processing are met 12
  • 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT) • FAT = File Analyzer Tool • Purpose: Final quality check of data before refinement • Background: Ensure content and refinement partners that all preparation steps have been executed successfully 13
  • 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OCR@UIBK • OCR = Optical Character Recognition • Number of pages to be refined: 8 million • Technologies: ABBYY FineReader SDK • State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts • Result: METS/ALTO package containing images, metadata & full text 14
  • 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OCR  Full text search 15 http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/
  • 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OLR@CCS • OLR = Optical Layout Recognition • Number of pages to be refined: 2 million • Technologies: docWorks • Separation of columns, articles, headlines, page classes • Result: METS/ALTO package containing images, metadata & full text 16
  • 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR  Article separation 17
  • 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: NER@KB • NER = Named Entities Recognition • Number of pages to be refined: 2 million • Technologies: Stanford CRF-NER • Languages supported: German, Dutch, English (+ French, Latvian) • Open source: https://github.com/KBNLresearch/europeananp-ner • Detection of Named entities: Person, Location, Organization • Feedback cycle with manual training step  better results 18
  • 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER  Browse by names or places 19
  • 20. Thank you for your attention! clemens.neudecker@kb.nl