Your SlideShare is downloading. ×
Bratislava WS - Schlarb - ONB - technical tools_pdf
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Bratislava WS - Schlarb - ONB - technical tools_pdf

676

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
676
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The challenges of historical materials and an overview on the technical solutions in IMPACT Sven Schlarb, Austrian National Library 7 May, Bratislava
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Overview  Challenges  Techical solutions  Integration – Interoperability – Modularisation 7 May 2010, Bratislava
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges of historical materials  Warped book pages (caused by thick spines)  Skewed and distorted scans  Curved text lines (caused by creased or due to humidity warped paper)  Annoying colour blots, different print intensities  Shine through and bleed through  Gothic font  Handwritten annotations  Complex layouts (e.g. newspaper pages and the article reading order)  Historical languages and time-specific words in the documents 7 May 2010, Bratislava
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges Tables – Curved cell borders 7 May 2010, Bratislava
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Extreme warping  Gothic font  Annotations  Chapter numbers 7 May 2010, Bratislava
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Warpage due to humidity  Distortion  Crinkles  Dots and blots  Page/Chapter numbers 7 May 2010, Bratislava
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Complex layout  Reading order 7 May 2010, Bratislava
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Skewed image  Gothic font  Bleed through  Page number 7 May 2010, Bratislava
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Gothic font  Warping  Page borders  Curved text lines  Page/Chapter numbers 7 May 2010, Bratislava
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Consortium (including new partners) • 13 Universities and Research Centres • 11 Libraries – Instituut voor Nederlandse Lexicologie, Leiden – Koninklijke Bibliotheek (Netherlands) (Netherlands) – British Library – National Research Centre Demokritos, Athens – Bibliothèque national de France – University of Salford, Great Britain – Deutsche Nationalbibliothek – University Munich, Centrum für Informations- – Bayrische Staatsbibliothek und Sprachverarbeitung (CIS), Germany – Niedersächsische Staats- und – University Innsbruck (InfMath), Austria Universitätsbibliothek Göttingen – University Bath, Great Britain – Österreichische Nationalbibliothek – Institute for Parallel Processing, Bulgarian – Universitätsbibliothek der Universität Innsbruck Academy of Sciences – “St. Cyril and Methodius” National Library, Sofia – Jožef Stefan Institute, Ljubljana (Slovenia) – National Library of the Czech Republic, Prague – Institute of the Czech National Corpus, – National Library, Madrid (Spain) Charles University Prague (Czech Republic) – Analyse et Traitement Informatique de la Langue Française (ATILF), Nancy (France) • 2 Industry partners – Foundation Virtual Library Miguel de – IBM (Research Centre Haifa, Israel) Cervantes, Alicante (Spain) – ABBYY (Moscow) – Poznan Supercomputing and Networking Center, Poznan (Poland) – University of Warsaw, Department of Formal Linguistics, Warsaw (Poland) 7 May 2010, Bratislava
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 May 2010, Bratislava
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 May 2010, Bratislava
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Border detection/removal 7 May 2010, Bratislava
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Geometric Dewarping 7 May 2010, Bratislava
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Geometric Deskewing 7 May 2010, Bratislava
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Binarisation 7 May 2010, Bratislava
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Historical Lexicons  Lexicons for German, Dutch, English, French, Spanish, Polish, Bulgarian and Czech available.  Tools for building historical lexicons  Interface to ABBYY FRE to integrate external lexicons  Basically ABBYY provides the information on how the weighing parameters of word lists with word frequencies have to be created.  Procedural information disclosed but results can be evaluated against each other, e.g. by evaluating the results with or without or with different dictionaries against each other. 7 May 2010, Bratislava
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Named Entities Registry  Named entities (= persons names, geographic locations, organizations) and general  Collaborative Named Entities Registry  Named Entities to be integrated into ABBYY FR as word lists 7 May 2010, Bratislava
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Linguistic Post-Correction  OCR (ABBYY) and OCR analysis (CIS group, LMU)  The colors indicate different types of analysis results, like a word being found in the historical or hypothetical dictionary, or a supposed OCR error, etc. 7 May 2010, Bratislava
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Collaborative Correction  Integrated web- based system for collaborative post- correction of OCR results  Character/Word/Pag e modi  Main purpose: Collaboratively correct OCR errors and use results for improving OCR 7 May 2010, Bratislava
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Functional Extension Parser  Recognition of the structure of book pages – Print space – Standard font of the main text – Page numbers  Enrichment of OCR results with structural information 7 May 2010, Bratislava
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Word Spotting  Alternative technique for indexing historical documents  After word segmentation relevant words are detected and highlighted  Key words can be person and location names (e.g. taken from the Named Entities Registry) 7 May 2010, Bratislava
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Interoperability  ABBYY XML  METS/ALTO  AltoEx (IBM)  PAGE XML  (TEI) 7 May 2010, Bratislava
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Modularisation 7 May 2010, Bratislava
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. http://www.impact-project.eu

×