Bratislava WS - Schlarb - ONB - technical tools_pdf

868 views
848 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
868
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bratislava WS - Schlarb - ONB - technical tools_pdf

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The challenges of historical materials and an overview on the technical solutions in IMPACT Sven Schlarb, Austrian National Library 7 May, Bratislava
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Overview  Challenges  Techical solutions  Integration – Interoperability – Modularisation 7 May 2010, Bratislava
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges of historical materials  Warped book pages (caused by thick spines)  Skewed and distorted scans  Curved text lines (caused by creased or due to humidity warped paper)  Annoying colour blots, different print intensities  Shine through and bleed through  Gothic font  Handwritten annotations  Complex layouts (e.g. newspaper pages and the article reading order)  Historical languages and time-specific words in the documents 7 May 2010, Bratislava
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges Tables – Curved cell borders 7 May 2010, Bratislava
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Extreme warping  Gothic font  Annotations  Chapter numbers 7 May 2010, Bratislava
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Warpage due to humidity  Distortion  Crinkles  Dots and blots  Page/Chapter numbers 7 May 2010, Bratislava
  7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Complex layout  Reading order 7 May 2010, Bratislava
  8. 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Skewed image  Gothic font  Bleed through  Page number 7 May 2010, Bratislava
  9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Challenges  Gothic font  Warping  Page borders  Curved text lines  Page/Chapter numbers 7 May 2010, Bratislava
  10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Consortium (including new partners) • 13 Universities and Research Centres • 11 Libraries – Instituut voor Nederlandse Lexicologie, Leiden – Koninklijke Bibliotheek (Netherlands) (Netherlands) – British Library – National Research Centre Demokritos, Athens – Bibliothèque national de France – University of Salford, Great Britain – Deutsche Nationalbibliothek – University Munich, Centrum für Informations- – Bayrische Staatsbibliothek und Sprachverarbeitung (CIS), Germany – Niedersächsische Staats- und – University Innsbruck (InfMath), Austria Universitätsbibliothek Göttingen – University Bath, Great Britain – Österreichische Nationalbibliothek – Institute for Parallel Processing, Bulgarian – Universitätsbibliothek der Universität Innsbruck Academy of Sciences – “St. Cyril and Methodius” National Library, Sofia – Jožef Stefan Institute, Ljubljana (Slovenia) – National Library of the Czech Republic, Prague – Institute of the Czech National Corpus, – National Library, Madrid (Spain) Charles University Prague (Czech Republic) – Analyse et Traitement Informatique de la Langue Française (ATILF), Nancy (France) • 2 Industry partners – Foundation Virtual Library Miguel de – IBM (Research Centre Haifa, Israel) Cervantes, Alicante (Spain) – ABBYY (Moscow) – Poznan Supercomputing and Networking Center, Poznan (Poland) – University of Warsaw, Department of Formal Linguistics, Warsaw (Poland) 7 May 2010, Bratislava
  11. 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 May 2010, Bratislava
  12. 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 May 2010, Bratislava
  13. 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Border detection/removal 7 May 2010, Bratislava
  14. 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Geometric Dewarping 7 May 2010, Bratislava
  15. 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Geometric Deskewing 7 May 2010, Bratislava
  16. 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Binarisation 7 May 2010, Bratislava
  17. 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Historical Lexicons  Lexicons for German, Dutch, English, French, Spanish, Polish, Bulgarian and Czech available.  Tools for building historical lexicons  Interface to ABBYY FRE to integrate external lexicons  Basically ABBYY provides the information on how the weighing parameters of word lists with word frequencies have to be created.  Procedural information disclosed but results can be evaluated against each other, e.g. by evaluating the results with or without or with different dictionaries against each other. 7 May 2010, Bratislava
  18. 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Named Entities Registry  Named entities (= persons names, geographic locations, organizations) and general  Collaborative Named Entities Registry  Named Entities to be integrated into ABBYY FR as word lists 7 May 2010, Bratislava
  19. 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Linguistic Post-Correction  OCR (ABBYY) and OCR analysis (CIS group, LMU)  The colors indicate different types of analysis results, like a word being found in the historical or hypothetical dictionary, or a supposed OCR error, etc. 7 May 2010, Bratislava
  20. 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Collaborative Correction  Integrated web- based system for collaborative post- correction of OCR results  Character/Word/Pag e modi  Main purpose: Collaboratively correct OCR errors and use results for improving OCR 7 May 2010, Bratislava
  21. 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Functional Extension Parser  Recognition of the structure of book pages – Print space – Standard font of the main text – Page numbers  Enrichment of OCR results with structural information 7 May 2010, Bratislava
  22. 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools: Word Spotting  Alternative technique for indexing historical documents  After word segmentation relevant words are detected and highlighted  Key words can be person and location names (e.g. taken from the Named Entities Registry) 7 May 2010, Bratislava
  23. 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Interoperability  ABBYY XML  METS/ALTO  AltoEx (IBM)  PAGE XML  (TEI) 7 May 2010, Bratislava
  24. 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Modularisation 7 May 2010, Bratislava
  25. 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. http://www.impact-project.eu

×