• Save
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and University Library of Slovenia by Alenka Kavčič-Čolić (NUK)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and University Library of Slovenia by Alenka Kavčič-Čolić (NUK)

on

  • 626 views

 

Statistics

Views

Total Views
626
Views on SlideShare
558
Embed Views
68

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 68

http://www.digitisation.eu 64
http://localhost 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and University Library of Slovenia by Alenka Kavčič-Čolić (NUK) Presentation Transcript

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Library experiences in IMPACT:THE NATIONAL AND UNIVERSITY LIBRARY OF SLOVENIA Alenka KAVČIČ-ČOLIĆ, Ines VODOPIVEC Library Research Centre, National and University Library Ljubljana, Slovenia in cooperation with Tomaž ERJAVEC Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.OUTLINE• Introduction• Cooperation in IMPACT project: • OCR improvement • Lexicon building • Improvment of information retrieval on historical document collections• Benefits overview IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.The National and University Library of Slovenia(Narodna in univerzitetna knjižnica - NUK) Digital Library of Slovenia (2005 - ) NUK entire online digital collection comprises more than 4 million scans and digital objects, including: – 19.000 pages of scientific journals, – 402.714 pages of newspapers, – 9.540 photographs, – 100 music records, – 15 3D objects, – 3 virtual exhibitions etc. IMPACT Outcomes, 26 June 2012, KB,
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.NUK statistics More than 200.000 visitors per year More than 1 million of distant users Approx. 2.8 million visits to the digital library portal dLib.si (2011) 2,30% 1,80% Library members school children 21,50% students university employees 3,10% general public 71,30% foreign citizens IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Slovene historical documents PDF file and HTML previewIMPACT Outcomes, 26 June 2012, KB, The Hague
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Documents published before 1850 Example of a bad OCR in a HTML previewIMPACT Outcomes, 26 June 2012, KB, The Hague
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Example of ahistorical text byLinhart from the18th century IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Cooperation in IMPACT NUK & IJS joined IMPACT project on the 2nd extension (1st Apr. 2010 – 31st Dec.2011) Goals to achieve in the project: 1. Lexicon building (JSI) 2. Improving OCR on historical documents by using special lexica for historical language (NUK & JSI) 3. Improving information retrieval on historical document collections (NUK) IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.The 3 goals were interdependent 1. OCR improvement (NUK) 2. Lexicon 3. Improved IR building in old texts (JSI) (NUK & JSI) IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Corpus selection (JSI & NUK): Selection of typical documents from the 19th and second half of the 18th century Materials from several sources: – dLib (www.dlib.si) – AHLib (http://nl.ijs.si/ahlib/) – books from 1848-1918, translated from German original Dataset: 41,313 digitized pages of historical newspapers & books from the 18th-19th century Subset of approx. 5,000 scans for Groundtruth (GT)* production(* a dataset of high-quality transcriptions of historical texts which also serves as a basis for theproduction of the Lexicon for historical Slovene)IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Characteristics of the dataset: Errors originating from language characteristics – Two basic historical alphabets (Bohoričica / Gajica) – Historical language / vocabulary  Poor vocabulary recognition – Special characters, digraphs and ligatures Errors originating from print properties – Latin and gothic types – Complex page structure or segmentation – Irregular spacing between letters, words and columns – Irregular / changing font sizes – Poor paper quality – Inconsistent inking Errors originating from digitisation procedures – Specific characteristics of the originals: staining, foxing, paper wrapping caused by humidity IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Groundtruth production (NUK) Digitisation and pre-processing of scans OCR procedure Post-processing of the text  GT production – Errors correction – manually – Page segmentation and reading order - Aletheia – OCR outcomes encoded in PAGE XML Evaluation IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Lexicon building for historical Slovene Developed by the Jožef Stefan Institute (JSI) Built to be incorporated in OCR & IR But also as a human-readable reference and as a training & testing set for Human Language Technologies Two stages development: – Reference corpus of historical Slovene – Lexicon of historical Slovene IMPACT Outcomes, 26 June 2012, KB,
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Period Units Pages TokensReference corpus goo300k 1584 1 8 6000  Page sampled 1695 1 27 10000  Each word annotated with: 1751-1800 8 155 27000 – Contemporary equivalent 1801-1850 12 206 74000 1851-1875 36 380 126000 – Modern lemma 1876-1900 23 224 51000 – Part-of-speech tag ∑ 81 1000 296000 – Gloss for archaic words (lemmas)  First automatically, then manually corrected: – Institute for Dutch Lexicology (INL) CoBaLT Lexicon Tool – A team of annotators – Also correcting errors in transcription  Available via a concordancer + download (CC-BY licence)  Development supported by Google humanities research award (JSI + ZRC SAZU) IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Historical lexicon Lexicon n Lex. entries 70,000 Lexicon dump from goo300k Word-forms 68,000 + additional lexicon from full-text collection  First automatically, then manually corrected: Modernised 50,000  INL CoBaLT Lexicon Tool Lemmas 24,000  A team of annotators Glosses 1,900 Dual role: – As a human readable lexicon of historical Slovene – For HLT applications (e.g. IR) Available via a web browser + download (CC-BY) JSI also developed ToTrTaLe – a tool for processing historical (Slovene) text, which annotates words in a TEI encoded corpus with their modern-day equivalents, PoS tags and lemmas – Used by Vaam finite-state library (Centrum für Informations- und Sprachverarbeitung, University of Munich) IMPACT Outcomes, 26 June 2012, KB,
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Example lexical entry(converted from TEI to HTML) IMPACT Outcomes, 26 June 2012, KB,
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.JSI resourcesavailable fromhttp://nl.ijs.si/imp/ IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Improved IR in old texts (NUK & JSI) Integration of historical lexicon developed by IJS into the full-text search engine of dLib.si IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Outcomes, 26 June 2012, KB,
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT project benefits- OCR quality improvement: - For texts in Bohoričica and Gajica the OCR done by Abbey Fine Reader increased from 58% to 70% for Bohoričica and 85% for Gajica- Full-text search with modern words in more than 200.000 digitised historical documents- Historical Slovene Lexicon that can be integrated in other Digital Library tools- Processing tools for large scale digitisation (ex. Aletheia, FEP … )- High-quality datasets for R&D of langauge technlogies- Other „invisible“ benefits: - Cooperation and integration in national and international networks - Additional experiences in large scale digitisation - New knowledge (on OCR, language processing, and other tools used in large scale digitisation) IMPACT Outcomes, 26 June 2012, KB, The Hague
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Literature: Erjavec, Tomaž. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC12), Istambul. Erjavec, Tomaž. Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland. Erjavec, Tomaž, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. A lexicon for processing archaic language: the case of XIXth century Slovene. Proceedings of WoLeR: ESSLLI Workshop on Lexical Resources, 2011, Ljubljana. Erjavec, Tomaž, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. Towards a Lexicon of XIXth Century Slovene. Proceedings of the Seventh Language Technologies Conference Ljubljana, 2010. Erjavec, Tomaž, Ines Jerele, Maša Kodrič. 2011. Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT. V: Kranjc, Simona (ur.). Meddisciplinarnost v slovenistiki. Ljubljana: Znanstvena založba Filozofske fakultete, 2011, 41-47 Erjavec, Tomaž. Slovenska prevodna književnost 1848-1918 : digitalna knjižnica in korpus AHLib. V: KRANJC, Simona (ur.). Meddisciplinarnost v slovenistiki. Ljubljana: Znanstvena založba Filozofske fakultete, 2011, str. 33-40. Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U. (2009). Enabling Information Retrieval on Historical Document Collections - the Role of Matching Procedures and Special Lexica. AND2009 Workshop (23-24 July 2009, Barcelona, Spain). Also available on http://sites.google.com/site/and2009workshop/. Jerele, Ines, Erjavec, Tomaž, Pokorn, Daša, Kavčič-Čolić, Alenka. 2012. Optical Character Recognition of Historical Texts: End- User Focused Research for Slovenian Books and Newspapers from the 18th and 19th Century. Review of the National Center for Digitization 21/2012, Faculty of Mathematics, Belgrade. Jerele, I., Erjavec, T., Pokorn, D., Kavčič-Čolić, A. (2011). Optical character recognition of historical texts: end-user focused research for slovenian books and newspapers from the 18th and 19th century. In: 6. SEEDI conference : proceedings, (16-20 May 2011, Zagreb, Croatia), p. 11, (unpublished jet). Also available on http://www.nsk.hr/seedi/seedi-hrv/index.html. Kenter, Tom, Erjavec, Tomaž, Maja and Žorga, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics. IMPACT Outcomes, 26 June 2012, KB,
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention! Alenka Kavčič-Čolić Alenka.kavcic@nuk.uni-lj.si