Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

University library of KU Leuven - Sam Alloing et Demmy Verbecke


Published on

University library of KU Leuven presentation at "Succeed in Digitisation. Spreading Excellence" Conference. Validation and take-up of text digitisation tools.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

University library of KU Leuven - Sam Alloing et Demmy Verbecke

  1. 1. University Library of KU Leuven Sam Alloing and Demmy Verbeke
  2. 2. University Library of KU Leuven Divisions involved: Arts Faculty Library •Collections and services focused on ongoing research and teaching in the Faculty of Arts •Some special collections (e.g. Gulden Librije) LIBIS •Provides services for libraries, museums and archives (inside and outside the university) Digitisation Unit •A.o. Digital Lab: High-tech digital photography centre
  3. 3. Why did we get involved? Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research
  4. 4. Corpus 13 books from the pretiosa collection of the Gulden Librije: -translations from Latin -books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
  5. 5. Assumptions •As automated as possible •Try as soon as possible, to fail early •Use ALTO format throughout the workflow
  6. 6. Workflow OCR Attestation Improving •User pattern training •Use dictionary •Improve images Executing OCR Digitisation Evaluation set ocrevalUAtion Lesson learnt: high error rate is not necessarily bad Aletheia •Create ground truth •User friendly Lessons learnt: •B&W images •Remove border •Biggest problem: letters from other pages coming through ABBYY FineReader engine •Useful sample applications •Windows
  7. 7. Workflow NER Attestation Training set Test set Execute NER Model Input Europeana Newspaper NER •ALTO input from OCR •Lesson learnt: lot of resources (RAM) needed INL Attestation tool Lesson learnt: lot more ground truth needed than OCR NERT of INL 80/20 split training/test NERT of INL •Different split training and test set •Create variants from old spelling Improving
  8. 8. Results NER Precision Recall F1 Overall 0.6257 0.5130 0.5638 Location 0.675 0.2903 0.40601 Organization 1.0 0.1666 0.2857 Person 0.6207 0.5571 0.5871 Segmentation 0.6634 0.5438 0.5977 Classification accuracy 0.9433 > 60% recognised correctly ≈ 50% of the entities found
  9. 9. Results NER, an experiment Input Corrected file Training file Test file Split Combine Precision Recall F1 Overall 0.8398 0.7954 0.8170 Location 0.8741 0.6720 0.7599 Organization 1.0 0.5 0.6666 Person 0.8320 0.8320 0.8320 Segmentation 0.8920 0.8448 0.8677 Classification accuracy 0.9415 80% recognised correctly ≈ 80% entities found
  10. 10. Next steps •Create a OCR and NER platform for the university and as part of the LIBIS services •New project about OCR and (early modern) Latin texts •Looking into other tools : •Lexicon building •Border detection •Automatically remove ‘noise’ from a page •NER: •Learning to use Latin (and Greek)
  11. 11. Thanks! Questions? •Sam Alloing ( •Demmy Verbeke (; @viroviacum) •