SlideShare a Scribd company logo
1 of 11
Download to read offline
University Library of KU Leuven 
Sam Alloing and Demmy Verbeke
University Library of KU Leuven 
Divisions involved: 
Arts Faculty Library 
•Collections and services focused on ongoing research and teaching in the Faculty of Arts 
•Some special collections (e.g. Gulden Librije) 
LIBIS 
•Provides services for libraries, museums and archives (inside and outside the university) 
Digitisation Unit 
•A.o. Digital Lab: High-tech digital photography centre
Why did we get involved? 
Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research 
http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie 
http://www.illuminare.be/rich_project 
http://www.europeana-photography.eu
Corpus 
13 books from the pretiosa collection of the Gulden Librije: 
-translations from Latin 
-books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
Assumptions 
•As automated as possible 
•Try as soon as possible, to fail early 
•Use ALTO format throughout the workflow
Workflow OCR 
Attestation 
Improving 
•User pattern training 
•Use dictionary 
•Improve images 
Executing OCR 
Digitisation 
Evaluation set 
ocrevalUAtion 
Lesson learnt: 
high error rate is not necessarily bad 
Aletheia 
•Create ground truth 
•User friendly 
Lessons learnt: 
•B&W images 
•Remove border 
•Biggest problem: letters from other pages coming through 
ABBYY FineReader engine 
•Useful sample applications 
•Windows
Workflow NER 
Attestation 
Training set 
Test set 
Execute NER 
Model 
Input 
Europeana Newspaper NER 
•ALTO input from OCR 
•Lesson learnt: lot of resources (RAM) needed 
INL Attestation tool 
Lesson learnt: 
lot more ground truth needed than OCR 
NERT of INL 
80/20 split training/test 
NERT of INL 
•Different split training and test set 
•Create variants from old spelling 
Improving
Results NER 
Precision 
Recall 
F1 
Overall 
0.6257 
0.5130 
0.5638 
Location 
0.675 
0.2903 
0.40601 
Organization 
1.0 
0.1666 
0.2857 
Person 
0.6207 
0.5571 
0.5871 
Segmentation 
0.6634 
0.5438 
0.5977 
Classification accuracy 
0.9433 
> 60% recognised correctly 
≈ 50% of the entities found
Results NER, an experiment 
Input 
Corrected file 
Training file 
Test file 
Split 
Combine 
Precision 
Recall 
F1 
Overall 
0.8398 
0.7954 
0.8170 
Location 
0.8741 
0.6720 
0.7599 
Organization 
1.0 
0.5 
0.6666 
Person 
0.8320 
0.8320 
0.8320 
Segmentation 
0.8920 
0.8448 
0.8677 
Classification accuracy 
0.9415 
80% recognised correctly 
≈ 80% entities found
Next steps 
•Create a OCR and NER platform for the university and as part of the LIBIS services 
•New project about OCR and (early modern) Latin texts 
•Looking into other tools : 
•Lexicon building 
•Border detection 
•Automatically remove ‘noise’ from a page 
•NER: 
•Learning to use Latin (and Greek)
Thanks! 
Questions? 
•Sam Alloing (Sam.Alloing@libis.kuleuven.be) 
•Demmy Verbeke (Demmy.Verbeke@arts.kuleuven.be; @viroviacum) 
•http://bib.kuleuven.be/english/ub

More Related Content

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke

150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnetJo Rademakers
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlinelab_SNG
 
OA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABOA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABRonald Snijder
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine HarvesterTry PurpleSearch
 
Keep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder UKeep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder Ulab_SNG
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital CollectionsErin Tripp
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
 
Technion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceTechnion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceElena Yaroshenko
 
BL Labs and Digital Humanities
BL Labs and Digital HumanitiesBL Labs and Digital Humanities
BL Labs and Digital Humanitieslabsbl
 
Geek out : Adding Coding Skills to Your Professional Repertoire
Geek out: Adding Coding Skills to Your Professional RepertoireGeek out: Adding Coding Skills to Your Professional Repertoire
Geek out : Adding Coding Skills to Your Professional RepertoireBohyun Kim
 
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...Jason Casden
 
Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcherLIBER Europe
 
ArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghRepository Fringe
 
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...MCN (Museum Computer Network)
 
Panel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyPanel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyJane Alexander
 

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke (20)

150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet
 
Introducing SUL
Introducing SULIntroducing SUL
Introducing SUL
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections online
 
OA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABOA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOAB
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
Keep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder UKeep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder U
 
KU Leuven - Words and numbers - ICoC
KU Leuven - Words and numbers - ICoCKU Leuven - Words and numbers - ICoC
KU Leuven - Words and numbers - ICoC
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Technion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceTechnion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpace
 
BL Labs and Digital Humanities
BL Labs and Digital HumanitiesBL Labs and Digital Humanities
BL Labs and Digital Humanities
 
Geek out : Adding Coding Skills to Your Professional Repertoire
Geek out: Adding Coding Skills to Your Professional RepertoireGeek out: Adding Coding Skills to Your Professional Repertoire
Geek out : Adding Coding Skills to Your Professional Repertoire
 
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
 
Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcher
 
Sistema Compartit a l'ICOLC
Sistema Compartit a l'ICOLCSistema Compartit a l'ICOLC
Sistema Compartit a l'ICOLC
 
ArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of Edinburgh
 
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
 
Emea, March 2011
Emea, March 2011 Emea, March 2011
Emea, March 2011
 
Panel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyPanel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: Technology
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

University library of KU Leuven - Sam Alloing et Demmy Verbecke

  • 1. University Library of KU Leuven Sam Alloing and Demmy Verbeke
  • 2. University Library of KU Leuven Divisions involved: Arts Faculty Library •Collections and services focused on ongoing research and teaching in the Faculty of Arts •Some special collections (e.g. Gulden Librije) LIBIS •Provides services for libraries, museums and archives (inside and outside the university) Digitisation Unit •A.o. Digital Lab: High-tech digital photography centre
  • 3. Why did we get involved? Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie http://www.illuminare.be/rich_project http://www.europeana-photography.eu
  • 4. Corpus 13 books from the pretiosa collection of the Gulden Librije: -translations from Latin -books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
  • 5. Assumptions •As automated as possible •Try as soon as possible, to fail early •Use ALTO format throughout the workflow
  • 6. Workflow OCR Attestation Improving •User pattern training •Use dictionary •Improve images Executing OCR Digitisation Evaluation set ocrevalUAtion Lesson learnt: high error rate is not necessarily bad Aletheia •Create ground truth •User friendly Lessons learnt: •B&W images •Remove border •Biggest problem: letters from other pages coming through ABBYY FineReader engine •Useful sample applications •Windows
  • 7. Workflow NER Attestation Training set Test set Execute NER Model Input Europeana Newspaper NER •ALTO input from OCR •Lesson learnt: lot of resources (RAM) needed INL Attestation tool Lesson learnt: lot more ground truth needed than OCR NERT of INL 80/20 split training/test NERT of INL •Different split training and test set •Create variants from old spelling Improving
  • 8. Results NER Precision Recall F1 Overall 0.6257 0.5130 0.5638 Location 0.675 0.2903 0.40601 Organization 1.0 0.1666 0.2857 Person 0.6207 0.5571 0.5871 Segmentation 0.6634 0.5438 0.5977 Classification accuracy 0.9433 > 60% recognised correctly ≈ 50% of the entities found
  • 9. Results NER, an experiment Input Corrected file Training file Test file Split Combine Precision Recall F1 Overall 0.8398 0.7954 0.8170 Location 0.8741 0.6720 0.7599 Organization 1.0 0.5 0.6666 Person 0.8320 0.8320 0.8320 Segmentation 0.8920 0.8448 0.8677 Classification accuracy 0.9415 80% recognised correctly ≈ 80% entities found
  • 10. Next steps •Create a OCR and NER platform for the university and as part of the LIBIS services •New project about OCR and (early modern) Latin texts •Looking into other tools : •Lexicon building •Border detection •Automatically remove ‘noise’ from a page •NER: •Learning to use Latin (and Greek)
  • 11. Thanks! Questions? •Sam Alloing (Sam.Alloing@libis.kuleuven.be) •Demmy Verbeke (Demmy.Verbeke@arts.kuleuven.be; @viroviacum) •http://bib.kuleuven.be/english/ub