Set of slides I used for my talk at the Symposium: Digitale historische kranten als ‘big data’ that took place on March 24, 2015 at the Koninklijke Bibliotheek (KB), The Hague.
https://www.kb.nl/nieuws/2015/symposium-digitale-historische-kranten-als-big-data
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
1. 1
Estimating the Impact of OCR Quality
on ResearchTasks in the Digital Humanities
Myriam C. Traub, Jacco van Ossenbruggen!
Centrum Wiskunde & Informatica
24/03/2015
2. 01
Use of digital
archives for…
✤ selection for close reading!
✤ first occurrence of specific
words!
✤ word frequency patterns over !
✤ time !
✤ different sources
It all
depends on OCR
quality!
2
3. We care about
average performance
on representative
subsets for generic
cases.
I care about
actual performance
on my non-
representative subset
for my specific
query.
3
Two different perspectives of OCR quality
6. ✤ American English 2009, 1700 - 1850
“amsterdam” and “amfterdam”
How do I
know?!
What else?
6
7. ✤ Current American English, 1700 - 1850
Different versions / pipelines
How do I know
that they fixed it?!
What else?
7
8. Research questions
✤ How can we give humanities researchers the
information they need to understand how the OCR
limitations influence their research tasks?!
✤ What is a good way of estimating uncertainty for a
specific case?!
✤ How to get (the data for) better estimates?
8
13. 01
OCR confidence
values useful?
✤ available for all items in the
collection: page, word,
character!
✤ calibration based on limited
ground truth?!
✤ only for highest ranked
words / characters, other
candidates missing
13
14. 01
Future work
✤ How to estimate impact of
OCR errors on use cases
outside the ground truth?!
✤ Can we crowdsource part of
this estimation problem?!
✤ How to convey the estimated
impact to researchers?
14
15. 01
How does OCR quality
affect your research?
Myriam.Traub@cwi.nl!
Jacco.van.Ossenbruggen@cwi.nl
15