Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

06 traub

577 views

Published on

KB symposium historische kranten als big data,
Den Haag, 24 maart 2015

Published in: Government & Nonprofit
  • Be the first to comment

  • Be the first to like this

06 traub

  1. 1. 1 Estimating the Impact of OCR Quality on ResearchTasks in the Digital Humanities Myriam C. Traub, Jacco van Ossenbruggen! Centrum Wiskunde & Informatica 24/03/2015
  2. 2. 01 Use of digital archives for… ✤ selection for close reading! ✤ first occurrence of specific words! ✤ word frequency patterns over ! ✤ time ! ✤ different sources 2
  3. 3. 01 Use of digital archives for… ✤ selection for close reading! ✤ first occurrence of specific words! ✤ word frequency patterns over ! ✤ time ! ✤ different sources It all depends on OCR quality! 2
  4. 4. We care about average performance on representative subsets for generic cases. 3 Two different perspectives of OCR quality
  5. 5. We care about average performance on representative subsets for generic cases. I care about actual performance on my non- representative subset for my specific query. 3 Two different perspectives of OCR quality
  6. 6. Specific case:“amsterdam” ✤ American English 2009, 1700 - 1850 4
  7. 7. Specific case:“amsterdam” ✤ American English 2009, 1700 - 1850 How reliable are these numbers? 4
  8. 8. Specific case:“amsterdam” ✤ American English 2009, 1700 - 1850 5
  9. 9. Specific case:“amsterdam” ✤ American English 2009, 1700 - 1850 5 Tool maker:! recall (average) = 90%
  10. 10. Specific case:“amsterdam” ✤ American English 2009, 1700 - 1850 5 Tool maker:! recall (average) = 90% Is “amsterdam” in the 10% or in the 90%?
  11. 11. ✤ American English 2009, 1700 - 1850 “amsterdam” and “amfterdam” 6
  12. 12. ✤ American English 2009, 1700 - 1850 “amsterdam” and “amfterdam” 6
  13. 13. ✤ American English 2009, 1700 - 1850 “amsterdam” and “amfterdam” How do I know?! What else? 6
  14. 14. ✤ Current American English, 1700 - 1850 Different versions / pipelines 7
  15. 15. ✤ Current American English, 1700 - 1850 Different versions / pipelines How do I know that they fixed it?! What else? 7
  16. 16. Research questions ✤ How can we give humanities researchers the information they need to understand how the OCR limitations influence their research tasks?! ✤ What is a good way of estimating uncertainty for a specific case?! ✤ How to get (the data for) better estimates? 8
  17. 17. OCR pre-processing post-! processing ingestion scanning 9 Understanding potential sources of bias ✤ some details difficult to reconstruct ! ✤ essential to understand overall impact
  18. 18. OCR pre-processing post-! processing ingestion scanning 10 Potential starting point: ✤ Confidence 
 in the pipeline?
  19. 19. Example for OCR confidence values 11
  20. 20. Example for OCR confidence values 11
  21. 21. Example for OCR confidence values 11
  22. 22. Example for OCR confidence values 11 What does 0.732 mean?
  23. 23. Example for post-processing 12
  24. 24. Example for post-processing 12
  25. 25. 01 OCR confidence values useful? ✤ available for all items in the collection: page, word, character! ✤ calibration based on limited ground truth?! ✤ only for highest ranked words / characters, other candidates missing 13
  26. 26. 01 Future work ✤ How to estimate impact of OCR errors on use cases outside the ground truth?! ✤ Can we crowdsource part of this estimation problem?! ✤ How to convey the estimated impact to researchers using the corpus? 14
  27. 27. 01 How does OCR quality affect your research? Myriam.Traub@cwi.nl! Jacco.van.Ossenbruggen@cwi.nl 15

×