Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

1,006 views

Published on

Digitized document collections often suffer from OCR errors that may impact a document’s readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library’s search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.

Published in: Science
  • Be the first to comment

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

  1. 1. Impact of Crowdsourcing OCR Improvements on Retrievability Bias Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Lynda Hardman Centrum Wiskunde & Informatica, Amsterdam, NL 1
  2. 2. Motivation: Retrievability (Bias) • Introduced by Azzopardi et al. in 2008 [1] • Retrievability score counts how 
 often a document is retrieved as one of 
 the top K documents by a given set of queries • Gini coefficient quantifies inequality in the distribution of scores [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM. 2
  3. 3. Study on Retrievability Bias (JCDL2016) • Follow-up study of Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus • Large-scale study based on 102 million newspaper items, 4 million simulated queries and 957,239 real user queries • Findings: • Large inequalities among the documents indicating retrievability bias • Document length impacts retrieval, no evidence for other technical bias found • Simulated queries yield very different results than real queries, experiments should take operators and facets into account 3
  4. 4. Potential Causes for Retrievability Bias • Skills and interest of users • Collection bias • Ranking algorithm • UI design • (OCR) quality 4 Courante uyt Italien, Duytslandt, &c (14-06-1618)
  5. 5. Research Questions • RQ1: Relation between OCR quality and retrievability • RQ2: Direct impact of correction on retrievability bias of corrected documents • RQ3: Indirect impact of correction of a fraction of documents on non-corrected ones 5
  6. 6. Research Questions • RQ1: Relation between OCR quality and retrievability • RQ2: Direct impact of correction on retrievability bias of corrected documents • RQ3: Indirect impact of correction of a fraction of documents on non-corrected ones How does bias caused by OCR quality impact my (re-)search results? 5
  7. 7. Experimental Setup 6
  8. 8. Documents & Queries • Subset of the historic newspaper archive maintained by the National Library of the Netherlands (public, KB) • Ground truth set of 100 manually corrected newspaper issues (822 articles) published in the 17th century and WWII period (public, KB) • Character error rates (CER) computed with [1] • User queries collected from delpher.nl (confidential, KB, same as in previous study), stopwords, short term removed, deduplicated 7 [1] https://www.digitisation.eu/ De geus onder studenten (14-10-1940)
  9. 9. 4 Corpora • Ground truth set (822 documents): • uncorrected • corrected • Ground truth + mixed in (1644 documents): • uncorrected • partially corrected 8
  10. 10. Setup - Retrievability [1] http://www.lemurproject.org/ Indri search engine [1] Documents Queries 9 • We report on c=1, c=10, c=100 or c=infinite • Carried out on each of the four corpora
  11. 11. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  12. 12. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  13. 13. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Rank of document d in the result list of a query q 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  14. 14. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  15. 15. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q Possibility to give more weight to certain queries, 
 we use oq=1 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  16. 16. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q For all queries q in a query set Q Possibility to give more weight to certain queries, 
 we use oq=1 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  17. 17. Impact Assessment • Wealth: How many documents were retrieved in total? • Sum of all r(d) scores • Equality: How are r(d) scores distributed among documents? • Gini coefficient • Retrieval per document/query: • Changes due to correction • Impact of individual (query) terms 11
  18. 18. OCR Quality & Retrievability RQ1: What is the relation between a document’s OCR character error rate and its retrievability score? 12
  19. 19. RQ1: OCR Quality & Retrievability • CER in 17cent collection significantly higher • R(d) scores higher in WWII collection • Correlation between r(d) and CER: -0.57 (Pearson) and -0.61 (Spearman) with p<0.001 0 20,000 40,000 60,000 0% 20% 40% 60% 80% Character error rate (CER) R(d)score Document length 1000 2000 3000 Subset 17cent WWII 13
  20. 20. Direct Impact of OCR Quality RQ2: How does the correction of OCR errors impact the retrievability bias of the corrected documents? 14
  21. 21. Direct Impact Uncorrected Complete corpus was corrected Corrected 15
  22. 22. Impact of Correction on Wealth • More documents retrieved from corrected documents • Number of queries with results increased by 8% • Impact is largest for users willing to look at the entire result list 16 365,855 338,139 2,023,283 1,750,340 5,477,566 4,341,536 6,033,099 4,521,030 + 8%+ 8% + 16%+ 16% + 26%+ 26% + 34%+ 34% c=1 c=10 c=100 c=infinite 0 2,500,000 5,000,000 7,500,000 10,000,000 Sum of all r(d) scores (wealth) Condition error−prone corrected
  23. 23. Impact on Equality • Correction lowers inequality among documents • In contrast to earlier findings, Gini coefficients do not decrease with larger c’s • Correction fixes more FN than FP (c=infinite): • Increases both, wealth and equality 17 0.0 0.2 0.4 0.6 1 10 100 infinite Gini Condition 822GTcor 822GTerr Direct Impact: Gini Coefficients
  24. 24. Retrieval per Document • Few documents lose r(d) scores after correction:
 Good, these are former FP caused by OCR errors and no longer retrieved • Most documents, however, gain — with 17cent corpus improving to a larger extent, but still remaining at a lower level 18
  25. 25. Retrieval per Query • Only 44% of the queries retrieved at least one document • Despite small collection size, we see large gains • Some queries lose because they retrieved FP from the uncorrected document set 19
  26. 26. Retrieval per Query Top 10 terms cause 35% of the wealth increase. These terms: 1. Appear very frequently in user queries and 2. Are highly susceptible to OCR errors in the documents Conclusion: Real queries are also a source of bias 20 0 25 50 75 100 0 1,000 2,000 3,000 4,000 Query terms ordered by difference in impact (descending) Cumulativer(d)difference(%) * new, Amsterdam, end, Mister, died/dead, grand/ large, Willem (name), two, three, old Figure 4: Queries ordered by their gain/loss in number of retrieved documents. The position on the y-axis represents the number of documents retrieved from 822GTcor . histograms. The distributions of the dierences in r(d) scores in Ta- ble 2, show that for all cuto values, the median of the dierences is positive, and increases from 8 (c = 1) to 912 (c = 1). The maximum loss and the maximum gain in r(d) scores increase for larger cuto values c, the latter to a much larger extent. Note that for c = 1 and c = 10 the entire rst quartile is lled with documents that scored worse in the corrected version. This shows that the competition in the top results makes the gain of some documents the loss of others. Increased retrieval per query In a nal step, we investigated 0 25 50 75 100 0 1,000 2,000 3,000 4,000 Query terms ordered by difference in impact (descending) Cumulativer(d)difference(%) Query Frequency in Cum. Term Queries 822GT err 822GTcor Impact nieuwe 1,903 99 166 7.36% amsterdam 7,885 41 57 14.65% ende 185 103 480 18.69% heer 826 20 89 21.99% overleden 3,698 5 18 24.78% groot 1,573 125 153 27.33% willem 5,375 5 13 29.81% twee 319 64 175 31.83% drie 401 34 120 33.81% oude 991 50 78 35.41% Figure 5: The accumulated impact scores of single-term queries show that very few query term contribute a large fraction of the overall wealth. The top ten query terms ac- *
  27. 27. Indirect Impact of OCR Quality RQ3: How does the correction of a fraction of error-prone documents influence the retrievability of non-corrected ones? 21
  28. 28. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22
  29. 29. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22 50% same documents as 
 in previous RQ
  30. 30. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22 50% same documents as 
 in previous RQ 50% new documents
  31. 31. Indirect Impact Mixed Half of the corpus was corrected We’re mainly interested 
 in these documents Uncorrected 22 50% same documents as 
 in previous RQ 50% new documents
  32. 32. Equality still increases! • Equality in r(d) scores is higher in the corrected document collection • Again, correction has decreased retrievability bias 23 0.0 0.2 0.4 0.6 0.8 1 10 100 infinite Gini Condition 1644err 1644mix Indirect Impact: Gini Coefficients
  33. 33. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth 24
  34. 34. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth • GT only: • Increase in wealth • c=1: +20% • c=10: +22% • c=100: +23% • c=infinite: +25% 24
  35. 35. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth • GT only: • Increase in wealth • c=1: +20% • c=10: +22% • c=100: +23% • c=infinite: +25% • Mixed-in only: • Decrease in wealth: • c=1: -13% • c=10: -10% • c=100: -5% 24
  36. 36. Retrieval per Document (mixed-in only, c=10) • Most documents’ scores change very little and if, they lose r(d) scores • 171 documents gain r(d) scores • Benefit from FP matches that disappeared 25
  37. 37. Conclusions 26
  38. 38. Conclusions • In our study, OCR correction • Increases overall retrievability • Reduces retrievability bias, even in a partially corrected corpus • Higher scores caused by small set of terms that are • frequent in queries and • susceptible to OCR errors • Using real user queries is essential to understand actual bias caused by OCR errors. 27
  39. 39. Impact of Crowdsourcing OCR Improvements on Retrievability Bias We would like to thank the for making the newspaper corpus and the (sensitive) user data available to us for research. 28 This research is partly funded by the Dutch COMMIT/ program, the WebART project and the VRE4EIC project, a project that has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 676247.

×