SlideShare a Scribd company logo
1 of 39
Impact of Crowdsourcing OCR Improvements
on Retrievability Bias
Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Lynda Hardman

Centrum Wiskunde & Informatica, Amsterdam, NL 1
Motivation: Retrievability (Bias)
• Introduced by Azzopardi et al. in 2008 [1]
• Retrievability score counts how 

often a document is retrieved as one of 

the top K documents by a given set of queries
• Gini coefficient quantifies inequality in the distribution of
scores
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
2
Study on Retrievability Bias (JCDL2016)
• Follow-up study of Querylog-based Assessment of Retrievability Bias in
a Large Newspaper Corpus
• Large-scale study based on 102 million newspaper items, 4 million
simulated queries and 957,239 real user queries
• Findings:
• Large inequalities among the documents indicating retrievability bias
• Document length impacts retrieval, no evidence for other technical
bias found
• Simulated queries yield very different results than real queries,
experiments should take operators and facets into account
3
Potential Causes for
Retrievability Bias
• Skills and interest of users
• Collection bias
• Ranking algorithm
• UI design
• (OCR) quality
4
Courante uyt Italien, Duytslandt, &c 

(14-06-1618)
Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
5
Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
How does bias caused by
OCR quality impact my (re-)search
results?
5
Experimental Setup
6
Documents & Queries
• Subset of the historic newspaper
archive maintained by the National
Library of the Netherlands (public,
KB)

• Ground truth set of 100 manually
corrected newspaper issues (822
articles) published in the 17th century
and WWII period (public, KB)

• Character error rates (CER)
computed with [1]

• User queries collected from
delpher.nl (confidential, KB, same as
in previous study), stopwords, short
term removed, deduplicated
7
[1] https://www.digitisation.eu/ De geus onder studenten 

(14-10-1940)
4 Corpora
• Ground truth set (822 documents):
• uncorrected
• corrected
• Ground truth + mixed in (1644 documents):
• uncorrected
• partially corrected
8
Setup - Retrievability
[1] http://www.lemurproject.org/
Indri search engine [1]
Documents
Queries
9
• We report on c=1, c=10, c=100 or c=infinite
• Carried out on each of the four corpora
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Rank of document d in
the result list of a query q
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
Possibility to give more
weight to certain queries, 

we use oq=1
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
For all queries q
in a query set Q
Possibility to give more
weight to certain queries, 

we use oq=1
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Impact Assessment
• Wealth: How many documents were retrieved in total?
• Sum of all r(d) scores
• Equality: How are r(d) scores distributed among documents?
• Gini coefficient
• Retrieval per document/query:
• Changes due to correction
• Impact of individual (query) terms
11
OCR Quality & Retrievability
RQ1: What is the relation between a document’s OCR character error rate and its
retrievability score?
12
RQ1: OCR Quality & Retrievability
• CER in 17cent collection significantly higher
• R(d) scores higher in WWII collection
• Correlation between r(d) and CER: -0.57 (Pearson) and -0.61 (Spearman) with p<0.001
0
20,000
40,000
60,000
0% 20% 40% 60% 80%
Character error rate (CER)
R(d)score
Document length 1000 2000 3000 Subset 17cent WWII
13
Direct Impact of OCR Quality
RQ2: How does the correction of OCR errors impact the retrievability bias of the
corrected documents?
14
Direct Impact
Uncorrected
Complete
corpus was
corrected
Corrected
15
Impact of Correction on Wealth
• More documents
retrieved from corrected
documents
• Number of queries with
results increased by 8%
• Impact is largest for
users willing to look at
the entire result list
16
365,855
338,139
2,023,283
1,750,340
5,477,566
4,341,536
6,033,099
4,521,030
+ 8%+ 8%
+ 16%+ 16%
+ 26%+ 26%
+ 34%+ 34%
c=1
c=10
c=100
c=infinite
0 2,500,000 5,000,000 7,500,000 10,000,000
Sum of all r(d) scores (wealth)
Condition error−prone corrected
Impact on Equality
• Correction lowers inequality among
documents
• In contrast to earlier findings, Gini
coefficients do not decrease with
larger c’s
• Correction fixes more FN than FP
(c=infinite):
• Increases both, wealth and
equality
17
0.0
0.2
0.4
0.6
1 10 100 infinite
Gini
Condition 822GTcor 822GTerr
Direct Impact: Gini Coefficients
Retrieval per Document
• Few documents lose r(d) scores after correction:

Good, these are former FP caused by OCR errors and no longer retrieved
• Most documents, however, gain — with 17cent corpus improving to a larger
extent, but still remaining at a lower level
18
Retrieval per Query
• Only 44% of the queries retrieved at least one document
• Despite small collection size, we see large gains
• Some queries lose because they retrieved FP from the uncorrected document set
19
Retrieval per Query
Top 10 terms cause 35% of the
wealth increase. These terms:
1. Appear very frequently in
user queries and
2. Are highly susceptible to
OCR errors in the
documents
Conclusion: Real queries are
also a source of bias
20
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
* new, Amsterdam, end, Mister, died/dead, grand/
large, Willem (name), two, three, old
Figure 4: Queries ordered by their gain/loss in number of
retrieved documents. The position on the y-axis represents
the number of documents retrieved from 822GTcor .
histograms. The distributions of the dierences in r(d) scores in Ta-
ble 2, show that for all cuto values, the median of the dierences is
positive, and increases from 8 (c = 1) to 912 (c = 1). The maximum
loss and the maximum gain in r(d) scores increase for larger cuto
values c, the latter to a much larger extent. Note that for c = 1 and
c = 10 the entire rst quartile is lled with documents that scored
worse in the corrected version. This shows that the competition
in the top results makes the gain of some documents the loss of
others.
Increased retrieval per query In a nal step, we investigated
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
Query Frequency in Cum.
Term Queries 822GT err 822GTcor Impact
nieuwe 1,903 99 166 7.36%
amsterdam 7,885 41 57 14.65%
ende 185 103 480 18.69%
heer 826 20 89 21.99%
overleden 3,698 5 18 24.78%
groot 1,573 125 153 27.33%
willem 5,375 5 13 29.81%
twee 319 64 175 31.83%
drie 401 34 120 33.81%
oude 991 50 78 35.41%
Figure 5: The accumulated impact scores of single-term
queries show that very few query term contribute a large
fraction of the overall wealth. The top ten query terms ac-
*
Indirect Impact of OCR Quality
RQ3: How does the correction of a fraction of error-prone documents influence the
retrievability of non-corrected ones?
21
Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
50% same documents as 

in previous RQ
Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
50% same documents as 

in previous RQ
50% new documents
Indirect Impact
Mixed
Half of the
corpus was
corrected
We’re mainly interested 

in these documents
Uncorrected
22
50% same documents as 

in previous RQ
50% new documents
Equality still increases!
• Equality in r(d) scores is higher in
the corrected document collection
• Again, correction has decreased
retrievability bias
23
0.0
0.2
0.4
0.6
0.8
1 10 100 infinite
Gini
Condition 1644err 1644mix
Indirect Impact: Gini Coefficients
376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
24
376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
• GT only:
• Increase in wealth
• c=1: +20%
• c=10: +22%
• c=100: +23%
• c=infinite: +25%
24
376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
• GT only:
• Increase in wealth
• c=1: +20%
• c=10: +22%
• c=100: +23%
• c=infinite: +25%
• Mixed-in only:
• Decrease in wealth:
• c=1: -13%
• c=10: -10%
• c=100: -5%
24
Retrieval per Document (mixed-in only, c=10)
• Most documents’ scores change very little and if, they lose r(d) scores
• 171 documents gain r(d) scores
• Benefit from FP matches that disappeared
25
Conclusions
26
Conclusions
• In our study, OCR correction
• Increases overall retrievability
• Reduces retrievability bias, even in a partially corrected corpus
• Higher scores caused by small set of terms that are
• frequent in queries and
• susceptible to OCR errors
• Using real user queries is essential to understand actual bias caused
by OCR errors.
27
Impact of Crowdsourcing
OCR Improvements on
Retrievability Bias
We would like to thank the 

for making the newspaper corpus and the
(sensitive) user data available to us for
research.
28
This research is partly funded by the Dutch COMMIT/ program, the
WebART project and the VRE4EIC project, a project that has received
funding from the European Union’s Horizon 2020 research and innovation
program under grant agreement No 676247.

More Related Content

Similar to Impact of Crowdsourcing OCR Improvements on Retrievability Bias

IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ireKovidaN
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Techniquekevig
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Techniquekevig
 
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYDIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYIJDKP
 
Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...Mary Montoya
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluationNidhirBiswas
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingRayhan Ferdous
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) cseij
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structurecseij
 
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingScalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingeSAT Journals
 
Scalable and efficient cluster based framework for
Scalable and efficient cluster based framework forScalable and efficient cluster based framework for
Scalable and efficient cluster based framework foreSAT Publishing House
 

Similar to Impact of Crowdsourcing OCR Improvements on Retrievability Bias (20)

IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ire
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
ME Synopsis
ME SynopsisME Synopsis
ME Synopsis
 
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYDIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
 
Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluation
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structure
 
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingScalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexing
 
Scalable and efficient cluster based framework for
Scalable and efficient cluster based framework forScalable and efficient cluster based framework for
Scalable and efficient cluster based framework for
 

More from Myriam Traub

Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a  Large Newspaper CorpusQuerylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a Large Newspaper CorpusMyriam Traub
 
Effectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting AnnotationsEffectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting AnnotationsMyriam Traub
 
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool CriticismThe Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool CriticismMyriam Traub
 
Querylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in DelpherQuerylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in DelpherMyriam Traub
 
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesImpact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesMyriam Traub
 
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesEstimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesMyriam Traub
 
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting AnnotationsMeasuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting AnnotationsMyriam Traub
 

More from Myriam Traub (8)

Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a  Large Newspaper CorpusQuerylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
 
Effectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting AnnotationsEffectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting Annotations
 
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool CriticismThe Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
 
Querylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in DelpherQuerylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in Delpher
 
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesImpact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
 
Tool Criticism
Tool CriticismTool Criticism
Tool Criticism
 
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesEstimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
 
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting AnnotationsMeasuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
 

Recently uploaded

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 

Recently uploaded (20)

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

  • 1. Impact of Crowdsourcing OCR Improvements on Retrievability Bias Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Lynda Hardman Centrum Wiskunde & Informatica, Amsterdam, NL 1
  • 2. Motivation: Retrievability (Bias) • Introduced by Azzopardi et al. in 2008 [1] • Retrievability score counts how 
 often a document is retrieved as one of 
 the top K documents by a given set of queries • Gini coefficient quantifies inequality in the distribution of scores [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM. 2
  • 3. Study on Retrievability Bias (JCDL2016) • Follow-up study of Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus • Large-scale study based on 102 million newspaper items, 4 million simulated queries and 957,239 real user queries • Findings: • Large inequalities among the documents indicating retrievability bias • Document length impacts retrieval, no evidence for other technical bias found • Simulated queries yield very different results than real queries, experiments should take operators and facets into account 3
  • 4. Potential Causes for Retrievability Bias • Skills and interest of users • Collection bias • Ranking algorithm • UI design • (OCR) quality 4 Courante uyt Italien, Duytslandt, &c (14-06-1618)
  • 5. Research Questions • RQ1: Relation between OCR quality and retrievability • RQ2: Direct impact of correction on retrievability bias of corrected documents • RQ3: Indirect impact of correction of a fraction of documents on non-corrected ones 5
  • 6. Research Questions • RQ1: Relation between OCR quality and retrievability • RQ2: Direct impact of correction on retrievability bias of corrected documents • RQ3: Indirect impact of correction of a fraction of documents on non-corrected ones How does bias caused by OCR quality impact my (re-)search results? 5
  • 8. Documents & Queries • Subset of the historic newspaper archive maintained by the National Library of the Netherlands (public, KB) • Ground truth set of 100 manually corrected newspaper issues (822 articles) published in the 17th century and WWII period (public, KB) • Character error rates (CER) computed with [1] • User queries collected from delpher.nl (confidential, KB, same as in previous study), stopwords, short term removed, deduplicated 7 [1] https://www.digitisation.eu/ De geus onder studenten (14-10-1940)
  • 9. 4 Corpora • Ground truth set (822 documents): • uncorrected • corrected • Ground truth + mixed in (1644 documents): • uncorrected • partially corrected 8
  • 10. Setup - Retrievability [1] http://www.lemurproject.org/ Indri search engine [1] Documents Queries 9 • We report on c=1, c=10, c=100 or c=infinite • Carried out on each of the four corpora
  • 11. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 12. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 13. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Rank of document d in the result list of a query q 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 14. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 15. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q Possibility to give more weight to certain queries, 
 we use oq=1 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 16. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q For all queries q in a query set Q Possibility to give more weight to certain queries, 
 we use oq=1 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 17. Impact Assessment • Wealth: How many documents were retrieved in total? • Sum of all r(d) scores • Equality: How are r(d) scores distributed among documents? • Gini coefficient • Retrieval per document/query: • Changes due to correction • Impact of individual (query) terms 11
  • 18. OCR Quality & Retrievability RQ1: What is the relation between a document’s OCR character error rate and its retrievability score? 12
  • 19. RQ1: OCR Quality & Retrievability • CER in 17cent collection significantly higher • R(d) scores higher in WWII collection • Correlation between r(d) and CER: -0.57 (Pearson) and -0.61 (Spearman) with p<0.001 0 20,000 40,000 60,000 0% 20% 40% 60% 80% Character error rate (CER) R(d)score Document length 1000 2000 3000 Subset 17cent WWII 13
  • 20. Direct Impact of OCR Quality RQ2: How does the correction of OCR errors impact the retrievability bias of the corrected documents? 14
  • 22. Impact of Correction on Wealth • More documents retrieved from corrected documents • Number of queries with results increased by 8% • Impact is largest for users willing to look at the entire result list 16 365,855 338,139 2,023,283 1,750,340 5,477,566 4,341,536 6,033,099 4,521,030 + 8%+ 8% + 16%+ 16% + 26%+ 26% + 34%+ 34% c=1 c=10 c=100 c=infinite 0 2,500,000 5,000,000 7,500,000 10,000,000 Sum of all r(d) scores (wealth) Condition error−prone corrected
  • 23. Impact on Equality • Correction lowers inequality among documents • In contrast to earlier findings, Gini coefficients do not decrease with larger c’s • Correction fixes more FN than FP (c=infinite): • Increases both, wealth and equality 17 0.0 0.2 0.4 0.6 1 10 100 infinite Gini Condition 822GTcor 822GTerr Direct Impact: Gini Coefficients
  • 24. Retrieval per Document • Few documents lose r(d) scores after correction:
 Good, these are former FP caused by OCR errors and no longer retrieved • Most documents, however, gain — with 17cent corpus improving to a larger extent, but still remaining at a lower level 18
  • 25. Retrieval per Query • Only 44% of the queries retrieved at least one document • Despite small collection size, we see large gains • Some queries lose because they retrieved FP from the uncorrected document set 19
  • 26. Retrieval per Query Top 10 terms cause 35% of the wealth increase. These terms: 1. Appear very frequently in user queries and 2. Are highly susceptible to OCR errors in the documents Conclusion: Real queries are also a source of bias 20 0 25 50 75 100 0 1,000 2,000 3,000 4,000 Query terms ordered by difference in impact (descending) Cumulativer(d)difference(%) * new, Amsterdam, end, Mister, died/dead, grand/ large, Willem (name), two, three, old Figure 4: Queries ordered by their gain/loss in number of retrieved documents. The position on the y-axis represents the number of documents retrieved from 822GTcor . histograms. The distributions of the dierences in r(d) scores in Ta- ble 2, show that for all cuto values, the median of the dierences is positive, and increases from 8 (c = 1) to 912 (c = 1). The maximum loss and the maximum gain in r(d) scores increase for larger cuto values c, the latter to a much larger extent. Note that for c = 1 and c = 10 the entire rst quartile is lled with documents that scored worse in the corrected version. This shows that the competition in the top results makes the gain of some documents the loss of others. Increased retrieval per query In a nal step, we investigated 0 25 50 75 100 0 1,000 2,000 3,000 4,000 Query terms ordered by difference in impact (descending) Cumulativer(d)difference(%) Query Frequency in Cum. Term Queries 822GT err 822GTcor Impact nieuwe 1,903 99 166 7.36% amsterdam 7,885 41 57 14.65% ende 185 103 480 18.69% heer 826 20 89 21.99% overleden 3,698 5 18 24.78% groot 1,573 125 153 27.33% willem 5,375 5 13 29.81% twee 319 64 175 31.83% drie 401 34 120 33.81% oude 991 50 78 35.41% Figure 5: The accumulated impact scores of single-term queries show that very few query term contribute a large fraction of the overall wealth. The top ten query terms ac- *
  • 27. Indirect Impact of OCR Quality RQ3: How does the correction of a fraction of error-prone documents influence the retrievability of non-corrected ones? 21
  • 28. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22
  • 29. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22 50% same documents as 
 in previous RQ
  • 30. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22 50% same documents as 
 in previous RQ 50% new documents
  • 31. Indirect Impact Mixed Half of the corpus was corrected We’re mainly interested 
 in these documents Uncorrected 22 50% same documents as 
 in previous RQ 50% new documents
  • 32. Equality still increases! • Equality in r(d) scores is higher in the corrected document collection • Again, correction has decreased retrievability bias 23 0.0 0.2 0.4 0.6 0.8 1 10 100 infinite Gini Condition 1644err 1644mix Indirect Impact: Gini Coefficients
  • 33. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth 24
  • 34. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth • GT only: • Increase in wealth • c=1: +20% • c=10: +22% • c=100: +23% • c=infinite: +25% 24
  • 35. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth • GT only: • Increase in wealth • c=1: +20% • c=10: +22% • c=100: +23% • c=infinite: +25% • Mixed-in only: • Decrease in wealth: • c=1: -13% • c=10: -10% • c=100: -5% 24
  • 36. Retrieval per Document (mixed-in only, c=10) • Most documents’ scores change very little and if, they lose r(d) scores • 171 documents gain r(d) scores • Benefit from FP matches that disappeared 25
  • 38. Conclusions • In our study, OCR correction • Increases overall retrievability • Reduces retrievability bias, even in a partially corrected corpus • Higher scores caused by small set of terms that are • frequent in queries and • susceptible to OCR errors • Using real user queries is essential to understand actual bias caused by OCR errors. 27
  • 39. Impact of Crowdsourcing OCR Improvements on Retrievability Bias We would like to thank the for making the newspaper corpus and the (sensitive) user data available to us for research. 28 This research is partly funded by the Dutch COMMIT/ program, the WebART project and the VRE4EIC project, a project that has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 676247.