#kbdata: Exploring potential impact of technology limitations on DH research

#kbdata: Exploring potential impact of
technology limitations on DH research
Myriam C. Traub, Jacco van Ossenbruggen
Centrum Wiskunde & Informatica, Amsterdam

Translate the established tradition of source
criticism to the digital world and create a new
tradition of tool criticism to systematically
identify and explain technology-induced bias. 
http://event.cwi.nl/toolcriticism/ #toolcrit
2

Context
✤ SealincMedia project, original
goals:
✤ crowdsourcing enrichment
✤ measure effect on scholarly
tasks
✤ Who are the scholars?
✤ What are their tasks?
3

Interviews
✤ Aim:
✤ Find out what types of
research tasks scholars
perform on digital archives
✤ Which quantitative / distant
reading tasks are not
(sufﬁciently) supported
✤ Scholars with experience in
performing historical research
on digital archives
4
(seeTPDL 2015 paper for details)

5
I mostly use digital archives for
exploration of a topic, selecting
material for close reading (T1, T2) or
external processing (T4).
OCR quality in digital archives /
libraries is partly very bad.
I cannot quantify its impact on my
research tasks.
I would not trust quantitative
analyses (T3a, T3b) based on this data
sufﬁciently to use it in publications.

Categorisation of research tasks
T1 find the first mention of a concept
T2 find a subset with relevant documents
T3 investigate quantitative results over time
T3.a compare quantitative results for two terms
T3.b compare quantitative results from two corpora
T4 tasks using external tools on archive data

Literature
✤ OCR quality is addressed from
the perspective of the collection
owner/OCR software developer
✤ Usability studies for digital
libraries
✤ Robustness of search engines
towards OCR errors
✤ Error removal in post-
processing either systematically
or intellectually
7

We care
about average
performance on
representative subsets
for generic cases.
I care about
actual performance
on my non-
representative subset
for my speciﬁc
query.
8
Two different perspectives of quality evaluation

Use case
✤ Aims:
✤ To study the impact on
research tasks in detail
✤ Identify starting points for
workarounds and/or further
research
✤ Tasks T1 - T3
9

T1: Finding the
first mention
✤ Key requirement: recall
✤ 100% recall is unrealistic
✤ Aim: Find out how a scholar
can assess the reliability of
results
10

“Amsterdam”
1642
11
First mention of …
… in the OCRed newspaper archive of the KB?
1618
earliest
document

O
C
R
pre-processing
post-processing
ingestion
scanning
12
Understanding potential sources
of bias and errors
✤ many details difﬁcult to reconstruct
✤ essential to understand overall
impact

“Amsterdam”
1642
13
1618
earliest
document
“Amfterdam”
1624

01
OCR confidence
values useful?
✤ Available for all items in the
collection: page, word,
character
✤ Only for highest ranked
words / characters, other
candidates missing
✤ This information would be
required to estimate recall.
14

Confusion table
✤ Applied frequent OCR
confusions to query
✤ 23 alternative spellings, but
none of them yielded an
earlier mention
✤ Problem: long tail
Amstcrdam 16-01-1743
Amstordam 01-08-1772
Amsttrdam 04-08-1705
Amslerdam 12-12-1673
Amslcrdam 20-06-1797
Amslordam 29-06-1813
Amsltrdam 13-04-1810
Amscerdam 17-10-1753
Amsccrdam 16-02-1816
Amscordam 01-11-1813
Amsctrdam 16-06-1823
Amfterdam already found
Amftcrdam 17-08-1644
Amftordam 31-01-1749
Amfttrdam 26-11-1675
Amflerdam 03-03-1629
Amflcrdam 01-03-1663
Amflordam 05-03-1723
Amfltrdam 01-09-1672
Amfcerdam 22-04-1700
Amfccrdam 27-11-1742
Amfcordam -
Amfctrdam 09-10-1880
correct confused
s f
n u
e c
n a
t l
t c
h b
l i
e o
e t
full table available online:
http://dx.doi.org/10.6084/m9.figshare.1448810

“Amsterdam”
1642
“Amfterdam”
1624
“Amsterstam”
1618
16
1618
earliest
document

“Amsterdam”
1642
“Amfterdam”
1624
“Amsterstam”
1618
17
Update!
1618
Corrections for 17th century newspapers were crowdsourced!
earliest
document
“Amsterdam”
1620

Confusion Matrix OCR Confidence
Values
Alternative
Confidence
Values
available: sample only full corpus not available
T1 find all queries for x,
impractical
estimated precision, not
helpful
improve recall
T2 as above estimated precision,
requires improved UI
improve recall
T3 pattern summarized over
set of alternative queries
estimates of corrected
precision
estimates of
corrected recall
T3.a warn for different
susceptibility to errors
as above, warn for
different distribution of
confidence values
as above
T3.b as above as above as above
19

Conclusions
Problems
✤ Scholars see OCR
quality as a serious
problem, but cannot
assess its impact
✤ OCR technology is
unlikely to be perfect
✤ OCR errors are
reported in terms of
averages measured
over representative
samples
✤ Impact on a speciﬁc
research task cannot
be assessed based on
average error metrics
Start of solutions
✤ Impact of OCR is
different for different
research tasks, so
these tasks need to
made be explicit
✤ OCR errors often
assumed to be
random but are often
partly systematic
✤ Tool pipelines and
their limitations need
to be transparent &
better documented

No silver bullet
✤ we propose novel strategies that solve
part of the problem:
✤ critical attitude
(awareness and better support)
✤ transparency
(provenance, open source,
documentation, …)
✤ alternative quality metrics
(taking research context into account)
21

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
5000000
10000000
15000000
1700 1800 1900 2000
decades
numberofdocuments
Viewed documents (blue) compared to overall corpus size (red)
RQ: Is this tiny fragment biased by technology?

User logs
✤ 5 months on 8 servers
✤ March - July 2015
✤ 100 M requests
✤ 4 M queries
✤ 1 M unique queries  
(dominated by named entities)
✤ 2.7 M unique documents
viewed

http://resolver.kb.nl/resolve?urn=ddd:011010313
March - July 2015.
24
Top viewed documents
1. views: 700 2. views: 243 3. views: 189
http://resolver.kb.nl/resolve?urn=ddd:010775269http://resolver.kb.nl/resolve?urn=ddd:011148923

Top 25 queries (# IP hashes)
493 armeense
283 telegraaf
200 doodvonnis batavia
176 ajax
168 voetbal
166 nieuwsblad van het noorden
149 suriname
142 oorlog
132 hitler
132 vvd PROX complot
131 amsterdam
129 volkskrant
126 algemeen handelsblad
122 armeensche
119 limburgs dagblad
119 de telegraaf
114 zoetemelk
114 rotterdam
114 20e eeuw
113 het vrije volk
112 staatscourant
112 brand
108 de waarheid
103 soekaboemi
97 overleden

Can we measure bias  
in all queries?

Candidate metric to
measure search bias
✤ Retrievability  
(IR, Azzopardi, CIKM 2008)
✤ measures how often documents
are retrieved for a given set Q
✤ compares popular documents
against non-popular
✤ Inequality expressed with Gini
coefﬁcient and Lorenz curve
✤ Inequality correlated with user
interest is ﬁne…

Experimental setup
✤ Repeat original experiment with synthesised queries
✤ Run experiment with real queries from log
✤ note the ratio: 1M queries vs 100M documents
✤ To do: test known item search for different quality
OCR, different media, different titles, …

Lorenz curves
c=10,
Gini=0.97
c=100,
Gini=0.90
c=1000,
Gini=0.78

0
1
5
10
50
100
500
1000
5000
10000
50000
1 2 3 4 5 6 7 8 9 10
ret_score
counts_16_log
0
1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
1 2 3 4 5 6 7 8 9 10
ret_score
counts_17_log
0
1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
5000000
1 2 3 4 5 6 7 8 9 10
ret_score
counts_18_log
0
1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
5000000
1 2 3 4 5 6 7 8 9 10
ret_score
counts_18_log

For documents that were viewed at least once.
OCR page conﬁdence values (x) and number of views by users (y)
33

0.00
0.25
0.50
0.75
1.00
1700 1800 1900 2000
decades
percentagesofr(d)
r(d)
0
1
2
3
4
0.0
0.2
0.4
0.6
0.8
1700 1800 1900 2000
decades
percentagesofr(d)
r(d)
0
1
2
3
4
0.1
0.2
0.3
0.4
1700 1800 1900 2000
decades
percentagesofr(d)
r
c=10
c=100 c=1000

0.00
0.05
0.10
0.15
1700 1800 1900 2000
decades
percentagesofr(d)
r(d)
0
1
2
3
4

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
5000000
10000000
15000000
1700 1800 1900 2000
decades
numberofdocuments
Viewed documents compared to overall corpus size (per decade)
RQ: Is this tiny fragment biased by technology?

●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
1024
16384
262144
4194304
1700 1800 1900 2000
decades
numberofdocuments
# do
# do

Conclusions
✤ Only small fragment of newspaper corpus is viewed or even
retrieved in top #10, 100, 100
✤ No clear evidence retrieval bias is correlated with OCR errors.
Why?
✤ there is no relation
✤ we look for patterns at a too generic level
✤ back to the speciﬁcity of the use cases?
✤ Other forms of bias that are measurable/quantiﬁable?

#kbdata: Exploring potential impact of technology limitations on DH research

Recommended

Recommended

More Related Content

Similar to #kbdata: Exploring potential impact of technology limitations on DH research

Similar to #kbdata: Exploring potential impact of technology limitations on DH research (20)

Recently uploaded

Recently uploaded (20)

#kbdata: Exploring potential impact of technology limitations on DH research