SlideShare a Scribd company logo
1 of 38
Download to read offline
#kbdata: Exploring potential impact of
technology limitations on DH research
Myriam C. Traub, Jacco van Ossenbruggen
Centrum Wiskunde & Informatica, Amsterdam
Translate the established tradition of source
criticism to the digital world and create a new
tradition of tool criticism to systematically
identify and explain technology-induced bias.

http://event.cwi.nl/toolcriticism/ #toolcrit
2
Context
✤ SealincMedia project, original
goals:
✤ crowdsourcing enrichment
✤ measure effect on scholarly
tasks
✤ Who are the scholars?
✤ What are their tasks?
3
Interviews
✤ Aim:
✤ Find out what types of
research tasks scholars
perform on digital archives
✤ Which quantitative / distant
reading tasks are not
(sufficiently) supported
✤ Scholars with experience in
performing historical research
on digital archives
4
(seeTPDL 2015 paper for details)
5
I mostly use digital archives for
exploration of a topic, selecting
material for close reading (T1, T2) or
external processing (T4).
OCR quality in digital archives /
libraries is partly very bad.
I cannot quantify its impact on my
research tasks.
I would not trust quantitative
analyses (T3a, T3b) based on this data
sufficiently to use it in publications.
Categorisation of research tasks
T1 find the first mention of a concept
T2 find a subset with relevant documents
T3 investigate quantitative results over time
T3.a compare quantitative results for two terms
T3.b compare quantitative results from two corpora
T4 tasks using external tools on archive data
Literature
✤ OCR quality is addressed from
the perspective of the collection
owner/OCR software developer
✤ Usability studies for digital
libraries
✤ Robustness of search engines
towards OCR errors
✤ Error removal in post-
processing either systematically
or intellectually
7
We care
about average
performance on
representative subsets
for generic cases.
I care about
actual performance
on my non-
representative subset
for my specific
query.
8
Two different perspectives of quality evaluation
Use case
✤ Aims:
✤ To study the impact on
research tasks in detail
✤ Identify starting points for
workarounds and/or further
research
✤ Tasks T1 - T3
9
T1: Finding the
first mention
✤ Key requirement: recall
✤ 100% recall is unrealistic
✤ Aim: Find out how a scholar
can assess the reliability of
results
10
“Amsterdam”
1642
11
First mention of …
… in the OCRed newspaper archive of the KB?
1618
earliest
document
O
C
R
pre-processing
post-processing
ingestion
scanning
12
Understanding potential sources
of bias and errors
✤ many details difficult to reconstruct
✤ essential to understand overall
impact
“Amsterdam”
1642
13
First mention of …
… in the OCRed newspaper archive of the KB?
1618
earliest
document
“Amfterdam”
1624
01
OCR confidence
values useful?
✤ Available for all items in the
collection: page, word,
character
✤ Only for highest ranked
words / characters, other
candidates missing
✤ This information would be
required to estimate recall.
14
Confusion table
✤ Applied frequent OCR
confusions to query
✤ 23 alternative spellings, but
none of them yielded an
earlier mention
✤ Problem: long tail
Amstcrdam 16-01-1743
Amstordam 01-08-1772
Amsttrdam 04-08-1705
Amslerdam 12-12-1673
Amslcrdam 20-06-1797
Amslordam 29-06-1813
Amsltrdam 13-04-1810
Amscerdam 17-10-1753
Amsccrdam 16-02-1816
Amscordam 01-11-1813
Amsctrdam 16-06-1823
Amfterdam already found
Amftcrdam 17-08-1644
Amftordam 31-01-1749
Amfttrdam 26-11-1675
Amflerdam 03-03-1629
Amflcrdam 01-03-1663
Amflordam 05-03-1723
Amfltrdam 01-09-1672
Amfcerdam 22-04-1700
Amfccrdam 27-11-1742
Amfcordam -
Amfctrdam 09-10-1880
correct confused
s f
n u
e c
n a
t l
t c
h b
l i
e o
e t
full table available online:
http://dx.doi.org/10.6084/m9.figshare.1448810
“Amsterdam”
1642
“Amfterdam”
1624
“Amsterstam”
1618
16
First mention of …
1618
… in the OCRed newspaper archive of the KB?
earliest
document
“Amsterdam”
1642
“Amfterdam”
1624
“Amsterstam”
1618
17
Update!
1618
Corrections for 17th century newspapers were crowdsourced!
earliest
document
“Amsterdam”
1620
… but why not 1618?
Confusion Matrix OCR Confidence
Values
Alternative
Confidence
Values
available: sample only full corpus not available
T1 find all queries for x,
impractical
estimated precision, not
helpful
improve recall
T2 as above estimated precision,
requires improved UI
improve recall
T3 pattern summarized over
set of alternative queries
estimates of corrected
precision
estimates of
corrected recall
T3.a warn for different
susceptibility to errors
as above, warn for
different distribution of
confidence values
as above
T3.b as above as above as above
19
Conclusions
Problems
✤ Scholars see OCR
quality as a serious
problem, but cannot
assess its impact
✤ OCR technology is
unlikely to be perfect
✤ OCR errors are
reported in terms of
averages measured
over representative
samples
✤ Impact on a specific
research task cannot
be assessed based on
average error metrics
Start of solutions
✤ Impact of OCR is
different for different
research tasks, so
these tasks need to
made be explicit
✤ OCR errors often
assumed to be
random but are often
partly systematic
✤ Tool pipelines and
their limitations need
to be transparent &
better documented
No silver bullet
✤ we propose novel strategies that solve
part of the problem:
✤ critical attitude
(awareness and better support)
✤ transparency
(provenance, open source,
documentation, …)
✤ alternative quality metrics
(taking research context into account)
21
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
5000000
10000000
15000000
1700 1800 1900 2000
decades
numberofdocuments
Viewed documents (blue) compared to overall corpus size (red)
RQ: Is this tiny fragment biased by technology?
User logs
✤ 5 months on 8 servers
✤ March - July 2015
✤ 100 M requests
✤ 4 M queries
✤ 1 M unique queries 

(dominated by named entities)
✤ 2.7 M unique documents
viewed
http://resolver.kb.nl/resolve?urn=ddd:011010313
March - July 2015.
24
Top viewed documents
1. views: 700 2. views: 243 3. views: 189
http://resolver.kb.nl/resolve?urn=ddd:010775269http://resolver.kb.nl/resolve?urn=ddd:011148923
Top 25 queries (# IP hashes)
493 armeense
283 telegraaf
200 doodvonnis batavia
176 ajax
168 voetbal
166 nieuwsblad van het noorden
149 suriname
142 oorlog
132 hitler
132 vvd PROX complot
131 amsterdam
129 volkskrant
126 algemeen handelsblad
122 armeensche
119 limburgs dagblad
119 de telegraaf
114 zoetemelk
114 rotterdam
114 20e eeuw
113 het vrije volk
112 staatscourant
112 brand
108 de waarheid
103 soekaboemi
97 overleden
Can we measure bias 

in all queries?
Candidate metric to
measure search bias
✤ Retrievability 

(IR, Azzopardi, CIKM 2008)
✤ measures how often documents
are retrieved for a given set Q
✤ compares popular documents
against non-popular
✤ Inequality expressed with Gini
coefficient and Lorenz curve
✤ Inequality correlated with user
interest is fine…
Experimental setup
✤ Repeat original experiment with synthesised queries
✤ Run experiment with real queries from log
✤ note the ratio: 1M queries vs 100M documents
✤ To do: test known item search for different quality
OCR, different media, different titles, …
Lorenz curves
c=10,
Gini=0.97
c=100,
Gini=0.90
c=1000,
Gini=0.78
0
1
5
10
50
100
500
1000
5000
10000
50000
1 2 3 4 5 6 7 8 9 10
ret_score
counts_16_log
0
1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
1 2 3 4 5 6 7 8 9 10
ret_score
counts_17_log
0
1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
5000000
1 2 3 4 5 6 7 8 9 10
ret_score
counts_18_log
0
1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
5000000
1 2 3 4 5 6 7 8 9 10
ret_score
counts_18_log
For documents that were viewed at least once.
OCR page confidence values (x) and number of views by users (y)
33
0.00
0.25
0.50
0.75
1.00
1700 1800 1900 2000
decades
percentagesofr(d)
r(d)
0
1
2
3
4
0.0
0.2
0.4
0.6
0.8
1700 1800 1900 2000
decades
percentagesofr(d)
r(d)
0
1
2
3
4
0.1
0.2
0.3
0.4
1700 1800 1900 2000
decades
percentagesofr(d)
r
c=10
c=100 c=1000
0.00
0.05
0.10
0.15
1700 1800 1900 2000
decades
percentagesofr(d)
r(d)
0
1
2
3
4
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
5000000
10000000
15000000
1700 1800 1900 2000
decades
numberofdocuments
Viewed documents compared to overall corpus size (per decade)
RQ: Is this tiny fragment biased by technology?
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
1024
16384
262144
4194304
1700 1800 1900 2000
decades
numberofdocuments
# do
# do
Conclusions
✤ Only small fragment of newspaper corpus is viewed or even
retrieved in top #10, 100, 100
✤ No clear evidence retrieval bias is correlated with OCR errors.
Why?
✤ there is no relation
✤ we look for patterns at a too generic level
✤ back to the specificity of the use cases?
✤ Other forms of bias that are measurable/quantifiable?

More Related Content

Similar to #kbdata: Exploring potential impact of technology limitations on DH research

Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfAlexanderKyalo3
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchDaniel Schneiter
 
How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...Richard Minerich
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 

Similar to #kbdata: Exploring potential impact of technology limitations on DH research (20)

Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and Workflows
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using Elasticsearch
 
How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop PDF
Hadoop PDFHadoop PDF
Hadoop PDF
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
06 traub
06 traub06 traub
06 traub
 

Recently uploaded

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 

#kbdata: Exploring potential impact of technology limitations on DH research

  • 1. #kbdata: Exploring potential impact of technology limitations on DH research Myriam C. Traub, Jacco van Ossenbruggen Centrum Wiskunde & Informatica, Amsterdam
  • 2. Translate the established tradition of source criticism to the digital world and create a new tradition of tool criticism to systematically identify and explain technology-induced bias.
 http://event.cwi.nl/toolcriticism/ #toolcrit 2
  • 3. Context ✤ SealincMedia project, original goals: ✤ crowdsourcing enrichment ✤ measure effect on scholarly tasks ✤ Who are the scholars? ✤ What are their tasks? 3
  • 4. Interviews ✤ Aim: ✤ Find out what types of research tasks scholars perform on digital archives ✤ Which quantitative / distant reading tasks are not (sufficiently) supported ✤ Scholars with experience in performing historical research on digital archives 4 (seeTPDL 2015 paper for details)
  • 5. 5 I mostly use digital archives for exploration of a topic, selecting material for close reading (T1, T2) or external processing (T4). OCR quality in digital archives / libraries is partly very bad. I cannot quantify its impact on my research tasks. I would not trust quantitative analyses (T3a, T3b) based on this data sufficiently to use it in publications.
  • 6. Categorisation of research tasks T1 find the first mention of a concept T2 find a subset with relevant documents T3 investigate quantitative results over time T3.a compare quantitative results for two terms T3.b compare quantitative results from two corpora T4 tasks using external tools on archive data
  • 7. Literature ✤ OCR quality is addressed from the perspective of the collection owner/OCR software developer ✤ Usability studies for digital libraries ✤ Robustness of search engines towards OCR errors ✤ Error removal in post- processing either systematically or intellectually 7
  • 8. We care about average performance on representative subsets for generic cases. I care about actual performance on my non- representative subset for my specific query. 8 Two different perspectives of quality evaluation
  • 9. Use case ✤ Aims: ✤ To study the impact on research tasks in detail ✤ Identify starting points for workarounds and/or further research ✤ Tasks T1 - T3 9
  • 10. T1: Finding the first mention ✤ Key requirement: recall ✤ 100% recall is unrealistic ✤ Aim: Find out how a scholar can assess the reliability of results 10
  • 11. “Amsterdam” 1642 11 First mention of … … in the OCRed newspaper archive of the KB? 1618 earliest document
  • 12. O C R pre-processing post-processing ingestion scanning 12 Understanding potential sources of bias and errors ✤ many details difficult to reconstruct ✤ essential to understand overall impact
  • 13. “Amsterdam” 1642 13 First mention of … … in the OCRed newspaper archive of the KB? 1618 earliest document “Amfterdam” 1624
  • 14. 01 OCR confidence values useful? ✤ Available for all items in the collection: page, word, character ✤ Only for highest ranked words / characters, other candidates missing ✤ This information would be required to estimate recall. 14
  • 15. Confusion table ✤ Applied frequent OCR confusions to query ✤ 23 alternative spellings, but none of them yielded an earlier mention ✤ Problem: long tail Amstcrdam 16-01-1743 Amstordam 01-08-1772 Amsttrdam 04-08-1705 Amslerdam 12-12-1673 Amslcrdam 20-06-1797 Amslordam 29-06-1813 Amsltrdam 13-04-1810 Amscerdam 17-10-1753 Amsccrdam 16-02-1816 Amscordam 01-11-1813 Amsctrdam 16-06-1823 Amfterdam already found Amftcrdam 17-08-1644 Amftordam 31-01-1749 Amfttrdam 26-11-1675 Amflerdam 03-03-1629 Amflcrdam 01-03-1663 Amflordam 05-03-1723 Amfltrdam 01-09-1672 Amfcerdam 22-04-1700 Amfccrdam 27-11-1742 Amfcordam - Amfctrdam 09-10-1880 correct confused s f n u e c n a t l t c h b l i e o e t full table available online: http://dx.doi.org/10.6084/m9.figshare.1448810
  • 16. “Amsterdam” 1642 “Amfterdam” 1624 “Amsterstam” 1618 16 First mention of … 1618 … in the OCRed newspaper archive of the KB? earliest document
  • 17. “Amsterdam” 1642 “Amfterdam” 1624 “Amsterstam” 1618 17 Update! 1618 Corrections for 17th century newspapers were crowdsourced! earliest document “Amsterdam” 1620
  • 18. … but why not 1618?
  • 19. Confusion Matrix OCR Confidence Values Alternative Confidence Values available: sample only full corpus not available T1 find all queries for x, impractical estimated precision, not helpful improve recall T2 as above estimated precision, requires improved UI improve recall T3 pattern summarized over set of alternative queries estimates of corrected precision estimates of corrected recall T3.a warn for different susceptibility to errors as above, warn for different distribution of confidence values as above T3.b as above as above as above 19
  • 20. Conclusions Problems ✤ Scholars see OCR quality as a serious problem, but cannot assess its impact ✤ OCR technology is unlikely to be perfect ✤ OCR errors are reported in terms of averages measured over representative samples ✤ Impact on a specific research task cannot be assessed based on average error metrics Start of solutions ✤ Impact of OCR is different for different research tasks, so these tasks need to made be explicit ✤ OCR errors often assumed to be random but are often partly systematic ✤ Tool pipelines and their limitations need to be transparent & better documented
  • 21. No silver bullet ✤ we propose novel strategies that solve part of the problem: ✤ critical attitude (awareness and better support) ✤ transparency (provenance, open source, documentation, …) ✤ alternative quality metrics (taking research context into account) 21
  • 22. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5000000 10000000 15000000 1700 1800 1900 2000 decades numberofdocuments Viewed documents (blue) compared to overall corpus size (red) RQ: Is this tiny fragment biased by technology?
  • 23. User logs ✤ 5 months on 8 servers ✤ March - July 2015 ✤ 100 M requests ✤ 4 M queries ✤ 1 M unique queries 
 (dominated by named entities) ✤ 2.7 M unique documents viewed
  • 24. http://resolver.kb.nl/resolve?urn=ddd:011010313 March - July 2015. 24 Top viewed documents 1. views: 700 2. views: 243 3. views: 189 http://resolver.kb.nl/resolve?urn=ddd:010775269http://resolver.kb.nl/resolve?urn=ddd:011148923
  • 25. Top 25 queries (# IP hashes) 493 armeense 283 telegraaf 200 doodvonnis batavia 176 ajax 168 voetbal 166 nieuwsblad van het noorden 149 suriname 142 oorlog 132 hitler 132 vvd PROX complot 131 amsterdam 129 volkskrant 126 algemeen handelsblad 122 armeensche 119 limburgs dagblad 119 de telegraaf 114 zoetemelk 114 rotterdam 114 20e eeuw 113 het vrije volk 112 staatscourant 112 brand 108 de waarheid 103 soekaboemi 97 overleden
  • 26. Can we measure bias 
 in all queries?
  • 27. Candidate metric to measure search bias ✤ Retrievability 
 (IR, Azzopardi, CIKM 2008) ✤ measures how often documents are retrieved for a given set Q ✤ compares popular documents against non-popular ✤ Inequality expressed with Gini coefficient and Lorenz curve ✤ Inequality correlated with user interest is fine…
  • 28. Experimental setup ✤ Repeat original experiment with synthesised queries ✤ Run experiment with real queries from log ✤ note the ratio: 1M queries vs 100M documents ✤ To do: test known item search for different quality OCR, different media, different titles, …
  • 30.
  • 31. 0 1 5 10 50 100 500 1000 5000 10000 50000 1 2 3 4 5 6 7 8 9 10 ret_score counts_16_log 0 1 5 10 50 100 500 1000 5000 10000 50000 100000 500000 1000000 1 2 3 4 5 6 7 8 9 10 ret_score counts_17_log 0 1 5 10 50 100 500 1000 5000 10000 50000 100000 500000 1000000 5000000 1 2 3 4 5 6 7 8 9 10 ret_score counts_18_log 0 1 5 10 50 100 500 1000 5000 10000 50000 100000 500000 1000000 5000000 1 2 3 4 5 6 7 8 9 10 ret_score counts_18_log
  • 32.
  • 33. For documents that were viewed at least once. OCR page confidence values (x) and number of views by users (y) 33
  • 34. 0.00 0.25 0.50 0.75 1.00 1700 1800 1900 2000 decades percentagesofr(d) r(d) 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1700 1800 1900 2000 decades percentagesofr(d) r(d) 0 1 2 3 4 0.1 0.2 0.3 0.4 1700 1800 1900 2000 decades percentagesofr(d) r c=10 c=100 c=1000
  • 35. 0.00 0.05 0.10 0.15 1700 1800 1900 2000 decades percentagesofr(d) r(d) 0 1 2 3 4
  • 36. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5000000 10000000 15000000 1700 1800 1900 2000 decades numberofdocuments Viewed documents compared to overall corpus size (per decade) RQ: Is this tiny fragment biased by technology?
  • 37. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1024 16384 262144 4194304 1700 1800 1900 2000 decades numberofdocuments # do # do
  • 38. Conclusions ✤ Only small fragment of newspaper corpus is viewed or even retrieved in top #10, 100, 100 ✤ No clear evidence retrieval bias is correlated with OCR errors. Why? ✤ there is no relation ✤ we look for patterns at a too generic level ✤ back to the specificity of the use cases? ✤ Other forms of bias that are measurable/quantifiable?