Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology

Universität Innsbruck
Christoph-Probst-Platz, Innrain 52
6020 Innsbruck
http://info.uibk.ac.at
User-driven correction of OCR errors.
Combing crowdsourcing and information retrieval
technology
Günter Mühlberger,Johannes Zelger
David Sagmeister,Albert Greinöcker
Universität Innsbruck / Höhere Technische
Bundeslehranstalt Anichstraße - Innsbruck

Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Introduction
• Crowdsourcing approaches for OCR correction
• Our approach
• Evaluation
• Future work
Agenda
2

Introduction
3

• Digitisation of historical printed material
– Google: Billions of files, libraries: Millions of files
– Still hard to get access to these files
• OCR quality
– There are only a few reliable data on the accuracy of OCR on large scale datasets
– E.g. we do not know „how good the Google collection“ is as a whole, or per
language, per century, decade or year, per text type, etc.
• Tanner (2009)
– Has done evaluation of OCR accuracy on British Newspapers
– Differences per newspaper are stronger than per publishing date
– Overall we are speaking about 10% to 40% Word Error Rate, with an average of
22% WER for standard words and 31% for significant words
– Evaluation done within the IMPACT project has shown similar figures
Digitisation and OCR quality
4

• What does this mean for the end-user?
– End-users are either searching a collection or are reading an interesting item
(which they may have found by searching).
– But for reading a page/book they have the original image – so the full-text is much
less important for them
• If we take the figures from above:
– End-users will miss e.g. 20% or 30% of all occurances of a search term which
would be interesting for them simply because the OCR is wrong.
• Maybe acceptable to occasional users, but surely not for humanities
researches or family historians: They want to get „all relevant
occurrences“
– What is “relevant” is decided by the user, some may be interested just within a
specific time period, or periodical, or collection of documents
– Note: Not all words are frequent in all collections („London“ in a Tyrolian newspaper
collections is seldom whereas it is frequent in a British Newspaper Collection)
End-usersand OCR quality
5

Crowd sourcing for OCR
6

• OCR as an „ideal“ field for crowd-sourcing
– Simple to realize: Provide link between image and text and let the
user correct it
• Three (and a half) main approaches
– reCAPTCHA
– Australian National Library (Newspaper Digitization Project)
– National Library of Finland (gamefication)
– IBM: CONCERT (CollaborativeCorrection Platform)
Approaches
7

reCAPTCHA
8

Australian National Library
9

Australian National Library
10

National Library Finland:Digitalkoot
11

IBM CONCERT (COoperative eNgine for Correction of ExtRacted Text)
12

• OCR correction with the support of the crowd does work (but not
always)!
• In the case of reCAPTCHA and DigitalKoot users have no influence on
what they correct (de-motivating)
– reCAPTCHA is successful due to the sheer size of interactions
• User specific benefit is provided mainly by the approach of the Australian
National Library
– User reads the text carefully when editing
– Finds corrected words immediately after submitting correct text
– Can decide what to correct
• Power users vs. crowd users
– A very small segment of all users are carrying out the actual work
– Australia: Top 6 users corrected about 25% of the texts
– transcribe Bentham project: Top 7 users produced 70% of all transcripts
Conclusion
13

Proposed approach
14

• Let„s combine searching and crowd based correction!
• Provide users with a powerful instrument to correct exactly
those words where they are interested in (searching for)
• Relieve users from actually editing words, but let them just
approve or reject the results of the OCR engine
Searching AND correcting
15

Search interface
16

• User has the chance to
– select the Edit Distance (ED): 0-2
– display already approved words
– search only within the index (without showing word snippets)
• In this way users can play around and
– have influence on the recall of the system
– see the index (which is very helpful to get an impression of the OCR
errors)
– see what already has been done
Search interface:Features
17

Result page: Features
18

• Users see the word snippets of their search
• Buttons
– Select all as „false“ or „correct“
• Red: A word snippet does not represent the correct text
• Green: A word snippet represents the correct text (match between search term and
word snippet)
– Deselect all
– Reverse selection
– Save
• Save
– Green word snippets: The text is either approved (if it is the same as in the
OCR text) or the wrong OCR text is corrected by the correct search term
– Red word snippets: Nothing is changed on the OCR text
Features
19

Result page (2)
20

• Result sets (on the left hand side)
– 150 word snippets are currently shown in the standard view
– Can be parametrized
– Currently ordered by file path (other criteria could be word
confidence)
• Index (on the right hand side)
– All index terms are listed which are „behind“ a fuzzy search
– Number of occurrences are shown for this result set
– User gets an overview of „which tokens are behind these snippets“
– User is able to decide quickly which tokens are „real“ words
Additional features
21

• Improve precision
– Search with ED0
– All word snippets should display the search term
– Those which do not are classical OCR errors
– If they are selected they get the status „approved“
– Those which are errors are currently just deselected (and not marked
as false)
• Approvals are directly written into the ALTO file
– Correction status: true „approved“
Correction strategies (1)
22

Example 1: Search for „nelle“
23

OCR errors
24
neue nelle neue nelle

Select correct word images = green = approved
25

• Search for a word with ED1 or ED2
– The number of hits (and word snippets) increases significantly
– Sometime more, sometimes less, depending very much on the
search string and the length of the string
• Strategy
– One may go through all word snippets and deselect wrong ones or
select correct ones  takes some time and is boring
• But
Due to ED2 many other correct words are included in the result set
• Therefore another correction strategy may be more interesting
Correction strategy (2): Improve recall
26

• Recommended method
– Go for all tokens representing „real words“ which appear in the index
on the right hand side
– By clicking on a word of the index a ED0 search is triggered
– In many cases ED0 searches retrieve good results with just a few
OCR errors  approval is very simple and fast
• Once the „real words“ are done, only those word snippets
appear with „real“ OCR errors of the search term which is our
real objective to correct
Correction strategies (3)
27

Example: Search for „Feuerwehr“ED2
28

„Feuerwehr“(fire brigade)
29
Feuenvehr Fenerwehr
Feuerwehr, Feuermeh
Feuerwehr- Feuerweh,
Feuerwehr. Feuerwerk
Feuerweh Feuerwehren
Feuerwehr-, Feuerwehr^
Feuerweh? Feuerwehr
Feuermehr Feueràhr
Feuerwert Feuerweihe
Fenerwchr
• Examples of erroneous
words in red
• These words are the „rest“
which appears after having
approved the „real“ words
(green)
• They will finally be replaced
by the correct word:
• In ALTO: correction status
true: substitute: Feuerwehr

Validating „real“ words from the index
30

• Those which were approved in the steps before are hidden to
the user.
– But users are able to see them if interested or if they want to do a
final check
– Overwriting is possible, status has to be changed
• Therefore the final correction screen shows now instead 324
word snippets for “Feuerwehr” ED2 only those which were not
approved before.
Repeated search for „Feuerwehr“ED2
31

Finally the „real“ OCR errors are replaced by the
correct word
32

• Test set
– From the Europeana Newspaper Project
– 16.000 pages from the Tessmann Library, several millions are waiting to get
indexed
– METS/ALTOfiles
• Standard technology
– JAVA, Javascript (Ajax), Lucene
• Images are cropped on the fly
– „Hardest“ task: takes some seconds on a 4 core engine
– First batch of 150 snippets is done immediatly, second batch preprocessed in the
background
• A testset is available online
– http://dbis-faxe.uibk.ac.at/Website%202.0/CorrectionServlet
– Attention: Not a stable link!
Implementation
33

• Our method provides the chance to improve precision and
recall of search terms in a rather quick and straight forward
way.
• Fuzzy search allows to increase the recall of search terms
significantly and to „correct“ erroneous terms quickly
• No need to edit text – only typing a search term once and than
clicking on the index terms for new searches
• Snowball system since approved words are stored
permanently and are reused for the next correction sessions
as well
Conclusion
34

Evaluation
35

• Currently not enough data for providing good figures on the
evaluation of the tool – implementation in real world scenario will be
necessary
• But: Doan, A. et al. 2011. Crowdsourcingsystems on the World-
Wide Web. Communications of the ACM.
• Four main criteria for crowd sourcing projects
(1) How to recruit and retain users?
(2) What contributions can users make?
(3) How to combine user contributions to solve the target problem?
(4) How to evaluate users and their contributions?
Evaluation
36

• Users are searching anyway!
• Those who are searching have a specific interest!
• Satisfaction will be higher if precision and especially recall is higher for
noisy OCR text
 motivation should be there
• Power users of the archive may be willing to contribute a good deal of
their time to improve the full-text search
 working power should be there
• Our tool is a piggypack of the search interface – can be integrated in a
simple way (e.g. an extra tab which is performed anyway and users may
try out what is behind)
• Searching the index provides useful insights to the user
 learning curve (get to know your full-text archive!)
(1) How to recruit and retain users?
37

• Contributions of users are
– Improve precision
– Improve recall by correcting OCR errors of search terms
– All these words are significant and meaningful to a user
• Only a small portion of words is interesting!
– Text contains a lot of words which are not meaningful or are very seldomly
part of a search
– Austrian Newspapers Online: 50% of all full-text searches go for person
names, 20% for geo-names, only a small portion for keywords
– This means that the corrections/approvals done by the user with our method
is more valuable than to correct running text
– The whole number of corrected words may not be so high, but these should
be significant and relevant words
(2) What contributionscan users make?
38

• Storage of contributions
– All contributions are stored in two ways:
• The Lucene index is immediately updated so that the next search already takes
benefit from approvals/corrections
• Approvals/corrections are directly stored in the OCR XML files (in this case ALTO):
Words are either marked as „correction status true“ „approved“ or the new
alternative of the word is included as well.
• Main benefit for the next user
– The next user will see which word snippets are already approved (are
shown in blue and gray) – in other words: The contributions are visible to
everyone though they are distributed among large amounts of text
– This should users give the feeling that someone already has worked in this
field as well
(3) How to combine user contributionsto solve the
target problem?
39

• Have not tackled this field so far
• Strategy could be
– Randomly select approved or corrected words and provide them to
other users for review
– If specific users provided too many errors a log file could be utilized
to reset the correction status within the ALTO files
(4) How to evaluateusers and their contributions?
40

Future work
41

• Improve user interface
– Allow to mark word snippets also as „false“
• Release as Open Source package
– Will be done during 2014
– JAVA, AJAX, LUCENE – only OS components
• Implementation of the tool in a real world scenario
• Include a edit distance that is more meaningful for OCR errors than the
Fuzzy search of Lucene
– E.g. larger ED than 2, but based on typical OCR problems (c-e, etc.)
• Use the data for machine learning
– For all word snippets metadata such as title of the publication, size of the print,
language, date of printing, etc. is available
– Use it to discriminate „hard“ cases by asking users to go for specific sets (which are
selected automatically)
Further work and improvements
42

Thank you for your attention!
43

Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology

Recommended

Recommended

More Related Content

Similar to Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology

Similar to Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology (20)

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Recently uploaded

Recently uploaded (20)

Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology