The slides present a text recovery method based on a probabilistic post-recognition processing of the output of an Optical Character Recognition system. The proposed method is trying to fill in the gaps of missing text resulted from the recognition process of degraded documents. For this task, a corpus of up to 5-grams provided by Google is used. Several heuristics for using this corpus for the fulfilment of this task are described after presenting the general problem and alternative solutions. These heuristics have been validated using a set of experiments that are also discussed together with the results that have been obtained.
3. The Problem
• Lots of projects attempting to digitize the content of some publications:
– Gutenberg Project (http://www.gutenberg.org/wiki/Main_Page);
– The Million Book Project (http://www.rr.cs.cmu.edu/mbdl.htm);
– The Runeberg Project (http://runeberg.org/);
– Google Book Search (http://books.google.com/);
– Many others.
• Problems:
– Very old documents;
– Partially damaged paper;
– Cheap (poor quality) paper.
• Results: the OCR-s are unable to
fully recognize the content of some
documents!
23.07.2010 ICSOFT 2010 2
4. Our Solution
• A probabilistic method for text recovery that
tries to identify which are the missing words
from the digital form of the document.
• Based on:
– “Web 1T 5-gram Version 1” corpus – n-gram
corpus provided by Google (used to generate
candidates)
23.07.2010 ICSOFT 2010 3
5. Gaps
• We are focusing on the reconstruction of
damaged documents based on the prediction
of the most plausible word sets for filling the
missing areas resulted after conversion to
digital form – we call them gaps.
• Gaps – very important property: its dimension
– number of characters or words that can be
place inside the gap.
23.07.2010 ICSOFT 2010 4
6. Assumptions
• Our method is based on two assumptions:
– Intra-document similarity. The document model
has 2 components:
• The style model – the structure of the text;
• The language model – the vocabulary used by the
author (n-grams and their frequencies).
– The Google corpus dimension is large enough to
subsume most of the language models of the
documents posted on the Internet:
• Any word that does not appear in this corpus, should
not be considered as a possible candidate to fill in the
gaps.
23.07.2010 ICSOFT 2010 5
7. Methodology (1)
• The style model of the document
dimension of the gap.
• 2 heuristics:
• Estimated character count ([min_chars, max_chars]) –
from the document format: margins and indentation;
• Estimated word count ([min_words, max_words]) –
uses previous heuristic and the distribution of words
length (in terms of characters) and of number of words
per phrase.
23.07.2010 ICSOFT 2010 6
8. Methodology (2)
• The language model of the document
detect the missing words.
1. Start from the partial words at the beginning or
at the ending of the gaps.
2. Use both the n-gram corpus and the words that
have been correctly identified before and after
the gap in order to identify first and last word
from the gap:
Use the last 4 words before the gap and the first 4 after
it in order to detect the most probable first and last
word from the gap using the n-grams from the corpus
(since the max order of n-grams in the corpus is 5);
23.07.2010 ICSOFT 2010 7
9. Methodology (3)
If there is not such a 5-gram, than the order of n-gram
is decreased repeatedly until bigrams where we
consider only the first word before the gap and last
after it;
The same thing happens when the gap is near the start
of the end of a phrase.
3. The possible candidates are stored and the
process is restarted for each of this candidates in
order to find the rest of the words from the gap.
23.07.2010 ICSOFT 2010 8
10. Methodology (4)
4. The process ends when one of the following
situations is reached:
The number of words or characters exceed the estimated
word or character count the branches are too long to be
valid can be discarded;
A left-side branch matches at some point a right-side
branch, identifying a valid candidate for the missing words.
The left-side branch has reached an end sentence mark-up
(</S>) AND the right-side one has reached a beginning of
sentence mark-up (<S>). At this point a “partial match” has
been obtained, which contains a possible unrecoverable gap
inside it.
– If the added size of the branches fits in the estimated character
and word count a valid candidate.
23.07.2010 ICSOFT 2010 9
11. Encountered Problems
• No continuation possibility for a branch:
– Decrease the order of n-grams;
– If already at bigrams order, the branch is
discarded.
• Very large number of candidates are
generated for each possible word (nomin
candidates are generated)
– The candidates have to be filtered out!
23.07.2010 ICSOFT 2010 10
12. Candidates Filtering Heuristics (1)
• POS-based: heuristic that predict the POS of the words
and discard the words that do not have the predicted
POS (TreeTagger).
• Semantics-based: discard the branches that do not
contain words related to the rest of the document
(based on lexical chains build using WordNet).
• Frequency-based: prefer the branches with higher
scores for the n-grams in the corpus.
• Considering these heuristics, some scores are
computed for every word added to a branch.
23.07.2010 ICSOFT 2010 11
13. Candidates Filtering Heuristics (2)
• These values are then combined in order to provide a
general score of the branch.
• A heuristic is used: distance to the nearest end of the
gap –to detect the importance of the scores of each
word (the error is propagated from ends to the middle
of the gap).
• Finally, the obtained scores are normalize with respect
to the number of words from the branch and the
results are ordered according to this final score.
• The branch with the highest score is used.
23.07.2010 ICSOFT 2010 12
14. Experiments (1)
• Starting from full documents and remove some
parts in order to simulate the gaps
(http://en.wikipedia.org/wiki/Literature).
• ”An even more narrow interpretation is that
(<gap>) text have a physical form, ...”
• TreeTagger (word, POS, lemma):
– “An DT an even RB even more RBR more narrow JJ
narrow <gap> NN <unknown> text NN text have VBP
have a DT a physical JJ physical form NN form , , ,”
23.07.2010 ICSOFT 2010 13
15. Experiments (2)
• The estimated word count was established to be 3.
• The 5-grams starting with “an even more narrow” are investigated
– none found. 4-grams and then trigrams are investigated
considering “more narrow” – 168 hits are found.
• The results containing symbols, punctuation marks or words with
less than 256 appearances in the corpus have been filtered out – 22
results. Top 6 are:
– [3] and [4816] [ CC : 0.527744] [-1]
– [3] approach [399] [ NN : 0.885605] [5]
– [3] as [372] [ IN : 0.829617] [-1]
– [3] definition [1934] [ NN : 1.221063] [1]
– [3] focus [2276] [ NN : 1.057171] [11]
– [3] interpretation [583] [ NN : 1.221063] [4]
Semantic relevance, -1 = not
filtered out by this criterion
Probability of n-gram of POS
N-gram frequency
Number of remaining
words
23.07.2010 ICSOFT 2010 14
16. Experiments (3)
• Thresholds: frequency: 308, POS score: 0.883849 and semantic
relevance: 4.
• Remaining candidates: “approach”, “focus”, “interpretation”,
“range”, “sense”, and “view”.
• The process continues with each of them until either no n-grams
are found to continue, the maximum depth has been reached or
we encountered a possible solution.
23.07.2010 ICSOFT 2010 15
17. • For the presented gap (interpretation is that), the results were:
• An even more narrow <gap> is that text have a physical form
– Missing word(s): interpretation.
– Results: approach [399][NN], view [754][NN], focus [2276][NN],
interpretation [583][NN] and sense [1346][NN].
Results (1)
23.07.2010 ICSOFT 2010 16
18. Results (2)
• “for scientific instruction, yet <gap> remain too technical to sit well
in most programmes”
– Missing word(s): they.
– Results: still [210782][RB] and they [418129][PP].
• “and often have a primarily utilitarian purpose: <gap> data or
convey immediate information.”
– Missing word(s): to record.
– Results: over 50 results, the closest results being: to [62786][TO] -
present [6934][JJ], to [62786][TO] - share [5828][NN], to [62786][TO] -
gain [7704][NN], to [62786][TO] - study [5423][NN], to [62786][TO] -
test [3854][NN], to [62786][TO] - order [4641][NN], to [62786][TO] -
move [8527][NN], to [62786][TO] - process [3899][NN], to [62786]
[TO] - control [4081][NN] and to [62786][TO] - access [3631][NN].
23.07.2010 ICSOFT 2010 17
19. Conclusions
• The application didn’t achieve the expected results.
• N-grams: not very helpful – coverage rates: 5-grams:
15%, 4-grams: 30%, trigrams: 60%, bigrams: 90% (our
assumption regarding the corpus was wrong).
• Small variations of the thresholds of each of the
considered heuristics can lead to massive filtering.
• Best heuristics seems to be the one based on POS.
23.07.2010 ICSOFT 2010 18