Towards a Higher Accuracy of Optical
Character Recognition of Chinese Rare
Books in Making Use of Text Model
Hsiang-An Wang
Academia Sinica
Center for Digital Cultures
Ink Bleed and Pool Quality
2
Limitation (Missing and Extra Word)
OCR Original OCR Original
3
Experiment: Data Collection
• Training dataset: 187 ancient medicine books
from the Scripta Sinica Database (about 40
million words)
• Testing dataset: 1 relevant ancient medicine
book named “ ” with a total of
185,000 words
• The OCR results are about 180,000 words
correct and about 5000 incorrect words,
which means the correct rate is about 97.3 %
4
Experiment: Building a N-gram Model
• Relied on the sequence of words in the
training dataset, and thus we picked the
highest frequency of output.
• " "
– 2-gram: input to predict " "
– 3-gram: input predict " "
– 4-gram: input predict " "
– ...
5
Experiment: Building a
Backward and Forward N-gram Model
• Relied on the sequence of backward and forward
words in the training dataset, and thus we picked the
highest frequency of output.
• Since the backward and forward N-gram are divided
into two different sets of N-gram, therefore, the
model can be used when the same word is found
afterwards.
• " "
– Backward 4-gram: input to predict " "
– Forward 4-gram: input to predict " "
6
Experiment: Building a LSTM Model
• Used the Word2vec to project text into the vector
space with 200 dimension
• Used LSTM with three layers of neural network
• Picked the highest score of softmax layer to
predict the word
• " "
– LSTM 2-gram: input to predict " "
– LSTM 3-gram: input to predict " "
– LSTM 4-gram: input to predict " "
7
The Modification of Correctness Rate
in N-gram Model
• 7-gram can achieve the best correction rate
8
The Modification of Correctness Rate in
Backward and Forward N-gram Model
• Backward and Forward 4-gram can achieve
the best correction rate
9
The Modification of Correctness Rate
in LSTM Model
• LSTM 6-gram can achieve the best correction
rate
•
10
Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.35% 13.06% 97.49%
LSTM 6-gram 0.1% 7.33% 97.5%
BF 4-gram 0.08% 9.54% 97.57%
Comparison of 7-gram, LSTM 6-gram
and BF 4-gram Text Models
• Backward and Forward 4-gram has the best
performance, with the lowest modification error
result and the highest correct results
11
Three Text models with
OCR Top 5 Candidate Words
• The OCR software we use is a Convolution Neural
Network model and to calculate the probability of
classification through softmax function
• When the probability of OCR Top 1 is lower than 95%,
it determines the word might be wrong and will use
mixed model
• Pick the word that has the highest score of the text
model also appeared in OCR Top 5 candidate words
12
Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.012% 9% 97.63%
LSTM 6-gram 0.13% 16% 97.71%
BF 4-gram 0.009% 5.92% 97.55%
Comparison of Three Text Models
Mixed with the Probability of OCR
• LSTM 6-gram mixed with the probability of OCR that
has the best performance
13
Conclusion: Using Text Model
• N-gram, backward and forward N-gram or LSTM N-
gram text model can increase the ratio of accuracy of
OCR
• Backward and Forward 4-gram model has the lowest
modification error result and the highest correct
result
14
Conclusion: Mixing Text Models with
the Probability of OCR
• By mixing rules of OCR Top 5 candidate words
and probability of Top 1 with text model, it can
archive better result than using text model only
• Mixing the LSTM 6-gram with the probability of
OCR model has the highest correct results
15
Thank you for listening

Session1 03.hsian-an wang

  • 1.
    Towards a HigherAccuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model Hsiang-An Wang Academia Sinica Center for Digital Cultures
  • 2.
    Ink Bleed andPool Quality 2
  • 3.
    Limitation (Missing andExtra Word) OCR Original OCR Original 3
  • 4.
    Experiment: Data Collection •Training dataset: 187 ancient medicine books from the Scripta Sinica Database (about 40 million words) • Testing dataset: 1 relevant ancient medicine book named “ ” with a total of 185,000 words • The OCR results are about 180,000 words correct and about 5000 incorrect words, which means the correct rate is about 97.3 % 4
  • 5.
    Experiment: Building aN-gram Model • Relied on the sequence of words in the training dataset, and thus we picked the highest frequency of output. • " " – 2-gram: input to predict " " – 3-gram: input predict " " – 4-gram: input predict " " – ... 5
  • 6.
    Experiment: Building a Backwardand Forward N-gram Model • Relied on the sequence of backward and forward words in the training dataset, and thus we picked the highest frequency of output. • Since the backward and forward N-gram are divided into two different sets of N-gram, therefore, the model can be used when the same word is found afterwards. • " " – Backward 4-gram: input to predict " " – Forward 4-gram: input to predict " " 6
  • 7.
    Experiment: Building aLSTM Model • Used the Word2vec to project text into the vector space with 200 dimension • Used LSTM with three layers of neural network • Picked the highest score of softmax layer to predict the word • " " – LSTM 2-gram: input to predict " " – LSTM 3-gram: input to predict " " – LSTM 4-gram: input to predict " " 7
  • 8.
    The Modification ofCorrectness Rate in N-gram Model • 7-gram can achieve the best correction rate 8
  • 9.
    The Modification ofCorrectness Rate in Backward and Forward N-gram Model • Backward and Forward 4-gram can achieve the best correction rate 9
  • 10.
    The Modification ofCorrectness Rate in LSTM Model • LSTM 6-gram can achieve the best correction rate • 10
  • 11.
    Model The ratioof the correct result of OCR changes to the wrong one The ratio of making the incorrect result of OCR changes to the right one The ratio of accuracy of OCR and the text model OCR X X 97.30% 7-gram 0.35% 13.06% 97.49% LSTM 6-gram 0.1% 7.33% 97.5% BF 4-gram 0.08% 9.54% 97.57% Comparison of 7-gram, LSTM 6-gram and BF 4-gram Text Models • Backward and Forward 4-gram has the best performance, with the lowest modification error result and the highest correct results 11
  • 12.
    Three Text modelswith OCR Top 5 Candidate Words • The OCR software we use is a Convolution Neural Network model and to calculate the probability of classification through softmax function • When the probability of OCR Top 1 is lower than 95%, it determines the word might be wrong and will use mixed model • Pick the word that has the highest score of the text model also appeared in OCR Top 5 candidate words 12
  • 13.
    Model The ratioof the correct result of OCR changes to the wrong one The ratio of making the incorrect result of OCR changes to the right one The ratio of accuracy of OCR and the text model OCR X X 97.30% 7-gram 0.012% 9% 97.63% LSTM 6-gram 0.13% 16% 97.71% BF 4-gram 0.009% 5.92% 97.55% Comparison of Three Text Models Mixed with the Probability of OCR • LSTM 6-gram mixed with the probability of OCR that has the best performance 13
  • 14.
    Conclusion: Using TextModel • N-gram, backward and forward N-gram or LSTM N- gram text model can increase the ratio of accuracy of OCR • Backward and Forward 4-gram model has the lowest modification error result and the highest correct result 14
  • 15.
    Conclusion: Mixing TextModels with the Probability of OCR • By mixing rules of OCR Top 5 candidate words and probability of Top 1 with text model, it can archive better result than using text model only • Mixing the LSTM 6-gram with the probability of OCR model has the highest correct results 15
  • 16.
    Thank you forlistening