SlideShare a Scribd company logo
Towards a Higher Accuracy of Optical
Character Recognition of Chinese Rare
Books in Making Use of Text Model
Hsiang-An Wang
Academia Sinica
Center for Digital Cultures
Ink Bleed and Pool Quality
2
Limitation (Missing and Extra Word)
OCR Original OCR Original
3
Experiment: Data Collection
• Training dataset: 187 ancient medicine books
from the Scripta Sinica Database (about 40
million words)
• Testing dataset: 1 relevant ancient medicine
book named “ ” with a total of
185,000 words
• The OCR results are about 180,000 words
correct and about 5000 incorrect words,
which means the correct rate is about 97.3 %
4
Experiment: Building a N-gram Model
• Relied on the sequence of words in the
training dataset, and thus we picked the
highest frequency of output.
• " "
– 2-gram: input to predict " "
– 3-gram: input predict " "
– 4-gram: input predict " "
– ...
5
Experiment: Building a
Backward and Forward N-gram Model
• Relied on the sequence of backward and forward
words in the training dataset, and thus we picked the
highest frequency of output.
• Since the backward and forward N-gram are divided
into two different sets of N-gram, therefore, the
model can be used when the same word is found
afterwards.
• " "
– Backward 4-gram: input to predict " "
– Forward 4-gram: input to predict " "
6
Experiment: Building a LSTM Model
• Used the Word2vec to project text into the vector
space with 200 dimension
• Used LSTM with three layers of neural network
• Picked the highest score of softmax layer to
predict the word
• " "
– LSTM 2-gram: input to predict " "
– LSTM 3-gram: input to predict " "
– LSTM 4-gram: input to predict " "
7
The Modification of Correctness Rate
in N-gram Model
• 7-gram can achieve the best correction rate
8
The Modification of Correctness Rate in
Backward and Forward N-gram Model
• Backward and Forward 4-gram can achieve
the best correction rate
9
The Modification of Correctness Rate
in LSTM Model
• LSTM 6-gram can achieve the best correction
rate
•
10
Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.35% 13.06% 97.49%
LSTM 6-gram 0.1% 7.33% 97.5%
BF 4-gram 0.08% 9.54% 97.57%
Comparison of 7-gram, LSTM 6-gram
and BF 4-gram Text Models
• Backward and Forward 4-gram has the best
performance, with the lowest modification error
result and the highest correct results
11
Three Text models with
OCR Top 5 Candidate Words
• The OCR software we use is a Convolution Neural
Network model and to calculate the probability of
classification through softmax function
• When the probability of OCR Top 1 is lower than 95%,
it determines the word might be wrong and will use
mixed model
• Pick the word that has the highest score of the text
model also appeared in OCR Top 5 candidate words
12
Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.012% 9% 97.63%
LSTM 6-gram 0.13% 16% 97.71%
BF 4-gram 0.009% 5.92% 97.55%
Comparison of Three Text Models
Mixed with the Probability of OCR
• LSTM 6-gram mixed with the probability of OCR that
has the best performance
13
Conclusion: Using Text Model
• N-gram, backward and forward N-gram or LSTM N-
gram text model can increase the ratio of accuracy of
OCR
• Backward and Forward 4-gram model has the lowest
modification error result and the highest correct
result
14
Conclusion: Mixing Text Models with
the Probability of OCR
• By mixing rules of OCR Top 5 candidate words
and probability of Top 1 with text model, it can
archive better result than using text model only
• Mixing the LSTM 6-gram with the probability of
OCR model has the highest correct results
15
Thank you for listening

More Related Content

Similar to Session1 03.hsian-an wang

MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
Golden Helix Inc
 
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Boston Institute of Analytics
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
tsysglobalsolutions
 
Grammatical Error Correction with Neural Reinforcement Learning
Grammatical Error Correction with Neural Reinforcement LearningGrammatical Error Correction with Neural Reinforcement Learning
Grammatical Error Correction with Neural Reinforcement Learning
Masahiro Kaneko
 
Text summarization
Text summarization Text summarization
Text summarization
prateek khandelwal
 
Knucth Morris and pratt_presentation.pptx
Knucth Morris and pratt_presentation.pptxKnucth Morris and pratt_presentation.pptx
Knucth Morris and pratt_presentation.pptx
siddharthyou29
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
Sanghamitra Deb
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
IJCI JOURNAL
 
Handwriting recognition
Handwriting recognitionHandwriting recognition
Handwriting recognition
Maeda Hanafi
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
ananth
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
MadhuriChandanbatwe
 
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
Golden Helix Inc
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
Asiri Wijesinghe
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
RIILP
 
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp AlgorithmK-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
Universitas Pembangunan Panca Budi
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
danielschulz2005
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
danielschulz2005
 
tmptmptmp123.pptx
tmptmptmp123.pptxtmptmptmp123.pptx
tmptmptmp123.pptx
ssuser893445
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
Ayodele Odubela
 
DeepWriting: Making Digital Ink Editable via Deep Generative Modeling
DeepWriting: Making Digital Ink Editable via Deep Generative ModelingDeepWriting: Making Digital Ink Editable via Deep Generative Modeling
DeepWriting: Making Digital Ink Editable via Deep Generative Modeling
ivaderivader
 

Similar to Session1 03.hsian-an wang (20)

MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
 
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
 
Grammatical Error Correction with Neural Reinforcement Learning
Grammatical Error Correction with Neural Reinforcement LearningGrammatical Error Correction with Neural Reinforcement Learning
Grammatical Error Correction with Neural Reinforcement Learning
 
Text summarization
Text summarization Text summarization
Text summarization
 
Knucth Morris and pratt_presentation.pptx
Knucth Morris and pratt_presentation.pptxKnucth Morris and pratt_presentation.pptx
Knucth Morris and pratt_presentation.pptx
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
 
Handwriting recognition
Handwriting recognitionHandwriting recognition
Handwriting recognition
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
 
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp AlgorithmK-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
tmptmptmp123.pptx
tmptmptmp123.pptxtmptmptmp123.pptx
tmptmptmp123.pptx
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
 
DeepWriting: Making Digital Ink Editable via Deep Generative Modeling
DeepWriting: Making Digital Ink Editable via Deep Generative ModelingDeepWriting: Making Digital Ink Editable via Deep Generative Modeling
DeepWriting: Making Digital Ink Editable via Deep Generative Modeling
 

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Recently uploaded

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Session1 03.hsian-an wang

  • 1. Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model Hsiang-An Wang Academia Sinica Center for Digital Cultures
  • 2. Ink Bleed and Pool Quality 2
  • 3. Limitation (Missing and Extra Word) OCR Original OCR Original 3
  • 4. Experiment: Data Collection • Training dataset: 187 ancient medicine books from the Scripta Sinica Database (about 40 million words) • Testing dataset: 1 relevant ancient medicine book named “ ” with a total of 185,000 words • The OCR results are about 180,000 words correct and about 5000 incorrect words, which means the correct rate is about 97.3 % 4
  • 5. Experiment: Building a N-gram Model • Relied on the sequence of words in the training dataset, and thus we picked the highest frequency of output. • " " – 2-gram: input to predict " " – 3-gram: input predict " " – 4-gram: input predict " " – ... 5
  • 6. Experiment: Building a Backward and Forward N-gram Model • Relied on the sequence of backward and forward words in the training dataset, and thus we picked the highest frequency of output. • Since the backward and forward N-gram are divided into two different sets of N-gram, therefore, the model can be used when the same word is found afterwards. • " " – Backward 4-gram: input to predict " " – Forward 4-gram: input to predict " " 6
  • 7. Experiment: Building a LSTM Model • Used the Word2vec to project text into the vector space with 200 dimension • Used LSTM with three layers of neural network • Picked the highest score of softmax layer to predict the word • " " – LSTM 2-gram: input to predict " " – LSTM 3-gram: input to predict " " – LSTM 4-gram: input to predict " " 7
  • 8. The Modification of Correctness Rate in N-gram Model • 7-gram can achieve the best correction rate 8
  • 9. The Modification of Correctness Rate in Backward and Forward N-gram Model • Backward and Forward 4-gram can achieve the best correction rate 9
  • 10. The Modification of Correctness Rate in LSTM Model • LSTM 6-gram can achieve the best correction rate • 10
  • 11. Model The ratio of the correct result of OCR changes to the wrong one The ratio of making the incorrect result of OCR changes to the right one The ratio of accuracy of OCR and the text model OCR X X 97.30% 7-gram 0.35% 13.06% 97.49% LSTM 6-gram 0.1% 7.33% 97.5% BF 4-gram 0.08% 9.54% 97.57% Comparison of 7-gram, LSTM 6-gram and BF 4-gram Text Models • Backward and Forward 4-gram has the best performance, with the lowest modification error result and the highest correct results 11
  • 12. Three Text models with OCR Top 5 Candidate Words • The OCR software we use is a Convolution Neural Network model and to calculate the probability of classification through softmax function • When the probability of OCR Top 1 is lower than 95%, it determines the word might be wrong and will use mixed model • Pick the word that has the highest score of the text model also appeared in OCR Top 5 candidate words 12
  • 13. Model The ratio of the correct result of OCR changes to the wrong one The ratio of making the incorrect result of OCR changes to the right one The ratio of accuracy of OCR and the text model OCR X X 97.30% 7-gram 0.012% 9% 97.63% LSTM 6-gram 0.13% 16% 97.71% BF 4-gram 0.009% 5.92% 97.55% Comparison of Three Text Models Mixed with the Probability of OCR • LSTM 6-gram mixed with the probability of OCR that has the best performance 13
  • 14. Conclusion: Using Text Model • N-gram, backward and forward N-gram or LSTM N- gram text model can increase the ratio of accuracy of OCR • Backward and Forward 4-gram model has the lowest modification error result and the highest correct result 14
  • 15. Conclusion: Mixing Text Models with the Probability of OCR • By mixing rules of OCR Top 5 candidate words and probability of Top 1 with text model, it can archive better result than using text model only • Mixing the LSTM 6-gram with the probability of OCR model has the highest correct results 15
  • 16. Thank you for listening