SlideShare a Scribd company logo
Towards a Higher Accuracy of Optical
Character Recognition of Chinese Rare
Books in Making Use of Text Model
Hsiang-An Wang
Academia Sinica
Center for Digital Cultures
Ink Bleed and Pool Quality
2
Limitation (Missing and Extra Word)
OCR Original OCR Original
3
Experiment: Data Collection
• Training dataset: 187 ancient medicine books
from the Scripta Sinica Database (about 40
million words)
• Testing dataset: 1 relevant ancient medicine
book named “ ” with a total of
185,000 words
• The OCR results are about 180,000 words
correct and about 5000 incorrect words,
which means the correct rate is about 97.3 %
4
Experiment: Building a N-gram Model
• Relied on the sequence of words in the
training dataset, and thus we picked the
highest frequency of output.
• " "
– 2-gram: input to predict " "
– 3-gram: input predict " "
– 4-gram: input predict " "
– ...
5
Experiment: Building a
Backward and Forward N-gram Model
• Relied on the sequence of backward and forward
words in the training dataset, and thus we picked the
highest frequency of output.
• Since the backward and forward N-gram are divided
into two different sets of N-gram, therefore, the
model can be used when the same word is found
afterwards.
• " "
– Backward 4-gram: input to predict " "
– Forward 4-gram: input to predict " "
6
Experiment: Building a LSTM Model
• Used the Word2vec to project text into the vector
space with 200 dimension
• Used LSTM with three layers of neural network
• Picked the highest score of softmax layer to
predict the word
• " "
– LSTM 2-gram: input to predict " "
– LSTM 3-gram: input to predict " "
– LSTM 4-gram: input to predict " "
7
The Modification of Correctness Rate
in N-gram Model
• 7-gram can achieve the best correction rate
8
The Modification of Correctness Rate in
Backward and Forward N-gram Model
• Backward and Forward 4-gram can achieve
the best correction rate
9
The Modification of Correctness Rate
in LSTM Model
• LSTM 6-gram can achieve the best correction
rate
•
10
Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.35% 13.06% 97.49%
LSTM 6-gram 0.1% 7.33% 97.5%
BF 4-gram 0.08% 9.54% 97.57%
Comparison of 7-gram, LSTM 6-gram
and BF 4-gram Text Models
• Backward and Forward 4-gram has the best
performance, with the lowest modification error
result and the highest correct results
11
Three Text models with
OCR Top 5 Candidate Words
• The OCR software we use is a Convolution Neural
Network model and to calculate the probability of
classification through softmax function
• When the probability of OCR Top 1 is lower than 95%,
it determines the word might be wrong and will use
mixed model
• Pick the word that has the highest score of the text
model also appeared in OCR Top 5 candidate words
12
Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.012% 9% 97.63%
LSTM 6-gram 0.13% 16% 97.71%
BF 4-gram 0.009% 5.92% 97.55%
Comparison of Three Text Models
Mixed with the Probability of OCR
• LSTM 6-gram mixed with the probability of OCR that
has the best performance
13
Conclusion: Using Text Model
• N-gram, backward and forward N-gram or LSTM N-
gram text model can increase the ratio of accuracy of
OCR
• Backward and Forward 4-gram model has the lowest
modification error result and the highest correct
result
14
Conclusion: Mixing Text Models with
the Probability of OCR
• By mixing rules of OCR Top 5 candidate words
and probability of Top 1 with text model, it can
archive better result than using text model only
• Mixing the LSTM 6-gram with the probability of
OCR model has the highest correct results
15
Thank you for listening

More Related Content

Similar to Session1 03.hsian-an wang

MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
Golden Helix Inc
 
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Boston Institute of Analytics
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
tsysglobalsolutions
 
Grammatical Error Correction with Neural Reinforcement Learning
Grammatical Error Correction with Neural Reinforcement LearningGrammatical Error Correction with Neural Reinforcement Learning
Grammatical Error Correction with Neural Reinforcement Learning
Masahiro Kaneko
 
Text summarization
Text summarization Text summarization
Text summarization
prateek khandelwal
 
Knucth Morris and pratt_presentation.pptx
Knucth Morris and pratt_presentation.pptxKnucth Morris and pratt_presentation.pptx
Knucth Morris and pratt_presentation.pptx
siddharthyou29
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
Sanghamitra Deb
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
IJCI JOURNAL
 
Handwriting recognition
Handwriting recognitionHandwriting recognition
Handwriting recognition
Maeda Hanafi
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
ananth
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
MadhuriChandanbatwe
 
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
Golden Helix Inc
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
Asiri Wijesinghe
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
RIILP
 
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp AlgorithmK-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
Universitas Pembangunan Panca Budi
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
danielschulz2005
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
danielschulz2005
 
tmptmptmp123.pptx
tmptmptmp123.pptxtmptmptmp123.pptx
tmptmptmp123.pptx
ssuser893445
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
Ayodele Odubela
 
DeepWriting: Making Digital Ink Editable via Deep Generative Modeling
DeepWriting: Making Digital Ink Editable via Deep Generative ModelingDeepWriting: Making Digital Ink Editable via Deep Generative Modeling
DeepWriting: Making Digital Ink Editable via Deep Generative Modeling
ivaderivader
 

Similar to Session1 03.hsian-an wang (20)

MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
 
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
 
Grammatical Error Correction with Neural Reinforcement Learning
Grammatical Error Correction with Neural Reinforcement LearningGrammatical Error Correction with Neural Reinforcement Learning
Grammatical Error Correction with Neural Reinforcement Learning
 
Text summarization
Text summarization Text summarization
Text summarization
 
Knucth Morris and pratt_presentation.pptx
Knucth Morris and pratt_presentation.pptxKnucth Morris and pratt_presentation.pptx
Knucth Morris and pratt_presentation.pptx
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
 
Handwriting recognition
Handwriting recognitionHandwriting recognition
Handwriting recognition
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
 
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp AlgorithmK-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
tmptmptmp123.pptx
tmptmptmp123.pptxtmptmptmp123.pptx
tmptmptmp123.pptx
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
 
DeepWriting: Making Digital Ink Editable via Deep Generative Modeling
DeepWriting: Making Digital Ink Editable via Deep Generative ModelingDeepWriting: Making Digital Ink Editable via Deep Generative Modeling
DeepWriting: Making Digital Ink Editable via Deep Generative Modeling
 

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Recently uploaded

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 

Recently uploaded (20)

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 

Session1 03.hsian-an wang

  • 1. Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model Hsiang-An Wang Academia Sinica Center for Digital Cultures
  • 2. Ink Bleed and Pool Quality 2
  • 3. Limitation (Missing and Extra Word) OCR Original OCR Original 3
  • 4. Experiment: Data Collection • Training dataset: 187 ancient medicine books from the Scripta Sinica Database (about 40 million words) • Testing dataset: 1 relevant ancient medicine book named “ ” with a total of 185,000 words • The OCR results are about 180,000 words correct and about 5000 incorrect words, which means the correct rate is about 97.3 % 4
  • 5. Experiment: Building a N-gram Model • Relied on the sequence of words in the training dataset, and thus we picked the highest frequency of output. • " " – 2-gram: input to predict " " – 3-gram: input predict " " – 4-gram: input predict " " – ... 5
  • 6. Experiment: Building a Backward and Forward N-gram Model • Relied on the sequence of backward and forward words in the training dataset, and thus we picked the highest frequency of output. • Since the backward and forward N-gram are divided into two different sets of N-gram, therefore, the model can be used when the same word is found afterwards. • " " – Backward 4-gram: input to predict " " – Forward 4-gram: input to predict " " 6
  • 7. Experiment: Building a LSTM Model • Used the Word2vec to project text into the vector space with 200 dimension • Used LSTM with three layers of neural network • Picked the highest score of softmax layer to predict the word • " " – LSTM 2-gram: input to predict " " – LSTM 3-gram: input to predict " " – LSTM 4-gram: input to predict " " 7
  • 8. The Modification of Correctness Rate in N-gram Model • 7-gram can achieve the best correction rate 8
  • 9. The Modification of Correctness Rate in Backward and Forward N-gram Model • Backward and Forward 4-gram can achieve the best correction rate 9
  • 10. The Modification of Correctness Rate in LSTM Model • LSTM 6-gram can achieve the best correction rate • 10
  • 11. Model The ratio of the correct result of OCR changes to the wrong one The ratio of making the incorrect result of OCR changes to the right one The ratio of accuracy of OCR and the text model OCR X X 97.30% 7-gram 0.35% 13.06% 97.49% LSTM 6-gram 0.1% 7.33% 97.5% BF 4-gram 0.08% 9.54% 97.57% Comparison of 7-gram, LSTM 6-gram and BF 4-gram Text Models • Backward and Forward 4-gram has the best performance, with the lowest modification error result and the highest correct results 11
  • 12. Three Text models with OCR Top 5 Candidate Words • The OCR software we use is a Convolution Neural Network model and to calculate the probability of classification through softmax function • When the probability of OCR Top 1 is lower than 95%, it determines the word might be wrong and will use mixed model • Pick the word that has the highest score of the text model also appeared in OCR Top 5 candidate words 12
  • 13. Model The ratio of the correct result of OCR changes to the wrong one The ratio of making the incorrect result of OCR changes to the right one The ratio of accuracy of OCR and the text model OCR X X 97.30% 7-gram 0.012% 9% 97.63% LSTM 6-gram 0.13% 16% 97.71% BF 4-gram 0.009% 5.92% 97.55% Comparison of Three Text Models Mixed with the Probability of OCR • LSTM 6-gram mixed with the probability of OCR that has the best performance 13
  • 14. Conclusion: Using Text Model • N-gram, backward and forward N-gram or LSTM N- gram text model can increase the ratio of accuracy of OCR • Backward and Forward 4-gram model has the lowest modification error result and the highest correct result 14
  • 15. Conclusion: Mixing Text Models with the Probability of OCR • By mixing rules of OCR Top 5 candidate words and probability of Top 1 with text model, it can archive better result than using text model only • Mixing the LSTM 6-gram with the probability of OCR model has the highest correct results 15
  • 16. Thank you for listening