SlideShare a Scribd company logo
1 of 19
Download to read offline
Construction of a Text Digitization System
for Nôm Historical Documents
Truyen Van PHAN and Masaki NAKAGAWA
Tokyo University of Agriculture & Technology (TUAT), Japan
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Outline
Introduction
What Nôm is?
How it is? Our motivation?
What we aim at?
Page Layout Analysis
Offline Recognition System
Generating Artificial Character Patterns
Building and Improving Large Set Character Recognition
Experiments and Results
GUI of Digitization System
Conclusion
Future Work
1/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
What Nôm is?
Nôm character
• 10
th
century ~ 20
th
century
• Based on Chinese character
Nôm character
• 10
th
century ~ 20
th
century
• Based on Chinese character
2/18
"My mother eats vegetarian food at the temple every Sunday"
Quốc Ngữ
Hán (classical Chinese)
Borrowed character
native Nôm
Invented character
Vietnamese alphabet
• 20
th
century ~ present
• Based on Roman alphabet
Vietnamese alphabet
• 20
th
century ~ present
• Based on Roman alphabet
2 categories of Nôm
src: wikipedia
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
How it is? Our motivation?
 Current situation of Nôm
 completely replaced by Quốc Ngữ.
 < 100 scholars worldwide can read Nôm.
 > 90% Nôm documents are not translated to Quốc Ngữ.
 Digitization Project of the Hán Nôm Special
Collection
 Have scanned ~ 5,200 documents.
 Providing online access to 1,907 documents with
133,495 pages.
http://nom.nlv.gov.vn/
3/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
What we aim at?
 Construct a digitization system that enables
people who are not even good at Nôm to build
the digital text library of Nôm documents.
 Provide a set of document image processing methods:
preprocessing, binarization, character segmentation.
 Provide a character recognition system.
 Provide an user interface enable an operator to verify.
 Lay a foundation of a digitization system for
future research and development.
4/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Overview of Our System
SegmentationSegmentation
Document
Images
Document
Images
LabelingLabeling
Normalized
Pattern
Normalized
Pattern
OCROCR
ClusteringClustering
PreprocessingPreprocessing NormalizationNormalization
Feature
Extraction
Feature
Extraction
TrainingTraining
DictionaryDictionaryClassificationClassification
Document
Texts
Document
Texts
PatternPattern
Document
Digitization
Pattern
Collection
Character
Recognition
Grouping
Artificial
Pattern
Artificial
Pattern
Page Layout
Analysis
5/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Page Layout Analysis (1/2)
 Preprocessing
 Red Comment Removal
 Black Margin Removal
 Line and Noise Removal
 Binarization
 1 local thresholding method (Su’s)
 16 global thresholding methods (Otsu’s, SIS,…)
 Character Segmentation
 Top-down method: RXY cut
 Bottom-up method: Voronoi
 Combined method: RXY cut + Voronoi
6/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Page Layout Analysis (2/2)
Black Margin
Removal
Black Margin
Removal
Red Comment
Removal
Red Comment
Removal
Document
Image
Document
Image
Line and Noise
Removal
Line and Noise
Removal
BinarizationBinarization
Character
Images
Character
Images
SegmentationSegmentation
7/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Offline Recognition System
 Generate a database of artificial character
patterns.
 There is no dataset for Nôm character with ground-truth.
 Build an offline recognition engine.
 Use MQDF2 recognition method.
 Improve the large scale character recognition
problem.
 Use GLVQ and kd-tree in coarse classification.
8/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Generating Artificial Patterns
 From 27 CJKV fonts of Nôm, Japanese, Chinese.
 Use distortion models (Linear: Rotation, Shear,
Shrink,…; and Non-linear).
 Generate 2 datasets:
 Common 7,601 characters for segmented character recognition.
 All 32,733 characters in Nôm fonts for recognized result verification.
NômcharacterHuman
9/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Building Offline Recognition Engine
 Normalization: Line Density Projection Interpolation (LDPI)
→ 64 x 64 image
 Feature Extraction: Normalization-Cooperated Gradient
Feature (NCGF)
→ 512 features
 Feature Reduction: Fisher Linear Discriminant Analysis
(FLDA)
→ 100 features
 Coarse-to-fine Classification:
k-NN (k candidates) → MQDF2
10/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
 Improving in coarse classification
 Mean vector → learned prototype by GLVQ: accuracy
 Ordered structure→ space-partitioning structure of kd-tree: speed
Improving Large Scale Character Recognition
wj
d(x, ci) < d(x,wj) < d (x, ci+1)
||}{||min)( i
C
wxxg
|||| i
wx : Euclidean distance
w1
w2
wC
…
…
inC
k
ik
in
i
x
C
w
0
1
))(( iii
wxtww
c1
c2
…
ci
ci+1
…
ck
11/18
Generalized Learning Vector Quantization
src: wikipedia
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Experiments
 Datasets
 TUAT HANDS Japanese character pattern databases
(Nakayosi and Kuchibue)
 J1_d: 2,965 JIS level-1 Kanji characters
 J1&2_d: 6,355 JIS level-1 and level-2 Kanji characters
 Artificial Nôm character pattern databases
 NomS_d: 7,601 characters
 NomL_d: 32,733 characters
 Evaluation
 Effects of GLVQ or/and kd-tree in large scale character
recognition.
12/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Experimental Results (1/3)
 Comparison of accuracy with and without prototype
learning by GLVQ on J1_d and J1&2_d datasets.
13/18
97,20
97,29
97,32 97,34 97,35 97,35 97,35 97,36 97,36 97,36
97,36 97,36 97,37 97,37 97,37 97,37 97,37 97,37 97,37 97,37
96,63
96,77
96,82 96,84 96,85 96,86 96,86 96,87 96,87 96,87
96,86 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88
96,50
96,60
96,70
96,80
96,90
97,00
97,10
97,20
97,30
97,40
97,50
10 20 30 40 50 60 70 80 90 100
Recognitionrate(%)
Candidate number k
J1_d J1_d_GLVQ J1&2_d J1&2_d_GLVQ
k-NN rate (top 1): 93.97% 95.96% 93.11% 95.46%
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
0,190
0,153
0,124
0,101
0,079 0,068 0,058
0,284
0,238
0,188
0,154
0,130
0,113 0,097
93,11 93,09 93,05
92,95
92,79
92,54
92,18
93,11 93,11 93,09 93,05
92,98
92,86
92,69
91,60
91,80
92,00
92,20
92,40
92,60
92,80
93,00
93,20
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0,75 1,00 1,25 1,50 1,75 2,00 2,25 2,50 2,75 3,00
Recognitionrate(%)
Speed(ms/char)
bound error ε
Speed10 Speed50 Rate10 Rate50
0.308
0.229
Experimental Results (2/3)
 Comparison of accuracy and speed with and without
kd-tree on J1&2_d dataset.
14/18
(-0.06)
(-0.105,54%)
(-0.06)
(-0.154,50%)
k=10 k=10 k=50k=50
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Experimental Results (3/3)
 Summary
15/18
Dataset
Categories
No.
Dictionary
size (Mb)
Evaluation
Original
engine
With
GLVQ
With
kd-tree
With GLVQ
and kd-tree
J1_d 2,965 6.5
Accuracy (%) 97.20 97.36 97.08 97.25 +0.05
Speed (ms/char) 0.114 0.126 0.074 0.085 -25%
J1&2_d 6,355 13.9
Accuracy (%) 96.63 96.86 96.52 96.75 +0.12
Speed (ms/char) 0.233 0.258 0.132 0.154 -34%
NomS_d 7,601 16.7
Accuracy (%) 98.58 98.61 98.58 98.61 +0.03
Speed (ms/char) 0.258 0.275 0.134 0.137 -47%
NomL_d 32,733 71.7
Accuracy (%) 96.09 96.05 96.07 96.04 -0.05
Speed (ms/char) 1.212 1.257 0.808 0.666 -45%
k=10, ε=2.25
With GLVQ and kd-tree, the computational time is reduced
while the recognition rate is kept
the same.
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
GUI of Digitization System
16/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Conclusion
 Implemented a set of image processing
(preprocessing, binarization, character
segmentation).
 Built a high-accuracy character recognition
engine.
 Obtained ~ 97% in recognition rate.
 Reduced ~ 1/3 computational time while kept the same
rate.
 Developed a GUI for Nôm document
digitization to enable an operator can verify
the processed results of binarization,
segmentation and recognition.
17/18
Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014
Future Work
 Improve page layout analysis to handle many
layouts of Nôm documents.
 Improve Segmentation
 Line segmentation
 Recognition-based character segmentation
 Improve Character Recognition
 Constraint output by word lexicon (use Nôm dictionary).
 Introduce, call attention to the work.
 Call for collaborative research.
18/18

More Related Content

What's hot

Virtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisVirtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisKabul Kurniawan
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learningViet-Trung TRAN
 
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By  Kabul KurniawanKnowledge Graph for Cybersecurity: An Introduction By  Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By Kabul KurniawanKabul Kurniawan
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
 
Walking on AI/ML for Networking
Walking on AI/ML for NetworkingWalking on AI/ML for Networking
Walking on AI/ML for NetworkingOscar Caicedo
 
DeepFix: a fully convolutional neural network for predicting human fixations...
DeepFix:  a fully convolutional neural network for predicting human fixations...DeepFix:  a fully convolutional neural network for predicting human fixations...
DeepFix: a fully convolutional neural network for predicting human fixations...Universitat Politècnica de Catalunya
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningBrodmann17
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...Hiroki Nakahara
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Tomonari Masada
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...Nesreen K. Ahmed
 
[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before training[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before trainingTaegyun Jeon
 

What's hot (15)

Virtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisVirtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log Analysis
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
DL for molecules
DL for moleculesDL for molecules
DL for molecules
 
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By  Kabul KurniawanKnowledge Graph for Cybersecurity: An Introduction By  Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
 
Walking on AI/ML for Networking
Walking on AI/ML for NetworkingWalking on AI/ML for Networking
Walking on AI/ML for Networking
 
DeepFix: a fully convolutional neural network for predicting human fixations...
DeepFix:  a fully convolutional neural network for predicting human fixations...DeepFix:  a fully convolutional neural network for predicting human fixations...
DeepFix: a fully convolutional neural network for predicting human fixations...
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before training[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before training
 

Similar to Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts

Texture features based text extraction from images using DWT and K-means clus...
Texture features based text extraction from images using DWT and K-means clus...Texture features based text extraction from images using DWT and K-means clus...
Texture features based text extraction from images using DWT and K-means clus...Divya Gera
 
Script Identification for printed document images at text-line level using DC...
Script Identification for printed document images at text-line level using DC...Script Identification for printed document images at text-line level using DC...
Script Identification for printed document images at text-line level using DC...IOSR Journals
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석datasciencekorea
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontIRJET Journal
 
Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Fwdays
 
Optimization of Incremental Queries CloudMDE2015
Optimization of Incremental Queries CloudMDE2015Optimization of Incremental Queries CloudMDE2015
Optimization of Incremental Queries CloudMDE2015József Makai
 
Web Service Antipatterns Detection Using Genetic Programming
Web Service Antipatterns Detection Using Genetic ProgrammingWeb Service Antipatterns Detection Using Genetic Programming
Web Service Antipatterns Detection Using Genetic ProgrammingAli Ouni
 
Detecting and Recognising Highly Arbitrary Shaped Texts from Product Images
Detecting and Recognising Highly Arbitrary Shaped Texts from Product ImagesDetecting and Recognising Highly Arbitrary Shaped Texts from Product Images
Detecting and Recognising Highly Arbitrary Shaped Texts from Product ImagesDatabricks
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword DictionariesPractical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword DictionariesShunsuke Kanda
 
Survey On Broken and Joint Devanagari Handwritten Characters Recognition Usin...
Survey On Broken and Joint Devanagari Handwritten Characters Recognition Usin...Survey On Broken and Joint Devanagari Handwritten Characters Recognition Usin...
Survey On Broken and Joint Devanagari Handwritten Characters Recognition Usin...IRJET Journal
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfdSolin TEM
 
Summary of SIGIR 2011 Papers
Summary of SIGIR 2011 PapersSummary of SIGIR 2011 Papers
Summary of SIGIR 2011 Paperschetanagavankar
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingRayhan Ferdous
 
Optically processed Kannada script realization with Siamese neural network model
Optically processed Kannada script realization with Siamese neural network modelOptically processed Kannada script realization with Siamese neural network model
Optically processed Kannada script realization with Siamese neural network modelIAESIJAI
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverviewMotaz El-Saban
 
A Review on Natural Scene Text Understanding for Computer Vision using Machin...
A Review on Natural Scene Text Understanding for Computer Vision using Machin...A Review on Natural Scene Text Understanding for Computer Vision using Machin...
A Review on Natural Scene Text Understanding for Computer Vision using Machin...IRJET Journal
 

Similar to Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts (20)

Texture features based text extraction from images using DWT and K-means clus...
Texture features based text extraction from images using DWT and K-means clus...Texture features based text extraction from images using DWT and K-means clus...
Texture features based text extraction from images using DWT and K-means clus...
 
Script Identification for printed document images at text-line level using DC...
Script Identification for printed document images at text-line level using DC...Script Identification for printed document images at text-line level using DC...
Script Identification for printed document images at text-line level using DC...
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English Font
 
Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"
 
Optimization of Incremental Queries CloudMDE2015
Optimization of Incremental Queries CloudMDE2015Optimization of Incremental Queries CloudMDE2015
Optimization of Incremental Queries CloudMDE2015
 
Web Service Antipatterns Detection Using Genetic Programming
Web Service Antipatterns Detection Using Genetic ProgrammingWeb Service Antipatterns Detection Using Genetic Programming
Web Service Antipatterns Detection Using Genetic Programming
 
Detecting and Recognising Highly Arbitrary Shaped Texts from Product Images
Detecting and Recognising Highly Arbitrary Shaped Texts from Product ImagesDetecting and Recognising Highly Arbitrary Shaped Texts from Product Images
Detecting and Recognising Highly Arbitrary Shaped Texts from Product Images
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword DictionariesPractical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
 
Survey On Broken and Joint Devanagari Handwritten Characters Recognition Usin...
Survey On Broken and Joint Devanagari Handwritten Characters Recognition Usin...Survey On Broken and Joint Devanagari Handwritten Characters Recognition Usin...
Survey On Broken and Joint Devanagari Handwritten Characters Recognition Usin...
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
 
Summary of SIGIR 2011 Papers
Summary of SIGIR 2011 PapersSummary of SIGIR 2011 Papers
Summary of SIGIR 2011 Papers
 
Sigir 2011 proceedings
Sigir 2011 proceedingsSigir 2011 proceedings
Sigir 2011 proceedings
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
Optically processed Kannada script realization with Siamese neural network model
Optically processed Kannada script realization with Siamese neural network modelOptically processed Kannada script realization with Siamese neural network model
Optically processed Kannada script realization with Siamese neural network model
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
A Review on Natural Scene Text Understanding for Computer Vision using Machin...
A Review on Natural Scene Text Understanding for Computer Vision using Machin...A Review on Natural Scene Text Understanding for Computer Vision using Machin...
A Review on Natural Scene Text Understanding for Computer Vision using Machin...
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts

  • 1. Construction of a Text Digitization System for Nôm Historical Documents Truyen Van PHAN and Masaki NAKAGAWA Tokyo University of Agriculture & Technology (TUAT), Japan
  • 2. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Outline Introduction What Nôm is? How it is? Our motivation? What we aim at? Page Layout Analysis Offline Recognition System Generating Artificial Character Patterns Building and Improving Large Set Character Recognition Experiments and Results GUI of Digitization System Conclusion Future Work 1/18
  • 3. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 What Nôm is? Nôm character • 10 th century ~ 20 th century • Based on Chinese character Nôm character • 10 th century ~ 20 th century • Based on Chinese character 2/18 "My mother eats vegetarian food at the temple every Sunday" Quốc Ngữ Hán (classical Chinese) Borrowed character native Nôm Invented character Vietnamese alphabet • 20 th century ~ present • Based on Roman alphabet Vietnamese alphabet • 20 th century ~ present • Based on Roman alphabet 2 categories of Nôm src: wikipedia
  • 4. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 How it is? Our motivation?  Current situation of Nôm  completely replaced by Quốc Ngữ.  < 100 scholars worldwide can read Nôm.  > 90% Nôm documents are not translated to Quốc Ngữ.  Digitization Project of the Hán Nôm Special Collection  Have scanned ~ 5,200 documents.  Providing online access to 1,907 documents with 133,495 pages. http://nom.nlv.gov.vn/ 3/18
  • 5. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 What we aim at?  Construct a digitization system that enables people who are not even good at Nôm to build the digital text library of Nôm documents.  Provide a set of document image processing methods: preprocessing, binarization, character segmentation.  Provide a character recognition system.  Provide an user interface enable an operator to verify.  Lay a foundation of a digitization system for future research and development. 4/18
  • 6. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Overview of Our System SegmentationSegmentation Document Images Document Images LabelingLabeling Normalized Pattern Normalized Pattern OCROCR ClusteringClustering PreprocessingPreprocessing NormalizationNormalization Feature Extraction Feature Extraction TrainingTraining DictionaryDictionaryClassificationClassification Document Texts Document Texts PatternPattern Document Digitization Pattern Collection Character Recognition Grouping Artificial Pattern Artificial Pattern Page Layout Analysis 5/18
  • 7. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Page Layout Analysis (1/2)  Preprocessing  Red Comment Removal  Black Margin Removal  Line and Noise Removal  Binarization  1 local thresholding method (Su’s)  16 global thresholding methods (Otsu’s, SIS,…)  Character Segmentation  Top-down method: RXY cut  Bottom-up method: Voronoi  Combined method: RXY cut + Voronoi 6/18
  • 8. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Page Layout Analysis (2/2) Black Margin Removal Black Margin Removal Red Comment Removal Red Comment Removal Document Image Document Image Line and Noise Removal Line and Noise Removal BinarizationBinarization Character Images Character Images SegmentationSegmentation 7/18
  • 9. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Offline Recognition System  Generate a database of artificial character patterns.  There is no dataset for Nôm character with ground-truth.  Build an offline recognition engine.  Use MQDF2 recognition method.  Improve the large scale character recognition problem.  Use GLVQ and kd-tree in coarse classification. 8/18
  • 10. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Generating Artificial Patterns  From 27 CJKV fonts of Nôm, Japanese, Chinese.  Use distortion models (Linear: Rotation, Shear, Shrink,…; and Non-linear).  Generate 2 datasets:  Common 7,601 characters for segmented character recognition.  All 32,733 characters in Nôm fonts for recognized result verification. NômcharacterHuman 9/18
  • 11. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Building Offline Recognition Engine  Normalization: Line Density Projection Interpolation (LDPI) → 64 x 64 image  Feature Extraction: Normalization-Cooperated Gradient Feature (NCGF) → 512 features  Feature Reduction: Fisher Linear Discriminant Analysis (FLDA) → 100 features  Coarse-to-fine Classification: k-NN (k candidates) → MQDF2 10/18
  • 12. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014  Improving in coarse classification  Mean vector → learned prototype by GLVQ: accuracy  Ordered structure→ space-partitioning structure of kd-tree: speed Improving Large Scale Character Recognition wj d(x, ci) < d(x,wj) < d (x, ci+1) ||}{||min)( i C wxxg |||| i wx : Euclidean distance w1 w2 wC … … inC k ik in i x C w 0 1 ))(( iii wxtww c1 c2 … ci ci+1 … ck 11/18 Generalized Learning Vector Quantization src: wikipedia
  • 13. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Experiments  Datasets  TUAT HANDS Japanese character pattern databases (Nakayosi and Kuchibue)  J1_d: 2,965 JIS level-1 Kanji characters  J1&2_d: 6,355 JIS level-1 and level-2 Kanji characters  Artificial Nôm character pattern databases  NomS_d: 7,601 characters  NomL_d: 32,733 characters  Evaluation  Effects of GLVQ or/and kd-tree in large scale character recognition. 12/18
  • 14. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Experimental Results (1/3)  Comparison of accuracy with and without prototype learning by GLVQ on J1_d and J1&2_d datasets. 13/18 97,20 97,29 97,32 97,34 97,35 97,35 97,35 97,36 97,36 97,36 97,36 97,36 97,37 97,37 97,37 97,37 97,37 97,37 97,37 97,37 96,63 96,77 96,82 96,84 96,85 96,86 96,86 96,87 96,87 96,87 96,86 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,50 96,60 96,70 96,80 96,90 97,00 97,10 97,20 97,30 97,40 97,50 10 20 30 40 50 60 70 80 90 100 Recognitionrate(%) Candidate number k J1_d J1_d_GLVQ J1&2_d J1&2_d_GLVQ k-NN rate (top 1): 93.97% 95.96% 93.11% 95.46%
  • 15. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 0,190 0,153 0,124 0,101 0,079 0,068 0,058 0,284 0,238 0,188 0,154 0,130 0,113 0,097 93,11 93,09 93,05 92,95 92,79 92,54 92,18 93,11 93,11 93,09 93,05 92,98 92,86 92,69 91,60 91,80 92,00 92,20 92,40 92,60 92,80 93,00 93,20 0,000 0,100 0,200 0,300 0,400 0,500 0,600 0,75 1,00 1,25 1,50 1,75 2,00 2,25 2,50 2,75 3,00 Recognitionrate(%) Speed(ms/char) bound error ε Speed10 Speed50 Rate10 Rate50 0.308 0.229 Experimental Results (2/3)  Comparison of accuracy and speed with and without kd-tree on J1&2_d dataset. 14/18 (-0.06) (-0.105,54%) (-0.06) (-0.154,50%) k=10 k=10 k=50k=50
  • 16. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Experimental Results (3/3)  Summary 15/18 Dataset Categories No. Dictionary size (Mb) Evaluation Original engine With GLVQ With kd-tree With GLVQ and kd-tree J1_d 2,965 6.5 Accuracy (%) 97.20 97.36 97.08 97.25 +0.05 Speed (ms/char) 0.114 0.126 0.074 0.085 -25% J1&2_d 6,355 13.9 Accuracy (%) 96.63 96.86 96.52 96.75 +0.12 Speed (ms/char) 0.233 0.258 0.132 0.154 -34% NomS_d 7,601 16.7 Accuracy (%) 98.58 98.61 98.58 98.61 +0.03 Speed (ms/char) 0.258 0.275 0.134 0.137 -47% NomL_d 32,733 71.7 Accuracy (%) 96.09 96.05 96.07 96.04 -0.05 Speed (ms/char) 1.212 1.257 0.808 0.666 -45% k=10, ε=2.25 With GLVQ and kd-tree, the computational time is reduced while the recognition rate is kept the same.
  • 17. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 GUI of Digitization System 16/18
  • 18. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Conclusion  Implemented a set of image processing (preprocessing, binarization, character segmentation).  Built a high-accuracy character recognition engine.  Obtained ~ 97% in recognition rate.  Reduced ~ 1/3 computational time while kept the same rate.  Developed a GUI for Nôm document digitization to enable an operator can verify the processed results of binarization, segmentation and recognition. 17/18
  • 19. Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Future Work  Improve page layout analysis to handle many layouts of Nôm documents.  Improve Segmentation  Line segmentation  Recognition-based character segmentation  Improve Character Recognition  Constraint output by word lexicon (use Nôm dictionary).  Introduce, call attention to the work.  Call for collaborative research. 18/18