Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts
Upcoming SlideShare
Loading in...5
×
 

Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts

on

  • 144 views

Presentation of the paper Construction of Text Digitization System for Nôm Historical Texts by Truyen Van Phan and Masaki Nakagawa in DATeCH 2014. #digidays

Presentation of the paper Construction of Text Digitization System for Nôm Historical Texts by Truyen Van Phan and Masaki Nakagawa in DATeCH 2014. #digidays

Statistics

Views

Total Views
144
Views on SlideShare
89
Embed Views
55

Actions

Likes
0
Downloads
1
Comments
0

3 Embeds 55

http://www.digitisation.eu 53
http://newsblur.com 1
https://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts Presentation Transcript

  • Construction of a Text Digitization System for Nôm Historical Documents Truyen Van PHAN and Masaki NAKAGAWA Tokyo University of Agriculture & Technology (TUAT), Japan
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Outline Introduction What Nôm is? How it is? Our motivation? What we aim at? Page Layout Analysis Offline Recognition System Generating Artificial Character Patterns Building and Improving Large Set Character Recognition Experiments and Results GUI of Digitization System Conclusion Future Work 1/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 What Nôm is? Nôm character • 10 th century ~ 20 th century • Based on Chinese character Nôm character • 10 th century ~ 20 th century • Based on Chinese character 2/18 "My mother eats vegetarian food at the temple every Sunday" Quốc Ngữ Hán (classical Chinese) Borrowed character native Nôm Invented character Vietnamese alphabet • 20 th century ~ present • Based on Roman alphabet Vietnamese alphabet • 20 th century ~ present • Based on Roman alphabet 2 categories of Nôm src: wikipedia
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 How it is? Our motivation?  Current situation of Nôm  completely replaced by Quốc Ngữ.  < 100 scholars worldwide can read Nôm.  > 90% Nôm documents are not translated to Quốc Ngữ.  Digitization Project of the Hán Nôm Special Collection  Have scanned ~ 5,200 documents.  Providing online access to 1,907 documents with 133,495 pages. http://nom.nlv.gov.vn/ 3/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 What we aim at?  Construct a digitization system that enables people who are not even good at Nôm to build the digital text library of Nôm documents.  Provide a set of document image processing methods: preprocessing, binarization, character segmentation.  Provide a character recognition system.  Provide an user interface enable an operator to verify.  Lay a foundation of a digitization system for future research and development. 4/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Overview of Our System SegmentationSegmentation Document Images Document Images LabelingLabeling Normalized Pattern Normalized Pattern OCROCR ClusteringClustering PreprocessingPreprocessing NormalizationNormalization Feature Extraction Feature Extraction TrainingTraining DictionaryDictionaryClassificationClassification Document Texts Document Texts PatternPattern Document Digitization Pattern Collection Character Recognition Grouping Artificial Pattern Artificial Pattern Page Layout Analysis 5/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Page Layout Analysis (1/2)  Preprocessing  Red Comment Removal  Black Margin Removal  Line and Noise Removal  Binarization  1 local thresholding method (Su’s)  16 global thresholding methods (Otsu’s, SIS,…)  Character Segmentation  Top-down method: RXY cut  Bottom-up method: Voronoi  Combined method: RXY cut + Voronoi 6/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Page Layout Analysis (2/2) Black Margin Removal Black Margin Removal Red Comment Removal Red Comment Removal Document Image Document Image Line and Noise Removal Line and Noise Removal BinarizationBinarization Character Images Character Images SegmentationSegmentation 7/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Offline Recognition System  Generate a database of artificial character patterns.  There is no dataset for Nôm character with ground-truth.  Build an offline recognition engine.  Use MQDF2 recognition method.  Improve the large scale character recognition problem.  Use GLVQ and kd-tree in coarse classification. 8/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Generating Artificial Patterns  From 27 CJKV fonts of Nôm, Japanese, Chinese.  Use distortion models (Linear: Rotation, Shear, Shrink,…; and Non-linear).  Generate 2 datasets:  Common 7,601 characters for segmented character recognition.  All 32,733 characters in Nôm fonts for recognized result verification. NômcharacterHuman 9/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Building Offline Recognition Engine  Normalization: Line Density Projection Interpolation (LDPI) → 64 x 64 image  Feature Extraction: Normalization-Cooperated Gradient Feature (NCGF) → 512 features  Feature Reduction: Fisher Linear Discriminant Analysis (FLDA) → 100 features  Coarse-to-fine Classification: k-NN (k candidates) → MQDF2 10/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014  Improving in coarse classification  Mean vector → learned prototype by GLVQ: accuracy  Ordered structure→ space-partitioning structure of kd-tree: speed Improving Large Scale Character Recognition wj d(x, ci) < d(x,wj) < d (x, ci+1) ||}{||min)( i C wxxg |||| i wx : Euclidean distance w1 w2 wC … … inC k ik in i x C w 0 1 ))(( iii wxtww c1 c2 … ci ci+1 … ck 11/18 Generalized Learning Vector Quantization src: wikipedia
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Experiments  Datasets  TUAT HANDS Japanese character pattern databases (Nakayosi and Kuchibue)  J1_d: 2,965 JIS level-1 Kanji characters  J1&2_d: 6,355 JIS level-1 and level-2 Kanji characters  Artificial Nôm character pattern databases  NomS_d: 7,601 characters  NomL_d: 32,733 characters  Evaluation  Effects of GLVQ or/and kd-tree in large scale character recognition. 12/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Experimental Results (1/3)  Comparison of accuracy with and without prototype learning by GLVQ on J1_d and J1&2_d datasets. 13/18 97,20 97,29 97,32 97,34 97,35 97,35 97,35 97,36 97,36 97,36 97,36 97,36 97,37 97,37 97,37 97,37 97,37 97,37 97,37 97,37 96,63 96,77 96,82 96,84 96,85 96,86 96,86 96,87 96,87 96,87 96,86 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,50 96,60 96,70 96,80 96,90 97,00 97,10 97,20 97,30 97,40 97,50 10 20 30 40 50 60 70 80 90 100 Recognitionrate(%) Candidate number k J1_d J1_d_GLVQ J1&2_d J1&2_d_GLVQ k-NN rate (top 1): 93.97% 95.96% 93.11% 95.46%
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 0,190 0,153 0,124 0,101 0,079 0,068 0,058 0,284 0,238 0,188 0,154 0,130 0,113 0,097 93,11 93,09 93,05 92,95 92,79 92,54 92,18 93,11 93,11 93,09 93,05 92,98 92,86 92,69 91,60 91,80 92,00 92,20 92,40 92,60 92,80 93,00 93,20 0,000 0,100 0,200 0,300 0,400 0,500 0,600 0,75 1,00 1,25 1,50 1,75 2,00 2,25 2,50 2,75 3,00 Recognitionrate(%) Speed(ms/char) bound error ε Speed10 Speed50 Rate10 Rate50 0.308 0.229 Experimental Results (2/3)  Comparison of accuracy and speed with and without kd-tree on J1&2_d dataset. 14/18 (-0.06) (-0.105,54%) (-0.06) (-0.154,50%) k=10 k=10 k=50k=50
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Experimental Results (3/3)  Summary 15/18 Dataset Categories No. Dictionary size (Mb) Evaluation Original engine With GLVQ With kd-tree With GLVQ and kd-tree J1_d 2,965 6.5 Accuracy (%) 97.20 97.36 97.08 97.25 +0.05 Speed (ms/char) 0.114 0.126 0.074 0.085 -25% J1&2_d 6,355 13.9 Accuracy (%) 96.63 96.86 96.52 96.75 +0.12 Speed (ms/char) 0.233 0.258 0.132 0.154 -34% NomS_d 7,601 16.7 Accuracy (%) 98.58 98.61 98.58 98.61 +0.03 Speed (ms/char) 0.258 0.275 0.134 0.137 -47% NomL_d 32,733 71.7 Accuracy (%) 96.09 96.05 96.07 96.04 -0.05 Speed (ms/char) 1.212 1.257 0.808 0.666 -45% k=10, ε=2.25 With GLVQ and kd-tree, the computational time is reduced while the recognition rate is kept the same.
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 GUI of Digitization System 16/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Conclusion  Implemented a set of image processing (preprocessing, binarization, character segmentation).  Built a high-accuracy character recognition engine.  Obtained ~ 97% in recognition rate.  Reduced ~ 1/3 computational time while kept the same rate.  Developed a GUI for Nôm document digitization to enable an operator can verify the processed results of binarization, segmentation and recognition. 17/18
  • Construction of a Text Digitization System for Nôm Historical DocumentsMay 20th, 2014 Future Work  Improve page layout analysis to handle many layouts of Nôm documents.  Improve Segmentation  Line segmentation  Recognition-based character segmentation  Improve Character Recognition  Constraint output by word lexicon (use Nôm dictionary).  Introduce, call attention to the work.  Call for collaborative research. 18/18