Successfully reported this slideshow.
Holistic Recognition of  Printed Arabic Script  Ligatures  Akram El-Korashy  Supervised by: Dr. Faisal ShafaitDeutsche For...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Segmentation-free OCR forArabic scriptsive● Nastalique writing: Classify ligatures instead  of individual characters.● Ove...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Features Extraction, ShapeContext method● Distribution of Points, Transformation  methods, Structural Analysis.● Nabocr: S...
Features Extraction, ShapeContext method● 4 histograms from 4 quadrants.● Each histogram is  a sum of point histograms.● D...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Hierarchical Classification● Decomposing a classification problem into a  set of smaller problems.● Useful with large numb...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Spectral Hashing● Fast NN technique● Feature vector into a binary code:     ○ easily computed     ○ small no. of bits     ...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Random Forests● Ensemble ClassifierEnsemble learning combines the predictions of different classifiers (decision trees) by...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Shape context weaknesses● Scale invariance● Missing representation of dots● Confusion between ligatures  that vary only in...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
New Features● Sizes of connected components● Locations of connected components     ○ above, below,       or interleaving  ...
New Features● Pixel-level properties:     ○ weights of regions     ○ fill ratio● Length, Width, Aspect Ratio     ○ Invaria...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Classification Methodology● Experiment set "1"     ○ Spectral Hashing, reduction of number of       comparisons● Experimen...
Classification Methodology● Spectral Hashing (sunvid project):     ○ Training Dataset (~80,000 samples)     ○ Test Dataset...
Classification Methodology● Random Forests (python milk):     ○ Number of decision trees: 101     ○ 70% of the attributes ...
Classification Methodology●                                                                input     ○ New features vector...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Experiments and Results● Spectral Hashing Results "1"     ○ Effect of changing the number of tables     ○ 7-bit-binary-cod...
Experiments and Results● Spectral Hashing Results "1"    Accuracy                     Best Reduction     Hash (bits, table...
Experiments and Results● Spectral Hashing Results "1"● Significant reduction rates     ○ Reduction down to 19% for a diffe...
Experiments and Results● Random Forest Results "2"●    Accuracy of 78.7% for 1, 2, 3, 4+ labels●    Accuracy of 45.4% for ...
Experiments and Results● Random Forest Results "2"● Confusion matrix for 1, 2, 3+: alphabet  symbols can be separately cla...
Experiments and Results● Alphabet symbols     ○ 80.34 % for Random Forests "3"     ○ Accuracy of 98.74 % for 1-NN classifi...
Outline● Introduction   ○ Segmentation-free OCR for Arabic scripts● Approaches Used   ○ Features Extraction, and the Shape...
Conclusion and Summary● Features vector can be improved.● 1-NN improved efficiency by Spectral  Hashing: significant reduc...
Questions?Future Work Thank You
Upcoming SlideShare
Loading in …5
×

Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

1,858 views

Published on

The thesis addresses the problem of holistic recognition of printed text in Nastalique writing style of the Urdu language. The main difficulty of the recognition process lies in
the large number of classes (17,000 different possible classes in our Urdu text data). This large number of classes not only limits the efficiency (run-time) of many recognition algorithms, but it also makes it more difficult to make use of some state-of-the-art classifiers –like random forests– that assume a much smaller number of classes in the classification problems they can be used for. In this paper, we investigate different strategies for improving the efficiency (reducing the search space) of nearest neighbor based classification of Urdu ligatures.
Experiments using spectral hashing show that the search space of nearest neighbor comparison can be reduced by about 50% without loss in recognition accuracy.
Further experiments demonstrate that Random Forest classifier can reliably distinguish one-character ligatures from multiple-character ligatures.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

  1. 1. Holistic Recognition of Printed Arabic Script Ligatures Akram El-Korashy Supervised by: Dr. Faisal ShafaitDeutsche Forschungszentrum für Künstliche Intelligenz (DFKI)Kaiserslautern, Deutschland
  2. 2. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 1
  3. 3. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 2
  4. 4. Segmentation-free OCR forArabic scriptsive● Nastalique writing: Classify ligatures instead of individual characters.● Over 20,000 valid ligatures in the Urdu language.● Ease in the preprocessing, with difficulty in feature extraction & classification.Akram El-Korashy, Segmentation-free OCR, 14.08.12 3
  5. 5. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 4
  6. 6. Features Extraction, ShapeContext method● Distribution of Points, Transformation methods, Structural Analysis.● Nabocr: Shape Context features vector.● Contour Extraction.● Shape Context is a shape descriptor proposed by Belongie et al.Akram El-Korashy, Segmentation-free OCR, 14.08.12 5
  7. 7. Features Extraction, ShapeContext method● 4 histograms from 4 quadrants.● Each histogram is a sum of point histograms.● Distance, Orientation● Histogram: bins of ranges.Akram El-Korashy, Segmentation-free OCR, 14.08.12 6
  8. 8. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 7
  9. 9. Hierarchical Classification● Decomposing a classification problem into a set of smaller problems.● Useful with large numbers of categories.● Efficiency of recognition.● Can help improve accuracy● Independent set of features for each branch.Akram El-Korashy, Segmentation-free OCR, 14.08.12 8
  10. 10. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 9
  11. 11. Spectral Hashing● Fast NN technique● Feature vector into a binary code: ○ easily computed ○ small no. of bits ○ similarity mapping● Calculating binary code: ○ maximum variance direction (PCA) ○ sin eigenfn.Akram El-Korashy, Segmentation-free OCR, 14.08.12 10
  12. 12. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 11
  13. 13. Random Forests● Ensemble ClassifierEnsemble learning combines the predictions of different classifiers (decision trees) by collecting independent votes from each tree and calculating the majority vote to give a prediction.Akram El-Korashy, Segmentation-free OCR, 14.08.12 12
  14. 14. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 13
  15. 15. Shape context weaknesses● Scale invariance● Missing representation of dots● Confusion between ligatures that vary only in dots.Akram El-Korashy, Segmentation-free OCR, 14.08.12 14
  16. 16. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 15
  17. 17. New Features● Sizes of connected components● Locations of connected components ○ above, below, or interleaving ○ Grid locationAkram El-Korashy, Segmentation-free OCR, 14.08.12 16
  18. 18. New Features● Pixel-level properties: ○ weights of regions ○ fill ratio● Length, Width, Aspect Ratio ○ Invariance to scanning resolution ○ Setting reference size ○ Histogram of widths and heightsAkram El-Korashy, Segmentation-free OCR, 14.08.12 17
  19. 19. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 18
  20. 20. Classification Methodology● Experiment set "1" ○ Spectral Hashing, reduction of number of comparisons● Experiment set "2" ○ Random Forests, hierarchy by recognizing the no. of characters● Experiment "3" ○ Random Forests, classification of alphabet symbolsAkram El-Korashy, Segmentation-free OCR, 14.08.12 19
  21. 21. Classification Methodology● Spectral Hashing (sunvid project): ○ Training Dataset (~80,000 samples) ○ Test Dataset (~20,000 samples) ○ Different combinations of number of bits, number of tables, tolerance bits (training different hash structures in parallel)Akram El-Korashy, Segmentation-free OCR, 14.08.12 20
  22. 22. Classification Methodology● Random Forests (python milk): ○ Number of decision trees: 101 ○ 70% of the attributes ○ 70% of the training samples ○ Reduced training dataset (~20,000 samples) ○ Test dataset of ~18,000 samplesAkram El-Korashy, Segmentation-free OCR, 14.08.12 21
  23. 23. Classification Methodology● input ○ New features vector Random Forest classifier ○ Classifying based on no. of characters 1-character 2-character 3+ character classifier classifier classifier ○ Classifying the Alphabet SymbolsAkram El-Korashy, Segmentation-free OCR, 14.08.12 22
  24. 24. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 23
  25. 25. Experiments and Results● Spectral Hashing Results "1" ○ Effect of changing the number of tables ○ 7-bit-binary-code, 2 tolerance bitsAkram El-Korashy, Segmentation-free OCR, 14.08.12 24
  26. 26. Experiments and Results● Spectral Hashing Results "1" Accuracy Best Reduction Hash (bits, tables, tolerance) 81.5% 37538 (47.2%) 7, 9, 1 81% 31553 (39.7%) 7, 7, 1 80.5% 23975 (30.1%) 8, 9, 1 79.5% 20736 (26.1%) 7, 4, 1 78% 18737 (23.6%) 8, 7, 1 76% 15392 (19.4%) 7, 3, 1Akram El-Korashy, Segmentation-free OCR, 14.08.12 25
  27. 27. Experiments and Results● Spectral Hashing Results "1"● Significant reduction rates ○ Reduction down to 19% for a difference of 6% in accuracy ○ Reduction down to 24% for a difference of 4% in accuracy. ○ Reduction down to 47.2% for no accuracy loss. ○ Observation: Accuracy slightly higher than 1-NN for reduction down to 57.6%Akram El-Korashy, Segmentation-free OCR, 14.08.12 26
  28. 28. Experiments and Results● Random Forest Results "2"● Accuracy of 78.7% for 1, 2, 3, 4+ labels● Accuracy of 45.4% for 1, 2, 3, 4, 5+ labels● Accuracy of 20.7% for 1, 2, 3, 4, 5, 6+ labels● Even worse with more partitioningAkram El-Korashy, Segmentation-free OCR, 14.08.12 27
  29. 29. Experiments and Results● Random Forest Results "2"● Confusion matrix for 1, 2, 3+: alphabet symbols can be separately classified. test label / Recall result 1 2 3+ 1 1131 88 14 91.9% 2 16 94 531 17.2% 3+ 7 2 16627 99.9% % true ___ positives 98% 51% 96.8%Akram El-Korashy, Segmentation-free OCR, 14.08.12 28
  30. 30. Experiments and Results● Alphabet symbols ○ 80.34 % for Random Forests "3" ○ Accuracy of 98.74 % for 1-NN classifier ○ 1-NN classifier can be used for recognition under class 1. ○ Over 30% of ligatures are individual characters.Akram El-Korashy, Segmentation-free OCR, 14.08.12 29
  31. 31. Outline● Introduction ○ Segmentation-free OCR for Arabic scripts● Approaches Used ○ Features Extraction, and the Shape Context method ○ Machine Learning Techniques (Hierarchical classification, Spectral Hashing, Random Forests)● Improvements and Methodology ○ Shape Context weaknesses ○ New Features (dots, sizes, pixel-level matching) ○ Classification Methodology● Experiments and Results● Conclusion and SummaryAkram El-Korashy, Segmentation-free OCR, 14.08.12 30
  32. 32. Conclusion and Summary● Features vector can be improved.● 1-NN improved efficiency by Spectral Hashing: significant reduction● Random Forests: can be used to separate the 1-character alphabet symbols.● Useful for overall performance improvement on real text data.Akram El-Korashy, Segmentation-free OCR, 14.08.12 31
  33. 33. Questions?Future Work Thank You

×