Successfully reported this slideshow.
You’ve unlocked unlimited downloads on SlideShare!
Khmer OCR BarCamp 22nd September, 2012 LONG SeangmengLecturer and researcher, GIC - ITC firstname.lastname@example.org 1
Khmer OCR• What is OCR?• Khmer OCR Project• State of the Art• Khmer OCR System• Project status• Perspectives 2
Optical Character Recognition (OCR) Text Image OCR Editable Text 3
Khmer OCR Project• 2011• Team – Dr. SENG Sopheap, ITC – Mr. LONG Seangmeng, ITC5th – Mr. EN Sovann (doing master) – Ms. PRUM Sophea (doing PhD) – Mr. HAO Jeudi (year)• Develop a Khmer OCR system – Font independent – Size independent 4
State of the ArtAuthor Limitation ResultCHEY Chanoeurn, KOSIN 10 characters (បបបបប បបបប 92%Chamnongthai and PINIT ប)KumhomCHEY Chanoeurn, KOSIN 20 fonts 92.85% (size 22)Chamnongthai and PINIT 91.66% (size 18)Kumhom 89.27% (size 12)ING Leng Ieng and MUAZ Limon R1 22 98.88%AhmedKRUY Vanna Font and size independent 97% (manual preparation for new fonts)EN Sovann Font and size independent 96% (manual preparation for new fonts) 5
Khmer OCR System (cont.)• Pre processing Binarization Noise removal Skew detection and correction 7
Khmer OCR System (cont.)• Segmentation Page Line 1 Line Line 2 Vertical Symbol Blob 8
Khmer OCR System (cont.) • Recognition Blob Training images (sample images) with label Closest matchBlob to be recognized Image: Search for closest Label: ស match … 9
Khmer OCR System (cont.)• Recognition (cont.) – How to find closest match? – How to represent the blob image? • Fourier transform: Any function f(t) with period T can be written as Blob image => 2-D Fourier transform The blob image (B) represented by Fourier coefficients: B, B, B, … City block distance between two blobs B and B’: Distance = |B – B’| + |B – B’| + |B – B’| + … 10
Project status• Pre processing – Binarization and noise removal √ – Skew detection and correction X• Segmentation √• Recognition – Features extraction √ – Automatic generation of training data for new fonts √• Post processing – Assembling and reordering rules • Manual √ • Automatic X – Spell checking X• Performance evaluation X 12
Perspectives• Joining characters• Text layout• Low quality text images• Curve line 13
Thanks for your attention! Demo & Questions??? 14