Optical Character Recognition (OCR) based Retrieval

Presentation Outline
• Introduction
• Global Research Works
1) English OCR
2) Arabic OCR
3) Indian (Devanagari) OCR
• Local Research Works
1) Worku (1997)
2) Ermias (1998)
3) Dereje (1999)
4) Million (2000)
5) Nigussie (2000)
6) Yaregal (2002)
7) Million and Jawahar (2007)
8) Teshome (2009)
9) Abay (2010)
10) Yaregal and Bigun (2011)
• Conclusion
• Recommendation

Definition
“Optical character recognition (OCR) systems
take scanned images of paper documents as
input, and automatically convert them into
digital format for computer-aided data
processing”.

Architecture of an OCR system
Binarization
1. Template-matching
and correlation
2. Feature based
Filling
Thinning
Normalization
Skew Correction
I. Grouping.
II. Error-detection & correction

Benefit of using OCR
• Improve accuracy of data entry
• Increase efficiency in data storage, retrieval
and processing
• Identify the specific form within a particular
application
• Read texts and produce a synthesized voice
translation for visually impaired users.

Applications of OCR
• Data entry
• Text entry
• Process automation
• Other applications
– Aid for visually impaired people
– Automatic plate number readers
– Automatic cartography
– Signature verification and identification

Issues in Developing OCR
• Noise in Input Image
• Skew in Input Image
• Images embedded with Text in Input Image

Introduction to Global OCR system
• Modern OCR technology has been started in 1951 with
invention of GISMO A Robot Reader-Writer by M. Sheppard's
• Today, OCR systems are less expensive, faster, and more
reliable.
 Handwritten recognition
 Form reading Current research area of OCR
However , there is an intensive research particularly on
Reliable recognition of handwritten cursive script.

--- Introduction cont’d
• Hundreds of OCR systems have been developed since the 1950s
and many are commercially available today.
• Commercial OCR systems:
1. task- specific readers : handles only specific document types. It
includes read bank check, letter mail, or credit card slips. and
2. general purpose page readers: are designed to handle a broader
range of documents such as business letters, technical writings
and newspapers.
A task-specific reader [Address Readers, Form Readers, Check
Readers, Bill Processing Systems, Airline Ticket Readers, Passport
Readers]
• General purpose page readers

Handwritten English Character Recognition Using Neural
Network
By Anita Pal & Dayashankar Singh (2010)
Identified
Problem
difficulty of recognizing the handwritten characters of one
person from other.
Methods ,
Techniques
and Algorithms
Scan character: Acquire the sample handwritten character by
scanning. Then it has been converted into 1024 (32X32) binary
pixels
 Skeletonization : used to binary pixel image , remove extra
pixels and reduce broad strokes to thin lines. Eg,
Normalization operations: they used (30X30) standard.
 Reorganization Algorithms:
 Feature extraction: Boundary Detection Feature Extraction
technique
• Classification of Character :Neural network is used.

… Handwritten English Character Recognition Using Neural Network
By Anita Pal & Dayashankar Singh (2010)
Performance
Registered
The system has tested by application of Fourier descriptors with back
propagation technique and provides good recognition accuracy of
94%.
They used a sample of 250 for training and testing the data
Further
Research
Directions
 Implement and improve the recognition accuracy by using a new
technique or algorithms of feature extraction.

Recognition of on-line Arabic handwritten characters
using structural features
Ahmad T. Al-Taani and Saeed Al-Haj(2010)
Identified
Problem
Characters written by different persons representing the same
character are not identical but can vary in both size and shape.
Scope and
limitation of the
study
The proposed system works only on Arabic isolated letters
Methods ,
Techniques
and Algorithms
 Digitalization: hand-held and digital tablet as an input device
Feature extraction: used structure feature to extract the
character
 Recognition . Decision trees was used to classify the
characters based on the features that were extracted from the
input character.

… ……Recognition of on-line Arabic handwritten characters using structural
features
Ahmad T. Al-Taani and Saeed Al-Haj(2010)
Performance
Registered
Acc to the experiment , the following performance were achieved:
 the recognition rate of about 75.3% for all letters ( with letter
containing sharp edges )
The system may reach an average performance of 85.3% .(with
excluding sharp edge.)
Further
Research
Directions
 Those letter containing sharp edge that are not recognized by the
current system so that it is open for further research

Segmentation of Printed Text in Devanagari Script and Gurmukhi
Script
By Vijay Kumar and Pankaj K. Sengar (2010)
Identified
Problem
In segmentation, there is an error committed due to touching
characters, which the classifier cannot properly tackle.
Methods ,
Techniques
and Algorithms
 image categories and preprocessing
The researchers used: Binary level images, pseudo color and true
color images categorization to standardize the scanned image as
an input.
 Segmentation. used stage by stage segmentation methods.

… …… Segmentation of Printed Text in Devanagari Script and Gurmukhi
Script
By Vijay Kumar and Pankaj K. Sengar (2010)
Performance
Registered
After the proposed system has been manifested the following
performance at different level of text:
Devanagari script Gurmukhi script
@ Line level has a performance of 100% at 100% a
@ word level has a performance of ~ 100% 99%
@ charater level has a performance of 99%
@ top character level has a performance of 97% ------
Further
Research
Directions

Introduction to local research
As we investigated the local researches from various
sources, Much of the effort was made to study the application
of OCR to Amharic language. The main focus area of these
studies are :
the recognition of machine printed,
typewritten and handwritten Amharic documents and
the recognition of computer printouts on real-life documents
such as books, magazines and newspapers. This is therefore ,
the main objectives of this presentation is to review those
work based on specific criteria .

Application of OCR techniques To the Amharic text
Worku Alemu (1997)
Identified
Problem Huge amount of printed information resources are available
with Amharic language. However ,some of them are
irreplaceable, accessibly and modification is difficulty
Methods ,
Techniques
and Algorithms
Preprocessing task
Digitalization: To digitize the printed text(document image)
flatbed scanner (HP scan Jet IIc) at standard resolution
(300bpi) is used.
Segmentation: The stage by stage segmentation algorithm
was used in this research to segment lines and characters.

…….Application of OCR techniques To the Amharic text
Worku Alemu (1997)
Methods ,
Techniques
and Algorithms
Recognition Algorithms: pattern classification task which maps
each character image onto its symbolic identification. Two
algorithms that used for recognition of characters are:
 polygonal approximations and relaxation
topological features. He used a tree classification schemes built
by using topological features of a character.
To code the selected algorithms turbo c++ for windows was
used.
Performance
Registered
Four test cases he selected one main test case and achieved
good accuracy rate (98.87%) of recognition (laser print out of
Amharic text with normal typestyle of WashRa font , with 12
points font size.)

…….Application of OCR techniques To the Amharic text
Work Alemu (1997)
Further Research
Directions
Application of pre-processing and post-processing technique to
detect and correct error.
Segmentation of picture and text regions
Recognition of text in forms , tables and also in picture
Recognition of Character w/c are printed using any color on
whatever color of paper
Recognition of handwritten Amharic text
Recognition of formatted Amharic text

Recognition of formatted Amharic text using OCR techniques Ermias
Abebe (1998)
Identified
Problem
The researcher particularly improves the algorithms adopted by
Worku ( developed for single and fixed font size)in response to
font size and Typestyle .
So the aim of the researcher was to enable the algorithms
recognize character printed in different size and format
feature
Scope and
Limitation of the
study
•Researchers attempt was only limited with the problem of size,
underline and italics.
•The method is tested with only 231 basic Amharic characters
•Only thinning, normalization and underline removal technique of
segmentation are used
Methods ,
Techniques
and Algorithms
1. Preprocessing task
Digitalization: flatbed scanner at 300 dot per inch resolution
is used
Thinning Algorithms(suggested by Zang and suen) it used
to convert digital image into unit width image (skeleton).
Underline detection and removal: to remove underline from
segmented text line.
Normalization: this is the process of making all the size of
the character in the image must be equal .

……. Recognition of formatted Amharic text using OCR techniques
Ermias Abebe (1998)
Methods ,
Techniques
and Algorithms
Recognition Algorithms
The researcher used two recognition algorithms to work with
formatted text. He selected. because of their accessibility and
their supposed relevance to the problem at hand. These are
1.a generalized Character recognition Algorithms: A graphical
approach and ,
2. Symbol recognition without prior segmentation
Performance
Registered
By including preprocessing algorithms over worku’s work , we
can achieve better performance.
Further Research
Directions
Skew detection and correction
Recognition of typewritten texts and texts written on poor
quality papers.
Implement other algorithm that improves segmenting and
recognizing character irrespective of the size and style should be
test using Amharic script

Optical Character Recognition of Typewritten Amharic Character
Dereje Teferi (1999)
Identified
Problem
The conventional way of converting text in electronic format, is
typing through keyboard which is tedious, impossible in view of
the magnitude of document. Moreover, typing Amharic character
on computer needs two key strokes on average and worse
Scope and
Limitation of the
study
•The study is focus on developing a segmentation algorithm to
segment an Amharic character from document and forward the
result for recognition.
•It was confined only with 231 character of mechanical
typewriter.
Methods ,
Techniques
and Algorithms
Preprocessing stages
feature extraction/. So as to tackle the problem mentioned above
application of OCR that is capable of recognizing poor quality
Amharic typewritten character is needed.Detection
The features used for identifying and recognizing each character
is extracted from contour/line analysis.

…..Optical Character Recognition of Typewritten Amharic Character
Methods ,
Techniques
and Algorithms
 Segmentation
Recursive segmentation :it is an approach that merges
segmentation and recognition together recursively
stage by stage segmentation:The researchers applied a
modified stage by stage segmentation algorithms. A threshold
vale is set for the width of the characters through experiment.
Image restoration: It is the process by which a degraded image
is fixed so that a better recognition performance is achieved.
Mathematical Morphology. This technique used by the
researcher to remove salt-and- pepper noise.
Binary morphological filter. This is used for removal of
subtractive and additive noise.
Performance
Registered
the researcher assured the OCR system produce a recognition
accuracy of 53.47% for documents written with mechanical
typewriter.

…..Optical Character Recognition of Typewritten Amharic Character
Further Research
Directions
Integration of normalization technique to the present
development of Amharic OCR so that the system will be size
independent(size independent Amharic typewritten recognition
system)
Skew detection and correction algorithms should be developed
Form detection and removal algorithm to detect and extract text
from tables, forms etc. should be developed
Algorithms that recognize text written in any color on any
background should be developed
Recognition that are not very sensitive to the feature of the
characters should be developed
Algorithms for detecting formats such as, indention and
bulleting, and restoring them after recognition should be
developed.

Handwritten Amharic text Recognition applied to The processing of Bank checks
By Nigussie Tadesse (2000)
This was the first attempt reported that the researcher
studied the application of Handwritten Amharic text
recognition for the processing of bank checks.
Identified
Problem
The majority of clients in CBE’s are Ethiopians, thus most
request are made using Amharic language. To this end a big share
of checks is filled using Amharic character by hand. The bank used
a semi automated process so that the check filled by someone
else is keyed in the database using keyboard.This activity is very
slow, costly, and error prone.
Scope and
limitation of the
study
The research was limited to the recognition of handwritten legal
amounts.
The research also does not include the cents (fraction) or it
confined only on birr part of the legal amounts.

…..
Handwritten Amharic text Recognition applied to The processing of Bank checks
Methods ,
Techniques
and Algorithms
Preprocessing task
 the researcher used the following algorithm
1.underline removal: to remove underline stage by stage
segmentation (adopted from Ermias (1998)) and the connected
component analysis.
2.slant normalization :To normalize the slant the chain code
method was implemented
3.Recognition
• Size normalization: Normalize each character image to a fixed
size
• Training: In order to construct and experiment with different
neural network architectures, the EasyNN neural network
development tool was used. It helps to create, control, train,
validate and query different Multilayer networks with back-
propagation algorithm.

….. Handwritten Amharic text Recognition applied to The processing of Bank
checks
Performance
registered
The classification accuracy of the mentioned networks was tested
using 38 characters from the test data set. The first neural
network with 256 input, 7 hidden and 8 output node and trained
with 135 sample characters correctly classified 2 of the characters
of the test set. And the second network, which has 256 input, 20
hidden and 8 output node and trained with498 sample character
correctly classified 16 out of 38.
Future research
Direction
Apply further prepossessing so that the intra –class variability
between characters will be minimized.
Train a neural network with sufficient data and consider other
feature set for training.
It is the first attempt for application of OCR for Amharic
language and the research was not validate all the
check reading activities so it is open for future research.
cases Network
Architecture(M
LNN)
Sample trained
character
Test case/
Used sample
Exp.result
1 256 -7- 8 135 38 2
2 256-20-8 498 38 16 correctly classified

A Generalized Approach to Optical Character
Recognition (OCR) of Amharic texts
Million Meshesha (2000)
Identified
Problem
to generalize the previously adopted
recognition algorithm insensitive to the
different font types
Scope and
Limitations
• Only for three commonly used font types
namely Agafari, WashRa and Visual Geez

… A Generalized Approach to Optical Character Recognition (OCR) of
Amharic texts (Million Meshesha)
Methods ,
Techniques
and
Algorithms
Digitization flat bed scanner HP ScanJet at 300 dpi
resolution
Binarization threshold value of 112 intensity level.
Thinning Hybrid of parallel & Zang-Suen thinning
algorithm
Segmentation A step-by-step segmentation with some
modification
Feature
Extraction
and Detection
1. Topological features were extracted
2. a database was developed using binary tree

… A Generalized Approach to Optical Character Recognition (OCR) of
Amharic texts (Million Meshesha)
Performance
Registered
49.38% for WashRa
26.04% for Agafari_Addis Zemen
15.75% for Visual Geez.
Future
Research
Directions
 an algorithm for form detection and removal.
 Recognition of characters written on any paper color using any
color type.
 development post processing techniques such as a spell checker,
thesaurus, grammar, etc.
 Normalization of input image patterns
 Standardization of font types and their representation.
 mechanism should be designed to flag unrecognized and suspicious
characters

Optical Character Recognition of Amharic Text: An
Integrated Approach
Yaregal Assabie (2002)
Identified
Problem
to come up with a versatile algorithm that is
independent of the font size and other
quantitative parameters of Amharic
characters
Scope and
Limitations
there is no previously well-formed algorithm

… Optical Character Recognition of Amharic Text: An Integrated Approach
(Yaregal Assabie)
Methods ,
Techniques
and
Algorithms
Digitization Scanjet Pro scanner
Segmentation A step-by-step segmentation with some
modification
Training
Patterns with
Neural
Network
BrainMaker and NetMaker to converts text files
Feature
Extraction
and Detection
An improved primitive extraction algorithm is
developed

… Optical Character Recognition of Amharic Text: An Integrated Approach
(Yaregal Assabie)
Performance
Registered
Font included in the training set Not included
8 65.02% 62.87%
12 74.68% 81.07%
14 73.18% 70.04%
Future
Research
Directions
 detailed image preprocessing techniques need to be developed.
 an algorithm to detect forms, graphs and tables
recognition documents written with any color with any background
color
development of algorithms for primitive extraction, identification,
and relationship/connection handling
 character image databases that represent of different font types,
styles, and sizes for research purpose should be developed
 Post processing techniques such as spell checking, grammar and
semantic analysis need to be incorporated
 standardization of the representation of Ethiopic characters

Optical Character Recognition of Amharic Documents
Million Meshesha and C. V. Jawahar (2007)
Identified
Problem
Challenges in building an OCR for African scripts
• Degradation of documents
• Printing variations
• Large number of characters in the script
• Visual similarity of most characters in the script
• Language related issues.

… Optical Character Recognition of Amharic Documents
(Million Meshesha and C. V. Jawahar)
Methods ,
Techniques
and
Algorithms
Digitization flat-bed HP7670 Scanjet scanner (300 dpi)
Binarization Was done.
Skew corrected by a range of ± 20%.
Segmentation using different horizontal and vertical
projections
Normalization standard size of 20 x 20
Feature
Extraction and
Detection
1. Principal Component Analysis (PCA)
2. Linear Discriminant Analysis (LDA)
Classification Support Vector Machine or SVM-based
decision directed acyclic graph (DDAG)
classifier

… Optical Character Recognition of Amharic Documents
(Million Meshesha and C. V. Jawahar)
Performance
Registered
•On the average 96.95% accuracy is obtained.
paper and printing qualities were reasonably good.
•on the average around 90 %
considering degraded documents
Future
Research
Directions
• the feasibility of designing a data-driven OCR and
• in the area of indexing and retrieval from degraded
document images using only image properties (without
explicit recognition)
• recognition of other indigenous African scripts.
• develop an approach to come up with an intelligent OCR
that can learn from its mistake and improve its performance
overtime

Recognition of Amharic Braille
Teshome Alemu (2009)
Identified
Problem
For the large amount of visually impaired people,
Amharic Braille-to-print documents recognizer
Scope and
Limitations
The system only used for single sided
Braille document.

… Recognition of Amharic Braille
(Teshome Alemu)
Methods ,
Techniques
and
Algorithms
Digitization flat-bed scanner with 200 dpi
Segmentation Mesh-grid
Feature
Extraction and
Detection
1. Modified region based approach
2. Content analysis based on rules defined
Classification Neural network classifier
Performance
Registered
with all training and test set, 92.5% accuracy

Amharic CR System for Printed Real-Life Documents
Abay Teshager (2010)
Identified
Problem
applying robust preprocessing techniques in
detecting and removing various noise types and
simplifying the extraction of features
Scope and
Limitations
•Slant corrections for Amharic characters were
not considered.

… Amharic CR System for Printed Real-Life Documents
Abay Teshager
Methods ,
Techniques
and
Algorithms
Digitization flat-bed BENQ Scanner (300 dpi)
Noise Detection
and Removal
Adaptive filtering method
MATLAB Image Processing Toolbox
Binarization Otsu global image thresholding
Normalization Linear interpolation technique - (20 x 20 size)
Segmentation stage by stage segmentation algorithm
Thinning Hit-and-miss morphological analysis
using Bwmorph() function (in MATLAB)
Skew and slant
correction
using Microsoft Paint Rotation Toolbox.
Representation binary representation of characters will be fed to
the neural network procedure – (coded in MATLAB)
Recognition Artificial Neural Network (ANN) is created, trained
and refined

… Amharic CR System for Printed Real-Life Documents
Abay Teshager
Performance
Registered
96.87% for the test sets from the training sets and
11.40% recognition rate is observed for the new test sets.
Future
Research
Directions
• Invariant shape feature extraction techniques should be developed
• Apply advanced noise detection and removal algorithms for highly
degraded Amharic document images.
• better segmentation algorithm that tolerates space between
connected characters.
• Implementation of word segmentation
• Additional processing algorithm for skew detection and slant
correction should be considered

Offline handwritten Amharic word recognition
Yaregal Assabie, Josef Bigun (2011)
Identified
Problem
For the Hidden Markov Model (HMM) with the
compact notation,
evaluation
decoding and
training problems.

… Offline handwritten Amharic word recognition
Yaregal Assabie, Josef Bigun
Methods ,
Techniques
and
Algorithms
Digitization A total of 307 pages were collected and
scanned at a resolution of 300 dpi
Image processing and
Feature extraction
Gaussian filters and derivatives of
Gaussians.
Text line detection and
word segmentation
Direction field image
Normalization a symmetric Gaussian window of
5 x 5 pixels for noisy characters and
3 x 3 pixels for texts -small characters
Recognition of
unconstrained
handwritten Amharic
words
feature-level and HMM-level
concatenation of characters

… … Offline handwritten Amharic word recognition
Yaregal Assabie, Josef Bigun
Recognition result for feature-level concatenation method
Recognition result for HMM-level concatenation method
Future
Research
Directions
The recognition result can be further improved by
employing language models in HMMs.
Performance Registered

Conclusion
• Abroad, hundreds of OCR systems have been
developed since the 1950s and many are
commercially available today.
– Address Readers, Form Readers, Check Readers,
Airline Ticket Readers, Passport Readers, …
• Ethiopic handwriting recognition in general and
Amharic word recognition in particular, is one of
the least investigated problems and no
commercial OCR system is available.
• No OCR system is developed for local languages
in Ethiopia

Recommendations
• Performing character recognition directly on grey-level
images.
• Combined character recognition model is the only
solution to practical problems.
• Promising techniques within this area, deal with the
recognition of entire words instead of individual
characters.
• Future researches should focus on the linguistic and
contextual information for further improvements.
• The recognition of cursive script that is handwritten
connected or calligraphic characters.

… Recommendations
1) OCR system for other local languages
2) Bi-lingual (Multi-lingual) OCR system
3) Automatic plate number recognition system
4) Integration of OCR and Speech Synthesizer -
specially for Visually Impaired Persons
5) Commercial OCR systems for
Passport reader
Bill Processing System
Airline ticket reader
Address readers, …

Can the machines read human
writing with the same fluency as
human?
Not yet!

Optical Character Recognition (OCR) based Retrieval

Optical Character Recognition (OCR) based Retrieval

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optical Character Recognition (OCR) based Retrieval

Similar to Optical Character Recognition (OCR) based Retrieval (20)

More from Biniam Asnake

More from Biniam Asnake (6)

Recently uploaded

Recently uploaded (20)

Optical Character Recognition (OCR) based Retrieval

Editor's Notes