Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique
Upcoming SlideShare
Loading in...5
×
 

Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique

on

  • 2,724 views

This thesis proposes a complete system that classifies and recognizes machine-printed...

This thesis proposes a complete system that classifies and recognizes machine-printed
Arabic text. The input to the system is a clean, high-resolution Tag Image File Format
(.TIFF) that contains Arabic text to be recognized; the output is simply the generated
Arabic text saved in a Microsoft Word Document (.DOC) file of the recognized Arabic
text. The technique is based on cleverly describing the text in terms of shape primitives
derived from Freeman chain codes. A rule-based data enhancement technique is used
to improve recognized features as much as possible. The recognized features are
processed by a Prolog feature-matching engine to classify character classes as well as
diacritic information as three separate streams (character class stream, diacritic stream
and corners information stream). In addition to the three provided streams, estimated
font size is also provided as a fourth input. Characters are finally determined by
processing a permutation of the three streams using Definite Clause Grammar (DCG).

Statistics

Views

Total Views
2,724
Views on SlideShare
2,706
Embed Views
18

Actions

Likes
1
Downloads
41
Comments
0

2 Embeds 18

http://www.linkedin.com 14
https://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • A commonly used term in conjunction with OCR software. Omnifont recognition refers to the capability of computer software, usually OCR software, to read (or recognize) virtually any font that maintains fairly standard character shapes
  • The field of Optical Character Recognition (OCR) is a branch of technology that deals with automatic reading of a text. The ultimate goal of OCR is to simulate the human ability to read both machine-printed and hand-written texts. Currently, a variable system can read faster than a human, but it cannot reliably read such a wide variety of texts. Humans are also better at reading from highly distorted text or noisy (unclear) media. Therefore, a great deal of intensive research is still needed to narrow the gap between the humans’ and machines’ reading capabilities. OCR has been used in many practical areas that are independent of the language to which it is applied. One of the earliest and most successful applications was sorting checks in banks, where the volume of checks circulating daily proved to be too enormous to be handled by a manual entry method. Reading of handwritten and printer postal codes, text archiving and retrieving, reading of customers’ handwritten forms and aiding visually impaired people to read are a few other examples. Most of the work on OCR has been on Latin and Chinese characters. Work on Arabic character recognition has only started recently and had advanced relatively slowly due to the complexity of recognizing Arabic text, which has characters that are cursive in nature. Arabic character recognition is still an open and challenging field of research.
  • This thesis proposes a complete system that classifies and recognizes machine-printed Arabic text. The input to the system is a clean, high-resolution Tag Image File Format (.TIFF) that contains Arabic text to be recognized; the output is simply the generated Arabic text saved in a Microsoft Word Document (.DOC) file of the recognized Arabic text. The technique is based on cleverly describing the text in terms of shape primitives derived from Freeman chain codes. A rule-based data enhancement technique is used to improve recognized features as much as possible. The recognized features are processed by a Prolog feature-matching engine to classify character classes as well as diacritic information as three separate streams (character class stream, diacritic stream and corners information stream). In addition to the three provided streams, estimated font size is also provided as a fourth input. Characters are finally determined by processing a permutation of the three streams using Definite Clause Grammar (DCG).
  • The extant literature includes a considerable number of studies focused on recognition of writing in languages such as Latin, Chinese and Hebrew. Unfortunately, limited research has been done on the recognition of Arabic writing. This might be attributed to the peculiar aspects of the writing of Arabic characters . For example, Arabic writing is from right to left. This may not be considered a major technical point, but it becomes an issue when an Arabic text contains some foreign text such as Latin or French, which is written from left to right, and vice versa. Arabic writing is cursive. Arabic characters are generally tailored to each other. This requires a process of segmentation of the Arabic words and characters before any recognition step can be taken. In fact, segmentation and character isolation is the most difficult problem in the character recognition schemes of Arabic language. An Arabic word might be comprised of both cursive or separated characters, Twenty-two of the 29 sets of Arabic characters assume different shapes and sizes depending on other positions within the words. Some characters such as “ ع ” take four shapes, which are all different from each other: “ ع , ع , ع ,ع ”. The character “ ”ج takes three shapes: “ .” ج ,ج ,جThere are only seven characters, which, regardless of their positions within the words,have only one shape: “ و, ذ, د, ظ, ط, ز, ر ”. In general, characters, even if they are of the same group, are of different sizes and they require different rectangular boxes that can enclose them.
  • This thesis proposes a complete system that classifies and recognizes machine-printed Arabic text. The input to the system is a clean, high-resolution Tag Image File Format (.TIFF) that contains Arabic text to be recognized; the output is simply the generated Arabic text saved in a Microsoft Word Document (.DOC) file of the recognized Arabic text. The technique is based on cleverly describing the text in terms of shape primitives derived from Freeman chain codes. A rule-based data enhancement technique is used to improve recognized features as much as possible. The recognized features are processed by a Prolog feature-matching engine to classify character classes as well as diacritic information as three separate streams (character class stream, diacritic stream and corners information stream). In addition to the three provided streams, estimated font size is also provided as a fourth input. Characters are finally determined by processing a permutation of the three streams using Definite Clause Grammar (DCG).
  • Having conducted a thorough literature review, we started designing our system by experimenting with prior researchers’ techniques, adopting or modifying some of them if they met our requirements, but otherwise developing our own techniques. Consequently, the components of our system are either due to the work of others, the result of our improvement of others’ work, or our own completely new techniques. During the development phase of the system, many of the investigated techniques were rejected due to their ineffectiveness. This ineffectiveness may be due to inherent deficiency or due to incomplete description specified by their text sources.
  • Preprocessing PhaseThis phase is implemented in C language to perform some image processing with the help of open source image processing libraries such as LibTiff and OpenCV.In our proposed system, Arabic text images have been obtained by optical scanning of the character images on the plain paper. The input data obtained by scanning of printed text is almost contaminated with noise and contains redundant information. Preprocessing includes digitalization, scaling, word-level segmentation, noise removal and elimination of redundant information as far as possible.Image information retrievalIn this process, we load and read the input Tagged Image File Format (TIFF) image file as binary; retrieve the image properties (size, width, height, pixel resolution, image channels and image alignment; and create memory storage for system intermediate processing.Image digitalizationThe image digitalization process digitizes the TIFF image in order to apply fixed-level thresholding and to convert the gray-scale and bitmapped image to a binary (0’s and 1’s representation) scale image. It does the vertical and horizontal histograms to retrieve the number of lines per page and number of components (words) per each line. In this phase, we calculate the font baseline and size. The font baseline can be calculated by finding the maximum horizontal histogram of each line per page. This enables the dots or other special characters such as Shadda, Madda, and Tanween to be classified as upper or lower components related to this baseline. Text line detection: Text line detection has been performed by scanning the input page image horizontally. Frequency of black pixels in each row is counted in order to construct the row histogram. The position between two consecutive lines, where the number of pixels in a row is zero, denotes a boundary between the lines. Here it is assumed that the text block contains only single column of textWord segmentation: After a line has been detected, it is scanned vertically. In order to find the column histogram, the number of black pixels in each column is calculated. If there exists n consecutive scans that find no black pixel, we denote it to be a marker between two words. The value of n is taken experimentally. Figure 5 andFigure 6show the preprocessing phase.
  • It is suggested from our practical experimentation to apply some rules to the retrieved Freeman chain contours. We introduced an algorithm to remove noisy pixels which come within any straight line, and to convert Arabic characters to approximately straight lines. These enhancement rules, which are derived from testing Arabic characters multiple times, reduce the time required for character recognition. The result of the algorithm is illustrated in Figure 10.

Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique Presentation Transcript

  • Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique
    Rami Al-Sahhar
    Ideas for today and tomorrow
  • Agenda
    OCR Overview
    The Arabic OCR Problem
    OCR Challenges
    Proposed Solution
    Detailed system stages
    Sample Run
    Future Work
    Demo
  • OCR Overview
    (OCR) is the process of converting an image of text, such as a scanned paper document, into computer-editable text
    The ultimate goal of OCR is to simulate the human ability to read both machine-printed and hand-written texts
    Most of the work on OCR has been on Latin and Chinese characters
    Arabic character recognition started recently and advanced relatively slowly due to the complexity of recognizing Arabic text, which has characters that are cursive in nature.
    Arabic character recognition is still an open and challenging field of research
  • The Arabic OCR Problem
    To propose a complete system that classifies and recognizes machine-printed Arabic text
    The input to the system is TIFF image file
    The Arabic font size varies from 8 up to 36
    The font type is Arabic Simplified or Traditional Arabic
    The image scanned at 300 dpi ( Resolution )
    The output is editable text in a word processor program ( MS Word)
  • OCR Challenges
    Understanding TIFF image format and pixel representation
    Programmatically , read TIFF image pixel by pixel from right to left
    Features extraction
    Segmentation free
    Spaces ,Words , Letters and Line isolation
    Noise reduction
    Dots and holes
    Overlapped characters
  • OCR Challenges Arabic Character Characteristics
    Right to left
    Always cursive
    Change of character shape according to its location in the word
    Four different shapes
    28 basic characters: 15 with dots, 13 without
    No fixed character width and no fixed size
  • OCR Challenges Arabic Character Characteristics
    Group of Arabic character shapes
    A sample of written Arabic showing some of its characteristics
  • The Proposed Solution
    The proposed system starts from the document image acquisition stage and ends with recognized Arabic text in standard Simplified true type font format in MS Word 2007
    We started designing our system by experimenting with prior researchers’ techniques, adopting or modifying some of them if they met our requirements, but otherwise developing our own techniques
    Consequently, the components of our system are either due to the work of others, the result of our improvement of others’ work, or our own completely new techniques.
  • Prolog-Based
    RECOGNIZED TEXT
    CLASSIFICATION
    AND
    RECOGNITION
    C -Based
    POST
    ENHANCEMENT
    FEATURE
    EXTRACTION
    PREPROCESSING
    TIFF
    IMAGE FILE
    The Proposed Solution
    ATR
    (Arabic Text Recognition)
    System model
  • The Proposed Solution
    Preprocessing Phase
    Digitalization, scaling, word-level segmentation, noise removal and elimination of redundant information as far as possible
    Image information retrieval
    Load/Read the input (TIFF) image file as binary; retrieve the image properties (size, width, height, pixel resolution, image channels and image alignment; and create memory storage for system intermediate processing
    Image digitalization
    Digitizes the TIFF image in order to apply fixed-level thresholding and to convert the gray-scale and bitmapped image to a binary (0’s and 1’s representation) scale image
  • The Proposed Solution
    It does the vertical and horizontal histograms to retrieve the number of lines per page and number of components (words) per each line
    We calculate the font baseline and size by finding the maximum horizontal histogram of each line per page
    This enables the dots or other special characters such as Shadda, Madda, and Tanween to be classified as upper or lower components related to this baseline
    Text line detection
    Word segmentation
  • The Proposed Solution
    B&W image is found in file name: [ test1.tif]
    Processing a [1615x2160] image with [1] channel(s)
    Image Origin : [Top-left Origin] , Align : [4-]
    Data Order :[Interleaved Color Channels]
    Number of Lines(s) found: [6]
    Line #0 , Y = 78 , Height = [67]
    Line #1 , Y = 185 , Height =[ 67]
    Line #2 , Y = 292 , Height = [67]
    Line #3 , Y = 399 , Height = [67]
    Line #4 , Y = 506 , Height = [67]
    Line #5 , Y = 613 , Height = [67]
    Font Baseline =[ 38 pixels]
    Number of Components found at Image Line #0 : [9]
    Number of Components found at Image Line #1 : [14]
    Number of Components found at Image Line #2 : [16]
    Number of Components found at Image Line #3 : [18]
    Number of Components found at Image Line #4 : [10]
    Number of Components found at Image Line #5 : [6]
    Preprocessing phase
    Text Line Detection
  • 1
    2
    Number of retrieved contours : [2]
    ************* Bounding Rectangle (1,22)-(72,65) ************
    Component [1] Origin Y = 22 , Height = 43 Area = 429.000000
    Component [2] Origin Y = 47, Height = 7 Area = 19.000000
     Max Component Area = 429.000000 , Y = 22 , H = 43
    The Proposed Solution
    Preprocessing phase
    Word Segmentation
  • The Proposed Solution
    Feature Extraction Phase
    Is the most challenging part for character or text recognition
    The choice of good features significantly improves the recognition rate and minimizes the error in case of noise
    The main selected features are :
    Outer contours described in Freeman chain codes
    Contours’ corners
    Dot information
    Font estimated size
    All of these features are extracted for all detected components during the page scanning
  • The Proposed Solution
    Freeman chain code
    Chain code was introduced by Freeman as a mean of representing lines or boundaries of shapes by a connected sequence of straight-line segments of specified length and direction
    An example of the 8-connectivity chain code
    Chain code numbering schemes
  • The Proposed Solution
    Contour extraction process
    This is the core process to extract the main word-level features of the Arabic text in Freeman Chain code format
    After extracting the Freeman codes, we aggregate those codes into pairs as (X, Y) where X is the direction (i.e. from 1 to 7) and Y is the length in pixels
  • The Proposed Solution
    Contours Freeman Chain Codes
    [2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,2,3,2,3,6,5,6,6,5,7,7,7,6,7,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,4,3,1,1,1,1,2,2,2,3,4,4,3,4,4,4,4,4,4,4,4,5,4,6,5,6,6,6,1,0,0,7,7,7,5,4,5,4,4,5,4,4,4,4,4,3,2,2,2,2,2,2,2,2,2,3,2,3,2,3,6,5,6,5,7,6,7,7,6,7,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,3,2,2,2,2,2,2,2,3,2,2,2,3,2,3,2,3,3,3,3,4,3,6,6,6,6,7,0,7,7,7,7,6,7,6,6,6,7,6,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,6,6,6,6,7,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,7,7,7,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    Contours Freeman Chain Code Pairs - Aligned
    Total Pairs : [95] ==> [(2,14),(3,1),(2,1),(3,1),(2,1),(3,1),(6,1),(5,1),(6,2),(5,1),(7,3),(6,1),(7,1),(6,4),(5,1),(4,19),(3,1),(4,1),(3,1),(1,4),(2,3),(3,1),(4,2),(3,1),(4,8),(5,1),(4,1),(6,1),(5,1),(6,3),(1,1),(0,2),(7,3),(5,1),(4,1),(5,1),(4,2),(5,1),(4,5),(3,1),(2,9),(3,1),(2,1),(3,1),(2,1),(3,1),(6,1),(5,1),(6,1),(5,1),(7,1),(6,1),(7,2),(6,1),(7,1),(6,4),(5,1),(4,13),(3,1),(2,7),(3,1),(2,3),(3,1),(2,1),(3,1),(2,1),(3,4),(4,1),(3,1),(6,4),(7,1),(0,1),(7,4),(6,1),(7,1),(6,3),(7,1),(6,5),(5,1),(4,14),(3,2),(6,4),(7,2),(0,38),(1,1),(0,2),(1,1),(0,1),(1,1),(0,1),(1,1),(7,3),(0,1),(7,1),(0,22)]
    Contour Corners Positions:
    [5,18,23,33,37,60,67,81,87,90,92,103,113,118,132,146,160,162,168,172,178,180,189,202,206,210,258,263,]
    Feature extraction: Freeman chain codes, pairs and corner positions
  • The Proposed Solution
    Corner Detection
    This phase detects and extracts the component’s contour corners of the text under processing
    It is based on an implementation of contour detection and curve representation by circular local histogram of contour chain code presented by [Arrebola, Camacho, Bandera , & Sandoval (1999)]
    The corner detection phase is very important for the next classification and recognition phase
    It helps our Prolog engine to determine the unique shape of the character’s feature regardless of the character orientation
    The output of this phase is a stream of corner information to be input for the next phase
  • The Proposed Solution
    • Contour enhancement
    • We introduced an algorithm to remove noisy pixels which come within any straight line, and to convert Arabic characters to approximately straight lines.
    • These enhancement rules, which are derived from testing Arabic characters multiple times, reduce the time required for character recognition
  • The Proposed Solution
    Dot Detection and Font Size Estimation
    Dot detection is another challenging important task
    It helps the prolog classification engine to recognize words that include dots and it decides which characters are un-dotted by nature
    Font size estimation is an critical task in Omni-size character recognition systems
    Font size estimation is usually used to find the pen width in online recognition systems
    Approaches used to estimate the font size by dots is presented in [Shirali-Shahreza , & Shirali-Shahreza (2006)]
  • The Proposed Solution
  • The Proposed Solution
    Font size against calculated component’s height (in pixels)
  • The Proposed Solution
    Definite Clause Grammar (DCG):
    Provides a mechanism for defining the grammar rules of a language
    These rules are automatically translated to a Prolog program which defines a parser for the language being defined
    Grammar rules are a feature only in some Prolog systems, and are designed to facilitate the parsing of natural language
    Using this notation, a grammar is represented as a set of logical rules
    When the DCG rules are consulted (or optimized), they are translated into Prolog clauses
  • The Proposed Solution
    Word-level Classification and Recognition Phase
    This is the most critical phase in our proposed ATR system
    It is written in Prolog language using Prolog matching, backtracking and DCG techniques
    The input for this phase is data on two features :
    The first input stream is the corner sequence of the word-level outer contours for each component that represents the elevation information of the input stream (the upper part that holds most of the features)
    The second input stream is the dot information found in the same component
  • The Proposed Solution
    The Prolog matching and backtracking techniques also use the corner sequence stream to classify the unknown inputs into character classes, while the Prolog DCG technique uses the dot information stream to recognize the actual Arabic letters of a particular character class
  • The Proposed Solution
    DCG implementation: The DCG grammar structure and some of the character classes are described below :
    % DCG part for Arabic text recognition based on two input streams
    % usage: phrase(s(R),[m,h_c,d1,m,dc]).
    s([H|T]) -->cc(H), subs(T). % every string is a character class followed by a sub-string
    s(R)-->cc(R). % or a string can be simply a character class
    subs(R)-->s(R). % a substring is nothing but a string (recursively)
    cc(R)-->ch(R). % a character class can be a simple character
    % or character classes can belong to any of the following classes
    cc(R)-->bc(R). % Ba class (ba, ta, tha, ya_md)
    cc(R)-->h_c(R). % H_ class (h_, jeem, kha)
    cc(R)-->dc(R). % Dal Class (dal, thal)
    cc(R)-->rc(R). % Ra' Class (ra, zay)
    cc(R)-->sc(R). % Seen Class (seen, sheen)
  • The Proposed Solution
    Microsoft Word Document Integration
    This is the final phase of our optical text recognition system
    It is written in Prolog language to interface with Microsoft Word program
    It uses Microsoft Word Document API to write the recognized characters into a new Word document
    It writes the output text in the same recognized font size in a predefined font type
    It also writes the white spaces and new lines to maintain the same original text alignment and format
  • Image Information Retrieval
    Height/ Width
    Document Image (TIFF File)
    Pixel Resolution
    Image
    Digitalization
    Font Baseline
    Font Size
    Word-level
    Segmentation
    Lines per Page
    Component Coordinates(X, Y)
    Words per Line
    Word-level
    Contour Extraction
    Freeman Chain Codes
    Component Area, Height and Width
    Contour Enhancement
    Character-Shape
    Prolog Matching
    Corner Detection
    Dots Detection
    Character Shape Stream
    Character Reference Database
    Dot Information Stream
    DCG Engine
    Word-level
    Recognition
    MS WORD Document Integration
    Recognized Text (Word Document)
    The Proposed Solution
  • Sample Run
    Original TIFF image with Arabic text
    The recognized Arabic text in MS Word 2007
  • Future Work
    Support more Arabic font types
    Support more image types ( GIF , BMP , JPEG…etc)
    Support different font sizes in same page
    Support Arabic & English fonts together , numeric and special characters
    Support Spellchecker and word suggestions
    Implement the system as Arabic Business Card reader
    Capture and Recognize feature for iPhone
  • Demo
  • OCR Applications
    Industries and Institutions in which control of large amounts of paper work is critical
    Banking, Credit cards, Insurance industries
    The medical community
    To capture, store and transmit radiology images
    Libraries and archives
    For conservation and preservation of vulnerable documents and for the provision of access to source documents
  • Thank You
    rami.sahhar@gmail.com