SlideShare a Scribd company logo
1 of 12
Download to read offline
International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.1, February 2014
DOI:10.5121/ijcsa.2014.4103 23
PERSIAN/ARABIC DOCUMENT SEGMENTATION
BASED ON HYBRID APPROACH
Seyyed Yasser hashemi
1
, Parisa Sheykhi Hesarlo2
1, 2
Department of Computer Engineering, miyandoab Branch, Islamic Azad University,
miyandoab, Iran
ABSTRACT
Document segmentation is an essential requirement for automatic transformation of paper documents into
electronic documents. However, some restrictions such as variations in character font sizes, different text
line spacing, and also non-uniform document layout structures altogether makes designing a general-
purpose document layout analysis algorithm much more sophisticated. Thus in most previously reported
methods these parameters are inevitably included. This mentioned issue becomes much more acute and
excessively severe, especially in Persian/Arabic documents. Since the Persian/Arabic scripts differ
considerably from the English scripts, most of the proposed methods for the English scripts do not render
good results for the Persian/Arabic scripts. In this paper, we present a novel parameter-free method based
on hybrid method for segmenting the Persian/Arabic document images which also works well for English
scripts. The proposed method is capable of document segmentation without considering the character font
sizes, text line spacing, document skew, and document layout structures. This algorithm is examined for
150 Persian/Arabic and English documents and document segmentation process are done successfully for
97.3 percent of them at worst condition.
KEYWORDS
Persian/Arabic document, document segmentation, Pyramidal Image Structure.
1. INTRODUCTION
In order to segment a document which is an important step in Optical Character Recognition
(OCR) systems, the document image is divided into homogeneous zones, each consisting of only
one physical layout structure, such as text, graphics, and pictures. Therefore, the performance of
OCR systems depends heavily on the implemented document segmentation algorithm. Several
document segmentation algorithms have been proposed during the last three decades [1-10].
The various approaches toward document segmentation are typically categorized as ā€œbottom-upā€,
ā€œtop-downā€, and ā€œtextural analysisā€ methods. The ā€œbottom-upā€ methods (F. Legourgiois et al.,
1992; D. Drivas et al., 1995; A. Simon et al., 1997) start from pixels or the connected
components, determine the words, merge the words into text lines, and finally merge the text lines
into paragraphs. The main disadvantage of these approaches is that the identification, analysis,
and grouping of connected components are, in general, time-consuming processes, especially
when there are many components in the image. The ā€œtop-downā€ approaches (J. Ha et al., 1995; J.
Ha et al., 1995; Yi Xiaoa et al., 2003; Jie Xi et al., 2002, Rafi Cohen et al 2013) look for global
information e.g. black and white stripes on the page and use them to split the page into columns,
the columns into blocks, the blocks into text lines, and finally the text lines into words. Low time
complexity of these methods in comparison to the prior methods, i.e. ā€œbottom-upā€ approaches and
their natural top-down view from coarse to fine resolution as is preferable for human beingsā€™ eyes
are of the most important advantages of this method. On the other hand, in ā€œtop-downā€ techniques
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
24
it is unfortunately difficult to segment the complex document layouts which include some
nonrectangular images and various character font sizes. Some other recently proposed document
segmentation methods (A. Jain and Y. Zhong et al., 1996; A. Jain and S. Bhattacharjee et al.,
1992; M. Acharyya and M. K.Kundu et al., 2002), consider the homogeneous regions of the
document image such as text, image or graphic as a textured region. Thus, document
segmentation is implemented based on the textured regions found in gray scale images. Junxi Sun
et al., 2008; propose a texture-based Bayesian document segmentation method. In this method
Bayesian method is used to fuse texture likelihood and prior contextual knowledge to achieve
document segmentation. The texture likelihood is based on a complex wavelet domain hidden
Markov tree (HMT) model and the prior contextual is based on a hybrid tree model. Very high
time complexity is the main problem associated with these texture-based approaches, since many
masks are used for extracting local features and also different tuning filters are used to capture a
desired local spatial frequency and the orientation characteristics of a textured region. Since the
Persian documents have some special characters which does not exist in the English documents,
so the aforementioned methods cannot be directly used for Persian document segmentation.
The special characters of Persian documents are as follows:
The Persian scripts are cursive and each connected component includes more than one character.
On the other side their arrangement and size may also vary tremendously.
There are 32 basic characters in the Persian alphabet. These characters may change their shapes
according to their positions (beginning, middle, end or isolated) in the word. Since each character
can take four different shapes, thus we have 114 different shapes considering all of Persian
alphabets.
Special stress marks called dots are the other characteristics of the Persian scripts. Most of the
Persian characters have one, two or three dots. These dots may be situated at the top, inside or
bottom of the characters.
From the script identification point of view, it is concluded from the above mentioned expressions
about the special characters of Persian documents that these scriptsā€™ word sizes are non-uniform.
The word size may vary according to the number of cursive characters and dots in the word.
In this paper, we propose a novel method for Persian/Arabic document segmentation using
pyramidal image structure. This paper is organized as follows: In Section 2, the proposed
algorithm is described in detail. Experimental results of the proposed algorithm are presented in
Section 3. Finally the paper is concluded in section 4.
2. MATERIALS AND METHODS
The following formatting rules must be followed strictly. This (.doc) document may be used as a
template for papers prepared using Microsoft Word. Papers not conforming to these requirements
may not be published in the conference proceedings.
Many document segmentation algorithms are presented for English Documents which most of
them do not provide good results for Persian/Arabic documents due to their aforementioned
differences. Thus, in order to make these methods suitable for Persian scripts, some of those
methodā€™s parameters must be specialized. We have proposed a parameter free segmentation
method for Persian/Arabic documents which eliminates these restrictions and provides excellent
results. The proposed method is interestingly capable of segmenting the Persian documents
composed of different font sizes, different lines spaces and also different structure layouts. In the
proposed method, the low resolution version of the document is firstly processed and then the
documentā€™s high resolution version is analyzed in detail. This manifests the pyramidal nature of
the proposed method. The original image will be analyzed and a pyramidal tree structure (images
at different resolutions) is created. At next phase, the bounding boxes are extracted from images
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
25
at lowest possible resolution and candidate regions are selected removing the bounding boxesā€™
overlap areas. Then the regions are classified with horizontal and vertical analysis and are further
investigated with textual and statistical analysis of the high resolution sample to recognize
different regions of the document such as text, image and drawing/table. The Fig. 1 shows the
proposed approach followed in this paper.
2.1. Document skew detection and correction (SDC):
In this step, the skew angle of the document ( )ļ± must be estimated. The proposed method uses a
document SDC based on Centre of Gravity (COG). To determine the skew angle, first step is the
Baseline Identification (BI). The angle between the baseline and direct horizontal lines determine
the skew angle. Therefore, the most important step in this process is to identify the baseline .
Baseline of the document is a line that passes through the COG along the horizontal axes. In this
algorithm, we detect skew angle by finding Actual Region of Document (ARD) using connected
component analysis, identifying its COG and identify the baseline of document .The angle
between the baseline and horizontal lines specifies the skew angle. The algorithm steps are as:
ā€¢ Connected Component identification(CC)
ā€¢ Identification of the ARD. For this purpose, four CCs that have the more distance
from the C0, C1, C2 and C3 corners (shown in Fig. 2(c)) will be selected.
ā€¢ Finding the COG. COG is calculated using (1).
1
( )( )1 1 16
11
( ( )1) 1 1
6 0
1
( )1 12
COG x x x y x yi i i i i ix A
N
COG y y x y x yY i i i i i i
A i
A x y x yi i i i
āˆ‘= + āˆ’+ + +
āˆ’
= + āˆ’āˆ‘ + + +
=
āˆ‘= āˆ’+ +
(1)
'A' is the area of polygon.
ā€¢ Baseline identification. A line (baseline) from COG to the center of the line that
connects the two upper and lower left corner of the ARD (midpoint).
ā€¢ Calculation of the amount of document skew angle which is the angle between
baseline and the horizontal line that passes through the midpoint.
Rotation of the document (see Fig. 2)
2.2. Pyramidal image structure
The pyramidal image structure is a simple and robust technique to provide several resolutions of
an image. An image pyramid is a collection of decreased resolution images which are arranged in
the shape of pyramid in a way that the base of the pyramid contains a high-resolution while the
apex contains a low resolution approximation of the image. The number of pixels in IL+1 is on
quarter of the number of pixels in IL. This process is repeated N times (number of levels of
pyramidal image structure) that is given by (2).
{ }log , min . , .0 0
100
l
N l I Width I Hight
ļ£® ļ£¹
= =ļ£Æ ļ£ŗļ£Æ ļ£ŗ
(2)
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
26
Figure 1. Overall diagram of the proposed method
The intensity of every pixel of the image for ā€˜L+1ā€™ level is calculated using pixel intensity of
level ā€˜Lā€™ with the equation (3) as follows:
1 1
2 ,21 0 0
4,
LI
i m j nL m n
I
i j
ļ£« ļ£¶ļ£« ļ£¶
ļ£¬ ļ£·āˆ‘ āˆ‘ ļ£¬ ļ£·ļ£¬ ļ£·+ +ļ£­ ļ£ø+ = =ļ£­ ļ£ø
=
(3)
Where 1
,
L
I
i j
+
is the intensity of (i,j) in level ā€˜L+1ā€™. The pyramidal image structure for I0 image is
shown in Fig. 3.
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
27
Figure 2. (a) Skewed document, (b) Document segmented to connected component, (c) The ARD in
skewed document, (d) calculate the amount of document skew angle, (e) non-skewed document, (f) final
result
2.3. Bounding box extraction
This algorithm uses a bottom-up approach to extract connected components (CCs) from the
document image. The first step in CCs extraction is to locate rectangular regions called Rects (M.
Viswanathan, 2002), A Rects may be thought of as a rectangular region of loosely connected
black pixels. More specifically, a Rects has at least one black pixel in every 9-pixel square. A
Rects is defined in this way so black regions may be found without scanning every pixel. The
second step is to merge adjacent Rects to form CCs .In terms of implementation, Rects are stored
in a linked list. This type of structure is ideal because the Rects need to be quickly traversed and
random access is not required. The process of locating Rects involves scanning 9-pixel squares of
each paragraph in raster order. The top left corner of a Rects is defined by the first 9-pixel square
found containing a black pixel. The right edge of the Rects is located by searching right until a
white 9-pixel square is found. Similarly, the bottom edge is located by scanning down until a
white 9-pixel square is found. CCs are defined as collections of adjacent Rects (Mitchel P. E and
Yana H., 2004). CCs are also stored in a linked list, which enables them to be quickly and easily
traversed for classification. The process used to construct CCs involves traversing the (initially
empty) list of CCs for each Rects in the Rects list. Each Rects is subsequently added to single
CCs or used to define new CCs. If a Rects is found to be adjacent to more than one CC, the CCs
are merged into a single CCs. Note that since the CCs are defined directly from the Rects, there is
no need to refer to the actual image. After extracting CCs, they merge with each other in vertical
and horizontal directions to create the bounding boxes.
In multi-resolution analysis, the lowest levels of the pyramid can be used for overall analysis of
the image. In this step, the bounding boxes of low resolution image are used to extract image
regions. The bounding boxes extraction in lowest resolution of the I0 image is shown in Fig. 4.
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
28
Figure. 3. The Pyramidal image structure for I0 image, (a) original image, (b) binary image, (c) level 0, (d)
level 1, (e) level 2, and (f) level 3.
Figure. 4. The bounding boxes extraction in lowest resolution of the I0 image.
2.4. Remove overlaps connected components
In the low-resolution document segmentation, the image of document 00I is divided into a set of
regions that called 1 2, ,....., nR R R . Each region iR is defined by coordinates of bounding box.
Some of regions, iR , are overlapped to other. For next stage of high-resolution page
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
29
segmentation, we should remove overlapped components from each region iR . For each region,
we remove overlapping components as follow:
First, we obtained connected components of region iR in low-resolution image LP . If iR have
only one connected component, then iR do not have any overlapping components. If in region iR
number of connected components is more than one, we leave the maximum of connected
component and remove other connected components at the high-resolution.
2.5. Horizontal analysis
The Abstract section begins with the word, ā€œAbstractā€ in 13 pt. Times New Roman, bold italics,
ā€œSmall Capsā€ font with a 6pt. spacing following. The abstract must not exceed 150 words in
length in 10 pt. Times New Roman italics. The text must be fully justified, with a 12 pt.
paragraph spacing following the last line.
Persian text region can be easily distinguished from other regions by horizontal analysis. In order
to speed up the process, the horizontal analysis is performed recursively. At the first step, the
horizontal projection profile is calculated by (4) for each region.
1
( ) ( , ) 1
1
W
LP n I x n n HBH W x
āˆ‘= ā‰¤ ā‰¤
=
(4)
Where ( , )
L
I x n
B
is the intensity value in the WƗH image of the Lth level and ( )P n
H
are normalized
between zeros to one. Fig. 5 shows ( )P n
H
for one region of the document.
Figure. 5. (a) one region of the original document, (b) the horizontal projection profile, (c) the normalized
horizontal projection profile.
At the second step, the normalized projection profile is transformed to binary signals as described
in (5).
1.0 ( ) 0.05
( )
0.0
P xHtP x
H otherwise
ļ£± >ļ£“
= ļ£²
ļ£“ļ£³
(5)
At the third step, the difference of signals calculated using the equation (6).
( ) ( ( ))dP n diff tP n
H H
= (6)
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
30
At the fourth step, ascending and descending edges are calculated using (7) and (8)
1 ( ) 0
( )
0
dP nHUHE n
otherwise
ļ£± >ļ£“
= ļ£²
ļ£“ļ£³
(7)
1 ( ) 0
( )
0
dP nHDHE n
otherwise
ļ£± <ļ£“
= ļ£²
ļ£“ļ£³
(8)
Where UHE(n) and DHE(n) determine the ascending and descending edges of the signal,
respectively.
At the fifth step, the distance between black and white areas of the signal is calculated using (9)
and (10).
( ) ( ) ( )PW n DHE n UHE n= āˆ’ (9)
( ) ( 1) ( )PB n UHE n DHE n= + āˆ’ (10)
PW(n) and PB(n) are the distance between white and black tPH(n) signals, respectively. Sum of
PW(n) and PB(n)
D(p) which is the decision value for document segmentation is derived from Pi(n) by (11), (12),
(13) and (14) as:
( )
1
N
P ni
nm
N
āˆ‘
== (11)
2( ( ) )
1
N
P n mi
nV
N
āˆ’āˆ‘
== (12)
2
2
1
P
Ve
= āˆ’
āˆ’+
(13)
1
( )
0
p TH
D P
otherwise
ļ£± >ļ£“
= ļ£²
ļ£“ļ£³
(14)
Using the decision value of D(p) we can estimate whether a region is homogeneous or not.
Where ā€˜Nā€™ is the number of grooves and we set the threshold value TH=0.5. It is achieved for ā€˜Vā€™
approximately equal to 1.099 in (12). This value for ā€˜Vā€™ is independent of the character font sizes,
text lines spacing, and the document layout structures, and is equally applied to each region to
decide whether it is a homogeneous region or not.
There are three types of horizontal analysis using D(p), tPH(n) and Pi(n):
ļ‚§ For constant signal tPH(n) equal to 1: In this type, if tPH(n) relates to upper level (L=1,
2,..., N) of the region of the document image, the horizontal analysis is further repeated
for lower levels, but if tPH(n) relates to the original document image, the region is a non-
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
31
text region (graphics, pictures,ā€¦) and in next step with textural and statistical analysis
will be investigated in detail.
ļ‚§ For symmetric signal tPH(n): If the decision value D(p) is 1, the variance of the Pi(n)
values are low and the region is considered a text region and hence it requires no further
splitting.
ļ‚§ For non-symmetric signal tPH(n): If the decision value D(p) is 0, the variance of Pi(n)
values are high and the region is not a homogeneous region; and hence further splitting is
inevitable. Thus this region is sub classified using projection profile information.
Segmentation process is repeated until all regions in all levels have a constant signal tPH(n) equal
to 1 or symmetric tPH(n) signal.
2.6. Determination of Splitting Position
The Keywords section begins with the word, ā€œKeywordsā€ in 13 pt. Times New Roman, bold
italics, ā€œSmall Capsā€ font with a 6pt. spacing following. There may be up to five keywords (or
short phrases) separated by commas and six spaces, in 10 pt. Times New Roman italics. An 18
pt. line spacing follows.
When a region is determined to require further splitting because it is not a homogeneous region,
there are two cases in the horizontal (Vertical) direction, as shown in Fig. 6. The case in Fig. 6a
needs to be split more because one white area is larger than the other white areas and the case in
Fig. 6b needs to be split more because one black area is larger than the other black areas. In these
two cases, we find a suitable position for splitting, using the method given below, and split it into
two regions. The processes described in Sections 2.5 and 2.6 are repeated for each region until no
further splitting is required.
Let W denote the set of the white areas of a region, wi.
Let B denote the set of the black areas of a region, bi
Sort the set W(or B) in the increasing order of the magnitude of wi(or bi)
wmed the median element of W
bmed the median element of B
wmax the last element of W
bmax the last element of B
if wi > wmed and wi . wmax, split wi
if bi > bmed and bi . bmax, split wi-1.
Fig. 6: Two cases requiring further horizontal splitting :(a) A distinct white area exists. (b) A distinct black
area exists.
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
32
2.6. Non-homogeneous regions analysis
.
In this step, the non-homogeneous regions are analyzed and segmented to homogeneous sub
regions. For this purpose, each region is segmented to initial sub region in horizontal direction,
based on median of characters high in document (Implications of previous steps). Then by
repeating the processes described in Sections 2.5 and 2.6, we segmented a non-homogeneous
region into maximal homogeneous sub regions that labeled as text or non-text sub regions.
Finally, the sub regions with same label are merged and the final result of this step is obtained.
2.7. Image and drawing/table recognition
The non-text regions which was labeled before, are labeled as image or drawing/table in this
section using statistical analysis. In order to do this, the number of blackpixels of each region is
applied to the equation (15).
( )
( ) ( )
( )
( ) ( )
the region isimageif 0.4
the region isdrawing/tableif 0.4
i
i i
i
i i
A R
W R H R
A R
W R H R
ļ£±
ā‰„ļ£“
Ɨļ£“
ļ£²
ļ£“ <
ļ£“ Ɨļ£³
(15)
3. EXPERIMENTAL RESULTS
The Proposed method has been implemented on duo core 2.66 GHZ in 2013. We have considered
different documents from different sources like journals, textbooks, newspapers and also
handwritten documents. For experimentation purpose 150 documents are considered which 50 of
them are handwriting documents. The obtained results are reported in Table I. The comparative
results of the proposed method taking other artworks into account are considered in Fig. 8 and
Fig. 9 and the final result for a sample image is shown in Fig. 7.
Fig. 7: The final result for a sample image
4. CONCLUSION
We have introduced a document segmentation method that segments the Persian/Arabic
document into homogeneous regions. The proposed efficient and comprehensive method works
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
33
based on pyramidal image structure and textual analysis. For the original image, a pyramidal tree
structure (images at different resolutions) is created. At next phase, the bounding boxes are
extracted from the image at lowest possible resolution and candidate regions are selected
removing the bounding boxesā€™ overlap areas. Then the regions are classified with horizontal
analysis and are further investigated with textual and statistical analysis of the high resolution
sample to recognize different regions of the document such as text, image and drawing/table.
The proposed method was examined on different colorful/gray documents achieved from
different sources. Experiments show more accurate results in comparison to the previously
reported methods. This method focuses on the Persian/Arabic document segmentation, which also
exhibits good results for other scripts such as English scripts. This work can be also extended for
special works such as license plate recognition, postal service, and noisy documents. This method
was implemented on 150 different documents (90 Persian/Arabic and 40 English and 20 hybrids
of English and Persian/Arabic) and the rate of accuracy is 97.3% in the worst circumstances.
85
90 92.5
79.5 82 81.5
98.8 96.3 97
0
10
20
30
40
50
60
70
80
90
100
Text identification Image identification Drawing/Table identification
Jain
Pietikainen
proposed
approach
Figure. 8. The comparative results of the proposed method taking other artworks.
Figure. 9. The comparative results of the proposed method taking other artworks based on process time
Table 1. The results achieved for the proposed method.
Total
number
of
regions
The
number
of
regions
found
The
rate of
regions
found
(%)
The
number
of
correct
regions
found
The
rate of
correct
regions
found
(%)
The
number
of
unfound
regions
The
rate of
unfound
regions
(%)
The
number
of
incorrect
regions
found
The rate
of
incorrect
regions
found
(%)
Max.
overall
error
rate
(%)
Total region 448 440 98.2 435 97.1 13 2.9 5 1.1 2.9
Text region 382 379 99.2 376 98.4 6 1 3 0.8 1
Image region 43 42 97.7 41 95.4 2 4.6 1 2.3 4.6
Drawing/table 23 23 100 22 95.6 1 4.3 1 4.3 4.3
International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
34
REFERENCES
[1] F. Legourgiois, Z. Bublinski, & H. Emptoz, (1992), ā€œA Fast and Efficient Method for Extracting Text
Paragraphs and Graphics from Unconstrained Documentsā€, Proc. 11th Int'l Conf. Pattern Recognition,
pp. 272-276.
[2] D. Drivas & A. Amin, ā€œDocument segmentation and Classification Utilizing Bottom-Up Approachā€,
(1995), Proc. Third Int'l Conf. Document Analysis and Recognition, pp. 610-614.
[3] A. Simon, J. Pret, & A. Johnson, ā€œA Fast Algorithm for Bottom-Up Document Layout Analysisā€ ,
(1997) , IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, pp. 273-276.
[4] J. Ha, R. Haralick, & I. Phillips, ā€œRecursive X-Y Cut Using Bounding Boxes of Connected
Componentsā€, Proc. (1995)Third Int'l Conf. Document Analysis and Recognition, pp. 952-955.
[5] J. Ha, R. Haralick, & I. Phillips, ā€œDocument Page Decomposition by the Bounding-Box Projection
Techniqueā€, Proc. (1995),Third Int'l Conf. Document Analysis and Recognition, pp. 1119-1122.
[6] Yi Xiaoa, Hong Yanaā€™ ā€œText region extraction in a document image based on the Delaunay
tessellationā€, (2003), Pattern Recognition, pp. 799-809.
[7] Jie Xi , Jianming Hu, Lide Wu, ā€œDocument segmentation of Chinese newspapersā€, (2002), Pattern
Recognition, pp. 2695-2704.
[8] A. Jain & Y. Zhong, ĀŖDocument segmentation Using Texture Analysisā€, (1996), Pattern Recognition,
vol. 29, pp. 743-770.
[9] A. Jain & S. Bhattacharjee, ā€œText Segmentation Using GaborFilters for Automatic Document
Processingā€ā€™ , (1992). Machine Vision and Applications, vol. 5, pp. 169-184.
[10] M. Acharyya & M. K.Kundu, ā€œDocument Image Segmentation Using Wavelet Scale-Space Featuresā€,
(2002), IEEE Transaction on circuits and systems for video technology, vol. 12, no.12.
[11] Junxi Sun, Dongbing Gu, Hua Cai, Guangwen Liu and Guangqiu Chen, ā€œBayesian Document
Segmentation Based on Complex Wavelet Domain Hidden Markov Tree Modelsā€, Proceedings of the
2008 IEEE International Conference on Information and Automation June 20 -23, 2008, Zhangjiajie,
China, pp. 493-498.
[12] Rafi Cohen, Abedelkadir Asi, Klara Kedem, Jihad El-Sana, Itshak Dinstein, ā€œRobust text and drawing
segmentation algorithm for historical documentsā€, Proceedings of the 2nd International Workshop on
Historical Document Imaging and Processing,(2013), pp. 110-117.
[13] M. Viswanathan, ā€œAnalysis of scanned documents - a syntactic approach,ā€. (1992), in Structured
Document Image Analysis, Springer-Verlag, pp. 115ā€“136.
[14] Mitchel P. E & Yana H., "Newspaper layout analysis incorporating connected component separation",
(2004), Image and Vision Computing, vol. 22, pp. 307ā€“317.
Authors
Seyyed Yasser Hashemi was born in Miyandoab, Azarbayjane Gharbi, Iran, in
1985. He received the B.Sc. and M.Sc. degrees from Islamic Azad University of
South Tehran Branch, in Computer Engineering field. He is with Computer
Department of Islamic Azad University, Miyandoab Branch since 2008. He is the
author or coauthor of more than ten national and international papers and also
collaborated in several research projects. His current research interests include voice
and image processing, pattern recognition, spam detecting, optical character
recognition, cloud computing and parallel genetic algorithms.
Parisa Sheykhi Hesarlo was born in shahindezh, Azarbayjane Gharbi, Iran, in 1992. She is B.Sc. student
in Pnu univesity in Computer Engineering field.

More Related Content

What's hot

Hh2513151319
Hh2513151319Hh2513151319
Hh2513151319IJERA Editor
Ā 
PubLayNet: Largest Dataset ever for Document Layout Analysis
PubLayNet: Largest Dataset ever for Document Layout AnalysisPubLayNet: Largest Dataset ever for Document Layout Analysis
PubLayNet: Largest Dataset ever for Document Layout AnalysisShivam Sood
Ā 
A MODEL TO CONVERT WAVEā€“FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...
A MODEL TO CONVERT WAVEā€“FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...A MODEL TO CONVERT WAVEā€“FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...
A MODEL TO CONVERT WAVEā€“FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...ijma
Ā 
Header Based Classification of Journals Using Document Image Segmentation and...
Header Based Classification of Journals Using Document Image Segmentation and...Header Based Classification of Journals Using Document Image Segmentation and...
Header Based Classification of Journals Using Document Image Segmentation and...CSCJournals
Ā 
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
Ā 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodijcsit
Ā 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text miningeSAT Journals
Ā 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text miningeSAT Publishing House
Ā 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Ā 
A simple text ine segmentation method for handwritten documents1
A simple text ine segmentation method for handwritten documents1A simple text ine segmentation method for handwritten documents1
A simple text ine segmentation method for handwritten documents1Prasad Babu
Ā 
Scene text recognition in mobile applications by character descriptor and str...
Scene text recognition in mobile applications by character descriptor and str...Scene text recognition in mobile applications by character descriptor and str...
Scene text recognition in mobile applications by character descriptor and str...eSAT Journals
Ā 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...acijjournal
Ā 
Anatomical Survey Based Feature Vector for Text Pattern Detection
Anatomical Survey Based Feature Vector for Text Pattern DetectionAnatomical Survey Based Feature Vector for Text Pattern Detection
Anatomical Survey Based Feature Vector for Text Pattern DetectionIJEACS
Ā 
NLP Based Text Summarization Using Semantic Analysis
NLP Based Text Summarization Using Semantic AnalysisNLP Based Text Summarization Using Semantic Analysis
NLP Based Text Summarization Using Semantic AnalysisINFOGAIN PUBLICATION
Ā 
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...Segmentation of Handwritten Chinese Character Strings Based on improved Algor...
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...ijeei-iaes
Ā 

What's hot (16)

Hh2513151319
Hh2513151319Hh2513151319
Hh2513151319
Ā 
Normalisation
Normalisation Normalisation
Normalisation
Ā 
PubLayNet: Largest Dataset ever for Document Layout Analysis
PubLayNet: Largest Dataset ever for Document Layout AnalysisPubLayNet: Largest Dataset ever for Document Layout Analysis
PubLayNet: Largest Dataset ever for Document Layout Analysis
Ā 
A MODEL TO CONVERT WAVEā€“FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...
A MODEL TO CONVERT WAVEā€“FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...A MODEL TO CONVERT WAVEā€“FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...
A MODEL TO CONVERT WAVEā€“FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...
Ā 
Header Based Classification of Journals Using Document Image Segmentation and...
Header Based Classification of Journals Using Document Image Segmentation and...Header Based Classification of Journals Using Document Image Segmentation and...
Header Based Classification of Journals Using Document Image Segmentation and...
Ā 
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
Ā 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation method
Ā 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
Ā 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
Ā 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
Ā 
A simple text ine segmentation method for handwritten documents1
A simple text ine segmentation method for handwritten documents1A simple text ine segmentation method for handwritten documents1
A simple text ine segmentation method for handwritten documents1
Ā 
Scene text recognition in mobile applications by character descriptor and str...
Scene text recognition in mobile applications by character descriptor and str...Scene text recognition in mobile applications by character descriptor and str...
Scene text recognition in mobile applications by character descriptor and str...
Ā 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
Ā 
Anatomical Survey Based Feature Vector for Text Pattern Detection
Anatomical Survey Based Feature Vector for Text Pattern DetectionAnatomical Survey Based Feature Vector for Text Pattern Detection
Anatomical Survey Based Feature Vector for Text Pattern Detection
Ā 
NLP Based Text Summarization Using Semantic Analysis
NLP Based Text Summarization Using Semantic AnalysisNLP Based Text Summarization Using Semantic Analysis
NLP Based Text Summarization Using Semantic Analysis
Ā 
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...Segmentation of Handwritten Chinese Character Strings Based on improved Algor...
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...
Ā 

Viewers also liked

A survey early detection of
A survey early detection ofA survey early detection of
A survey early detection ofijcsa
Ā 
Vulnerability scanners a proactive approach to assess web application security
Vulnerability scanners a proactive approach to assess web application securityVulnerability scanners a proactive approach to assess web application security
Vulnerability scanners a proactive approach to assess web application securityijcsa
Ā 
A survey on cloud security issues and techniques
A survey on cloud security issues and techniquesA survey on cloud security issues and techniques
A survey on cloud security issues and techniquesijcsa
Ā 
Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...ijcsa
Ā 
Vulnerabilities and attacks targeting social networks and industrial control ...
Vulnerabilities and attacks targeting social networks and industrial control ...Vulnerabilities and attacks targeting social networks and industrial control ...
Vulnerabilities and attacks targeting social networks and industrial control ...ijcsa
Ā 
Impact of HeartBleed Bug in Android and Counter Measures
Impact of HeartBleed Bug in Android and Counter  Measures Impact of HeartBleed Bug in Android and Counter  Measures
Impact of HeartBleed Bug in Android and Counter Measures ijcsa
Ā 
Implementation and performance evaluation of
Implementation and performance evaluation ofImplementation and performance evaluation of
Implementation and performance evaluation ofijcsa
Ā 
Nocs performance improvement using parallel transmission through wireless links
Nocs performance improvement using parallel transmission through wireless linksNocs performance improvement using parallel transmission through wireless links
Nocs performance improvement using parallel transmission through wireless linksijcsa
Ā 
Modeling of manufacturing of a field effect transistor to determine condition...
Modeling of manufacturing of a field effect transistor to determine condition...Modeling of manufacturing of a field effect transistor to determine condition...
Modeling of manufacturing of a field effect transistor to determine condition...ijcsa
Ā 
Accelerating the ant colony optimization by
Accelerating the ant colony optimization byAccelerating the ant colony optimization by
Accelerating the ant colony optimization byijcsa
Ā 
IMPACT OF DIFFERENT SELECTION STRATEGIES ON PERFORMANCE OF GA BASED INFORMATI...
IMPACT OF DIFFERENT SELECTION STRATEGIES ON PERFORMANCE OF GA BASED INFORMATI...IMPACT OF DIFFERENT SELECTION STRATEGIES ON PERFORMANCE OF GA BASED INFORMATI...
IMPACT OF DIFFERENT SELECTION STRATEGIES ON PERFORMANCE OF GA BASED INFORMATI...ijcsa
Ā 
Web log data analysis by enhanced fuzzy c
Web log data analysis by enhanced fuzzy cWeb log data analysis by enhanced fuzzy c
Web log data analysis by enhanced fuzzy cijcsa
Ā 
Labview with dwt for denoising the blurred biometric images
Labview with dwt for denoising the blurred biometric imagesLabview with dwt for denoising the blurred biometric images
Labview with dwt for denoising the blurred biometric imagesijcsa
Ā 
Comparative study of augmented reality
Comparative study of augmented realityComparative study of augmented reality
Comparative study of augmented realityijcsa
Ā 
Robust face recognition by applying partitioning around medoids over eigen fa...
Robust face recognition by applying partitioning around medoids over eigen fa...Robust face recognition by applying partitioning around medoids over eigen fa...
Robust face recognition by applying partitioning around medoids over eigen fa...ijcsa
Ā 
An ahp (analytic hierarchy process)fce (fuzzy comprehensive evaluation) based...
An ahp (analytic hierarchy process)fce (fuzzy comprehensive evaluation) based...An ahp (analytic hierarchy process)fce (fuzzy comprehensive evaluation) based...
An ahp (analytic hierarchy process)fce (fuzzy comprehensive evaluation) based...ijcsa
Ā 
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnAutomatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
Ā 
Security issues in grid computing
Security issues in grid computingSecurity issues in grid computing
Security issues in grid computingijcsa
Ā 
Energy efficient mac protocols for wireless sensor network
Energy efficient mac protocols for wireless sensor networkEnergy efficient mac protocols for wireless sensor network
Energy efficient mac protocols for wireless sensor networkijcsa
Ā 
Building Blocks
Building BlocksBuilding Blocks
Building Blocksjemarki
Ā 

Viewers also liked (20)

A survey early detection of
A survey early detection ofA survey early detection of
A survey early detection of
Ā 
Vulnerability scanners a proactive approach to assess web application security
Vulnerability scanners a proactive approach to assess web application securityVulnerability scanners a proactive approach to assess web application security
Vulnerability scanners a proactive approach to assess web application security
Ā 
A survey on cloud security issues and techniques
A survey on cloud security issues and techniquesA survey on cloud security issues and techniques
A survey on cloud security issues and techniques
Ā 
Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...
Ā 
Vulnerabilities and attacks targeting social networks and industrial control ...
Vulnerabilities and attacks targeting social networks and industrial control ...Vulnerabilities and attacks targeting social networks and industrial control ...
Vulnerabilities and attacks targeting social networks and industrial control ...
Ā 
Impact of HeartBleed Bug in Android and Counter Measures
Impact of HeartBleed Bug in Android and Counter  Measures Impact of HeartBleed Bug in Android and Counter  Measures
Impact of HeartBleed Bug in Android and Counter Measures
Ā 
Implementation and performance evaluation of
Implementation and performance evaluation ofImplementation and performance evaluation of
Implementation and performance evaluation of
Ā 
Nocs performance improvement using parallel transmission through wireless links
Nocs performance improvement using parallel transmission through wireless linksNocs performance improvement using parallel transmission through wireless links
Nocs performance improvement using parallel transmission through wireless links
Ā 
Modeling of manufacturing of a field effect transistor to determine condition...
Modeling of manufacturing of a field effect transistor to determine condition...Modeling of manufacturing of a field effect transistor to determine condition...
Modeling of manufacturing of a field effect transistor to determine condition...
Ā 
Accelerating the ant colony optimization by
Accelerating the ant colony optimization byAccelerating the ant colony optimization by
Accelerating the ant colony optimization by
Ā 
IMPACT OF DIFFERENT SELECTION STRATEGIES ON PERFORMANCE OF GA BASED INFORMATI...
IMPACT OF DIFFERENT SELECTION STRATEGIES ON PERFORMANCE OF GA BASED INFORMATI...IMPACT OF DIFFERENT SELECTION STRATEGIES ON PERFORMANCE OF GA BASED INFORMATI...
IMPACT OF DIFFERENT SELECTION STRATEGIES ON PERFORMANCE OF GA BASED INFORMATI...
Ā 
Web log data analysis by enhanced fuzzy c
Web log data analysis by enhanced fuzzy cWeb log data analysis by enhanced fuzzy c
Web log data analysis by enhanced fuzzy c
Ā 
Labview with dwt for denoising the blurred biometric images
Labview with dwt for denoising the blurred biometric imagesLabview with dwt for denoising the blurred biometric images
Labview with dwt for denoising the blurred biometric images
Ā 
Comparative study of augmented reality
Comparative study of augmented realityComparative study of augmented reality
Comparative study of augmented reality
Ā 
Robust face recognition by applying partitioning around medoids over eigen fa...
Robust face recognition by applying partitioning around medoids over eigen fa...Robust face recognition by applying partitioning around medoids over eigen fa...
Robust face recognition by applying partitioning around medoids over eigen fa...
Ā 
An ahp (analytic hierarchy process)fce (fuzzy comprehensive evaluation) based...
An ahp (analytic hierarchy process)fce (fuzzy comprehensive evaluation) based...An ahp (analytic hierarchy process)fce (fuzzy comprehensive evaluation) based...
An ahp (analytic hierarchy process)fce (fuzzy comprehensive evaluation) based...
Ā 
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnAutomatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Ā 
Security issues in grid computing
Security issues in grid computingSecurity issues in grid computing
Security issues in grid computing
Ā 
Energy efficient mac protocols for wireless sensor network
Energy efficient mac protocols for wireless sensor networkEnergy efficient mac protocols for wireless sensor network
Energy efficient mac protocols for wireless sensor network
Ā 
Building Blocks
Building BlocksBuilding Blocks
Building Blocks
Ā 

Similar to Persian/Arabic document segmentation using hybrid approach

Enhancement and Segmentation of Historical Records
Enhancement and Segmentation of Historical RecordsEnhancement and Segmentation of Historical Records
Enhancement and Segmentation of Historical Recordscsandit
Ā 
A Novel Method for De-warping in Persian document images captured by cameras
A Novel Method for De-warping in Persian document images captured by camerasA Novel Method for De-warping in Persian document images captured by cameras
A Novel Method for De-warping in Persian document images captured by camerasCSCJournals
Ā 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...ijdpsjournal
Ā 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...ijdpsjournal
Ā 
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document SegmentationA Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document SegmentationUniversity of Bari (Italy)
Ā 
AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...
AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...
AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...ijaia
Ā 
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...iosrjce
Ā 
Script identification using dct coefficients 2
Script identification using dct coefficients 2Script identification using dct coefficients 2
Script identification using dct coefficients 2IAEME Publication
Ā 
A review on signature detection and signature based document image retrieval
A review on signature detection and signature based document image retrievalA review on signature detection and signature based document image retrieval
A review on signature detection and signature based document image retrievaleSAT Journals
Ā 
A review on signature detection and signature based document image retrieval
A review on signature detection and signature based document image retrievalA review on signature detection and signature based document image retrieval
A review on signature detection and signature based document image retrievaleSAT Journals
Ā 
Wavelet Packet Based Features for Automatic Script Identification
Wavelet Packet Based Features for Automatic Script IdentificationWavelet Packet Based Features for Automatic Script Identification
Wavelet Packet Based Features for Automatic Script IdentificationCSCJournals
Ā 
TEXT EXTRACTION FROM RASTER MAPS USING COLOR SPACE QUANTIZATION
TEXT EXTRACTION FROM RASTER MAPS USING COLOR SPACE QUANTIZATIONTEXT EXTRACTION FROM RASTER MAPS USING COLOR SPACE QUANTIZATION
TEXT EXTRACTION FROM RASTER MAPS USING COLOR SPACE QUANTIZATIONcsandit
Ā 
Separation of mixed Document Images in Farsi Scanned Documents Using Blind So...
Separation of mixed Document Images in Farsi Scanned Documents Using Blind So...Separation of mixed Document Images in Farsi Scanned Documents Using Blind So...
Separation of mixed Document Images in Farsi Scanned Documents Using Blind So...CSCJournals
Ā 
Dimension Reduction for Script Classification - Printed Indian Documents
Dimension Reduction for Script Classification - Printed Indian DocumentsDimension Reduction for Script Classification - Printed Indian Documents
Dimension Reduction for Script Classification - Printed Indian Documentsijait
Ā 
DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTS
DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTSDIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTS
DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTSijait
Ā 
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...ITIIIndustries
Ā 
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESA NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Ā 
Text Extraction System by Eliminating Non-Text Regions
Text Extraction System by Eliminating Non-Text RegionsText Extraction System by Eliminating Non-Text Regions
Text Extraction System by Eliminating Non-Text RegionsIJCSIS Research Publications
Ā 
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHDEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHijcseit
Ā 

Similar to Persian/Arabic document segmentation using hybrid approach (20)

F045053236
F045053236F045053236
F045053236
Ā 
Enhancement and Segmentation of Historical Records
Enhancement and Segmentation of Historical RecordsEnhancement and Segmentation of Historical Records
Enhancement and Segmentation of Historical Records
Ā 
A Novel Method for De-warping in Persian document images captured by cameras
A Novel Method for De-warping in Persian document images captured by camerasA Novel Method for De-warping in Persian document images captured by cameras
A Novel Method for De-warping in Persian document images captured by cameras
Ā 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
Ā 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
Ā 
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document SegmentationA Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
Ā 
AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...
AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...
AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...
Ā 
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
Ā 
Script identification using dct coefficients 2
Script identification using dct coefficients 2Script identification using dct coefficients 2
Script identification using dct coefficients 2
Ā 
A review on signature detection and signature based document image retrieval
A review on signature detection and signature based document image retrievalA review on signature detection and signature based document image retrieval
A review on signature detection and signature based document image retrieval
Ā 
A review on signature detection and signature based document image retrieval
A review on signature detection and signature based document image retrievalA review on signature detection and signature based document image retrieval
A review on signature detection and signature based document image retrieval
Ā 
Wavelet Packet Based Features for Automatic Script Identification
Wavelet Packet Based Features for Automatic Script IdentificationWavelet Packet Based Features for Automatic Script Identification
Wavelet Packet Based Features for Automatic Script Identification
Ā 
TEXT EXTRACTION FROM RASTER MAPS USING COLOR SPACE QUANTIZATION
TEXT EXTRACTION FROM RASTER MAPS USING COLOR SPACE QUANTIZATIONTEXT EXTRACTION FROM RASTER MAPS USING COLOR SPACE QUANTIZATION
TEXT EXTRACTION FROM RASTER MAPS USING COLOR SPACE QUANTIZATION
Ā 
Separation of mixed Document Images in Farsi Scanned Documents Using Blind So...
Separation of mixed Document Images in Farsi Scanned Documents Using Blind So...Separation of mixed Document Images in Farsi Scanned Documents Using Blind So...
Separation of mixed Document Images in Farsi Scanned Documents Using Blind So...
Ā 
Dimension Reduction for Script Classification - Printed Indian Documents
Dimension Reduction for Script Classification - Printed Indian DocumentsDimension Reduction for Script Classification - Printed Indian Documents
Dimension Reduction for Script Classification - Printed Indian Documents
Ā 
DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTS
DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTSDIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTS
DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTS
Ā 
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Ā 
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESA NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
Ā 
Text Extraction System by Eliminating Non-Text Regions
Text Extraction System by Eliminating Non-Text RegionsText Extraction System by Eliminating Non-Text Regions
Text Extraction System by Eliminating Non-Text Regions
Ā 
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHDEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
Ā 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
Ā 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
Ā 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
Ā 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
Ā 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
Ā 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
Ā 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
Ā 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
Ā 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
Ā 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Ā 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
Ā 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Ā 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
Ā 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
Ā 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
Ā 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
Ā 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
Ā 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
Ā 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Ā 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Ā 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Ā 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Ā 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Ā 
Hot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort Service
Hot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort ServiceHot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort Service
Hot Sexy call girls in Panjabi Bagh šŸ” 9953056974 šŸ” Delhi escort Service
Ā 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
Ā 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
Ā 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Ā 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
Ā 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Ā 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Ā 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Ā 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
Ā 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
Ā 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Ā 

Persian/Arabic document segmentation using hybrid approach

  • 1. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.1, February 2014 DOI:10.5121/ijcsa.2014.4103 23 PERSIAN/ARABIC DOCUMENT SEGMENTATION BASED ON HYBRID APPROACH Seyyed Yasser hashemi 1 , Parisa Sheykhi Hesarlo2 1, 2 Department of Computer Engineering, miyandoab Branch, Islamic Azad University, miyandoab, Iran ABSTRACT Document segmentation is an essential requirement for automatic transformation of paper documents into electronic documents. However, some restrictions such as variations in character font sizes, different text line spacing, and also non-uniform document layout structures altogether makes designing a general- purpose document layout analysis algorithm much more sophisticated. Thus in most previously reported methods these parameters are inevitably included. This mentioned issue becomes much more acute and excessively severe, especially in Persian/Arabic documents. Since the Persian/Arabic scripts differ considerably from the English scripts, most of the proposed methods for the English scripts do not render good results for the Persian/Arabic scripts. In this paper, we present a novel parameter-free method based on hybrid method for segmenting the Persian/Arabic document images which also works well for English scripts. The proposed method is capable of document segmentation without considering the character font sizes, text line spacing, document skew, and document layout structures. This algorithm is examined for 150 Persian/Arabic and English documents and document segmentation process are done successfully for 97.3 percent of them at worst condition. KEYWORDS Persian/Arabic document, document segmentation, Pyramidal Image Structure. 1. INTRODUCTION In order to segment a document which is an important step in Optical Character Recognition (OCR) systems, the document image is divided into homogeneous zones, each consisting of only one physical layout structure, such as text, graphics, and pictures. Therefore, the performance of OCR systems depends heavily on the implemented document segmentation algorithm. Several document segmentation algorithms have been proposed during the last three decades [1-10]. The various approaches toward document segmentation are typically categorized as ā€œbottom-upā€, ā€œtop-downā€, and ā€œtextural analysisā€ methods. The ā€œbottom-upā€ methods (F. Legourgiois et al., 1992; D. Drivas et al., 1995; A. Simon et al., 1997) start from pixels or the connected components, determine the words, merge the words into text lines, and finally merge the text lines into paragraphs. The main disadvantage of these approaches is that the identification, analysis, and grouping of connected components are, in general, time-consuming processes, especially when there are many components in the image. The ā€œtop-downā€ approaches (J. Ha et al., 1995; J. Ha et al., 1995; Yi Xiaoa et al., 2003; Jie Xi et al., 2002, Rafi Cohen et al 2013) look for global information e.g. black and white stripes on the page and use them to split the page into columns, the columns into blocks, the blocks into text lines, and finally the text lines into words. Low time complexity of these methods in comparison to the prior methods, i.e. ā€œbottom-upā€ approaches and their natural top-down view from coarse to fine resolution as is preferable for human beingsā€™ eyes are of the most important advantages of this method. On the other hand, in ā€œtop-downā€ techniques
  • 2. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 24 it is unfortunately difficult to segment the complex document layouts which include some nonrectangular images and various character font sizes. Some other recently proposed document segmentation methods (A. Jain and Y. Zhong et al., 1996; A. Jain and S. Bhattacharjee et al., 1992; M. Acharyya and M. K.Kundu et al., 2002), consider the homogeneous regions of the document image such as text, image or graphic as a textured region. Thus, document segmentation is implemented based on the textured regions found in gray scale images. Junxi Sun et al., 2008; propose a texture-based Bayesian document segmentation method. In this method Bayesian method is used to fuse texture likelihood and prior contextual knowledge to achieve document segmentation. The texture likelihood is based on a complex wavelet domain hidden Markov tree (HMT) model and the prior contextual is based on a hybrid tree model. Very high time complexity is the main problem associated with these texture-based approaches, since many masks are used for extracting local features and also different tuning filters are used to capture a desired local spatial frequency and the orientation characteristics of a textured region. Since the Persian documents have some special characters which does not exist in the English documents, so the aforementioned methods cannot be directly used for Persian document segmentation. The special characters of Persian documents are as follows: The Persian scripts are cursive and each connected component includes more than one character. On the other side their arrangement and size may also vary tremendously. There are 32 basic characters in the Persian alphabet. These characters may change their shapes according to their positions (beginning, middle, end or isolated) in the word. Since each character can take four different shapes, thus we have 114 different shapes considering all of Persian alphabets. Special stress marks called dots are the other characteristics of the Persian scripts. Most of the Persian characters have one, two or three dots. These dots may be situated at the top, inside or bottom of the characters. From the script identification point of view, it is concluded from the above mentioned expressions about the special characters of Persian documents that these scriptsā€™ word sizes are non-uniform. The word size may vary according to the number of cursive characters and dots in the word. In this paper, we propose a novel method for Persian/Arabic document segmentation using pyramidal image structure. This paper is organized as follows: In Section 2, the proposed algorithm is described in detail. Experimental results of the proposed algorithm are presented in Section 3. Finally the paper is concluded in section 4. 2. MATERIALS AND METHODS The following formatting rules must be followed strictly. This (.doc) document may be used as a template for papers prepared using Microsoft Word. Papers not conforming to these requirements may not be published in the conference proceedings. Many document segmentation algorithms are presented for English Documents which most of them do not provide good results for Persian/Arabic documents due to their aforementioned differences. Thus, in order to make these methods suitable for Persian scripts, some of those methodā€™s parameters must be specialized. We have proposed a parameter free segmentation method for Persian/Arabic documents which eliminates these restrictions and provides excellent results. The proposed method is interestingly capable of segmenting the Persian documents composed of different font sizes, different lines spaces and also different structure layouts. In the proposed method, the low resolution version of the document is firstly processed and then the documentā€™s high resolution version is analyzed in detail. This manifests the pyramidal nature of the proposed method. The original image will be analyzed and a pyramidal tree structure (images at different resolutions) is created. At next phase, the bounding boxes are extracted from images
  • 3. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 25 at lowest possible resolution and candidate regions are selected removing the bounding boxesā€™ overlap areas. Then the regions are classified with horizontal and vertical analysis and are further investigated with textual and statistical analysis of the high resolution sample to recognize different regions of the document such as text, image and drawing/table. The Fig. 1 shows the proposed approach followed in this paper. 2.1. Document skew detection and correction (SDC): In this step, the skew angle of the document ( )ļ± must be estimated. The proposed method uses a document SDC based on Centre of Gravity (COG). To determine the skew angle, first step is the Baseline Identification (BI). The angle between the baseline and direct horizontal lines determine the skew angle. Therefore, the most important step in this process is to identify the baseline . Baseline of the document is a line that passes through the COG along the horizontal axes. In this algorithm, we detect skew angle by finding Actual Region of Document (ARD) using connected component analysis, identifying its COG and identify the baseline of document .The angle between the baseline and horizontal lines specifies the skew angle. The algorithm steps are as: ā€¢ Connected Component identification(CC) ā€¢ Identification of the ARD. For this purpose, four CCs that have the more distance from the C0, C1, C2 and C3 corners (shown in Fig. 2(c)) will be selected. ā€¢ Finding the COG. COG is calculated using (1). 1 ( )( )1 1 16 11 ( ( )1) 1 1 6 0 1 ( )1 12 COG x x x y x yi i i i i ix A N COG y y x y x yY i i i i i i A i A x y x yi i i i āˆ‘= + āˆ’+ + + āˆ’ = + āˆ’āˆ‘ + + + = āˆ‘= āˆ’+ + (1) 'A' is the area of polygon. ā€¢ Baseline identification. A line (baseline) from COG to the center of the line that connects the two upper and lower left corner of the ARD (midpoint). ā€¢ Calculation of the amount of document skew angle which is the angle between baseline and the horizontal line that passes through the midpoint. Rotation of the document (see Fig. 2) 2.2. Pyramidal image structure The pyramidal image structure is a simple and robust technique to provide several resolutions of an image. An image pyramid is a collection of decreased resolution images which are arranged in the shape of pyramid in a way that the base of the pyramid contains a high-resolution while the apex contains a low resolution approximation of the image. The number of pixels in IL+1 is on quarter of the number of pixels in IL. This process is repeated N times (number of levels of pyramidal image structure) that is given by (2). { }log , min . , .0 0 100 l N l I Width I Hight ļ£® ļ£¹ = =ļ£Æ ļ£ŗļ£Æ ļ£ŗ (2)
  • 4. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 26 Figure 1. Overall diagram of the proposed method The intensity of every pixel of the image for ā€˜L+1ā€™ level is calculated using pixel intensity of level ā€˜Lā€™ with the equation (3) as follows: 1 1 2 ,21 0 0 4, LI i m j nL m n I i j ļ£« ļ£¶ļ£« ļ£¶ ļ£¬ ļ£·āˆ‘ āˆ‘ ļ£¬ ļ£·ļ£¬ ļ£·+ +ļ£­ ļ£ø+ = =ļ£­ ļ£ø = (3) Where 1 , L I i j + is the intensity of (i,j) in level ā€˜L+1ā€™. The pyramidal image structure for I0 image is shown in Fig. 3.
  • 5. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 27 Figure 2. (a) Skewed document, (b) Document segmented to connected component, (c) The ARD in skewed document, (d) calculate the amount of document skew angle, (e) non-skewed document, (f) final result 2.3. Bounding box extraction This algorithm uses a bottom-up approach to extract connected components (CCs) from the document image. The first step in CCs extraction is to locate rectangular regions called Rects (M. Viswanathan, 2002), A Rects may be thought of as a rectangular region of loosely connected black pixels. More specifically, a Rects has at least one black pixel in every 9-pixel square. A Rects is defined in this way so black regions may be found without scanning every pixel. The second step is to merge adjacent Rects to form CCs .In terms of implementation, Rects are stored in a linked list. This type of structure is ideal because the Rects need to be quickly traversed and random access is not required. The process of locating Rects involves scanning 9-pixel squares of each paragraph in raster order. The top left corner of a Rects is defined by the first 9-pixel square found containing a black pixel. The right edge of the Rects is located by searching right until a white 9-pixel square is found. Similarly, the bottom edge is located by scanning down until a white 9-pixel square is found. CCs are defined as collections of adjacent Rects (Mitchel P. E and Yana H., 2004). CCs are also stored in a linked list, which enables them to be quickly and easily traversed for classification. The process used to construct CCs involves traversing the (initially empty) list of CCs for each Rects in the Rects list. Each Rects is subsequently added to single CCs or used to define new CCs. If a Rects is found to be adjacent to more than one CC, the CCs are merged into a single CCs. Note that since the CCs are defined directly from the Rects, there is no need to refer to the actual image. After extracting CCs, they merge with each other in vertical and horizontal directions to create the bounding boxes. In multi-resolution analysis, the lowest levels of the pyramid can be used for overall analysis of the image. In this step, the bounding boxes of low resolution image are used to extract image regions. The bounding boxes extraction in lowest resolution of the I0 image is shown in Fig. 4.
  • 6. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 28 Figure. 3. The Pyramidal image structure for I0 image, (a) original image, (b) binary image, (c) level 0, (d) level 1, (e) level 2, and (f) level 3. Figure. 4. The bounding boxes extraction in lowest resolution of the I0 image. 2.4. Remove overlaps connected components In the low-resolution document segmentation, the image of document 00I is divided into a set of regions that called 1 2, ,....., nR R R . Each region iR is defined by coordinates of bounding box. Some of regions, iR , are overlapped to other. For next stage of high-resolution page
  • 7. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 29 segmentation, we should remove overlapped components from each region iR . For each region, we remove overlapping components as follow: First, we obtained connected components of region iR in low-resolution image LP . If iR have only one connected component, then iR do not have any overlapping components. If in region iR number of connected components is more than one, we leave the maximum of connected component and remove other connected components at the high-resolution. 2.5. Horizontal analysis The Abstract section begins with the word, ā€œAbstractā€ in 13 pt. Times New Roman, bold italics, ā€œSmall Capsā€ font with a 6pt. spacing following. The abstract must not exceed 150 words in length in 10 pt. Times New Roman italics. The text must be fully justified, with a 12 pt. paragraph spacing following the last line. Persian text region can be easily distinguished from other regions by horizontal analysis. In order to speed up the process, the horizontal analysis is performed recursively. At the first step, the horizontal projection profile is calculated by (4) for each region. 1 ( ) ( , ) 1 1 W LP n I x n n HBH W x āˆ‘= ā‰¤ ā‰¤ = (4) Where ( , ) L I x n B is the intensity value in the WƗH image of the Lth level and ( )P n H are normalized between zeros to one. Fig. 5 shows ( )P n H for one region of the document. Figure. 5. (a) one region of the original document, (b) the horizontal projection profile, (c) the normalized horizontal projection profile. At the second step, the normalized projection profile is transformed to binary signals as described in (5). 1.0 ( ) 0.05 ( ) 0.0 P xHtP x H otherwise ļ£± >ļ£“ = ļ£² ļ£“ļ£³ (5) At the third step, the difference of signals calculated using the equation (6). ( ) ( ( ))dP n diff tP n H H = (6)
  • 8. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 30 At the fourth step, ascending and descending edges are calculated using (7) and (8) 1 ( ) 0 ( ) 0 dP nHUHE n otherwise ļ£± >ļ£“ = ļ£² ļ£“ļ£³ (7) 1 ( ) 0 ( ) 0 dP nHDHE n otherwise ļ£± <ļ£“ = ļ£² ļ£“ļ£³ (8) Where UHE(n) and DHE(n) determine the ascending and descending edges of the signal, respectively. At the fifth step, the distance between black and white areas of the signal is calculated using (9) and (10). ( ) ( ) ( )PW n DHE n UHE n= āˆ’ (9) ( ) ( 1) ( )PB n UHE n DHE n= + āˆ’ (10) PW(n) and PB(n) are the distance between white and black tPH(n) signals, respectively. Sum of PW(n) and PB(n) D(p) which is the decision value for document segmentation is derived from Pi(n) by (11), (12), (13) and (14) as: ( ) 1 N P ni nm N āˆ‘ == (11) 2( ( ) ) 1 N P n mi nV N āˆ’āˆ‘ == (12) 2 2 1 P Ve = āˆ’ āˆ’+ (13) 1 ( ) 0 p TH D P otherwise ļ£± >ļ£“ = ļ£² ļ£“ļ£³ (14) Using the decision value of D(p) we can estimate whether a region is homogeneous or not. Where ā€˜Nā€™ is the number of grooves and we set the threshold value TH=0.5. It is achieved for ā€˜Vā€™ approximately equal to 1.099 in (12). This value for ā€˜Vā€™ is independent of the character font sizes, text lines spacing, and the document layout structures, and is equally applied to each region to decide whether it is a homogeneous region or not. There are three types of horizontal analysis using D(p), tPH(n) and Pi(n): ļ‚§ For constant signal tPH(n) equal to 1: In this type, if tPH(n) relates to upper level (L=1, 2,..., N) of the region of the document image, the horizontal analysis is further repeated for lower levels, but if tPH(n) relates to the original document image, the region is a non-
  • 9. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 31 text region (graphics, pictures,ā€¦) and in next step with textural and statistical analysis will be investigated in detail. ļ‚§ For symmetric signal tPH(n): If the decision value D(p) is 1, the variance of the Pi(n) values are low and the region is considered a text region and hence it requires no further splitting. ļ‚§ For non-symmetric signal tPH(n): If the decision value D(p) is 0, the variance of Pi(n) values are high and the region is not a homogeneous region; and hence further splitting is inevitable. Thus this region is sub classified using projection profile information. Segmentation process is repeated until all regions in all levels have a constant signal tPH(n) equal to 1 or symmetric tPH(n) signal. 2.6. Determination of Splitting Position The Keywords section begins with the word, ā€œKeywordsā€ in 13 pt. Times New Roman, bold italics, ā€œSmall Capsā€ font with a 6pt. spacing following. There may be up to five keywords (or short phrases) separated by commas and six spaces, in 10 pt. Times New Roman italics. An 18 pt. line spacing follows. When a region is determined to require further splitting because it is not a homogeneous region, there are two cases in the horizontal (Vertical) direction, as shown in Fig. 6. The case in Fig. 6a needs to be split more because one white area is larger than the other white areas and the case in Fig. 6b needs to be split more because one black area is larger than the other black areas. In these two cases, we find a suitable position for splitting, using the method given below, and split it into two regions. The processes described in Sections 2.5 and 2.6 are repeated for each region until no further splitting is required. Let W denote the set of the white areas of a region, wi. Let B denote the set of the black areas of a region, bi Sort the set W(or B) in the increasing order of the magnitude of wi(or bi) wmed the median element of W bmed the median element of B wmax the last element of W bmax the last element of B if wi > wmed and wi . wmax, split wi if bi > bmed and bi . bmax, split wi-1. Fig. 6: Two cases requiring further horizontal splitting :(a) A distinct white area exists. (b) A distinct black area exists.
  • 10. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 32 2.6. Non-homogeneous regions analysis . In this step, the non-homogeneous regions are analyzed and segmented to homogeneous sub regions. For this purpose, each region is segmented to initial sub region in horizontal direction, based on median of characters high in document (Implications of previous steps). Then by repeating the processes described in Sections 2.5 and 2.6, we segmented a non-homogeneous region into maximal homogeneous sub regions that labeled as text or non-text sub regions. Finally, the sub regions with same label are merged and the final result of this step is obtained. 2.7. Image and drawing/table recognition The non-text regions which was labeled before, are labeled as image or drawing/table in this section using statistical analysis. In order to do this, the number of blackpixels of each region is applied to the equation (15). ( ) ( ) ( ) ( ) ( ) ( ) the region isimageif 0.4 the region isdrawing/tableif 0.4 i i i i i i A R W R H R A R W R H R ļ£± ā‰„ļ£“ Ɨļ£“ ļ£² ļ£“ < ļ£“ Ɨļ£³ (15) 3. EXPERIMENTAL RESULTS The Proposed method has been implemented on duo core 2.66 GHZ in 2013. We have considered different documents from different sources like journals, textbooks, newspapers and also handwritten documents. For experimentation purpose 150 documents are considered which 50 of them are handwriting documents. The obtained results are reported in Table I. The comparative results of the proposed method taking other artworks into account are considered in Fig. 8 and Fig. 9 and the final result for a sample image is shown in Fig. 7. Fig. 7: The final result for a sample image 4. CONCLUSION We have introduced a document segmentation method that segments the Persian/Arabic document into homogeneous regions. The proposed efficient and comprehensive method works
  • 11. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 33 based on pyramidal image structure and textual analysis. For the original image, a pyramidal tree structure (images at different resolutions) is created. At next phase, the bounding boxes are extracted from the image at lowest possible resolution and candidate regions are selected removing the bounding boxesā€™ overlap areas. Then the regions are classified with horizontal analysis and are further investigated with textual and statistical analysis of the high resolution sample to recognize different regions of the document such as text, image and drawing/table. The proposed method was examined on different colorful/gray documents achieved from different sources. Experiments show more accurate results in comparison to the previously reported methods. This method focuses on the Persian/Arabic document segmentation, which also exhibits good results for other scripts such as English scripts. This work can be also extended for special works such as license plate recognition, postal service, and noisy documents. This method was implemented on 150 different documents (90 Persian/Arabic and 40 English and 20 hybrids of English and Persian/Arabic) and the rate of accuracy is 97.3% in the worst circumstances. 85 90 92.5 79.5 82 81.5 98.8 96.3 97 0 10 20 30 40 50 60 70 80 90 100 Text identification Image identification Drawing/Table identification Jain Pietikainen proposed approach Figure. 8. The comparative results of the proposed method taking other artworks. Figure. 9. The comparative results of the proposed method taking other artworks based on process time Table 1. The results achieved for the proposed method. Total number of regions The number of regions found The rate of regions found (%) The number of correct regions found The rate of correct regions found (%) The number of unfound regions The rate of unfound regions (%) The number of incorrect regions found The rate of incorrect regions found (%) Max. overall error rate (%) Total region 448 440 98.2 435 97.1 13 2.9 5 1.1 2.9 Text region 382 379 99.2 376 98.4 6 1 3 0.8 1 Image region 43 42 97.7 41 95.4 2 4.6 1 2.3 4.6 Drawing/table 23 23 100 22 95.6 1 4.3 1 4.3 4.3
  • 12. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014 34 REFERENCES [1] F. Legourgiois, Z. Bublinski, & H. Emptoz, (1992), ā€œA Fast and Efficient Method for Extracting Text Paragraphs and Graphics from Unconstrained Documentsā€, Proc. 11th Int'l Conf. Pattern Recognition, pp. 272-276. [2] D. Drivas & A. Amin, ā€œDocument segmentation and Classification Utilizing Bottom-Up Approachā€, (1995), Proc. Third Int'l Conf. Document Analysis and Recognition, pp. 610-614. [3] A. Simon, J. Pret, & A. Johnson, ā€œA Fast Algorithm for Bottom-Up Document Layout Analysisā€ , (1997) , IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, pp. 273-276. [4] J. Ha, R. Haralick, & I. Phillips, ā€œRecursive X-Y Cut Using Bounding Boxes of Connected Componentsā€, Proc. (1995)Third Int'l Conf. Document Analysis and Recognition, pp. 952-955. [5] J. Ha, R. Haralick, & I. Phillips, ā€œDocument Page Decomposition by the Bounding-Box Projection Techniqueā€, Proc. (1995),Third Int'l Conf. Document Analysis and Recognition, pp. 1119-1122. [6] Yi Xiaoa, Hong Yanaā€™ ā€œText region extraction in a document image based on the Delaunay tessellationā€, (2003), Pattern Recognition, pp. 799-809. [7] Jie Xi , Jianming Hu, Lide Wu, ā€œDocument segmentation of Chinese newspapersā€, (2002), Pattern Recognition, pp. 2695-2704. [8] A. Jain & Y. Zhong, ĀŖDocument segmentation Using Texture Analysisā€, (1996), Pattern Recognition, vol. 29, pp. 743-770. [9] A. Jain & S. Bhattacharjee, ā€œText Segmentation Using GaborFilters for Automatic Document Processingā€ā€™ , (1992). Machine Vision and Applications, vol. 5, pp. 169-184. [10] M. Acharyya & M. K.Kundu, ā€œDocument Image Segmentation Using Wavelet Scale-Space Featuresā€, (2002), IEEE Transaction on circuits and systems for video technology, vol. 12, no.12. [11] Junxi Sun, Dongbing Gu, Hua Cai, Guangwen Liu and Guangqiu Chen, ā€œBayesian Document Segmentation Based on Complex Wavelet Domain Hidden Markov Tree Modelsā€, Proceedings of the 2008 IEEE International Conference on Information and Automation June 20 -23, 2008, Zhangjiajie, China, pp. 493-498. [12] Rafi Cohen, Abedelkadir Asi, Klara Kedem, Jihad El-Sana, Itshak Dinstein, ā€œRobust text and drawing segmentation algorithm for historical documentsā€, Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing,(2013), pp. 110-117. [13] M. Viswanathan, ā€œAnalysis of scanned documents - a syntactic approach,ā€. (1992), in Structured Document Image Analysis, Springer-Verlag, pp. 115ā€“136. [14] Mitchel P. E & Yana H., "Newspaper layout analysis incorporating connected component separation", (2004), Image and Vision Computing, vol. 22, pp. 307ā€“317. Authors Seyyed Yasser Hashemi was born in Miyandoab, Azarbayjane Gharbi, Iran, in 1985. He received the B.Sc. and M.Sc. degrees from Islamic Azad University of South Tehran Branch, in Computer Engineering field. He is with Computer Department of Islamic Azad University, Miyandoab Branch since 2008. He is the author or coauthor of more than ten national and international papers and also collaborated in several research projects. His current research interests include voice and image processing, pattern recognition, spam detecting, optical character recognition, cloud computing and parallel genetic algorithms. Parisa Sheykhi Hesarlo was born in shahindezh, Azarbayjane Gharbi, Iran, in 1992. She is B.Sc. student in Pnu univesity in Computer Engineering field.