This document describes a novel hybrid approach for segmenting Persian/Arabic documents. The proposed method uses a pyramidal image structure to analyze documents at multiple resolutions without requiring parameters for font size, line spacing, or layout structure. Bounding boxes are extracted from low-resolution images to identify candidate regions, which are then classified through horizontal and vertical analysis and textual/statistical analysis of high-resolution samples to segment texts, images, and other elements. The method was tested on 150 documents and successfully segmented 97.3% of them under worst conditions.
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
In this paper, word recognition using neural network is proposed. Recognition process is started with the partitioning of document image into lines, words, and characters and then capturing the local features of segmented characters. After classifying the characters, the word image is transferred into unique code based on character code. This code ideally describes any form of word including word with mixed styles and different sizes. Sequence of character codes of the word form input pattern and word code is a target value of the pattern. Neural network is used to train the patterns of the words. Trained network is tested with word patterns and is recognized or unrecognized based on the network error value. Experiments have been conducted with a local database to evaluate the performance of the word recognizing system and obtained good accuracy. This method can be applied for any language word recognition system as the training is based on only unique code of the characters and words belonging to the language.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Off-Line Arabic Handwritten Words Segmentation using Morphological Operatorssipij
The main aim of this study is the assessment and discussion of a model for hand-written Arabic through
segmentation. The framework is proposed based on three steps: pre-processing, segmentation, and
evaluation. In the pre-processing step, morphological operators are applied for Connecting Gaps (CGs) in
written words.
Devnagari document segmentation using histogram approachVikas Dongre
Document segmentation is one of the critical phases in machine recognition of any language. Correct
segmentation of individual symbols decides the accuracy of character recognition technique. It is used to
decompose image of a sequence of characters into sub images of individual symbols by segmenting lines and
words. Devnagari is the most popular script in India. It is used for writing Hindi, Marathi, Sanskrit and
Nepali languages. Moreover, Hindi is the third most popular language in the world. Devnagari documents
consist of vowels, consonants and various modifiers. Hence proper segmentation of Devnagari word is
challenging. A simple histogram based approach to segment Devnagari documents is proposed in this paper.
Various challenges in segmentation of Devnagari script are also discussed.
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
In this paper, word recognition using neural network is proposed. Recognition process is started with the partitioning of document image into lines, words, and characters and then capturing the local features of segmented characters. After classifying the characters, the word image is transferred into unique code based on character code. This code ideally describes any form of word including word with mixed styles and different sizes. Sequence of character codes of the word form input pattern and word code is a target value of the pattern. Neural network is used to train the patterns of the words. Trained network is tested with word patterns and is recognized or unrecognized based on the network error value. Experiments have been conducted with a local database to evaluate the performance of the word recognizing system and obtained good accuracy. This method can be applied for any language word recognition system as the training is based on only unique code of the characters and words belonging to the language.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Off-Line Arabic Handwritten Words Segmentation using Morphological Operatorssipij
The main aim of this study is the assessment and discussion of a model for hand-written Arabic through
segmentation. The framework is proposed based on three steps: pre-processing, segmentation, and
evaluation. In the pre-processing step, morphological operators are applied for Connecting Gaps (CGs) in
written words.
Devnagari document segmentation using histogram approachVikas Dongre
Document segmentation is one of the critical phases in machine recognition of any language. Correct
segmentation of individual symbols decides the accuracy of character recognition technique. It is used to
decompose image of a sequence of characters into sub images of individual symbols by segmenting lines and
words. Devnagari is the most popular script in India. It is used for writing Hindi, Marathi, Sanskrit and
Nepali languages. Moreover, Hindi is the third most popular language in the world. Devnagari documents
consist of vowels, consonants and various modifiers. Hence proper segmentation of Devnagari word is
challenging. A simple histogram based approach to segment Devnagari documents is proposed in this paper.
Various challenges in segmentation of Devnagari script are also discussed.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Header Based Classification of Journals Using Document Image Segmentation and...CSCJournals
Document image segmentation plays an important role in classification of journals, magazines, newspaper, etc., It is a process of splitting the document into distinct regions. Document layout analysis is a key process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non- textual ones and the arrangement in their correct reading order. Detection and labelling of text zones play different logical roles inside the document such as titles, captions, footnotes, etc. This research work proposes a new approach to segment the document and classify the journals based on the header block. Documents are collected from different journals and used as input image. The image is segmented into blocks like heading, header, author name and footer using Particle Swarm optimization algorithm and features are extracted from header block using Gray Level Co-occurrences Matrix. Extreme Learning Machine has been used for classification based on the header blocks and obtained 82.3% accuracy.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
Improved method for pattern discovery in text miningeSAT Journals
Abstract Digital data in the form of text documents is rapidly growing. Analyzing such data manually is a tedious task. Data mining techniques have been around to analyze such data and bring about interesting patterns. Many existing methods are based on term-based approaches that can’t deal with synonymy and polysemy. Moreover they lack the ability in using and updating the discovered patterns. Zhong et al. proposed an effective pattern discovery technique. It discovers patterns and then computes specificities of patterns for evaluating term weights as per their distribution in the discovered patterns. It also takes care of updating patterns that exhibit ambiguity which is a feature known as pattern evolution. In this paper we implemented that technique and also built a prototype application to test the efficiency of the technique. The empirical results revealed that the solution is very useful in text mining domain. Keywords – Text mining, pattern discovery, text classification, pattern evolving
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
Scene text recognition in mobile applications by character descriptor and str...eSAT Journals
Abstract
Camera-based scene images usually have complex background filled with non-text objects in multiple shapes and colors. The existing system is sensitive to font scale changes and background interference. The main focusof this system is on two character recognition methods. In text detection, previously proposed algorithms are used to search for regions of text strings. Proposed system uses character descriptor which is effective to extract representative and discriminative text features for both recognition schemes. The local features descriptor HOG is compatible with all above key point detectors. Our method of scene text recognition from detected text regions is compatible with the application of mobile devices. Proposedsystem accurately extracts text from natural scene image in presence of background interference.The demo system gives us details of algorithm design and performance improvements of scene text extraction. It is ableto detect text region of text strings from cluttered and recognize characters in the text regions.
Keywords: Scene text detection, scene text recognition, character descriptor, stroke configuration, text understanding, text retrieval.
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...acijjournal
Representation of semantic information contained in the words is needed for any Arabic Text Mining
applications. More precisely, the purpose is to better take into account the semantic dependencies
between words expressed by the co-occurrence frequencies of these words. There have been many
proposals to compute similarities between words based on their distributions in contexts. In this paper,
we compare and contrast the effect of two preprocessing techniques applied to Arabic corpus: Rootbased (Stemming), and Stem-based (Light Stemming) approaches for measuring the similarity between
Arabic words with the well known abstractive model -Latent Semantic Analysis (LSA)- with a wide
variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity,
Jaccard Coefficient, and the Pearson Correlation Coefficient. The obtained results show that, on the one
hand, the variety of the corpus produces more accurate results; on the other hand, the Stem-based
approach outperformed the Root-based one because this latter affects the words meanings.
Anatomical Survey Based Feature Vector for Text Pattern DetectionIJEACS
The vital objective of artificial intelligence is to discover and understand the human competences, one of which is the capability to distinguish several text objects within one or more images exhibited on any canvas including prints, videos or electronic displays. Multimedia data has increased rapidly in past years. Textual information present in multimedia contains important information about the image/video content. However it needs to technologically verify the commonly used human intelligence of detecting and differentiating the text within an image, for computers. Hence in this paper feature set based on anatomical study of human text detection system is proposed.
Due to an exponential growth in the generation of textual data, the need for tools and mechanisms for automatic summarization of documents has become very critical. Text documents are vital to any organization's day-to-day working and as such, long documents often hamper trivial work. Therefore, an automatic summarizer is vital towards reducing human effort. Text summarization is an important activity in the analysis of a high volume text documents and is currently a major research topic in Natural Language Processing. It is the process of generation of the summary of input text by extracting the representative sentences from it. In this project, we present a novel technique for generating the summarization of domain specific text by using Semantic Analysis for text summarization, which is a subset of Natural Language Processing.
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...ijeei-iaes
Algorithm Liu attracts high attention because of its high accuracy in segmentation of Japanese postal address. But the disadvantages, such as complexity and difficult implementation of algorithm, etc. have an adverse effect on its popularization and application. In this paper, the author applies the principles of algorithm Liu to handwritten Chinese character segmentation according to the characteristics of the handwritten Chinese characters, based on deeply study on algorithm Liu.In the same time, the author put forward the judgment criterion of Segmentation block classification and adhering mode of the handwritten Chinese characters.In the process of segmentation, text images are seen as the sequence made up of Connected Components (CCs), while the connected components are made up of several horizontal itinerary set of black pixels in image. The author determines whether these parts will be merged into segmentation through analyzing connected components. And then the author does image segmentation through adhering mode based on the analysis of outline edges. Finally cut the text images into character segmentation. Experimental results show that the improved Algorithm Liu obtains high segmentation accuracy and produces a satisfactory segmentation result.
Alzheimer’s disease(AD) is a neurological disease. It affects memory. The livelihood of the people that are
diagnosed with AD. In this paper, we have discussed various imaging modalities, feature selection and
extraction, segmentation and classification techniques.
Vulnerability scanners a proactive approach to assess web application securityijcsa
With the increasing concern for security in the network, many approaches are laid out that try to protect
the network from unauthorised access. New methods have been adopted in order to find the potential
discrepancies that may damage the network. Most commonly used approach is the vulnerability
assessment. By vulnerability, we mean, the potential flaws in the system that make it prone to the attack.
Assessment of these system vulnerabilities provide a means to identify and develop new strategies so as to
protect the system from the risk of being damaged. This paper focuses on the usage of various vulnerability
scanners and their related methodology to detect the various vulnerabilities available in the web
applications or the remote host across the network and tries to identify new mechanisms that can be
deployed to secure the network.
A survey on cloud security issues and techniquesijcsa
Today, cloud computing is an emerging way of computing in computer science. Cloud computing is a set of
resources and services that are offered by the network or internet. Cloud computing extends various
computing techniques like grid computing, distributed computing. Today cloud computing is used in both
industrial field and academic field. Cloud facilitates its users by providing virtual resources via internet. As
the field of cloud computing is spreading the new techniques are developing. This increase in cloud
computing environment also increases security challenges for cloud developers. Users of cloud save their
data in the cloud hence the lack of security in cloud can lose the user’s trust.
In this paper we will discuss some of the cloud security issues in various aspects like multi-tenancy,
elasticity, availability etc. the paper also discuss existing security techniques and approaches for a secure
cloud. This paper will enable researchers and professionals to know about different security threats and
models and tools proposed.
Some alternative ways to find m ambiguous binary words corresponding to a par...ijcsa
Parikh matrix of a word gives numerical information of the word in terms of its subwords. In this Paper an
algorithm for finding Parikh matrix of a binary word is introduced. With the help of this algorithm Parikh
matrix of a binary word, however large it may be can be found out. M-ambiguous words are the problem of
Parikh matrix. In this paper an algorithm is shown to find the M- ambiguous words of a binary ordered
word instantly. We have introduced a system to represent binary words in a two dimensional field. We see
that there are some relations among the representations of M-ambiguous words in the two dimensional
field. We have also introduced a set of equations which will help us to calculate the M- ambiguous words.
Vulnerabilities and attacks targeting social networks and industrial control ...ijcsa
Vulnerability is a weakness, shortcoming or flaw in the system or network infrastructure which can be used
by an attacker to harm the system, disrupt its normal operation and use it for his financial, competitive or
other motives or just for cyber escapades.
In this paper, we re-examined the various types of attacks on industrial control systems as well as on social
networking users. We have listed which all vulnerabilities were exploited for executing these attacks and
their effects on these systems and social networks. The focus will be mainly on the vulnerabilities that are
used in OSNs as the convertors which convert the social network into antisocial network and these
networks can be further used for the network attacks on the users associated with the victim user whereby
creating a consecutive chain of attacks on increasing number of social networking users. Another type of
attack, Stuxnet Attack which was originally designed to attack Iran’s nuclear facilities is also discussed
here which harms the system it controls by changing the code in that target system. The Stuxnet worm is a
very treacherous and hazardous means of attack and is the first of its kind as it allows the attacker to
manipulate real-time equipment.
Impact of HeartBleed Bug in Android and Counter Measures ijcsa
Now a days smart phones revolving around the globe. The no of
Android users are also increasing day by
day, the main problem arises here. The Android operating syste
m based devices are more advance and also
prone to bugs when compared to other OS devices. Mainly Android co
mes with lot of Apps so in order to
provide the services to the user. So the App developers was i
n a hurry to release the Apps as per market
strategy which causes vulnerabilities. Some of them intentional
ly creates the Apps in order to hack the
device. When compared to other operating system Android is a ope
n source so everybody trys to perform
the reverse-engineering of Apks and perform some modification
s, release the Apks into the market. We
believe that our study will awaken the developers and researches
.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Header Based Classification of Journals Using Document Image Segmentation and...CSCJournals
Document image segmentation plays an important role in classification of journals, magazines, newspaper, etc., It is a process of splitting the document into distinct regions. Document layout analysis is a key process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non- textual ones and the arrangement in their correct reading order. Detection and labelling of text zones play different logical roles inside the document such as titles, captions, footnotes, etc. This research work proposes a new approach to segment the document and classify the journals based on the header block. Documents are collected from different journals and used as input image. The image is segmented into blocks like heading, header, author name and footer using Particle Swarm optimization algorithm and features are extracted from header block using Gray Level Co-occurrences Matrix. Extreme Learning Machine has been used for classification based on the header blocks and obtained 82.3% accuracy.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
Improved method for pattern discovery in text miningeSAT Journals
Abstract Digital data in the form of text documents is rapidly growing. Analyzing such data manually is a tedious task. Data mining techniques have been around to analyze such data and bring about interesting patterns. Many existing methods are based on term-based approaches that can’t deal with synonymy and polysemy. Moreover they lack the ability in using and updating the discovered patterns. Zhong et al. proposed an effective pattern discovery technique. It discovers patterns and then computes specificities of patterns for evaluating term weights as per their distribution in the discovered patterns. It also takes care of updating patterns that exhibit ambiguity which is a feature known as pattern evolution. In this paper we implemented that technique and also built a prototype application to test the efficiency of the technique. The empirical results revealed that the solution is very useful in text mining domain. Keywords – Text mining, pattern discovery, text classification, pattern evolving
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
Scene text recognition in mobile applications by character descriptor and str...eSAT Journals
Abstract
Camera-based scene images usually have complex background filled with non-text objects in multiple shapes and colors. The existing system is sensitive to font scale changes and background interference. The main focusof this system is on two character recognition methods. In text detection, previously proposed algorithms are used to search for regions of text strings. Proposed system uses character descriptor which is effective to extract representative and discriminative text features for both recognition schemes. The local features descriptor HOG is compatible with all above key point detectors. Our method of scene text recognition from detected text regions is compatible with the application of mobile devices. Proposedsystem accurately extracts text from natural scene image in presence of background interference.The demo system gives us details of algorithm design and performance improvements of scene text extraction. It is ableto detect text region of text strings from cluttered and recognize characters in the text regions.
Keywords: Scene text detection, scene text recognition, character descriptor, stroke configuration, text understanding, text retrieval.
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...acijjournal
Representation of semantic information contained in the words is needed for any Arabic Text Mining
applications. More precisely, the purpose is to better take into account the semantic dependencies
between words expressed by the co-occurrence frequencies of these words. There have been many
proposals to compute similarities between words based on their distributions in contexts. In this paper,
we compare and contrast the effect of two preprocessing techniques applied to Arabic corpus: Rootbased (Stemming), and Stem-based (Light Stemming) approaches for measuring the similarity between
Arabic words with the well known abstractive model -Latent Semantic Analysis (LSA)- with a wide
variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity,
Jaccard Coefficient, and the Pearson Correlation Coefficient. The obtained results show that, on the one
hand, the variety of the corpus produces more accurate results; on the other hand, the Stem-based
approach outperformed the Root-based one because this latter affects the words meanings.
Anatomical Survey Based Feature Vector for Text Pattern DetectionIJEACS
The vital objective of artificial intelligence is to discover and understand the human competences, one of which is the capability to distinguish several text objects within one or more images exhibited on any canvas including prints, videos or electronic displays. Multimedia data has increased rapidly in past years. Textual information present in multimedia contains important information about the image/video content. However it needs to technologically verify the commonly used human intelligence of detecting and differentiating the text within an image, for computers. Hence in this paper feature set based on anatomical study of human text detection system is proposed.
Due to an exponential growth in the generation of textual data, the need for tools and mechanisms for automatic summarization of documents has become very critical. Text documents are vital to any organization's day-to-day working and as such, long documents often hamper trivial work. Therefore, an automatic summarizer is vital towards reducing human effort. Text summarization is an important activity in the analysis of a high volume text documents and is currently a major research topic in Natural Language Processing. It is the process of generation of the summary of input text by extracting the representative sentences from it. In this project, we present a novel technique for generating the summarization of domain specific text by using Semantic Analysis for text summarization, which is a subset of Natural Language Processing.
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...ijeei-iaes
Algorithm Liu attracts high attention because of its high accuracy in segmentation of Japanese postal address. But the disadvantages, such as complexity and difficult implementation of algorithm, etc. have an adverse effect on its popularization and application. In this paper, the author applies the principles of algorithm Liu to handwritten Chinese character segmentation according to the characteristics of the handwritten Chinese characters, based on deeply study on algorithm Liu.In the same time, the author put forward the judgment criterion of Segmentation block classification and adhering mode of the handwritten Chinese characters.In the process of segmentation, text images are seen as the sequence made up of Connected Components (CCs), while the connected components are made up of several horizontal itinerary set of black pixels in image. The author determines whether these parts will be merged into segmentation through analyzing connected components. And then the author does image segmentation through adhering mode based on the analysis of outline edges. Finally cut the text images into character segmentation. Experimental results show that the improved Algorithm Liu obtains high segmentation accuracy and produces a satisfactory segmentation result.
Alzheimer’s disease(AD) is a neurological disease. It affects memory. The livelihood of the people that are
diagnosed with AD. In this paper, we have discussed various imaging modalities, feature selection and
extraction, segmentation and classification techniques.
Vulnerability scanners a proactive approach to assess web application securityijcsa
With the increasing concern for security in the network, many approaches are laid out that try to protect
the network from unauthorised access. New methods have been adopted in order to find the potential
discrepancies that may damage the network. Most commonly used approach is the vulnerability
assessment. By vulnerability, we mean, the potential flaws in the system that make it prone to the attack.
Assessment of these system vulnerabilities provide a means to identify and develop new strategies so as to
protect the system from the risk of being damaged. This paper focuses on the usage of various vulnerability
scanners and their related methodology to detect the various vulnerabilities available in the web
applications or the remote host across the network and tries to identify new mechanisms that can be
deployed to secure the network.
A survey on cloud security issues and techniquesijcsa
Today, cloud computing is an emerging way of computing in computer science. Cloud computing is a set of
resources and services that are offered by the network or internet. Cloud computing extends various
computing techniques like grid computing, distributed computing. Today cloud computing is used in both
industrial field and academic field. Cloud facilitates its users by providing virtual resources via internet. As
the field of cloud computing is spreading the new techniques are developing. This increase in cloud
computing environment also increases security challenges for cloud developers. Users of cloud save their
data in the cloud hence the lack of security in cloud can lose the user’s trust.
In this paper we will discuss some of the cloud security issues in various aspects like multi-tenancy,
elasticity, availability etc. the paper also discuss existing security techniques and approaches for a secure
cloud. This paper will enable researchers and professionals to know about different security threats and
models and tools proposed.
Some alternative ways to find m ambiguous binary words corresponding to a par...ijcsa
Parikh matrix of a word gives numerical information of the word in terms of its subwords. In this Paper an
algorithm for finding Parikh matrix of a binary word is introduced. With the help of this algorithm Parikh
matrix of a binary word, however large it may be can be found out. M-ambiguous words are the problem of
Parikh matrix. In this paper an algorithm is shown to find the M- ambiguous words of a binary ordered
word instantly. We have introduced a system to represent binary words in a two dimensional field. We see
that there are some relations among the representations of M-ambiguous words in the two dimensional
field. We have also introduced a set of equations which will help us to calculate the M- ambiguous words.
Vulnerabilities and attacks targeting social networks and industrial control ...ijcsa
Vulnerability is a weakness, shortcoming or flaw in the system or network infrastructure which can be used
by an attacker to harm the system, disrupt its normal operation and use it for his financial, competitive or
other motives or just for cyber escapades.
In this paper, we re-examined the various types of attacks on industrial control systems as well as on social
networking users. We have listed which all vulnerabilities were exploited for executing these attacks and
their effects on these systems and social networks. The focus will be mainly on the vulnerabilities that are
used in OSNs as the convertors which convert the social network into antisocial network and these
networks can be further used for the network attacks on the users associated with the victim user whereby
creating a consecutive chain of attacks on increasing number of social networking users. Another type of
attack, Stuxnet Attack which was originally designed to attack Iran’s nuclear facilities is also discussed
here which harms the system it controls by changing the code in that target system. The Stuxnet worm is a
very treacherous and hazardous means of attack and is the first of its kind as it allows the attacker to
manipulate real-time equipment.
Impact of HeartBleed Bug in Android and Counter Measures ijcsa
Now a days smart phones revolving around the globe. The no of
Android users are also increasing day by
day, the main problem arises here. The Android operating syste
m based devices are more advance and also
prone to bugs when compared to other OS devices. Mainly Android co
mes with lot of Apps so in order to
provide the services to the user. So the App developers was i
n a hurry to release the Apps as per market
strategy which causes vulnerabilities. Some of them intentional
ly creates the Apps in order to hack the
device. When compared to other operating system Android is a ope
n source so everybody trys to perform
the reverse-engineering of Apks and perform some modification
s, release the Apks into the market. We
believe that our study will awaken the developers and researches
.
The study evaluates three background subtraction techniques. The techniques ranges from very basic
algorithm to state of the art published techniques categorized based on speed, memory requirements and
accuracy. Such a review can effectively guide the designer to select the most suitable method for a given
application in a principled way. The algorithms used in the study ranges from varying levels of accuracy
and computational complexity. Few of them can also deal with real time challenges like rain, snow, hails,
swaying branches, objects overlapping, varying light intensity or slow moving objects.
Nocs performance improvement using parallel transmission through wireless linksijcsa
The Network-on-Chip (NoC) is a solution for integrating high numbers of cores on a single chip. The
integration of high number of cores especially on the mesh topology causes long diameter, which, in
turn, affects the network performance due to an increase in average hop count. Hence, other solutions
like long range links have been proposed to decrease average hop count. These links can be
implemented by new technologies such as high bandwidth on-chip wireless connection to decrease
latency. On-chip wireless links prepare high bandwidth interconnections using carbon nanotube
antennas. Wireless links bandwidth is higher than wire links;hence for handling bandwidth
incompatibility a new transmission rule is needed. In this paper, a method for transmitting/receiving
flits through the wireless links in a parallel manner is initially presented. Then, a parallel buffer
structure to store flits from wireless links is introduced. Finally, we demonstrate the advantages of the
proposed methodusing energy and latency analysis. Simulation results show that energy is saved
around 30% on the all-to-all traffic and 15% on the transpose traffic. Network latency as a function of
the packet injection rate can be improved on the all-to-all and transpose traffics around 71% and 19%,
respectively
Modeling of manufacturing of a field effect transistor to determine condition...ijcsa
In this paper we introduce an approach to model technological process of manufacture of a field-effect
heterotransistor. The modeling gives us possibility to optimize the technological process to decrease length
of channel by using mechanical stress. As accompanying results of the decreasing one can find decreasing
of thickness of the heterotransistors and increasing of their density, which were comprised in integrated
circuits.
This paper research review Ant colony optimization (ACO) and Genetic Algorithm (GA), both are two
powerful meta-heuristics. This paper explains some major defects of these two algorithm at first then
proposes a new model for ACO in which, artificial ants use a quick genetic operator and accelerate their
actions in selecting next state.
Experimental results show that proposed hybrid algorithm is effective and its performance including speed
and accuracy beats other version.
IMPACT OF DIFFERENT SELECTION STRATEGIES ON PERFORMANCE OF GA BASED INFORMATI...ijcsa
As the information proliferates, searching for relevant information has become a primary task. Searching
or Information retrieval (IR) aims to help the users in organising as well as retrieving those documents
from the documentary collection which are most likely to satisfy information needs of the user. An optimal
Information Retrieval System (IRS) is one which retrieves only those documents from the document
database which are pertinent to user's information needs, while excluding documents that are not relevant.
Genetic Algorithm is described by higher likelihood of finding good solutions to large and complex
problems of IR optimisation. The performance of Genetic Algorithm depends upon the decision of
underlying operators used namely selection, crossover and mutation. A GA-based algorithm IRIGA
(Information Retrieval Improvement using Genetic Algorithm) is developed to improve the performance of
Information Retrieval System. This paper presents a comparison of performance of IRIGA when different
selection methods are used. The results are analysed by conducting experiments keeping the rest of the GA
parameters as constant and varying only the selection strategy.
World Wide Web is a huge repository of information and there is a tremendous increase in the volume of
information daily. The number of users are also increasing day by day. To reduce users browsing time lot
of research is taken place. Web Usage Mining is a type of web mining in which mining techniques are
applied in log data to extract the behaviour of users. Clustering plays an important role in a broad range
of applications like Web analysis, CRM, marketing, medical diagnostics, computational biology, and many
others. Clustering is the grouping of similar instances or objects. The key factor for clustering is some sort
of measure that can determine whether two objects are similar or dissimilar. In this paper a novel
clustering method to partition user sessions into accurate clusters is discussed. The accuracy and various
performance measures of the proposed algorithm shows that the proposed method is a better method for
web log mining.
Labview with dwt for denoising the blurred biometric imagesijcsa
In this paper, biometric blurred image (fingerprint) denoising are presented and investigated by using
LabVIEW applications , It is blurred and corrupted with Gaussian noise. This work is proposed
algorithm that has used a discrete wavelet transform (DWT) to divide the image into two parts, this will
be increasing the manipulation speed of biometric images that are of the big sizes. This work has included
two tasks ; the first designs the LabVIEW system to calculate and present the approximation coefficients,
by which the image's blur factor reduced to minimum value according to the proposed algorithm. The
second task removes the image's noise by calculated the regression coefficients according to Bayesian-
Shrinkage estimation method.
Augmented reality (AR) is a technology which provides real time integration of digital content with the
information available in real world. Augmented reality enables direct access to implicit information
attached with context in real time. Augmented reality enhances our perception of real world by enriching
what we see, feel, and hear in the real environment. This paper gives comparative study of various
augmented reality software development kits (SDK’s) available to create augmented reality apps. The
paper describes how augmented reality is different from virtual reality; working of augmented reality
system and different types of tracking used in AR.
Robust face recognition by applying partitioning around medoids over eigen fa...ijcsa
An unsupervised learning methodology for robust face recognition is proposed for enhancing invariance to
various changes in the face. The area of face recognition in spite of being the most unobtrusive biometric
modality of all has encountered challenges with high performance in uncontrolled environment owing to
frequently occurring, unavoidable variations in the face. These changes may be due to noise, outliers,
changing expressions, emotions, pose, illumination, facial distractions like makeup, spectacles, hair growth
etc. Methods for dealing with these variations have been developed in the past with different success.
However the cost and time efficiency play a crucial role in implementing any methodology in real world.
This paper presents a method to integrate the technique of Partitioning Around Medoids with Eigen Faces
and Fisher Faces to improve the efficiency of face recognition considerably. The system so designed has
higher resistance towards the impact of various changes in the face and performs well in terms of success
rate, cost involved and time complexity. The methodology can therefore be used in developing highly robust
face recognition systems for real time environment.
An ahp (analytic hierarchy process)fce (fuzzy comprehensive evaluation) based...ijcsa
In this paper the AHP (Analytic Hierarchy Process) and the FCE (Fuzzy Comprehensive Evaluation) are
applied to find the best coaches from different sports and to rank these great coaches.
First, we screen coaches’ information using three screening criterions. We rank the screened coaches
preliminarily by means of analytic hierarchy process (AHP). Second, we rank them by fuzzy comprehensive
evaluation method(FCE), and we determined the top5 coaches on basketball, football and hockey. Third,
we use the Topsis method to test the accuracy and reasonableness of the model, modify the model and then
reorder the original results to inspect the consistency of the results of the two models. Finally, we take
some other factors into account to optimize our model, which includes on the influence of time line horizon
and genders.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
Grid computing is concerned with the sharing and use of resources in dynamic distributed virtual
organizations. The dynamic nature of Grid environments introduces challenging security concerns that
demand new technical approaches. In this brief overview we review key Grid security issues and outline
the technologies that are being developed to address those issues. We focus on works done by Globus
Toolkits to provide security and also we will discuss about the cyber security in Grid.
Energy efficient mac protocols for wireless sensor networkijcsa
Wireless sensor network are the collection of individual nodes which are able to interact with physical
environment statically or dynamically by sensing or controlling physical parameter. Wireless sensor network
become a leading solution in many important applications such as intrusion detection, target tracking,
industrial automation etc. A major problem with WSN is to determining a most efficient protocol for
conserving energy of power source. The design of an energy- efficient Medium Access efficient Control
(MAC) protocol is one of the major issues in wireless sensor networks (WSN). In this paper we study some
characteristics of WSN that are important for the design of MAC layer protocols and give a brief introduction
of some newly come MAC protocols with reference to energy efficiency for WSN. In accordance with channel
access policies, MAC protocols are classified into four types, which are cross layer protocols, TDMA-based,
contention-based and hybrid, these are discussed in this paper.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Enhancement and Segmentation of Historical Recordscsandit
Document Analysis and Recognition (DAR) aims to extract automatically the information in the document and also addresses to human comprehension. The automatic processing of degraded
historical documents are applications of document image analysis field which is confronted with many difficulties due to the storage condition and the complexity of the script. The main interest
of enhancement of historical documents is to remove undesirable statistics that appear in the
background and highlight the foreground, so as to enable automatic recognition of documents
with high accuracy. This paper addresses pre-processing and segmentation of ancient scripts, as an initial step to automate the task of an epigraphist in reading and deciphering inscriptions.
Pre-processing involves, enhancement of degraded ancient document images which is achieved through four different Spatial filtering methods for smoothing or sharpening namely Median,
Gaussian blur, Mean and Bilateral filter, with different mask sizes. This is followed by
binarization of the enhanced image to highlight the foreground information, using Otsu
thresholding algorithm. In the second phase Segmentation is carried out using Drop Fall and
WaterReservoir approaches, to obtain sampled characters, which can be used in later stages of
OCR. The system showed good results when tested on the nearly 150 samples of varying
degraded epigraphic images and works well giving better enhanced output for, 4x4 mask size
for Median filter, 2x2 mask size for Gaussian blur, 4x4 mask size for Mean and Bilateral filter.
The system can effectively sample characters from enhanced images, giving a segmentation rate of 85%-90% for Drop Fall and 85%-90% for Water Reservoir techniques respectively
A Novel Method for De-warping in Persian document images captured by camerasCSCJournals
In this Paper, We proposed a novel algorithm for de-warping of Persian document images captured by the cameras. The aim of de-warping is to remove page distortions and to straighten document images captured by the cameras, so that the documents are readable to the OCR system. Recently, the industrial implementation of the images captured by digital cameras has significantly expanded. Most of the studies carries out so far in this regard have focused on the documents written in Latin and few researches have been conducted regarding Persian documents. The original idea of the proposed algorithm is based on the segmentation of the components of texts. In this algorithm, an effective technique is offered for detection of the upper and lower baselines, which is used in estimation of the slope of the words. Moreover, vertical shift of the warped words is done through fitting a quadratic curve fitted to the centers of the words in a line in relation to the horizontal line. The suggested algorithm is examined by qualitative and quantitative measures and the results of its implementation on various documents indicate a 92% accuracy of the proposed technique in correction of the location and angle of the words.
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...ijdpsjournal
Scene text recognition brings various new challenges occurs in recent years. Detecting and recognizing text in scenes entails some of the equivalent problems as document processing, but there are also numerous novel problems to face for ecognizing text in natural scene images. Recent research in these regions has exposed several promise but present is motionless much effort to be entire in these regions. Most existing techniques have focused on detecting horizontal or near-horizontal texts. In this paper, we propose a new scheme which detects texts of arbitrary directions in natural scene images. Our algorithm is equipped with two sets of characteristics specially designed for capturing both the natural characteristics of texts using
MSER regions using Otsu method. To better estimate our algorithm and compare it with other existing algorithms, we are using existing MSRA Dataset, ICDAR Dataset, and our new dataset, which includes various texts in various real-world situations. Experiments results on these standard datasets and the proposed dataset shows that our algorithm compares positively with the modern algorithms when using horizontal texts and accomplishes significantly improved performance on texts of random orientations in composite natural scenes images.
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...ijdpsjournal
Scene text recognition brings various new challenges occurs in recent years. Detecting and recognizing text
in scenes entails some of the equivalent problems as document processing, but there are also numerous
novel problems to face for recognizing text in natural scene images. Recent research in these regions has
exposed several promise but present is motionless much effort to be entire in these regions. Most existing
techniques have focused on detecting horizontal or near-horizontal texts. In this paper, we propose a new
scheme which detects texts of arbitrary directions in natural scene images. Our algorithm is equipped with
two sets of characteristics specially designed for capturing both the natural characteristics of texts using
MSER regions using Otsu method. To better estimate our algorithm and compare it with other existing
algorithms, we are using existing MSRA Dataset, ICDAR Dataset, and our new dataset, which includes
various texts in various real-world situations. Experiments results on these standard datasets and the
proposed dataset shows that our algorithm compares positively with the modern algorithms when using
horizontal texts and accomplishes significantly improved performance on texts of random orientations in
composite natural scenes images.
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document SegmentationUniversity of Bari (Italy)
Layout analysis is a fundamental step in automatic document processing, because its outcome affects all subsequent processing steps. Many different techniques have been proposed to perform this task. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. A famous approach proposed in the literature for layout analysis was the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run Length Smoothing with OR”), that exploits the OR logical operator instead of the AND and is particularly indicated for the identification of frames in non-Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on different criteria than those that work in RLSA. Since setting such thresholds is a hard and unnatural task for (even expert) users, and no single threshold can fit all documents, we developed a technique to automatically define such thresholds for each specific document, based on the distribution of spacing therein. Application on selected sample documents, that cover a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size.
AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...ijaia
In this work, we review the outcome of texture features for script classification. Rectangular White Space
analysis algorithm is used to analyze and identify heterogeneous layouts of document images. The texture
features, namely the color texture moments, Local binary pattern (LBP) and responses of Gabor, LM-filter,
S-filter, R-filter are extracted, and combinations of these are considered in the classification. In this work,
a probabilistic neural network and Nearest Neighbor are used for classification. To corrabate the
adequacy of the proposed strategy, an experiment was operated on our own data set. To study the effect of
classification accuracy, we vary the database sizes and the results show that the combination of multiple
features vastly improves the performance.
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...iosrjce
Segmentation technique plays a major role in scripting the documents for extraction of various
features. Many researchers are doing various research works in this field to make the segmenting process
simple as well as efficient. In this paper a simple segmentation technique for both the line and word
segmentation of a script document has been proposed. The main objective of this technique is to recognize the
spaces that separate two text lines.For the Word segmentation technique also similar procedure is followed. In
this work ,three different scanned document have been taken as input images for both line and word
segmentation techniques. The results found were outstanding with average accuracy for both line and word. It
provides 100% accuracy for line segmentation and 100% for line segmentation as well. Evaluation results show
that our method outperforms several competing methods.
A review on signature detection and signature based document image retrievaleSAT Journals
Abstract
Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval, and interpretation continues to be a challenging problem. Signature is an individualistic identification of a person. It is an authentic identification because a signature cannot be copied by others. Signatures are a special case of handwriting subject to intra personal variation and inter personal differences. To counter check fraud and forgery of handwritten signatures, Signature extraction from printed text background and signature based document retrieval from a large dataset is necessary. A lot many techniques have been implemented successfully for both signature extraction and signature based document retrieval. This paper present techniques and methods evolved for signature extraction and signature based document retrieval.
Keywords: signature detection, signature extraction, Document image Retrieval, Query image retrieval
A review on signature detection and signature based document image retrievaleSAT Journals
Abstract
Gelcoat are widely used to provide exterior protection for the finished part of fiber reinforced composite material. It is a primary focus to achieve proper gelcoat film thickness because it is a critical control point for crack prevention which is able to increase mechanical strength and withstand harsh environment. The interface between the gelcoat and laminate composite is similarly imperative in deciding the mechanical performance of the composite by controls the reintroduction of stress into component. There are no specified standard to verified that how much gel-coat thickness required to produce certain product since mostly research only focus on the enhancement of composite orientation and fiber combination. The aim of this review is to gain an in depth understanding of exactly the effect of gel-coat thickness on laminate composite structure and strength
Keywords: Gelcoat, Thickness, Protection, Laminated, Composite.
Wavelet Packet Based Features for Automatic Script IdentificationCSCJournals
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in seven scripts, to categorize them for further processing. The South Indian documents printed in the seven scripts - Kannada, Tamil, Telugu, Malayalam, Urdu, Hindi and English are considered here. The document images are decomposed through the Wavelet Packet Decomposition using the Haar basis function up to level two. The texture features are extracted from the sub bands of the wavelet packet decomposition. The Shannon entropy value is computed for the set of sub bands and these entropy values are combined to use as the texture features. Experimentation conducted involved 2100 text images for learning and 1400 text images for testing. Script classification performance is analyzed using the K-nearest neighbor classifier. The average success rate is found to be 99.68%.
TEXT EXTRACTION FROM RASTER MAPS USING COLOR SPACE QUANTIZATIONcsandit
Maps convey valuable information by relating names to their positions. In this paper we present
a new method for text extraction from raster maps using color space quantization. Previously,
most researches in this field were focused on Latin texts and the results for Persian or Arabic
texts were poor. In our proposed method we use a Mean-Shift algorithm with proper parameter
adjustment and consequently, we apply color transformation to make the maps ready for KMeans
algorithm which quantizes the colors in maps to six levels. By comparing to a threshold
the text layer candidates are then limited to three. The best layer can afterwards be chosen by
user. This method is independent of font size, direction and the color of the text and can find
both Latin and Persian/Arabic texts in maps. Experimental results show a significant
improvement in Persian text extraction.
Separation of mixed Document Images in Farsi Scanned Documents Using Blind So...CSCJournals
In the field of mixed scanned documents separation, various studies have been carried out to reduce one (or more) unwanted artifacts from the document. Most of the approaches are based on comparison of the front and back sides of the documents. In some cases, it has been suggested to analyze the colored images; however, because of the calculation complexity of the approaches, they are not very applicable in practical applications. Furthermore none of them are tested on Farsi/Arabic documents. In this paper, an applicable approach to large size images is presented which is based on image block segmentation (mosaicing). The advantages of this approach are less memory usage, combining of simultaneous and ordinal blind source separation methods in order to increase the algorithm efficiency, reducing calculation complexity of the algorithm into about twenty percents of the basic algorithm, and high stability in noisy images. In noiseless conditions, the average signal to noise ratio of the output images is reached up to 28.75 db. Furthermore, all of these cases have been tested on Farsi official documents. By applying the suggested ideas, considerable accuracy is achieved in the results, at minimum time. In addition, various parameters of the proposed algorithm (e.g. the size of each block, appropriate initial point, and number of iterations) were optimized.
DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTSijait
Automatic identification of a script in a given document image facilitates many important applications such
as automatic archiving of multilingual documents, searching online archives of document images and for
the selection of script specific OCR in a multilingual environment. This paper provides a comparison study
of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR)
and principal component analysis (PCA), and evaluates the relative performance of classification
procedures incorporating those methods
Dimension Reduction for Script Classification - Printed Indian Documentsijait
Automatic identification of a script in a given document image facilitates many important applications such as automatic archiving of multilingual documents, searching online archives of document images and for the selection of script specific OCR in a multilingual environment. This paper provides a comparison study of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR)
and principal component analysis (PCA), and evaluates the relative performance of classification procedures incorporating those methods. For given script we extracted different features like Gray Level Co-occurrence Method (GLCM) and Scale invariant feature transform (SIFT) features. The features are
extracted globally from a given text block which does not require any complex and reliable segmentation of the document image into lines and characters. Extracted features are reduced using various dimension reduction techniques. The reduced features are fed into Nearest Neighbor classifier. Thus the proposed
scheme is efficient and can be used for many practical
pplications which require processing large volumes
of data. The scheme has been tested on 10 Indian scripts and found to be robust in the process of scanning and relatively insensitive to change in font size. This proposed system achieves good classification accuracy on a large testing data set.
Dimensionality Reduction and Feature Selection Methods for Script Identificat...ITIIIndustries
The goal of this research is to explore effects of dimensionality reduction and feature selection on the problem of script identification from images of printed documents. The kadjacent segment is ideal for this use due to its ability to capture visual patterns. We have used principle component analysis to reduce the size of our feature matrix to a handier size that can be trained easily, and experimented by including varying combinations of dimensions of the super feature set. A modular
approach in neural network was used to classify 7 languages – Arabic, Chinese, English, Japanese, Tamil, Thai and Korean.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Text detection and recognition in scene images or natural images has applications in computer
vision systems like registration number plate detection, automatic traffic sign detection, image retrieval
and help for visually impaired people. Scene text, however, has complicated background, blur image,
partly occluded text, variations in font-styles, image noise and ranging illumination. Hence scene text
recognition could be a difficult computer vision problem. In this paper connected component method
is used to extract the text from background. In this work, horizontal and vertical projection profiles,
geometric properties of text, image binirization and gap filling method are used to extract the text from
scene images. Then histogram based threshold is applied to separate text background of the images.
Finally text is extracted from images.
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHijcseit
Document segmentation is one of the critical phases in machine recognition of any language. Correct
segmentation of individual symbols decides the accuracy of character recognition technique. It is used to
decompose image of a sequence of characters into sub images of individual symbols by segmenting lines and
words. Devnagari is the most popular script in India. It is used for writing Hindi, Marathi, Sanskrit and
Nepali languages. Moreover, Hindi is the third most popular language in the world. Devnagari documents
consist of vowels, consonants and various modifiers. Hence proper segmentation of Devnagari word is
challenging. A simple histogram based approach to segment Devnagari documents is proposed in this paper.
Various challenges in segmentation of Devnagari script are also discussed.
Similar to Persian arabic document segmentation based on hybrid approach (20)
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Persian arabic document segmentation based on hybrid approach
1. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.1, February 2014
DOI:10.5121/ijcsa.2014.4103 23
PERSIAN/ARABIC DOCUMENT SEGMENTATION
BASED ON HYBRID APPROACH
Seyyed Yasser hashemi
1
, Parisa Sheykhi Hesarlo2
1, 2
Department of Computer Engineering, miyandoab Branch, Islamic Azad University,
miyandoab, Iran
ABSTRACT
Document segmentation is an essential requirement for automatic transformation of paper documents into
electronic documents. However, some restrictions such as variations in character font sizes, different text
line spacing, and also non-uniform document layout structures altogether makes designing a general-
purpose document layout analysis algorithm much more sophisticated. Thus in most previously reported
methods these parameters are inevitably included. This mentioned issue becomes much more acute and
excessively severe, especially in Persian/Arabic documents. Since the Persian/Arabic scripts differ
considerably from the English scripts, most of the proposed methods for the English scripts do not render
good results for the Persian/Arabic scripts. In this paper, we present a novel parameter-free method based
on hybrid method for segmenting the Persian/Arabic document images which also works well for English
scripts. The proposed method is capable of document segmentation without considering the character font
sizes, text line spacing, document skew, and document layout structures. This algorithm is examined for
150 Persian/Arabic and English documents and document segmentation process are done successfully for
97.3 percent of them at worst condition.
KEYWORDS
Persian/Arabic document, document segmentation, Pyramidal Image Structure.
1. INTRODUCTION
In order to segment a document which is an important step in Optical Character Recognition
(OCR) systems, the document image is divided into homogeneous zones, each consisting of only
one physical layout structure, such as text, graphics, and pictures. Therefore, the performance of
OCR systems depends heavily on the implemented document segmentation algorithm. Several
document segmentation algorithms have been proposed during the last three decades [1-10].
The various approaches toward document segmentation are typically categorized as “bottom-up”,
“top-down”, and “textural analysis” methods. The “bottom-up” methods (F. Legourgiois et al.,
1992; D. Drivas et al., 1995; A. Simon et al., 1997) start from pixels or the connected
components, determine the words, merge the words into text lines, and finally merge the text lines
into paragraphs. The main disadvantage of these approaches is that the identification, analysis,
and grouping of connected components are, in general, time-consuming processes, especially
when there are many components in the image. The “top-down” approaches (J. Ha et al., 1995; J.
Ha et al., 1995; Yi Xiaoa et al., 2003; Jie Xi et al., 2002, Rafi Cohen et al 2013) look for global
information e.g. black and white stripes on the page and use them to split the page into columns,
the columns into blocks, the blocks into text lines, and finally the text lines into words. Low time
complexity of these methods in comparison to the prior methods, i.e. “bottom-up” approaches and
their natural top-down view from coarse to fine resolution as is preferable for human beings’ eyes
are of the most important advantages of this method. On the other hand, in “top-down” techniques
2. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
24
it is unfortunately difficult to segment the complex document layouts which include some
nonrectangular images and various character font sizes. Some other recently proposed document
segmentation methods (A. Jain and Y. Zhong et al., 1996; A. Jain and S. Bhattacharjee et al.,
1992; M. Acharyya and M. K.Kundu et al., 2002), consider the homogeneous regions of the
document image such as text, image or graphic as a textured region. Thus, document
segmentation is implemented based on the textured regions found in gray scale images. Junxi Sun
et al., 2008; propose a texture-based Bayesian document segmentation method. In this method
Bayesian method is used to fuse texture likelihood and prior contextual knowledge to achieve
document segmentation. The texture likelihood is based on a complex wavelet domain hidden
Markov tree (HMT) model and the prior contextual is based on a hybrid tree model. Very high
time complexity is the main problem associated with these texture-based approaches, since many
masks are used for extracting local features and also different tuning filters are used to capture a
desired local spatial frequency and the orientation characteristics of a textured region. Since the
Persian documents have some special characters which does not exist in the English documents,
so the aforementioned methods cannot be directly used for Persian document segmentation.
The special characters of Persian documents are as follows:
The Persian scripts are cursive and each connected component includes more than one character.
On the other side their arrangement and size may also vary tremendously.
There are 32 basic characters in the Persian alphabet. These characters may change their shapes
according to their positions (beginning, middle, end or isolated) in the word. Since each character
can take four different shapes, thus we have 114 different shapes considering all of Persian
alphabets.
Special stress marks called dots are the other characteristics of the Persian scripts. Most of the
Persian characters have one, two or three dots. These dots may be situated at the top, inside or
bottom of the characters.
From the script identification point of view, it is concluded from the above mentioned expressions
about the special characters of Persian documents that these scripts’ word sizes are non-uniform.
The word size may vary according to the number of cursive characters and dots in the word.
In this paper, we propose a novel method for Persian/Arabic document segmentation using
pyramidal image structure. This paper is organized as follows: In Section 2, the proposed
algorithm is described in detail. Experimental results of the proposed algorithm are presented in
Section 3. Finally the paper is concluded in section 4.
2. MATERIALS AND METHODS
The following formatting rules must be followed strictly. This (.doc) document may be used as a
template for papers prepared using Microsoft Word. Papers not conforming to these requirements
may not be published in the conference proceedings.
Many document segmentation algorithms are presented for English Documents which most of
them do not provide good results for Persian/Arabic documents due to their aforementioned
differences. Thus, in order to make these methods suitable for Persian scripts, some of those
method’s parameters must be specialized. We have proposed a parameter free segmentation
method for Persian/Arabic documents which eliminates these restrictions and provides excellent
results. The proposed method is interestingly capable of segmenting the Persian documents
composed of different font sizes, different lines spaces and also different structure layouts. In the
proposed method, the low resolution version of the document is firstly processed and then the
document’s high resolution version is analyzed in detail. This manifests the pyramidal nature of
the proposed method. The original image will be analyzed and a pyramidal tree structure (images
at different resolutions) is created. At next phase, the bounding boxes are extracted from images
3. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
25
at lowest possible resolution and candidate regions are selected removing the bounding boxes’
overlap areas. Then the regions are classified with horizontal and vertical analysis and are further
investigated with textual and statistical analysis of the high resolution sample to recognize
different regions of the document such as text, image and drawing/table. The Fig. 1 shows the
proposed approach followed in this paper.
2.1. Document skew detection and correction (SDC):
In this step, the skew angle of the document ( ) must be estimated. The proposed method uses a
document SDC based on Centre of Gravity (COG). To determine the skew angle, first step is the
Baseline Identification (BI). The angle between the baseline and direct horizontal lines determine
the skew angle. Therefore, the most important step in this process is to identify the baseline .
Baseline of the document is a line that passes through the COG along the horizontal axes. In this
algorithm, we detect skew angle by finding Actual Region of Document (ARD) using connected
component analysis, identifying its COG and identify the baseline of document .The angle
between the baseline and horizontal lines specifies the skew angle. The algorithm steps are as:
• Connected Component identification(CC)
• Identification of the ARD. For this purpose, four CCs that have the more distance
from the C0, C1, C2 and C3 corners (shown in Fig. 2(c)) will be selected.
• Finding the COG. COG is calculated using (1).
1
( )( )1 1 16
11
( ( )1) 1 1
6 0
1
( )1 12
COG x x x y x yi i i i i ix A
N
COG y y x y x yY i i i i i i
A i
A x y x yi i i i
∑= + −+ + +
−
= + −∑ + + +
=
∑= −+ +
(1)
'A' is the area of polygon.
• Baseline identification. A line (baseline) from COG to the center of the line that
connects the two upper and lower left corner of the ARD (midpoint).
• Calculation of the amount of document skew angle which is the angle between
baseline and the horizontal line that passes through the midpoint.
Rotation of the document (see Fig. 2)
2.2. Pyramidal image structure
The pyramidal image structure is a simple and robust technique to provide several resolutions of
an image. An image pyramid is a collection of decreased resolution images which are arranged in
the shape of pyramid in a way that the base of the pyramid contains a high-resolution while the
apex contains a low resolution approximation of the image. The number of pixels in IL+1 is on
quarter of the number of pixels in IL. This process is repeated N times (number of levels of
pyramidal image structure) that is given by (2).
{ }log , min . , .0 0
100
l
N l I Width I Hight
= =
(2)
4. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
26
Figure 1. Overall diagram of the proposed method
The intensity of every pixel of the image for ‘L+1’ level is calculated using pixel intensity of
level ‘L’ with the equation (3) as follows:
1 1
2 ,21 0 0
4,
LI
i m j nL m n
I
i j
∑ ∑ + + + = =
=
(3)
Where 1
,
L
I
i j
+
is the intensity of (i,j) in level ‘L+1’. The pyramidal image structure for I0 image is
shown in Fig. 3.
5. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
27
Figure 2. (a) Skewed document, (b) Document segmented to connected component, (c) The ARD in
skewed document, (d) calculate the amount of document skew angle, (e) non-skewed document, (f) final
result
2.3. Bounding box extraction
This algorithm uses a bottom-up approach to extract connected components (CCs) from the
document image. The first step in CCs extraction is to locate rectangular regions called Rects (M.
Viswanathan, 2002), A Rects may be thought of as a rectangular region of loosely connected
black pixels. More specifically, a Rects has at least one black pixel in every 9-pixel square. A
Rects is defined in this way so black regions may be found without scanning every pixel. The
second step is to merge adjacent Rects to form CCs .In terms of implementation, Rects are stored
in a linked list. This type of structure is ideal because the Rects need to be quickly traversed and
random access is not required. The process of locating Rects involves scanning 9-pixel squares of
each paragraph in raster order. The top left corner of a Rects is defined by the first 9-pixel square
found containing a black pixel. The right edge of the Rects is located by searching right until a
white 9-pixel square is found. Similarly, the bottom edge is located by scanning down until a
white 9-pixel square is found. CCs are defined as collections of adjacent Rects (Mitchel P. E and
Yana H., 2004). CCs are also stored in a linked list, which enables them to be quickly and easily
traversed for classification. The process used to construct CCs involves traversing the (initially
empty) list of CCs for each Rects in the Rects list. Each Rects is subsequently added to single
CCs or used to define new CCs. If a Rects is found to be adjacent to more than one CC, the CCs
are merged into a single CCs. Note that since the CCs are defined directly from the Rects, there is
no need to refer to the actual image. After extracting CCs, they merge with each other in vertical
and horizontal directions to create the bounding boxes.
In multi-resolution analysis, the lowest levels of the pyramid can be used for overall analysis of
the image. In this step, the bounding boxes of low resolution image are used to extract image
regions. The bounding boxes extraction in lowest resolution of the I0 image is shown in Fig. 4.
6. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
28
Figure. 3. The Pyramidal image structure for I0 image, (a) original image, (b) binary image, (c) level 0, (d)
level 1, (e) level 2, and (f) level 3.
Figure. 4. The bounding boxes extraction in lowest resolution of the I0 image.
2.4. Remove overlaps connected components
In the low-resolution document segmentation, the image of document 00I is divided into a set of
regions that called 1 2, ,....., nR R R . Each region iR is defined by coordinates of bounding box.
Some of regions, iR , are overlapped to other. For next stage of high-resolution page
7. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
29
segmentation, we should remove overlapped components from each region iR . For each region,
we remove overlapping components as follow:
First, we obtained connected components of region iR in low-resolution image LP . If iR have
only one connected component, then iR do not have any overlapping components. If in region iR
number of connected components is more than one, we leave the maximum of connected
component and remove other connected components at the high-resolution.
2.5. Horizontal analysis
The Abstract section begins with the word, “Abstract” in 13 pt. Times New Roman, bold italics,
“Small Caps” font with a 6pt. spacing following. The abstract must not exceed 150 words in
length in 10 pt. Times New Roman italics. The text must be fully justified, with a 12 pt.
paragraph spacing following the last line.
Persian text region can be easily distinguished from other regions by horizontal analysis. In order
to speed up the process, the horizontal analysis is performed recursively. At the first step, the
horizontal projection profile is calculated by (4) for each region.
1
( ) ( , ) 1
1
W
LP n I x n n HBH W x
∑= ≤ ≤
=
(4)
Where ( , )
L
I x n
B
is the intensity value in the W×H image of the Lth level and ( )P n
H
are normalized
between zeros to one. Fig. 5 shows ( )P n
H
for one region of the document.
Figure. 5. (a) one region of the original document, (b) the horizontal projection profile, (c) the normalized
horizontal projection profile.
At the second step, the normalized projection profile is transformed to binary signals as described
in (5).
1.0 ( ) 0.05
( )
0.0
P xHtP x
H otherwise
>
=
(5)
At the third step, the difference of signals calculated using the equation (6).
( ) ( ( ))dP n diff tP n
H H
= (6)
8. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
30
At the fourth step, ascending and descending edges are calculated using (7) and (8)
1 ( ) 0
( )
0
dP nHUHE n
otherwise
>
=
(7)
1 ( ) 0
( )
0
dP nHDHE n
otherwise
<
=
(8)
Where UHE(n) and DHE(n) determine the ascending and descending edges of the signal,
respectively.
At the fifth step, the distance between black and white areas of the signal is calculated using (9)
and (10).
( ) ( ) ( )PW n DHE n UHE n= − (9)
( ) ( 1) ( )PB n UHE n DHE n= + − (10)
PW(n) and PB(n) are the distance between white and black tPH(n) signals, respectively. Sum of
PW(n) and PB(n)
D(p) which is the decision value for document segmentation is derived from Pi(n) by (11), (12),
(13) and (14) as:
( )
1
N
P ni
nm
N
∑
== (11)
2( ( ) )
1
N
P n mi
nV
N
−∑
== (12)
2
2
1
P
Ve
= −
−+
(13)
1
( )
0
p TH
D P
otherwise
>
=
(14)
Using the decision value of D(p) we can estimate whether a region is homogeneous or not.
Where ‘N’ is the number of grooves and we set the threshold value TH=0.5. It is achieved for ‘V’
approximately equal to 1.099 in (12). This value for ‘V’ is independent of the character font sizes,
text lines spacing, and the document layout structures, and is equally applied to each region to
decide whether it is a homogeneous region or not.
There are three types of horizontal analysis using D(p), tPH(n) and Pi(n):
For constant signal tPH(n) equal to 1: In this type, if tPH(n) relates to upper level (L=1,
2,..., N) of the region of the document image, the horizontal analysis is further repeated
for lower levels, but if tPH(n) relates to the original document image, the region is a non-
9. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
31
text region (graphics, pictures,…) and in next step with textural and statistical analysis
will be investigated in detail.
For symmetric signal tPH(n): If the decision value D(p) is 1, the variance of the Pi(n)
values are low and the region is considered a text region and hence it requires no further
splitting.
For non-symmetric signal tPH(n): If the decision value D(p) is 0, the variance of Pi(n)
values are high and the region is not a homogeneous region; and hence further splitting is
inevitable. Thus this region is sub classified using projection profile information.
Segmentation process is repeated until all regions in all levels have a constant signal tPH(n) equal
to 1 or symmetric tPH(n) signal.
2.6. Determination of Splitting Position
The Keywords section begins with the word, “Keywords” in 13 pt. Times New Roman, bold
italics, “Small Caps” font with a 6pt. spacing following. There may be up to five keywords (or
short phrases) separated by commas and six spaces, in 10 pt. Times New Roman italics. An 18
pt. line spacing follows.
When a region is determined to require further splitting because it is not a homogeneous region,
there are two cases in the horizontal (Vertical) direction, as shown in Fig. 6. The case in Fig. 6a
needs to be split more because one white area is larger than the other white areas and the case in
Fig. 6b needs to be split more because one black area is larger than the other black areas. In these
two cases, we find a suitable position for splitting, using the method given below, and split it into
two regions. The processes described in Sections 2.5 and 2.6 are repeated for each region until no
further splitting is required.
Let W denote the set of the white areas of a region, wi.
Let B denote the set of the black areas of a region, bi
Sort the set W(or B) in the increasing order of the magnitude of wi(or bi)
wmed the median element of W
bmed the median element of B
wmax the last element of W
bmax the last element of B
if wi > wmed and wi . wmax, split wi
if bi > bmed and bi . bmax, split wi-1.
Fig. 6: Two cases requiring further horizontal splitting :(a) A distinct white area exists. (b) A distinct black
area exists.
10. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
32
2.6. Non-homogeneous regions analysis
.
In this step, the non-homogeneous regions are analyzed and segmented to homogeneous sub
regions. For this purpose, each region is segmented to initial sub region in horizontal direction,
based on median of characters high in document (Implications of previous steps). Then by
repeating the processes described in Sections 2.5 and 2.6, we segmented a non-homogeneous
region into maximal homogeneous sub regions that labeled as text or non-text sub regions.
Finally, the sub regions with same label are merged and the final result of this step is obtained.
2.7. Image and drawing/table recognition
The non-text regions which was labeled before, are labeled as image or drawing/table in this
section using statistical analysis. In order to do this, the number of blackpixels of each region is
applied to the equation (15).
( )
( ) ( )
( )
( ) ( )
the region isimageif 0.4
the region isdrawing/tableif 0.4
i
i i
i
i i
A R
W R H R
A R
W R H R
≥
×
<
×
(15)
3. EXPERIMENTAL RESULTS
The Proposed method has been implemented on duo core 2.66 GHZ in 2013. We have considered
different documents from different sources like journals, textbooks, newspapers and also
handwritten documents. For experimentation purpose 150 documents are considered which 50 of
them are handwriting documents. The obtained results are reported in Table I. The comparative
results of the proposed method taking other artworks into account are considered in Fig. 8 and
Fig. 9 and the final result for a sample image is shown in Fig. 7.
Fig. 7: The final result for a sample image
4. CONCLUSION
We have introduced a document segmentation method that segments the Persian/Arabic
document into homogeneous regions. The proposed efficient and comprehensive method works
11. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
33
based on pyramidal image structure and textual analysis. For the original image, a pyramidal tree
structure (images at different resolutions) is created. At next phase, the bounding boxes are
extracted from the image at lowest possible resolution and candidate regions are selected
removing the bounding boxes’ overlap areas. Then the regions are classified with horizontal
analysis and are further investigated with textual and statistical analysis of the high resolution
sample to recognize different regions of the document such as text, image and drawing/table.
The proposed method was examined on different colorful/gray documents achieved from
different sources. Experiments show more accurate results in comparison to the previously
reported methods. This method focuses on the Persian/Arabic document segmentation, which also
exhibits good results for other scripts such as English scripts. This work can be also extended for
special works such as license plate recognition, postal service, and noisy documents. This method
was implemented on 150 different documents (90 Persian/Arabic and 40 English and 20 hybrids
of English and Persian/Arabic) and the rate of accuracy is 97.3% in the worst circumstances.
85
90 92.5
79.5 82 81.5
98.8 96.3 97
0
10
20
30
40
50
60
70
80
90
100
Text identification Image identification Drawing/Table identification
Jain
Pietikainen
proposed
approach
Figure. 8. The comparative results of the proposed method taking other artworks.
Figure. 9. The comparative results of the proposed method taking other artworks based on process time
Table 1. The results achieved for the proposed method.
Total
number
of
regions
The
number
of
regions
found
The
rate of
regions
found
(%)
The
number
of
correct
regions
found
The
rate of
correct
regions
found
(%)
The
number
of
unfound
regions
The
rate of
unfound
regions
(%)
The
number
of
incorrect
regions
found
The rate
of
incorrect
regions
found
(%)
Max.
overall
error
rate
(%)
Total region 448 440 98.2 435 97.1 13 2.9 5 1.1 2.9
Text region 382 379 99.2 376 98.4 6 1 3 0.8 1
Image region 43 42 97.7 41 95.4 2 4.6 1 2.3 4.6
Drawing/table 23 23 100 22 95.6 1 4.3 1 4.3 4.3
12. International Journal on Computational Sciences & Applications (IJCSA) Vol.1, No.4, February 2014
34
REFERENCES
[1] F. Legourgiois, Z. Bublinski, & H. Emptoz, (1992), “A Fast and Efficient Method for Extracting Text
Paragraphs and Graphics from Unconstrained Documents”, Proc. 11th Int'l Conf. Pattern Recognition,
pp. 272-276.
[2] D. Drivas & A. Amin, “Document segmentation and Classification Utilizing Bottom-Up Approach”,
(1995), Proc. Third Int'l Conf. Document Analysis and Recognition, pp. 610-614.
[3] A. Simon, J. Pret, & A. Johnson, “A Fast Algorithm for Bottom-Up Document Layout Analysis” ,
(1997) , IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, pp. 273-276.
[4] J. Ha, R. Haralick, & I. Phillips, “Recursive X-Y Cut Using Bounding Boxes of Connected
Components”, Proc. (1995)Third Int'l Conf. Document Analysis and Recognition, pp. 952-955.
[5] J. Ha, R. Haralick, & I. Phillips, “Document Page Decomposition by the Bounding-Box Projection
Technique”, Proc. (1995),Third Int'l Conf. Document Analysis and Recognition, pp. 1119-1122.
[6] Yi Xiaoa, Hong Yana’ “Text region extraction in a document image based on the Delaunay
tessellation”, (2003), Pattern Recognition, pp. 799-809.
[7] Jie Xi , Jianming Hu, Lide Wu, “Document segmentation of Chinese newspapers”, (2002), Pattern
Recognition, pp. 2695-2704.
[8] A. Jain & Y. Zhong, ªDocument segmentation Using Texture Analysis”, (1996), Pattern Recognition,
vol. 29, pp. 743-770.
[9] A. Jain & S. Bhattacharjee, “Text Segmentation Using GaborFilters for Automatic Document
Processing”’ , (1992). Machine Vision and Applications, vol. 5, pp. 169-184.
[10] M. Acharyya & M. K.Kundu, “Document Image Segmentation Using Wavelet Scale-Space Features”,
(2002), IEEE Transaction on circuits and systems for video technology, vol. 12, no.12.
[11] Junxi Sun, Dongbing Gu, Hua Cai, Guangwen Liu and Guangqiu Chen, “Bayesian Document
Segmentation Based on Complex Wavelet Domain Hidden Markov Tree Models”, Proceedings of the
2008 IEEE International Conference on Information and Automation June 20 -23, 2008, Zhangjiajie,
China, pp. 493-498.
[12] Rafi Cohen, Abedelkadir Asi, Klara Kedem, Jihad El-Sana, Itshak Dinstein, “Robust text and drawing
segmentation algorithm for historical documents”, Proceedings of the 2nd International Workshop on
Historical Document Imaging and Processing,(2013), pp. 110-117.
[13] M. Viswanathan, “Analysis of scanned documents - a syntactic approach,”. (1992), in Structured
Document Image Analysis, Springer-Verlag, pp. 115–136.
[14] Mitchel P. E & Yana H., "Newspaper layout analysis incorporating connected component separation",
(2004), Image and Vision Computing, vol. 22, pp. 307–317.
Authors
Seyyed Yasser Hashemi was born in Miyandoab, Azarbayjane Gharbi, Iran, in
1985. He received the B.Sc. and M.Sc. degrees from Islamic Azad University of
South Tehran Branch, in Computer Engineering field. He is with Computer
Department of Islamic Azad University, Miyandoab Branch since 2008. He is the
author or coauthor of more than ten national and international papers and also
collaborated in several research projects. His current research interests include voice
and image processing, pattern recognition, spam detecting, optical character
recognition, cloud computing and parallel genetic algorithms.
Parisa Sheykhi Hesarlo was born in shahindezh, Azarbayjane Gharbi, Iran, in 1992. She is B.Sc. student
in Pnu univesity in Computer Engineering field.