Regression models and their statistical analyses is one of the most important tool used by scientists and practitioners. The aim of a regression model is to fit parametric functions to data. It is known that the true regression is unknown and specific methods are created and used strictly pertaining to the roblem. For the pioneering work to develop procedures for fitting functions, we refer to the work on the methods of least
absolute deviations, least squares deviations and minimax absolute deviations. Today’s widely celebrated
procedure of the method of least squares for function fitting is credited to the published works of Legendre and Gauss. However, the least squares based models in practice may fail to provide optimal results in nonGaussian situations especially when the errors follow distributions with the fat tails. In this paper an unorthodox method of estimating linear regression coefficients by minimising GMSE(geometric mean of squared errors) is explored. Though GMSE(geometric mean of squared errors) is used to compare models it is rarely used to obtain the coefficients. Such a method is tedious to handle due to the large number of roots obtained by minimisation of the loss function. This paper offers a way to tackle that problem.
Application is illustrated with the ‘Advertising’ dataset from ISLR and the obtained results are compared
with the results of the method of least squares for single index linear regression model.
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYcscpconf
A digital library is a type of information retrieval (IR) system. The existing information retrieval
methodologies generally have problems on keyword-searching. We proposed a model to solve
the problem by using concept-based approach (ontology) and metadata case base. This model
consists of identifying domain concepts in user’s query and applying expansion to them. The
system aims at contributing to an improved relevance of results retrieved from digital libraries
by proposing a conceptual query expansion for intelligent concept-based retrieval. We need to
import the concept of ontology, making use of its advantage of abundant semantics and
standard concept. Domain specific ontology can be used to improve information retrieval from
traditional level based on keyword to the lay based on knowledge (or concept) and change the
process of retrieval from traditional keyword matching to semantics matching. One approach is
query expansion techniques using domain ontology and the other would be introducing a case
based similarity measure for metadata information retrieval using Case Based Reasoning
(CBR) approach. Results show improvements over classic method, query expansion using
general purpose ontology and a number of other approaches.
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
The manual process of searching out individuals in an already existing
research field is cumbersome and time-consuming. Prominent and rookie
researchers alike are predisposed to seek existing research publications in
a research field of interest before coming up with a thesis. From
extant literature, automated similar research area detection systems have
been developed to solve this problem. However, most of them use
keyword-matching techniques, which do not sufficiently capture the implicit
semantics of keywords thereby leaving out some research articles. In this
study, we propose the use of ontology-based pre-processing, Latent Semantic
Indexing and K-Means Clustering to develop a prototype similar research area
detection system, that can be used to determine similar research domain
publications. Our proposed system solves the challenge of high dimensionality
and data sparsity faced by the traditional document clustering technique. Our
system is evaluated with randomly selected publications from faculties
in Nigerian universities and results show that the integration of ontologies
in preprocessing provides more accurate clustering results.
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
Although publicly accessible databases containing speech documents. It requires a great deal of time and effort
required to keep them up to date is often burdensome. In an effort to help identify speaker of speech if text is
available, text-mining tools, from the machine learning discipline, it can be applied to help in this process also.
Here, we describe and evaluate document classification algorithms i.e. a combo pack of text mining and
classification. This task asked participants to design classifiers for identifying documents containing speech
related information in the main literature, and evaluated them against one another. Expected systems utilizes a
novel approach of k -nearest neighbour classification and compare its performance by taking different values of
k.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYcscpconf
A digital library is a type of information retrieval (IR) system. The existing information retrieval
methodologies generally have problems on keyword-searching. We proposed a model to solve
the problem by using concept-based approach (ontology) and metadata case base. This model
consists of identifying domain concepts in user’s query and applying expansion to them. The
system aims at contributing to an improved relevance of results retrieved from digital libraries
by proposing a conceptual query expansion for intelligent concept-based retrieval. We need to
import the concept of ontology, making use of its advantage of abundant semantics and
standard concept. Domain specific ontology can be used to improve information retrieval from
traditional level based on keyword to the lay based on knowledge (or concept) and change the
process of retrieval from traditional keyword matching to semantics matching. One approach is
query expansion techniques using domain ontology and the other would be introducing a case
based similarity measure for metadata information retrieval using Case Based Reasoning
(CBR) approach. Results show improvements over classic method, query expansion using
general purpose ontology and a number of other approaches.
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
The manual process of searching out individuals in an already existing
research field is cumbersome and time-consuming. Prominent and rookie
researchers alike are predisposed to seek existing research publications in
a research field of interest before coming up with a thesis. From
extant literature, automated similar research area detection systems have
been developed to solve this problem. However, most of them use
keyword-matching techniques, which do not sufficiently capture the implicit
semantics of keywords thereby leaving out some research articles. In this
study, we propose the use of ontology-based pre-processing, Latent Semantic
Indexing and K-Means Clustering to develop a prototype similar research area
detection system, that can be used to determine similar research domain
publications. Our proposed system solves the challenge of high dimensionality
and data sparsity faced by the traditional document clustering technique. Our
system is evaluated with randomly selected publications from faculties
in Nigerian universities and results show that the integration of ontologies
in preprocessing provides more accurate clustering results.
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
Although publicly accessible databases containing speech documents. It requires a great deal of time and effort
required to keep them up to date is often burdensome. In an effort to help identify speaker of speech if text is
available, text-mining tools, from the machine learning discipline, it can be applied to help in this process also.
Here, we describe and evaluate document classification algorithms i.e. a combo pack of text mining and
classification. This task asked participants to design classifiers for identifying documents containing speech
related information in the main literature, and evaluated them against one another. Expected systems utilizes a
novel approach of k -nearest neighbour classification and compare its performance by taking different values of
k.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
Evaluating the efficiency of rule techniques for file classificationeSAT Journals
Abstract Text mining refers to the process of deriving high quality information from text. It is also known as knowledge discovery from text (KDT), deals with the machine supported analysis of text. It is used in various areas such as information retrieval, marketing, information extraction, natural language processing, document similarity, and so on. Document Similarity is one of the important techniques in text mining. In document similarity, the first and foremost step is to classify the files based on their category. In this research work, various classification rule techniques are used to classify the computer files based on their extensions. For example, the extension of computer files may be pdf, doc, ppt, xls, and so on. There are several algorithms for rule classifier such as decision table, JRip, Ridor, DTNB, NNge, PART, OneR and ZeroR. In this research work, three classification algorithms namely decision table, DTNB and OneR classifiers are used for performing classification of computer files based on their extension. The results produced by these algorithms are analyzed by using the performance factors classification accuracy and error rate. From the experimental results, DTNB proves to be more efficient than other two techniques. Index Terms: Data mining, Text mining, Classification, Decision table, DTNB, OneR
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This
paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean
model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the
researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
Concept integration using edit distance and n gram match ijdms
Information is growing more rapidly on the World Wide Web (WWW) has made it necessary to make all
this information not only available to people but also to the machines. Ontology and token are widely being
used to add the semantics in data processing or information processing. A concept formally refers to the
meaning of the specification which is encoded in a logic-based language, explicit means concepts,
properties that specification is machine readable and also a conceptualization model how people think
about things of a particular subject area. In modern scenario more ontologies has been developed on
various different topics, results in an increased heterogeneity of entities among the ontologies. The concept
integration becomes vital over last decade and a tool to minimize heterogeneity and empower the data
processing. There are various techniques to integrate the concepts from different input sources, based on
the semantic or syntactic match values. In this paper, an approach is proposed to integrate concept
(Ontologies or Tokens) using edit distance or n-gram match values between pair of concept and concept
frequency is used to dominate the integration process. The proposed techniques performance is compared
with semantic similarity based integration techniques on quality parameters like Recall, Precision, FMeasure
& integration efficiency over the different size of concepts. The analysis indicates that edit
distance value based interaction outperformed n-gram integration and semantic similarity techniques.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
From last few years online information is growing tremendously on World Wide Web or on user’s desktops and thus online information gains much more attention in the field of automatic text summarization. Text mining has become a significant research field as it produces valuable data from unstructured and large amount of texts. Summarization systems provide the possibility of searching the important keywords of the texts and so the consumer will expend less time on reading the whole document. Main objective of summarization system is to generate a new form which expresses the key meaning of the contained text. This paper study on various existing techniques with needs of novel Multi-Document summarization schemes. This paper is motivated by arising need to provide high quality summary in very short period of time. In proposed system, user can quickly and easily access correctly-developed summaries which expresses the key meaning of the contained text. The primary focus of this paper lies with thef_β-optimal merge function, a function recently presented here, that uses the weighted harmonic mean to discover a harmony in the middle of precision and recall. Proposed system utilizes Bisect K-means clustering to improve the time and Neural Networks to improve the accuracy of summary generated by NEWSUM algorithm.
Automatic Text Classification using Supervised Learningdbpublications
As the time goes on and on,
digitization of text has been increasing remarkably and
the need to organize, categorize and classify text has
become indispensable. Disorganization and very little
categorization and classification of text may result in
gradual lower response time of text or information
retrieval. Therefore it is very important and necessary
to organize, categorize and classify texts and digitized
documents according to description proposed by text
mining experts and computer scientists. Automated
text classification has been considered as a imperative
method to manage and process a large amount of
documents in digital forms that are widespread and
continuously increasing. In general, text classification
plays and substantial role in information extraction
and text retrieval, and question answering. This paper
emphasizes the text classification process using
machine learning techniques.
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
The first element of the search process is the query.
The user query being on an average restricted to two or three
keywords makes the query ambiguous to the search engine.
Given the user query, the goal of an Information Retrieval
[IR] system is to retrieve information which might be useful
or relevant to the information need of the user. Hence, the
query processing plays an important role in IR system.
The query processing can be divided into four categories
i.e. query expansion, query optimization, query classification and
query parsing. In this paper an attempt is made to evaluate the
performance of query processing algorithms in each of the
category. The evaluation was based on dataset as specified by
Forum for Information Retrieval [FIRE15]. The criteria used
for evaluation are precision and relative recall. The analysis is
based on the importance of each step in query processing. The
experimental results show that the significance of each step
in query processing and also the relevance of web semantics
and spelling correction in the user query.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
Evaluating the efficiency of rule techniques for file classificationeSAT Journals
Abstract Text mining refers to the process of deriving high quality information from text. It is also known as knowledge discovery from text (KDT), deals with the machine supported analysis of text. It is used in various areas such as information retrieval, marketing, information extraction, natural language processing, document similarity, and so on. Document Similarity is one of the important techniques in text mining. In document similarity, the first and foremost step is to classify the files based on their category. In this research work, various classification rule techniques are used to classify the computer files based on their extensions. For example, the extension of computer files may be pdf, doc, ppt, xls, and so on. There are several algorithms for rule classifier such as decision table, JRip, Ridor, DTNB, NNge, PART, OneR and ZeroR. In this research work, three classification algorithms namely decision table, DTNB and OneR classifiers are used for performing classification of computer files based on their extension. The results produced by these algorithms are analyzed by using the performance factors classification accuracy and error rate. From the experimental results, DTNB proves to be more efficient than other two techniques. Index Terms: Data mining, Text mining, Classification, Decision table, DTNB, OneR
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This
paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean
model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the
researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
Concept integration using edit distance and n gram match ijdms
Information is growing more rapidly on the World Wide Web (WWW) has made it necessary to make all
this information not only available to people but also to the machines. Ontology and token are widely being
used to add the semantics in data processing or information processing. A concept formally refers to the
meaning of the specification which is encoded in a logic-based language, explicit means concepts,
properties that specification is machine readable and also a conceptualization model how people think
about things of a particular subject area. In modern scenario more ontologies has been developed on
various different topics, results in an increased heterogeneity of entities among the ontologies. The concept
integration becomes vital over last decade and a tool to minimize heterogeneity and empower the data
processing. There are various techniques to integrate the concepts from different input sources, based on
the semantic or syntactic match values. In this paper, an approach is proposed to integrate concept
(Ontologies or Tokens) using edit distance or n-gram match values between pair of concept and concept
frequency is used to dominate the integration process. The proposed techniques performance is compared
with semantic similarity based integration techniques on quality parameters like Recall, Precision, FMeasure
& integration efficiency over the different size of concepts. The analysis indicates that edit
distance value based interaction outperformed n-gram integration and semantic similarity techniques.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
From last few years online information is growing tremendously on World Wide Web or on user’s desktops and thus online information gains much more attention in the field of automatic text summarization. Text mining has become a significant research field as it produces valuable data from unstructured and large amount of texts. Summarization systems provide the possibility of searching the important keywords of the texts and so the consumer will expend less time on reading the whole document. Main objective of summarization system is to generate a new form which expresses the key meaning of the contained text. This paper study on various existing techniques with needs of novel Multi-Document summarization schemes. This paper is motivated by arising need to provide high quality summary in very short period of time. In proposed system, user can quickly and easily access correctly-developed summaries which expresses the key meaning of the contained text. The primary focus of this paper lies with thef_β-optimal merge function, a function recently presented here, that uses the weighted harmonic mean to discover a harmony in the middle of precision and recall. Proposed system utilizes Bisect K-means clustering to improve the time and Neural Networks to improve the accuracy of summary generated by NEWSUM algorithm.
Automatic Text Classification using Supervised Learningdbpublications
As the time goes on and on,
digitization of text has been increasing remarkably and
the need to organize, categorize and classify text has
become indispensable. Disorganization and very little
categorization and classification of text may result in
gradual lower response time of text or information
retrieval. Therefore it is very important and necessary
to organize, categorize and classify texts and digitized
documents according to description proposed by text
mining experts and computer scientists. Automated
text classification has been considered as a imperative
method to manage and process a large amount of
documents in digital forms that are widespread and
continuously increasing. In general, text classification
plays and substantial role in information extraction
and text retrieval, and question answering. This paper
emphasizes the text classification process using
machine learning techniques.
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
The first element of the search process is the query.
The user query being on an average restricted to two or three
keywords makes the query ambiguous to the search engine.
Given the user query, the goal of an Information Retrieval
[IR] system is to retrieve information which might be useful
or relevant to the information need of the user. Hence, the
query processing plays an important role in IR system.
The query processing can be divided into four categories
i.e. query expansion, query optimization, query classification and
query parsing. In this paper an attempt is made to evaluate the
performance of query processing algorithms in each of the
category. The evaluation was based on dataset as specified by
Forum for Information Retrieval [FIRE15]. The criteria used
for evaluation are precision and relative recall. The analysis is
based on the importance of each step in query processing. The
experimental results show that the significance of each step
in query processing and also the relevance of web semantics
and spelling correction in the user query.
ΓΓΔΕ: Παροχή διευκρινίσεων, σχετ. με το κοινό φορολογικό καθεστώς το οποίο ισχύει για τις μητρικές και τις θυγατρικές εταιρίες διαφορετικών κρατών-μελών
AN INNOVATIVE RESEARCH FRAMEWORK ON INTELLIGENT TEXT DATA CLASSIFICATION SYST...ijaia
Recent years have witnessed an astronomical growth in the amount of textual information available both on the web and institutional wise document repositories. As a result, text mining has become extremely prevalent and processing of textual information from such repositories got the focus of the current age researchers. Indeed, in the researcher front of text analysis, there are numerous cutting edge applications are available for text mining. More specifically, the classification oriented text mining has been gaining more attention as it concentrates measures like coverage and accuracy. Along with the huge volume of data, the aspirations of the user are growing far higher than the human capacity, thus, an utomated and competitive intelligent systems are essential for reliable text analysis. Towards this, the authors in the present paper propose an Intelligent Text Data Classification System
(ITDCS) which is designed in the light of biological nature of genetic approach and able to acquire
computational intelligence accurately. Initially, ITDCS focusses on preparing structured data from the huge volume of unstructured data with its procedural steps and filter methods. Subsequently, it emphasises on classifying the text data into labelled classes using KNN classification based on the selection of best features derived by genetic algorithm. In this process, it specially concentrates on adding the power of
intelligence to the classifier using together with the biological parts namely, encoding strategy, fitness function and operators of genetic algorithm. The integration of all biological zomponents of genetic algorithm in ITDCS significantly improves the accuracy and reduces the misclassification rate in classifying the text data
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from large amounts of data. The important term in data mining is text mining. Text mining extracts the quality information highly from text. Statistical pattern learning is used to high quality information. High –quality in text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language processing and analytical methods are highly preferred to turn text into data for analysis. This survey is about the various techniques and algorithms used in text mining.
A web content mining application for detecting relevant pages using Jaccard ...IJECEIAES
The tremendous growth in the availability of enormous text data from a variety of sources creates a slew of concerns and obstacles to discovering meaningful information. This advancement of technology in the digital realm has resulted in the dispersion of texts over millions of web sites. Unstructured texts are densely packed with textual information. The discovery of valuable and intriguing relationships in unstructured texts demands more computer processing. So, text mining has developed into an attractive area of study for obtaining organized and useful data. One of the purposes of this research is to discuss text pre-processing of automobile marketing domains in order to create a structured database. Regular expressions were used to extract data from unstructured vehicle advertisements, resulting in a well-organized database. We manually develop unique rule-based ways of extracting structured data from unstructured web pages. As a result of the information retrieved from these advertisements, a systematic search for certain noteworthy qualities is performed. There are numerous approaches for query recommendation, and it is vital to understand which one should be employed. Additionally, this research attempts to determine the optimal value similarity for query suggestions based on user-supplied parameters by comparing MySQL pattern matching and Jaccard similarity.
Sensing complicated meanings from unstructured data: a novel hybrid approachIJECEIAES
The majority of data on computers nowadays is in the form of unstructured data and unstructured text. The inherent ambiguity of natural language makes it incredibly difficult but also highly profitable to find hidden information or comprehend complex semantics in unstructured text. In this paper, we present the combination of natural language processing (NLP) and convolution neural network (CNN) hybrid architecture called automated analysis of unstructured text using machine learning (AAUT-ML) for the detection of complex semantics from unstructured data that enables different users to make understand formal semantic knowledge to be extracted from an unstructured text corpus. The AAUT-ML has been evaluated using three datasets data mining (DM), operating system (OS), and data base (DB), and compared with the existing models, i.e., YAKE, term frequency-inverse document frequency (TF-IDF) and text-R. The results show better outcomes in terms of precision, recall, and macro-averaged F1-score. This work presents a novel method for identifying complex semantics using unstructured data.
Semantic Search of E-Learning Documents Using Ontology Based Systemijcnes
The keyword searching mechanism is traditionally used for information retrieval from Web based systems. However, this system fails to meet the requirements in Web searching of the expert knowledge base based on the popular semantic systems. Semantic search of E-learning documents based on ontology is increasingly adopted in information retrieval systems. Ontology based system simplifies the task of finding correct information on the Web by building a search system based on the meaning of keyword instead of the keyword itself. The major function of the ontology based system is the development of specification of conceptualization which enhances the connection between the information present in the Web pages with that of the background knowledge.The semantic gap existing between the keyword found in documents and those in query can be matched suitably using Ontology based system. This paper provides a detailed account of the semantic search of E-learning documents using ontology based system by making comparison between various ontology systems. Based on this comparison, this survey attempts to identify the possible directions for future research.
Post 1What is text analytics How does it differ from text mini.docxstilliegeorgiana
Post 1:
What is text analytics? How does it differ from text mining?
Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data. Text mining is a tool that helps in getting the data cleaned up.Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
Differences between Text Mining and Text Analytics:
• Text Mining and Text Analytics solve the same problems, but use different techniques and are complementary ways to automatically extract meaning from text.
• Text Analytics is developed within the field of computational linguistics. It has the ability to encode human understanding into a series of linguistic rules which are generated by humans are high in precision, but they do not automatically adapt and are usually fragile when tried in new situations.
• Text mining is a newer discipline arising out of the fields of statistics, data mining, and machine learning. Its strength is the ability to inductively create models from collections of historical data. Because statistical models are learned from training data they are adaptive and can identify “unknown unknowns”, leading to the better recall. Still, they can be prone to missing something that would seem obvious to a human.
• Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
• Due to their different perspectives and strengths, combining text analytics with text mining often leads to better performance than either approach alone.
2. What technologies were used in building Watson (both hardware and software)?
Watson is an extraordinary computer system (a novel combination of advanced hardware an software) designed at answering questions posed in natural human language.Watson is an artificially intelligent computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson. The computer system was specifically developed to answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.
Watson received the first prize of $1 million.The goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society.IBM undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show Jeopardy!The extent of the challenge in ...
Post 1What is text analytics How does it differ from text minianhcrowley
Post 1:
What is text analytics? How does it differ from text mining?
Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data. Text mining is a tool that helps in getting the data cleaned up.Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
Differences between Text Mining and Text Analytics:
• Text Mining and Text Analytics solve the same problems, but use different techniques and are complementary ways to automatically extract meaning from text.
• Text Analytics is developed within the field of computational linguistics. It has the ability to encode human understanding into a series of linguistic rules which are generated by humans are high in precision, but they do not automatically adapt and are usually fragile when tried in new situations.
• Text mining is a newer discipline arising out of the fields of statistics, data mining, and machine learning. Its strength is the ability to inductively create models from collections of historical data. Because statistical models are learned from training data they are adaptive and can identify “unknown unknowns”, leading to the better recall. Still, they can be prone to missing something that would seem obvious to a human.
• Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
• Due to their different perspectives and strengths, combining text analytics with text mining often leads to better performance than either approach alone.
2. What technologies were used in building Watson (both hardware and software)?
Watson is an extraordinary computer system (a novel combination of advanced hardware an software) designed at answering questions posed in natural human language.Watson is an artificially intelligent computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson. The computer system was specifically developed to answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.
Watson received the first prize of $1 million.The goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society.IBM undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show Jeopardy!The extent of the challenge in ...
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
A simplified classification computational model of opinion mining using deep ...IJECEIAES
Opinion and attempts to develop an automated system to determine people's viewpoints towards various units such as events, topics, products, services, organizations, individuals, and issues. Opinion analysis from the natural text can be regarded as a text and sequence classification problem which poses high feature space due to the involvement of dynamic information that needs to be addressed precisely. This paper introduces effective modelling of human opinion analysis from social media data subjected to complex and dynamic content. Firstly, a customized preprocessing operation based on natural language processing mechanisms as an effective data treatment process towards building quality-aware input data. On the other hand, a suitable deep learning technique, bidirectional long short term-memory (Bi-LSTM), is implemented for the opinion classification, followed by a data modelling process where truncating and padding is performed manually to achieve better data generalization in the training phase. The design and development of the model are carried on the MATLAB tool. The performance analysis has shown that the proposed system offers a significant advantage in terms of classification accuracy and less training time due to a reduction in the feature space by the data treatment operation.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR FOR SINGLE INDEX LINEAR REGRESSION MODEL
1. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
DOI: 10.5121/ijaia.2016.7605 57
AN INNOVATIVE RESEARCH FRAMEWORK ON
INTELLIGENT TEXT DATA CLASSIFICATION
SYSTEM USING GENETIC ALGORITHM
Dr. V V R Maheswara Rao1
, N Silpa2
and Dr.Gadiraju Mahesh3
1,2
Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh,
India
3
Department of CSE, S.R.K.R. Engineering College, Andhra Pradesh, India
ABSTRACT
Recent years have witnessed an astronomical growth in the amount of textual information available both on
the web and institutional wise document repositories. As a result, text mining has become extremely
prevalent and processing of textual information from such repositories got the focus of the current age
researchers. Indeed, in the researcher front of text analysis, there are numerous cutting edge applications
are available for text mining. More specifically, the classification oriented text mining has been gaining
more attention as it concentrates measures like coverage and accuracy. Along with the huge volume of
data, the aspirations of the user are growing far higher than the human capacity, thus, an automated and
competitive intelligent systems are essential for reliable text analysis.
Towards this, the authors in the present paper propose an Intelligent Text Data Classification System
(ITDCS) which is designed in the light of biological nature of genetic approach and able to acquire
computational intelligence accurately. Initially, ITDCS focusses on preparing structured data from the
huge volume of unstructured data with its procedural steps and filter methods. Subsequently, it emphasises
on classifying the text data into labelled classes using KNN classification based on the selection of best
features derived by genetic algorithm. In this process, it specially concentrates on adding the power of
intelligence to the classifier using together with the biological parts namely, encoding strategy, fitness
function and operators of genetic algorithm. The integration of all biological components of genetic
algorithm in ITDCS significantly improves the accuracy and reduces the misclassification rate in
classifying the text data.
KEYWORDS
Text Data, Classification, Genetic Approach, Learning Algorithm, Text Mining, KNN Classification.
1. INTRODUCTION
In the light of rapid advances in data storage, the presence of structured, semi-structured,
unstructured, non-scalable and complex nature of data, evolved as a great text data repository and
in-turn has motivated the advances in text mining along with soft computing paradigm, and this
endorsed as a potential research area for the present researchers. Nowadays,text mining has
attracted more attention from academia & industry and a great amount of progresses have been
achieved in many applications also. Although much work has been done in text mining and a
great amount of achievement has been made so far. However, still remain many open research
problems to be solved in this area due to the fact of the complexity of text data, the diversity of
various applications and the increased demands of the users in terms of efficiency and
effectiveness.
2. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
58
Text mining is also called as Knowledge Discovery from Text Data (KDTD)[27], is a process of
extracting potential, previously hidden knowledge from large amount of text documents. Text
Pre-processing, Text Transformation, Attribute Selection, Text Mining, Pattern Evolution and
knowledge representation are the key phases of text mining as depicted in figure 1.
Figure 1.Phases of Text Mining
Text mining refers gaining knowledge from virtually infinite volumes of textual data, available in
digital format such as transaction data in e-commerce applications or genetic expressions in
bioinformatics research domains. At high level, the text mining techniques broadly categorised
into statistical methods and linguistic methods. Statistical methods generally build on statistical or
probabilistic framework and rely on mathematical representation of text, so called bag of words
matrix. Linguistic methods, basically, developed on meaning semantics and natural language
processing techniques. There are several methods used to structure the text documents.
Principally, these methods are classified into supervised methods which perform the text mining
task by assigning a given keyword to the documents and unsupervised methods which
automatically group the similar text documents.
Text classification is a supervised technique which is used to build the classifier to classify the
new text documents. In this approach, pre-defined class labels are assigned to the text documents.
The aim is to train the classifier on the basis of known text documents and then new text
documents are classified automatically. However, the classification is often used in the fields of
information retrieval and information extraction. In overall process of classification, the
performance and training of any text classification technique is highly depends on the selection of
best features from possible set of features. The simple usage of feature selection process in
conventional classification techniques proved to be inefficientas the dimensionality of the original
feature set is high, drawn from the data preparation stage. Thus, it creates a need to
deployintelligent learning feature selection techniques in order to enhance the performance of
classification in the era of text data mining.
The computational intelligence models are relying on heuristic algorithms such as evolutionary
computation, fuzzy systems and artificial neural networks. Basically these models acquire the
intelligence by applying the combination of learning, adaption and evaluation techniques. The
intelligent models enhance the performance of existing conventional techniques. In addition to
that, these models are implemented by closely considering the biological inspiration of the nature.
In a nutshell, neural networks, fuzzy systems, evolutionary computation, swarm intelligence,
probabilistic reasoning, multi-agent systems and combination of these techniques created the era
of soft computing paradigm. Among all, to attain global optimization of learning algorithms,
Genetic algorithms, a biologically imitating technology is more suitable for feature selection
problem in the classification of proposed work.
The remaining paper is organized as follows: In section 2,a brief related work is given. Then, in
section 3 the proposed Intelligent Text Data Classification System is presented. Subsequently, the
experimental analysis is showcased in section 4. Finally the conclusions are made.
3. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
59
2. RELATED WORK
Text mining or knowledge discovery from text (KDTD) is first time mentioned by Feldman et al.
[27], which deals with the machine supported analysis of text data. Text mining is an
interdisciplinary field, the confluence of a set of disciplines, including information extraction,
information retrieval, data mining, soft computing and statistics. Over the last decade, the
research contributions of many authors [21] are evident that the current text mining research
mainly tackles the problems of information retrieval, text document categorization, and
information extraction. Among all, text document categorization is a current and promising
research area [19] of text mining and its future paths have triggered the present research in the
field of text document categorization. Various authors [7,9,12] have found that the document
categorization can be done in two ways: either try to assign keywords to documents based on a
given keyword set (classification) or automatically structure document collections to find groups
of similar documents (clustering). The proposed work has got the impression towards the
classification and present brief related work from 2010 to the current year.
In 2010,the authors [21] are emphasized that text mining got a high commercial potential value as
the most information is stored as text in real world applications. They mainly concentrated on
presenting various issues such as information extraction, information retrieval, document
clustering and classification with respect to text mining techniques. Among all issues, the authors
[20] taken up the problem of document clustering based on semantic contents of document. They
mostly pay their effort on preparation of feature selection by considering the meaning behind the
text, then, they adapt the hierarchical agglomerative clustering algorithm to perform process of
clustering. Other set of authors [19] shown the interest in other issue, namely, text document
classification.They provide a review and comparative study on approaches and document
representation techniques in document classification. In conclusion, they endorsed that the KNN
algorithm is most appropriate for hybrid approaches for automatic classification.
After that in the year 2011, the authors [17] made a report on promising research results of data
and text mining methods for real world deception detection. Their discussion revealed the
combination of text and data mining techniques can be successfully applied to real world data to
produce better results. Meanwhile, they felt that due to increasing demand of text mining, the
extra effort is needed to achieve high accuracy in document segmentation process. With that
objective, the authors [16, 18] analysed the problem of classification and clustering within the
framework of genetic algorithm respectively. Empirical testing has been performed and the
results are justified the need and relevance of genetic algorithm for both classification and
clustering processes.
All the range in 2012 and 2013, the authors [10, 13, 14,15] discussed the significance of applying
pre-processing techniques and presented the sequential stages of pre-processing in text data
categorization. From the literature, they recognized that the KNN is one of the most accepted
classification technique to segment the data efficiently.However, they noticed the traditional
KNN has some limitations like high calculation complexity, dependence on the training set and
no weight difference between samples. To overcome such limitations, the authors proposed
numerous approaches with the combination of clustering techniques or by implementing
weighting strategy with traditional KNN. In addition to that, they expressed that the combination
of KNN with machine learning techniques definitely yield the optimal solutions.
In between 2014 and 2015, some of the authors[4,5,8] make use the text mining process on
various applications such as identification of plagiaristic behaviour in online assignments as well
as in research papers, and finding software vulnerabilities via bugs. Their results indicated that,
text mining analysis require too much processing power and time and thus, it is a promoted as
4. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
60
great challenge in the large scale applications. They prominently conclude that further studies are
required to improve the performance of text mining algorithm with soft computing paradigm for
detecting text similarities to get faster results. In this path, the authors [6] use the genetic
algorithm for efficient text clustering and text classification process. Their research is proved that
the incorporation of genetic approach in text mining process can definitely yields the best results
and it becomes one of the promising area of research.
The recent research review in 2016, carried out by many authors [1,2,3] evident that text
classification is a fully potential area of research in text mining. In addition to that, the alone
usage of KNN could not produce satisfactory results. To improve the performance of KNN in
terms of efficiency and accuracy, it is necessary to employ machine learning techniques and then
KNN is appropriate to deal with large amount of data. with this motivation, the authors in the
present paper take-up the issue of text data classification with the combination of well-known
KNN and genetic algorithm.
3. PROPOSED INTELLIGENT TEXT DATA CLASSIFICATION SYSTEM-ITDCS
In order to overcome the challenges involved in the earlier classifiers and to improve the accuracy
and coverage of text data classifier the authors in the present paper propose an Intelligent Text
Data Classification System (ITDCS) as shown in figure 2.
Figure 2. Architecture of ITDCS
The proposed ITDCS is designed and developed in the research framework of genetic approach
with a focus to improve overall classification accuracy. At first, the ITDCS focuses on data
preparation that includes document collection, tokenization, stop word removal, stemming and
attribute selection. Later, it emphasises on classifying the text documents using the standard KNN
classification algorithm. In this process of classification the authors integrated Genetic algorithm
for feature selection so as to enhance the accuracy of the classifier and improve the coverage of
all unclassified documents with its biological components. As a result, finally the classifier
ITDCS comprehensively able to classify the text documents accurately into optimal labelled
classes.
5. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
61
3.1. ITDCS - DATA PREPARATION
As the classification algorithms unable to understand the documents directly one has to represent
the raw documents into a suitable structure. In addition, the success and accuracy of any text
mining classifier is highly and directly depend on the effective preparation of input raw data. In
the text mining it is important to identify the significant keywords that carry the meaning
contribute towards the quality of the further stages of the classifier.The data preparation stage
converts the raw original textual documents in a classifier ready structure. With this aim, the
ITDCS initially concentrates on the data preparation with all possible procedural methods. The
ITDCS data preparation covers all the procedural steps including text document collection,
tokenization, stop word removal, stemming and attribute selection.
3.1.1 DOCUMENT COLLECTION
The ITDCS process starts with a collection of text data which include a set of documents with
many extensions like .pdf, .txt or even flat file extension. They are normally collected from
different sources like web, individual data stores, real world data, natural language text
documents, online chat, emails, message boards, news-groups, blogs and web pages etc. Also,
text dataset is created by processing spontaneous speech, printed text and handwritten text that
contain processing noise. However, the data collection is purely dependent on the application
domain of the text mining.
3.1.2 TOKENIZATION
In order to get all words in a given text, tokenization is required that splits the sentence into a
stream of words by removing all punctuation marks. The present system ITDCS discovers the
words as tokens by splitting a text into pieces and then remove all punctuation marks using it
sparser. These token representations are separated by replacing a single white space between the
words as word boundaries and then make use for further execution. The ITDCS create a
dictionary with a set of tokens obtained by combining all text documents. The ITDCS parser finds
significantly meaningful keywords and transforms the abbreviations and acronyms into a standard
form. In addition, the tokenization helps in maintaining consistency in all the collected
documents. The ITDCS successfully perform the phase of a tokenization and it allow for an
accurate classification of text documents.
3.1.3 Stop Word Removal
In the next step, the data preparation of ITDCS pays an attention to remove the stop words
because these words may misguide the entire text mining process. Stop words are words that are
language specific words with very little or no semantic value as they are used to provide structure
in the language rather than the content. Specifically, in English language,the stop word list may
consist of pronouns, prepositions, and conjunctions etc. that are frequent words that carry no
meaningful information. Additionally, the plain stop words such as names of day, names of
month, one or two character terms, non-alphabetical characters are sensibly discarded during this
step. In addition, the terms printed on each page of a document are treated as the template words,
they also deleted as they negatively impact the text mining results. Specially, the advantage of
elimination of these unwanted words reduces the dimensionality of input data of the
classification. A sample list of English stop words is shown in table 1.
Table 1 : List of Stop Words in English
6. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
62
3.1.4 STEMMING
Here the data preparation of proposed ITDCS concentrates further to reduce the size of the input
data and helps in enhancing the intelligence of the proposed system. Stemming is an important
process as it emphasises in eliminating the words having grammatically same meaning which are
connected to the same root word. This process works with an objective to remove the words
which are derived different forms such as noun, adjective, verb, adverb, plurals etc. from the root
word. This procedure incorporates a great deal of language dependent linguistic knowledge but
not on the domain. In general, the stemming process majorly considers two important points: one
is the morphological form of a word that has the same base meaning should be mapped to the
same stem. The other point is the words that do not have the same meaning treat as separate.
Towards this, the ITDCS data preparation stage rightly uses a standard statistical n-gram
stemmer. It follows string similarity approach to covert word inflation to its stem and language
independent. The main idea behind this approach is that, similar words have a high proportion of
n-grams in common. In this method, an n-gram is a string of n, generally adjacent, characters
extracted from a section of sentence. For n equals to 2 or 3, the words extracted are called di-
grams or tri-grams, respectively.
For example, the word ‘INTRODUCTIONS’ results in the generation:
Di-grams:
*I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*
Tri-grams:
**I, *IN, INT, NTR, TRO, ROD, ODU, DUC, UCT, CTI, TIO, ION, ONS, NS*, S**
7. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
63
Where '*' denotes a padding space. There are n+1 such di-grams and n+2 such tri-grams in a word
containing n characters. Generally a value of 4 or 5 is selected for n. A well-known statistical
analysis based on the Inverse Document Frequency (IDF) is used to identify them.
3.1.5 ATTRIBUTE SELECTION
The attribute selection is the key factor in building any classifier in the process of text mining.
Though the previous stages of data preparation of ITDCS reduces the size of the input data by
eliminating unwanted words, it is necessary to identify the attributes which are contributing more
to the problem domain. With this aim, the authors in the present paper use the standard following
measures to identify the required attributes and improve the overall performance of the proposed
system.
Term Contribution: It is calculated on the basis of how much a specific term contributes to the
similarity among all collected text data documents.
TC t = f t ,D ∗ f t , D
, ∩
1
Where f t ,D is the TF-IDF of the kth
term of the ith
document.
Term Variance: It calculates the variance of all terms among all the text data documents. Later,
assign the maximum scores to terms that have more document frequency and a common
distribution value.
v t = f − f 2
Where f is the frequency of the ith
term of the jth
in document and f is the mean frequency of the
terms in the document collection.
Term Variance Quality: This measure calculates the quality of a term by using the total variance.
q t = f −
1
n
" f
#
$ 3
#
Where f is the frequency of the ith term
of the jth
document.
As a final point, the filters assign ranks to all the attributes, so the top-ranked attributes are treated
as features and give as an input for the next stages of ITDCS.
3.1.6 TEXT DATA ENCODING AND STRUCTURED REPRESENTATION USING VECTOR SPACE
MODEL
In order to classify large number of text documents it is equally important to encode the words
into numerical values using any standard approach in the preparation of text data. With this
objective, the proposed work employs a proven vector space model which enables the classifier
with efficient classification. This representation model usually improves the performance of the
8. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
64
classifier as it is designed based on term weighting scheme. Larger weights are assigned to the
words that are frequently identified in the text documents, thus, it able to encode the input data
appropriately.
The document weight w(d, t) for a term ‘t’ in a document ‘d’ is computed by the product of Term
Frequency TF(d, t) with Inverse Document Frequency IDF(t). Based on document weight, the
ITDCS prepares vector space model that represents document as vectors in m-dimensional space.
Here each document d is described by a numerical feature vector Wd ={x(d, t1), x(d,t2), ……x(d,
tm)}. Now, the documents are compared by performing simple vector operations for text mining
process.
A sample dimensional vector space model is as shown in table 2, here, each term is a component
of the vector and the value of each component is the number of times the corresponding term
occurs in the document. The documents are the rows of this matrix, while the features are the
columns.
Table 2 : Sample Dimensional Vector Space Model
The prepared structured text data generated by vector space model is given as the basic input for
the intelligent text data classification system.
3.2. ITDCS - CLASSIFICATION
The accuracy of building any classifier in the era of text mining refers the ability of predicting the
known and unknown class labels of training data. The standard KNN classifier is unable to reach
the expected level of accuracy in predicting the class labels. The uneven class distribution of
conventional KNN technique is posed many challenges. In addition, the conventional
classification techniques computationally expensive and leads to static solutions. To attain global
optimum solution, one has to use an intelligent classification technique that can predict the class
labels accurately and reaches the global solution. The accuracy and coverage are two measures
for building the intelligent classifier. Moreover, relevance analysis of attribute selection is also
key parameter in classification. The authors in the literature proposed many intelligent
techniques, Genetic approach is rightly suitable for developing accurate supervised classifier.
Towards this, the authors in the present paper propose an intelligent classification algorithm
which is designed on the biological inspiration of Genetic approach.
The proposed intelligent text data classification system concentrates on modelling the classifier
from the training data in the light of genetic algorithm with its all appropriate biological
9. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
65
components. The encoding strategy of Genetic approach model the global solution to the present
text data classification problem. The continuous evaluation offitness function of Genetic
algorithm builds the ITDCS classifier to overcome the challenges of “susceptible to overfitting”
and “misclassification”. The biologically inspired operators of genetic algorithm create an added
intelligence to the ITDCS classifier and thus the boundaries of the classes are more flexible. As a
whole, the comprehensive approach of all the components of Genetic makes the classifier to
produce right prediction on finding the class labels accurately within time. Then, this classifier
applies on the new testing data, fed by the data preparation stage of vector space model to predict
the right class labels.
3.2. 1ITDCS - BASIC PRINCIPLES
Genetic algorithm is an efficient, robust, an adaptive evaluation and self-learning process which is
generally applied on high volume, complex and multi-dimensional data. This is designed on the
principles of natural Genetic systems, each individual text data document is encoded as
chromosomes. This learning algorithm uses application domain dependent knowledge to compute
the fitness function to create more promising solutions. According to the theory of evaluation,
each initial individual chromosome is associated with fitness value to acquire the expected
accuracy. Various biologically inspired genetic operators like selection, crossover and mutation
are applied on these chromosomes to get potentially better global solution.
Genetic Algorithms (GA)are different from most of the normal learning algorithms in the
following ways:
• GAs work with the coding of the parameter set, not with the parameters themselves
• GAs work simultaneously with multiple points, not with a single point.
• GAs optimize via sampling using only the payoff information
• GAs optimize using stochastic operators, not deterministic rules
3.3.2 ITDCS - ENCODING STRATEGY
The encoding strategy is a process to represent the data fed by the data preparation stage of
ITDCS system into a right form of genetic algorithm. It is an important process in the genetic
approach as it plays a key role to yield a best global performance of learning algorithm. Many
encoding techniques like tree encoding, permutation encoding and binary encoding are used by
the genetic algorithm.
The authors in the present paper for ITDCS, the Binary encoding technique is adapted to create
initial population. The binary encoding encodes the chromosome with two bits as 0 and 1.
Consider following example of documentfeatures {F1, F4, F5, F8} is encoded as a binary
chromosome of length 10 and is represented in figure 3. The presence of a feature in a document
is encoded as 1, otherwise encoded as 0.
Figure 3. Example of Binary Encoding Chromosome
10. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
66
3.3.3 ITDCS – FITNESS FUNCTION
The authors in the present paper employ KNN classifier to classify the text documents into
predicted class labels which is generally works based on selected best features identified by
genetic process. In this process the Genetic approach is intrinsically designed to train the ITDCS
classifier to attain enough intelligence for accurate classification. Generally, to attain the accuracy
in classification, the selection of features is playing an important role. Specifically, impact of
accuracy of the classification, not only depends on the number-of-neighbours, but also dependent
on weightage-of-individual neighbour.
Towards this, the fitness function of proposed ITDCS is designed with an intension to compute
weightage-of-individual neighbour by its set of optimal parameter weights in the continuous
evaluation process. The set of indicated parameters overcome the problem of susceptible to
overfitting and reduce the rate of misclassification. In addition, it takes the responsibility of
achieving the level of expected accuracy in predicting the exact class label of a text document.
The ITDCS uses a robust fitness function which is formulated based on total documents to be
classified, correctly classified documents and number of nearest neighbours.
&'()*++ &,)-('.) &&/ = 0
1.(2.-+ − //2.-+
1.(2.-+
+ 4 5
666
78
1.(2.-+
9
Where,
FFC : Fitness Function of the Classifier
TotDocs : Total number of text Documents
CCDocs : Correctly Classified Documents
NNN : Number of Nearest Neighbours having less impact on classification
K : Number of Nearest Neighbours having high impact on classification
α, β are the constants to tune the learning algorithm
The goal of genetic algorithm is typically expressed by its fitness function that evaluates rate of
accuracy in the labelled classes. The performance of ITDCS fitness function is highly
proportional to the choosing number of times of individual as initial population of text document
features. Therefore, the bestfit individual features have able to generate better population in the
next generation while low fitness individuals likely to disappear.
3.3.4 ITDCS – OPERATORS
In addition to the choosing best individual population of features by the ITDCS fitness function,
genetically inspired operators selection, crossover and mutation create the best fit solution. These
operators are applied on the selected initial population to generate possible better new population.
The selection, crossover and mutation operators designed intune of ITDCS that comprehensively
transforms individual features of text document stochastically. Each feature chromosome has an
associated fitness value that contributes in the generation of potential new population using
operators. At each transformation, the ITDCS utilizes the fitness function value to evaluate the
survival capacity of newly generated chromosome feature. As a whole, the operators of ITDCS
create a new set of population chromosomes to improve the fitness function value.
SELECTION:
The mating pool is reproduced from choosing a particular individual chromosome feature by the
selection operator with mimic nature of selection procedure. The number of iterations of choosing
11. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
67
the individual chromosome is also decided by the selection operation. Only the selected
chromosomes in generating mating pool are able to take part in the subsequent genetic operations.
Among the several available selection methods, ITDCS deploy the roulette wheel parent selection
technique. The roulette wheel has as many slots as the population size, where the area of the slot
is proportional to the relative fitness of corresponding feature chromosome in the population. An
individual feature chromosome of the text document is selected by spinning the roulette wheel
and record the position when the wheel stops. Therefore, the number of times a feature
chromosome selected is proportional to its fitness value, in the given population. The procedure is
illustrated as shown in figure 4.
Figure 4. Example of Roulette Wheel Parent Selection
The selection operator of ITDCSallots a probability of selection Pjto each individual feature j
based on its fitness function value.
A series of N random numbers is generated and compared against the cumulative probability
/: = ∑ <:
:
= of the feature population.
The suitable individual i is chosenfor the new population if /: < ? 0,1 < /:.
The probability Pi for each individual chromosome is defined by:
<AB)C'D'C,EF''+-ℎ.+*)H =
&:
∑ &=
IJKL:MN
=
Where, Fi is the fitness of individual i.
CROSSOVER:
The genetic algorithm uses the crossover operation that mates two randomly selected parent
chromosomes to generate two new child chromosomes. The objective of the crossover operator is
producing new chromosomes that are potentially offer best optimal features than the parents by
acquiring the best characteristics from their parents. Single point, two point and uniform
crossover are the well-known approaches to evaluate the process according to defined crossover
probability factor by the user. In this paper, to produce optimal feature chromosomes, the ITDCS
elect the single point crossover function and define the crossover probability as Cp. The ITDCS
crossover operator identify a single crossover point and then exchange portions of the two parent
chromosomes lying to the right of the crossover point to produce two child chromosomes. For
12. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
68
chromosomes of length L, a random integer, called the crossover point, is generated in the range
[1, L-1]. An example is as shown in figure 5 and 6 before and after crossover respectively.
Figure.5 Single Point Crossover Operation Before Crossover
Figure.6 Single Point Crossover Operations After Crossover
MUTATION:
The mutation works with an aim to introduce genetic diversity into the newly produced
population. In this process, a random alteration is takes place in the genetic structure of a
chromosome. With these altered gene values the genetic algorithm able to arrive at a best fit
solution than the previously possible solution. This mutation helps the genetic algorithm to
prevent the population from stagnating at any local solution. Here, ITDCS uses bit flip mutation
where each bit of the chromosome is given to mutation with a mutation probabilityMp. For
example, for binary representation of chromosomes, a bit position is mutated by simply flipping
its value. In the figure 7, a random position is selected in the chromosome and replace by
negation of bit.
Figure 7. Process of bit-by-bit Mutation
3.3.5 ITDCS – ALGORITHM
The ITDCS is designed with the theories and techniques of genetic approach for identifying
optimal features to classify text documents effectively. In this process, encoding technique along
with fitness function tuned the length of the feature chromosome and the size of population. The
biologically evaluation operators calculate and then adjust the probabilities of selection ,
13. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
69
crossover and mutation so that ITDCS acquires self-learning capability. The ITDCS algorithm is
perfectly designed and implemented in the framework of genetic approach as follows,
ALGORITHM :: ITDCS
Step 01. Start
Step 02. Apply pre-processing techniques on the target text data documents
Step 03. Load the pre-processed documents in the database D.
Step 04. Build the classifier by applying KNN Classification algorithm based on
identified features of the training data set TD and S is a set of identified
features which are distributed in total documents.
Step 05. Set Q = where Q is the output set, which contains all labelled classes with
optimal classification accuracy.
Step 06. Set the input termination condition of genetic algorithm.
Step 07. Represent each document of a labelled class of ‘S’ as binary encoding.
Step 08. Select two members from the labelled class.
Step 09. Repeatedly apply GA operators, crossover, and mutation functions on the
selected members to generate optimal labelled classes.
Step 010. Find the fitness function value.
Step 011. If fitness function value meet the target classification accuracy then
Step 012. Set Q = Q U Q1
where Q1
is the set of accuracy labelled classes.
Step 013. If the desired number of generations is not met then go to Step 4.
Step 014. Stop
4. EXPERIMENTAL ANALYSIS
The authors in the present paper implemented the ITDCS on large number of text data documents
under standard execution environment. The integrated methodology of KNN with genetic
algorithm takes the input from pre-processed structured representation of vector space model after
performing data preparation stages. In order to identify the performance of proposed ITDCS the
authors conducted many experiments using standard datasets available on World Wide Web. A
dataset consists of 52 categories and a total of 6532 documents for training and testing the
proposed work. Among all 52 categories, the authors have chosen 36 categories which are
distributed in total number of documents.
A series of experiments are conducted and results are showcased with respect to accuracy rate,
crossover probabilities and execution time. The results are as follows,
a) Various crossover probabilities (Cp=0, 0.25, 0.5, 0.75, 1) are defined while iterating
number of times to find the improvements in accuracy levels of proposed genetic based
classifier. It is clearly evident that the ITDCS is performing well at the average mean Cp=
0.5 as shown in figure 8.
14. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
70
Figure 8. Mean quality of ITDCS
b) The experiment results noticed that intelligent text data classification system reduces the
human intervention to classify the text data documents. The error rate between the testing
data and training data is almost minimized in ITDCS and is found to be 0.2 on an
average. The nature of relationship between testing and training text data is studied and
both are proven as continuous and linear as shown in figure9.
Figure 9. Error Rate between Training and Testing Data
c) From the results shown in figure 10, the performance of standard KNN and proposed
ITDCS is not significantly varying for small values of K. However, as K value increases,
the genetic approach of ITDCS noticeably improves its performance over standard KNN
classification technique.
Figure 10.Classification Accuracy with Respect to K Value
15. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
71
d) The standard KNN classification technique, the intelligent classifier of ITDCS is tested
on the same standard environment for a common test data. The test results are compared,
the intelligent classifier of ITDCS after training takes less execution time when compared
with conventional KNN classifier as presented in figure 11.
Figure 11. Execution Performance of ITDCS
e) The proposed ITDCS is also compared with the other existing classification techniques
like J48, Naïve Bays (NB) and standard KNN on various data sets. The experimental
results are evident that the proposed method performs better than other techniques, even
though the dataset size is large.
Figure 12.Comparison of Classification Accuracy
5. CONCLUSIONS
The authors in the present paper proposed an intelligent text data classification system which is
designed with the integration of genetic based feature selection method to addresses the
challenges of text data classification. The proposed system prepared the data using vector space
representation for rightly suitable to KNN classification technique. The chosen encoding strategy
of proposed ITDCS exactly represents each feature of the document as a chromosome and rightly
populates the initial population. The measures taken in the fitness function optimally evaluate the
survival fitness of the generated chromosomes. The biological imitation nature of ITDCS
operators efficiently find the minimal feature subset and forwarded towards optimal solution. The
integration of genetic algorithm select optimal feature set dynamically. The experimental results
of ITDCS demonstrated the high learning performance of the classifier and the ability to converge
16. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
72
quickly. The proposed system has proven the need of genetic algorithm in the process of
classification for selecting the optimal features. Moreover, the stochastic process of ITDCS
significantly reduces the misclassification rate as well as improves the performance in terms of
accuracy and efficiency in classifying text documents.
Applying the optimization techniques for text data classification and clustering is still in early
stage as well as has open challenges. Thus, it creates a solid base for promising future research
direction. To improve the performance genetic repetition, design genetic algorithm using parallel
processing is also another future path.
REFERENCES
[1] RupaliP.Patil, R.P.Bhavsar, B.V.Pawar, “A Comparative Study of Text Classification Methods: An
Experimental Approach”, International Journal on Recent and Innovation Trends in Computing and
Communication, Vol: 4, Iss: 3, pp:517-523, 2016.
[2] Shivani Sharma, Saurbh Kr. Srivastava, “Review on Text Mining Algorithms”, International Journal
of Computer Applications, Vol 134, No.8, pp: 39-43, 2016.
[3] ZhenyunDeng, XiaoshuZhu,et.al., “Efficient kNN classification algorithm For big data”,
Neurocomputing, Elsevier, vol. 195, pp: 143–148, 2016.
[4] GokhanAkcapınar, “How automated feedback through text mining changes plagiaristic behavior in
online assignments”, Computers & Education, Elsevier, Vol.87, pp: 123-130, 2015.
[5] Ranjeet Kumar, R.C.Tripathi, “Text mining and similarity search using extended tri-gram algorithm
in the reference based local repository dataset”, Procedia Computer Science, ELSEVIER, Vol. 65, pp:
911 – 919, 2015.
[6] Sung-Sam Hong, Wanhee Lee, et.al, “The Feature Selection Method based on Genetic Algorithm for
Efficient of Text Clustering and Text Classification”, Int. J. Advance Soft Compu. Appl, Vol. 7, No.
1, pp: 22-40, 2015.
[7] Charu C. Aggarwal, Yuchen Zhao, et.al., “On the Use of Side Information for Mining Text Data”,
Ieee Transactions On Knowledge And Data Engineering, Vol. 26, No. 6, pp: 1415-1429, 2014.
[8] DumiduWijayasekara, Miles McQueen, “Vulnerability Identification and Classification Via Text
Mining Bug Databases”, IEEE, pp: 3612-3618, 2014.
[9] Deepankarbharadwaj, suneetshukal, “Text Mining Technique using Genetic Algorithm”, International
Conference on Advances in Computer Application, Proceedings Published in International Journal of
Computer Applications, pp: 07-10, 2013.
[10] J. AlameluMangai, Satej Milind Wagle, and V. Santhosh Kumar, “A Novel Web Page Clasification
Model using an Improved k Nearest Neighbor Algorithm”; rd International Conference on Intelligent
Computational Systems pp: 49-53, 2013.
[11] M.Akhiljabbar, B.L DeekshatuluPriti Chandra, “Classification of Heart Disease Using K- Nearest
Neighbor and Genetic Algorithm”, Procedia Technology, ELSEVIER, Vol. 10, pp: 85–94, 2013.
[12] DivyaNasa, “Text Mining Techniques- A Survey”, International Journal Of Advanced Research In
Computer Science And Software Engineering, vol. 2, issue 4, pp: 50-54, 2012.
[13] PutuWiraBuana, SesaltinaJannet D.R.M, I KetutGede Drama Putra, “Combination of K-Nearest
Neighbor And K-Means Based on Term Re-Weighting for Classify Indonesian News”, International
Journal of Computer Applications, Volume 50, No.11, pp: 37-42, 2012.
[14]Tsung-Hsien Chiang, Hung-Yi Lo, Shou-De Lin, “A Ranking-based KNN Approach for Multi-
Label Classi_cation”, JMLR: Workshop and Conference Proceedings,Vol. 25pp: 81-96, 2012.
[15] VandanaKorde, C NamrataMahender, “Text Classification And Classifiers: A Survey”, International
Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.2,pp: 85-99, 2012.
[16] Chien-Pang Lee, Wen-Shin Lin, Yuh-Min Chen, Bo-JeinKuo, “Gene selection and sample
classification on microarray data based on adaptive genetic algorithm/k-nearest neighbor method”,
Expert Systems with Applications, Elsevier, Vol. 38, pp: 4661–4667, 2011.
[17] Christie M. Fuller, David P. Biros, DursunDelen, “An investigation of data and text mining methods
for real world deception detection”, Expert Systems with Applications, Elsevier, Vol. 38, pp:8392–
8398, 2011.
17. International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 7, No. 6, November 2016
73
[18] N. El-Bathy, C. Gloster, I. Kateeb1, G. Stein, “Intelligent Extended Clustering Genetic Algorithm for
Information Retrival Using BPEL”, American Journal of Intelligent Systems, Vol. 1(1), pp: 10-15,
2011.
[19] Aurangzeb Khan, BaharumBaharudin, Lam Hong Lee, Khairullah khan, “A Review of Machine
Learning Algorithms for Text-Documents Classification”, Journal of Advances In Information
Technology, Vol. 1, No. 1, pp: 04-20, 2010.
[20] Muhammad Rafi, M. Shahid Shaikh, Amir Farooq, “Document Clustering based on Topic Maps”,
International Journal of Computer Applications, Vol. 12– No.1, pp: 32-36, 2010.
[21] Vidhya. K. A, & G. Aghila, “Text Mining Process, Techniques and Tools : an Overview”,
International Journal of Information Technology and Knowledge Management, Vol. 2, No. 2, pp:
613-622, 2010.
[22] A. A. Shumeyko,, “S. L. Sotnik, Using Genetic Algorithms for Texts Classification Problems”,
Anale. SeriaInformatica, Vol. 7, pp: 325-340, 2009.
[23] Adriana Pietramala, Veronica L. Policicchio, Pasquale Rullo, and Inderbir Sidhu, “A Genetic
Algorithm for Text Classification Rule Induction”, Springer-Verlag Berlin Heidelberg, pp: 188–203,
2008.
[24] Milos Radovanovic, Mirjana Ivanovic, “Text Mining: Approaches And Applications”, Novi Sad J.
Math, Vol. 38, No. 3, pp:227-234, 2008.
[25] R. Gil-Pita and X. Yao, “Using a Genetic Algorithm for Editing k-Nearest Neighbor Classifiers”,
Springer-Verlag Berlin Heidelberg, pp:1141–1150, 2007.
[26] SanghamitraBandyopadhyay, Sankar k. Pal, “Classification and Learning Using Genetic Algorithms”,
Application in Bioinformatics and Web Intelligence, Springer, 2007.
[27] R. Feldman and I. Dagan. Kdt - knowledge discovery in texts. In Proc. of the First Int. Conf. on
Knowledge Discovery (KDD), pages 112–117, 1995.
AUTHORS
Dr. V V R Maheswara Rao is working as Professor in the Department of Computer
Science and Engineering, Shri Vishnu Engineering College for Women (Autonomous),
Bhimavaram. He received M.Tech. CSE from JNTU Kakinada and Ph.D. from Acharya
Nagarjuna University, Andhra Pradesh, India. He has 6 years of IT industry experience and
12 years of Teaching and Resea rch experience. He has more than 25 papers to his credit in
many international and national journals and conferences. His research interests are Data Mining, Web
Mining, Text Mining, Artificial Intelligence, Machine Learning and Big Data Analytics. He has associated
with two projects funded by DST.
N Silpa is working as Assistant Professor in the Department of Computer Science and
Engineering, Shri Vishnu Engineering College for Women (Autonomous), Bhimavaram. She
received M.Tech. CSE from JNTU Kakinada, Andhra Pradesh, India. She has 7 years of
Teaching and Rese arch experience. Her research interests are Data Mining, Text Mining,
Machine Learning and Big Data Analytics. She has associated with one project funded by
DST.
Dr.Gadiraju Mahesh has been working as an Associate Professor in Department of
Computer science and Engineering, S.R.K.R. Engineering College, Bhimavaram, India.
Worked in Col. D.S. Raju polytechnic for 17 years and has a total teaching Experience of 27
years. Done M. Tech. (C.S.E.) from J.N.T.U., Kakinada and completed Ph.D. in Computer
Science and Engineering from Acharya Nagarjuna University in 2016 under the guidance of
Prof. V. ValliKumari, Andhra University. Guided many M.Tech. and M.C.A. students in
their thesis work.