-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
Context Driven Technique for Document ClassificationIDES Editor
In this paper we present an innovative hybrid Text
Classification (TC) system that bridges the gap between
statistical and context based techniques. Our algorithm
harnesses contextual information at two stages. First it extracts
a cohesive set of keywords for each category by using lexical
references, implicit context as derived from LSA and wordvicinity
driven semantics. And secondly, each document is
represented by a set of context rich features whose values are
derived by considering both lexical cohesion as well as the extent
of coverage of salient concepts via lexical chaining. After
keywords are extracted, a subset of the input documents is
apportioned as training set. Its members are assigned categories
based on their keyword representation. These labeled
documents are used to train binary SVM classifiers, one for
each category. The remaining documents are supplied to the
trained classifiers in the form of their context-enhanced feature
vectors. Each document is finally ascribed its appropriate
category by an SVM classifier.
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
The widespread adoption of the World-Wide Web (the Web) has created challenges both for society as a whole and for the technology used to build and maintain the Web. The ongoing struggle of information retrieval systems is to wade through this vast pile of data and satisfy users by presenting them with information that most adequately it’s their needs. On a societal level, the Web is expanding faster than we can comprehend its implications or develop rules for its use. The ubiquitous use of the Web has raised important social concerns in the areas of privacy, censorship, and access to information. On a technical level, the novelty of the Web and the pace of its growth have created challenges not only in the development of new applications that realize the power of the Web, but also in the technology needed to scale applications to accommodate the resulting large data sets and heavy loads. This thesis presents searching algorithms and hierarchical classification techniques for increasing a search service's understanding of web queries. Existing search services rely solely on a query's occurrence in the document collection to locate relevant documents. They typically do not perform any task or topic-based analysis of queries using other available resources, and do not leverage changes in user query patterns over time. Provided within are a set of techniques and metrics for performing temporal analysis on query logs. Our log analyses are shown to be reasonable and informative, and can be used to detect changing trends and patterns in the query stream, thus providing valuable data to a search service.
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
The first element of the search process is the query.
The user query being on an average restricted to two or three
keywords makes the query ambiguous to the search engine.
Given the user query, the goal of an Information Retrieval
[IR] system is to retrieve information which might be useful
or relevant to the information need of the user. Hence, the
query processing plays an important role in IR system.
The query processing can be divided into four categories
i.e. query expansion, query optimization, query classification and
query parsing. In this paper an attempt is made to evaluate the
performance of query processing algorithms in each of the
category. The evaluation was based on dataset as specified by
Forum for Information Retrieval [FIRE15]. The criteria used
for evaluation are precision and relative recall. The analysis is
based on the importance of each step in query processing. The
experimental results show that the significance of each step
in query processing and also the relevance of web semantics
and spelling correction in the user query.
Classifying web users in a personalised search setup is cumbersome due the very nature of dynamism in
user browsing history. This fluctuating nature of user behaviour and user interest shall be well interpreted
within a fuzzy setting. Prior to analysing user behaviour, nature of user interests has to be collected. This
work proposes a fuzzy based user classification model to suit a personalised web search environment. The
user browsing data is collected using an established customised browser designed to suit personalisation.
The data are fuzzified and fuzzy rules are generated by applying decision trees. Using fuzzy rules, the
search pages are labelled to aid grouping of user search interests. Evaluation of the proposed approach
proves to be better when compared with Bayesian classifier.
Context Driven Technique for Document ClassificationIDES Editor
In this paper we present an innovative hybrid Text
Classification (TC) system that bridges the gap between
statistical and context based techniques. Our algorithm
harnesses contextual information at two stages. First it extracts
a cohesive set of keywords for each category by using lexical
references, implicit context as derived from LSA and wordvicinity
driven semantics. And secondly, each document is
represented by a set of context rich features whose values are
derived by considering both lexical cohesion as well as the extent
of coverage of salient concepts via lexical chaining. After
keywords are extracted, a subset of the input documents is
apportioned as training set. Its members are assigned categories
based on their keyword representation. These labeled
documents are used to train binary SVM classifiers, one for
each category. The remaining documents are supplied to the
trained classifiers in the form of their context-enhanced feature
vectors. Each document is finally ascribed its appropriate
category by an SVM classifier.
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
The widespread adoption of the World-Wide Web (the Web) has created challenges both for society as a whole and for the technology used to build and maintain the Web. The ongoing struggle of information retrieval systems is to wade through this vast pile of data and satisfy users by presenting them with information that most adequately it’s their needs. On a societal level, the Web is expanding faster than we can comprehend its implications or develop rules for its use. The ubiquitous use of the Web has raised important social concerns in the areas of privacy, censorship, and access to information. On a technical level, the novelty of the Web and the pace of its growth have created challenges not only in the development of new applications that realize the power of the Web, but also in the technology needed to scale applications to accommodate the resulting large data sets and heavy loads. This thesis presents searching algorithms and hierarchical classification techniques for increasing a search service's understanding of web queries. Existing search services rely solely on a query's occurrence in the document collection to locate relevant documents. They typically do not perform any task or topic-based analysis of queries using other available resources, and do not leverage changes in user query patterns over time. Provided within are a set of techniques and metrics for performing temporal analysis on query logs. Our log analyses are shown to be reasonable and informative, and can be used to detect changing trends and patterns in the query stream, thus providing valuable data to a search service.
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
The first element of the search process is the query.
The user query being on an average restricted to two or three
keywords makes the query ambiguous to the search engine.
Given the user query, the goal of an Information Retrieval
[IR] system is to retrieve information which might be useful
or relevant to the information need of the user. Hence, the
query processing plays an important role in IR system.
The query processing can be divided into four categories
i.e. query expansion, query optimization, query classification and
query parsing. In this paper an attempt is made to evaluate the
performance of query processing algorithms in each of the
category. The evaluation was based on dataset as specified by
Forum for Information Retrieval [FIRE15]. The criteria used
for evaluation are precision and relative recall. The analysis is
based on the importance of each step in query processing. The
experimental results show that the significance of each step
in query processing and also the relevance of web semantics
and spelling correction in the user query.
Classifying web users in a personalised search setup is cumbersome due the very nature of dynamism in
user browsing history. This fluctuating nature of user behaviour and user interest shall be well interpreted
within a fuzzy setting. Prior to analysing user behaviour, nature of user interests has to be collected. This
work proposes a fuzzy based user classification model to suit a personalised web search environment. The
user browsing data is collected using an established customised browser designed to suit personalisation.
The data are fuzzified and fuzzy rules are generated by applying decision trees. Using fuzzy rules, the
search pages are labelled to aid grouping of user search interests. Evaluation of the proposed approach
proves to be better when compared with Bayesian classifier.
An effective search on web log from most popular downloaded contentijdpsjournal
A Web page recommender system effectively predicts the best related web page to search. While search
ing
a word from search engine it may display some unnecessary links and unrelated data’s to user so to a
void
this problem, the con
ceptual prediction model combines both the web usage and domain knowledge. The
proposed conceptual prediction model automatically generates a semantic network of the semantic Web
usage knowledge, which is the integration of domain knowledge and web usage i
nformation. Web usage
mining aims to discover interesting and frequent user access patterns from web browsing data. The
discovered knowledge can then be used for many practical web applications such as web
recommendations, adaptive web sites, and personali
zed web search and surfing
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...cseij
Query sensitive summarization aims at providing the users with the summary of the contents of single or multiple web pages based on the search query. This paper proposes a novel idea of generating a comparative summary from a set of URLs from the search result. User selects a set of web page links from the search result produced by search engine. Comparative summary of these selected web sites is generated. This method makes use of HTML DOM tree structure of these web pages. HTML documents are segmented into set of concept blocks. Sentence score of each concept block is computed with respect to the query and feature keywords. The important sentences from the concept blocks of different web pages are extracted to compose the comparative summary on the fly. This system reduces the time and effort required for the user to browse various web sites to compare the information. The comparative summary of the contents would help the users in quick decision making.
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYcscpconf
A digital library is a type of information retrieval (IR) system. The existing information retrieval
methodologies generally have problems on keyword-searching. We proposed a model to solve
the problem by using concept-based approach (ontology) and metadata case base. This model
consists of identifying domain concepts in user’s query and applying expansion to them. The
system aims at contributing to an improved relevance of results retrieved from digital libraries
by proposing a conceptual query expansion for intelligent concept-based retrieval. We need to
import the concept of ontology, making use of its advantage of abundant semantics and
standard concept. Domain specific ontology can be used to improve information retrieval from
traditional level based on keyword to the lay based on knowledge (or concept) and change the
process of retrieval from traditional keyword matching to semantics matching. One approach is
query expansion techniques using domain ontology and the other would be introducing a case
based similarity measure for metadata information retrieval using Case Based Reasoning
(CBR) approach. Results show improvements over classic method, query expansion using
general purpose ontology and a number of other approaches.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing
one single patient’s information gathered from a large volume of data over a long period of time. The
main objective of this paper is to represent our ontology-based information retrieval approach for
clinical Information System. We have performed a Case Study in the real life hospital settings. The results
obtained illustrate the feasibility of the proposed approach which significantly improved the information
retrieval process on a large volume of data over a long period of time from August 2011 until January
2012
Vertical intent prediction approach based on Doc2vec and convolutional neural...IJECEIAES
Vertical selection is the task of selecting the most relevant verticals to a given query in order to improve the diversity and quality of web search results. This task requires not only predicting relevant verticals but also these verticals must be those the user expects to be relevant for his particular information need. Most existing works focused on using traditional machine learning techniques to combine multiple types of features for selecting several relevant verticals. Although these techniques are very efficient, handling vertical selection with high accuracy is still a challenging research task. In this paper, we propose an approach for improving vertical selection in order to satisfy the user vertical intent and reduce user’s browsing time and efforts. First, it generates query embeddings vectors using the doc2vec algorithm that preserves syntactic and semantic information within each query. Secondly, this vector will be used as input to a convolutional neural network model for increasing the representation of the query with multiple levels of abstraction including rich semantic information and then creating a global summarization of the query features. We demonstrate the effectiveness of our approach through comprehensive experimentation using various datasets. Our experimental findings show that our system achieves significant accuracy. Further, it realizes accurate predictions on new unseen data.
Content-based and collaborative filtering methods are the most successful solutions in recommender
systems. Content-based method is based on item’s attributes. This method checks the features of user's
favourite items and then proposes the items which have the most similar characteristics with those items.
Collaborative filtering method is based on the determination of similar items or similar users, which are
called item-based and user-based collaborative filtering, respectively.In this paper we propose a hybrid
method that integrates collaborative filtering and content-based methods. The proposed method can be
viewed as user-based Collaborative filtering technique. However to find users with similar taste with active
user, we used content features of the item under investigation to put more emphasis on user’s rating for
similar items. In other words two users are similar if their ratings are similar on items that have similar
context. This is achieved by assigning a weight to each rating when calculating the similarity of two
users.We used movielens data set to access the performance of the proposed method in comparison with
basic user-based collaborative filtering and other popular methods.
An effective pre processing algorithm for information retrieval systemsijdms
The Internet is probably the most successful distributed computing system ever. However, our capabilities
for data querying and manipulation on the internet are primordial at best. The user expectations are
enhancing over the period of time along with increased amount of operational data past few decades. The
data-user expects more deep, exact, and detailed results. Result retrieval for the user query is always
relative o the pattern of data storage and index. In Information retrieval systems, tokenization is an
integrals part whose prime objective is to identifying the token and their count. In this paper, we have
proposed an effective tokenization approach which is based on training vector and result shows that
efficiency/ effectiveness of proposed algorithm. Tokenization on documents helps to satisfy user’s
information need more precisely and reduced search sharply, is believed to be a part of information
retrieval. Pre-processing of input document is an integral part of Tokenization, which involves preprocessing
of documents and generates its respective tokens which is the basis of these tokens probabilistic
IR generate its scoring and gives reduced search space. The comparative analysis is based on the two
parameters; Number of Token generated, Pre-processing time.
UML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVALijcsit
In this current technological era, there is an enormous increase in the information available on web and
also in the online databases. This information abundance increases the complexity of finding relevant
information. To solve such challenges, there is a need for improved and intelligent systems for efficient
search and retrieval. Intelligent Agents can be used for better search and information retrieval in a
document collection. The information required by a user is scattered in a large number of databases. In this
paper, the object oriented modeling for agent based information retrieval system is presented. The paper
also discusses the framework of agent architecture for obtaining the best combination terms that serve as
an input query to the information retrieval system. The communication and cooperation among the agents
are also explained. Each agent has a task to perform in information retrieval.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Information retrival system and PageRank algorithmRupali Bhatnagar
We discuss the various models for Information retrieval system present in literature and discuss them mathematically. We also study the PageRank Algorithm which is used for relevant search.
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
An effective search on web log from most popular downloaded contentijdpsjournal
A Web page recommender system effectively predicts the best related web page to search. While search
ing
a word from search engine it may display some unnecessary links and unrelated data’s to user so to a
void
this problem, the con
ceptual prediction model combines both the web usage and domain knowledge. The
proposed conceptual prediction model automatically generates a semantic network of the semantic Web
usage knowledge, which is the integration of domain knowledge and web usage i
nformation. Web usage
mining aims to discover interesting and frequent user access patterns from web browsing data. The
discovered knowledge can then be used for many practical web applications such as web
recommendations, adaptive web sites, and personali
zed web search and surfing
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...cseij
Query sensitive summarization aims at providing the users with the summary of the contents of single or multiple web pages based on the search query. This paper proposes a novel idea of generating a comparative summary from a set of URLs from the search result. User selects a set of web page links from the search result produced by search engine. Comparative summary of these selected web sites is generated. This method makes use of HTML DOM tree structure of these web pages. HTML documents are segmented into set of concept blocks. Sentence score of each concept block is computed with respect to the query and feature keywords. The important sentences from the concept blocks of different web pages are extracted to compose the comparative summary on the fly. This system reduces the time and effort required for the user to browse various web sites to compare the information. The comparative summary of the contents would help the users in quick decision making.
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYcscpconf
A digital library is a type of information retrieval (IR) system. The existing information retrieval
methodologies generally have problems on keyword-searching. We proposed a model to solve
the problem by using concept-based approach (ontology) and metadata case base. This model
consists of identifying domain concepts in user’s query and applying expansion to them. The
system aims at contributing to an improved relevance of results retrieved from digital libraries
by proposing a conceptual query expansion for intelligent concept-based retrieval. We need to
import the concept of ontology, making use of its advantage of abundant semantics and
standard concept. Domain specific ontology can be used to improve information retrieval from
traditional level based on keyword to the lay based on knowledge (or concept) and change the
process of retrieval from traditional keyword matching to semantics matching. One approach is
query expansion techniques using domain ontology and the other would be introducing a case
based similarity measure for metadata information retrieval using Case Based Reasoning
(CBR) approach. Results show improvements over classic method, query expansion using
general purpose ontology and a number of other approaches.
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
In health research, one of the major tasks is to retrieve, and analyze heterogeneous databases containing
one single patient’s information gathered from a large volume of data over a long period of time. The
main objective of this paper is to represent our ontology-based information retrieval approach for
clinical Information System. We have performed a Case Study in the real life hospital settings. The results
obtained illustrate the feasibility of the proposed approach which significantly improved the information
retrieval process on a large volume of data over a long period of time from August 2011 until January
2012
Vertical intent prediction approach based on Doc2vec and convolutional neural...IJECEIAES
Vertical selection is the task of selecting the most relevant verticals to a given query in order to improve the diversity and quality of web search results. This task requires not only predicting relevant verticals but also these verticals must be those the user expects to be relevant for his particular information need. Most existing works focused on using traditional machine learning techniques to combine multiple types of features for selecting several relevant verticals. Although these techniques are very efficient, handling vertical selection with high accuracy is still a challenging research task. In this paper, we propose an approach for improving vertical selection in order to satisfy the user vertical intent and reduce user’s browsing time and efforts. First, it generates query embeddings vectors using the doc2vec algorithm that preserves syntactic and semantic information within each query. Secondly, this vector will be used as input to a convolutional neural network model for increasing the representation of the query with multiple levels of abstraction including rich semantic information and then creating a global summarization of the query features. We demonstrate the effectiveness of our approach through comprehensive experimentation using various datasets. Our experimental findings show that our system achieves significant accuracy. Further, it realizes accurate predictions on new unseen data.
Content-based and collaborative filtering methods are the most successful solutions in recommender
systems. Content-based method is based on item’s attributes. This method checks the features of user's
favourite items and then proposes the items which have the most similar characteristics with those items.
Collaborative filtering method is based on the determination of similar items or similar users, which are
called item-based and user-based collaborative filtering, respectively.In this paper we propose a hybrid
method that integrates collaborative filtering and content-based methods. The proposed method can be
viewed as user-based Collaborative filtering technique. However to find users with similar taste with active
user, we used content features of the item under investigation to put more emphasis on user’s rating for
similar items. In other words two users are similar if their ratings are similar on items that have similar
context. This is achieved by assigning a weight to each rating when calculating the similarity of two
users.We used movielens data set to access the performance of the proposed method in comparison with
basic user-based collaborative filtering and other popular methods.
An effective pre processing algorithm for information retrieval systemsijdms
The Internet is probably the most successful distributed computing system ever. However, our capabilities
for data querying and manipulation on the internet are primordial at best. The user expectations are
enhancing over the period of time along with increased amount of operational data past few decades. The
data-user expects more deep, exact, and detailed results. Result retrieval for the user query is always
relative o the pattern of data storage and index. In Information retrieval systems, tokenization is an
integrals part whose prime objective is to identifying the token and their count. In this paper, we have
proposed an effective tokenization approach which is based on training vector and result shows that
efficiency/ effectiveness of proposed algorithm. Tokenization on documents helps to satisfy user’s
information need more precisely and reduced search sharply, is believed to be a part of information
retrieval. Pre-processing of input document is an integral part of Tokenization, which involves preprocessing
of documents and generates its respective tokens which is the basis of these tokens probabilistic
IR generate its scoring and gives reduced search space. The comparative analysis is based on the two
parameters; Number of Token generated, Pre-processing time.
UML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVALijcsit
In this current technological era, there is an enormous increase in the information available on web and
also in the online databases. This information abundance increases the complexity of finding relevant
information. To solve such challenges, there is a need for improved and intelligent systems for efficient
search and retrieval. Intelligent Agents can be used for better search and information retrieval in a
document collection. The information required by a user is scattered in a large number of databases. In this
paper, the object oriented modeling for agent based information retrieval system is presented. The paper
also discusses the framework of agent architecture for obtaining the best combination terms that serve as
an input query to the information retrieval system. The communication and cooperation among the agents
are also explained. Each agent has a task to perform in information retrieval.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Information retrival system and PageRank algorithmRupali Bhatnagar
We discuss the various models for Information retrieval system present in literature and discuss them mathematically. We also study the PageRank Algorithm which is used for relevant search.
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)IJCSEA Journal
Feature selection is an effective method used in text categorization for sorting a set of documents into certain number of predefined categories. It is an important method for improving the efficiency and accuracy of text categorization algorithms by removing irredundant terms from the corpus. Genome contains the total amount of genetic information in the chromosomes of an organism, including its genes and DNA sequences. In this paper a Clustering technique called Hierarchical Techniques is used tocategories the Features from the Genome documents. A framework is proposed for Genomic Feature set Selection. A Filter based Feature Selection Method like
2 statistics, CHIR statistics are used to select the Feature set. The Selected Feature set is verified by using F-measure and it is biologically validated for Biological relevance using the BLAST tool.
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)IJCSEA Journal
Feature selection is an effective method used in text categorization for sorting a set of documents into certain number of predefined categories. It is an important method for improving the efficiency and accuracy of text categorization algorithms by removing irredundant terms from the corpus. Genome contains the total amount of genetic information in the chromosomes of an organism, including its genes and DNA sequences. In this paper a Clustering technique called Hierarchical Techniques is used to categories the Features from the Genome documents. A framework is proposed for Genomic Feature set Selection. A Filter based Feature Selection Method like statistics, CHIR statistics are used to select the Feature set. The Selected Feature set is verified by using F-measure and it is biologically validated for Biological relevance using the BLAST tool.
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
Organize continuing growth of dynamic unstructured documents is the major challenge to the field experts.
Handling of such unorganized documents causes more expensive. Clustering of such dynamic documents helps
to reduce the cost. Document clustering by analysing the keywords of the documents is one the best method to
organize the unstructured dynamic documents. Statistical analysis is the best adaptive method to extract the
keywords from the documents. In this paper an algorithm was proposed to cluster the documents. It has two
parts, first part extracts the keywords using statistical method and the second part construct the clusters by
keyword using agglomerative method. This proposed algorithm gives more than 90% of accuracy.
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
Organize continuing growth of dynamic unstructured documents is the major challenge to the field experts.
Handling of such unorganized documents causes more expensive. Clustering of such dynamic documents helps
to reduce the cost. Document clustering by analysing the keywords of the documents is one the best method to
organize the unstructured dynamic documents. Statistical analysis is the best adaptive method to extract the
keywords from the documents. In this paper an algorithm was proposed to cluster the documents. It has two
parts, first part extracts the keywords using statistical method and the second part construct the clusters by
keyword using agglomerative method. This proposed algorithm gives more than 90% of accuracy.
Similar to Scaling Down Dimensions and Feature Extraction in Document Repository Classification (20)
A review on data mining techniques for Digital Mammographic Analysisijdmtaiir
- Medical Data mining is the search for relationships
and patterns within the medical data that could provide useful
knowledge for effective medical diagnosis. The predictability
of disease will become more effective and early detection of
disease will aid in increased exposure to required patient care
and improved cure rates using computational applications.
Review shows that the reasons for feature selection include
improvement in performance prediction, reduction in
computational requirements, reduction in data storage
requirements, reduction in the cost of future measurements
and improvement in data or model understanding
Comparison on PCA ICA and LDA in Face Recognitionijdmtaiir
Face recognition is used in wide range of application.
In recent years, face recognition has become one of the most
successful applications in image analysis and understanding.
Different statistical method and research groups reported a
contradictory result when comparing principal component
analysis (PCA) algorithm, independent component analysis
(ICA) algorithm, and linear discriminant analysis (LDA)
algorithm that has been proposed in recent years. The goal of
this paper is to compare and analyze the three algorithms and
conclude which is best. Feret Dataset is used for consistency
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Analysis of Classification Algorithm in Data Miningijdmtaiir
Data Mining is the extraction of hidden predictive
information from large database. Classification is the process
of finding a model that describes and distinguishes data classes
or concept. This paper performs the study of prediction of class
label using C4.5 and Naïve Bayesian algorithm.C4.5 generates
classifiers expressed as decision trees from a fixed set of
examples. The resulting tree is used to classify future samples
.The leaf nodes of the decision tree contain the class name
whereas a non-leaf node is a decision node. The decision node
is an attribute test with each branch (to another decision tree)
being a possible value of the attribute. C4.5 uses information
gain to help it decide which attribute goes into a decision node.
A Naïve Bayesian classifier is a simple probabilistic classifier
based on applying Baye’s theorem with strong (naive)
independence assumptions. Naive Bayesian classifier assumes
that the effect of an attribute value on a given class is
independent of the values of the other attribute. This
assumption is called class conditional independence. The
results indicate that Predicting of class label using Naïve
Bayesian classifier is very effective and simple compared to
C4.5 classifier
Performance Analysis of Selected Classifiers in User Profilingijdmtaiir
User profiles can serve as indicators of personal
preferences which can be effectively used while providing
personalized services. Building user files which can capture
accurate information of individuals has been a daunting task.
Several attempts have been made by researchers to extract
information from different data sources to build user profiles
on different application domains. Towards this end, in this
paper we employ different classification algorithmsto create
accurate user profiles based on information gathered from
demographic data. The aim of this work is to analyze the
performance of five most effective classification methods,
namely Bayesian Network(BN), Naïve Bayesian(NB), Naives
Bayes Updateable(NBU), J48, and Decision Table(DT). Our
simulation results show that, in general, the J48has the highest
classification accuracy performance with the lowest error rate.
On the other hand, it is found that Naïve Bayesian and Naives
Bayes Updateable classifiers have the lowest time requirement
to build the classification model
Analysis of Sales and Distribution of an IT Industry Using Data Mining Techni...ijdmtaiir
The goal of this work is to allow a corporation to
improve its marketing, sales, and customer support operations
through a better understanding of its customers. Keep in mind,
however, that the data mining techniques and tools described
here are equally applicable in fields ranging from law
enforcement to radio astronomy, medicine, and industrial
process control. Businesses in today’s environment
increasingly focus on gaining competitive advantages.
Organizations have recognized that the effective use of data is
the key element in the next generation is to predict the sales
value and emerging trend of technology market. Data is
becoming an important resource for the companies to analyze
existing sales value with current technology trends and this
will be more useful for the companies to identify future sales
value. There a variety of data analysis and modeling techniques
to discover patterns and relationships in data that are used to
understand what your customers want and predict what they
will do. The main focus of this is to help companies to select
the right prospects on whom to focus, offer the right additional
products to company’s existing customers and identify good
customers who may be about to leave. This results in improved
revenue because of a greatly improved ability to respond to
each individual contact in the best way and reduced costs due
to properly allocated resources. Keywords: sales, customer,
technology, profit.
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...ijdmtaiir
Educational mining used to evaluate the leaner's
performance and the learning environment. The learning
process are involved and influenced by different
components. The memory is playing vital role in the
process of learning. The long term, short term, working,
instant, responsive, process, recollect, reference,
instruction and action memory are involved in the
process of learning. The influencing factors on these
memories are identified through the construction analysis
of Neural Network Back Propagation Algorithm. The
observed set of data represented using cubical dataset
format for the mining approach. The mining process is
carried out using neural network based back propagation
network model to decide the influencing cognitive load
for the different learning challenges. The learners’
difficulties are identified through the experimental
results.
An Analysis of Data Mining Applications for Fraud Detection in Securities Marketijdmtaiir
-In recent securities fraud broadly refers to deceptive
practices in connection with the offering for sale of securities.
There are many challenges involved in developing data mining
applications for fraud detection in securities market, including:
massive datasets, accuracy, privacy, performance measures and
complexity. The impacts on the market and the training of
regulators are other issues that need to be addressed. In this
paper we present the results of a Comprehensive systematic
literature review on data mining techniques for detecting
fraudulent activities and market manipulation in securities
market. We identify the best practices that are based on data
mining methods for detecting known fraudulent patterns and
discovering new predatory strategies. Furthermore, we
highlight the challenges faced in the development and
implementation of data mining systems for detecting market
manipulation in securities market and we provide
recommendation for future research works accordingly
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...ijdmtaiir
The health care industry contains large amount of
health care data with hidden information. This information is
useful for making effective decision. For getting appropriate
result from the hidden information computer based data mining
techniques are used. Previously Neural Network (NN) is
widely used for predicting cardiac disease. In this paper, a
Cardiac Disease Prediction System (CDPS) is developed by
using data clustering. The CDPS system uses 15 parameters to
predict the disease, for example BP, Obesity, cholesterol, etc.
This 15 attributes like sex, age, weight are given as the input.
In this paper by using the patient’s medical record, an illdefined classification is used at the early stage of the patient to
diagnose the cardiac disease. Based on the result the patients
are advised to keep the sensor to predict them.
Music Promotes Gross National Happiness Using Neutrosophic fuzzyCognitive Map...ijdmtaiir
This paper provides an investigation to promote
gross national happiness from music using fuzzy logic model.
Music influences rate of learning. It has been the subject of
study for many years. Researchers have confirmed that loud
background noise impedes learning, concentration, and
information acquisition. An interesting phenomenon that
occurs frequently in listening new music among the students
creates sense of anxiety even without having
properunderstandingofmusic.Happiness is the emotion that
expresses various degrees of positive and negativefeelings
ranging from satisfaction to extreme joy. The happiness is
thegoal most people strive to achieve. Happypeople are
satisfied with their lives. The goal of this work is to find the
particular component of music which will ultimately promote
the happiness of people because of indeterminacy situation in
components of music
A Study on Youth Violence and Aggression using DEMATEL with FCM Methodsijdmtaiir
The DEMATEL method is then a good technique for
making decisions. In this paper we analyzed the risk factors of
youth violence and what makes them more aggressive. Since
there are more risk factors of youth violence, to relate each
other more complex to construct FCM and analyze them.
Moreover the data is an unsupervised one obtained from
survey as well as interviews. Hence fuzzy alone has the
capacity to analyses these concepts.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
Analyzing the Role of a Family in Constructing Gender Roles Using Combined Ov...ijdmtaiir
Family, as a social institution and as the
fundamental unit of the society, plays a vital role in forming
persons. We become full-fledged members of the society
through the process of socialization which starts in the family.
It provides the first foundational formation of personhood.
Especially in traditional societies like India the role of the
family assumes even greater importance. We learn to
differentiate and to discriminate between man and woman from
the role our parents play. In this paper we analyze the role of
the family in constructing gender roles using Combined
Overlap Block Fuzzy Cognitive Maps
An Interval Based Fuzzy Multiple Expert System to Analyze the Impacts of Clim...ijdmtaiir
Indian agriculture is completely dependent on the
environment and any undesirable change in the environment
has an adverse impact on agriculture. Climate change and
pollution in India have caused great damage to the
environment. In this paper we analyse the impact of climate
change on Indian agriculture. The first section gives an
introduction to the problem. In section two we introduce a new
fuzzy tool called interval based fuzzy multiple expert system.
Section three adapt the new fuzzy tool to analyse the problem
of agricultural impacts of climate change and in section four
we give the results and suggestions based on our analysis
An Approach for the Detection of Vascular Abnormalities in Diabetic Retinopathyijdmtaiir
Diabetic Retinopathy is a common complication of
diabetes that is caused by changes in the blood vessels of the
retina. The blood vessels in the retina get altered. Exudates are
secreted, micro-aneurysms and hemorrhages occur in the
retina. The appearance of these features represents the degree
of severity of the disease. In this paper the proposed approach
detects the presence of abnormalities in the retina using image
processing techniques by applying morphological processing
techniques to the fundus images to extract features such as
blood vessels, micro aneurysms and exudates. These features
are used for the detection of severity of Diabetic Retinopathy.
It can quickly process a large number of fundus images
obtained from mass screening to help reduce the cost, increase
productivity and efficiency for ophthalmologists.
Improve the Performance of Clustering Using Combination of Multiple Clusterin...ijdmtaiir
The ever-increasing availability of textual
documents has lead to a growing challenge for information
systems to effectively manage and retrieve the information
comprised in large collections of texts according to the user’s
information needs. There is no clustering method that can
adequately handle all sorts of cluster structures and properties
(e.g. shape, size, overlapping, and density). Combining
multiple clustering methods is an approach to overcome the
deficiency of single algorithms and further enhance their
performances. A disadvantage of the cluster ensemble is the
highly computational load of combing the clustering results
especially for large and high dimensional datasets. In this paper
we propose a multiclustering algorithm , it is a combination of
Cooperative Hard-Fuzzy Clustering model based on
intermediate cooperation between the hard k-means (KM) and
fuzzy c-means (FCM) to produce better intermediate clusters
and ant colony algorithm. This proposed method gives better
result than individual clusters.
The Study of Symptoms of Tuberculosis Using Induced Fuzzy Coginitive Maps (IF...ijdmtaiir
Tuberculosis or TB is a common, infectious disease
caused by various strains of mycobacterium usually called as
mycobacterium Tuberculosis. Tb attacks the lungs but can also
affect other parts of the body. Most infections are
asymptomatic and latent, but about one in ten latent infections
eventually progresses to active disease which, if left untreated,
kills more than 50% of those so infected. Hence, this paper
analysisthes y m p t o m s o f Tuberculosis using Induced
Fuzzy Cognitive Maps (IFCMs). IFCMs area fuzzy-graph
modeling approach based on expert’s opinion. This is the nonstatistical approach to study the problems with imprecise
information.
A Study on Finding the Key Motive of Happiness Using Fuzzy Cognitive Maps (FCMs)ijdmtaiir
Happiness is subjective. It is difficult to compare one
person’s happiness with another. It can be especially difficult
to compare happiness across cultures. The function of man is
to live a certain kind of life and this activity implies a rational
principle. The function of a good man is the good and noble
performance. If any action is well performed it is performed in
accord with the appropriate excellence. If this is the case, then
happiness turns out to be an activity of the soul in accordance
with virtue. That is happiness as the exercise of virtue. Every
human being thinks happiness in his own perspective.
Everyone wants to be happy and searching happiness in all
their activities. Happiness is typically measured using
subjective measures. Happiness cannot be defined in terms of
rigid boundaries and it is a vague term and it is appropriate to
use Fuzzy Logic. In particular, we use Fuzzy Cognitive
Mapping to find the key motive of happiness. This paper
consists of four sections. The first section is of inductive
nature about happiness. The section two introduces the
concept of Fuzzy Cognitive Mapping (FCMs). In section three
Fuzzy Cognitive Mapping applied on the concept of
Happiness. Section four gives the conclusions and
suggestions
Study of sustainable development using Fuzzy Cognitive Relational Maps (FCM)ijdmtaiir
Sustainable development provides a framework under
which communities can use resources efficiently, create efficient
infrastructures, protect and enhance quality of life, and create new
businesses to strengthen their economies. It can help us create
healthy communities that can sustain our generation, as well as
those that follow ours. Sustainable development is not a new
concept. Rather, it is the latest expression of a long-standing ethic
involving peoples' relationships with the environment and the
current generation's responsibilities to future generations. For a
community to be truly sustainable, it must adopt a three-pronged
approach that considers economic, environmental and cultural
resources. Communities must consider these needs in the short
term as well as the long term. Sustainable Development also can
be defined simply as a better quality of life for everyone, now and
for generations to come. It is a vision of progress that links
economic development, protection of the environment and social
justice, and its values are recognised by democratic governments
and political movements the world over. Sustainable
Development is therefore closely linked to Governance, Better
Regulation and Impact Assessment. Indicators to measure
progress are also vital.This paper has four sections. In the first
section we introduce the notion of fuzzy cognitive maps and
Combined Fuzzy Cognitive Maps (CFCMs). In section two we
describe the problem and justification for the use of FCMs. In
section three we give the adaptation of FCM to the problem. In
the final section we give conclusions based on our analysis of the
problem using FCM
A Study of Personality Influence in Building Work Life Balance Using Fuzzy Re...ijdmtaiir
Personality plays an important role in work life
balance irrespective of the organizational setups and other
factors. It has become a subject of concern in terms of
technological, market and organizational changes associated
with an individual’s personality. Here in this study an attempt is
made to study about the holistic picture of personality influence
in work-life balance on the basis of experts’ opinion. The
influence of personality is studied from the big five factors of
personality traits. The data were analyzed using Fuzzy
Relational mapping (FRM) model and conclusions arrived for
which personality has more influence in building work life
balance and which one is more vulnerable for work life
imbalance
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Final project report on grocery store management system..pdf
Scaling Down Dimensions and Feature Extraction in Document Repository Classification
1. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 03 Issue: 01 June 2014, Page No. 1- 4
ISSN: 2278-2419
1
Scaling Down Dimensions and Feature Extraction
in Document Repository Classification
Asha Kurian1
,M.S.Josephine2
, V.Jeyabalaraja3
1
Research Scholar, Department of Computer Applications, Dr. M.G.R. Educational and Research Institute University, Chennai
2
Professor, Department of Computer Applications, Dr. M.G.R. Educational and Research Institute University, Chennai
3
Professor, Department of computer Science Engineering, Velammal Engineering College, Chennai
E-mail: ashk47@yahoo.com , josejbr@yahoo.com, jeyabalaraja@gmail.com
Abstract-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification.
Keywords: Feature extraction, text classification,
categorization, dimensionality reduction
I. INTRODUCTION
Text Categorization or text classification attempts to assort
documents in a repository into different class labels.A
classifier learns from a training set of documents that are
already classified and labeled. A general model is devised that
correctly labels further incoming documents.A repository
typically consists of thousands of documents. Retrieving a
selection becomes a laborious task, unless the documents are
indexed or categorized in some particular order. Document
categorization is modeled along the lines of Information
Retrieval [1] and Natural Language Processing [5] where a
user query elicits documents of maximal significance in
relation to the query. The sorting is done by grouping the
document terms and phrases and identifying some association
or correlation between them. Establishing relations among
words is compounded due to polysemy and the presence of
synonyms. Every document contains thousands of unique
terms resulting in a highly dimensional feature space.
To reduce information retrieval time, the dimensionality of the
document collection can be reduced by selecting only those
terms which best describe the document. Dimensionality
reduction techniques try to find out the context-meaning of
words, disregarding those which are inconsequential. Feature
selection algorithms reduce the feature space by selecting
appropriate vectors, whereas feature extraction algorithms
transform the vectors into a sub-space of scaled down
dimension. Feature selection can be followed by supervised or
unsupervised learning. Classification becomes supervised
when a collection of labeled documents helps in the learning
using a train-test set.
II. REVIEW OF LITERATURE
1. Text mining and Organization in Large Corpus, December
2005. Two dimensionality deduction methods: Singular
Vector Decomposition (SVD) and Random Projection (RP) are
compared, along with three selected clustering algorithms: K-
means, Non-negative Matrix Factorization (NMF) and
Frequent Itemset. These selected methods and algorithms are
compared based on their performance and time consumption.
2. Improving Methods for Single-label Text Categorization,
July 2007 – An evaluation of the evolutionary feature
reduction algorithms is done. In this paper a comprehensive
comparison of the performance of a number of text
categorization methods in two different data sets is presented.
In particular, the Vector and Latent Semantic Analysis (LSA)
methods, a classifier based on Support Vector Machines
(SVM) and the k-Nearest Neighbor variations of the Vector
and LSA models is evaluated.
3. A Fuzzy based approach to text mining and document
clustering, October 2013- In this paper, how to apply fuzzy
logic in text mining in order to perform document clustering is
shown. Fuzzy c-means (FCM) algorithm was used to cluster
these documents into clusters.
4. Feature Clustering algorithms for text classification-Novel
Techniques and Reviews, August 2010 - In this paper, some of
the important techniques for text classification have been
reviewed and novel parameters using fuzzy set approach have
been discussed in detail.
5. Classification of text using fuzzy based incremental feature
clustering algorithm, International Journal of Advanced
Research in Computer Engineering and Technology Volume 1,
Issue 5, July 2012 – A fuzzy based incremental feature
clustering algorithm is proposed. Based on the similarity test
the feature vector of a document set is classified and grouped
into clusters following clustering properties and each cluster is
characterized by a membership function with statistical mean
and deviation.
III. FEATURE SELECTION METHODS FOR
DIMENSIONALITY REDUCTION
Feature selection can either be supervised or unsupervised. A
brief summary of the different feature extraction methods used
2. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 03 Issue: 01 June 2014, Page No. 1- 4
ISSN: 2278-2419
2
in this study are entailed. As a first step, document pre-
processing removes stopwords, short words, numbers and
alphanumeric characters. Noise removed, the text is
metamorphosed into a term-weighted matrix whose rows
indicate the terms and columns represent the documents they
make an appearance in. Each cell entry holds the rate of
occurrence of words in the document (also called weights). The
term weighting factors are:
Local term frequency(tf) – frequency of a term in the document
global document frequency(idf) – accounts for the
proportionality of the term within a document as well as
globally for the whole collection.The normalized tf / idf factor
is the general weighting method used. The matrix thus obtained
consists mostly of sparse elements.
A. Latent Semantic Indexing (LSI)
Every document has an underlying semantic structure relating
to a particular abstraction. Latent Semantic Indexing taps this
theory to identify the relation between words and the context in
which they used. Mathematical and statistical techniques are
used for this inference [2]. Dimensions that are of no
consequence to the text are to be eliminated. At the same time
removal of these dimensions should not result in a loss of
interpretation’s starts by pre-processing the document of stop
words, stemming etc. This followed by converting the text
document to term weighted matrix subsisting of mostly sparse
vectors. Cell entries are incremented corresponding to the
frequency of words. LSI, accompanied by the powerful
Singular Value Decomposition (SVD) is used for feature
selection. Singular Value Decomposition (SVD) works with
the sparse matrix of term-weights and transforms it into a
product of three matrices. The Left Singular vector comprising
of original row elements, Right Singular Vector consisting of
original column elements both transformed orthogonally and
the Singular value which is a diagonal matrix containing the
scaling value [4]. The diagonal matrix is the component used
for dimensionality reduction. The smallest value in this matrix
indicates terms which are inconsequential and can be removed.
LSI along with SVD identifies features that do not contribute
to the semantic structure of the document.
B. Principal Component Analysis (PCA)
Principal Component Analysis uses the notion of eigenvalues
and eigenvectors for feature variable reduction procedure.
Given data with high dimensionality, PCA starts by subtracting
the mean from each data point so that the averaged mean for
each dimension is zero. The original data is projected along
with the altered data. A covariance matrix is a square matrixC
[n x n] = (cov (dimensionx,dimensiony)) C is a n x n matrix and
each cell (x,y) shows the covariance between two different
dimensions x and y.The covariance matrix represents the
dependence of dimensions on each other. Positive values
indicate that as one dimension increase, the dependent
dimension also scales up. The eigenvalues and eigenvectors
representation of the covariance matrix is used to plot the
principal component axis, the best line along which the data
can be laid out. Arranging the eigenvalues and eigenvectors in
a decreasing order shows the components with higher
relevance corresponding to higher eigenvalues [8]. The vectors
with lower values can be disregarded as non significant. The
higher second principal component plotted is perpendicular to
the first principal component. All eigenvectors are orthogonal
to each other irrespective of how many dimensions are present.
Each principal component is analyzed to what extent they
contribute to the variance of the data points[9] . As a final step
the transpose of the chosen principal components is multiplied
with the mean adjusted data to derive the final data set
representing those dimensions to be retained.
IV. FUZZY CLUSTERING
Fuzzy C-Means algorithm is an iterative algorithm to group the
data into different clusters [10]. Every cluster has a cluster
centre. The data points are put into the clusters based on the
probability with which they belong to the group. Points closer
to the center are more closely integrated to the group when
compared to those that are further away from the cluster center.
A decision as to which cluster to put a data point into is taken
based on the membership function of mean and standard
deviation. A membership function determines how well the
data fit into a particular cluster. On every iteration, the FCM
algorithm upgrades the cluster centers and the membership
functions. At the same time minimization of the objective
function is also done.
The objective function of FCM is given by
Um = ∑ ∑ ‖ − ‖
Where m is the number of iterations
xij denotes the membership of xi in the jth
cluster.
xi are the different dimensions of the data.
The iteration stops when the algorithm correctly pinpoints the
cluster center and no change seen in the minimization. The
algorithm consists of four steps .
Step1. Identify the initial cluster centers and a matrix X such
that each xij element of X takes a value in [0,1] that denotes the
extent to which the element belongs to the cluster.
Step 2.During each iteration update the matrix value [0,1] from
the given cluster center.
Step 3.Calculate the objective function for that iteration.
Step 4. If the value has reduced from the previous iteration,
continue with the iteration otherwise the procedure halts [11]
and [12]. Similar features are now clustered, with features
closer to the center having a stronger resemblance.
V. METHODOLOGY
The feature extraction methods try to interpret the underlying
semantics of the text and identify the unique words that can be
eliminated. The accuracy of classification depends on this
optimal set. To assess the efficiency and performance of
feature reduction, we first find the accuracy of classification,
given the training and test dataset [7]. The evolutionary
algorithms like LSI and PCA were mainly chosen because they
exhibit good accuracy when reducing dimensions of a
document. This is used as a measure find how much it scores
against FCM clustering. Supervised feature reduction works on
3. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 03 Issue: 01 June 2014, Page No. 1- 4
ISSN: 2278-2419
the principle that the number of clusters is determined
beforehand. The fuzzy C-Means clustering requires no training
or test samples to work on. The primary step in feature
reduction is to preprocess a document. Pre-processing is an
essential procedure that further reduces the complexity of
dimensionality reduction and the subsequent classification or
clustering process. The whole mass of text has to be
transformed algebraically. The text data in the document is
mapped to its vector representation, for the purpose of
document indexing. The most common form of representation
is a matrix format where words that appear frequently are
weighted [2].The resultant matrix is a sparse matrix with most
of elements being zeros. From the nature of text data, it can be
seen that certain words do not contribute to the meaning of the
text. The parser removes these words by stopword elimination.
This is followed by lemmatization or stemming. Preprocessing
also helps to eliminate words shorter than three letters,
alphanumeric characters and numbers. With the weighted
information, documents are ranked based on their similarity to
the query. The cosine of the angle formed by these vectors is
used as a similarity measure. Two common datasets have been
used in this study. 20 News groups, R8 - the 8 most frequent
classes from the Reuters-21578 collection datasets. The dataset
is split as training and test documents. Out of the 18821
documents in 20NG, 60% are taken as a training set and the
rest for testing. From the R8 group 70% documents are
considered as training set. Using the training set for learning,
the supervised techniques attempts to assign the new incoming,
and unknown documents into their correct class labels [8].
VI. SUPERVISED VS UNSUPERVISED FEATURE
SELECTION
All tests have been performed on the MATLAB computing
environment. Before any analysis is undertaken, it is
worthwhile to study the characteristics of the data collection
used. R8, the eight most frequent classes taken from the
Reuters collection is smaller in size compared to the
20Newsgroup collection. The general procedure followed is to
train the classifier using the training data. The accuracy is
judged by the number of test documents that are labeled
correctly after the learning phase. In FCM since learning is
evolving over iterations, bifurcation of data is unnecessary. In
this study the techniques are assessed based on the accuracy,
macro-averaged precision and recall, training and testing times
[6]. The datasets are divided in a 60:40 and 70:30 ratios for R8
and 20NG as training and test documents. With the class
labeling information from the training documents, the test
documents can be classified. Initially the datasets are labeled
by k-means after feature selection. This is used as a point of
reference when clustering using Fuzzy C-Means algorithm is
applied. A detailed comparison of the results shows that when
features are clustered using FCM, it executes faster and is more
accurate in classifying datasets. The savings are mainly due to
foregoing the training and testing times.
Dataset Collection Classes Train Test Total
Docs Docs Docs
20NG 20Newsgroups 20 11293 7528 18821
R8 Reuters-21578 8 5485 2189 7674
Figure.1 shows the datasets used in the study. The number of
classes in each collection along with the division of training
and test documents is shown
VII. IMPLEMENTATION
All the statistics shown have been derived after data pre-
processing followed by reduction of dimensions. The resultant
dataset is further classified using the k-means classification
method. The performance of the dimensionality reduction
algorithms are graded by comparing the evaluation measures of
recall, accuracy of classification and precision. Figure.2 and 3
depicts the result of classification using both the supervised
and unsupervised techniques on the datasets under study.
Recall Accuracy Precision
PCA 76.27 81.76 88.35
LSI 83.5 89.70 87.64
FCM 85.12 92.37 91.75
Figure.2 Collation of the performance measures for R8 dataset
FCM executes the best in terms of accuracy, precision and
recall while classifying the R8 dataset. There is an approximate
increase in accuracy of 4% when compared to LSI and 8%
increment over PCA. FCM clusters 20Newsgroups with an
accuracy of 3% over LSI and close to 10% over PCA.
Recall Accuracy Precision
PCA 75.37 73.29 77.64
LSI 81.44 88.87 84.79
FCM 84.25 92.63 87.60
Figure.3 Collation of the performance measures of the
20Newsgroups data collection.
Figure.4 Bar chart showing the comparison of performance
measures of the algorithms on R8 dataset.
Figure.5 Bar chart showing the comparison of performance
measures for 20Newsgroups dataset.
0
100
PCA
LSI
FCM
0
100
PCA
LSI
FCM
4. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 03 Issue: 01 June 2014, Page No. 1- 4
ISSN: 2278-2419
4
Figure.6 Clustering of R8 dataset using FCM algorithm
Figure.7 Clustering of 20NG dataset using FCM algorithm
Figures. 6 and 7 shows the clustering of data points using hard
c-means clustering. Every point can be a member of only one
cluster. Dual clustering is not allowed.FCM has clustered the
datasets into four clusters. Cluster centers are marked by
slightly larger darker circles. In R8 collection the distribution
of terms are not very strongly bound to its cluster center.
Features belonging to one cluster could share membership with
another cluster as well. In the 20NG dataset, the members are
closely tied to its cluster center.
VIII. CONCLUSION
This study evaluated the effectiveness of feature selection
techniques on text categorization for dimensionality reduction.
Both supervised and supervised techniques were experimented
on. From the evolutionary techniques, Latent Semantic
Indexing exhibits superior performance over reduction using
Principal Component Analysis in terms of precision, accuracy
and recall. While using unsupervised feature clustering using
FCM, it is seen that there is an improvement over both LSI as
well as PCA in its accuracy. Clustering with FCM shows
higher accuracy as the training and testing times are
minimized. In FCM it can be seen that at least 80% terms can
be removed without degrading the resulting classification. The
datasets under scanner show some fundamental differences.
Classification results vary depending upon the features that are
eliminated, the relation between the data, what factors are used
to assess the similarity in collaboration with the classification
method employed.Given a choice of optimally proven feature
extraction methods, it is possible that the accuracy will
improve, when two feature selection algorithms are used in
conjunction with each other. Since the characteristics of every
document collection is different, devising a classification
algorithm that works best depending on the type of document
content could also be introduced. The performance of the
feature extraction algorithms can also be estimated using other
efficacy measures.
REFERENCES
[1] Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval.
Addison-Wesley, Reading, Massachusetts, USA, 1999.
[2] Wei Xu, Xin Liu, Yihong Gong. Document Clustering Based On
Nonnegative Matrix Factorization. In ACM. SIGIR, Toronto, Canada,
2003.
[3] Ian Soboroff. IR Models: The Vector Space Model. In Information
Retrieval Lecture 7.
[4] http://www.csee.umbc.edu/_ian/irF02/lectures/07
Models-VSM.pdf
[5] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R.
Harshman. Indexing by latent semantic analysis. Journal of the Society
for Information Science, 41:391-407, 1990.
[6] Marko Grobelnik, Dunja Mladenic and J. Stefan. Institute, Slovenia Text-
Mining Tutorial
[7] Patterns in Unstructured Data Discovery, Aggregation, and Visualization
- A Presentation to the Andrew W. Mellon Foundation by Clara Yu,John
Cuadrado ,Maciej Ceglowski, J. Scott Payne
[8] Yang, Y., and Jan Pedersen. “A Comparative Study on Feature Selection
in Text Categorization.” ICML 1997: 412-420.
[9] A tutorial on Principal Components Analysis Lindsay I Smith.
[10] Ana Cardoso-Cachopo, Improving Methods for Single-label Text
Categorization, PhD Thesis, October, 2007.
[11] K.Sathiyakumari, V.Preamsudha, G.Manimekalai; “Unsupervised
Approach for Document Clustering Using Modified Fuzzy C mean
Algorithm”; International Journal of Computer & Organization Trends –
Volume 11 Issue3-2011.
[12] R. Rajendra Prasath, Sudeshna Sarkar: Unsupervised Feature Generation
using Knowledge Repositories for Effective Text Categorization. ECAI
2010: 1101-1102
[13] Nogueira ,T,M ;“ On The Use of Fuzzy Rules to Text Document
Classification ”; 2010 10th International Conference on Hybrid
Intelligent Systems (HIS),; 23-25 Aug 2010 Atlanta, US
Author Profile
Asha Kurian is a research scholar in the department of
computer application from Dr. MGR University, Chennai.
She did her Post Graduation in computer Application from
Coimbatore Institute of Technology, Bharathiar University in
2003 . Her areas of interest include Data Mining and Artifical
Intelligence.
M.S.Josephine is Working in Dept of Computer
Applications, Dr.MGR University, Chennai. She graduated
Masters Degree (MCA) from St.Joseph’s College,
Bharathidasan University, M.Phil (Computer Science ) from
Periyar University and Doctorate from Mother Teresa
University in Computer Applications. Her research Interest
includes Software Engineering , Expert System, Networks and
Data Mining.