FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSISmlaij
Sentiment analysis and Opinion mining has emerged as a popular and efficient technique for information retrieval and web data analysis. The exponential growth of the user generated content has opened new horizons for research in the field of sentiment analysis. This paper proposes a model for sentiment analysis of movie reviews using a combination of natural language processing and machine learning approaches. Firstly, different data pre-processing schemes are applied on the dataset. Secondly, the behaviour of twoclassifiers, Naive Bayes and SVM, is investigated in combination with different feature selection schemes to
obtain the results for sentiment analysis. Thirdly, the proposed model for sentiment analysis is extended to
obtain the results for higher order n-grams.
Data mining and machine learning have become a vital part of crime detection and prevention. In this
research, we use WEKA, an open source data mining software, to conduct a comparative study between the
violent crime patterns from the Communities and Crime Unnormalized Dataset provided by the University
of California-Irvine repository and actual crime statistical data for the state of Mississippi that has been
provided by neighborhoodscout.com. We implemented the Linear Regression, Additive Regression, and
Decision Stump algorithms using the same finite set of features, on the Communities and Crime Dataset.
Overall, the linear regression algorithm performed the best among the three selected algorithms. The scope
of this project is to prove how effective and accurate the machine learning algorithms used in data mining
analysis can be at predicting violent crime patterns.
In early days the main emphases were on the cognitive aspects of learning and traditional instructions of teaching in the classroom using outdated and conventional techniques. But today in this world of constant innovations and discoveries, scientists and gadget-experts are continuously searching for one or the two technological devices a day. Nodoubt technology has made our life much easier and better in many aspects. In developed countries, technology facilitates and helps students and teacher to learn things in more effective ways. But in the country like India, the development in technology is not upto that mark. We still are moving towards the path of progress. Thus, this paper will best describes about the conceptual framework regarding futuristic studies related to future technologies such as M-Learning, E-Learning, , iPod, I-Pad self-efficacy learning, Virtual Learning Environment (VLE ) etc. In this paper investigator highlighted some of the studies related to trends in futurology and innovations that could prove an important aspect of education technology.
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING mlaij
Nowadays, There are many risks related to bank loans, for the bank and for those who get the loans. The
analysis of risk in bank loans need understanding what is the meaning of risk. In addition, the number of
transactions in banking sector is rapidly growing and huge data volumes are available which represent
the customers behavior and the risks around loan are increased. Data Mining is one of the most motivating
and vital area of research with the aim of extracting information from tremendous amount of accumulated
data sets. In this paper a new model for classifying loan risk in banking sector by using data mining. The
model has been built using data form banking sector to predict the status of loans. Three algorithms have
been used to build the proposed model: j48, bayesNet and naiveBayes. By using Weka application, the
model has been implemented and tested. The results has been discussed and a full comparison between
algorithms was conducted. J48 was selected as best algorithm based on accuracy.
Opposition Based Firefly Algorithm Optimized Feature Subset Selection Approac...mlaij
Recently huge amount of data is available in the field of medicine that helps the doctors in diagnosing diseases when analysed. Data mining techniques can be applied to these medical data to extract knowledge so that disease prediction becomes accurate and easier. In this work, cardiotocogram (CTG) data is analysed using Support Vector Machine (SVM) for predicting fetal risk. Opposition based firefly algorithm (OBFA) is proposed to extract the relevant features that maximise the classification performance of SVM. The obtained results show that opposition based firefly algorithm outperforms the standard firefly algorithm (FA).
This paper presents a review & performs a comparative evaluation of few known machine learning
algorithms in terms of their suitability & code performance on any given data set of any size. In this paper,
we describe our Machine Learning ToolBox that we have built using python programming language. The
algorithms used in the toolbox consists of supervised classification algorithms such as Naïve Bayes,
Decision Trees, SVM, K-nearest Neighbors and Neural Network (Backpropagation). The algorithms are
tested on iris and diabetes dataset and are compared on the basis of their accuracy under different
conditions. However using our tool one can apply any of the implemented ML algorithms on any dataset of
any size. The main goal of building a toolbox is to provide users with a platform to test their datasets on
different Machine Learning algorithms and use the accuracy results to determine which algorithms fits the
data best. The toolbox allows the user to choose a dataset of his/her choice either in structured or
unstructured form and then can choose the features he/she wants to use for training the machine We have
given our concluding remarks on the performance of implemented algorithms based on experimental
analysis
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...mlaij
Abstract—A new model for online machine learning process of high speed data stream is proposed, to
minimize the severe restrictions associated with the existing computer learning algorithms. Most of the
existing models have three principle steps. In the first step, the system would create a model incrementally.
In the second step the time taken by the examples to complete a prescribed procedure with their arrival
speed is computed. In the third and final step of the model the size of memory required for computation is
predicted in advance. To overcome these restrictions we proposed this new data stream classification
algorithm, where the data can be partitioned into stream of trees. In this algorithm, the new data set can be
updated with the existing tree. This algorithm, called incremental classification tree algorithm, is proved to
be an excellent solution for processing larger data streams. In this paper, we present the experimental
results of our new algorithm and prove that our method would eradicate the problems of the existing
method.
Classification of Enzymes Using Machine Learning Based Approaches: A Review mlaij
Enzymes play an important role in metabolism that helps in catalyzing bio-chemical reactions. A
computational method is required to predict the function of enzymes. Many feature selection technique
have been used in this paper by examining many previous research paper. This paper presents supervised
machine learning approach to predict the functional classes and subclass of enzymes based on set of 857
sequence derived features. It uses seven sequence derived properties including amino acid composition,
dipeptide composition, correlation feature, composition, transition, distribution and pseudo amino acid
composition .Support vector machine recursive Feature elimination (SVRRFE) is used to select the optimal
number of features. The Random Forest has been used to construct a three level model with optimal
number of features selected by SVMRFE, where top level distinguish a query protein as an enzyme or nonenzyme,
second level predicts the enzyme functional class and the third layer predict the sub functional
class. The proposed model reported overall accuracy of 100%, precision of 100% and MCC value of 1.00
for the first level, whereas accuracy of 90.1%,precision of 90.5% and MCC value of 0.88 for second level
and accuracy of 88.0%, precision of 88.7% and MCC value of 0.87 for the third level.
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has supplied a large volume of data to many fields. The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. In as much as the data achieving from microarray technology is very noisy and also has thousands of features, feature selection plays an important role in removing irrelevant and redundant features and also reducing computational complexity. There are two important approaches for gene selection in microarray data analysis, the filters and the wrappers. To select a concise subset of informative genes, we introduce a hybrid feature selection which combines two approaches. The fact of the matter is that candidate’s features are first selected from the original set via several effective filters. The candidate feature set is further refined by more accurate wrappers. Thus, we can take advantage of both the filters and wrappers. Experimental results based on 11 microarray datasets show that our mechanism can be effected with a smaller feature set. Moreover, these feature subsets can be obtained in a reasonable time.
Machine Learning Based Approaches for Prediction of Parkinson's Disease mlaij
The prediction of Parkinson’s disease is most important and challenging problem for biomedical engineering researchers and doctors. The symptoms of disease are investigated in middle and late middle age. In this paper, minimum redundancy maximum relevance feature selection algorithms is used to select the most important feature among all the features to predict the Parkinson diseases. Here, it is observed that the random forest with 20 number of features selected by minimum redundancy maximum relevance feature selection algorithms provide the overall accuracy 90.3%, precision 90.2%, Mathews correlation coefficient values of 0.73 and ROC values 0.96 which is better in comparison to all other machine learning based approaches such as bagging, boosting, random forest, rotation forest, random subspace, support vector machine, multilayer perceptron, and decision tree based methods.
Analysis of Opinionated Text for Opinion Miningmlaij
In sentiment analysis, the polarities of the opinions expressed on an object/feature are determined to assess the sentiment of a sentence or document whether it is positive/negative/neutral. Naturally, the object/feature is a noun representation which refers to a product or a component of a product, let’s say, the "lens" in a camera and opinions emanating on it are captured in adjectives, verbs, adverbs and noun words themselves. Apart from such words, other meta-information and diverse effective features are also going to play an important role in influencing the sentiment polarity and contribute significantly to the performance of the system. In this paper, some of the associated information/meta-data are explored and investigated in the sentiment text. Based on the analysis results presented here, there is scope for further assessment and utilization of the meta-information as features in text categorization, ranking text document, identification of spam documents and polarity classification problems.
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSISmlaij
Sentiment analysis and Opinion mining has emerged as a popular and efficient technique for information retrieval and web data analysis. The exponential growth of the user generated content has opened new horizons for research in the field of sentiment analysis. This paper proposes a model for sentiment analysis of movie reviews using a combination of natural language processing and machine learning approaches. Firstly, different data pre-processing schemes are applied on the dataset. Secondly, the behaviour of twoclassifiers, Naive Bayes and SVM, is investigated in combination with different feature selection schemes to
obtain the results for sentiment analysis. Thirdly, the proposed model for sentiment analysis is extended to
obtain the results for higher order n-grams.
Data mining and machine learning have become a vital part of crime detection and prevention. In this
research, we use WEKA, an open source data mining software, to conduct a comparative study between the
violent crime patterns from the Communities and Crime Unnormalized Dataset provided by the University
of California-Irvine repository and actual crime statistical data for the state of Mississippi that has been
provided by neighborhoodscout.com. We implemented the Linear Regression, Additive Regression, and
Decision Stump algorithms using the same finite set of features, on the Communities and Crime Dataset.
Overall, the linear regression algorithm performed the best among the three selected algorithms. The scope
of this project is to prove how effective and accurate the machine learning algorithms used in data mining
analysis can be at predicting violent crime patterns.
In early days the main emphases were on the cognitive aspects of learning and traditional instructions of teaching in the classroom using outdated and conventional techniques. But today in this world of constant innovations and discoveries, scientists and gadget-experts are continuously searching for one or the two technological devices a day. Nodoubt technology has made our life much easier and better in many aspects. In developed countries, technology facilitates and helps students and teacher to learn things in more effective ways. But in the country like India, the development in technology is not upto that mark. We still are moving towards the path of progress. Thus, this paper will best describes about the conceptual framework regarding futuristic studies related to future technologies such as M-Learning, E-Learning, , iPod, I-Pad self-efficacy learning, Virtual Learning Environment (VLE ) etc. In this paper investigator highlighted some of the studies related to trends in futurology and innovations that could prove an important aspect of education technology.
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING mlaij
Nowadays, There are many risks related to bank loans, for the bank and for those who get the loans. The
analysis of risk in bank loans need understanding what is the meaning of risk. In addition, the number of
transactions in banking sector is rapidly growing and huge data volumes are available which represent
the customers behavior and the risks around loan are increased. Data Mining is one of the most motivating
and vital area of research with the aim of extracting information from tremendous amount of accumulated
data sets. In this paper a new model for classifying loan risk in banking sector by using data mining. The
model has been built using data form banking sector to predict the status of loans. Three algorithms have
been used to build the proposed model: j48, bayesNet and naiveBayes. By using Weka application, the
model has been implemented and tested. The results has been discussed and a full comparison between
algorithms was conducted. J48 was selected as best algorithm based on accuracy.
Opposition Based Firefly Algorithm Optimized Feature Subset Selection Approac...mlaij
Recently huge amount of data is available in the field of medicine that helps the doctors in diagnosing diseases when analysed. Data mining techniques can be applied to these medical data to extract knowledge so that disease prediction becomes accurate and easier. In this work, cardiotocogram (CTG) data is analysed using Support Vector Machine (SVM) for predicting fetal risk. Opposition based firefly algorithm (OBFA) is proposed to extract the relevant features that maximise the classification performance of SVM. The obtained results show that opposition based firefly algorithm outperforms the standard firefly algorithm (FA).
This paper presents a review & performs a comparative evaluation of few known machine learning
algorithms in terms of their suitability & code performance on any given data set of any size. In this paper,
we describe our Machine Learning ToolBox that we have built using python programming language. The
algorithms used in the toolbox consists of supervised classification algorithms such as Naïve Bayes,
Decision Trees, SVM, K-nearest Neighbors and Neural Network (Backpropagation). The algorithms are
tested on iris and diabetes dataset and are compared on the basis of their accuracy under different
conditions. However using our tool one can apply any of the implemented ML algorithms on any dataset of
any size. The main goal of building a toolbox is to provide users with a platform to test their datasets on
different Machine Learning algorithms and use the accuracy results to determine which algorithms fits the
data best. The toolbox allows the user to choose a dataset of his/her choice either in structured or
unstructured form and then can choose the features he/she wants to use for training the machine We have
given our concluding remarks on the performance of implemented algorithms based on experimental
analysis
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...mlaij
Abstract—A new model for online machine learning process of high speed data stream is proposed, to
minimize the severe restrictions associated with the existing computer learning algorithms. Most of the
existing models have three principle steps. In the first step, the system would create a model incrementally.
In the second step the time taken by the examples to complete a prescribed procedure with their arrival
speed is computed. In the third and final step of the model the size of memory required for computation is
predicted in advance. To overcome these restrictions we proposed this new data stream classification
algorithm, where the data can be partitioned into stream of trees. In this algorithm, the new data set can be
updated with the existing tree. This algorithm, called incremental classification tree algorithm, is proved to
be an excellent solution for processing larger data streams. In this paper, we present the experimental
results of our new algorithm and prove that our method would eradicate the problems of the existing
method.
Classification of Enzymes Using Machine Learning Based Approaches: A Review mlaij
Enzymes play an important role in metabolism that helps in catalyzing bio-chemical reactions. A
computational method is required to predict the function of enzymes. Many feature selection technique
have been used in this paper by examining many previous research paper. This paper presents supervised
machine learning approach to predict the functional classes and subclass of enzymes based on set of 857
sequence derived features. It uses seven sequence derived properties including amino acid composition,
dipeptide composition, correlation feature, composition, transition, distribution and pseudo amino acid
composition .Support vector machine recursive Feature elimination (SVRRFE) is used to select the optimal
number of features. The Random Forest has been used to construct a three level model with optimal
number of features selected by SVMRFE, where top level distinguish a query protein as an enzyme or nonenzyme,
second level predicts the enzyme functional class and the third layer predict the sub functional
class. The proposed model reported overall accuracy of 100%, precision of 100% and MCC value of 1.00
for the first level, whereas accuracy of 90.1%,precision of 90.5% and MCC value of 0.88 for second level
and accuracy of 88.0%, precision of 88.7% and MCC value of 0.87 for the third level.
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has supplied a large volume of data to many fields. The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. In as much as the data achieving from microarray technology is very noisy and also has thousands of features, feature selection plays an important role in removing irrelevant and redundant features and also reducing computational complexity. There are two important approaches for gene selection in microarray data analysis, the filters and the wrappers. To select a concise subset of informative genes, we introduce a hybrid feature selection which combines two approaches. The fact of the matter is that candidate’s features are first selected from the original set via several effective filters. The candidate feature set is further refined by more accurate wrappers. Thus, we can take advantage of both the filters and wrappers. Experimental results based on 11 microarray datasets show that our mechanism can be effected with a smaller feature set. Moreover, these feature subsets can be obtained in a reasonable time.
Machine Learning Based Approaches for Prediction of Parkinson's Disease mlaij
The prediction of Parkinson’s disease is most important and challenging problem for biomedical engineering researchers and doctors. The symptoms of disease are investigated in middle and late middle age. In this paper, minimum redundancy maximum relevance feature selection algorithms is used to select the most important feature among all the features to predict the Parkinson diseases. Here, it is observed that the random forest with 20 number of features selected by minimum redundancy maximum relevance feature selection algorithms provide the overall accuracy 90.3%, precision 90.2%, Mathews correlation coefficient values of 0.73 and ROC values 0.96 which is better in comparison to all other machine learning based approaches such as bagging, boosting, random forest, rotation forest, random subspace, support vector machine, multilayer perceptron, and decision tree based methods.
Analysis of Opinionated Text for Opinion Miningmlaij
In sentiment analysis, the polarities of the opinions expressed on an object/feature are determined to assess the sentiment of a sentence or document whether it is positive/negative/neutral. Naturally, the object/feature is a noun representation which refers to a product or a component of a product, let’s say, the "lens" in a camera and opinions emanating on it are captured in adjectives, verbs, adverbs and noun words themselves. Apart from such words, other meta-information and diverse effective features are also going to play an important role in influencing the sentiment polarity and contribute significantly to the performance of the system. In this paper, some of the associated information/meta-data are explored and investigated in the sentiment text. Based on the analysis results presented here, there is scope for further assessment and utilization of the meta-information as features in text categorization, ranking text document, identification of spam documents and polarity classification problems.
An New Attractive Mage Technique Using L-Diversity mlaij
Data that is published or shared between organizations contain private information about an individual. The concept of Privacy Preservation aims to preserve this sensitive information from various privacy threats that violate the privacy of an individual. Analysis of this private information could reveal information that can be used for malicious purposes by the attackers. Anonymization is a privacy preservation approach suitable for mixed data that contains both numerical and categorical attributes. In this paper a novel method called Micro-aggregation Generalization (MAGE) is used for anonymization of microdata that can retain more semantics of the original data. Here the Micro-aggregation is applied over the numerical data and Generalization is applied over the categorical data. Even though the MAGE approach preserves privacy it fails to address the homogeneity and background knowledge attacks. Later the l-diversity approach is applied to deal with homogeneity attack. In l-diversity, the anonymized records are reordered to satisfy a new privacy principle that removes homogeneity of sensitive information. The result shows that the MAGE approach suffers from homogeneity attack and applying l-diversity over MAGE prevents homogeneity attack and also provides better privacy and data utility.
IMAGE BASED RECOGNITION - RECENT CHALLENGES AND SOLUTIONS ILLUSTRATED ON APPL...mlaij
In this paper, problems and solutions for the automatic recognition of miscellaneous materials, especially
bulk materials are discussed. The fact that many materials, especially natural materials, have a strong
phenotypic variability resulting in high intra-class and low inter-class variability of the calculated features
poses a complex recognition problem. The recognition of components of a wheat sample or the
classification of mineral aggregates serves as an example to demonstrate different aspects in segmentation,
feature extraction, classifier design and complexity assessment. We present a technique for the
segmentation of highly overlapping and touching objects into single object images, a proposal for feature
selection and classifier parameter optimization, as well as a method to visualise the complexity of a highdimensional
recognition problem in a three-dimensional space. Every step of the pattern recognition
process needs to be optimized carefully with special attention to the risk of overfitting. Modern processors
and the application of field-programmable gate arrays as well as the outsourcing of processing steps to the
graphic processing unit speed up the calculation and make real-time computation possible also for highly
complex recognition problems such as the quality assurance of bulk materials.
A Multi-Level Security for Preventing DDOS Attacks in Cloud Environmentsmlaij
Incredible and amazing growths in the meadow of extranet, internet, intranet and its users have developed an innovative period of great global competition and contention. Denial of service attack by several computers is accomplished of distressing the services of competitor servers. The attack can be done for various reasons. So it is a key threat for cloud environment. Distributed-Denial of Service (DDoS) is a key intimidation to network and cloud computing security. Cloud computing Network is a group of nodes that interrelate with each other for switch over the information. So security is the major issue. There are several security attacks in cloud computing. One of the major intimidations to internet examine is DDoS attack. It is a malevolent effort to suspending or suspends services to destination node. DDoS or DoS is an effort to create network resource or the machine is busy to its intentional user. Numerous thoughts are developed for avoid the DDoS or DoS. DDoS occur in two different behaviours they may happen obviously or it may due to some attackers.