This document summarizes an article that proposes a novel cost-free learning (CFL) approach called ABC-SVM to address the class imbalance problem. The approach aims to maximize the normalized mutual information of the predicted and actual classes to balance errors and rejects without requiring cost information. It optimizes misclassification costs, SVM parameters, and feature selection simultaneously using an artificial bee colony algorithm. Experimental results on several datasets show the method performs effectively compared to sampling techniques for class imbalance.
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Classification problems specified in high dimensional data with smallnumber of observation are generally becoming common in specific microarray data. In the time of last two periods of years, manyefficient classification standard models and also Feature Selection (FS) algorithm which isalso referred as FS technique have basically been proposed for higher prediction accuracies. Although, the outcome of FS algorithm related to predicting accuracy is going to be unstable over the variations in considered trainingset, in high dimensional data. In this paperwe present a latest evaluation measure Q-statistic that includes the stability of the selected feature subset in inclusion to prediction accuracy. Then we are going to propose the standard Booster of a FS algorithm that boosts the basic value of the preferred Q-statistic of the algorithm applied. Therefore study on synthetic data and 14 microarray data sets shows that Booster boosts not only the value of Q-statistics but also the prediction accuracy of the algorithm applied.
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Classification problems specified in high dimensional data with smallnumber of observation are generally becoming common in specific microarray data. In the time of last two periods of years, manyefficient classification standard models and also Feature Selection (FS) algorithm which isalso referred as FS technique have basically been proposed for higher prediction accuracies. Although, the outcome of FS algorithm related to predicting accuracy is going to be unstable over the variations in considered trainingset, in high dimensional data. In this paperwe present a latest evaluation measure Q-statistic that includes the stability of the selected feature subset in inclusion to prediction accuracy. Then we are going to propose the standard Booster of a FS algorithm that boosts the basic value of the preferred Q-statistic of the algorithm applied. Therefore study on synthetic data and 14 microarray data sets shows that Booster boosts not only the value of Q-statistics but also the prediction accuracy of the algorithm applied.
Analytical Hierarchical Process has been used as a useful methodology for multi-criteria decision making environments with substantial applications in recent years. But the weakness of the traditional AHP method lies in the use of subjective judgement based assessment and standardized scale for pairwise comparison matrix creation. The paper proposes a Condorcet Voting Theory based AHP method to solve multi criteria decision making problems where Analytical Hierarchy Process (AHP) is combined with Condorcet theory based preferential voting technique followed by a quantitative ratio method for framing the comparison matrix instead of the standard importance scale in traditional AHP approach. The consistency ratio (CR) is calculated for both the approaches to determine and compare the consistency of both the methods. The results reveal Condorcet- AHP method to be superior generating lower consistency ratio and more accurate ranking of the criterion for solving MCDM problems.
Feature selection is one of the most fundamental steps in machine learning. It is closely related to
dimensionality reduction. A commonly used approach in feature selection is ranking the individual
features according to some criteria and then search for an optimal feature subset based on an evaluation
criterion to test the optimality. The objective of this work is to predict more accurately the presence of
Learning Disability (LD) in school-aged children with reduced number of symptoms. For this purpose, a
novel hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The process of feature ranking
follows a method of calculating the significance or priority of each symptoms of LD as per their
contribution in representing the knowledge contained in the dataset. Each symptoms significance or
priority values reflect its relative importance to predict LD among the various cases. Then by eliminating
least significant features one by one and evaluating the feature subset at each stage of the process, an
optimal feature subset is generated. For comparative analysis and to establish the importance of rough set
theory in feature selection, the backward feature elimination algorithm is combined with two state-of-theart
filter based feature ranking techniques viz. information gain and gain ratio. The experimental results
show the proposed feature selection approach outperforms the other two in terms of the data reduction.
Also, the proposed method eliminates all the redundant attributes efficiently from the LD dataset without
sacrificing the classification performance.
Correlation based feature selection (cfs) technique to predict student perfro...IJCNCJournal
Education data mining is an emerging stream which h
elps in mining academic data for solving various
types of problems. One of the problems is the selec
tion of a proper academic track. The admission of a
student in engineering college depends on many fact
ors. In this paper we have tried to implement a
classification technique to assist students in pred
icting their success in admission in an engineering
stream.We have analyzed the data set containing inf
ormation about student’s academic as well as socio-
demographic variables, with attributes such as fami
ly pressure, interest, gender, XII marks and CET ra
nk
in entrance examinations and historical data of pre
vious batch of students. Feature selection is a pro
cess
for removing irrelevant and redundant features whic
h will help improve the predictive accuracy of
classifiers. In this paper first we have used featu
re selection attribute algorithms Chi-square.InfoGa
in, and
GainRatio to predict the relevant features. Then we
have applied fast correlation base filter on given
features. Later classification is done using NBTree
, MultilayerPerceptron, NaiveBayes and Instance bas
ed
–K- nearest neighbor. Results showed reduction in c
omputational cost and time and increase in predicti
ve
accuracy for the student model
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...cscpconf
Multi-label spatial classification based on association rules with multi objective genetic
algorithms (MOGA) enriched by semi supervised learning is proposed in this paper. It is to deal
with multiple class labels problem. In this paper we adapt problem transformation for the multi
label classification. We use hybrid evolutionary algorithm for the optimization in the generation
of spatial association rules, which addresses single label. MOGA is used to combine the single
labels into multi labels with the conflicting objectives predictive accuracy and
comprehensibility. Semi supervised learning is done through the process of rule cover
clustering. Finally associative classifier is built with a sorting mechanism. The algorithm is
simulated and the results are compared with MOGA based associative classifier, which out
performs the existing
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...ijtsrd
Customary characterization calculations can be constrained in their execution on exceedingly uneven informational collections. A famous stream of work for countering the substance of class inelegance has been the use of an assorted of inspecting methodologies. In this correspondence, we center on planning alterations neural system to properly handle the issue of class irregularity. We consolidate distinctive rebalance heuristics in ANN demonstrating, including cost delicate learning, and over and under testing. These ANN based systems are contrasted and different best in class approaches on an assortment of informational collections by utilizing different measurements, including G mean, region under the collector working trademark curve, F measure, and region under the exactness review curve. Numerous regular strategies, which can be classified into testing, cost delicate, or gathering, incorporate heuristic and task subordinate procedures. So as to accomplish a superior arrangement execution by detailing without heuristics and errand reliance, presently propose RBF based Network RBF NN . Its target work is the symphonious mean of different assessment criteria got from a perplexity grid, such criteria as affectability, positive prescient esteem, and others for negatives. This target capacity and its enhancement are reliably detailed on the system of CM KLOGR, in light of least characterization mistake and summed up probabilistic plunge MCE GPD learning. Because of the benefits of the consonant mean, CM KLOGR, and MCE GPD, RBF NN improves the multifaceted exhibitions in a very much adjusted way. It shows the definition of RBF NN and its adequacy through trials that nearly assessed RBF NN utilizing benchmark imbalanced datasets. Nitesh Kumar | Dr. Shailja Sharma "Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm Optimization" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25255.pdfPaper URL: https://www.ijtsrd.com/computer-science/other/25255/adaptive-classification-of-imbalanced-data-using-ann-with-particle-of-swarm-optimization/nitesh-kumar
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
A high prediction accuracy of the students’ performance is more helpful to identify the low performance students at the beginning of the learning process. Data mining is used to attain this objective. Data mining techniques are used to discover models or patterns of data, and it is much helpful in the decision-making.Boosting technique is the most popular techniques for constructing ensembles of classifier to improve the classification accuracy. Adaptive Boosting (AdaBoost) is a generation of boosting algorithm. It is used for
the binary classification and not applicable to multiclass classification directly. SAMME boosting
technique extends AdaBoost to a multiclass classification without reduce it to a set of sub-binaryclassification.In this paper, students’ performance prediction system usingMulti Agent Data Mining is proposed to predict the performance of the students based on their data with high prediction accuracy and provide helpto the low students by optimization rules.The proposed system has been implemented and evaluated by investigate the prediction accuracy ofAdaboost.M1 and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The results show that using SAMME Boosting technique improves the prediction accuracy and outperformed
C4.5 single classifier and LogitBoost.
Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Predictionijtsrd
Data mining techniques play an important role in data analysis. For the construction of a classification model which could predict performance of students, particularly for engineering branches, a decision tree algorithm associated with the data mining techniques have been used in the research. A number of factors may affect the performance of students. Data mining technology which can related to this student grade well and we also used classification algorithms prediction. In this paper, we used educational data mining to predict students final grade based on their performance. We proposed student data classification using ID3 Iterative Dichotomiser 3 Decision Tree Algorithm Khin Khin Lay | San San Nwe "Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Prediction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26545.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26545/using-id3-decision-tree-algorithm-to-the-student-grade-analysis-and-prediction/khin-khin-lay
Prediction of student performance has become an essential issue for improving the educational system. However, this has turned to be a challenging task due to the huge quantity of data in the educational environment. Educational data mining is an emerging field that aims to develop techniques to manipulate and explore the sizable educational data. Classification is one of the primary approaches of the educational data mining methods that is the most widely used for predicting student performance and characteristics. In this work, three linear classification techniques; logistic regression, support vector machines (SVM), and stochastic gradient descent (SGD), and three nonlinear classification methods; decision tree, random forest and adaptive boosting (AdaBoost) are explored and evaluated on a dataset of ASSISTment system. A k-fold cross validation method is used to evaluate the implemented techniques. The results demonstrate that decision tree algorithm outperforms the other techniques, with an average accuracy of 0.7254, an average sensitivity of 0.8036 and an average specificity of 0.901. Furthermore, the importance of the utilized features is obtained and the system performance is computed using the most significant features. The results reveal that the best performance is reached using the first 80 important features with accuracy, sensitivity and specificity of 0.7252, 0.8042 and 0.9016, respectively.
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
Feature selection is considered as a problem of global combinatorial optimization in machine learning, which reduces the number of features, removes irrelevant, noisy and redundant data. However, identification of useful features from hundreds or even thousands of related features is not an easy task. Selecting relevant genes from microarray data becomes even more challenging owing to the high dimensionality of features, multiclass categories involved and the usually small sample size. In order to improve the prediction accuracy and to avoid incomprehensibility due to the number of features different feature selection techniques can be implemented. This survey classifies and analyzes different approaches, aiming to not only provide a comprehensive presentation but also discuss challenges and various performance parameters. The techniques are generally classified into three; filter, wrapper and hybrid.
Effective Feature Selection for Feature Possessing Group Structurerahulmonikasharma
Feature selection has become an interesting research topic in recent years. It is an effective method to tackle the data with high dimension. The underlying structure has been ignored by the previous feature selection method and it determines the feature individually. Considering this we focus on the problem where feature possess some group structure. To solve this problem we present group feature selection method at group level to execute feature selection. Its objective is to execute the feature selection in within the group and between the group of features that select discriminative features and remove redundant features to obtain optimal subset. We demonstrate our method on data sets and perform the task to achieve classification accuracy.
Research on multi-class imbalance from a number of researchers faces
obstacles in the form of poor data diversity and a large number of classifiers.
The Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method
is a Hybrid Ensembles method which is the development of the Hybrid
Approach Redefinion (HAR) method. This study has compared the results
obtained with the Dynamic Ensemble Selection-Multiclass Imbalance
(DES-MI) method in handling multiclass imbalance. In the HAR-MI
Method, the preprocessing stage was carried out using the random balance
ensembles method and dynamic ensemble selection to produce a candidate
ensemble and the processing stages was carried out using different
contribution sampling and dynamic ensemble selection to produce
a candidate ensemble. This research has been conducted by using multi-class
imbalance datasets sourced from the KEEL Repository. The results show that
the HAR-MI method can overcome multi-class imbalance with better data
diversity, smaller number of classifiers, and better classifier performance
compared to a DES-MI method. These results were tested with a Wilcoxon
signed-rank statistical test which showed that the superiority of the HAR-MI
method with respect to DES-MI method.
UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...ijaia
In this paper, we propose a movie genre recommendation system based on imbalanced survey data and unequal classification costs for small and medium-sized enterprises (SMEs) who need a data-based and analytical approach to stock favored movies and target marketing to young people. The dataset maintains a detailed personal profile as predictors including demographic, behavioral and preferences information for each user as well as imbalanced genre preferences. These predictors do not include movies’ information such as actors or directors. The paper applies Gentle boost, Adaboost and Bagged tree ensembles as well as SVM machine learning algorithms to learn classification from one thousand observations and predict movie genre preferences with adjusted classification costs. The proposed recommendation system also selects important predictors to avoid overfitting and to shorten training time. This paper compares the test error among the above-mentioned algorithms that are used to recommend different movie genres. The prediction power is also indicated in a comparison of precision and recall with other state-of-the-art recommendation systems. The proposed movie genre recommendation system solves problems such as small dataset, imbalanced response, and unequal classification costs.
Heuristics for the Maximal Diversity Selection ProblemIJMER
The problem of selecting k items from among a given set of N items such that the ‘diversity’
among the k items is maximum, is a classical problem with applications in many diverse areas such as
forming committees, jury selection, product testing, surveys, plant breeding, ecological preservation,
capital investment, etc. A suitably defined distance metric is used to determine the diversity. However,
this is a hard problem, and the optimal solution is computationally intractable. In this paper we present
the experimental evaluation of two approximation algorithms (heuristics) for the maximal diversity
selection problem
An overview on data mining designed for imbalanced datasetseSAT Journals
Abstract The imbalanced datasets with the classifying categories are not around equally characterized. A problem in imbalanced dataset occurs in categorization, where the amount of illustration of single class will be greatly lesser than the illustrations of the previous classes. Current existence brought improved awareness during implementation of machine learning methods to complex real world exertion, which is considered by several through imbalanced data. In machine learning the imbalanced datasets has become a critical problem and also usually found in many implementation such as detection of fraudulent calls, bio-medical, engineering, remote-sensing, computer society and manufacturing industries. In order to overcome the problems several approaches have been proposed. In this paper a study on Imbalanced dataset problem and examine various sampling method utilized in favour of evaluation of the datasets, moreover the interpretation methods are further suitable for imbalanced datasets mining. Keywords: Imbalance Problems, Imbalanced datasets, sampling strategies, Machine Learning.
An approach for improved students’ performance prediction using homogeneous ...IJECEIAES
Web-based learning technologies of educational institutions store a massive amount of interaction data which can be helpful to predict students’ performance through the aid of machine learning algorithms. With this, various researchers focused on studying ensemble learning methods as it is known to improve the predictive accuracy of traditional classification algorithms. This study proposed an approach for enhancing the performance prediction of different single classification algorithms by using them as base classifiers of homogeneous ensembles (bagging and boosting) and heterogeneous ensembles (voting and stacking). The model utilized various single classifiers such as multilayer perceptron or neural networks (NN), random forest (RF), naïve Bayes (NB), J48, JRip, OneR, logistic regression (LR), k-nearest neighbor (KNN), and support vector machine (SVM) to determine the base classifiers of the ensembles. In addition, the study made use of the University of California Irvine (UCI) open-access student dataset to predict students’ performance. The comparative analysis of the model’s accuracy showed that the best-performing single classifier’s accuracy increased further from 93.10% to 93.68% when used as a base classifier of a voting ensemble method. Moreover, results in this study showed that voting heterogeneous ensemble performed slightly better than bagging and boosting homogeneous ensemble methods.
Analysis of Imbalanced Classification Algorithms A Perspective Viewijtsrd
Classification of data has become an important research area. The process of classifying documents into predefined categories Unbalanced data set, a problem often found in real world application, can cause seriously negative effect on classification performance of machine learning algorithms. There have been many attempts at dealing with classification of unbalanced data sets. In this paper we present a brief review of existing solutions to the class-imbalance problem proposed both at the data and algorithmic levels. Even though a common practice to handle the problem of imbalanced data is to rebalance them artificially by oversampling and or under-sampling, some researchers proved that modified support vector machine, rough set based minority class oriented rule learning methods, cost sensitive classifier perform good on imbalanced data set. We observed that current research in imbalance data problem is moving to hybrid algorithms. Priyanka Singh | Prof. Avinash Sharma "Analysis of Imbalanced Classification Algorithms: A Perspective View" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-2 , February 2019, URL: https://www.ijtsrd.com/papers/ijtsrd21574.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/21574/analysis-of-imbalanced-classification-algorithms-a-perspective-view/priyanka-singh
Analytical Hierarchical Process has been used as a useful methodology for multi-criteria decision making environments with substantial applications in recent years. But the weakness of the traditional AHP method lies in the use of subjective judgement based assessment and standardized scale for pairwise comparison matrix creation. The paper proposes a Condorcet Voting Theory based AHP method to solve multi criteria decision making problems where Analytical Hierarchy Process (AHP) is combined with Condorcet theory based preferential voting technique followed by a quantitative ratio method for framing the comparison matrix instead of the standard importance scale in traditional AHP approach. The consistency ratio (CR) is calculated for both the approaches to determine and compare the consistency of both the methods. The results reveal Condorcet- AHP method to be superior generating lower consistency ratio and more accurate ranking of the criterion for solving MCDM problems.
Feature selection is one of the most fundamental steps in machine learning. It is closely related to
dimensionality reduction. A commonly used approach in feature selection is ranking the individual
features according to some criteria and then search for an optimal feature subset based on an evaluation
criterion to test the optimality. The objective of this work is to predict more accurately the presence of
Learning Disability (LD) in school-aged children with reduced number of symptoms. For this purpose, a
novel hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The process of feature ranking
follows a method of calculating the significance or priority of each symptoms of LD as per their
contribution in representing the knowledge contained in the dataset. Each symptoms significance or
priority values reflect its relative importance to predict LD among the various cases. Then by eliminating
least significant features one by one and evaluating the feature subset at each stage of the process, an
optimal feature subset is generated. For comparative analysis and to establish the importance of rough set
theory in feature selection, the backward feature elimination algorithm is combined with two state-of-theart
filter based feature ranking techniques viz. information gain and gain ratio. The experimental results
show the proposed feature selection approach outperforms the other two in terms of the data reduction.
Also, the proposed method eliminates all the redundant attributes efficiently from the LD dataset without
sacrificing the classification performance.
Correlation based feature selection (cfs) technique to predict student perfro...IJCNCJournal
Education data mining is an emerging stream which h
elps in mining academic data for solving various
types of problems. One of the problems is the selec
tion of a proper academic track. The admission of a
student in engineering college depends on many fact
ors. In this paper we have tried to implement a
classification technique to assist students in pred
icting their success in admission in an engineering
stream.We have analyzed the data set containing inf
ormation about student’s academic as well as socio-
demographic variables, with attributes such as fami
ly pressure, interest, gender, XII marks and CET ra
nk
in entrance examinations and historical data of pre
vious batch of students. Feature selection is a pro
cess
for removing irrelevant and redundant features whic
h will help improve the predictive accuracy of
classifiers. In this paper first we have used featu
re selection attribute algorithms Chi-square.InfoGa
in, and
GainRatio to predict the relevant features. Then we
have applied fast correlation base filter on given
features. Later classification is done using NBTree
, MultilayerPerceptron, NaiveBayes and Instance bas
ed
–K- nearest neighbor. Results showed reduction in c
omputational cost and time and increase in predicti
ve
accuracy for the student model
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...cscpconf
Multi-label spatial classification based on association rules with multi objective genetic
algorithms (MOGA) enriched by semi supervised learning is proposed in this paper. It is to deal
with multiple class labels problem. In this paper we adapt problem transformation for the multi
label classification. We use hybrid evolutionary algorithm for the optimization in the generation
of spatial association rules, which addresses single label. MOGA is used to combine the single
labels into multi labels with the conflicting objectives predictive accuracy and
comprehensibility. Semi supervised learning is done through the process of rule cover
clustering. Finally associative classifier is built with a sorting mechanism. The algorithm is
simulated and the results are compared with MOGA based associative classifier, which out
performs the existing
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...ijtsrd
Customary characterization calculations can be constrained in their execution on exceedingly uneven informational collections. A famous stream of work for countering the substance of class inelegance has been the use of an assorted of inspecting methodologies. In this correspondence, we center on planning alterations neural system to properly handle the issue of class irregularity. We consolidate distinctive rebalance heuristics in ANN demonstrating, including cost delicate learning, and over and under testing. These ANN based systems are contrasted and different best in class approaches on an assortment of informational collections by utilizing different measurements, including G mean, region under the collector working trademark curve, F measure, and region under the exactness review curve. Numerous regular strategies, which can be classified into testing, cost delicate, or gathering, incorporate heuristic and task subordinate procedures. So as to accomplish a superior arrangement execution by detailing without heuristics and errand reliance, presently propose RBF based Network RBF NN . Its target work is the symphonious mean of different assessment criteria got from a perplexity grid, such criteria as affectability, positive prescient esteem, and others for negatives. This target capacity and its enhancement are reliably detailed on the system of CM KLOGR, in light of least characterization mistake and summed up probabilistic plunge MCE GPD learning. Because of the benefits of the consonant mean, CM KLOGR, and MCE GPD, RBF NN improves the multifaceted exhibitions in a very much adjusted way. It shows the definition of RBF NN and its adequacy through trials that nearly assessed RBF NN utilizing benchmark imbalanced datasets. Nitesh Kumar | Dr. Shailja Sharma "Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm Optimization" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25255.pdfPaper URL: https://www.ijtsrd.com/computer-science/other/25255/adaptive-classification-of-imbalanced-data-using-ann-with-particle-of-swarm-optimization/nitesh-kumar
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
A high prediction accuracy of the students’ performance is more helpful to identify the low performance students at the beginning of the learning process. Data mining is used to attain this objective. Data mining techniques are used to discover models or patterns of data, and it is much helpful in the decision-making.Boosting technique is the most popular techniques for constructing ensembles of classifier to improve the classification accuracy. Adaptive Boosting (AdaBoost) is a generation of boosting algorithm. It is used for
the binary classification and not applicable to multiclass classification directly. SAMME boosting
technique extends AdaBoost to a multiclass classification without reduce it to a set of sub-binaryclassification.In this paper, students’ performance prediction system usingMulti Agent Data Mining is proposed to predict the performance of the students based on their data with high prediction accuracy and provide helpto the low students by optimization rules.The proposed system has been implemented and evaluated by investigate the prediction accuracy ofAdaboost.M1 and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The results show that using SAMME Boosting technique improves the prediction accuracy and outperformed
C4.5 single classifier and LogitBoost.
Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Predictionijtsrd
Data mining techniques play an important role in data analysis. For the construction of a classification model which could predict performance of students, particularly for engineering branches, a decision tree algorithm associated with the data mining techniques have been used in the research. A number of factors may affect the performance of students. Data mining technology which can related to this student grade well and we also used classification algorithms prediction. In this paper, we used educational data mining to predict students final grade based on their performance. We proposed student data classification using ID3 Iterative Dichotomiser 3 Decision Tree Algorithm Khin Khin Lay | San San Nwe "Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Prediction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26545.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26545/using-id3-decision-tree-algorithm-to-the-student-grade-analysis-and-prediction/khin-khin-lay
Prediction of student performance has become an essential issue for improving the educational system. However, this has turned to be a challenging task due to the huge quantity of data in the educational environment. Educational data mining is an emerging field that aims to develop techniques to manipulate and explore the sizable educational data. Classification is one of the primary approaches of the educational data mining methods that is the most widely used for predicting student performance and characteristics. In this work, three linear classification techniques; logistic regression, support vector machines (SVM), and stochastic gradient descent (SGD), and three nonlinear classification methods; decision tree, random forest and adaptive boosting (AdaBoost) are explored and evaluated on a dataset of ASSISTment system. A k-fold cross validation method is used to evaluate the implemented techniques. The results demonstrate that decision tree algorithm outperforms the other techniques, with an average accuracy of 0.7254, an average sensitivity of 0.8036 and an average specificity of 0.901. Furthermore, the importance of the utilized features is obtained and the system performance is computed using the most significant features. The results reveal that the best performance is reached using the first 80 important features with accuracy, sensitivity and specificity of 0.7252, 0.8042 and 0.9016, respectively.
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
Feature selection is considered as a problem of global combinatorial optimization in machine learning, which reduces the number of features, removes irrelevant, noisy and redundant data. However, identification of useful features from hundreds or even thousands of related features is not an easy task. Selecting relevant genes from microarray data becomes even more challenging owing to the high dimensionality of features, multiclass categories involved and the usually small sample size. In order to improve the prediction accuracy and to avoid incomprehensibility due to the number of features different feature selection techniques can be implemented. This survey classifies and analyzes different approaches, aiming to not only provide a comprehensive presentation but also discuss challenges and various performance parameters. The techniques are generally classified into three; filter, wrapper and hybrid.
Effective Feature Selection for Feature Possessing Group Structurerahulmonikasharma
Feature selection has become an interesting research topic in recent years. It is an effective method to tackle the data with high dimension. The underlying structure has been ignored by the previous feature selection method and it determines the feature individually. Considering this we focus on the problem where feature possess some group structure. To solve this problem we present group feature selection method at group level to execute feature selection. Its objective is to execute the feature selection in within the group and between the group of features that select discriminative features and remove redundant features to obtain optimal subset. We demonstrate our method on data sets and perform the task to achieve classification accuracy.
Research on multi-class imbalance from a number of researchers faces
obstacles in the form of poor data diversity and a large number of classifiers.
The Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method
is a Hybrid Ensembles method which is the development of the Hybrid
Approach Redefinion (HAR) method. This study has compared the results
obtained with the Dynamic Ensemble Selection-Multiclass Imbalance
(DES-MI) method in handling multiclass imbalance. In the HAR-MI
Method, the preprocessing stage was carried out using the random balance
ensembles method and dynamic ensemble selection to produce a candidate
ensemble and the processing stages was carried out using different
contribution sampling and dynamic ensemble selection to produce
a candidate ensemble. This research has been conducted by using multi-class
imbalance datasets sourced from the KEEL Repository. The results show that
the HAR-MI method can overcome multi-class imbalance with better data
diversity, smaller number of classifiers, and better classifier performance
compared to a DES-MI method. These results were tested with a Wilcoxon
signed-rank statistical test which showed that the superiority of the HAR-MI
method with respect to DES-MI method.
UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...ijaia
In this paper, we propose a movie genre recommendation system based on imbalanced survey data and unequal classification costs for small and medium-sized enterprises (SMEs) who need a data-based and analytical approach to stock favored movies and target marketing to young people. The dataset maintains a detailed personal profile as predictors including demographic, behavioral and preferences information for each user as well as imbalanced genre preferences. These predictors do not include movies’ information such as actors or directors. The paper applies Gentle boost, Adaboost and Bagged tree ensembles as well as SVM machine learning algorithms to learn classification from one thousand observations and predict movie genre preferences with adjusted classification costs. The proposed recommendation system also selects important predictors to avoid overfitting and to shorten training time. This paper compares the test error among the above-mentioned algorithms that are used to recommend different movie genres. The prediction power is also indicated in a comparison of precision and recall with other state-of-the-art recommendation systems. The proposed movie genre recommendation system solves problems such as small dataset, imbalanced response, and unequal classification costs.
Heuristics for the Maximal Diversity Selection ProblemIJMER
The problem of selecting k items from among a given set of N items such that the ‘diversity’
among the k items is maximum, is a classical problem with applications in many diverse areas such as
forming committees, jury selection, product testing, surveys, plant breeding, ecological preservation,
capital investment, etc. A suitably defined distance metric is used to determine the diversity. However,
this is a hard problem, and the optimal solution is computationally intractable. In this paper we present
the experimental evaluation of two approximation algorithms (heuristics) for the maximal diversity
selection problem
An overview on data mining designed for imbalanced datasetseSAT Journals
Abstract The imbalanced datasets with the classifying categories are not around equally characterized. A problem in imbalanced dataset occurs in categorization, where the amount of illustration of single class will be greatly lesser than the illustrations of the previous classes. Current existence brought improved awareness during implementation of machine learning methods to complex real world exertion, which is considered by several through imbalanced data. In machine learning the imbalanced datasets has become a critical problem and also usually found in many implementation such as detection of fraudulent calls, bio-medical, engineering, remote-sensing, computer society and manufacturing industries. In order to overcome the problems several approaches have been proposed. In this paper a study on Imbalanced dataset problem and examine various sampling method utilized in favour of evaluation of the datasets, moreover the interpretation methods are further suitable for imbalanced datasets mining. Keywords: Imbalance Problems, Imbalanced datasets, sampling strategies, Machine Learning.
An approach for improved students’ performance prediction using homogeneous ...IJECEIAES
Web-based learning technologies of educational institutions store a massive amount of interaction data which can be helpful to predict students’ performance through the aid of machine learning algorithms. With this, various researchers focused on studying ensemble learning methods as it is known to improve the predictive accuracy of traditional classification algorithms. This study proposed an approach for enhancing the performance prediction of different single classification algorithms by using them as base classifiers of homogeneous ensembles (bagging and boosting) and heterogeneous ensembles (voting and stacking). The model utilized various single classifiers such as multilayer perceptron or neural networks (NN), random forest (RF), naïve Bayes (NB), J48, JRip, OneR, logistic regression (LR), k-nearest neighbor (KNN), and support vector machine (SVM) to determine the base classifiers of the ensembles. In addition, the study made use of the University of California Irvine (UCI) open-access student dataset to predict students’ performance. The comparative analysis of the model’s accuracy showed that the best-performing single classifier’s accuracy increased further from 93.10% to 93.68% when used as a base classifier of a voting ensemble method. Moreover, results in this study showed that voting heterogeneous ensemble performed slightly better than bagging and boosting homogeneous ensemble methods.
Analysis of Imbalanced Classification Algorithms A Perspective Viewijtsrd
Classification of data has become an important research area. The process of classifying documents into predefined categories Unbalanced data set, a problem often found in real world application, can cause seriously negative effect on classification performance of machine learning algorithms. There have been many attempts at dealing with classification of unbalanced data sets. In this paper we present a brief review of existing solutions to the class-imbalance problem proposed both at the data and algorithmic levels. Even though a common practice to handle the problem of imbalanced data is to rebalance them artificially by oversampling and or under-sampling, some researchers proved that modified support vector machine, rough set based minority class oriented rule learning methods, cost sensitive classifier perform good on imbalanced data set. We observed that current research in imbalance data problem is moving to hybrid algorithms. Priyanka Singh | Prof. Avinash Sharma "Analysis of Imbalanced Classification Algorithms: A Perspective View" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-2 , February 2019, URL: https://www.ijtsrd.com/papers/ijtsrd21574.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/21574/analysis-of-imbalanced-classification-algorithms-a-perspective-view/priyanka-singh
Multi Criteria Decision Making Methodology on Selection of a Student for All ...ijtsrd
Selecting a student for all round excellent award is based on a complex, elaborate combination of abilities and skills. A multi criteria Decision Making method, AHP is used to help in making decision consistently by doing a pairwise comparison matrix process between criteria based on selected alternatives and determining the priority order of criteria and alternatives used. The results of these calculations are used to determine the outstanding student receiving a scholarship based on the final results of the AHP method calculation. The results demonstrated that the student ranking is more likely influenced by the relative importance of management, leadership and motivation by sub criteria, education, cooperation, innovation, disciplinary, attendance, knowledge, sports activity, social activity and awards. Kyi Kyi Mynit "Multi Criteria Decision Making Methodology on Selection of a Student for All Round Excellent Award" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26428.pdfPaper URL: https://www.ijtsrd.com/management/research-method/26428/multi-criteria-decision-making-methodology-on-selection-of-a-student-for-all-round-excellent-award/kyi-kyi-mynit
Evaluation metric plays a critical role in achieving the optimal classifier during the classification training.
Thus, a selection of suitable evaluation metric is an important key for discriminating and obtaining the
optimal classifier. This paper systematically reviewed the related evaluation metrics that are specifically
designed as a discriminator for optimizing generative classifier. Generally, many generative classifiers
employ accuracy as a measure to discriminate the optimal solution during the classification training.
However, the accuracy has several weaknesses which are less distinctiveness, less discriminability, less
informativeness and bias to majority class data. This paper also briefly discusses other metrics that are
specifically designed for discriminating the optimal solution. The shortcomings of these alternative metrics
are also discussed. Finally, this paper suggests five important aspects that must be taken into consideration
in constructing a new discriminator metric.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
Incremental learning from unbalanced data with concept class, concept drift a...IJDKP
Recently, stream data mining applications has drawn vital attention from several research communities.
Stream data is continuous form of data which is distinguished by its online nature. Traditionally, machine
learning area has been developing learning algorithms that have certain assumptions on underlying
distribution of data such as data should have predetermined distribution. Such constraints on the problem
domain lead the way for development of smart learning algorithms performance is theoretically verifiable.
Real-word situations are different than this restricted model. Applications usually suffers from problems
such as unbalanced data distribution. Additionally, data picked from non-stationary environments are also
usual in real world applications, resulting in the “concept drift” which is related with data stream
examples. These issues have been separately addressed by the researchers, also, it is observed that joint
problem of class imbalance and concept drift has got relatively little research. If the final objective of
clever machine learning techniques is to be able to address a broad spectrum of real world applications,
then the necessity for a universal framework for learning from and tailoring (adapting) to, environment
where drift in concepts may occur and unbalanced data distribution is present can be hardly exaggerated.
In this paper, we first present an overview of issues that are observed in stream data mining scenarios,
followed by a complete review of recent research in dealing with each of the issue.
A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...nooriasukmaningtyas
Ant-tree-miner (ATM) has an advantage over the conventional decision tree algorithm in terms of feature selection. However, real world applications commonly involved imbalanced class problem where the classes have different importance. This condition impeded the entropy-based heuristic of existing ATM algorithm to develop effective decision boundaries due to its biasness towards the dominant class. Consequently, the induced decision trees are dominated by the majority class which lack in predictive ability on the rare class. This study proposed an enhanced algorithm called hellingerant-tree-miner (HATM) which is inspired by ant colony optimization (ACO) metaheuristic for imbalanced learning using decision tree classification algorithm. The proposed algorithm was compared to the existing algorithm, ATM in nine (9) publicly available imbalanced data sets. Simulation study reveals the superiority of HATM when the sample size increases with skewed class (Imbalanced Ratio < 50%). Experimental results demonstrate the performance of the existing algorithm measured by BACC has been improved due to the class skew-insensitiveness of hellinger distance. The statistical significance test shows that HATM has higher mean BACC score than ATM.
Relationships between diversity of classification ensembles and single classIEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCEIJCI JOURNAL
Class imbalance exists in many classification problems, and since the data is designed for accuracy,
imbalance in data classes can lead to classification challenges with a few classes having higher
misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small
amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This
paper provides a comprehensive overview of research in the field of imbalanced data classification. The
discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid
methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas,
strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed,
along with strategies to address them. It is convenient for researchers to choose the appropriate method
according to their needs.
Association rule discovery for student performance prediction using metaheuri...csandit
According to the increase of using data mining tech
niques in improving educational systems
operations, Educational Data Mining has been introd
uced as a new and fast growing research
area. Educational Data Mining aims to analyze data
in educational environments in order to
solve educational research problems. In this paper
a new associative classification technique
has been proposed to predict students final perform
ance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative cla
ssifiers maintain interpretability along
with high accuracy. In this research work, we have
employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract associat
ion rule for student performance prediction
as a multi-objective classification problem. Result
s indicate that the proposed swarm based
algorithm outperforms well-known classification tec
hniques on student performance prediction
classification problem.
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...cscpconf
According to the increase of using data mining techniques in improving educational systems
operations, Educational Data Mining has been introduced as a new and fast growing research
area. Educational Data Mining aims to analyze data in educational environments in order to
solve educational research problems. In this paper a new associative classification technique
has been proposed to predict students final performance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative classifiers maintain interpretability along
with high accuracy. In this research work, we have employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract association rule for student performance prediction
as a multi-objective classification problem. Results indicate that the proposed swarm based
algorithm outperforms well-known classification techniques on student performance prediction
classification problem.
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...ijcsa
Active learning is a supervised learning method that is based on the idea that a machine learning algorithm can achieve greater accuracy with fewer labelled training images if it is allowed to choose the image from which it learns. Facial age classification is a technique to classify face images into one of the several predefined age groups. The proposed study applies an active learning approach to facial age classification which allows a classifier to select the data from which it learns. The classifier is initially trained using a small pool of labeled training images. This is achieved by using the bilateral two dimension linear discriminant analysis. Then the most informative unlabeled image is found out from the unlabeled pool using the furthest nearest neighbor criterion, labeled by the user and added to the
appropriate class in the training set. The incremental learning is performed using an incremental version of bilateral two dimension linear discriminant analysis. This active learning paradigm is proposed to be applied to the k nearest neighbor classifier and the support vector machine classifier and to compare the performance of these two classifiers.
INCREMENTAL SEMI-SUPERVISED CLUSTERING METHOD USING NEIGHBOURHOOD ASSIGNMENTIJCSEA Journal
Semi-supervised considering so as to cluster expects to enhance clustering execution client supervision as
pair wise imperatives. In this paper, we contemplate the dynamic learning issue of selecting pair wise
must-connect and can't interface imperatives for semi supervised clustering. We consider dynamic learning
in an iterative way where in every emphasis questions are chosen in light of the current clustering
arrangement and the current requirement set. We apply a general system that expands on the idea of
Neighbourhood, where Neighbourhoods contain "named samples" of distinctive bunches as indicated by
the pair wise imperatives. Our dynamic learning strategy extends the areas by selecting educational
focuses and questioning their association with the areas. Under this system, we expand on the fantastic
vulnerability based rule and present a novel methodology for figuring the instability related with every
information point. We further present a determination foundation that exchanges off the measure of
vulnerability of every information point with the expected number of inquiries (the expense) needed to
determine this instability. This permits us to choose questions that have the most astounding data rate. We
assess the proposed strategy on the benchmark information sets and the outcomes show predictable and
significant upgrades over the current cutting edge.
INCREMENTAL SEMI-SUPERVISED CLUSTERING METHOD USING NEIGHBOURHOOD ASSIGNMENTIJCSEA Journal
Semi-supervised considering so as to cluster expects to enhance clustering execution client supervision as
pair wise imperatives. In this paper, we contemplate the dynamic learning issue of selecting pair wise
must-connect and can't interface imperatives for semi supervised clustering. We consider dynamic learning
in an iterative way where in every emphasis questions are chosen in light of the current clustering
arrangement and the current requirement set. We apply a general system that expands on the idea of
Neighbourhood, where Neighbourhoods contain "named samples" of distinctive bunches as indicated by
the pair wise imperatives. Our dynamic learning strategy extends the areas by selecting educational
focuses and questioning their association with the areas. Under this system, we expand on the fantastic
vulnerability based rule and present a novel methodology for figuring the instability related with every
information point. We further present a determination foundation that exchanges off the measure of
vulnerability of every information point with the expected number of inquiries (the expense) needed to
determine this instability. This permits us to choose questions that have the most astounding data rate. We
assess the proposed strategy on the benchmark information sets and the outcomes show predictable and
significant upgrades over the current cutting edge.
"Heart failure is a typical clinical accompanied by symptoms syndrome (e.g. shortness of breath, ankle swelling and fatigue) that lead to structural or functional abnormalities of the heart (e.g. high venous pressure, pulmonary edema and peripheral edema).
In recent years, the significant role of B-type natriuretic peptide has been revealed in the pathogenesis of heart disease and the use of the drug sacubitril/valsartan has started. It has a positive effect on the regulation of the level of B-type natriuretic peptide in the body. It is obviously seen from the the world literature that natriuretic peptides play an important role in the pathophysiology of heart failure. For this reason, many studies suggest that the importance of natriuretic peptides in the diagnosis and treatment of heart failure is recommended.
Due to this, we tried to investigate the effects of a comprehensive medication therapy with a combination of sacubitril/valsartan in the patients with chronic heart failure."
Parallel generators of pseudo random numbers with control of calculation errors
U0 vqmtq2otq=
1. International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
An Optimized Cost-Free Learning Using ABC-SVM
Approach in the Class Imbalance Problem
K. Sasikala, Dr. V. Anuratha
Tirupur (Dt), Tamil Nadu (St), India
Abstract: In this work, cost-free learning (CFL) formally defined in comparison with cost-sensitive learning (CSL). The primary
difference between them is that even in the class imbalance problem, a CFL approach provides optimal classification results without
requiring any cost information. In point of fact, several CFL approaches exist in the related studies like sampling and some criteria-based
approaches. Yet, to our best knowledge none of the existing CFL and CSL approaches is able to process the abstaining
classifications properly when no information is given about errors and rejects. Hence based on information theory, here we propose a
novel CFL which seeks to maximize normalized mutual information of the targets and the decision outputs of classifiers. With the help
of this strategy, we can manage binary or multi-class classifications with or without refraining. Important features are observed from the
new strategy. When the degree of class imbalance is changing, this proposed strategy could able to balance the errors and rejects
accordingly and automatically. A wrapper paradigm of proposed ABC-SVM (Artificial Bee Colony-SVM) is oriented on the evaluation
measure of imbalanced dataset as objective function with respect to feature subset, misclassification cost and intrinsic parameters of
SVM. The main goal of cost free ABC-SVM is to directly improve the performance of classification by simultaneously optimizing the best
pair of intrinsic parameters, feature subset and misclassification cost parameters. The obtained experimental results on various standard
benchmark datasets and real-world data with different ratios of imbalance show that the proposed method is effective in comparison with
commonly used sampling techniques.
Keywords: Classification, Class Imbalance, Cost-Free Learning, Cost-Sensitive Learning, artificial bee colony algorithm, Support vector
machine
1. Introduction
Recently, the class imbalance problem has been
recognized as a crucial problem in machine learning and
data mining [1] which occurs when the training data is not
evenly distributed among classes and is also especially
critical in many real applications such as credit card fraud
detection when fraudulent cases are rare or medical
diagnoses where normal cases are the majority. In these
cases, standard classifiers generally perform poorly. Also,
the classifiers usually tend to be overwhelmed by the
majority class and ignore the minority class examples with
most classifiers assuming an even distribution of examples
among classes and assume an equal misclassification cost.
Moreover, classifiers are typically designed to maximize
accuracy but that is not a good metric to evaluate
effectiveness in the case of imbalanced training data.
Hence, we need to improve traditional algorithms so as to
handle imbalanced data and choose other metrics to
measure performance instead of accuracy. Here we focus
our study on imbalanced datasets with binary classes.
Much work has been done in addressing the class
imbalance problem. These methods can be grouped in two
categories: the data perspective and the algorithm
perspective [2]. The methods with the data perspective re-balance
the class distribution by re-sampling the data
space either randomly or deterministically. The main
disadvantage of re-sampling techniques are that they may
cause loss of important information or the model over
fitting, since that they change the original data distribution.
The imbalanced learning problem in data mining has
attracted a significant amount of interest from the research
community and practitioners because real-world datasets
are frequently imbalanced containing a minority class with
relatively few instances when compared to the other
classes in the dataset. These standard classification
algorithms of literature which were used in supervised
learning have difficulties in correctly classifying the
minority class. Most of these algorithms assume a
balanced distribution of classes and equal misclassification
costs for each class. In addition, these algorithms are
designed to generalize from sample data and output the
simplest hypothesis that best fits the data.
This methodology is embedded as principle in the
inductive bias of many machine learning algorithms
including nearest neighbor, Decision Tree and Support
Vector Machine (SVM). Therefore, when they are used on
complex imbalanced data sets, these algorithms are
inclined to be overwhelmed by the majority class and
ignore the minority class causing errors in classification
for the minority class. Putting differently, standard
classification algorithms try to minimize the overall
classification error rate by producing a biased hypothesis
which regards almost all instances as the majority class.
Recent research on the class imbalance problem has
included studies on datasets from a wide variety of
contexts such as, information retrieval and filtering [3],
diagnosis of rare thyroid disease [4], text classification [5],
credit card fraud detection [6] and detection of oil spills
from satellite images [7]. Imbalance degree varies
depending on the context.
A cost-sensitive classifier tries to learn more
characteristics of samples with the minority class by
setting a high cost to the misclassification of a minority
class sample. It does not modify the data distribution.
Weiss [8] left the questions “why doesn’t the cost-sensitive
learning algorithm perform better given the
known draw-backs with sampling and are there ways to
improve the cost-sensitive learning algorithms
Volume 3 Issue 10, October 2014
www.ijsr.net
Paper ID: SEP14694 184
Licensed Under Creative Commons Attribution CC BY
2. International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
effectiveness.” He pinpointed to improve the effectiveness
of cost sensitive learning algorithms by optimizing factors
which influence the performance of cost sensitive learning.
There are two challenges with respect to the training of
cost sensitive classifier. The misclassification costs play a
crucial role in the construction of a cost sensitive learning
model for achieving expected classification results.
Nevertheless, in many contexts of imbalanced dataset, the
costs of misclassification cannot be determined. Apart
from cost, intrinsic parameters and the feature set of some
sophisticated classifiers also influence the classification
performance. Moreover, these factors influence each other.
This is the first challenge. The other is the gap between the
measure of evaluation and the objective of training on the
imbalanced data [9]. Indeed, for evaluating the
performance of a cost-sensitive classifier on a skewed data
set, the overall accuracy is inapplicable. It is common to
employ other evaluation measures to monitor the balanced
classification ability, such as G-mean [10] and AUC [11].
However, these cost-sensitive classifiers measured by
imbalanced evaluation are not trained and updated with the
objective of the imbalanced evaluation. In order to achieve
good prediction performance, learning algorithms should
train classifiers by optimizing the concerned performance
measures [12].
In order to solve the challenges above, we propose a CFL
strategy in the class imbalance problem. Using normalized
mutual information (NI) as the learning target, we conduct
the learning from cost-insensitive classifiers. Therefore,
we are able to adopt conventional classifiers for simple
and direct implementations. The most advantage of this
strategy is its unique feature in classification scenarios
where one has no knowledge of costs. Simultaneously for
improving the performance of cost-free ABC-SVM
optimize the factors such as cost of misclassification,
intrinsic parameters and feature set of classifier.
The remainder of this paper is as follows. In section 2
related works is described. In section 3 proposed CF-ABC-SVM
discussed for class imbalance problem. In section 4
discussed the experimental results and conclusion is
described in section 5.
2. Related Work
When costs are unequal and unknown, Maloof [13] uses
ROC curve to show the performance of binary
classifications under different cost settings. To make fair
comparisons, an alternative to AUC is proposed to
evaluate different classifiers under the same cost ratio
distribution [14]. These studies can be viewed as
comparing classifiers rather than finding optimal operating
points. Cost curve [15] can be used to visualize optimal
expected costs over a range of cost settings, but it does not
suit multi-class problem. Zadrozny and Elkan [16] apply
least-squares multiple linear regression to estimate the
costs. The method requires cost information of the training
sets to be known. Cross validation [17] is proposed to
choose from a limited set of cost values, and the final
decisions are made by users.
There exist some CFL approaches in the class imbalance
problem. Various sampling strategies [18] try to modify
the imbalanced class distributions. Active learning [19] is
also investigated to select desired instances and the feature
selection techniques are applied to combat the class
imbalance problem for high-dimensional data sets.
Besides, ensemble learning methods are used to improve
the generalization of predicting the minority class. To
reduce the influence of imbalance, Hellinger distance is
applied to decision trees as a splitting criterion for its skew
insensitivity. And the recognition-based methods that train
on a single class are proposed as alternatives to the
discrimination-based methods.
However, all CFL methods above do not take abstaining
into consideration and may fail to process the abstaining
classifications. In regards to abstaining classification,
some strategies have been proposed for defining optimal
reject rules. Pietraszek [20] proposes a bounded-abstaintion
model with ROC analysis, and Fumera et al.
[21] seek to maximize accuracy while keeping the reject
rate below a given value. However, the bound information
and the targeted reject rate are required to be specified
respectively. When there is no prior knowledge of these
settings, it is hard to determine the values. Li and Sethi
[22] restrict the maximum error rate of each class, but the
rates may conflict when they are arbitrarily given.
Zhang et al [23] proposed a new strategy of CFL to deal
with the class imbalance problem. Based on the specific
property of mutual information that can distinguish
different error types and reject types, we seek to maximize
it as a general rule for dealing with binary/multiclass
classifications with/without abstaining. The challenge will
be a need of specific optimization algorithms for different
learning machines. Many sampling techniques have been
introduced including heuristic or non-heuristic
oversampling, undersampling and data cleaning rules such
as removing “noise” and “borderline”. These works focus
on data-level techniques whereas, other researchers
concentrate on changing the classifier internally as for
example SVM that is to deal with class imbalance;
Whereas uses ensemble learning to deal with class
imbalance while combines undersampling with ensemble
methods; focuses on incorporating different re-balance
heuristics to SVM to tackle the problem of class imbalance
while incorporate SVM into a boosting method.
These all literature studies introduced a test strategy which
determines how unknown attributes are selected to
perform test on in order to minimize the sum of the
misclassification costs and test costs. Furthermore, applied
synthetic minority oversampling technique (SMOTE) to
balance the dataset at first and then built the model using
SVM with different costs proposed applied some common
classifiers (e.g. C4.5, logistic regression, and Naive Bayes)
with sampling techniques such as random oversampling,
random undersampling and condensed nearest neighbor
rule, Wilson’s edited nearest neighbor rule, Tomek’s link,
and SMOTE. In different to the literature, rather than
focusing only on data sampling or CSL, here we propose
using both techniques. In addition to that, we also do not
assume a fixed cost ratio and we neither set the cost ratio
Volume 3 Issue 10, October 2014
www.ijsr.net
Paper ID: SEP14694 185
Licensed Under Creative Commons Attribution CC BY
3. International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
by inverting the ratio of prior distributions between
minority and majority class. Here instead, optimize the
cost ratio locally.
3. Dealing with Class Imbalance with ABC-SVM
On dealing with imbalanced datasets, literature researchers
focused on the data level and the classifier level. At the
data level, the common task is the modification of the class
distribution whereas at the classifier level many techniques
were introduced such as manipulating classifiers
internally, ensemble learning, one-class learning and CFL.
1.1 Cost-Free SVM
Support Vector Machines (SVM), which has strong
mathematical foundations based on statistical learning
theory that has been successfully adopted in various
classification applications. SVM aims maximizing a
margin in a hyper plane separating classes. However, it is
overwhelmed by the majority class instances in the case of
imbalanced datasets because the objective of regular SVM
is to maximize the accuracy. In order to provide different
costs associated with the two different kinds of errors,
cost-sensitive SVM (CF-SVM) [15] is a good solution.
CF-SVM is formulated as follows:
Volume 3 Issue 10, October 2014
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
min ଵ
ଶ
‖ݓ‖ଶ + ܥା Σ;ୀାଵ ߦ + ܥି Σ;௬ೕୀିଵ ߦ
st ݕሾሺݓ் ݔሻ + ܾሿ ≥ 1 − ߦ ∀ ݅ = 1, ܮ, . . ݊ ߦ ≥ 0
where the ܥା is the higher misclassification cost of the
positive class that is the primary interest, where ܥ— is the
lower misclassification cost of the negative class. Using
the different error cost for the positive and negative
classes, this SVM hyper plane could be pushed away from
the positive instances. In this paper, we fix ܥ— = ܥ and
ܥା = ܥ × ܥ , where C and ܥ are respectively the
regularization parameter and the ratio of misclassification
cost. On the construction of cost sensitive SVM, this
misclassification cost parameter plays an indispensable
role. In general, the Radial Basis Function (RBF kernel) is
a reasonable first choice for the classification of the
nonlinear datasets, as it has fewer parameters (γ).
1.2 Optimized cost free SVM by measure of
imbalanced data
SVM tries to minimize the regularized hinge loss; it is
driven by an error based objective function. Yet, the
obtained overall accuracy is not an appropriate evaluation
measure for imbalanced data classification. Finally, there
is an inevitable gap between the evaluation measure by
which the classifier is to be evaluated and the objective
function based on which the classifier is trained. The
classifier for imbalanced data learning should be driven by
more appropriate measures. We inject the appropri-ate
measures into the objective function of the classifier in the
training with ABC. The common evaluation for
imbalanced data classification is G-mean and AUC.
However for many classifiers, yet the learning process is
still driven by error based objective functions. In this paper
we explicitly treat the measure itself as the objective
function when training the cost sensitive learning. Here,
we designed a measure based training framework for
dealing with imbalanced data classification issues. A
wrapper paradigm proposed that discovers the amount of
re-sampling for a dataset based on optimizing evaluation
functions like the f-measure, and AUC. To date, there is no
research about training the cost sensitive classifier with
measure based objective functions. This is one important
issue that hinders the performance of cost-sensitive
learning.
Another important issue of applying the cost-sensitive
learning algorithm to the imbalanced data is that the cost
matrix is often unavailable for a problem domain. The
misclassification cost, especially the ratio misclassification
cost, plays a crucial role in the construction of a cost
sensitive approach; the knowledge of misclassification
costs is required for achieving expected classification
result. However, the values of costs are commonly given
by domain experts. They remain unknown in many
domains where it is in fact difficult to specify the cost ratio
information precisely. Also, it is not the correct method to
set the cost ratio to the inverse of the imbalance ratio (the
number of majority in-stances divided by the number of
minority instances); especially it is not accurate for some
classifier such as SVM. Heuristic approaches were used
search the optimal cost matrix in some of the cost sensitive
learning such as Genetic Algorithm or grid search to find
the optimal cost setup. Feature subset selection and the
intrinsic parameters of the classifier have a significant
bearing on the performance, apart from the ratio
misclassification cost information. These both factors are
not only important for classification of imbalanced data,
but also applicable for any kind of classification. The
technique, feature selection is used for selecting a subset
of discriminative features for building robust learning
models by removing most irrelevant and redundant
features from the data. Furthermore, this advanced optimal
feature selection can concurrently achieve good accuracy
and dimensionality reduction.
Unfortunately, the imbalanced data distributions are often
accompanied by high dimensionality in real-world datasets
such as bio-informatics, text classification and CAD
(Computer Aided Detection). It is important to select
features that can capture the high skew in the class
distribution. Furthermore, proper intrinsic parameter
setting of classifiers like regularization of cost parameter
and the kernel functional parameter of SVM can improve
the classification performance. It is necessary to use the
grid search to optimize the kernel parameter and regulation
parameters. Also, these three factors influence each other.
Thus, we obtain the optimal ratio of cost of
misclassification, feature subset selection and intrinsic
parameters must occur simultaneously.
Based on the reasons above, our specific goal is to devise a
strategy to automatically determine the optimal factors
during training of the cost sensitive classifier oriented by
the imbalanced evaluation criteria (G-mean and AUC). For
binary class classification, there is only one parameter
called cost parameter i.e., the relative cost information,
Paper ID: SEP14694 186
4. International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
known as ratio misclassification cost factor Crf. Since the
RBF kernel is selected for the cost sensitive SVM, γ and C
are the parameters to be optimized. We need to combine
the discrete and continuous values in the solution
representation since the costs and parameters we intend to
optimize are continuous while the feature subset is discrete
as like each and every feature is represented by a 1 or 0 for
whether it is selected or not. In our proposed method, the
food source (solution) is randomly generated in the initial
process. Then, each solution is evaluated through SVM
classifier. The solving of SVM is generally a quadratic
programming (QP) problem; sequential minimal
optimization (SMO) will be applied in this study to
optimize the computation time of training period of SVM.
SMO divides the large QP problem into series of smallest
possible QP problem to avoid the intensive time and
memories required. For this study, the radial basis function
(RBF) is used as the kernel function of non-linear SVM
classifier to learn and recognize pattern of input data from
the training set. The equation of RBF function is defined
as
(1)
where r is a kernel parameter in RBF function. Then, a
testing set is used to determine the classification accuracy
for the input dataset. The fitness value is obtained by the
classification accuracy of this testing dataset. For a small-and
medium-sized data, the accurate value can be
estimated by using the 10-fold cross-validation method.
The method will randomly separate data into 10 subsets;
one subset is used as a testing set while the remaining nine
subsets are used as training sets. The process will be
performed for 10 times in total. As a result, each subset
will be used once as a validation or testing set. The
accuracy of classification will be obtained from the
average of all correctness values from 10 rounds. For the
large-sized data, the holdout method is applied. This
method divides the data into two parts for construction
training and testing models. Each food source is modified
based on the process of updating feasible solution by
employed bees as expressed in the equation (2) where Φ is
a random number in the range between [-1,1], negative one
and one.
The equation (2) returns a numerical number of a new
candidate solution vij from their current food source xij
and their neighboring food source xkj. However, for
dimension reduction these solutions must be binary
numbers. Thus, this numeric solution must be converted
into either 0 or 1 by using equation (3) and (4) as follows:
Volume 3 Issue 10, October 2014
www.ijsr.net
ܭ൫ݔ , ݔ൯ = ݁ି൫௫,௫ೕ൯మ
ݒ = ݔ + Φሺx୧୨ − x୩୨ሻ (2)
ܵ൫ݒ൯ = ଵ
Licensed Under Creative Commons Attribution CC BY
ଵାషೡೕ (3)
݂݅ ቀݎܽ݊݀ ൏ ܵ൫ݒ ൯ቁ ݐ݄݁݊ ݒ = 1; ݈݁ݏ݁ ݒ = 0 (4)
The equation (3) is a sigmoid limiting function into the
interval [0.0, 1.0]. Then, a random number between range
[0.0, 1.0], rand, is used to determine the binary value of
the solution in the equation (4). The new candidate
solution must be evaluated with SVM classifier. If the new
fitness value is better than the current one, the employed
bees will replace its solution with this new candidate
solution; otherwise, the new candidate solution will be
ignored. Onlooker bees will select a food source according
to the calculated probability of each food sources based on
the equation (5), once after employed bees share
information of their solutions.
ܲ = ௧
ಿ
సభ
Σ ௧
(5)
where ܲ is the probability value of the solution i, N is the
number of all solutions, and ݂݅ݐ is the fitness value of the
solution i. Thus, the solution with higher fitness value will
have greater opportunity to be selected by the onlooker
bees. After the onlooker bees select their desirable food
source, the bees will perform the process of updating
feasible solution similar to employed bees. The whole
processes will be repeated until the termination criterion is
reached. The dimension reduction process of ABC-SVM is
summarized as follows:
Figure 1: The flowchart of ABC-SVM method
Paper ID: SEP14694 187
5. International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
4. Experimental Results and Discussion
1.3 Dataset Description
To evaluate the classification performance of our proposed
method in different tasks of classification and to compare
with other methods specifically devised for imba-lanced
data, we tried several datasets from the UCI database. We
used all available datasets from the combined sets used.
This also ensures that we did not choose only the datasets
on which our method performs better. The minority class
label (+) is indicated in Table 1 and the chosen datasets
contains diversity in the number of attributes and
imbalance ratio. Also, these shown datasets have both
continuous and categorical attributes of data. All the
experiments are conducted by 10-fold cross-validation.
Table 1: The data sets used for experimentation the
dataset name is appended with the label of the minority
Volume 3 Issue 10, October 2014
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
class (+)
Dataset Instances Features Class Imbalance
Hepatitis (1) 155 19 1:4
Glass(7) 214 9 1:6
Segment(1) 2310 19 1:6
Anneal(5) 898 38 1:12
Soybean(12) 683 35 1:15
The comparison is conducted between our method and the
other state-of-the-art imba-lanced data classifiers, such as
the random under-sampling (RUS), SMOTE,
SMOTEBoost, and SMOTE combined with asymmetric
cost classifier. The re-sampling rate of under-sampling
algorithm such as the SMOTE and SMOTEBoost is
unknown. In order to compare equally, in our experiments,
no matter whether it is under-sampling or over-sampling
method, we used the evaluation measure as the
optimization objective of the re-sampling method to search
the optimal re-sampling level. The steps of both increment
and decrement are set at 10%. This is a greedy search kind
of approach which repeats the process greedily, upto no
performance gains are observed.
Then the optimal rate of resampling is decided in an
iterative fashion according to the evaluation metrics.
Hence in the each fold, these data of training set is
separated into training subset and validating subset for
searching the appropriate rate parameters. The evaluation
metrics are also used with the G-mean and AUC. For the
CS-SVM with SMOTE, for each search of re-sampling,
the optimal misclassification cost ratio is determined by
searching under the evaluation measure guiding under the
current over-sampling level of SMOTE.
1.4 Execution Time Comparison
This graph contains execution time of planned and existing
system. The execution time of existing method is
incredibly high compared with the planned system. The
initial purpose of CF-ABC-SVM approach is there
employing a screening methodology it'll be useful for
mechanically effort class imbalance problem. This is
shown in Fig.2. From the results it is observed that the
proposed work takes less computational time to solve the
class imbalance problem.
Figure 2: Execution Time Comparison
1.5 Accuracy
This graph contains execution time of planned and existing
system. The accuracy of proposed system is incredibly
high compared with the planned system. The initial
purpose of CF-ABC-SVM approach is there employing a
screening methodology it'll be useful for mechanically
effort class imbalance problem. This is shown in Fig.3.
From the results it is observed that the proposed work
takes less computational time to solve the class imbalance
problem.
Figure 3: Accuracy Comparison
5. Conclusion
Learning with class imbalance is a challenging task. We
propose a wrapper paradigm oriented by the evaluation
measure of imbalanced dataset as objective function with
respect to cost of misclassification, selection of feature
subset and intrinsic parameters of SVM. Our measure
oriented framework could wrap around an existing cost-sensitive
classifier. The proposed method has been
validated on some benchmark imbalanced data and real-world
application. The obtained experimental results in
this study have demonstrated that the proposed framework
provides a very competitive solution to other existing
state-of-the-arts methods, in terms of optimization of G-mean
and AUC for conflicting imbalanced classification
problems. These experimental results confirm the
advantages of our approach that shows the promising
Paper ID: SEP14694 188
6. International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
perspective and new understanding of cost sensitive
learning. In the future research, we will extend the
framework to the imbalanced multiclass data
classification.
References
[1] Chawla, N.V., Japkowicz, N.& Kolcz, A. (2004):
Editorial: special issue on learning from imbalanced
data sets. SIGKDD Explorations Special Issue on
Learning from Imbalanced Datasets 6 (1):1-6.
[2] Kotsiantis, S., Kanellopoulos, D. & Pintelas, P.
(2006): Handling imbalanced datasets: A review.
GESTS International Transactions on Computer
Science and Engineering:25-36.
[3] Lewis, D & Catlett, J.(1994). Heterogeneous
uncertainty sampling for supervised learning. In
Cohen, W. W. & Hirsh, H. (Eds.), Proceedings of
ICML-94, 11th International Conference on Machine
Learning, (pp. 148-156)., San Francisco. Morgan
Kaufmann
[4] Murphy, P. & Aha, D.(1994). UCI repository of
machine learning databases. University of California-
Irvine, Department of Information and Computer
Science.
[5] Chawla, N.V., Bowyer, K.W., Hall, L.O., &
Kegelmeyer, W.P.(2002). SMOTE: Synthetic
Minority Over-sampling Technique. Journal of
Artificial Intelligence and Research 16, 321-357.
[6] Wu, G. & Chang, E. (2003). Class-Boundary
Alignment for Imbalanced Dataset Learning. In ICML
2003 Workshop on Learning from Imbalanced Data
Sets ΙΙ, Washington, DC.
[7] Kubat, M., Holte, R., & Matwin, S.(1998). Machine
learning for the detection of oil spills in satellite radar
images. Machine Learning, 30, 195-215
[8] Weiss G., McCarthy K., Zabar B. (2007): Cost-sensitive
learning vs. sampling: Which is Best for
Handling Unbalanced Classes with Unequal Error
Costs? IEEE ICDM, pp. 35–41.
[9] Yuan, B. & Liu, W.H. (2011): A Measure Oriented
Training Scheme for Imbalanced Clas-sification
Problems. Pacific-Asia Conference on Knowledge
Discovery and Data Mining Workshop on
Biologically Inspired Techniques for Data Mining.
pp:293–303.
[10] Akbani, R., Kwek, S. & Japkowicz, N. (2004):
Applying support vector machines to imba-lanced
datasets. European conference on machine learning.
[11] Chawla, N.V., Cieslak, D.A., Hall, L. O. & Joshi, A.
(2008): Automatically countering im-balance and its
empirical relationship to cost. Utility-Based Data
Mining: A Special issue of the International Journal
Data Mining and Knowledge Discovery.
[12] Li, N., Tsang, I., Zhou, Z. (2012): Efficient
Optimization of Performance Measures by Classifier
Adaptation. IEEE Transactions on Pattern Analysis
and Machine Intelligence. Volume: PP , Issue: 99,
Page(s): 1.
[13] M.A. Maloof, “Learning When Data Sets are
Imbalanced and When Costs are Unequal and
Unknown,” Proc. Workshop on Learning from
Imbalanced Data Sets II. Int’l Conf. Machine
Learning, 2003
[14] D.J. Hand, “Measuring Classifier Performance: A
Coherent Alternative to the Area under the ROC
Curve,” Machine Learning, vol. 77, no. 1, pp. 103-
123, Oct. 2009
[15] C.C. Friedel, U. R¨ uckert, and S. Kramer, “Cost
Curves for Abstaining Classifiers,” Proc. Workshop
on ROC Analysis in Machine Learning. Int’l Conf.
Machine Learning, 2006.
[16] B. Zadrozny and C. Elkan, “Learning and Making
Decisions When Costs and Probabilities are Both
Unknown,” Proc. ACM SIGKDD Int’l Conf.
Knowledge Discovery and Data Mining, pp. 204-213,
2001.
[17] Y. Zhang and Z.-H. Zhou, “Cost-Sensitive Face
Recognition,” IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 32, no. 10, pp. 1758-1769,
Oct. 2010.
[18] M. Wasikowski and X.-W. Chen, “Combating the
Small Sample Class Imbalance Problem Using
Feature Selection,” IEEE Trans. Knowledge and Data
Eng., vol. 22, no. 10, pp. 1388-1400, Oct. 2010.
[19] S. Wang and X. Yao, “Relationships between
Diversity of Classification Ensembles and Single-
Class Performance Measures,” IEEE Trans.
Knowledge and Data Eng., vol. 25, no. 1, pp. 206-
219, Jan. 2013.
[20] T. Pietraszek, “Optimizing Abstaining Classifiers
using ROC Analysis,” Proc. Int’l Conf. Machine
Learning, pp. 665-672, 2005.
[21] G. Fumera, F. Roli, and G. Giacinto, “Reject Option
with Multiple Thresholds,” Pattern Recognition, vol.
33, no. 12, pp. 2099-2101, 2000.
[22] M. Li and I.K. Sethi, “Confidence-Based Classifier
Design,” Pattern Recognition, vol. 39, no. 7, pp. 1230-
1240, Jul. 2006.
[23] Xiaowan Zhang and Bao-Gang Hu, Senior Member,
‘A New Strategy of Cost-Free Learning in the Class
Imbalance Problem’ IEEE 1041-4347 (c) 2013 IEEE
Volume 3 Issue 10, October 2014
www.ijsr.net
Paper ID: SEP14694 189
Licensed Under Creative Commons Attribution CC BY