Diabetes disease is amongst the most common disease in India. It affects patient’s health and also leads to
other chronic diseases. Prediction of diabetes plays a significant role in saving of life and cost. Predicting
diabetes in human body is a challenging task because it depends on several factors. Few studies have reported the performance of classification algorithms in terms of accuracy. Results in these studies are difficult and complex to understand by medical practitioner and also lack in terms of visual aids as they arepresented in pure text format. This reported survey uses ROC and PRC graphical measures toimproveunderstanding of results. A detailed parameter wise discussion of comparison is also presented which lacksin other reported surveys. Execution time, Accuracy, TP Rate, FP Rate, Precision, Recall, F Measureparameters are used for comparative analysis and Confusion Matrix is prepared for quick review of each
algorithm. Ten fold cross validation method is used for estimation of prediction model. Different sets of
classification algorithms are analyzed on diabetes dataset acquired from UCI repository
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEaciijournal
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Preprocessing and Classification in WEKA Using Different ClassifiersIJERA Editor
Data mining is a process of extracting information from a dataset and transform it into understandable structure
for further use, also it discovers patterns in large data sets [1]. Data mining has number of important techniques
such as preprocessing, classification. Classification is one such technique which is based on supervised learning.
It is a technique used for predicting group membership for the data instance. Here in this paper we use
preprocessing, classification on diabetes database. Here we apply classifiers on this database and compare the
result based on certain parameters using WEKA. 77.2 million people in India are suffering from pre diabetes.
ICMR estimates that around 65.1million are diabetes patients. Globally in year 2010, 227 to 285 million people
had diabetes, out of that 90% cases are related to type 2 ,this is equal to 3.3% of the population with equal rates
in both women and men in 2011 it resulted in 1.4 million deaths worldwide making it the leading cause of
death.
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
Data mining environment produces a large amount of data that need to be analyzed.
Using traditional databases and architectures, it has become difficult to process, manage and analyze
patterns. To gain knowledge about the Big Data a proper architecture should be understood.
Classification is an important data mining technique with broad applications to classify the various
kinds of data used in nearly every field of our life. Classification is used to classify the item
according to the features of the item with respect to the predefined set of classes. This paper put a
light on various classification algorithms including j48, C4.5, Naive Bayes using large dataset.
An Influence of Measurement Scale of Predictor Variable on Logistic Regressio...IJECEIAES
Much real world decision making is based on binary categories of information that agree or disagree, accept or reject, succeed or fail and so on. Information of this category is the output of a classification method that is the domain of statistical field studies (eg Logistic Regression method) and machine learning (eg Learning Vector Quantization (LVQ)). The input argument of a classification method has a very crucial role to the resulting output condition. This paper investigated the influence of various types of input data measurement (interval, ratio, and nominal) to the performance of logistic regression method and LVQ in classifying an object. Logistic regression modeling is done in several stages until a model that meets the suitability model test is obtained. Modeling on LVQ was tested on several codebook sizes and selected the most optimal LVQ model. The best model of each method compared to its performance on object classification based on Hit Ratio indicator. In logistic regression model obtained 2 models that meet the model suitability test is a model with predictive variables scaled interval and nominal, while in LVQ modeling obtained 3 pieces of the most optimal model with a different codebook. In the data with interval-scale predictor variable, the performance of both methods is the same. The performance of both models is just as bad when the data have the predictor variables of the nominal scale. In the data with predictor variable has ratio scale, the LVQ method able to produce moderate enough performance, while on logistic regression modeling is not obtained the model that meet model suitability test. Thus if the input dataset has interval or ratio-scale predictor variables than it is preferable to use the LVQ method for modeling the object classification.
Data Mining is an analytic process designed to explore data in search of consistent patterns and systematic relationships between variables, and then to validate the results by applying the patterns found to a new subset of data. Data mining is often described as the process of discovering patterns, correlations, trends or relationships by searching through a large amount of data stored in repositories, databases, and data warehouses. Diabetes, often referred to by doctors as diabetes mellitus, describes a group of metabolic diseases in which the person has high blood [3] glucose (blood sugar), either because insulin production is insufficient, or because the body's cells do not respond properly to insulin, or both. This project helps in identifying whether a person has diabetes or not, if predicted diabetic[4] the project suggest measures for maintaining normal health and if not diabetic it predicts the risk of getting diabetic. In this project Classification algorithm was used to classify the Pima Indian diabetes dataset. Results have been obtained using Android Application.
Multimodal authentication is one of the prime concepts in current applications of real scenario. Various
approaches have been proposed in this aspect. In this paper, an intuitive strategy is proposed as a
framework for providing more secure key in biometric security aspect. Initially the features will be
extracted through PCA by SVD from the chosen biometric patterns, then using LU factorization technique
key components will be extracted, then selected with different key sizes and then combined the selected key
components using convolution kernel method (Exponential Kronecker Product - eKP) as Context-Sensitive
Exponent Associative Memory model (CSEAM). In the similar way, the verification process will be done
and then verified with the measure MSE. This model would give better outcome when compared with SVD
factorization[1] as feature selection. The process will be computed for different key sizes and the results
will be presented.
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEaciijournal
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Preprocessing and Classification in WEKA Using Different ClassifiersIJERA Editor
Data mining is a process of extracting information from a dataset and transform it into understandable structure
for further use, also it discovers patterns in large data sets [1]. Data mining has number of important techniques
such as preprocessing, classification. Classification is one such technique which is based on supervised learning.
It is a technique used for predicting group membership for the data instance. Here in this paper we use
preprocessing, classification on diabetes database. Here we apply classifiers on this database and compare the
result based on certain parameters using WEKA. 77.2 million people in India are suffering from pre diabetes.
ICMR estimates that around 65.1million are diabetes patients. Globally in year 2010, 227 to 285 million people
had diabetes, out of that 90% cases are related to type 2 ,this is equal to 3.3% of the population with equal rates
in both women and men in 2011 it resulted in 1.4 million deaths worldwide making it the leading cause of
death.
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
Data mining environment produces a large amount of data that need to be analyzed.
Using traditional databases and architectures, it has become difficult to process, manage and analyze
patterns. To gain knowledge about the Big Data a proper architecture should be understood.
Classification is an important data mining technique with broad applications to classify the various
kinds of data used in nearly every field of our life. Classification is used to classify the item
according to the features of the item with respect to the predefined set of classes. This paper put a
light on various classification algorithms including j48, C4.5, Naive Bayes using large dataset.
An Influence of Measurement Scale of Predictor Variable on Logistic Regressio...IJECEIAES
Much real world decision making is based on binary categories of information that agree or disagree, accept or reject, succeed or fail and so on. Information of this category is the output of a classification method that is the domain of statistical field studies (eg Logistic Regression method) and machine learning (eg Learning Vector Quantization (LVQ)). The input argument of a classification method has a very crucial role to the resulting output condition. This paper investigated the influence of various types of input data measurement (interval, ratio, and nominal) to the performance of logistic regression method and LVQ in classifying an object. Logistic regression modeling is done in several stages until a model that meets the suitability model test is obtained. Modeling on LVQ was tested on several codebook sizes and selected the most optimal LVQ model. The best model of each method compared to its performance on object classification based on Hit Ratio indicator. In logistic regression model obtained 2 models that meet the model suitability test is a model with predictive variables scaled interval and nominal, while in LVQ modeling obtained 3 pieces of the most optimal model with a different codebook. In the data with interval-scale predictor variable, the performance of both methods is the same. The performance of both models is just as bad when the data have the predictor variables of the nominal scale. In the data with predictor variable has ratio scale, the LVQ method able to produce moderate enough performance, while on logistic regression modeling is not obtained the model that meet model suitability test. Thus if the input dataset has interval or ratio-scale predictor variables than it is preferable to use the LVQ method for modeling the object classification.
Data Mining is an analytic process designed to explore data in search of consistent patterns and systematic relationships between variables, and then to validate the results by applying the patterns found to a new subset of data. Data mining is often described as the process of discovering patterns, correlations, trends or relationships by searching through a large amount of data stored in repositories, databases, and data warehouses. Diabetes, often referred to by doctors as diabetes mellitus, describes a group of metabolic diseases in which the person has high blood [3] glucose (blood sugar), either because insulin production is insufficient, or because the body's cells do not respond properly to insulin, or both. This project helps in identifying whether a person has diabetes or not, if predicted diabetic[4] the project suggest measures for maintaining normal health and if not diabetic it predicts the risk of getting diabetic. In this project Classification algorithm was used to classify the Pima Indian diabetes dataset. Results have been obtained using Android Application.
Multimodal authentication is one of the prime concepts in current applications of real scenario. Various
approaches have been proposed in this aspect. In this paper, an intuitive strategy is proposed as a
framework for providing more secure key in biometric security aspect. Initially the features will be
extracted through PCA by SVD from the chosen biometric patterns, then using LU factorization technique
key components will be extracted, then selected with different key sizes and then combined the selected key
components using convolution kernel method (Exponential Kronecker Product - eKP) as Context-Sensitive
Exponent Associative Memory model (CSEAM). In the similar way, the verification process will be done
and then verified with the measure MSE. This model would give better outcome when compared with SVD
factorization[1] as feature selection. The process will be computed for different key sizes and the results
will be presented.
Comparative study of various supervisedclassification methodsforanalysing def...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Privacy preserving data mining in four group randomized response technique us...eSAT Journals
Abstract Data mining is a process in which data collected from different sources is analyzed for useful information. Data mining is also known as knowledge discovery in database (KDD). Privacy and accuracy are the important issues in data mining when data is shared. Most of the methods use random permutation techniques to mask the data, for preserving the privacy of sensitive data. Randomize response techniques were developed for the purpose of protecting surveys privacy and avoiding biased answers. The proposed work thesis is to enhance the privacy level in RR technique using four group schemes. First according to the algorithm random attributes a, b, c, d were considered, Then the randomization have been performed on every dataset according to the values of theta. Then ID3 and CART algorithm are applied on the randomized data. The result shows that by increasing the group, the privacy level will increase. This work shows that as compared with three group scheme with four groups scheme the accuracy decreases 6% but the privacy increases 65%.
The pertinent single-attribute-based classifier for small datasets classific...IJECEIAES
Classifying a dataset using machine learning algorithms can be a big challenge when the target is a small dataset. The OneR classifier can be used for such cases due to its simplicity and efficiency. In this paper, we revealed the power of a single attribute by introducing the pertinent single-attributebased-heterogeneity-ratio classifier (SAB-HR) that used a pertinent attribute to classify small datasets. The SAB-HR’s used feature selection method, which used the Heterogeneity-Ratio (H-Ratio) measure to identify the most homogeneous attribute among the other attributes in the set. Our empirical results on 12 benchmark datasets from a UCI machine learning repository showed that the SAB-HR classifier significantly outperformed the classical OneR classifier for small datasets. In addition, using the H-Ratio as a feature selection criterion for selecting the single attribute was more effectual than other traditional criteria, such as Information Gain (IG) and Gain Ratio (GR).
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
In this paper we attempt to solve an automatic clustering problem by optimizing multiple objectives such as automatic k-determination and a set of cluster validity indices concurrently. The proposed automatic clustering technique uses the most recent optimization algorithm Jaya as an underlying optimization stratagem. This evolutionary technique always aims to attain global best solution rather than a local best solution in larger datasets. The explorations and exploitations imposed on the proposed work results to detect the number of automatic clusters, appropriate partitioning present in data sets and mere optimal values towards CVIs frontiers. Twelve datasets of different intricacy are used to endorse the performance of aimed algorithm. The experiments lay bare that the conjectural advantages of multi objective clustering optimized with evolutionary approaches decipher into realistic and scalable performance paybacks.
Control chart pattern recognition using k mica clustering and neural networksISA Interchange
Automatic recognition of abnormal patterns in control charts has seen increasing demands nowadays in manufacturing processes. This paper presents a novel hybrid intelligent method (HIM) for recognition of the common types of control chart pattern (CCP). The proposed method includes two main modules: a clustering module and a classifier module. In the clustering module, the input data is first clustered by a new technique. This technique is a suitable combination of the modified imperialist competitive algorithm (MICA) and the K-means algorithm. Then the Euclidean distance of each pattern is computed from the determined clusters. The classifier module determines the membership of the patterns using the computed distance. In this module, several neural networks, such as the multilayer perceptron, probabilistic neural networks, and the radial basis function neural networks, are investigated. Using the experimental study, we choose the best classifier in order to recognize the CCPs. Simulation results show that a high recognition accuracy, about 99.65%, is achieved.
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...CSCJournals
This paper presents the application of multi dimensional feature reduction of Consistency Subset Evaluator (CSE) and Principal Component Analysis (PCA) and Unsupervised Expectation Maximization (UEM) classifier for imaging surveillance system. Recently, research in image processing has raised much interest in the security surveillance systems community. Weapon detection is one of the greatest challenges facing by the community recently. In order to overcome this issue, application of the UEM classifier is performed to focus on the need of detecting dangerous weapons. However, CSE and PCA are used to explore the usefulness of each feature and reduce the multi dimensional features to simplified features with no underlying hidden structure. In this paper, we take advantage of the simplified features and classifier to categorize images object with the hope to detect dangerous weapons effectively. In order to validate the effectiveness of the UEM classifier, several classifiers are used to compare the overall accuracy of the system with the compliment from the features reduction of CSE and PCA. These unsupervised classifiers include Farthest First, Densitybased Clustering and k-Means methods. The final outcome of this research clearly indicates that UEM has the ability in improving the classification accuracy using the extracted features from the multi-dimensional feature reduction of CSE. Besides, it is also shown that PCA is able to speed-up the computational time with the reduced dimensionality of the features compromising the slight decrease of accuracy.
Feature selection is one of the most fundamental steps in machine learning. It is closely related to
dimensionality reduction. A commonly used approach in feature selection is ranking the individual
features according to some criteria and then search for an optimal feature subset based on an evaluation
criterion to test the optimality. The objective of this work is to predict more accurately the presence of
Learning Disability (LD) in school-aged children with reduced number of symptoms. For this purpose, a
novel hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The process of feature ranking
follows a method of calculating the significance or priority of each symptoms of LD as per their
contribution in representing the knowledge contained in the dataset. Each symptoms significance or
priority values reflect its relative importance to predict LD among the various cases. Then by eliminating
least significant features one by one and evaluating the feature subset at each stage of the process, an
optimal feature subset is generated. For comparative analysis and to establish the importance of rough set
theory in feature selection, the backward feature elimination algorithm is combined with two state-of-theart
filter based feature ranking techniques viz. information gain and gain ratio. The experimental results
show the proposed feature selection approach outperforms the other two in terms of the data reduction.
Also, the proposed method eliminates all the redundant attributes efficiently from the LD dataset without
sacrificing the classification performance.
ROLE OF CERTAINTY FACTOR IN GENERATING ROUGH-FUZZY RULEIJCSEA Journal
The generation of effective feature-based rules is essential to the development of any intelligent system. This paper presents an approach that integrates a powerful fuzzy rule generation algorithm with a rough set-assisted feature reduction method to generate diagnostic rule with a certainty factor. Certainty factor of each rule is calculated by considering both the membership value of each linguistic term introduced at time of fuzzyfication of data as well as possibility values, due to inconsistent data, generated by rough set theory at time of rule generation. In time of knowledge inferencing in an intelligent system, certainty factor of each rule will play an important role to find out the appropriate rule to be selected. Experimental results demonstrate the superiority of our approach.
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
In this paper we attempt to solve an automatic clustering problem by optimizing multiple objectives such as
automatic k-determination and a set of cluster validity indices concurrently. The proposed automatic
clustering technique uses the most recent optimization algorithm Jaya as an underlying optimization
stratagem. This evolutionary technique always aims
to attain global best solution rather than a local best solution in larger datasets. The explorations and exploitations imposed on the proposed work results to
detect the number of automatic clusters, appropriate partitioning present in data sets and mere optimal
values towards CVIs frontiers. Twelve datasets of different intricacy are used to endorse the performance
of aimed algorithm. The experiments lay bare that the conjectural advantages of multi objective clustering optimized with evolutionary approaches decipher into realistic and scalable performance paybacks.
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCERcscpconf
In this paper, we study the performance criterion of machine learning tools in classifying breast cancer. We compare the data mining tools such as Naïve Bayes, Support vector machines, Radial basis neural networks, Decision trees J48 and simple CART. We used both binary and multi class data sets namely WBC, WDBC and Breast tissue from UCI machine learning depositary. The experiments are conducted in WEKA. The aim of this research is to find out the best classifier with respect to accuracy, precision, sensitivity and specificity in detecting breast cancer
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Data mining is indispensable for business organizations for extracting useful information from the huge
volume of stored data which can be used in managerial decision making to survive in the competition. Due
to the day-to-day advancements in information and communication technology, these data collected from e-
commerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high
dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets
selected in subsequent iterations by feature selection should be same or similar even in case of small
perturbations of the dataset and is called as selection stability. It is recently becomes important topic of
research community. The selection stability has been measured by various measures. This paper analyses
the selection of the suitable search method and stability measure for the feature selection algorithms and
also the influence of the characteristics of the dataset as the choice of the best approach is highly problem
dependent.
Comparative study of various supervisedclassification methodsforanalysing def...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Privacy preserving data mining in four group randomized response technique us...eSAT Journals
Abstract Data mining is a process in which data collected from different sources is analyzed for useful information. Data mining is also known as knowledge discovery in database (KDD). Privacy and accuracy are the important issues in data mining when data is shared. Most of the methods use random permutation techniques to mask the data, for preserving the privacy of sensitive data. Randomize response techniques were developed for the purpose of protecting surveys privacy and avoiding biased answers. The proposed work thesis is to enhance the privacy level in RR technique using four group schemes. First according to the algorithm random attributes a, b, c, d were considered, Then the randomization have been performed on every dataset according to the values of theta. Then ID3 and CART algorithm are applied on the randomized data. The result shows that by increasing the group, the privacy level will increase. This work shows that as compared with three group scheme with four groups scheme the accuracy decreases 6% but the privacy increases 65%.
The pertinent single-attribute-based classifier for small datasets classific...IJECEIAES
Classifying a dataset using machine learning algorithms can be a big challenge when the target is a small dataset. The OneR classifier can be used for such cases due to its simplicity and efficiency. In this paper, we revealed the power of a single attribute by introducing the pertinent single-attributebased-heterogeneity-ratio classifier (SAB-HR) that used a pertinent attribute to classify small datasets. The SAB-HR’s used feature selection method, which used the Heterogeneity-Ratio (H-Ratio) measure to identify the most homogeneous attribute among the other attributes in the set. Our empirical results on 12 benchmark datasets from a UCI machine learning repository showed that the SAB-HR classifier significantly outperformed the classical OneR classifier for small datasets. In addition, using the H-Ratio as a feature selection criterion for selecting the single attribute was more effectual than other traditional criteria, such as Information Gain (IG) and Gain Ratio (GR).
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
In this paper we attempt to solve an automatic clustering problem by optimizing multiple objectives such as automatic k-determination and a set of cluster validity indices concurrently. The proposed automatic clustering technique uses the most recent optimization algorithm Jaya as an underlying optimization stratagem. This evolutionary technique always aims to attain global best solution rather than a local best solution in larger datasets. The explorations and exploitations imposed on the proposed work results to detect the number of automatic clusters, appropriate partitioning present in data sets and mere optimal values towards CVIs frontiers. Twelve datasets of different intricacy are used to endorse the performance of aimed algorithm. The experiments lay bare that the conjectural advantages of multi objective clustering optimized with evolutionary approaches decipher into realistic and scalable performance paybacks.
Control chart pattern recognition using k mica clustering and neural networksISA Interchange
Automatic recognition of abnormal patterns in control charts has seen increasing demands nowadays in manufacturing processes. This paper presents a novel hybrid intelligent method (HIM) for recognition of the common types of control chart pattern (CCP). The proposed method includes two main modules: a clustering module and a classifier module. In the clustering module, the input data is first clustered by a new technique. This technique is a suitable combination of the modified imperialist competitive algorithm (MICA) and the K-means algorithm. Then the Euclidean distance of each pattern is computed from the determined clusters. The classifier module determines the membership of the patterns using the computed distance. In this module, several neural networks, such as the multilayer perceptron, probabilistic neural networks, and the radial basis function neural networks, are investigated. Using the experimental study, we choose the best classifier in order to recognize the CCPs. Simulation results show that a high recognition accuracy, about 99.65%, is achieved.
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...CSCJournals
This paper presents the application of multi dimensional feature reduction of Consistency Subset Evaluator (CSE) and Principal Component Analysis (PCA) and Unsupervised Expectation Maximization (UEM) classifier for imaging surveillance system. Recently, research in image processing has raised much interest in the security surveillance systems community. Weapon detection is one of the greatest challenges facing by the community recently. In order to overcome this issue, application of the UEM classifier is performed to focus on the need of detecting dangerous weapons. However, CSE and PCA are used to explore the usefulness of each feature and reduce the multi dimensional features to simplified features with no underlying hidden structure. In this paper, we take advantage of the simplified features and classifier to categorize images object with the hope to detect dangerous weapons effectively. In order to validate the effectiveness of the UEM classifier, several classifiers are used to compare the overall accuracy of the system with the compliment from the features reduction of CSE and PCA. These unsupervised classifiers include Farthest First, Densitybased Clustering and k-Means methods. The final outcome of this research clearly indicates that UEM has the ability in improving the classification accuracy using the extracted features from the multi-dimensional feature reduction of CSE. Besides, it is also shown that PCA is able to speed-up the computational time with the reduced dimensionality of the features compromising the slight decrease of accuracy.
Feature selection is one of the most fundamental steps in machine learning. It is closely related to
dimensionality reduction. A commonly used approach in feature selection is ranking the individual
features according to some criteria and then search for an optimal feature subset based on an evaluation
criterion to test the optimality. The objective of this work is to predict more accurately the presence of
Learning Disability (LD) in school-aged children with reduced number of symptoms. For this purpose, a
novel hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The process of feature ranking
follows a method of calculating the significance or priority of each symptoms of LD as per their
contribution in representing the knowledge contained in the dataset. Each symptoms significance or
priority values reflect its relative importance to predict LD among the various cases. Then by eliminating
least significant features one by one and evaluating the feature subset at each stage of the process, an
optimal feature subset is generated. For comparative analysis and to establish the importance of rough set
theory in feature selection, the backward feature elimination algorithm is combined with two state-of-theart
filter based feature ranking techniques viz. information gain and gain ratio. The experimental results
show the proposed feature selection approach outperforms the other two in terms of the data reduction.
Also, the proposed method eliminates all the redundant attributes efficiently from the LD dataset without
sacrificing the classification performance.
ROLE OF CERTAINTY FACTOR IN GENERATING ROUGH-FUZZY RULEIJCSEA Journal
The generation of effective feature-based rules is essential to the development of any intelligent system. This paper presents an approach that integrates a powerful fuzzy rule generation algorithm with a rough set-assisted feature reduction method to generate diagnostic rule with a certainty factor. Certainty factor of each rule is calculated by considering both the membership value of each linguistic term introduced at time of fuzzyfication of data as well as possibility values, due to inconsistent data, generated by rough set theory at time of rule generation. In time of knowledge inferencing in an intelligent system, certainty factor of each rule will play an important role to find out the appropriate rule to be selected. Experimental results demonstrate the superiority of our approach.
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
In this paper we attempt to solve an automatic clustering problem by optimizing multiple objectives such as
automatic k-determination and a set of cluster validity indices concurrently. The proposed automatic
clustering technique uses the most recent optimization algorithm Jaya as an underlying optimization
stratagem. This evolutionary technique always aims
to attain global best solution rather than a local best solution in larger datasets. The explorations and exploitations imposed on the proposed work results to
detect the number of automatic clusters, appropriate partitioning present in data sets and mere optimal
values towards CVIs frontiers. Twelve datasets of different intricacy are used to endorse the performance
of aimed algorithm. The experiments lay bare that the conjectural advantages of multi objective clustering optimized with evolutionary approaches decipher into realistic and scalable performance paybacks.
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCERcscpconf
In this paper, we study the performance criterion of machine learning tools in classifying breast cancer. We compare the data mining tools such as Naïve Bayes, Support vector machines, Radial basis neural networks, Decision trees J48 and simple CART. We used both binary and multi class data sets namely WBC, WDBC and Breast tissue from UCI machine learning depositary. The experiments are conducted in WEKA. The aim of this research is to find out the best classifier with respect to accuracy, precision, sensitivity and specificity in detecting breast cancer
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Data mining is indispensable for business organizations for extracting useful information from the huge
volume of stored data which can be used in managerial decision making to survive in the competition. Due
to the day-to-day advancements in information and communication technology, these data collected from e-
commerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high
dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets
selected in subsequent iterations by feature selection should be same or similar even in case of small
perturbations of the dataset and is called as selection stability. It is recently becomes important topic of
research community. The selection stability has been measured by various measures. This paper analyses
the selection of the suitable search method and stability measure for the feature selection algorithms and
also the influence of the characteristics of the dataset as the choice of the best approach is highly problem
dependent.
Abstract In this paper, the concept of data mining was summarized and its significance towards its methodologies was illustrated. The data mining based on Neural Network and Genetic Algorithm is researched in detail and the key technology and ways to achieve the data mining on Neural Network and Genetic Algorithm are also surveyed. This paper also conducts a formal review of the area of rule extraction from ANN and GA. Keywords: Data Mining, Neural Network, Genetic Algorithm, Rule Extraction.
Classification of medical datasets using back propagation neural network powe...IJECEIAES
The classification is a one of the most indispensable domains in the data mining and machine learning. The classification process has a good reputation in the area of diseases diagnosis by computer systems where the progress in smart technologies of computer can be invested in diagnosing various diseases based on data of real patients documented in databases. The paper introduced a methodology for diagnosing a set of diseases including two types of cancer (breast cancer and lung), two datasets for diabetes and heart attack. Back Propagation Neural Network plays the role of classifier. The performance of neural net is enhanced by using the genetic algorithm which provides the classifier with the optimal features to raise the classification rate to the highest possible. The system showed high efficiency in dealing with databases differs from each other in size, number of features and nature of the data and this is what the results illustrated, where the ratio of the classification reached to 100% in most datasets).
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...IJECEIAES
The variable selection is an important technique the reducing dimensionality of data frequently used in data preprocessing for performing data mining. This paper presents a new variable selection algorithm uses the heuristic variable selection (HVS) and Minimum Redundancy Maximum Relevance (MRMR). We enhance the HVS method for variab le selection by incorporating (MRMR) filter. Our algorithm is based on wrapper approach using multi-layer perceptron. We called this algorithm a HVS-MRMR Wrapper for variables selection. The relevance of a set of variables is measured by a convex combination of the relevance given by HVS criterion and the MRMR criterion. This approach selects new relevant variables; we evaluate the performance of HVS-MRMR on eight benchmark classification problems. The experimental results show that HVS-MRMR selected a less number of variables with high classification accuracy compared to MRMR and HVS and without variables selection on most datasets. HVS-MRMR can be applied to various classification problems that require high classification accuracy.
— The healthcare industry is considered one of the
largest industry in the world. The healthcare industry is same as
the medical industries having the largest amount of health related
and medical related data. This data helps to discover useful
trends and patters that can be used in diagnosis and decision
making. Clustering techniques like K-means, D-streams,
COBWEB, EM have been used for healthcare purposes like heart
disease diagnosis, cancer detection etc. This paper focuses on the
use of K-means and D-stream algorithm in healthcare. This
algorithms were used in healthcare to determine whether a
person is fit or unfit and this fitness decision was taken based on
his/her historical and current data. Both the clustering
algorithms were analyzed by applying them on patients current
biomedical historical databases, this analysis depends on the
attributes like peripheral blood oxygenation, diastolic arterial
blood pressure, systolic arterial blood pressure, heart rate,
heredity, obesity, and this fitness decision was taken based on
his/her historical and current data. Both the clustering
algorithms were analyzed by applying them on patients current
biomedical historical databases, this analysis depends on the
attributes like peripheral blood oxygenation, diastolic arterial
blood pressure, systolic arterial blood pressure, heart rate,
heredity, obesity, cigarette smoking. By analyzing both the
algorithm it was found that the Density-based clustering
algorithm i.e. the D-stream algorithm proves to give more
accurate results than K-means when used for cluster formation of
historical biomedical data. D-stream algorithm overcomes
drawbacks of K-means algorithm
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)ieijjournal1
Any abnormal activity can be assumed to be anomalies intrusion. In the literature several techniques and
algorithms have been discussed for anomaly detection. In the most of cases true positive and false positive
parameters have been used to compare their performance. However, depending upon the application a
wrong true positive or wrong false positive may have severe detrimental effects. This necessitates inclusion
of cost sensitive parameters in the performance. Moreover the most common testing dataset KDD-CUP-99
has huge size of data which intern require certain amount of pre-processing. Our work in this paper starts
with enumerating the necessity of cost sensitive analysis with some real life examples. After discussing
KDD-CUP-99 an approach is proposed for feature elimination and then features selection to reduce the
number of more relevant features directly and size of KDD-CUP-99 indirectly. From the reported
literature general methods for anomaly detection are selected which perform best for different types of
attacks. These different classifiers are clubbed to form an ensemble. A cost opportunistic technique is
suggested to allocate the relative weights to classifiers ensemble for generating the final result. The cost
sensitivity of true positive and false positive results is done and a method is proposed to select the elements
of cost sensitivity metrics for further improving the results to achieve the overall better performance. The
impact on performance trade of due to incorporating the cost sensitivity is discussed.
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...cscpconf
In the present study, the abilities of three classification methods of data mining namely artificial
neural networks with feed-forward back propagation algorithm, J48 decision tree method and
logistic regression analysis are compared in a medical real dataset. The prediction of
malignancy in suspected thyroid tumour patients is the objective of the study. The accuracy of
the correct predictions (the minimum error rate), the amount of time consuming in the
modelling process and the interpretability and simplicity of the results for clinical experts are
the factors considered to choose the best method
Advanced Computational Intelligence: An International Journal (ACII)aciijournal
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
Data mining techniques play very important role in health care industry. Liver disease is one of the growing
diseases these days due to the changed life style of people. Various authors have worked in the field of classification of
data and they have used various classification techniques like Decision Tree, Support Vector Machine, Naïve Bayes,
Artificial Neural Network (ANN) etc. These techniques can be very useful in timely and accurate classification and
prediction of diseases and better care of patients. The main focus of this work is to analyze the use of data mining
techniques by different authors for the prediction and classification of liver patient.
Comparative Study of Data Mining Classification Algorithms in Heart Disease P...paperpublications3
Abstract: This research paper intends to provide an accuracy of the best algorithm from the classification algorithms. The main objective of this research work is to predict more accurately the presence of heart disease in patients. The dataset used here is from the UCI Machine Learning Repository based Hungarian-14-heart disease with 294 instances. In this research paper, we use weka tool with five classifiers like Naïve Bayes, Logistic function, RBF Network function, Decision Table rule, SMO function and their performance on the diagnosis has been compared. From that Naïve Bayes provides 86% accurate result in 0 seconds and RBF Network provides 86% of accuracy in 0.17 seconds. The research result shows both the Naïve Bayes outperforms with a time duration of 0 seconds to build the model.
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
Prognosticating Autism Spectrum Disorder Using Artificial Neural Network: Lev...Avishek Choudhury
Autism spectrum condition (ASC) or autism spectrum disorder (ASD) is primarily identified with the help of behavioral indications encompassing social, sensory and motor characteristics. Although categorized, recurring motor actions are measured during diagnosis, quantifiable measures that ascertain kinematic physiognomies in the movement configurations of autistic persons are not adequately studied, hindering the advances in understanding the etiology of motor mutilation. Subject aspects such as behavioral characters that influences ASD need further exploration. Presently, limited autism datasets concomitant with screening ASD are available, and a majority of them are genetic. Hence, in this study, we used a dataset related to autism screening enveloping ten behavioral and ten personal attributes that have been effective in diagnosing ASD cases from controls in behavior science. ASD diagnosis is time exhaustive and uneconomical. The burgeoning ASD cases worldwide mandate a need for the fast and economical screening tool. Our study aimed to implement an artificial neural network with the Levenberg- Marquardt algorithm to detect ASD and examine its predictive accuracy. Consecutively, develop a clinical decision support system for early ASD identification.
Disease prediction in big data healthcare using extended convolutional neural...IJAAS Team
Diabetes Mellitus is one of the growing fatal diseases all over the world. It leads to complications that include heart disease, stroke, and nerve disease, kidney damage. So, Medical Professionals want a reliable prediction system to diagnose Diabetes. To predict the diabetes at earlier stage, different machine learning techniques are useful for examining the data from different sources and valuable knowledge is synopsized. So, mining the diabetes data in an efficient way is a crucial concern. In this project, a medical dataset has been accomplished to predict the diabetes. The R-Studio and Pypark software was employed as a statistical computing tool for diagnosing diabetes. The PIMA Indian database was acquired from UCI repository will be used for analysis. The dataset was studied and analyzed to build an effective model that predicts and diagnoses the diabetes disease earlier.
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
Similar to MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS (20)
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
DOI:10.5121/ijcsit.2015.7310 115
MULTI-PARAMETER BASED PERFORMANCE
EVALUATION OF CLASSIFICATION ALGORITHMS
Saurabh Kr. Srivastava1
and Sandeep Kr. Singh2
1
Research Scholar, Department of Computer Sc. & Engineering, JIIT University, Noida,
India
2
Assistant Professor, Department of Computer Sc. & Engineering, JIIT University, Noida,
India
ABSTRACT
Diabetes disease is amongst the most common disease in India. It affects patient’s health and also leads to
other chronic diseases. Prediction of diabetes plays a significant role in saving of life and cost. Predicting
diabetes in human body is a challenging task because it depends on several factors. Few studies have
reported the performance of classification algorithms in terms of accuracy. Results in these studies are
difficult and complex to understand by medical practitioner and also lack in terms of visual aids as they are
presented in pure text format. This reported survey uses ROC and PRC graphical measures to improve
understanding of results. A detailed parameter wise discussion of comparison is also presented which lacks
in other reported surveys. Execution time, Accuracy, TP Rate, FP Rate, Precision, Recall, F Measure
parameters are used for comparative analysis and Confusion Matrix is prepared for quick review of each
algorithm. Ten fold cross validation method is used for estimation of prediction model. Different sets of
classification algorithms are analyzed on diabetes dataset acquired from UCI repository.
KEYWORDS
Data mining, Diabetes, UCI repository, Machine Learning.
1. INTRODUCTION AND LITERATURE REVIEW
Diabetes invites other chronic diseases. It is primarily associated with an increase of blood
glucose. Two common categories of diabetes are Type-1 and Type-2. Type-1 diabetes occurs
when pancreas fails to produce sufficient insulin while Type-2 occurs when body cannot
effectively consume the insulin produced. Diabetes confirmation requires a lot of examination
that affects time and cost both. Prior studies show that classification algorithms have been
effectively used in prediction of diabetes.
Several studies and results are reported using data mining techniques in healthcare for
classification in medical databases. In the context of that J.W.Smith et al.[1] have introduced
adaptive learning routine that executes digital analogy of perceptions called ADAP. The algorithm
uses 576 training instances and classification accuracy is 76% on the remaining 192 instances.
K.Srinivas et al.[4] studied the applications of Data mining Techniques in healthcare and have
reported prediction of heart attacks using clinical database. Elmakolce et al.[5] has given the
glimps of different data mining techniques utilization. Asha Rajkumar et al.[2] and A.Khemphlila
et al.[3] discussed various data mining techniques for diagnosis of certain life threatening diseases.
HuyNguyen A.P. et al.[6] proposed a new Homogeneity-Based Algorithm that determines over
fitting and over generalization behavior of classification. Recently K.R. Lakshmi et al.[10] and
Karthikeyini.V. et.al [8,9] discussed data mining algorithms performance based upon their
computing time, and precision value.
2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
116
The data mining techniques aims to extract the hidden information from large dataset and
transform it into knowledge patterns for further use that can be easily understood by human. A
model is an example of structure which is based upon predicted data. The paper uses ROC and
PRC graphical measures to visualize the results that makes data easy to understand by medical
practitioners. Execution time, Accuracy, TP Rate, FP Rate, Precision, Recall, and F Measure
parameters are used for comparative analysis.
We have described the results of experiments in which categorical (Tree, Function, Rule &
Bayesian) algorithms are used for performance evauation. The paper is organized as follows: in
section 2 we described the algorithms which are used for performance evaluation; in section 3 we
discussed the measures on which result is obtained; in section 4 we explained about the
experimental setup, data set and result evaluation process; in section 5 & 6 we described the result
and discussion on evaluated results; finally, in section 7 we presented our conclusion.
2. ALGORITHMS USED FOR RESULT EVALUATION
2.1. Decision Tree Based Classification
Decision Tree maps observation to conclusion in the form of target values. Decision tree separates
input space of a data set into mutually exclusive regions. Each of which is assigned to a label, a value
or an action to characterize its data points.Two variants: J48 and RandomTree are discussed in
subsections.
2.1.1. J48: A decision tree is a graphical representation which consist of internal and external
nodes connected by branches. Internal node is responsible for implementing decision functions,
that determines which node to visit next. The external node, has no child node and it is
associated with a value that characterizes the given data which leads to its being visited.
Decision tree construction algorithms involve a two-step process. First- growing, Second-pruning.
The growing process sets up a very large decision tree and pruning process reduces the large size
and overfiting of the data. The pruned decision tree that is used for classification is called as
classification tree [19]. A J48 decision tree is a predictive machine-learning model that decides
the target value (dependent variable) of a new sample based on various attribute values of the
available data. The nodes of a J48 decision tree denote the different attributes. Branches between
the nodes tell us the possible values that these attributes can have in the observed samples. To
build a decision tree, we need to calculate the entropy and information gain [18].
Entropy: E(S)=∑ -pi log2pi (Where ∑ varies from i=1 to c)
Information Gain: Gain (T, X) = Entropy (T) – Entropy (T,X).
2.1.2. Random Tree: It is the simplest tree algorithm that comes as a result of stochastic process.
Random binary tree refers system model that selects random values over time. Two different
distributions are commonly used for Random Tree formation: 1) Inserting nodes one at a time
according to a random permutation. 2) Chosing from a uniform discrete distribution. Repeated
splitting is another distribution to form the random tree. Adding and removing nodes directly in a
random binary tree will generaly disrupt its random structure. In order to balance the nodes of
Random Trees, random permutation are used for dynamic insertion and deletion of nodes.
2.2. Function Based Classification
It’s a mathematical classification (or statistical procedure) to classify an instance in a particular class.
3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
117
2.2.1. Logistics Regression: Logistics Regression predicts the probability of an outcome that can only
have two values. It performs a least-square fit of a parameter vector β to a numeric target variable X.
The logistic regression uses equation: F(X) = βT. X to formulate prediction model.Where X is the
input vector (a constant term to accommodate the intercept), and β is parameter vector to a numeric
target variable. It is possible to use this technique for classification by directly fitting logistics
regression models to class indication variables. The prediction is based on the use of one or several
predictors (numerical and categorical). A linear regression is not appropriate for predicting the value
of binary variable for two reasons:
1. A linear regression will predict values outside the acceptable range (e.g. predicting probabilities
outside the range 0 to 1)
2. Since the two value experiments can only have one of two possible values for each experiment, the
residuals will not be normally distributed about the predicted line.
Figure 1. Logistics Regression Function
Simply, the logistic regression equation can be written in terms of an odds ratio-
p/1-p = exp(b0 + b1x).
Here, b0 is the constant responsible for moving the curve left and right while b1 defines the steepness of
the curve.
2.2.2. Multi-Layer Perceptron: A simplest form of neural network needs to classify linearly
seperable patterns. While for non-linear patterns multi-layer perceptron (MLP) neural network
model performs well. It maps sets of input data onto a set of appropriate outputs. MLP consists of
multiple layers of nodes in a directed graph with each layer fully connected to the next one.
Except for the input nodes, each node is a neuron (or processing element) with a nonlinear
activation function. MLP uses back propagation learning algorithm for training and widely used
in pattern classification and recognition. The simplest form of MLP is shown as-
Figure 2. Multi-Layer Perceptron
2.3. Rule Based Classification
The rule based classification is based upon rules like (if-else). Rules are mutually exclusive and
exhaustive.Two variants are discussed in following subsections.
4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
118
2.3.1. ZeroR: ZeroR is a simplest method which only focuses on the class. It is a least accurate
classifier only predicts the majority category (class). It is useful for determining a baseline as a
benchmark for other classification methods. ZeroR algorithm ignores the predictors only constructs a
frequency table for the target and select its most frequent value.
2.3.2. OneR: OneR is a rule based classification that uses a single predictor. It generates one rule for
each predictor in the data and constructs frequency table for each predictor against the target. OneR
short for “One Rule” is an accurate classification algorithm, for each value of predictor, OneR makes
rule as follows-
1: Count how often each value of target (class) appears
2: Find the most frequent class
3: Make the rule assign that class to this value of predictors
4: Calculate the total error of the rules of each predictor
5: Choose the predictor with smallest total error.
OneR produces rules only slightly less accurate than state-of-the-art classification algorithms.
2.4. Bayesian Based Classification
Bayesian classification is based upon Baye’s probability rules and depends on likelihood functions.
Two variants are discussed in following subsections.
2.4.1. Simple Baye’s Classification: Bayesian networks are a powerful probabilistic representation
for classification. The Baye’s provides a rule for calculating posterior probability. If probability of
item A is to be calculated over given item B then according to Baye’s-
P (A/B) = P (B/A) * P (A) / P (B)
P (A/B) is probability of A given that B is true (posterior probability).
P (B/A) is probability of B given that A is true (likelihood).
P (A) is prior probability of the class.
P (B) is prior probability of predictor.
2.4.2. Naive Baye’s: The Naive Baye’s classifier is based on Baye’s Theorem with independent
assumptions between predictors. Naive Bayesian model is easy to build without complicated
iterative parameter estimation. It analyzes all the attributes in the data individually, means the
value of a predictor (X) on a given (C) is independent of the values of other predictors. This
assumption is called as class conditional independence. The working steps for Naive Baye’s
classifier are as follows-
1. First calculating the posterior probability and construct the frequency table against the target.
2. Transforming the frequency table into likelihood table and using the Naive Baye’s equation to
calculate the posterior probability for each class.
3. Class with highest probability is the outcome of prediction.
P (C/X) = P (X/C) * P (C) / P (X)
P (C/X) is posterior probability of class (target) given predictor (attribute).
P (X/C) is likelihood which is the probability of predictor given class.
P (C) is prior probability of the class.
P (X) is prior probability of predictor.
5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
119
3. MEASURES ON WHICH RESULT IS OBTAINED
3.1. Execution Time: The time taken by the tool to execute and evaluate an algorithm.
3.2. Accuracy: Accuracy defines percentage of correctly classified instances from the test set by the
classifier.
3.3. TP Rate: True Positive rate is proportion of examples which were classified as a class X among
all examples which truly have the X class. This shows how much the part of the class is truly captured.
3.4. FP Rate: False Positive is the proportion of the examples which were incorrectly identified and
belong to another class.
3.5. Precision: Precision is the proportion of examples which truly have class X among all those
which were classified as X. It is a measure of exactness. Positive predictive value is called as
precision.
PPV= TP/TP+FP
3.6. Recall: Recall is the proportion of examples which were classified as class X among all examples
which truely have class X. It is a measure of completeness. Negative predictive value is called as
recall.
NPV= TP/TP+FN
3.7. F Measure: F measure is aggregate of precision and recall, The formula is defined as: 2 *
Precision * Recall / Precision + Recall.
3.8. ROC Area: ROC is comparison of two operating characteristics TPR and FPR. It is also
known as a receivers operating characteristic curve. A receiver operating characteristic curve is
a graphical measure which interprets the performance of a classifier as its discrimination
threshold is varied. It is an outcome of plotting the true positive rate vs. false positive rate at
various threshold settings. True positive rate is fraction of true positives out of the total actual
positives while the fraction of false positives out of the total actual negatives indcates false
positive rate. The point in threshold curve record various statistics such as true positive, false
positive etc. The curves are generated by sorting the prediction produced by the classifier in
descending order of probability it assigns to the positive class. The formula is defind as:
ROC Area: TP Rate= TP/TP+FN*100, FP Rate= FP/FP+TN*100.
3.9. PRC Area: The PRC is known as precision recall characteristics curve. It is a comparison
of two operating characteristics (PPV and sensitivity) as the criterion changes. A precision
recall curve or PRC curve is a graphical plot which illustrates the performance of binary
classifiers as its discrimination threshold is varied. PPV is fraction of true positives out of test
outcomes positive. While sensitivity is fraction of true positive out of conditions positive.
3.10. Confusion Matrix: It is also called as contingency table. In our case the result is in two classes
so it is 2 X 2 confusion matrix. Number of correctly classified instances are diagonal in the matrix and
others are incorrectly classified.
4. EXPERIMENTAL SETUP
[20] Weka environment is a popular suit of machine learning written in java developed at the
university of Waikato, New Zealand. It provides GUI to its user. Weka provides a wide range of
uni and multivariate parametric and non parametric tests. It is having list of feature selection
6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
120
techniques. Weka supports data preprocessing, clustering, classifications, regression,
visualization, and feature selection.
4.1 Data Source
To evaluate and analyze data mining classification algorithms UCI Diabetes data set is used this
data set have nine attributes and 768 instances.
Table 1. Data Descriptiion
S.No. Name Description
1. Pregnancy Number of times pregnant
2. Plasma Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Pres Diastolic blood pressure (mm Hg)
4. Skin Triceps skin fold thickness (mm)
5. Insulin 2-Hour serum insulin (mu U/ml)
6. Mass Body mass index (weight in kg/(height in m2)
7. Pedi Diabetes pedigree function
8. Age Age(in years)
9. Class Class variable (False or True)
4.2. Result Evaluation
Sets of algorithms (Tree, Function, Rule & Bayesian) used for performance evaluation in terms of
their Execution time, Accuracy, Precision, Recall, ROC and PRC. Followings are the steps taken up
for their evaluation.
Step 1
4.2.1. Pre-processing
During the pre-processing the dataset which is used for evaluation is processed using filters.
Basically, data pre-processing helps in removing out of range values, missing values, impossible
data combinations etc. The quality of data is first and foremost requirement before analysis. The
result of data preprocessing is the final training set. Preprocessing of data is done using
supervised filter discrete method [12].
Step 2
4.2.2. Training
During the training algorithms are applied to the data so that it learns well. A training set is used
along with a supervised learning method to train a knowledge dataset which can be further used
7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
121
by an artificial intelligence machine. Cross validation aims to test the model in training phase in
order to limiting the problems of data overfitting and gives an insight on how the model will
generalize to an independent data set.The 10 fold cross validation is used for classification of the
dataset.
Step 3
4.2.3. Testing
The purpose of testing is to validate whether the trained data correctly classifies the given dataset
or not. 10-fold cross validation signifies that the data is divided into 10 chunks and train 10 times,
treating treating a different chunk as the holdout set each time. Testing can be done on new data
as well as on the part of the dataset that has not been used for training. The aim in cross-validation
is to ensure that every example from the original dataset has the same chance of appearing in the
training and testing set [11].
5. RESULT
Execution Time, TP Rate, FP Rate, Precision, Recall, F Measure, Accuracy, ROC Area, PRC Area,
and Confusion Matrix are the measures used for performance evaluation.All algorithms discussed
for comparison are filtered to identify the effectiveness of their higher precision and higher
accuracy of classification. The overall performance evaluation table are given below.
Table 2. Performance Evaluation of classifiers
Acc. TP FP Pre. Recall F Mea. ROC PRC Confusion Matrix
TREE BASED CLASSIFICATION A B
J48 78.26 0.73 0.40 0.73 0.73 0.71 0.75 0.73
115 113 A True
52 448 B False
Random Tree 74.61 0.66 0.39 0.67 0.66 0.66 0.66 0.65
147 121 A True
140 360 B False
FUNCTION BASED CLASSIFICATION A B
Logistics 78.65 0.75 0.33 0.75 0.75 0.75 0.81 0.80
155 113 A True
77 423 B False
MLP 75.91 0.71 0.34 0.71 0.71 0.71 0.77 0.78
158 110 A True
110 390 B False
RULE BASED CLASSIFICATON A B
One R 74.74 0.73 0.39 0.72 0.73 0.72 0.67 0.66
125 143 A True
62 438 B False
Zero R 65.10 0.65 0.65 0.42 0.65 0.51 0.50 0.54
0 268 A True
0 500 B False
BAYES CLASSIFICATION A B
Baye’s 77.87 0.75 0.30 0.75 0.75 0.75 0.83 0.82
175 93 A True
99 401 B False
Naive Baye’s 77.87 0.75 0.30 0.75 0.75 0.75 0.83 0.82
172 96 A True
93 407 B False
8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
122
6. RESULT DISCUSSION
6.1. Computation Time
This measure depicts time taken by algorithm for result evaluation. In this ZeroR and Naive
Baye’s classifiers performs best among all.
Table 3. Computation time of classifiers
J48
Random
Tree
Logistics MLP OneR ZeroR Baye’s
Naive
Baye’s
Com.Time 130ms 50 ms 200 ms
3950
ms
20 ms
Less
than
10 ms
50 ms
Less
than
10 ms
6.2. Accuracy, Precision, Recall, TP & FP
Precision and accuracy are related terms. In statistics measuring the degree of closeness is
accuracy while precision is reproducibility (the degree to which repeated measurements under
unchanged conditions shows the same result). When a system contains systematic error increase
in sample size, increase in precision does not increase the accuracy. Eliminating the systematic
error improves accuracy but does not change precision. The evaluated results are as follows-
Table 4. Accuracy, Precision, Recall, TP & FP values
J48
Random
Tree
Logistics MLP OneR ZeroR Baye’s
Naive
Baye’s
Accuracy 78.26 74.61 78.65 75.91 74.74 65.10 77.87 77.87
Precision 0.726 0.666 0.747 0.714 0.724 0.424 0.751 0.753
Recall 0.733 0.66 0.753 0.714 0.733 0.651 0.75 0.754
TP 0.733 0.66 0.753 0.714 0.733 0.651 0.75 0.754
FP 0.408 0.392 0.328 0.344 0.391 0.651 0.295 0.298
Figure 3. Simulation result of TP and FP
9. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
123
In the plotted diagram (Figure 3) Series 1 represents true positive value while Series 2 represents
false positive value, with the observation of the plotted data it is clearly visible that ZeroR
classifier having no impact on TP and FP for this dataset.
Figure 4. Simulation result of Precision and Recall
In the plotted diagram (Figure 4) Series 1 represents recall while Series 2 represents precision
value.
Based on the above results, we can observe that for the diabetes dataset the highest accuracy is
78.26% (J48) and the minimum is 65.10 (ZeroR) while in case of precision Naive Baye’s
(75.3%) performs best among all. A measurement system is considered valid if both (accuracy
& precision) are accurate and precise. In fact, with the consideration of both the measures
Naive Baye’s classifier present at the top and OneR lies at the bottom of the chart. Recall is the
fraction of relevant instances that are retrieved. Precision and recall both are inversely
proportional to each other. Both precision and recall are based on measurement of relevance. So
in most of the cases performance of a classifier is always measured in terms of both.
Considering both (precision & recall) in above table, Bayesian classifier performs best among
all, followed by Funtion based and Tree based classifiers, while Rule based classifier lies at the
bottom.
6.3 F-Measure & ROC
F-Measure aggregates the precision and recall. The classifier’s performance can be measure by
considering precision & recall both.
Table 8.
J48
Random
Tree
Logistics MLP OneR ZeroR Baye’s
Naive
Baye’s
F-Measure 0.714 0.663 0.748 0.714 0.719 0.513 0.751 0.754
ROC 0.745 0.658 0.811 0.766 0.671 0.497 0.825 0.825
Based on above table 8 it is clearly visible that the performance order of classifiers are: Bayesian,
Funtion based, Tree based, and Rule based. While Naive baysian (F Mea.=0.754) classifier
performs best among all.
10. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
124
7. CONCLUSION
The aim of data mining algorithm is to get best among available. In this paper different sets (Tree,
Function, Rule & Bayesian) of classification algorithms are used for performance evaluation.
These algorithms are compared on the basis of execution time, TP Rate, FP Rate, Precision,
Recall, Accuracy, ROC and PRC. ROC and PRC are graphical representation used for ease of
understanding. Weka tool is used to evaluate and investigate the performance of four different
categories of classifiers. The raw numbers are shown in the confusion matrix (Table 2), with A
and B representing the class labels. Here in the experiment we are using 768 instances, so the
percentages and raw numbers add up comes out AA+BB, AB+BA.The percentage of correctly
classified instances is often called accuracy. We compare all the parameters and found that Naive
bayes classifier performs best among all while ZeroR classifier only provides benchmark for
classification. ROC and PRC graphical measures are used for quick review of all classifiers.
According to the result the performance order (highest to lowest) of the algorithms are:
Bayesian, Function, Tree and Rule.
8. FUTURE WORK
Validataion of results using other datasets is required before generalizing result findings.
REFERENCES
[1] Smith, J., W., Everthart, J.,E., Dickson, W.,C., Knowler, W.,C. and Johannes, R.,S.,Using the ADAP
learning algorithm to forecast the onset of diabetes mellitus, in Proceedings of the Symposium on
Computer Applications and Medical Care, IEEE Computer Society Press, 1988, pp. 261-265
[2] Asha Rajkumar,Sophia Reena.G., “Diagnosis of Heart Disease Using Data Mining Algorithm”,
Global Journal of Computer Sci- ence and Technology, Vol-10, 2010, pp. 38-46.
[3] A.Khemphila,V.Boojing, “Comparing Performance of logistic regression,decision tree and neural
network for classifying heart disease patients”, Proceeding of International conference on Com- puter
Information System and Industrial Management Application, 2010, pp.193-198.
[4] K.Srinivas, B.Kavitha Rani,A.Govrdhan, Applications of Data Mining Techniques in +health care and
Prediction Heart Attacks, International Journal on Computer Science and Engineering (IJCSE), vol.II,
2010, pp.250-255.
[5] Elma kolce (cela), Neki Frasheri, “A Literature Review of Data Mining Techniques used in
Healthcare Databases”, ICT Innova- tions 2012 Web Proceedings-Poster Session.
[6] Huy Nguyen Anh Pham and Evangelos Triantaphyllou “Prediction of Diabetes by Employing a New
Data Mining Approach Which Balances Fitting and Generalization” Department of Computer
Science, 298 Coates Hall, Louisiana State University, Baton Rouge, LA 70803.
[7] Ms.S.Sapna, Dr.A.Tamilarasi “Data mining – Fuzzy Neural Ge- netic Algorithm in predicting
diabetes” Department Of Com- puter Applications (MCA), K.S.R College of Engineering “BOOM
2K8”, Research Journal on Computer Engineering, March 2008.
[8] Karthikeyini.V., Pervin begum.I., “Comparison a performance of data mining algorithms (CPDMA)
in prediction of Diabetes Disease”, International journal of Computer Science and Engineer- ing,
Vol.5, No. 03, March 2013, pp. 205-210.
[9] Karthikeyini.V., Pervin begum.I., Tajuddin.K., Shahina Begum, “Comparative of data mining
classification algorithm (CDMCA) Diabetes Disease Prediction”, International journal of Computer
Applications, Vol.60, No. 12, Dec. 2012, pp. 26-31
[10] K.R. Lakshmi and S. Prem Kumar, Utilization of Data Mining Techniques for Prediction of Diabetes
Disease Survivability. IJSER, Vol. 4, Issue 6, June 2013.
[11] http://www.sussex.ac.uk/Users/christ/crs/ml/lec03a.html
[12] http:// weka.sourceforge.net/doc.dev/weka/filters/supervised/
[13] http://chem-eng.utoronto.ca/~datamining/dmc/zeror.htm
[14] http://chem-eng.utoronto.ca/~datamining/dmc/oner.htm
11. International Journal of Computer Science & Information Technology (IJCSIT) Vol 7, No 3, June 2015
125
[15] http://chem-eng.utoronto.ca/~datamining/dmc/naive_bayesian.htm
[16] http://chem-eng.utoronto.ca/~datamining/dmc/logistic_regression.htm
[17] http://neuroph.sourceforge.net/tutorials/MultiLayerPerceptron.html
[18] Jiwei Han and Micheline Kamber: Data Mining Concepts and Techniques, Elsevier Inc., Edition 2,
2006, ISBN no. 978-81-312-0535-8.
[19] Decision Tree at http://chemeng.utoronto.ca/~datamining/dmc/decision_tree.htm
[20] WEKA Tool at http://www.cs.waikato.ac.nz/ml/weka/
Authors
Mr. Saurabh Kumar Srivastava is a research scholar in the department of Computer Science &
Engineering at JIIT Noida, India. He obtained his M-Tech (Computer Engineering) from Shobhit
University and B-Tech(C.S.) from Uttar Pradesh Technical University, Lucknow (U.P.). He has been in
teaching since 7 years. During his teaching he has coordinated several Technical fests and
International/National Conferences at Institute level. He has attended several seminars, workshops and
conferences at various levels. His area of research includes Datamining, Machine Learning, Artificial
Intelligence, and Web Technology.
Dr. Sandeep Kumar Singh is an Assistant Professor (Senior Grade) at JIIT in Noida, India. He has
completed his Ph.D in (Computer Science and Engineering). He has around 13+ years’ experience, which
includes corporate training and teaching. His areas of interests are Software Engineering, Requirements
Engineering, Software Testing, Web Application Testing,Internet and Web Technology, Object Oriented
Technology, Programming Languages, Information Retrieval and Data Mining, Model based Testing and
Applications of Soft computing in Software Testing and Databases. He is currently supervising 4 Ph.D's in
Computer Science. He has around 28 published papers to his credit in different international journals and
conferences.