Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
Data mining environment produces a large amount of data, that need to be
analyses, pattern have to be extracted from that to gain knowledge. In this new period with
rumble of data both ordered and unordered, by using traditional databases and architectures, it
has become difficult to process, manage and analyses patterns. To gain knowledge about the
Big Data a proper architecture should be understood. Classification is an important data mining
technique with broad applications to classify the various kinds of data used in nearly every
field of our life. Classification is used to classify the item according to the features of the item
with respect to the predefined set of classes. This paper provides an inclusive survey of
different classification algorithms and put a light on various classification algorithms including
j48, C4.5, k-nearest neighbor classifier, Naive Bayes, SVM etc., using random concept.
A Decision tree is termed as good DT when it has small size and when new data is introduced it can be classified accurately. Pre-processing the input data is one of the good approaches for generating a good DT. When different data pre-processing methods are used with the combination of DT classifier it evaluates to give high performance. This paper involves the accuracy variation in the ID3 classifier when used in combination with different data pre-processing and feature selection method. The performances of DTs are produced from comparison of original and pre-processed input data and experimental results are shown by using standard decision tree algorithm-ID3 on a dataset.
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
Data mining environment produces a large amount of data that need to be analyzed.
Using traditional databases and architectures, it has become difficult to process, manage and analyze
patterns. To gain knowledge about the Big Data a proper architecture should be understood.
Classification is an important data mining technique with broad applications to classify the various
kinds of data used in nearly every field of our life. Classification is used to classify the item
according to the features of the item with respect to the predefined set of classes. This paper put a
light on various classification algorithms including j48, C4.5, Naive Bayes using large dataset.
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
Abstract
Decision Tree learning algorithms have been able to capture knowledge successfully. Decision Trees are best considered when
instances are described by attribute-value pairs and when the target function has a discrete value. The main task of these
decision trees is to use inductive methods to the given values of attributes of an unknown object and determine an
appropriate classification by applying decision tree rules. Decision Trees are very effective forms to evaluate the performance
and represent the algorithms because of their robustness, simplicity, capability of handling numerical and categorical data,
ability to work with large datasets and comprehensibility to a name a few. There are various decision tree algorithms available
like ID3, CART, C4.5, VFDT, QUEST, CTREE, GUIDE, CHAID, CRUISE, etc. In this paper a comparative study on three of
these popular decision tree algorithms - (Iterative Dichotomizer 3), C4.5 which is an evolution of ID3 and VFDT (Very
Fast Decision Tree has been made. An empirical study has been conducted to compare C4.5 and VFDT in terms of accuracy
and execution time and various conclusions have been drawn.
Key Words: Decision tree, ID3, C4.5, VFDT, Information Gain, Gain Ratio, Gini Index, Over−fitting.
Predicting students' performance using id3 and c4.5 classification algorithmsIJDKP
An educational institution needs to have an approximate prior knowledge of enrolled students to predict
their performance in future academics. This helps them to identify promising students and also provides
them an opportunity to pay attention to and improve those who would probably get lower grades. As a
solution, we have developed a system which can predict the performance of students from their previous
performances using concepts of data mining techniques under Classification. We have analyzed the data
set containing information about students, such as gender, marks scored in the board examinations of
classes X and XII, marks and rank in entrance examinations and results in first year of the previous batch
of students. By applying the ID3 (Iterative Dichotomiser 3) and C4.5 classification algorithms on this data,
we have predicted the general and individual performance of freshly admitted students in future
examinations.
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
Data mining environment produces a large amount of data, that need to be
analyses, pattern have to be extracted from that to gain knowledge. In this new period with
rumble of data both ordered and unordered, by using traditional databases and architectures, it
has become difficult to process, manage and analyses patterns. To gain knowledge about the
Big Data a proper architecture should be understood. Classification is an important data mining
technique with broad applications to classify the various kinds of data used in nearly every
field of our life. Classification is used to classify the item according to the features of the item
with respect to the predefined set of classes. This paper provides an inclusive survey of
different classification algorithms and put a light on various classification algorithms including
j48, C4.5, k-nearest neighbor classifier, Naive Bayes, SVM etc., using random concept.
A Decision tree is termed as good DT when it has small size and when new data is introduced it can be classified accurately. Pre-processing the input data is one of the good approaches for generating a good DT. When different data pre-processing methods are used with the combination of DT classifier it evaluates to give high performance. This paper involves the accuracy variation in the ID3 classifier when used in combination with different data pre-processing and feature selection method. The performances of DTs are produced from comparison of original and pre-processed input data and experimental results are shown by using standard decision tree algorithm-ID3 on a dataset.
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
Data mining environment produces a large amount of data that need to be analyzed.
Using traditional databases and architectures, it has become difficult to process, manage and analyze
patterns. To gain knowledge about the Big Data a proper architecture should be understood.
Classification is an important data mining technique with broad applications to classify the various
kinds of data used in nearly every field of our life. Classification is used to classify the item
according to the features of the item with respect to the predefined set of classes. This paper put a
light on various classification algorithms including j48, C4.5, Naive Bayes using large dataset.
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
Abstract
Decision Tree learning algorithms have been able to capture knowledge successfully. Decision Trees are best considered when
instances are described by attribute-value pairs and when the target function has a discrete value. The main task of these
decision trees is to use inductive methods to the given values of attributes of an unknown object and determine an
appropriate classification by applying decision tree rules. Decision Trees are very effective forms to evaluate the performance
and represent the algorithms because of their robustness, simplicity, capability of handling numerical and categorical data,
ability to work with large datasets and comprehensibility to a name a few. There are various decision tree algorithms available
like ID3, CART, C4.5, VFDT, QUEST, CTREE, GUIDE, CHAID, CRUISE, etc. In this paper a comparative study on three of
these popular decision tree algorithms - (Iterative Dichotomizer 3), C4.5 which is an evolution of ID3 and VFDT (Very
Fast Decision Tree has been made. An empirical study has been conducted to compare C4.5 and VFDT in terms of accuracy
and execution time and various conclusions have been drawn.
Key Words: Decision tree, ID3, C4.5, VFDT, Information Gain, Gain Ratio, Gini Index, Over−fitting.
Predicting students' performance using id3 and c4.5 classification algorithmsIJDKP
An educational institution needs to have an approximate prior knowledge of enrolled students to predict
their performance in future academics. This helps them to identify promising students and also provides
them an opportunity to pay attention to and improve those who would probably get lower grades. As a
solution, we have developed a system which can predict the performance of students from their previous
performances using concepts of data mining techniques under Classification. We have analyzed the data
set containing information about students, such as gender, marks scored in the board examinations of
classes X and XII, marks and rank in entrance examinations and results in first year of the previous batch
of students. By applying the ID3 (Iterative Dichotomiser 3) and C4.5 classification algorithms on this data,
we have predicted the general and individual performance of freshly admitted students in future
examinations.
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...ijcnes
Educational data mining is used to study the data available in the educational field and bring out the hidden knowledge from it. Classification methods like decision trees, rule mining can be applied on the educational data for predicting the students behavior. This paper focuses on finding thesuitablealgorithm which yields the best result to find out the reason behind students absenteeism in an academic year. The first step in this processis to gather students data by using questionnaire.The datais collected from 123 under graduate students from a private college which is situated in a semirural area. The second step is to clean the data which is appropriate for mining purpose and choose the relevant attributes. In the final step, three different Decision tree induction algorithms namely, ID3(Iterative Dichotomiser), C4.5 and CART(Classification and Regression Tree)were applied for comparison of results for the same data sample collected using questionnaire. The results were compared to find the algorithm which yields the best result in predicting the reason for student s absenteeism.
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
Distributed digital artifacts incorporate cryptographic hash values to URI called trusty URIs in a distributed environment
building good in quality, verifiable and unchangeable web resources to prevent the rising man in the middle attack. The greatest
challenge of a centralized system is that it gives users no possibility to check whether data have been modified and the communication
is limited to a single server. As a solution for this, is the distributed digital artifact system, where resources are distributed among
different domains to enable inter-domain communication. Due to the emerging developments in web, attacks have increased rapidly,
among which man in the middle attack (MIMA) is a serious issue, where user security is at its threat. This work tries to prevent MIMA
to an extent, by providing self reference and trusty URIs even when presented in a distributed environment. Any manipulation to the
data is efficiently identified and any further access to that data is blocked by informing user that the uniform location has been
changed. System uses self-reference to contain trusty URI for each resource, lineage algorithm for generating seed and SHA-512 hash
generation algorithm to ensure security. It is implemented on the semantic web, which is an extension to the world wide web, using
RDF (Resource Description Framework) to identify the resource. Hence the framework was developed to overcome existing
challenges by making the digital artifacts on the semantic web distributed to enable communication between different domains across
the network securely and thereby preventing MIMA.
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)ieijjournal1
Any abnormal activity can be assumed to be anomalies intrusion. In the literature several techniques and
algorithms have been discussed for anomaly detection. In the most of cases true positive and false positive
parameters have been used to compare their performance. However, depending upon the application a
wrong true positive or wrong false positive may have severe detrimental effects. This necessitates inclusion
of cost sensitive parameters in the performance. Moreover the most common testing dataset KDD-CUP-99
has huge size of data which intern require certain amount of pre-processing. Our work in this paper starts
with enumerating the necessity of cost sensitive analysis with some real life examples. After discussing
KDD-CUP-99 an approach is proposed for feature elimination and then features selection to reduce the
number of more relevant features directly and size of KDD-CUP-99 indirectly. From the reported
literature general methods for anomaly detection are selected which perform best for different types of
attacks. These different classifiers are clubbed to form an ensemble. A cost opportunistic technique is
suggested to allocate the relative weights to classifiers ensemble for generating the final result. The cost
sensitivity of true positive and false positive results is done and a method is proposed to select the elements
of cost sensitivity metrics for further improving the results to achieve the overall better performance. The
impact on performance trade of due to incorporating the cost sensitivity is discussed.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.IJERD Editor
Decision tree learning is the discipline to create a predictive model to map the different items in the
set and respective target values and associate them in a way that is true to every element. This concept is used in
statistics, data mining and machine learning due to its simple and effectiveness.
Among the various strategies available to construct the decision trees ID3 is one of the simplest and
most widely used decision tree algorithm, but ID3 algorithm gives more importance to attributes having
multiple values while selecting node. This major shortcoming affects the accuracy of decision tree. In this paper
we are proposing improvement in ID3 algorithm using association function (AF). The Experimental result
shows improved ID3 algorithm can overcome shortcomings of ID3 which will also improve the accuracy of ID3
algorithm.
Propose a Enhanced Framework for Prediction of Heart DiseaseIJERA Editor
Heart disease diagnosis requires more experience and it is a complex task. The Heart MRI, ECG and Stress Test etc are the numbers of medical tests are prescribed by the doctor for examining the heart disease and it is the way of tradition in the prediction of heart disease. Today world, the hidden information of the huge amount of health care data is contained by the health care industry. The effective decisions are made by means of this hidden information. For appropriate results, the advanced data mining techniques with the information which is based on the computer are used. In any empirical sciences, for the inference and categorisation, the new mathematical techniques to be used called Artificial neural networks (ANNs) it also be used to the modelling of the real neural networks. Acting, Wanting, knowing, remembering, perceiving, thinking and inferring are the nature of mental phenomena and these can be understand by using the theory of ANN. The problem of probability and induction can be arised for the inference and classification because these are the powerful instruments of ANN. In this paper, the classification techniques like Naive Bayes Classification algorithm and Artificial Neural Networks are used to classify the attributes in the given data set. The attribute filtering techniques like PCA (Principle Component Analysis) filtering and Information Gain Attribute Subset Evaluation technique for feature selection in the given data set to predict the heart disease symptoms. A new framework is proposed which is based on the above techniques, the framework will take the input dataset and fed into the feature selection techniques block, which selects any one techniques that gives the least number of attributes and then classification task is done using two algorithms, the same attributes that are selected by two classification task is taken for the prediction of heart disease. This framework consumes the time for predicting the symptoms of heart disease which make the user to know the important attributes based on the proposed framework.
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
In this paper, we discusses about five functionalities of data mining in IOT that affects the performance and that
are: Data anomaly detection, Data clustering, Data classification, feature selection, time series prediction. Some
important algorithm has also been reviewed here of each functionalities that show advantages and limitations as
well as some new algorithm that are in research direction. Here we had represent knowledge view of data
mining in IOT.
A study and survey on various progressive duplicate detection mechanismseSAT Journals
Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Severa
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data which it uses to create sophisticated models that can predict the labels of related unlabelled data. the literature on the field offers a wide spectrum of algorithms and applications. however, there is limited research available to compare the algorithms making it difficult for beginners to choose the most efficient algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to sample datasets along with the effect of hyper-parameter tuning. for the research, each algorithm is applied to the datasets and the validation curves (for the hyper-parameters) and learning curves are analysed to understand the sensitivity and performance of the algorithms. the research can guide new researchers aiming to apply supervised learning algorithm to better understand, compare and select the appropriate algorithm for their application. Additionally, they can also tune the hyper-parameters for improved efficiency and create ensemble of algorithms for enhancing accuracy.
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data
which it uses to create sophisticated models that can predict the labels of related unlabelled data.the
literature on the field offers a wide spectrum of algorithms and applications.however, there is limited
research available to compare the algorithms making it difficult for beginners to choose the most efficient
algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to
sample datasets along with the effect of hyper-parameter tuning.for the research, each algorithm is applied
to the datasets and the validation curves (for the hyper-parameters) and learning curves are analysed to
understand the sensitivity and performance of the algorithms.the research can guide new researchers
aiming to apply supervised learning algorithm to better understand, compare and select the appropriate
algorithm for their application. Additionally, they can also tune the hyper-parameters for improved
efficiency and create ensemble of algorithms for enhancing accuracy.
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSijcsit
Diabetes disease is amongst the most common disease in India. It affects patient’s health and also leads to
other chronic diseases. Prediction of diabetes plays a significant role in saving of life and cost. Predicting
diabetes in human body is a challenging task because it depends on several factors. Few studies have reported the performance of classification algorithms in terms of accuracy. Results in these studies are difficult and complex to understand by medical practitioner and also lack in terms of visual aids as they arepresented in pure text format. This reported survey uses ROC and PRC graphical measures toimproveunderstanding of results. A detailed parameter wise discussion of comparison is also presented which lacksin other reported surveys. Execution time, Accuracy, TP Rate, FP Rate, Precision, Recall, F Measureparameters are used for comparative analysis and Confusion Matrix is prepared for quick review of each
algorithm. Ten fold cross validation method is used for estimation of prediction model. Different sets of
classification algorithms are analyzed on diabetes dataset acquired from UCI repository
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data
which it uses to create sophisticated models that can predict the labels of related unlabelled data. the
literature on the field offers a wide spectrum of algorithms and applications. However, there is limited
research available to compare the algorithms making it difficult for beginners to choose the most efficient
algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to
sample datasets along with the effect of hyper-parameter tuning. for the research, each algorithm is
applied to the datasets and the validation curves (for the hyper-parameters) and learning curves are
analysed to understand the sensitivity and performance of the algorithms. The research can guide new
researchers aiming to apply supervised learning algorithm to better understand, compare and select the
appropriate algorithm for their application. Additionally, they can also tune the hyper-parameters for
improved efficiency and create ensemble of algorithms for enhancing accuracy.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...ijcnes
Educational data mining is used to study the data available in the educational field and bring out the hidden knowledge from it. Classification methods like decision trees, rule mining can be applied on the educational data for predicting the students behavior. This paper focuses on finding thesuitablealgorithm which yields the best result to find out the reason behind students absenteeism in an academic year. The first step in this processis to gather students data by using questionnaire.The datais collected from 123 under graduate students from a private college which is situated in a semirural area. The second step is to clean the data which is appropriate for mining purpose and choose the relevant attributes. In the final step, three different Decision tree induction algorithms namely, ID3(Iterative Dichotomiser), C4.5 and CART(Classification and Regression Tree)were applied for comparison of results for the same data sample collected using questionnaire. The results were compared to find the algorithm which yields the best result in predicting the reason for student s absenteeism.
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
Distributed digital artifacts incorporate cryptographic hash values to URI called trusty URIs in a distributed environment
building good in quality, verifiable and unchangeable web resources to prevent the rising man in the middle attack. The greatest
challenge of a centralized system is that it gives users no possibility to check whether data have been modified and the communication
is limited to a single server. As a solution for this, is the distributed digital artifact system, where resources are distributed among
different domains to enable inter-domain communication. Due to the emerging developments in web, attacks have increased rapidly,
among which man in the middle attack (MIMA) is a serious issue, where user security is at its threat. This work tries to prevent MIMA
to an extent, by providing self reference and trusty URIs even when presented in a distributed environment. Any manipulation to the
data is efficiently identified and any further access to that data is blocked by informing user that the uniform location has been
changed. System uses self-reference to contain trusty URI for each resource, lineage algorithm for generating seed and SHA-512 hash
generation algorithm to ensure security. It is implemented on the semantic web, which is an extension to the world wide web, using
RDF (Resource Description Framework) to identify the resource. Hence the framework was developed to overcome existing
challenges by making the digital artifacts on the semantic web distributed to enable communication between different domains across
the network securely and thereby preventing MIMA.
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)ieijjournal1
Any abnormal activity can be assumed to be anomalies intrusion. In the literature several techniques and
algorithms have been discussed for anomaly detection. In the most of cases true positive and false positive
parameters have been used to compare their performance. However, depending upon the application a
wrong true positive or wrong false positive may have severe detrimental effects. This necessitates inclusion
of cost sensitive parameters in the performance. Moreover the most common testing dataset KDD-CUP-99
has huge size of data which intern require certain amount of pre-processing. Our work in this paper starts
with enumerating the necessity of cost sensitive analysis with some real life examples. After discussing
KDD-CUP-99 an approach is proposed for feature elimination and then features selection to reduce the
number of more relevant features directly and size of KDD-CUP-99 indirectly. From the reported
literature general methods for anomaly detection are selected which perform best for different types of
attacks. These different classifiers are clubbed to form an ensemble. A cost opportunistic technique is
suggested to allocate the relative weights to classifiers ensemble for generating the final result. The cost
sensitivity of true positive and false positive results is done and a method is proposed to select the elements
of cost sensitivity metrics for further improving the results to achieve the overall better performance. The
impact on performance trade of due to incorporating the cost sensitivity is discussed.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.IJERD Editor
Decision tree learning is the discipline to create a predictive model to map the different items in the
set and respective target values and associate them in a way that is true to every element. This concept is used in
statistics, data mining and machine learning due to its simple and effectiveness.
Among the various strategies available to construct the decision trees ID3 is one of the simplest and
most widely used decision tree algorithm, but ID3 algorithm gives more importance to attributes having
multiple values while selecting node. This major shortcoming affects the accuracy of decision tree. In this paper
we are proposing improvement in ID3 algorithm using association function (AF). The Experimental result
shows improved ID3 algorithm can overcome shortcomings of ID3 which will also improve the accuracy of ID3
algorithm.
Propose a Enhanced Framework for Prediction of Heart DiseaseIJERA Editor
Heart disease diagnosis requires more experience and it is a complex task. The Heart MRI, ECG and Stress Test etc are the numbers of medical tests are prescribed by the doctor for examining the heart disease and it is the way of tradition in the prediction of heart disease. Today world, the hidden information of the huge amount of health care data is contained by the health care industry. The effective decisions are made by means of this hidden information. For appropriate results, the advanced data mining techniques with the information which is based on the computer are used. In any empirical sciences, for the inference and categorisation, the new mathematical techniques to be used called Artificial neural networks (ANNs) it also be used to the modelling of the real neural networks. Acting, Wanting, knowing, remembering, perceiving, thinking and inferring are the nature of mental phenomena and these can be understand by using the theory of ANN. The problem of probability and induction can be arised for the inference and classification because these are the powerful instruments of ANN. In this paper, the classification techniques like Naive Bayes Classification algorithm and Artificial Neural Networks are used to classify the attributes in the given data set. The attribute filtering techniques like PCA (Principle Component Analysis) filtering and Information Gain Attribute Subset Evaluation technique for feature selection in the given data set to predict the heart disease symptoms. A new framework is proposed which is based on the above techniques, the framework will take the input dataset and fed into the feature selection techniques block, which selects any one techniques that gives the least number of attributes and then classification task is done using two algorithms, the same attributes that are selected by two classification task is taken for the prediction of heart disease. This framework consumes the time for predicting the symptoms of heart disease which make the user to know the important attributes based on the proposed framework.
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
In this paper, we discusses about five functionalities of data mining in IOT that affects the performance and that
are: Data anomaly detection, Data clustering, Data classification, feature selection, time series prediction. Some
important algorithm has also been reviewed here of each functionalities that show advantages and limitations as
well as some new algorithm that are in research direction. Here we had represent knowledge view of data
mining in IOT.
A study and survey on various progressive duplicate detection mechanismseSAT Journals
Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Severa
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data which it uses to create sophisticated models that can predict the labels of related unlabelled data. the literature on the field offers a wide spectrum of algorithms and applications. however, there is limited research available to compare the algorithms making it difficult for beginners to choose the most efficient algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to sample datasets along with the effect of hyper-parameter tuning. for the research, each algorithm is applied to the datasets and the validation curves (for the hyper-parameters) and learning curves are analysed to understand the sensitivity and performance of the algorithms. the research can guide new researchers aiming to apply supervised learning algorithm to better understand, compare and select the appropriate algorithm for their application. Additionally, they can also tune the hyper-parameters for improved efficiency and create ensemble of algorithms for enhancing accuracy.
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data
which it uses to create sophisticated models that can predict the labels of related unlabelled data.the
literature on the field offers a wide spectrum of algorithms and applications.however, there is limited
research available to compare the algorithms making it difficult for beginners to choose the most efficient
algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to
sample datasets along with the effect of hyper-parameter tuning.for the research, each algorithm is applied
to the datasets and the validation curves (for the hyper-parameters) and learning curves are analysed to
understand the sensitivity and performance of the algorithms.the research can guide new researchers
aiming to apply supervised learning algorithm to better understand, compare and select the appropriate
algorithm for their application. Additionally, they can also tune the hyper-parameters for improved
efficiency and create ensemble of algorithms for enhancing accuracy.
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSijcsit
Diabetes disease is amongst the most common disease in India. It affects patient’s health and also leads to
other chronic diseases. Prediction of diabetes plays a significant role in saving of life and cost. Predicting
diabetes in human body is a challenging task because it depends on several factors. Few studies have reported the performance of classification algorithms in terms of accuracy. Results in these studies are difficult and complex to understand by medical practitioner and also lack in terms of visual aids as they arepresented in pure text format. This reported survey uses ROC and PRC graphical measures toimproveunderstanding of results. A detailed parameter wise discussion of comparison is also presented which lacksin other reported surveys. Execution time, Accuracy, TP Rate, FP Rate, Precision, Recall, F Measureparameters are used for comparative analysis and Confusion Matrix is prepared for quick review of each
algorithm. Ten fold cross validation method is used for estimation of prediction model. Different sets of
classification algorithms are analyzed on diabetes dataset acquired from UCI repository
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
Supervised learning is a branch of machine learning wherein the machine is equipped with labelled data
which it uses to create sophisticated models that can predict the labels of related unlabelled data. the
literature on the field offers a wide spectrum of algorithms and applications. However, there is limited
research available to compare the algorithms making it difficult for beginners to choose the most efficient
algorithm and tune it for their application.
This research aims to analyse the performance of common supervised learning algorithms when applied to
sample datasets along with the effect of hyper-parameter tuning. for the research, each algorithm is
applied to the datasets and the validation curves (for the hyper-parameters) and learning curves are
analysed to understand the sensitivity and performance of the algorithms. The research can guide new
researchers aiming to apply supervised learning algorithm to better understand, compare and select the
appropriate algorithm for their application. Additionally, they can also tune the hyper-parameters for
improved efficiency and create ensemble of algorithms for enhancing accuracy.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Similar to Advanced Computational Intelligence: An International Journal (ACII) (20)
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Advanced Computational Intelligence: An International Journal (ACII)
1. Advanced Computational Intelligence: An International Journal (ACII), Vol.2, No.3, July 2015
DOI:10.5121/acii.2015.2304 31
A NEW DECISION TREE METHOD FOR DATA MINING
IN MEDICINE
Kasra Madadipouya1
1
Department of Computing and Science, Asia Pacific University of Technology &
Innovation
ABSTRACT
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
KEYWORDS
Data mining, Medicine, Classification, Decision Tree, ID3, C4.5
1. INTRODUCTION
Health care institutions all over the world have been gathering medical data over the years of their
operation. A huge amount of this data is stored in databases and data warehouses. Such databases
and their applications are different from each other. The basic ones store only some information
about patients such as name, age, address, blood type, etc. The more advanced ones are able to
record patients' visits and store detailed information related to their health condition. Some
systems also are applied to patients' registration, units' finances and recently new types of a
medical system have emerged which originates in the business intelligence and facilitates medical
decisions, [1]: medical decision support system. This data may contain valuable information that
awaits extraction. The knowledge may be encapsulated in various patterns and regularities that
may be hidden in the data. Such knowledge may prove to be priceless in future medical decision
making. The mentioned situation is the reason for a close collaboration between medical staff and
computer scientists[2][3]. Its purpose is generating the most suitable method of data processing,
which is able to discover dependencies and nontrivial rules in data. The results may reduce the
time of a diagnosis delivery or risk of a medical mistake as well as improve the process of
treatment and diagnosing. The research area, which investigates the methods of knowledge
extraction from data, is called data mining or knowledge discovery[4]. It applies various data
mining algorithms to analyses databases. The purpose of this research is to review the most
common data mining techniques implemented in medicine. A number of research papers have
evaluated various data mining methods but they focus on a small number of medical
datasets[5][6], the algorithms used are not calibrated (tested only on one parameters’ settings)[6]
or the algorithms compared are not common in the medical decision support systems[7]. Also,
even though a large number of methods have been studied they were not evaluated with the use of
different metrics on different datasets [5][7][8]. This makes the collective evaluation of the
algorithms impossible. This paper reviews the most common data mining algorithms (determined
2. Advanced Computational Intelligence: An International Journal (ACII), Vol.2, No.3, July 2015
32
after an in- depth literature study), which are implemented in modern MDSS’s. Algorithms are
analyzed under the same conditions.
The rest of the paper is organized as follows, in the next section a review regarding data mining
and various data mining algorithms are provided. In section three dedicated to the new propsed
algorithm in details which is relied on C4.5 algorithm to create decision tree. In section four, the
result of the experiment is presented and analyzed. Dinally in the last section, conclusions and
future improvements are discussed.
2. RELATED WORK
Decision trees are an easy to understand and accepted classification technique in knowledge
discovery because of their clarification and flexibility in presenting the classification procedure
[9]. Amongst the techniques to build decision tree, Iterative Dichotomiser Tree offered by
Quinllan and C4.5 (its enhanced edition) are two of the most accepted methods [10]. Since C4.5
is able to deal with noise, missing values and evading over-fitting it is superior to ID3 in
constructing decision trees utilized commonly. Though, both of them choose one attribute as the
splitting criteria for constructing a node [4]. For reducing the depth of the tree and improving its
accuracy, this paper demonstrates a unique method to expand C4.5. This algorithm is based on
choosing one or two attribute as the splitting criteria according to which yields greater
information gain ratio. The usual techniques to choose the splitting criteria are the information
gain of ID3 and the information gain ratio of C4.5. They are the significance of Information
Entropy theory. In constructing decision trees, attribute’s impurity are shown by the information
gain and gain ratio which denotes the probability of being chosen as the greatest splitting criteria.
In the following section ID3 and C4.5 algorithms are discussed in details.
2.1 ID3 Algorithm
ID3 is a greedy algorithm which builds decision trees based on an up to down approach. The
input and output data in ID3 are categorical. All categories of attributes can be applied in
generating decision trees by ID3, thus creates wide and shallow trees. It builds trees in 3
phases[11]:
1. Creating splits in a multi-way manner, for example for all of attributes a split is made and
subdivisions of the proposed split are attributes categories.
2. Estimation of the greatest split for tree branching according to information gain metric.
3. Testing the stop criterion, and repeating the steps recursively for new subdivisions. These
three steps are done iteratively for all of the nodes of the tree. The below formula
represents the information gain measure.
S denotes the dataset. K denotes the number of output variable classes, and Pi the possibility of
the class i. In this algorithm the quality of the split is represented by information gain
3. Advanced Computational Intelligence: An International Journal (ACII), Vol.2, No.3, July 2015
33
Values(A) represent probable values of attribute A, Sv represents the subdivision of dataset S
which contains value v in S. Entropy(S) calculates the entropy of an input attribute A which has k
categories , Entropy(SV) is the entropy of an attributes category with respect to the output
attribute, and |SV| / |S| is the probability of the j-th category in the attribute [12]. The difference
between entropy of the node and an attribute is Information gain of an attribute. Information gain
shows the information an attribute convey for disambiguation of the class[13].
2.2 C4.5
The enhanced edition of ID3 is C4.5. It is based on numeric input attributes[14]. It builds trees in
3 phases [15]:
1. Creating Splits for categorical attributes like ID3 algorithm. This algorithm considers all
probable binary splits for numerical attributes. Splits of Numeric attributes are always binary.
2. Evaluation of the greatest split according to gain ratio metric
3.Testing the stop criterion, and repeating the steps in a recursive manner for new subdivisions.
These three steps are done iteratively for all of nodes.
C4.5 presents a new metric for split evaluation which is less biased. The algorithm is able to
handle missing values, pruning the tree, grouping attribute values, and rules generation. The Gain
ratio is less biased towards choosing attributes with more categories.
Attribute A has k different values which divide S into subsets S1,…,Sk.. equation 1 computes
Information Entropy, equation 2 computes the information gain ratio of attribute A in dataset S,
equation 3 computes the split information and finally (4) computes the information gain ratio of
each attribute . Gain ratio splits information gain of the attribute with the split information,
described via equation (4); this metric is based on different values of an attribute (K). This
algorithm is able to deal with numerical and categorical attributes. Numeric attributes are able to
create binary splits and categorical attributes multi-way splits. Also, C4.5 has three pruning
techniques: reduced error pruning, error-based pruning and pessimistic error pruning. At the end,
C4.5 has the below qualities [16][13]:
1. Handling discrete and continues values;
2. Handling missing values;
3. Coping different costs of attributes;
4. Pruning decision tree after its generation.
3. THE PROPOSED ALGORITHM
The classification aims at categorizing data into predefined classes based on particular attributes
[13]. Decision tree are a popular type of classification method which is similar to flow chart and
have the qualities like fast classification speed, great accuracy, appropriate for inductive
knowledge discovery, working with high-dimensional data and etc[17]. This algorithm chooses
attribute by utilizing information gain as the splitting criterion, however its tendency is to select
the attributes with many values[18]. There exist two criteria in evaluation of the decision tree
quality: (1) fewer leaf nodes, shallow trees and slight redundancy; (2) high classification
accuracy. C4.5 applies information gain ratio to choose splitting attribute [19]. It inherits the
4. Advanced Computational Intelligence: An International Journal (ACII), Vol.2, No.3, July 2015
34
whole qualities from ID3 and attaches the qualities like discreting the continual attributes,
handling attributes with missing values [20], pruning decision tree, etc. in C4.5 the construction
of the tree is top-to-down and follows a one-step greedy searching approach. This algorithm finds
locally optimal solution for each decision node [21]. To escape from local optima and find a
global optima solution, we propose a novel method that selects two attributes simultaneously, not
only one. This new method, in choosing attributes considers the information gain of choosing a
pair of attributes concurrently instead of choosing only one attribute. Therefore, to enhance the
probability of finding globally optima solution, taking two optimal attribute into account is better
than one optimal attribute [22]. It is able to choose splitting attributes more precisely, improving
the classification outcomes and building decision tree with less deepness. However in cases that
the distribution of attributes are imbalanced, the offered algorithm has a number of weaknesses,
leading to the condition which lots of samples are focused on a number of branches which is
worse than C4.5 in terms of performance. The proposed algorithm is not able to escape from local
optima but it enhances the probability of finding globally optimal solutions. Therefore to solve
this issue, we propose a different method of building decision as the following steps:
1. Checking all of the records of the training set in advanced in terms of missing values. To
handle missing values this algorithm generates one probable value for attributes with
missing values by calculating the weight of all classes in the result from multi path to
leaves.
2. Calculating the information gain ratio for each attribute and each pair of attributes;
3. Comparing the average information gain ratio obtained by considering a single attribute
as the splitting criteria with two attributes and choosing the larger as the considered node
for building a tree node;
4. For all of subdivision of the considered node, choosing the optimum attribute or pair of
attribute from the rest of attributes by Step 2 as the consecutive node and consider it as
the current node;
5. Till all attributes are chosen repeat step4. If only one attribute is remained, consider that
attribute as the substitute of the current node instantly.
6. The pruning technique which is applied in this algorithm is same as C4.5. When the
training data fits as well as possible and overfitting is occurred, transform decision tree
into a set of equivalent rules and then modifying them by removing every precondition
which can improve the evaluation accuracy; sort the pruned rules in terms of estimated
accuracy and consider them in this sequence when classifying next instances.
Assume S is the data set, and A is the set of attributes, the equation below calculates the
information gain for pair of attribute {Ai, Aj} in A
(Ak) ( k =i, j) represents all probable values of attribute Ak and SV,U denotes subdivision of
dataset S that contain value V of attribute Ai and value U of attribute Aj. Assume attribute Ai has
n value and attribute Aj has m value, so the splitting information is described as
5. Advanced Computational Intelligence: An International Journal (ACII), Vol.2, No.3, July 2015
35
At the place, Si,…,Smxn are mxn sample subdivisions of dataset S splited by the mixture of every
probable value of attributes Ai and Aj. The below equation calculates the information gain ratio of
attributes set {Ai,Aj} in A :
Gain Ratio (S, Ai ,Aj) and Split Data (S,Ai,Aj) are computed by (5) and (6).
4. EXPERIMENTAL RESULTS
To test the performance of the improved algorithm, the algorithm was implemented by C++
programming language. Breast Cancer and Dermatology datasets were used as the experimental
objects and the entire training set was used for testing phase. Figure.1 presents some parts of the
codes which calculate Entropy of an input attribute.
Figure.1 Calculating Entropy of an input attribute
Results are shown in Table.1 compares with the results of C.5 which achieved by applying Weka
in modelling phase. The evaluation is based on correctly classified metric.
Name
Correctly classified
instances in C4.5
Correctly classified instances
in Improved algorithm
Breast cancer 97,9% 97.9%
Dermatology 96,9% 100%
Table.1 Results of implementing improved algorithm on the datasets by using entire training set
According to the Table.1 in dermatology dataset the improved algorithm gets more accurate
results than C4.5. In the breast cancer dataset the improved algorithm and C4.5 have the same
accuracy. The experimental results in Table.1 prove that except the merits such as improving the
6. Advanced Computational Intelligence: An International Journal (ACII), Vol.2, No.3, July 2015
36
probability of finding globally optimal solutions and reducing depth of the tree the proposed
algorithm has high accuracy of classification which is better than C4.5.
5. IMPLICATIONS FOR MEDICINE DOMAIN
The results achieved prove that the proposed algorithm is applicable for the medical datasets.
However before generalizing the results for all medical datasets we should be careful and it is
essential to perform the more complex experiments before applying them in real medical system.
The results of such experiments are valuable during the creation of new Medical Decision
Support Systems. The paper may be the beginning of more complex experiments. The algorithm
may be beneficial for medicine. With the use of accurate algorithms the precision of the medical
decisions would increase for the diagnosing. This improvement of the health care is extremely
important in cases that physicians have doubt in diagnosis. This would also make the doctors
consider rare diseases. That is the reason why improving data mining algorithms in terms of
accuracy is so important from the medical point of view.
6. CONCLUSION AND FUTURE WORK
In this paper a new algorithm was proposed in order to improve the classification accuracy and
deepness of tree in constructing decision trees, in comparison with C4.5. During choosing the
splitting criteria, the proposed algorithm selects two attributes simultaneously, not only one which
enables algorithm to discover the greater information gain ratio of the criterion in the role of the
splitting node of decision tree.
The result of testing of the proposed algorithm proved that the classification accuracy in the
generated decision tree is improved. Also by using the proposed algorithm depth of the tree is
reduced. Nevertheless, the proposed algorithm has some weaknesses such as more time for
computation and it is not still able to escape from being trapped into local optimum.
The plans of future work include improving the algorithm in terms of computation and extending
the multi-criterion splitting approach for complex situations such as lots of attributes in feature
space or as very high dimensionality. In addition to that, the proposed algorithm can be evaluation
againsts other medical databases. The studies would be carried out by applying extensive range of
medical datasets which makes the evaluation more accurate by considering regarding various
aspects such as performance.
REFERENCES
[1] Nolte, E. and M. McKee (2008). Caring for people with chronic conditions: a health system
perspective. McGraw-Hill Education (UK).
[2] Teach R. and Shortliffe E. (1981). An analysis of physician attitudes regarding computer-based
clinical consultation systems. Computers and Biomedical Research, Vol. 14, 542-558.
[3] Turkoglu I., Arslan A., Ilkay E. (2002). An expert system for diagnosis of the heart valve diseases.
Expert Systems with Applications, Vol. 23, No.3, 229–236.
[4] Witten I. H., Frank E. (2005). Data Mining, Practical Machine Learning Tools and Techniques, 2nd
Elsevier.
[5] Herron P. (2004). Machine Learning for Medical Decision Support: Evaluating Diagnostic
Performance of Machine Learning Classification Algorithms, INLS 110, Data Mining.
[6] Li L.et al. (2004). Data mining techniques for cancer detection using serum proteomic profiling,
Artificial Intelligence in Medicine, Vol. 32, 71-83.
7. Advanced Computational Intelligence: An International Journal (ACII), Vol.2, No.3, July 2015
37
[7] Comak E., Arslan A., Turkoglu I. (2007). A decision support system based on support vector
machines for diagnosis of the heart valve diseases. Elsevier, vol. 37, 21-27.
[8] Rojas, R. (1996). Neural Networks: a systematic introduction, Springer-Verlag.
[9] Jiang, L.X., Li C.Q. (2009). Learning decision tree for ranking, Knowl InfSyst, 2009, Vol. 20, pp.
123-135.
[10] Ruggieri, S. (2002). Efficient C4. 5 [classification algorithm]. Knowledge and Data Engineering,
IEEE Transactions on, Vol. 14, No.2, 438-444.
[11] Cios, K. J., Liu, N. (1992). A machine learning method for generation of a neural network
architecture: A continuous ID3 algorithm. Neural Networks, IEEE Transactions on, Vol. 3, No.3, 280-
291.
[12] Gladwin, C. H. (1989). Ethnographic decision tree modeling Vol. 19. Sage.
[13] Kamber, M., Winstone, L., Gong, W., Cheng, S., & Han, J. (1997). Generalization and decision tree
induction: efficient classification in data mining. In Research Issues in Data Engineering, 1997.
Proceedings. Seventh International Workshop on (pp. 111-120). IEEE.
[14] Jiawei, H. (2006). Data Mining: Concepts and Techniques, Morgan Kaufmann publications.
[15] Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier.
[16] Karthikeyan, T., Thangaraju P. (2013). Analysis of Classification Algorithms Applied to Hepatitis
Patients, International Journal of Computer Applications (0975 – 888), Vol. 62, No.15.
[17] Suknovic, M., Delibasic B. , et al. (2012). Reusable components in decision tree induction algorithms,
Comput Stat, Vol. 27, 127-148.
[18] Chang, R. L., & Pavlidis, T. (1977). Fuzzy decision tree algorithms. IEEE Transactions on Systems,
Man, and Cybernetics, Vol. 1, No. 7, 28-35.
[19] Wang, Y., & Witten, I. H. (1996). Induction of model trees for predicting continuous classes.
[20] Zhang, S. , et al. (2005). Missing is usefull": missing values in cost-sensitive decision trees,
Knowledge and Data Engineering, Vol 17, No. 12, 1689-1693.
[21] Mingers, J. (1989). An empirical comparison of pruning methods for decision tree induction. Machine
learning, Vol. 4, No. 2, 227-243.
[22] Lin, S. W., Chen S. C. (2012). Parameter determination and feature selection for C4.5 algorithm using
scatter search approach, Soft Comput, Vol. 16, 63-75.