This document presents a new algorithm called UDT-CDF for building decision trees to classify uncertain numerical data. It improves on previous algorithms like UDT that were based on probability density functions (PDFs). The key aspects of the new algorithm are:
1. It uses cumulative distribution functions (CDFs) rather than PDFs to represent uncertain numerical attributes, since CDFs provide more complete probability information.
2. It splits data at decision tree nodes based on the CDF, placing data with values covering the split point into both branches weighted by the CDF.
3. Experimental results show the new CDF-based algorithm achieves more accurate classifications and is more computationally efficient than the PDF-based UDT algorithm,
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCERcscpconf
In this paper, we study the performance criterion of machine learning tools in classifying breast cancer. We compare the data mining tools such as Naïve Bayes, Support vector machines, Radial basis neural networks, Decision trees J48 and simple CART. We used both binary and multi class data sets namely WBC, WDBC and Breast tissue from UCI machine learning depositary. The experiments are conducted in WEKA. The aim of this research is to find out the best classifier with respect to accuracy, precision, sensitivity and specificity in detecting breast cancer
Preprocessing and Classification in WEKA Using Different ClassifiersIJERA Editor
Data mining is a process of extracting information from a dataset and transform it into understandable structure
for further use, also it discovers patterns in large data sets [1]. Data mining has number of important techniques
such as preprocessing, classification. Classification is one such technique which is based on supervised learning.
It is a technique used for predicting group membership for the data instance. Here in this paper we use
preprocessing, classification on diabetes database. Here we apply classifiers on this database and compare the
result based on certain parameters using WEKA. 77.2 million people in India are suffering from pre diabetes.
ICMR estimates that around 65.1million are diabetes patients. Globally in year 2010, 227 to 285 million people
had diabetes, out of that 90% cases are related to type 2 ,this is equal to 3.3% of the population with equal rates
in both women and men in 2011 it resulted in 1.4 million deaths worldwide making it the leading cause of
death.
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
Regularized Weighted Ensemble of Deep Classifiers ijcsa
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
Multimodal authentication is one of the prime concepts in current applications of real scenario. Various
approaches have been proposed in this aspect. In this paper, an intuitive strategy is proposed as a
framework for providing more secure key in biometric security aspect. Initially the features will be
extracted through PCA by SVD from the chosen biometric patterns, then using LU factorization technique
key components will be extracted, then selected with different key sizes and then combined the selected key
components using convolution kernel method (Exponential Kronecker Product - eKP) as Context-Sensitive
Exponent Associative Memory model (CSEAM). In the similar way, the verification process will be done
and then verified with the measure MSE. This model would give better outcome when compared with SVD
factorization[1] as feature selection. The process will be computed for different key sizes and the results
will be presented.
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCERcscpconf
In this paper, we study the performance criterion of machine learning tools in classifying breast cancer. We compare the data mining tools such as Naïve Bayes, Support vector machines, Radial basis neural networks, Decision trees J48 and simple CART. We used both binary and multi class data sets namely WBC, WDBC and Breast tissue from UCI machine learning depositary. The experiments are conducted in WEKA. The aim of this research is to find out the best classifier with respect to accuracy, precision, sensitivity and specificity in detecting breast cancer
Preprocessing and Classification in WEKA Using Different ClassifiersIJERA Editor
Data mining is a process of extracting information from a dataset and transform it into understandable structure
for further use, also it discovers patterns in large data sets [1]. Data mining has number of important techniques
such as preprocessing, classification. Classification is one such technique which is based on supervised learning.
It is a technique used for predicting group membership for the data instance. Here in this paper we use
preprocessing, classification on diabetes database. Here we apply classifiers on this database and compare the
result based on certain parameters using WEKA. 77.2 million people in India are suffering from pre diabetes.
ICMR estimates that around 65.1million are diabetes patients. Globally in year 2010, 227 to 285 million people
had diabetes, out of that 90% cases are related to type 2 ,this is equal to 3.3% of the population with equal rates
in both women and men in 2011 it resulted in 1.4 million deaths worldwide making it the leading cause of
death.
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
Regularized Weighted Ensemble of Deep Classifiers ijcsa
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
Multimodal authentication is one of the prime concepts in current applications of real scenario. Various
approaches have been proposed in this aspect. In this paper, an intuitive strategy is proposed as a
framework for providing more secure key in biometric security aspect. Initially the features will be
extracted through PCA by SVD from the chosen biometric patterns, then using LU factorization technique
key components will be extracted, then selected with different key sizes and then combined the selected key
components using convolution kernel method (Exponential Kronecker Product - eKP) as Context-Sensitive
Exponent Associative Memory model (CSEAM). In the similar way, the verification process will be done
and then verified with the measure MSE. This model would give better outcome when compared with SVD
factorization[1] as feature selection. The process will be computed for different key sizes and the results
will be presented.
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSijcsit
Diabetes disease is amongst the most common disease in India. It affects patient’s health and also leads to
other chronic diseases. Prediction of diabetes plays a significant role in saving of life and cost. Predicting
diabetes in human body is a challenging task because it depends on several factors. Few studies have reported the performance of classification algorithms in terms of accuracy. Results in these studies are difficult and complex to understand by medical practitioner and also lack in terms of visual aids as they arepresented in pure text format. This reported survey uses ROC and PRC graphical measures toimproveunderstanding of results. A detailed parameter wise discussion of comparison is also presented which lacksin other reported surveys. Execution time, Accuracy, TP Rate, FP Rate, Precision, Recall, F Measureparameters are used for comparative analysis and Confusion Matrix is prepared for quick review of each
algorithm. Ten fold cross validation method is used for estimation of prediction model. Different sets of
classification algorithms are analyzed on diabetes dataset acquired from UCI repository
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Comparision of methods for combination of multiple classifiers that predict b...IJERA Editor
Predictive analysis include techniques fromdata mining that analyze current and historical data and make
predictions about the future. Predictive analytics is used in actuarial science, financial services, retail, travel,
healthcare, insurance, pharmaceuticals, marketing, telecommunications and other fields.Predicting patterns can
be considered as a classification problem and combining the different classifiers gives better results. We will
study and compare three methods used to combine multiple classifiers. Bayesian networks perform
classification based on conditional probability. It is ineffective and easy to interpret as it assumes that the
predictors are independent. Tree augmented naïve Bayes (TAN) constructs a maximum weighted spanning tree
that maximizes the likelihood of the training data, to perform classification.This tree structure eliminates the
independent attribute assumption of naïve Bayesian networks. Behavior-knowledge space method works in two
phases and can provide very good performances if large and representative data sets are available.
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
Research scholars evaluation based on guides view using id3eSAT Journals
Abstract Research Scholars finds many problems in their Research and Development activities for the completion of their research work in universities. This paper gives a proficient way for analyzing the performance of Research Scholar based on guides and experts feedback. A dataset is formed using this information. The outcome class attribute will be in view of guides about the scholars. We apply decision tree algorithm ID3 on this dataset to construct the decision tree. Then the scholars can enter the testing data that has comprised with attribute values to get the view of guides for that testing dataset. Guidelines to the scholar can be provided by considering this constructed tree to improve their outcomes.
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTIONIJDKP
The geometry of data, also known as probability distribution, is an important consideration for accurate computation of data mining tasks, such as pre-processing, classification and interpretation. The data geometry influences outcome and accuracy of the statistical analysis to a large extent. The current paper focuses on, understanding the influence of data geometry in the feature subset selection process using random forest algorithm. In practice, it is assumed that the data follows normal distribution and most of the time, it may not be true. The dimensionality reduction varies, due to change in the distribution of the data. A comparison is made using three standard distributions such as Triangular, Uniform and Normal Distribution. The results are discussed in this paper.
Privacy preserving data mining in four group randomized response technique us...eSAT Journals
Abstract Data mining is a process in which data collected from different sources is analyzed for useful information. Data mining is also known as knowledge discovery in database (KDD). Privacy and accuracy are the important issues in data mining when data is shared. Most of the methods use random permutation techniques to mask the data, for preserving the privacy of sensitive data. Randomize response techniques were developed for the purpose of protecting surveys privacy and avoiding biased answers. The proposed work thesis is to enhance the privacy level in RR technique using four group schemes. First according to the algorithm random attributes a, b, c, d were considered, Then the randomization have been performed on every dataset according to the values of theta. Then ID3 and CART algorithm are applied on the randomized data. The result shows that by increasing the group, the privacy level will increase. This work shows that as compared with three group scheme with four groups scheme the accuracy decreases 6% but the privacy increases 65%.
POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...IJCI JOURNAL
When only a few lower modes data are available to evaluate a large number of unknown parameters, it is
difficult to acquire information about all unknown parameters. The challenge in this kind of updation
problem is first to get confidence about the parameters that are evaluated correctly using the available
data and second to get information about the remaining parameters. In this work, the first issue is resolved
employing the sensitivity of the modal data used for updation. Once it is fixed that which parameters are
evaluated satisfactorily using the available modal data the remaining parameters are evaluated employing
modal data of a virtual structure. This virtual structure is created by adding or removing some known
stiffness to or from some of the stories of the original structure. A 12-story shear building is considered for
the numerical illustration of the approach. Results of the study show that the present approach is an
effective tool in system identification problem when only a few data is available for updation.
Improved probabilistic distance based locality preserving projections method ...IJECEIAES
In this paper, a dimensionality reduction is achieved in large datasets using the proposed distance based Non-integer Matrix Factorization (NMF) technique, which is intended to solve the data dimensionality problem. Here, NMF and distance measurement aim to resolve the non-orthogonality problem due to increased dataset dimensionality. It initially partitions the datasets, organizes them into a defined geometric structure and it avoids capturing the dataset structure through a distance based similarity measurement. The proposed method is designed to fit the dynamic datasets and it includes the intrinsic structure using data geometry. Therefore, the complexity of data is further avoided using an Improved Distance based Locality Preserving Projection. The proposed method is evaluated against existing methods in terms of accuracy, average accuracy, mutual information and average mutual information.
A Decision tree is termed as good DT when it has small size and when new data is introduced it can be classified accurately. Pre-processing the input data is one of the good approaches for generating a good DT. When different data pre-processing methods are used with the combination of DT classifier it evaluates to give high performance. This paper involves the accuracy variation in the ID3 classifier when used in combination with different data pre-processing and feature selection method. The performances of DTs are produced from comparison of original and pre-processed input data and experimental results are shown by using standard decision tree algorithm-ID3 on a dataset.
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
Medical data mining has great deal for exploring new knowledge from large amount of data. Classification is one of the important data mining techniques for classification of data. In this research work, we have used various data mining based classification techniques for classification of cancer diseases patient or not. We applied the Breast Cancer-Wisconsin (Original) data set into different data mining techniques and compared the accuracy of models with two different data partitions. BayesNet achieved highest accuracy as 97.13% in case of 10-fold data partitions. We have also applied the info gain feature selection technique on BayesNet and Support Vector Machine (SVM) and achieved best accuracy 97.28% accuracy with BayesNet in case of 6 feature subset.
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSijcsit
Diabetes disease is amongst the most common disease in India. It affects patient’s health and also leads to
other chronic diseases. Prediction of diabetes plays a significant role in saving of life and cost. Predicting
diabetes in human body is a challenging task because it depends on several factors. Few studies have reported the performance of classification algorithms in terms of accuracy. Results in these studies are difficult and complex to understand by medical practitioner and also lack in terms of visual aids as they arepresented in pure text format. This reported survey uses ROC and PRC graphical measures toimproveunderstanding of results. A detailed parameter wise discussion of comparison is also presented which lacksin other reported surveys. Execution time, Accuracy, TP Rate, FP Rate, Precision, Recall, F Measureparameters are used for comparative analysis and Confusion Matrix is prepared for quick review of each
algorithm. Ten fold cross validation method is used for estimation of prediction model. Different sets of
classification algorithms are analyzed on diabetes dataset acquired from UCI repository
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Comparision of methods for combination of multiple classifiers that predict b...IJERA Editor
Predictive analysis include techniques fromdata mining that analyze current and historical data and make
predictions about the future. Predictive analytics is used in actuarial science, financial services, retail, travel,
healthcare, insurance, pharmaceuticals, marketing, telecommunications and other fields.Predicting patterns can
be considered as a classification problem and combining the different classifiers gives better results. We will
study and compare three methods used to combine multiple classifiers. Bayesian networks perform
classification based on conditional probability. It is ineffective and easy to interpret as it assumes that the
predictors are independent. Tree augmented naïve Bayes (TAN) constructs a maximum weighted spanning tree
that maximizes the likelihood of the training data, to perform classification.This tree structure eliminates the
independent attribute assumption of naïve Bayesian networks. Behavior-knowledge space method works in two
phases and can provide very good performances if large and representative data sets are available.
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
Research scholars evaluation based on guides view using id3eSAT Journals
Abstract Research Scholars finds many problems in their Research and Development activities for the completion of their research work in universities. This paper gives a proficient way for analyzing the performance of Research Scholar based on guides and experts feedback. A dataset is formed using this information. The outcome class attribute will be in view of guides about the scholars. We apply decision tree algorithm ID3 on this dataset to construct the decision tree. Then the scholars can enter the testing data that has comprised with attribute values to get the view of guides for that testing dataset. Guidelines to the scholar can be provided by considering this constructed tree to improve their outcomes.
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTIONIJDKP
The geometry of data, also known as probability distribution, is an important consideration for accurate computation of data mining tasks, such as pre-processing, classification and interpretation. The data geometry influences outcome and accuracy of the statistical analysis to a large extent. The current paper focuses on, understanding the influence of data geometry in the feature subset selection process using random forest algorithm. In practice, it is assumed that the data follows normal distribution and most of the time, it may not be true. The dimensionality reduction varies, due to change in the distribution of the data. A comparison is made using three standard distributions such as Triangular, Uniform and Normal Distribution. The results are discussed in this paper.
Privacy preserving data mining in four group randomized response technique us...eSAT Journals
Abstract Data mining is a process in which data collected from different sources is analyzed for useful information. Data mining is also known as knowledge discovery in database (KDD). Privacy and accuracy are the important issues in data mining when data is shared. Most of the methods use random permutation techniques to mask the data, for preserving the privacy of sensitive data. Randomize response techniques were developed for the purpose of protecting surveys privacy and avoiding biased answers. The proposed work thesis is to enhance the privacy level in RR technique using four group schemes. First according to the algorithm random attributes a, b, c, d were considered, Then the randomization have been performed on every dataset according to the values of theta. Then ID3 and CART algorithm are applied on the randomized data. The result shows that by increasing the group, the privacy level will increase. This work shows that as compared with three group scheme with four groups scheme the accuracy decreases 6% but the privacy increases 65%.
POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...IJCI JOURNAL
When only a few lower modes data are available to evaluate a large number of unknown parameters, it is
difficult to acquire information about all unknown parameters. The challenge in this kind of updation
problem is first to get confidence about the parameters that are evaluated correctly using the available
data and second to get information about the remaining parameters. In this work, the first issue is resolved
employing the sensitivity of the modal data used for updation. Once it is fixed that which parameters are
evaluated satisfactorily using the available modal data the remaining parameters are evaluated employing
modal data of a virtual structure. This virtual structure is created by adding or removing some known
stiffness to or from some of the stories of the original structure. A 12-story shear building is considered for
the numerical illustration of the approach. Results of the study show that the present approach is an
effective tool in system identification problem when only a few data is available for updation.
Improved probabilistic distance based locality preserving projections method ...IJECEIAES
In this paper, a dimensionality reduction is achieved in large datasets using the proposed distance based Non-integer Matrix Factorization (NMF) technique, which is intended to solve the data dimensionality problem. Here, NMF and distance measurement aim to resolve the non-orthogonality problem due to increased dataset dimensionality. It initially partitions the datasets, organizes them into a defined geometric structure and it avoids capturing the dataset structure through a distance based similarity measurement. The proposed method is designed to fit the dynamic datasets and it includes the intrinsic structure using data geometry. Therefore, the complexity of data is further avoided using an Improved Distance based Locality Preserving Projection. The proposed method is evaluated against existing methods in terms of accuracy, average accuracy, mutual information and average mutual information.
A Decision tree is termed as good DT when it has small size and when new data is introduced it can be classified accurately. Pre-processing the input data is one of the good approaches for generating a good DT. When different data pre-processing methods are used with the combination of DT classifier it evaluates to give high performance. This paper involves the accuracy variation in the ID3 classifier when used in combination with different data pre-processing and feature selection method. The performances of DTs are produced from comparison of original and pre-processed input data and experimental results are shown by using standard decision tree algorithm-ID3 on a dataset.
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
Medical data mining has great deal for exploring new knowledge from large amount of data. Classification is one of the important data mining techniques for classification of data. In this research work, we have used various data mining based classification techniques for classification of cancer diseases patient or not. We applied the Breast Cancer-Wisconsin (Original) data set into different data mining techniques and compared the accuracy of models with two different data partitions. BayesNet achieved highest accuracy as 97.13% in case of 10-fold data partitions. We have also applied the info gain feature selection technique on BayesNet and Support Vector Machine (SVM) and achieved best accuracy 97.28% accuracy with BayesNet in case of 6 feature subset.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
JORNADA SOBRE ELS AVANTATGES DEL “NÚVOL” – AEBALL 30-11-11
Ponència de Nestor Requesens, Team Leader a Itequia - Microsoft Office al núvol: OFFICE 365: Outlook, Word, Excel, intranet, etc. en mobilitat i a preus més econòmics
Mais Natue é uma revista bimestral, distribuída junto com os produtos, focada em bem-estar e saúde. Seu objetivo é mostrar um estilo de vida saudável com matérias sobre alimentação balanceada, receitas, atividades físicas, saúde, suplementos e estilo de vida.
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
Abstract
Decision Tree learning algorithms have been able to capture knowledge successfully. Decision Trees are best considered when
instances are described by attribute-value pairs and when the target function has a discrete value. The main task of these
decision trees is to use inductive methods to the given values of attributes of an unknown object and determine an
appropriate classification by applying decision tree rules. Decision Trees are very effective forms to evaluate the performance
and represent the algorithms because of their robustness, simplicity, capability of handling numerical and categorical data,
ability to work with large datasets and comprehensibility to a name a few. There are various decision tree algorithms available
like ID3, CART, C4.5, VFDT, QUEST, CTREE, GUIDE, CHAID, CRUISE, etc. In this paper a comparative study on three of
these popular decision tree algorithms - (Iterative Dichotomizer 3), C4.5 which is an evolution of ID3 and VFDT (Very
Fast Decision Tree has been made. An empirical study has been conducted to compare C4.5 and VFDT in terms of accuracy
and execution time and various conclusions have been drawn.
Key Words: Decision tree, ID3, C4.5, VFDT, Information Gain, Gain Ratio, Gini Index, Over−fitting.
Scalable decision tree based on fuzzy partitioning and an incremental approachIJECEIAES
Classification as a data mining materiel is the process of assigning entities to an already defined class by examining the features. The most significant feature of a decision tree as a classification method is its ability to data recursive partitioning. To choose the best attributes for partition, the value range of each continuous attribute should be divided into two or more intervals. Fuzzy partitioning can be used to reduce noise sensitivity and increase the stability of trees. Also, decision trees constructed with existing approaches, tend to be complex, and consequently are difficult to use in practical applications. In this article, a fuzzy decision tree has been introduced that tackles the problem of tree complexity and memory limitation by incrementally inserting data sets into the tree. Membership functions are generated automatically. Then fuzzy information gain is used as a fast-splitting attribute selection criterion and the expansion of a leaf is done attending only with the instances stored in it. The efficiency of this algorithm is examined in terms of accuracy and tree complexity. The results show that the proposed algorithm by reducing the complexity of the tree can overcome the memory limitation and make a balance between accuracy and complexity.
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
Data mining environment produces a large amount of data that need to be analyzed.
Using traditional databases and architectures, it has become difficult to process, manage and analyze
patterns. To gain knowledge about the Big Data a proper architecture should be understood.
Classification is an important data mining technique with broad applications to classify the various
kinds of data used in nearly every field of our life. Classification is used to classify the item
according to the features of the item with respect to the predefined set of classes. This paper put a
light on various classification algorithms including j48, C4.5, Naive Bayes using large dataset.
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
Distributed digital artifacts incorporate cryptographic hash values to URI called trusty URIs in a distributed environment
building good in quality, verifiable and unchangeable web resources to prevent the rising man in the middle attack. The greatest
challenge of a centralized system is that it gives users no possibility to check whether data have been modified and the communication
is limited to a single server. As a solution for this, is the distributed digital artifact system, where resources are distributed among
different domains to enable inter-domain communication. Due to the emerging developments in web, attacks have increased rapidly,
among which man in the middle attack (MIMA) is a serious issue, where user security is at its threat. This work tries to prevent MIMA
to an extent, by providing self reference and trusty URIs even when presented in a distributed environment. Any manipulation to the
data is efficiently identified and any further access to that data is blocked by informing user that the uniform location has been
changed. System uses self-reference to contain trusty URI for each resource, lineage algorithm for generating seed and SHA-512 hash
generation algorithm to ensure security. It is implemented on the semantic web, which is an extension to the world wide web, using
RDF (Resource Description Framework) to identify the resource. Hence the framework was developed to overcome existing
challenges by making the digital artifacts on the semantic web distributed to enable communication between different domains across
the network securely and thereby preventing MIMA.
DCOM (Distributed Component Object Model) and CORBA (Common Object Request Broker Architecture) are two popular distributed object models. In this paper, we make architectural comparison of DCOM and CORBA at three different layers: basic programming architecture, remoting architecture, and the wire protocol architecture.
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
Data mining environment produces a large amount of data, that need to be
analyses, pattern have to be extracted from that to gain knowledge. In this new period with
rumble of data both ordered and unordered, by using traditional databases and architectures, it
has become difficult to process, manage and analyses patterns. To gain knowledge about the
Big Data a proper architecture should be understood. Classification is an important data mining
technique with broad applications to classify the various kinds of data used in nearly every
field of our life. Classification is used to classify the item according to the features of the item
with respect to the predefined set of classes. This paper provides an inclusive survey of
different classification algorithms and put a light on various classification algorithms including
j48, C4.5, k-nearest neighbor classifier, Naive Bayes, SVM etc., using random concept.
Advanced Computational Intelligence: An International Journal (ACII)aciijournal
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEaciijournal
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
IMAGE CLASSIFICATION USING KNN, RANDOM FOREST AND SVM ALGORITHM ON GLAUCOMA DATASETS AND EXPLAIN THE ACCURACY, SENSITIVITY, AND SPECIFICITY OF EACH AND EVERY ALGORITHMS
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
Theimmense volumes of data are populated into repositories from various applications. In order to find out desired information and knowledge from large datasets, the data mining techniques are very much helpful. Classification is one of the knowledge discovery techniques. In Classification, Decision trees are very popular in research community due to simplicity and easy comprehensibility. This paper presentsan updated review of recent developments in the field of decision trees.
Abstract In this paper, the concept of data mining was summarized and its significance towards its methodologies was illustrated. The data mining based on Neural Network and Genetic Algorithm is researched in detail and the key technology and ways to achieve the data mining on Neural Network and Genetic Algorithm are also surveyed. This paper also conducts a formal review of the area of rule extraction from ANN and GA. Keywords: Data Mining, Neural Network, Genetic Algorithm, Rule Extraction.
sis of health condition is very challenging task for every human being because life is directly related to health
condition. Data mining based classification is one of the important applications for classification of data. In this
research work, we have used various classification techniques for classification of thyroid data. CART gives highest
accuracy 99.47% as best model. Feature selection plays very important role to computationally efficient and increase
the performance of model. This research work focus on Info Gain and Gain Ratio feature selection technique to
reduce the irrelevant features from original data set and computationally increase the performance of model. We have
applied both the feature selection techniques on best model i. e. CART. Our proposed CART-Info Gain and CARTGain
Ratio gives 99.47% and 99.20% accuracy with 25 and 3 feature respectively.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Predicting students' performance using id3 and c4.5 classification algorithmsIJDKP
An educational institution needs to have an approximate prior knowledge of enrolled students to predict
their performance in future academics. This helps them to identify promising students and also provides
them an opportunity to pay attention to and improve those who would probably get lower grades. As a
solution, we have developed a system which can predict the performance of students from their previous
performances using concepts of data mining techniques under Classification. We have analyzed the data
set containing information about students, such as gender, marks scored in the board examinations of
classes X and XII, marks and rank in entrance examinations and results in first year of the previous batch
of students. By applying the ID3 (Iterative Dichotomiser 3) and C4.5 classification algorithms on this data,
we have predicted the general and individual performance of freshly admitted students in future
examinations.
Predicting students' performance using id3 and c4.5 classification algorithms
Hx3115011506
1. VarshaChoudhary, Pranita Jain / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1501-1506
Classification: A Decision Tree For Uncertain Data Using CDF
VarshaChoudhary1, Pranita Jain2
(Department Of Information Technology,SATI College, RGPV University
Vidisha (M.P) , India ,)
( Ass. Prof.InDepartment Of Information Technology,SATI College, RGPV University
Vidisha (M.P) , India,)
ABSTRACT
The Decision trees are suitable and construction . All of these algorithms are used in
widely used for describing classification various areas such as image recognition , medical
phenomena. This paper present a decision tree diagnosis[5] ,credit rating of loan applicants ,
based classification system for uncertain data. scientific tests , fraud detection and target marketing.
The uncertain data means lack of certainty. Data The decision tree is a supervised classification
uncertainty comes by different parameters approach .A decision tree is a flow chart like
including sensor error, network latency structure , where each internal node denotes a test on
measurements precision limitation and multiple an attribute , each branch shows an outcome of the
repeated measurements. We find that decision test and each leaf node holds a class label. The top
tree classifier gives more accurate result if we node in a tree is define a root node . A decision tree
take “complete information” of data set .In this has a two different sub sets – a training set and a test
paper we improve the traditional decision tree set. The training set is used for deriving the classifier
algorithm which works on with known and and test set is used to measure the accuracy of the
precise data , including gini index method for classifier. The accuracy of the classifier is
determining the goodness of a split and determined by the percentage of the test data set that
considering cumulative distribution function . is correctly classified.
The experimental study shows that proposed A decision tree works on two different kind
CDF-distribution based algorithm gives accurate of attributes namely numerical and categorical .
result for uncertain numerical dataset and it is Those attribute which works on numeric data known
computationally efficient in terms of memory, as numerical attribute and the attributes whose
time and accuracy. domain is not numeric are called the categorical
attributes. The aim of classification is to design a
Keywords–Classification,CDF,Datamining concise model that can be used to predict the class of
,Decision tree ,Uncertain data . the data records whose class label is unknown.
A simple way to handle uncertainty is to
I. INTRODUCTION abstract probability distribution by summary
The Data mining refers to extracting or statistics such as means and variance. This approach
mining knowledge from large amounts of data. The is known as Averaging. Another method is works on
classification of large data set is an important the complete information carried by the probability
problem in data mining . The classification problem distributions to design a decision tree. This method
can be defined as follows for a database with a is known as distribution based[1]. In this paper we
number of records and for a set of classes such that works on distribution based method with
each record belongs to one of the given classes , the “cumulative distribution function “ (cdf) for
problem of classification is to decide the class to constructing decision tree classifier on uncertain
which given record belongs. The classification numerical data sets.
problem is also concerned with generating a A uncertainty comes on many application
description a model for each class from the given due to different reasons. We shortly describe
data set. Classification is one of the most important different kind of uncertainty here:-
data mining techniques .It is used to predict 1.1.Parameter uncertainty :- A parameter uncertainty
group/class membership for data instances. which comes from the model parameter that are
Different models have been proposed for inputs to the computer model (mathematical
classification such as Decision tree, Neural networks model)but whose exact values are unknown to
,Bayesian belief networks, Fuzzy set and Genetic experimentalists and cannot be controlled in physical
models. The decision trees classifier are most widely experiments.
used in among of these models for classification. 1.2.Structural uncertainty :- This type of uncertainty
They are popular because they are practically and comes from the lack of knowledge of the underlying
easy to understand. Rules can also be extracted from true physics . It depends on how accurately a
decision trees easily. Many algorithm such as ID3[7] mathematical model describes the true system for a
,C4.5 and CART have been devised for decision tree
1501 | P a g e
2. VarshaChoudhary, Pranita Jain / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1501-1506
real life situation , considering the fact that models can be fuzzy and are represented in fuzzy terms[1].
are almost always only approximations to really. Given a fuzzy attribute of a data tuple, a degree
1.3.Experimental uncertainty :- This type of (called membership) is assigned to each possible
uncertainty comes from the variability of value, showing the extent to which the data tuple
experimental measurements. The experimental belongs to a particular value. Our work instead gives
uncertainty is inevitable and can be noticed by classification results as a distribution: for each test
repeating a measurement for many times using tuple, we give a distribution telling how likely it
exactly the same setting for all inputs/variables. belongs to each class .There are many variations of
In this paper our contributions include : fuzzy decision trees, e.g., fuzzy extension of ID3[13]
1. A basic algorithm for building decision trees for and Soft Decision Tree[14]. In these models, a node
uncertain numerical datasets. of the decision tree does not give a crisp test which
2.A experimental study compare the classification decides deterministically which branch down the
accuracy achieved by the UDT based on pdf and tree a training or testing tuple is sent. Rather it gives
UDT based on cdf . a “soft test” or a fuzzy test on the point-valued tuple.
3.A performance analysis based on cpu time , Based on the fuzzy truth value of the test, the tuple is
memory for both algorithm. split into weighted tuples (akin to fractional tuples)
In the rest of this paper section 2 describes related and these are sent down the tree in parallel[1]. This
work , section 3 describes problem definition , differs from the approach taken in this paper, in
section 4 shows the proposed algorithm UDT-CDF which the probabilistic part stems from the
and section 5 shows experimental study on both uncertainty embedded in the data tuples, while the
algorithm. The last section we concludes the paper. test represented by each node of our decision tree
remains crisp and deterministic. The advantage of
II. Related Work our approach is that the tuple splitting is based on
There are many uncertain data classification probability values, giving a natural interpretation to
algorithms have been proposed in the literature in the splitting as well as the result of classification.
recent years. Qin et al(2009b) proposed a rule –
based classification algorithm for uncertain data[4]. III. Problem Definition
Ren et al.(2009) proposed to apply Naïve Bayes This section focus on the problem of
approach to uncertain data classification problem. A decision-tree classification on uncertain data. We
decision tree is widely used classification models describe traditional decision trees in shortly. Then,
because of its advantage(Tsang et al 2009 we discuss how data tuples with uncertainty are
[1],Quinlan 1993[2]). Various decision tree based handled.
classifiers for uncertain data are proposed by
research. The C4.5 classification algorithm were 3.1.Traditional Decision Trees :
extended to the DTU(Qin et al 2009a)[2] and the In our model, a dataset consists of d
UDT (Tsang et al 2009)[1] for classifying uncertain training tuples, {t1,t2, …, td} and k numerical (real-
data.(Qin et al 2009a) used probability vector and valued) feature attributes, A1,….,Ak. The domain of
probability density function (pdf) to represent attribute Aj is dom(Aj). Each tuple ti is associated
uncertain numerical attribute (Qin et al 2009b) and with a feature vector Vi =(vi,1, vi,2,.. vi,k) and a class
uncertain numerical attribute (Cheng et al 2003) label ci, where vi,j €dom(Aj) and ci € C, the set of all
respectively. They constructed a well performance class labels. The classification problem is to
decision tree for uncertain data (DTU).A C.Liang , construct a model M that maps each feature vector
Y.Zhang(2010) proposed an algorithm UCVFDT (vx,1,….,.. vx,k) to a probability distribution Px on C
which works on dynamic and uncertain data streams such that given a test tuple t0 = (v0,1,….., v0,k, c0), P0
[3] .Tsang et al (2009)[1] used the “complete = M(v0,1,….., v0,k) predicts the class label c0 with
information ” for pdf to construct a uncertain high accuracy. We say that P0 predicts c0 if c0 =
decision tree (UDT) and proposed a series of arg max c €C P0 (c).[1]
pruning techniques to improve the efficiency.
In our algorithm we use a cumulative distribution In this paper we study binary decision trees
function to construct a uncertain numerical decision with tests on numerical attributes. Each internal node
tree and it gives more accurate result compare to n of a decision tree is associated with an attribute
UDT which works on pdf[1]. A cumulative and a split point .An internal node has exactly 2
distribution function is basically probability based nodes, which are labeled “left” and “right”,
distribution function . respectively .Each leaf node in a binary tree
Another related topic is Fuzzy decision associate with class label.
tree[13] . Fuzzy information models data uncertainty
arising from human perception and understanding. To determine the class label of a given test
The uncertainty reflects the vagueness and tuple t0 ,we traverse the tree starting from the root
ambiguity of concepts, e.g., how cool is “cool”. In node until a leaf node is reached. When we visit an
fuzzy decision tree, both attributes and class labels internal node n, we execute the test and proceed to
1502 | P a g e
3. VarshaChoudhary, Pranita Jain / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1501-1506
the left node or the right node accordingly. i.w(1)*
Eventually, we reach a leaf node m. The probability
distribution Pm associated with m gives the
probabilities that t0 belongs to each class label c € C.
For a single result, we return the class label c € C
that maximizes Pm(c). 4.17 put it into DSR with weight i.w(2)*
3.2.Handling uncertainty:
In our algorithm , a feature value is
represented not by a single value , but by a cdf . In
our algorithm the cumulative distribution
function (CDF) describes probability of a random 4.18 endif;
variable falling in the interval (−∞, x].The CDF of 4.19 end for;
the standard normal distribution is denoted with the
capital Greek letter Φ (phi), and can be computed as The basic concept of this algorithm as follows:
an integral of the probability density function:
(1)The tree starts as a single node representing the
training samples (step 1).
(2) If the samples are all of the same class; then the
node becomes a leaf and is labeled with that class
Numerical methods for calculation of the standard (step2 and 3).
normal CDF are as follows For a generic normal
random variable with mean μ and variance σ2 > 0 the (3) Otherwise, the algorithm uses a probabilistic
CDF will be equal to measure, known as the probabilistic information gini
index, as the criteria for selecting the attribute that
will best separate the samples into an individual
class (step 7). This attribute becomes the ”test”
eq.(2) attribute at the node.
3.3.Propertiesof cdf :
3.3.1 The standard normal CDF is 2-fold rotationally (4) If the test attribute is uncertain numerical, we
symmetric around point (0, ½): Φ(−x) = 1 − Φ(x). split for the data at the selected position p (steps 8
3.3.2 The derivative of Φ(x) is equal to the standard and 9).
normal pdf ϕ(x): Φ′(x) = ϕ(x).
3.3.3 The antiderivative of Φ(x) is: ∫ Φ(x) dx = x Φ(x) (5) A branch is created for test-attribute ≤ p or test-
+ ϕ(x) attribute > p respectively. If an instance’s test
attribute value is less than or equal to p , it is put into
IV. Algorithm for UDT-CDF the left branch with the instance’s weight i .w(1). If
Input : the training dataset DS (Japanese an instance’s test attribute value is larger than p , it
vowel) ; the set of candidate attributes att-list is put into the right branch with the instance’s
Output : An uncertain numerical tree weight i .w(2). If an attribute’s value covers the split
Begin point p (-∞ ≤ p <∞ ), it is put into the left branch
4.1 create a node N; with weight i .w∗ cdf eq.
4.2 if (DS are all of the same class, C)then
4.3 return N as a leaf node labeled with the class C;
4.4 else if (attribute –list is empty)then
4.5 return N as a leaf node labeled with the highest
weight class in DS;
(6) And for right branch with weight i .w∗cdfeq(1).
4.6 endif;
Then the dataset is divided into Dsl and DSr (steps
4.7 select a test-attribute with the highest
10-19).
probabilistic information gini index to label node N;
4.8 if (test-attribute is uncertain numeric ) then
(7) The algorithm recursively applies the same
4.9 binary split the data from the selected position p;
process to generate a decision tree for the samples.
4.10for (each instance i)do
4.11 if(test-attribute <= p)then
(8) The recursive partitioning process stops only
4.12 put it into DSl left side with weight i.w;
when either of the following conditions becomes
4.13 Else if (test-attribute>p)then
true:
4.14 Put it into DSrright side with weight i.w;
(8.1) All samples for a given node belong to the
4.15 else
same class (steps 2 and 3),
4.16 put it into DSL with weight
Or
1503 | P a g e
4. VarshaChoudhary, Pranita Jain / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1501-1506
(8.2) There are no remaining attributes on which the In figure x-axis defines time and y-axis defines
samples may be further partitioned (step 4). In this memory .the dotted line indicates accuracy achieved
case, the highest weight class is employed (step 5). by UDT-CDF .
This involves converting the given node into a leaf
and labeling it with the class having the highest 5.2 Execution time:
weight among samples. Alternatively, the class The Execution time or CPU time of a
distribution of the node samples may be stored. algorithm is defined as the time spent by the system
executing that particular algorithm, including the
V. Experiments time spent executing run-time or system services on
In this section , we present the experimental its behalf. we examine the execution time of the
results of the proposed decision tree algorithm .we algorithms, which is shown in figure 2.
have implemented UDT based on pdf[1]and UDT
based on cdf and applied them to real data sets
which name is Japanese vowel .it is taken from the
UCI Machine Learning Repository[11]. This data set
chosen because this contain mostly numerical
attributes obtained from measurements.
The Japanese vowel data set have 640tuples, in
which 270 are training tuples and 370 are the test
tuples .Each tuple representing anutterance of
Japanese vowels by one of the 9 participatingmale
speakers. Each tuple contains 12 numerical
attributes, which are LPC (Linear Predictive Coding)
coefficients. These coefficients reflect important
features of speech sound. Each attribute value
consists of 7–29 samples of LPC coefficients
collected over time. These samples represent
uncertain information and are used to model the cdf
of the attribute for the tuple. The class label of each
tuple is the speaker id. The classification task is to fig: 2 UDT-CDF cputime on uncertain numerical
identify the speaker when given a test tuple. data sets
We implemented UDT-CDF on MATLAB The diagram show cputime (in seconds) or
7.8.0(R2009a) , the experiments are executed on pc execution time of UDT-PDF and UDT-CDF .where
with Intel (R)Pentium(R) ,2.30GHZ CPU and 2.00 x-axis defines time and y-axis defines pdf, cdf
GB main memory . algorithms. Our proposed algorithm takes less time
for execution of uncertain numerical datasets. The
5.1 Accuracy: first bar defines execution time of udt-pdf and
The overall accuracy of calculate in this graph. second bar defines the execution time taken by udt-
(The accuracy of the classifier is determined by the cdf.
percentage of the test data set that is correctly
classified).We first examine the accuracy of the 5.3Memory
algorithms , which is shown in figure 1. We examine the memory space used by the
algorithms , which is shown in figure 3.
fig: 1 UDT-CDF accuracy on uncertain numerical
data sets fig: 3 Memory graph
1504 | P a g e
5. VarshaChoudhary, Pranita Jain / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1501-1506
while we use complete information for plot a 5 if x1<0.940004 then node 10 else node 11
uncertain decision tree then we need more memory 6 class = Smith
space for storing the information. The figure 3 7 class = Marry
memory graph defines memory space used by UDT- 8 class = Gangnam
PDF and UDT-CDF. Where first bar for UDT-PDF 9 class = coll
and second graph for UDT-CDF. 10 class = Gangnam
11 class = Woody
5.4 Classification diagram :
Fig4.and 5 shows the diagram of decision tree of
UDT-PDF and UDT-CDF .The classification rules
describe how the classified .
The table1.1 shows the result analysis of both
algorithm.
fig:4 UDT-PDF decision tree
Rules for classification of UDT-PDF:
1 if x1<0.117465 then node 2 else node 3
2 if x9<0.39492 then node 4 else node 5
3 if x1<0.120901 then node 6 else node 7
4 class = Smith From the table-1.1 we see that UDT-CDF builds
5 class = Marry more accurate decision trees than UDT[1]. The
6 if x2<0.369726 then node 8 else node 9 Japanese vowel dataset is used for experiment which
7 if x1<0.130256 then node 10 else node 11 contain two sets firstly training dataset(ae.train.dat)
8 class = Gangnam and second is testing dataset (ae.test.dat). A training
9 class = Woody set has 270 blocks and test set has 370 blocks. The
10 class = coll total numbers of user are 9.The total memory used
11 class = Gangnam by UDT-CDF is 1.275K.B while UDT-PDF takes
1.876K.B for same training set. We see that not only
UDT-CDF takes less memory for training set but
also for test set. The total execution time of both
algorithm is define in table , where UDT-CDF takes
less cputime for execution.
VI. Conclusion
We have implemented a new algorithm
UDT-CDF for classification of uncertain numerical
data sets. Our algorithm gives a novel result in terms
of memory, execution time and accuracy. Our
algorithm can able to work on discrete random
variables data sets and continuous random variables
data sets.
References
fig:5UDT-CDF decision tree
[1] S.Tsang ,B.Kao, K.Y.Yip, W-S Ho and S-
Rules for classification UDT-CDF:
D.Lee; (2009) Decision Trees For
1 if x1<0.941063 then node 2 else node 3
Uncertain Data in IEEE.
2 if x1<0.938852 then node 4 else node 5
[2] B.Qin ,Y.Xia and F.Li ;(2009) , DTU:A
3 if x9<0.443404 then node 6 else node 7
decision tree for classifying uncertain data
4 if x1<0.932693 then node 8 else node 9
the PAKDD pp,4-15.
1505 | P a g e
6. VarshaChoudhary, Pranita Jain / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1501-1506
[3] C.Liang, Y.Zhang and Q.Song ;(2010),
Decision tree for Dynamic and Unceratin
Data Strems JMLR: workshop and
conference proceeding 13:209-224.
[4] B.Qin ,Y.Xia ,S.Prabhakar and Y.Tu
;(2009) A rule-based classification
algorithm for uncertain data: the proc the
IEEE workshop on Management and
Mining of Uncertain Data (MOUND) .
[5] C.L.Tsien , I.s. Kohane and N.Mclntosh
,Multiple signal integration by decision tree
induction to detect artifacts in the neonatal
intensive care unit. Artificial intelligence In
Medicine , vol.19 , no.3, pp.189-202, 2000.
[6] J.Gama, P.Medas and P.Rodrigues
.Learning Decision Trees from Dynamic
data streams . Journal of Universal
Computer Science, 11(8):1353-1366,2005.
[7] J. Gama, R.Fernandes and
R.Rocha.Decision trees for mining data
streams .Intell . Data Anal ,1:23c-45,2006.
[8] J.R. Quinlan .Induction of decision trees.
Machine Learning, Vol-1, no.1, pp.81-
106.0986.
[9] R.Agrawal ,T.Imielinski and A.N Swami .
Databaesmining : A performance
perspective. IEEE Trans. Knowl. Data
Eng., vol.5, no.6, pp. 914-925, 1993.
[10] M.Chau , R .Cheng , B.Kao and J.Ng,.
Uncertain data mining: An example in
clustering location data. In PAKDD
,ser.Lecture Notes in Computer Science ,
vol.3918. Singapore: Springer ,9-12
Apr.2006 pp 199-204.
[11] A.Asuncion and D.Newman. UCI machine
learning repository .2007. [online].
[12] C.K.Chui, B.Kao and E.Hung. Mining
frequent itemsetsfromm uncertain data. In
PAKDD ser. Lecture notes in computer
science. Vol.4426. Nanjing ,China :
Springer ,22-25 May 2007 ,pp .47-58.
[13] C.Z Janikow, Fuzzy decision tree: Issues
and mehods .IEEE Transactions on
Systems, Man , and cybernetics , Part B,
vol.28. no.1, pp.1-14, 1998.
[14] C.Olaru and L. Wehenkel ,A complete fuzzy
decision tree technique . Fuzzy sets and
Systems.Vol 138, no.2, pp. 221-254,2003.
1506 | P a g e