This document compares two approaches for handling incomplete data and generating decision rules: 1) Rough set theory, which fills missing values and performs attribute reduction, and 2) Random tree classification in data mining, which ignores missing values. It uses a heart disease dataset with missing values to test the approaches in ROSE2 and WEKA. The results show that random tree classification ignoring missing values produces more accurate decision rules than rough set theory filling missing values.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
Comparative study of various supervisedclassification methodsforanalysing def...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
J48 and JRIP Rules for E-Governance DataCSCJournals
Data are any facts, numbers, or text that can be processed by a computer. Data Mining is an analytic process which designed to explore data usually large amounts of data. Data Mining is often considered to be \"a blend of statistics. In this paper we have used two data mining techniques for discovering classification rules and generating a decision tree. These techniques are J48 and JRIP. Data mining tools WEKA is used in this paper.
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
The current trend in the industry is to analyze large data sets and apply data mining, machine learning techniques to identify a pattern. But the challenges with huge data sets are the high dimensions associated with it. Sometimes in data analytics applications, large amounts of data produce worse performance. Also, most of the data mining algorithms are implemented column wise and too many columns restrict the performance and make it slower. Therefore, dimensionality reduction is an important step in data analysis. Dimensionality reduction is a technique that converts high dimensional data into much lower dimension, such that maximum variance is explained within the first few dimensions. This paper focuses on multivariate statistical and artificial neural networks techniques for data reduction. Each method has a different rationale to preserve the relationship between input parameters during analysis. Principal Component Analysis which is a multivariate technique and Self Organising Map a neural network technique is presented in this paper. Also, a hierarchical clustering approach has been applied to the reduced data set. A case study of Air quality measurement has been considered to evaluate the performance of the proposed techniques.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
Comparative study of various supervisedclassification methodsforanalysing def...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
J48 and JRIP Rules for E-Governance DataCSCJournals
Data are any facts, numbers, or text that can be processed by a computer. Data Mining is an analytic process which designed to explore data usually large amounts of data. Data Mining is often considered to be \"a blend of statistics. In this paper we have used two data mining techniques for discovering classification rules and generating a decision tree. These techniques are J48 and JRIP. Data mining tools WEKA is used in this paper.
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
The current trend in the industry is to analyze large data sets and apply data mining, machine learning techniques to identify a pattern. But the challenges with huge data sets are the high dimensions associated with it. Sometimes in data analytics applications, large amounts of data produce worse performance. Also, most of the data mining algorithms are implemented column wise and too many columns restrict the performance and make it slower. Therefore, dimensionality reduction is an important step in data analysis. Dimensionality reduction is a technique that converts high dimensional data into much lower dimension, such that maximum variance is explained within the first few dimensions. This paper focuses on multivariate statistical and artificial neural networks techniques for data reduction. Each method has a different rationale to preserve the relationship between input parameters during analysis. Principal Component Analysis which is a multivariate technique and Self Organising Map a neural network technique is presented in this paper. Also, a hierarchical clustering approach has been applied to the reduced data set. A case study of Air quality measurement has been considered to evaluate the performance of the proposed techniques.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Automatic Feature Subset Selection using Genetic Algorithm for Clusteringidescitation
Feature subset selection is a process of selecting a
subset of minimal, relevant features and is a pre processing
technique for a wide variety of applications. High dimensional
data clustering is a challenging task in data mining. Reduced
set of features helps to make the patterns easier to understand.
Reduced set of features are more significant if they are
application specific. Almost all existing feature subset
selection algorithms are not automatic and are not application
specific. This paper made an attempt to find the feature subset
for optimal clusters while clustering. The proposed Automatic
Feature Subset Selection using Genetic Algorithm (AFSGA)
identifies the required features automatically and reduces
the computational cost in determining good clusters. The
performance of AFSGA is tested using public and synthetic
datasets with varying dimensionality. Experimental results
have shown the improved efficacy of the algorithm with optimal
clusters and computational cost.
sis of health condition is very challenging task for every human being because life is directly related to health
condition. Data mining based classification is one of the important applications for classification of data. In this
research work, we have used various classification techniques for classification of thyroid data. CART gives highest
accuracy 99.47% as best model. Feature selection plays very important role to computationally efficient and increase
the performance of model. This research work focus on Info Gain and Gain Ratio feature selection technique to
reduce the irrelevant features from original data set and computationally increase the performance of model. We have
applied both the feature selection techniques on best model i. e. CART. Our proposed CART-Info Gain and CARTGain
Ratio gives 99.47% and 99.20% accuracy with 25 and 3 feature respectively.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
Feature selection is one of the most fundamental steps in machine learning. It is closely related to
dimensionality reduction. A commonly used approach in feature selection is ranking the individual
features according to some criteria and then search for an optimal feature subset based on an evaluation
criterion to test the optimality. The objective of this work is to predict more accurately the presence of
Learning Disability (LD) in school-aged children with reduced number of symptoms. For this purpose, a
novel hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The process of feature ranking
follows a method of calculating the significance or priority of each symptoms of LD as per their
contribution in representing the knowledge contained in the dataset. Each symptoms significance or
priority values reflect its relative importance to predict LD among the various cases. Then by eliminating
least significant features one by one and evaluating the feature subset at each stage of the process, an
optimal feature subset is generated. For comparative analysis and to establish the importance of rough set
theory in feature selection, the backward feature elimination algorithm is combined with two state-of-theart
filter based feature ranking techniques viz. information gain and gain ratio. The experimental results
show the proposed feature selection approach outperforms the other two in terms of the data reduction.
Also, the proposed method eliminates all the redundant attributes efficiently from the LD dataset without
sacrificing the classification performance.
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...ertekg
İndirmek için Bağlantı > https://ertekprojects.com/gurdal-ertek-publications/blog/re-mining-association-mining-results-through-visualization-data-envelopment-analysis-and-decision-trees/
Re-mining is a general framework which suggests the execution of additional data mining steps based on the results of an original data mining process. This study investigates the multi-faceted re-mining of association mining results, develops and presents a practical methodology, and shows the applicability of the developed methodology through real world data. The methodology suggests re-mining using data visualization, data envelopment analysis, and decision trees. Six hypotheses, regarding how re-mining can be carried out on association mining results, are answered in the case study through empirical analysis.
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
Theimmense volumes of data are populated into repositories from various applications. In order to find out desired information and knowledge from large datasets, the data mining techniques are very much helpful. Classification is one of the knowledge discovery techniques. In Classification, Decision trees are very popular in research community due to simplicity and easy comprehensibility. This paper presentsan updated review of recent developments in the field of decision trees.
The D-basis Algorithm for Association Rules of High ConfidenceITIIIndustries
We develop a new approach for distributed computing of the association rules of high confidence on the attributes/columns of a binary table. It is derived from the D-basis algorithm developed by K.Adaricheva and J.B.Nation (Theoretical Computer Science, 2017), which runs multiple times on sub-tables of a given binary table, obtained by removing one or more rows. The sets of rules retrieved at these runs are then aggregated. This allows us to obtain a basis of association rules of high confidence, which can be used for ranking all attributes of the table with respect to a given fixed attribute. This paper focuses on some algorithmic details and the technical implementation of the new algorithm. Results are given for tests performed on random, synthetic and real data
K-Medoids Clustering Using Partitioning Around Medoids for Performing Face Re...ijscmcj
Face recognition is one of the most unobtrusive biometric techniques that can be used for access control as well as surveillance purposes. Various methods for implementing face recognition have been proposed with varying degrees of performance in different scenarios. The most common issue with effective facial biometric systems is high susceptibility of variations in the face owing to different factors like changes in pose, varying illumination, different expression, presence of outliers, noise etc. This paper explores a novel technique for face recognition by performing classification of the face images using unsupervised learning approach through K-Medoids clustering. Partitioning Around Medoids algorithm (PAM) has been used for performing K-Medoids clustering of the data. The results are suggestive of increased robustness to noise and outliers in comparison to other clustering methods. Therefore the technique can also be used to increase the overall robustness of a face recognition system and thereby increase its invariance and make it a reliably usable biometric modality.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...Waqas Tariq
One major problem in the data cleaning & data reduction step of KDD process is the presence of missing values in attributes. Many of analysis task have to deal with missing values and have developed several treatments to guess them. One of the most common method to replace the missing values is the mean method of imputation. In this paper we suggested a new imputation method by combining factor type and compromised imputation method, using two-phase sampling scheme and by using this method we impute the missing values of a target attribute in a data warehouse. Our simulation study shows that the estimator of mean from this method is found more efficient than compare to other.
Gabe and Devon present Regression Analysis using Google Prediction API and numpi. These slides are the Google Prediction API portion.
For the rest see https://docs.google.com/presentation/d/1Wtivp7IfUOBxr3wWN0lcw97SQiFkWMLBqgQf_bXgJ0c/edit#slide=id.p10
So you want to predict the future? Oh, just some sentiment analysis, spam detection, stock market predictions? In that case the Google Prediction API is for you. Classification problems, Regression problems. This API is a great tool for any software developer and is easily accessible to anyone who is good with spreadsheets.
These slide discuss the extending of the concept of correlation and show it can be used in prediction. The statistical test used is called regression. This is the process of using one variable to predict another when the two are correlated.
Automatic Feature Subset Selection using Genetic Algorithm for Clusteringidescitation
Feature subset selection is a process of selecting a
subset of minimal, relevant features and is a pre processing
technique for a wide variety of applications. High dimensional
data clustering is a challenging task in data mining. Reduced
set of features helps to make the patterns easier to understand.
Reduced set of features are more significant if they are
application specific. Almost all existing feature subset
selection algorithms are not automatic and are not application
specific. This paper made an attempt to find the feature subset
for optimal clusters while clustering. The proposed Automatic
Feature Subset Selection using Genetic Algorithm (AFSGA)
identifies the required features automatically and reduces
the computational cost in determining good clusters. The
performance of AFSGA is tested using public and synthetic
datasets with varying dimensionality. Experimental results
have shown the improved efficacy of the algorithm with optimal
clusters and computational cost.
sis of health condition is very challenging task for every human being because life is directly related to health
condition. Data mining based classification is one of the important applications for classification of data. In this
research work, we have used various classification techniques for classification of thyroid data. CART gives highest
accuracy 99.47% as best model. Feature selection plays very important role to computationally efficient and increase
the performance of model. This research work focus on Info Gain and Gain Ratio feature selection technique to
reduce the irrelevant features from original data set and computationally increase the performance of model. We have
applied both the feature selection techniques on best model i. e. CART. Our proposed CART-Info Gain and CARTGain
Ratio gives 99.47% and 99.20% accuracy with 25 and 3 feature respectively.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
Feature selection is one of the most fundamental steps in machine learning. It is closely related to
dimensionality reduction. A commonly used approach in feature selection is ranking the individual
features according to some criteria and then search for an optimal feature subset based on an evaluation
criterion to test the optimality. The objective of this work is to predict more accurately the presence of
Learning Disability (LD) in school-aged children with reduced number of symptoms. For this purpose, a
novel hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The process of feature ranking
follows a method of calculating the significance or priority of each symptoms of LD as per their
contribution in representing the knowledge contained in the dataset. Each symptoms significance or
priority values reflect its relative importance to predict LD among the various cases. Then by eliminating
least significant features one by one and evaluating the feature subset at each stage of the process, an
optimal feature subset is generated. For comparative analysis and to establish the importance of rough set
theory in feature selection, the backward feature elimination algorithm is combined with two state-of-theart
filter based feature ranking techniques viz. information gain and gain ratio. The experimental results
show the proposed feature selection approach outperforms the other two in terms of the data reduction.
Also, the proposed method eliminates all the redundant attributes efficiently from the LD dataset without
sacrificing the classification performance.
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...ertekg
İndirmek için Bağlantı > https://ertekprojects.com/gurdal-ertek-publications/blog/re-mining-association-mining-results-through-visualization-data-envelopment-analysis-and-decision-trees/
Re-mining is a general framework which suggests the execution of additional data mining steps based on the results of an original data mining process. This study investigates the multi-faceted re-mining of association mining results, develops and presents a practical methodology, and shows the applicability of the developed methodology through real world data. The methodology suggests re-mining using data visualization, data envelopment analysis, and decision trees. Six hypotheses, regarding how re-mining can be carried out on association mining results, are answered in the case study through empirical analysis.
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
Theimmense volumes of data are populated into repositories from various applications. In order to find out desired information and knowledge from large datasets, the data mining techniques are very much helpful. Classification is one of the knowledge discovery techniques. In Classification, Decision trees are very popular in research community due to simplicity and easy comprehensibility. This paper presentsan updated review of recent developments in the field of decision trees.
The D-basis Algorithm for Association Rules of High ConfidenceITIIIndustries
We develop a new approach for distributed computing of the association rules of high confidence on the attributes/columns of a binary table. It is derived from the D-basis algorithm developed by K.Adaricheva and J.B.Nation (Theoretical Computer Science, 2017), which runs multiple times on sub-tables of a given binary table, obtained by removing one or more rows. The sets of rules retrieved at these runs are then aggregated. This allows us to obtain a basis of association rules of high confidence, which can be used for ranking all attributes of the table with respect to a given fixed attribute. This paper focuses on some algorithmic details and the technical implementation of the new algorithm. Results are given for tests performed on random, synthetic and real data
K-Medoids Clustering Using Partitioning Around Medoids for Performing Face Re...ijscmcj
Face recognition is one of the most unobtrusive biometric techniques that can be used for access control as well as surveillance purposes. Various methods for implementing face recognition have been proposed with varying degrees of performance in different scenarios. The most common issue with effective facial biometric systems is high susceptibility of variations in the face owing to different factors like changes in pose, varying illumination, different expression, presence of outliers, noise etc. This paper explores a novel technique for face recognition by performing classification of the face images using unsupervised learning approach through K-Medoids clustering. Partitioning Around Medoids algorithm (PAM) has been used for performing K-Medoids clustering of the data. The results are suggestive of increased robustness to noise and outliers in comparison to other clustering methods. Therefore the technique can also be used to increase the overall robustness of a face recognition system and thereby increase its invariance and make it a reliably usable biometric modality.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...Waqas Tariq
One major problem in the data cleaning & data reduction step of KDD process is the presence of missing values in attributes. Many of analysis task have to deal with missing values and have developed several treatments to guess them. One of the most common method to replace the missing values is the mean method of imputation. In this paper we suggested a new imputation method by combining factor type and compromised imputation method, using two-phase sampling scheme and by using this method we impute the missing values of a target attribute in a data warehouse. Our simulation study shows that the estimator of mean from this method is found more efficient than compare to other.
Gabe and Devon present Regression Analysis using Google Prediction API and numpi. These slides are the Google Prediction API portion.
For the rest see https://docs.google.com/presentation/d/1Wtivp7IfUOBxr3wWN0lcw97SQiFkWMLBqgQf_bXgJ0c/edit#slide=id.p10
So you want to predict the future? Oh, just some sentiment analysis, spam detection, stock market predictions? In that case the Google Prediction API is for you. Classification problems, Regression problems. This API is a great tool for any software developer and is easily accessible to anyone who is good with spreadsheets.
These slide discuss the extending of the concept of correlation and show it can be used in prediction. The statistical test used is called regression. This is the process of using one variable to predict another when the two are correlated.
IOSR Journal of Electronics and Communication Engineering(IOSR-JECE) is an open access international journal that provides rapid publication (within a month) of articles in all areas of electronics and communication engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in electronics and communication engineering. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Feature selection is a problem closely related to dimensionality reduction. A commonly used
approach in feature selection is ranking the individual features according to some criteria and
then search for an optimal feature subset based on an evaluation criterion to test the optimality.
The objective of this work is to predict more accurately the presence of Learning Disability
(LD) in school-aged children with reduced number of symptoms. For this purpose, a novel
hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The approach follows
a ranking of the symptoms of LD according to their importance in the data domain. Each
symptoms significance or priority values reflect its relative importance to predict LD among the
various cases. Then by eliminating least significant features one by one and evaluating the
feature subset at each stage of the process, an optimal feature subset is generated. The
experimental results shows the success of the proposed method in removing redundant
attributes efficiently from the LD dataset without sacrificing the classification performance.
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...csandit
Feature selection is a problem closely related to dimensionality reduction. A commonly used
approach in feature selection is ranking the individual features according to some criteria and
then search for an optimal feature subset based on an evaluation criterion to test the optimality.
The objective of this work is to predict more accurately the presence of Learning Disability
(LD) in school-aged children with reduced number of symptoms. For this purpose, a novel
hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The approach follows
a ranking of the symptoms of LD according to their importance in the data domain. Each
symptoms significance or priority values reflect its relative importance to predict LD among the
various cases. Then by eliminating least significant features one by one and evaluating the
feature subset at each stage of the process, an optimal feature subset is generated. The
experimental results shows the success of the proposed method in removing redundant
attributes efficiently from the LD dataset without sacrificing the classification performance.
Many data mining and knowledge discovery methodologies and process models have been developed, with varying degrees of success, there are three main methods used to discover patterns in data; KDD, SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in practice. To our knowledge, there is no clear methodology developed to support link mining. However, there is a well known methodology in knowledge discovery in databases, known as Cross Industry Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links that are not yet known in a given network. This approach is implemented through the use of a case study of real world data (co-citation data). This case study aims to use mutual information to interpret the semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining the nature of a given link and potentially identifying important future link relationships
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCEIJCI JOURNAL
Class imbalance exists in many classification problems, and since the data is designed for accuracy,
imbalance in data classes can lead to classification challenges with a few classes having higher
misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small
amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This
paper provides a comprehensive overview of research in the field of imbalanced data classification. The
discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid
methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas,
strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed,
along with strategies to address them. It is convenient for researchers to choose the appropriate method
according to their needs.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTIONIJDKP
The geometry of data, also known as probability distribution, is an important consideration for accurate computation of data mining tasks, such as pre-processing, classification and interpretation. The data geometry influences outcome and accuracy of the statistical analysis to a large extent. The current paper focuses on, understanding the influence of data geometry in the feature subset selection process using random forest algorithm. In practice, it is assumed that the data follows normal distribution and most of the time, it may not be true. The dimensionality reduction varies, due to change in the distribution of the data. A comparison is made using three standard distributions such as Triangular, Uniform and Normal Distribution. The results are discussed in this paper.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
Attribute reduction and classification task are an essential process in dealing with large data
sets that comprise numerous number of input attributes. There are many search methods and
classifiers that have been used to find the optimal number of attributes. The aim of this paper is
to find the optimal set of attributes and improve the classification accuracy by adopting
ensemble rule classifiers method. Research process involves 2 phases; finding the optimal set of
attributes and ensemble classifiers method for classification task. Results are in terms of
percentage of accuracy and number of selected attributes and rules generated. 6 datasets were
used for the experiment. The final output is an optimal set of attributes with ensemble rule
classifiers method. The experimental results conducted on public real dataset demonstrate that
the ensemble rule classifiers methods consistently show improve classification accuracy on the
selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
Enhancing Keyword Query Results Over Database for Improving User Satisfaction ijmpict
Storing data in relational databases is widely increasing to support keyword queries but search results does not gives effective answers to keyword query and hence it is inflexible from user perspective. It would be helpful to recognize such type of queries which gives results with low ranking. Here we estimate prediction of query performance to find out effectiveness of a search performed in response to query and features of such hard queries is studied by taking into account contents of the database and result list. One relevant problem of database is the presence of missing data and it can be handled by imputation. Here an inTeractive Retrieving-Inferring data imputation method (TRIP) is used which achieves retrieving and inferring alternately to fill the missing attribute values in the database. So by considering both the prediction of hard queries and imputation over the database, we can get better keyword search results.
Privacy preservation techniques in data miningeSAT Journals
Abstract In this paper different privacy preservation techniques are compared. Classification is the most commonly applied data mining technique, which employs a set of pre-classified examples to develop a model that can classify the population of records at large. Fraud detection and credit risk applications are particularly well suited to this type of analysis. This approach frequently employs decision tree or neural network-based classification algorithms. The data classification process involves learning and classification. In Learning the training data are analyzed by classification algorithm. In classification test data are used to estimate the accuracy of the classification rules. If the accuracy is acceptable the rules can be applied to the new data tuples . For a fraud detection application, this would include complete records of both fraudulent and valid activities determined on a record-by-record basis. The classifier-training algorithm uses these pre-classified examples to determine the set of parameters required for proper discrimination. The algorithm then encodes these parameters into a model called a classifier Index Terms: Data Mining, Privacy Preservation, Clustering, Classification Techniques, Naive Bayes.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Clustering heterogeneous categorical data using enhanced mini batch K-means ...IJECEIAES
Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, k at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains.
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
1. IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 9, Issue 3 (Mar. - Apr. 2013), PP 06-10
www.iosrjournals.org
A Comparative Study on Decision Rule Induction for incomplete
data using Rough Set and Random Tree Approaches
M. Sandhya1, Dr. A. Kangaiammal2, Dr. C. Senthamarai3,
1
(M.Phil. Scholar of Computer Science, Govt. Arts College (Autonomous), Salem -7, Periyar University,
INDIA)
2 3
, (Assistant Professor of Computer Applications, Govt. Arts College (Autonomous), Salem -7, Periyar
University, INDIA)
Abstract: Handling missing attribute values is the greatest challenging process in data analysis. There are so
many approaches that can be adopted to handle the missing attributes. In this paper, a comparative analysis is
made of an incomplete dataset for future prediction using rough set approach and random tree generation in
data mining. The result of simple classification technique (using random tree classifier) is compared with the
result of rough set attribute reduction performed based on Rule induction and decision tree. WEKA (Waikato
Environment for Knowledge Analysis), a Data Mining tool and ROSE2 (Rough Set Data Explorer), a Rough Set
approach tool have been used for the experiment. The result of the experiment shows that the random tree
classification algorithm gives promising results with utmost accuracy and produces best decision rule using
decision tree for the original incomplete data or with the missing attribute values (i.e. missing attributes are
simply ignored). Whereas in rough set approach, the missing attribute values are filled with the most common
values of that attribute domain. This paper brings out a conclusion that the missing data simply ignored yields
best decision than filling some data in the place of missing attribute value.
Keywords- Random Tree, WEKA, ROSE2, Missing attribute, Incomplete dataset, Classification, Rule
Induction, Decision Tree.
I. Introduction
Data mining is the way of extracting useful information and discovering knowledge patterns that may
be used for decision making [7]. Several data mining techniques are association rule, clustering, classification
and prediction, neural networks, decision tree, etc. Application of data mining techniques concern to develop the
methods that discover knowledge from data and then used to uncover the hidden or unknown information that is
not apparent, but potentially useful [5]. Classification and Clustering are the important techniques in data
mining. Classification groups data based on a classifier model while clustering groups the data based on the
distance or similarity.
Rough set theory was introduced by Zdzisław Pawlak in early 1980’s. The main aim of the rough set
analysis is to find the approximation of concepts from the existing data. In order to deal with vagueness of data,
rough set theory replaces every vague concept with two important concepts called the upper and lower
approximation space. Lower approximation consists of those objects which are surely belong to the set and the
upper approximation consist of those objects which do possibly belong to the set [13]. Rough set theory is
basically used for finding:
a) Hidden patterns in data
b) Significance of attribute
c) Reduced subset of data
d) Dependency of attributes and so on.
There are various reasons as to why datasets are affected by missing attribute values. Sometimes the
irrelevant values will not be recorded in the data as said by [4]. Another reason is that forgot to place the values
in the table or mistakenly erased the data from the table. There are several approaches to handle the missing
attribute values. The authors have found that filling the missing attribute with the most common attribute value
is the worst method among all the approaches. Among all the nine approaches discussed, the two approaches
namely, C4.5 and ignoring the missing attribute values are the best methods.
The paper is organized as follows: section 2 describes the related research work. Section 3 states the
problem statement and section 4 describes the proposed method about filling the incomplete dataset by most
common attribute value using rough set approach to generate rule induction and also about without filling the
incomplete dataset to generate rule induction using Random Tree classification algorithm in data mining.
Experimental results and performance evaluation are presented in section 5. Finally, section 6 concludes the
work and points out some of the prospective future work.
www.iosrjournals.org 6 | Page
2. A Comparative Study on Decision Rule Induction for incomplete data using Rough Set and Random
II. Related Work
There are nine different approaches discussed for handling the missing attribute values. Ten input data
files were used to apply and test the nine approaches for investigating the performance while handling missing
attribute values. They fixed quality criterion for ten-fold cross validation as the error rate which needed to be
average. The authors have concluded based on Wilcoxon matched-pairs signed rank test that the two approaches
namely, C4.5 and ignoring the missing attribute values are the two best methods to handle the missing attribute
values [4].
The scholars of [6] describe an ISOM-DM (Independent Self Organizing Maps) model that has been
proposed for incomplete data handling in data mining. Compared with Mixture of Principal Component
Analyzers (MPCA), mean method and standard SOM-based fuzzy map model, ISOM-DH model can be applied
to more cases.
The work in [9] uses attribute value pair. These blocks are used to construct characteristic sets,
characteristic relations, and lower and upper approximations for decision tables with missing attribute values.
The authors in [9] conclude that an error rate for classification is smaller when missing attribute values are
considered to be lost.
In [5] Characteristic relations are introduced to describe incompletely specified decision tables. For
completely specified decision tables any characteristic relation is reduced to an indiscernibility relation. The
basic rough set idea of lower and upper approximations for incompletely specified decision tables may be
defined in a variety of different ways.
The work in [16] has made a comparative analysis of data mining classification technique and an
integration of clustering and classification technique that helps in identifying large data sets. The integration of
clustering and classification technique gives more accurate results than simple classification technique. It is also
useful in developing rules when the data set is containing missing values. This integrated technique of clustering
and classification gives a promising classification results with utmost accuracy rate.
III. Problem Statement
The problem here is to identify the best method of dealing with the missing attributes when decision
making is important. This has been accomplished by comparing the results of rule generation through filling the
incomplete dataset with most common attribute value using rough set approach and also in data mining without
filling the incomplete dataset using Random Tree Classification for rule induction. Heart problem using 3-
condition attributes and 1-ecision attribute with incomplete dataset has been considered for comparative study.
IV. Proposed Method
Rough set deals with vagueness and uncertainty of data. In rough set the incomplete dataset are
described by their characteristic relation and complete decision tables are described by indiscernibility relations.
Classification is the important technique in Data mining. Classification groups’ data based on a classifier model.
Using Random Tree classification algorithm decision is taken for incomplete dataset. Taking decision in the
orginal table is the best method. Fig. 1 shows a general framework of a comparative analysis of two approaches
for finding better rule induction for incomplete dataset. Fig. 2 shows the block diagram of steps of evaluation
and comparison. Table 1 shows the incomplete dataset
In this experiment, decision attribute corresponds to heart problem and the condition attribute
corresponds to blood pressure, chest pain and cholesterol. Apply two approaches for this table to generate rule in
ROSE2 and WEKA tool. Apply classification technique in WEKA and attribute reduction in ROSE2.
In Rough set rule induction is based on the consistency and inconsistency of the table. Using reduct, the
attribute is reducted and find the conisitency of the condition attribute and decision attribute. After reducing the
attribute, with the consistancy of the table is chossen for the rule induction. Before finding the consistency and
inconsistancy of the incomplete decision table, the table should be converted into a complete decison table. The
rule is generated for the consistency of the table after removing a attribute.
In classification, the decision rule is generated using the decision tree. Random Tree considers a set of
K randomly chosen attributes to split on at each node. Random tree, gives number of nodes by selecting all
possible trees uniformly at random.
The complete descriptions of the incomplete dataset attribute value are presented in Table 1. In rough
set, first the incomplete decision table is transformed to complete table by filling the missing attribute value
using most common attribute value (i.e. the value of the attribute that occurs most often is selected as the value
for all the missing values of the attribute). In WEKA, the decision rule is generated with the missing attribute
value using Random tree classification algorithm. The complete decision table is shown in Table 2.
www.iosrjournals.org 7 | Page
3. A Comparative Study on Decision Rule Induction for incomplete data using Rough Set and Random
Table 1: Incomplete Decision Table
Case Blood Pressure Chest pain Cholesterol Heart Problem
1 High ? High Yes
2 ? Yes ? Yes
3 ? No ? No
4 High ? High Yes
5 ? Yes Low No
6 Normal No ? No
Figure 1: Architecture of the Proposed Work
Figure 2: Block diagram of Evaluation and Comparison
www.iosrjournals.org 8 | Page
4. A Comparative Study on Decision Rule Induction for incomplete data using Rough Set and Random
V. Experiment Results And Performance Evaluation
In this experiment a comparative study of attribute reduction of rough set and classification technique
of data mining technique for incomplete dataset on various parameters using missing attribute value in Heart
problem data set containing 3 condition attribute and 1 decision attribute. During Rough set, the incomplete
dataset is given as input to ROSE2 tool and fill the missing attribute value with the most common attribute value
then reduce the attribute based on consistency and inconsistency of a table was implemented for rule generation.
In data mining during simple classification, the training dataset is given as input to WEKA tool and the
classification algorithm namely Random Tree was implemented.
The result of the experiment shows generating the rule in the original incomplete dataset produce best
result than filling the attribute with most common attribute value. Table 2 shows the complete decision table
generated by ROSE2 and figure3 shows the decision tree and rule generated by Random tree classification in
WEKA.
Table 2: Complete Decision Table
Next is finding the Reduct and Core of the complete information table for generating rules based on
Reduct of the complete information table. Reduct is a minimal subset of attributes that enables the relevancy and
redundancy. A subset attribute is said to be relevant if it is predictive of the decision features, otherwise it is
irrelevant. A subset attribute is considered to be redundant if it is highly correlated with other features. The
result of analysis shows that there is only one reduct of Table 2. That is,
Reduct = {Chest Pain, Cholesterol}
The rule is generated after reducing the attribute by ROSE2 is
(Chestpain=yes) & (Cholesterol = yes) =>(Heartproblem=yes)
(Chestpain=no)=>(Heartproblem=no)
(Cholesterol=low)=>(Heartproblem=no)
In WEKA, the table1 is given as input. The decision rule is generated without filling the missing
attribute value. With the original incomplete decision table the rule is generated. Using Random tree
classification algorithm in WEKA the decision tree and rule is generated. The figure3 shows the decision tree
and rule.
Figure 3: Decision Tree
Fig. 4 shows the rules framed by the random tree classification is as follows:
(Blood pressure = high & cholesterol = high & chest pain = high) = (heart problem = yes)
(Blood pressure = normal) = (heart problem = No)
www.iosrjournals.org 9 | Page
5. A Comparative Study on Decision Rule Induction for incomplete data using Rough Set and Random
(Blood pressure = high & cholesterol = low) = (heart problem = No)
(Blood pressure = high & cholesterol = high & chest pain = No) = (heart problem = yes)
Figure 4: Decision Rule
Observations and Analysis:
It is observed that the random tree classification provides better rule than the reduction of attribute in rough
set.
Filling the most common attribute value for the incomplete decision is the worst method.
Accuracy of RANDOM TREE classifier is high i.e. 100% (Table 3), which is highly required.
VI. Conclusion And Future Work
A comparative study of data mining classification and rough set attribute reduction for the incomplete
dataset to generate the decision table has been performed. The presented experiment shows that the random tree
classification algorithm is the best method for the rule generation of incomplete decision table. Because random
tree generate the rule with the missing attribute values without filling the most common attribute value. Filling
the missing attribute value with the most common attribute value is the worst method for the prediction.
Therefore, it is observed that handling the incomplete decision table without filling the missing attribute value is
best for prediction. In future, the work can be extended to use various other approaches for handling missing
attribute values so as to observe the change in decision rules.
References
[1] Antony Rosewelt and Dr. Sampathkumar Vajeravelu, Mining Software Defects using Random tree, International Journal of
Computer Science & Technology, IJCST Vol. 2, Issue 4, ISSN : 0976-8491(Online) | ISSN : 2229-4333(Print), Oct - Dec 2011.
[2] Frederick Livingston, Implementation of Breiman’s Random Forest Machine Learning Algorithm, ECE591Q Machine Learning
Journal Paper, Fall 2005.
[3] Grzymala-Busse, J.W. and Hu, M., A comparison of several approaches to missing attribute values in data mining. Proceedings of
the Second International Conference on Rough Sets and Current Trends in Computing RSCTC-2000, Banff, Canada, October 16–19,
2000, 340–347.
[4] Grzymala-Busse, J.W., Rough set strategies to data with missing attribute values, Workshop Notes, Foundations and New Directions
of Data Mining, The 3rd International Conference on Data Mining, Melbourne, FL, USA, November 19–22, 2003, 56–63.
[5] Hongyi Peng and Siming Zhu, Handling of incomplete data sets using ICA and SOM in data mining, Received: 2 September 2005 /
Accepted: 24 April 2006 / Published online: 30 May 2006, Springer-Verlag London Limited 2006.
[6] http://idss.cs.put.poznan.pl/site/rose.html.
[7] J. Han and M. Kamber, Data Mining: Concepts and Techniques (Morgan Kaufmann, 2000).
[8] Jerzy W. Grzymala-Busse and Sachin Siddhaye, Rough Set Approaches to Rule Induction from Incomplete Data, Proceedings of the
IPMU'2004, the 10th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based
Systems, Perugia, Italy, July 4–9, 2004, vol. 2, 923–930.
[9] Kash Barker et al, Learning From Student Data, Proceedings of the 2004 Systems and Information Engineering Design
Symposium, Mathew H. Jones, Stephen D. Patek, and Barbara E. Towney eds. 2004. pp79-86.
[10] Marzena Kryszkiewicz, Rules in incomplete Information Systems, Information Science 113(1999), 271-292.
[11] Nambiraj Suguna and Keppana Gowder Thanushkodi, Predicting Missing Attribute Values Using k-Means Clustering, Journal of
Computer Science 7 (2): 216-224, 2011
[12] Pawlak, Z, Rough Set, International Journal of Computer and Information Sciences (1982) 341–356.
[13] Pawlak, Z, Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht, Boston, London
(1991).
[14] Renu Vashist, M.L Garg, A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set,
International Journal of Computer Applications (0975 – 8887) Volume 42– No.14, March 2012.
[15] Varun Kumar, Nisha Rathee , Knowledge discovery from database Using an integration of clustering and classification”, (IJACSA)
International Journal of Advanced Computer Science and Applications, Vol. 2, No.3, March 2011.
[16] Weka 3- Data Mining with open source machine learning software available from: - http://www.cs.waikato. ac.nz/ml/ weka.
[17] Z. Pawlak, Rough Sets and Intelligent Data Analysis [J]. Information Sciences, 2002, 147(1-4) 1-12.
[18] Zdzisław Pawlak, Rough set theory and its applications, journal of telecommunication and technologies, 2002.
www.iosrjournals.org 10 | Page