In real world everything is an object which represents particular classes. Every object can be fully described by its attributes. Any real world dataset contains large number of attributes and objects. Classifiers give poor performance when these huge datasets are given as input to it for proper classification. So from these huge dataset most useful attributes need to be extracted that contribute the maximum to the decision. In the paper, attribute set is reduced by generating reducts using the indiscernibility relation of Rough Set Theory (RST). The method measures similarity among the attributes using relative indiscernibility relation and computes attribute similarity set. Then the set is minimized and an attribute similarity table is constructed from which attribute similar to maximum number of attributes is selected so that the resultant minimum set of selected attributes (called reduct) cover all attributes of the attribute similarity table. The method has been applied on glass dataset collected from the UCI repository and the classification accuracy is calculated by various classifiers. The result shows the efficiency of the proposed method.
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
Application for Logical Expression Processing csandit
Processing of logical expressions – especially a conversion from conjunctive normal form (CNF) to disjunctive normal form (DNF) – is very common problem in many aspects of information retrieval and processing. There are some existing solutions for the logical symbolic calculations, but none of them offers a functionality of CNF to DNF conversion. A new application for this purpose is presented in this paper.
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Data mining is indispensable for business organizations for extracting useful information from the huge
volume of stored data which can be used in managerial decision making to survive in the competition. Due
to the day-to-day advancements in information and communication technology, these data collected from e-
commerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high
dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets
selected in subsequent iterations by feature selection should be same or similar even in case of small
perturbations of the dataset and is called as selection stability. It is recently becomes important topic of
research community. The selection stability has been measured by various measures. This paper analyses
the selection of the suitable search method and stability measure for the feature selection algorithms and
also the influence of the characteristics of the dataset as the choice of the best approach is highly problem
dependent.
The document discusses clustering documents using a multi-viewpoint similarity measure. It begins with an introduction to document clustering and common similarity measures like cosine similarity. It then proposes a new multi-viewpoint similarity measure that calculates similarity between documents based on multiple reference points, rather than just the origin. This allows a more accurate assessment of similarity. The document outlines an optimization algorithm used to cluster documents by maximizing the new similarity measure. It compares the new approach to existing document clustering methods and similarity measures.
An enhanced fuzzy rough set based clustering algorithm for categorical dataeSAT Journals
Abstract In today’s world everything is done digitally and so we have lots of data raw data. This data are useful to predict future events if we proper use it. Clustering is such a technique where we put closely related data together. Furthermore we have types of data sequential, interval, categorical etc. In this paper we have shown what is the problem with clustering categorical data with rough set and who we can overcome with improvement.
An enhanced fuzzy rough set based clustering algorithm for categorical dataeSAT Publishing House
This document summarizes a research paper that proposes an enhanced fuzzy rough set-based clustering algorithm for categorical data. The paper discusses problems with using traditional rough set theory to cluster categorical data when there are no crisp relations between attributes. It proposes using fuzzy logic to assign weights to attribute values and calculate lower approximations based on the similarity between sets, in order to cluster categorical data when crisp relations do not exist. The proposed method is described through an example comparing traditional rough set clustering to the new fuzzy rough set approach.
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
Application for Logical Expression Processing csandit
Processing of logical expressions – especially a conversion from conjunctive normal form (CNF) to disjunctive normal form (DNF) – is very common problem in many aspects of information retrieval and processing. There are some existing solutions for the logical symbolic calculations, but none of them offers a functionality of CNF to DNF conversion. A new application for this purpose is presented in this paper.
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Data mining is indispensable for business organizations for extracting useful information from the huge
volume of stored data which can be used in managerial decision making to survive in the competition. Due
to the day-to-day advancements in information and communication technology, these data collected from e-
commerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high
dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets
selected in subsequent iterations by feature selection should be same or similar even in case of small
perturbations of the dataset and is called as selection stability. It is recently becomes important topic of
research community. The selection stability has been measured by various measures. This paper analyses
the selection of the suitable search method and stability measure for the feature selection algorithms and
also the influence of the characteristics of the dataset as the choice of the best approach is highly problem
dependent.
The document discusses clustering documents using a multi-viewpoint similarity measure. It begins with an introduction to document clustering and common similarity measures like cosine similarity. It then proposes a new multi-viewpoint similarity measure that calculates similarity between documents based on multiple reference points, rather than just the origin. This allows a more accurate assessment of similarity. The document outlines an optimization algorithm used to cluster documents by maximizing the new similarity measure. It compares the new approach to existing document clustering methods and similarity measures.
An enhanced fuzzy rough set based clustering algorithm for categorical dataeSAT Journals
Abstract In today’s world everything is done digitally and so we have lots of data raw data. This data are useful to predict future events if we proper use it. Clustering is such a technique where we put closely related data together. Furthermore we have types of data sequential, interval, categorical etc. In this paper we have shown what is the problem with clustering categorical data with rough set and who we can overcome with improvement.
An enhanced fuzzy rough set based clustering algorithm for categorical dataeSAT Publishing House
This document summarizes a research paper that proposes an enhanced fuzzy rough set-based clustering algorithm for categorical data. The paper discusses problems with using traditional rough set theory to cluster categorical data when there are no crisp relations between attributes. It proposes using fuzzy logic to assign weights to attribute values and calculate lower approximations based on the similarity between sets, in order to cluster categorical data when crisp relations do not exist. The proposed method is described through an example comparing traditional rough set clustering to the new fuzzy rough set approach.
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
This document compares two approaches for handling incomplete data and generating decision rules: 1) Rough set theory, which fills missing values and performs attribute reduction, and 2) Random tree classification in data mining, which ignores missing values. It uses a heart disease dataset with missing values to test the approaches in ROSE2 and WEKA. The results show that random tree classification ignoring missing values produces more accurate decision rules than rough set theory filling missing values.
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
Abstract Concept lattices is the important technique that has become a standard in data analytics and knowledge presentation in many fields such as statistics, artificial intelligence, pattern recognition ,machine learning ,information theory ,social networks, information retrieval system and software engineering. Formal concepts are adopted as the primitive notion. A concept is jointly defined as a pair consisting of the intension and the extension. FCA can handle with huge amount of data it generates concepts and rules and data visualization. Matrix factorization methods have recently received greater exposure, mainly as an unsupervised learning method for latent variable decomposition. In this paper a novel method is proposed to decompose such concepts by using Boolean Matrix Factorization for dimensionality reduction. This paper focuses on finding all the concepts and the object intersections. Keywords: Data mining, formal concepts, lattice, matrix factorization dimensionality reduction.
The pertinent single-attribute-based classifier for small datasets classific...IJECEIAES
Classifying a dataset using machine learning algorithms can be a big challenge when the target is a small dataset. The OneR classifier can be used for such cases due to its simplicity and efficiency. In this paper, we revealed the power of a single attribute by introducing the pertinent single-attributebased-heterogeneity-ratio classifier (SAB-HR) that used a pertinent attribute to classify small datasets. The SAB-HR’s used feature selection method, which used the Heterogeneity-Ratio (H-Ratio) measure to identify the most homogeneous attribute among the other attributes in the set. Our empirical results on 12 benchmark datasets from a UCI machine learning repository showed that the SAB-HR classifier significantly outperformed the classical OneR classifier for small datasets. In addition, using the H-Ratio as a feature selection criterion for selecting the single attribute was more effectual than other traditional criteria, such as Information Gain (IG) and Gain Ratio (GR).
1. This document reviews literature on dimensionality reduction techniques, including feature selection and feature extraction.
2. It discusses seminal papers on feature selection algorithms like FOCUS, RELIEF, and LVF, as well as feature extraction methods including PCA.
3. The document also compares the filter and wrapper approaches to feature selection, and surveys influential work by researchers like Liu, Kohavi, and John on developing and evaluating feature selection methods.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
This document summarizes and compares various multi-relational decision tree learning algorithms. It first introduces the MRDTL algorithm, which constructs decision trees from relational databases by using selection graphs to represent patterns at nodes. The document then discusses limitations of MRDTL related to running time and handling missing values. It proposes an improved MRDTL-2 algorithm that uses a naive Bayesian classifier for preprocessing to address missing values, and tuple ID propagation instead of complex data structures to improve running time. Finally, it describes the RDC algorithm, which uses tuple ID propagation and information gain to build decision trees from relational databases.
KNN and ARL Based Imputation to Estimate Missing Valuesijeei-iaes
Missing data are the absence of data items for a subject; they hide some information that may be important. In practice, missing data have been one major factor affecting data quality. Thus, Missing value imputation is needed. Methods such as hierarchical clustering and K-means clustering are not robust to missing data and may lose effectiveness even with a few missing values. Therefore, to improve the quality of data method for missing value imputation is needed. In this paper KNN and ARL based Imputation are introduced to impute missing values and accuracy of both the algorithms are measured by using normalized root mean sqare error. The result shows that ARL is more accurate and robust method for missing value estimation.
The document discusses knowledge representation systems (KRS) and how they are used to represent knowledge in a symbolic form for computer processing. It provides examples of KRS tables and defines key concepts such as attributes, objects, equivalence relations, and discernibility matrices. Discernibility matrices are used to analyze the significance of attributes and determine attribute reducts.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
QUANTIFYING THE THEORY VS. PROGRAMMING DISPARITY USING SPECTRAL BIPARTIVITY A...ijcsit
This document proposes two approaches to quantify the disparity between students' performance in theoretical versus programming assignments in computer science courses.
The first approach uses spectral bipartivity analysis to partition assignments based on student scores, then calculates a Theory vs Programming Tuple Proximity distance between the partitions and actual theoretical/programming assignments to derive a Theory vs Programming Disparity metric for each student.
The second approach applies principal component analysis separately to theoretical and programming assignment scores for all students, then correlates the principal components to determine a class-level disparity metric based on correlation coefficients and their variances.
Some students in the Computer Science and related majors excel very well in programming-related
assignments, but not equally well in the theoretical assignments (that are not programming-based) and
vice-versa. We refer to this as the "Theory vs. Programming Disparity (TPD)". In this paper, we propose a
spectral bipartivity analysis-based approach to quantify the TPD metric for any student in a course based
on the percentage scores (considered as decimal values in the range of 0 to 1) of the student in the course
assignments (that involves both theoretical and programming-based assignments). We also propose a
principal component analysis (PCA)-based approach to quantify the TPD metric for the entire class based
on the percentage scores (in a scale of 0 to 100) of the students in the theoretical and programming
assignments. The spectral analysis approach partitions the set of theoretical and programming
assignments to two disjoint sets whose constituents are closer to each other within each set and relatively
more different from each across the two sets
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...cscpconf
Classification is a step by step practice for allocating a given piece of input into any of the given
category. Classification is an essential Machine Learning technique. There are many
classification problem occurs in different application areas and need to be solved. Different
types are classification algorithms like memory-based, tree-based, rule-based, etc are widely
used. This work studies the performance of different memory based classifiers for classification
of Multivariate data set from UCI machine learning repository using the open source machine
learning tool. A comparison of different memory based classifiers used and a practical
guideline for selecting the most suited algorithm for a classification is presented. Apart fromthat some empirical criteria for describing and evaluating the best classifiers are discussed
This document summarizes a research paper that proposes a new method to accelerate the nearest neighbor search step of the k-means clustering algorithm. The k-means algorithm is computationally expensive due to calculating distances between data points and cluster centers. The proposed method uses geometric relationships between data points and centers to reject centers that are unlikely to be the nearest neighbor, without decreasing clustering accuracy. Experimental results showed the method significantly reduced the number of distance computations required.
This document summarizes several key dimensionality reduction algorithms:
1) FOCUS and RELIEF are two of the earliest and most widely cited dimensionality reduction algorithms. FOCUS implements feature selection by preferring consistent hypotheses with few features, while RELIEF ranks features based on distance between instances of the same and different classes.
2) RELIEF was later expanded to handle multi-class problems (RELIEF-F) and incomplete data (RELIEF-D). Liu and Dash also reviewed dimensionality reduction methods and categorized them based on their generation and evaluation procedures.
3) Liu et al. proposed the LVF (Las Vegas Filter) algorithm for feature selection, which uses random search guided by in
Improving circuit miniaturization and its efficiency using Rough Set Theory( ...Sarvesh Singh
This document discusses using rough set theory to improve circuit miniaturization and efficiency. It presents an example of applying rough set concepts like indiscernibility relations, lower and upper approximations, and decision rules to reduce the number of logic gates in a circuit without changing the output. The example generates data from each gate, analyzes it using rough set approximations and rules to identify redundant gates, allowing the circuit to be minimized. This technique could help reduce chip size, switching power usage, and increase the number of transistors implemented on a chip.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This paper introduces a new comparison base stable sorting algorithm, named RA sort. The RA sort
involves only the comparison of pair of elements in an array which ultimately sorts the array and does not
involve the comparison of each element with every other element. It tries to build upon the relationship
established between the elements in each pass. Instead of going for a blind comparison we prefer a
selective comparison to get an efficient method. Sorting is a fundamental operation in computer science.
This algorithm is analysed both theoretically and empirically to get a robust average case result. We have
performed its Empirical analysis and compared its performance with the well-known quick sort for various
input types. Although the theoretical worst case complexity of RA sort is Yworst(n) = O(n√), the
experimental results suggest an empirical Oemp(nlgn)1.333 time complexity for typical input instances, where
the parameter n characterizes the input size. The theoretical complexity is given for comparison operation.
We emphasize that the theoretical complexity is operation specific whereas the empirical one represents the
overall algorithmic complexity.
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
This document compares two approaches for handling incomplete data and generating decision rules: 1) Rough set theory, which fills missing values and performs attribute reduction, and 2) Random tree classification in data mining, which ignores missing values. It uses a heart disease dataset with missing values to test the approaches in ROSE2 and WEKA. The results show that random tree classification ignoring missing values produces more accurate decision rules than rough set theory filling missing values.
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
Abstract Concept lattices is the important technique that has become a standard in data analytics and knowledge presentation in many fields such as statistics, artificial intelligence, pattern recognition ,machine learning ,information theory ,social networks, information retrieval system and software engineering. Formal concepts are adopted as the primitive notion. A concept is jointly defined as a pair consisting of the intension and the extension. FCA can handle with huge amount of data it generates concepts and rules and data visualization. Matrix factorization methods have recently received greater exposure, mainly as an unsupervised learning method for latent variable decomposition. In this paper a novel method is proposed to decompose such concepts by using Boolean Matrix Factorization for dimensionality reduction. This paper focuses on finding all the concepts and the object intersections. Keywords: Data mining, formal concepts, lattice, matrix factorization dimensionality reduction.
The pertinent single-attribute-based classifier for small datasets classific...IJECEIAES
Classifying a dataset using machine learning algorithms can be a big challenge when the target is a small dataset. The OneR classifier can be used for such cases due to its simplicity and efficiency. In this paper, we revealed the power of a single attribute by introducing the pertinent single-attributebased-heterogeneity-ratio classifier (SAB-HR) that used a pertinent attribute to classify small datasets. The SAB-HR’s used feature selection method, which used the Heterogeneity-Ratio (H-Ratio) measure to identify the most homogeneous attribute among the other attributes in the set. Our empirical results on 12 benchmark datasets from a UCI machine learning repository showed that the SAB-HR classifier significantly outperformed the classical OneR classifier for small datasets. In addition, using the H-Ratio as a feature selection criterion for selecting the single attribute was more effectual than other traditional criteria, such as Information Gain (IG) and Gain Ratio (GR).
1. This document reviews literature on dimensionality reduction techniques, including feature selection and feature extraction.
2. It discusses seminal papers on feature selection algorithms like FOCUS, RELIEF, and LVF, as well as feature extraction methods including PCA.
3. The document also compares the filter and wrapper approaches to feature selection, and surveys influential work by researchers like Liu, Kohavi, and John on developing and evaluating feature selection methods.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
This document summarizes and compares various multi-relational decision tree learning algorithms. It first introduces the MRDTL algorithm, which constructs decision trees from relational databases by using selection graphs to represent patterns at nodes. The document then discusses limitations of MRDTL related to running time and handling missing values. It proposes an improved MRDTL-2 algorithm that uses a naive Bayesian classifier for preprocessing to address missing values, and tuple ID propagation instead of complex data structures to improve running time. Finally, it describes the RDC algorithm, which uses tuple ID propagation and information gain to build decision trees from relational databases.
KNN and ARL Based Imputation to Estimate Missing Valuesijeei-iaes
Missing data are the absence of data items for a subject; they hide some information that may be important. In practice, missing data have been one major factor affecting data quality. Thus, Missing value imputation is needed. Methods such as hierarchical clustering and K-means clustering are not robust to missing data and may lose effectiveness even with a few missing values. Therefore, to improve the quality of data method for missing value imputation is needed. In this paper KNN and ARL based Imputation are introduced to impute missing values and accuracy of both the algorithms are measured by using normalized root mean sqare error. The result shows that ARL is more accurate and robust method for missing value estimation.
The document discusses knowledge representation systems (KRS) and how they are used to represent knowledge in a symbolic form for computer processing. It provides examples of KRS tables and defines key concepts such as attributes, objects, equivalence relations, and discernibility matrices. Discernibility matrices are used to analyze the significance of attributes and determine attribute reducts.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
QUANTIFYING THE THEORY VS. PROGRAMMING DISPARITY USING SPECTRAL BIPARTIVITY A...ijcsit
This document proposes two approaches to quantify the disparity between students' performance in theoretical versus programming assignments in computer science courses.
The first approach uses spectral bipartivity analysis to partition assignments based on student scores, then calculates a Theory vs Programming Tuple Proximity distance between the partitions and actual theoretical/programming assignments to derive a Theory vs Programming Disparity metric for each student.
The second approach applies principal component analysis separately to theoretical and programming assignment scores for all students, then correlates the principal components to determine a class-level disparity metric based on correlation coefficients and their variances.
Some students in the Computer Science and related majors excel very well in programming-related
assignments, but not equally well in the theoretical assignments (that are not programming-based) and
vice-versa. We refer to this as the "Theory vs. Programming Disparity (TPD)". In this paper, we propose a
spectral bipartivity analysis-based approach to quantify the TPD metric for any student in a course based
on the percentage scores (considered as decimal values in the range of 0 to 1) of the student in the course
assignments (that involves both theoretical and programming-based assignments). We also propose a
principal component analysis (PCA)-based approach to quantify the TPD metric for the entire class based
on the percentage scores (in a scale of 0 to 100) of the students in the theoretical and programming
assignments. The spectral analysis approach partitions the set of theoretical and programming
assignments to two disjoint sets whose constituents are closer to each other within each set and relatively
more different from each across the two sets
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...cscpconf
Classification is a step by step practice for allocating a given piece of input into any of the given
category. Classification is an essential Machine Learning technique. There are many
classification problem occurs in different application areas and need to be solved. Different
types are classification algorithms like memory-based, tree-based, rule-based, etc are widely
used. This work studies the performance of different memory based classifiers for classification
of Multivariate data set from UCI machine learning repository using the open source machine
learning tool. A comparison of different memory based classifiers used and a practical
guideline for selecting the most suited algorithm for a classification is presented. Apart fromthat some empirical criteria for describing and evaluating the best classifiers are discussed
This document summarizes a research paper that proposes a new method to accelerate the nearest neighbor search step of the k-means clustering algorithm. The k-means algorithm is computationally expensive due to calculating distances between data points and cluster centers. The proposed method uses geometric relationships between data points and centers to reject centers that are unlikely to be the nearest neighbor, without decreasing clustering accuracy. Experimental results showed the method significantly reduced the number of distance computations required.
This document summarizes several key dimensionality reduction algorithms:
1) FOCUS and RELIEF are two of the earliest and most widely cited dimensionality reduction algorithms. FOCUS implements feature selection by preferring consistent hypotheses with few features, while RELIEF ranks features based on distance between instances of the same and different classes.
2) RELIEF was later expanded to handle multi-class problems (RELIEF-F) and incomplete data (RELIEF-D). Liu and Dash also reviewed dimensionality reduction methods and categorized them based on their generation and evaluation procedures.
3) Liu et al. proposed the LVF (Las Vegas Filter) algorithm for feature selection, which uses random search guided by in
Improving circuit miniaturization and its efficiency using Rough Set Theory( ...Sarvesh Singh
This document discusses using rough set theory to improve circuit miniaturization and efficiency. It presents an example of applying rough set concepts like indiscernibility relations, lower and upper approximations, and decision rules to reduce the number of logic gates in a circuit without changing the output. The example generates data from each gate, analyzes it using rough set approximations and rules to identify redundant gates, allowing the circuit to be minimized. This technique could help reduce chip size, switching power usage, and increase the number of transistors implemented on a chip.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This paper introduces a new comparison base stable sorting algorithm, named RA sort. The RA sort
involves only the comparison of pair of elements in an array which ultimately sorts the array and does not
involve the comparison of each element with every other element. It tries to build upon the relationship
established between the elements in each pass. Instead of going for a blind comparison we prefer a
selective comparison to get an efficient method. Sorting is a fundamental operation in computer science.
This algorithm is analysed both theoretically and empirically to get a robust average case result. We have
performed its Empirical analysis and compared its performance with the well-known quick sort for various
input types. Although the theoretical worst case complexity of RA sort is Yworst(n) = O(n√), the
experimental results suggest an empirical Oemp(nlgn)1.333 time complexity for typical input instances, where
the parameter n characterizes the input size. The theoretical complexity is given for comparison operation.
We emphasize that the theoretical complexity is operation specific whereas the empirical one represents the
overall algorithmic complexity.
Similar to Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theory (20)
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theory
1. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
DOI : 10.5121/ijsc.2012.3109 107
SINGLE REDUCT GENERATION BASED ON RELATIVE
INDISCERNIBILITY OF ROUGH SET THEORY
Shampa Sengupta and Asit Kr. Das
M.C.K.V.Institute Of Engineering - 243, G.T.Road (North), Liluah, Howrah 711204,
West Bengal
Bengal Engineering And Science University, Shibpur, Howrah 711103, West Bengal
shampa2512@yahoo.co.in, akdas@cs.becs.ac.in
Abstract.
In real world everything is an object which represents particular classes. Every object can be fully
described by its attributes. Any real world dataset contains large number of attributes and objects.
Classifiers give poor performance when these huge datasets are given as input to it for proper
classification. So from these huge dataset most useful attributes need to be extracted that contribute the
maximum to the decision. In the paper, attribute set is reduced by generating reducts using the
indiscernibility relation of Rough Set Theory (RST). The method measures similarity among the attributes
using relative indiscernibility relation and computes attribute similarity set. Then the set is minimized and
an attribute similarity table is constructed from which attribute similar to maximum number of attributes is
selected so that the resultant minimum set of selected attributes (called reduct) cover all attributes of the
attribute similarity table. The method has been applied on glass dataset collected from the UCI repository
and the classification accuracy is calculated by various classifiers. The result shows the efficiency of the
proposed method.
Keywords:
Rough Set Theory, Attribute Similarity, Relative Indiscernibility Relation, Reduct.
1. Introduction
In general, considering all attributes highest accuracy of a classifier should be achieved. But for
real-world problems, there is huge number of attributes, which degrades the efficiency of the
Classification algorithms. So, some attributes need to be neglected, which again decrease the
accuracy of the system. Therefore, a trade-off is required for which strong dimensionality
reduction or feature selection techniques are needed. The attributes contribute the most to the
decision must be retained. Rough Set Theory (RST) [1, 2], new mathematical approach to
imperfect knowledge, is popularly used to evaluate significance of attribute and helps to find
minimal set of attribute called reduct. Thus a reduct is a set of attributes that preserves partition. It
means that a reduct is the minimal subset of attributes that enables the same classification of
elements of the universe as the whole set of attributes. In other words, attributes that do not
belong to a reduct are superfluous with regard to classification of elements of the universe. Hu et
2. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
108
al. [3] developed two new algorithms to calculate core attributes and reducts for feature selection.
These algorithms can be extensively applied to a wide range of real-life applications with very
large data sets. Jensen et al. [4] developed the Quickreduct algorithm to compute a minimal
reduct without exhaustively generating all possible subsets and also they developed Fuzzy-Rough
attribute reduction with application to web categorization. Zhong et al. [5] applies Rough Sets
with Heuristics (RSH) and Rough Sets with Boolean Reasoning (RSBR) are used for attribute
selection and discretization of real-valued attributes. Komorowsk et al. [6] studies an application
of rough sets to modelling prognostic power of cardiac tests. Bazan [7] compares rough set-based
methods, in particular dynamic reducts, with statistical methods, neural networks, decision trees
and decision rules. Carlin et al. [8] presents an application of rough sets to diagnosing suspected
acute appendicitis.The main advantage of rough set theory in data analysis is that it does not need
any preliminary or additional information about data like probability in statistics [9], or basic
probability assignment in Dempster-Shafer theory [10], grade of membership or the value of
possibility in fuzzy set theory [11] and so on. But finding reduct for classification is an NP-
Complete problem and so some heuristic approach should be applied. In the paper, a novel reduct
generation method is proposed based on the indiscernibility relation of rough set theory.
Fig1: Single Reduct Generation Process
In the method, a new kind of indiscernibility, called relative indiscernibility of an attribute with
respect to other attribute is introduced. This relative indiscernibility relation induces the partitions
of attributes, based on which similarity between conditional attributes is measured and an
3. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
109
attribute similarity set (ASS) is obtained. Then, the similarity set is minimized by removing the
attribute similarities having similarity measure less than the average similarity. Lastly, an
attribute similarity table is constructed for ASS each row of which describes the similarity of an
attribute with some other attributes. Then traverse each row and select the attribute of that row
which has maximum similar attributes. Next, all the rows associated with the selected attribute
and its similar attributes are deleted from the table and similarly select another attribute from the
modified table. The process continued until all the rows are deleted from the table and finally,
selected attributes, covering all the attributes are considered as reduct, a minimum set of
attributes.
The rest of the paper is organized as follows: Similarity measurement of attributes by relative
indiscernibility and single reduct generation are described in section 2 and section 3 respectively.
Section 4 explains the experimental analysis of the proposed method and finally conclusion of the
paper is stated in section 5.
2. Relative Indiscernibility and Dependency of Attributes
Formally, a decision system DS can be seen as a system DS = (U, A) where U is the universe (a
finite set of objects, U = < x1, x2,..xm>) and A is the set of attributes such that A = C ∪D and C ∩
D = ∅ where C and D are the set of condition attributes and the set of decision attributes,
respectively.
2.1 Indiscernibility
A per the discussion in section II, each attribute a ∈ A defines an information function: fa : U
→Va, where Va is the set of values of a, called the domain of attribute. Every subset of attributes
P determines an indiscernibility relation over U, and is denoted as IND(P) , which can be defined
as, IND(P) = {(x, y) ∈ U× U | ∀ a ∈ P, fa (x) = fa (y)}. For each set of attributes P, an
indiscernibility relation IND(P) partitions the set of objects into a m-number of equivalence
classes [ ] defined as partition U/IND(P) or U/P is equal to {[x]p} where |U/P| = m. Elements
belonging to the same equivalence class are indiscernible; otherwise elements are discernible with
respect to P. If one considers a non-empty attributes subset, R ⊂ P and IND(R) = IND(P), then P
− R is dispensable. Any minimal R such that IND(R) = IND(P) , is a minimal set of attributes that
preserves the indiscernibility relation computed on the set of attributes P. R is called reduct of P
and denoted as R = RED(P). The core of P is the intersection of these reductions, defined as
CORE(P) = ∩RED(P). Naturally, the core contains all the attributes from P which are considered
of greatest importance for classification, i.e., the most relevant for a correct classification of the
objects of U. On the other hand, none of the attributes belonging to the core may be neglected
without deteriorating the quality of the classification considered, that is, if any attribute in the
core is eliminated from the given data, it will be impossible to obtain the highest quality of
approximation with the remaining attributes.
2.2 Relative Indiscernibility
Here, the relation is defined based on the same information function: fa : U → Va where Va is the
set of values of a, called the domain of attribute. Every conditional attribute Ai of C determines
an relative (relative to decision attribute) indiscernibility relation (RIR) over U, and is denoted as
RIRD(Ai), which can be defined by equation (1).
4. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
110
(1)
For each conditional attribute Ai, a relative indiscernibility relation RIRD(Ai) partitions the set of
objects into a n-number of equivalence classes [ ] defined as partition U/ RIRD(Ai) or UD/Ai is
equal to where | UD/Ai | = n. Obviously, each equivalence class contains objects
with same decision value which are indiscernible by attribute Ai.
To illustrate the method, a sample dataset represented by Table 1 is considered with eight objects,
four conditional and one decision attributes.
Table 1. Sample Dataset
Diploma(i) Experience(
e)
French(f) Reference(
r)
Decision
x1 MBA Medium Yes Excellent Accept
x2 MBA Low Yes Neutral Reject
x3 MCE Low Yes Good Reject
x4 MSc High Yes Neutral Accept
x5 MSc Medium Yes Neutral Reject
x6 MSc High Yes Excellent Reject
x7 MBA High No Good Accept
x8 MCE Low No Excellent Reject
Table 2. Equivalence classes induces by indiscernibility and relative
indiscernibility relations
Equivalence classes for each attribute
by relation IND(P)
Equivalence classes for each conditional
attribute by relative indiscernibility relation
RIRD(Ai)
U/D = ({x1, x4, x7}, {x2, x3, x5, x6, x8})
U/i = ({x1, x2, x7}, {x3, x8}, {x4, x5,
x6})
U/e = ({x1, x5}, {x2, x3, x8}, {x4, x6,
x7})
U/f = ({x1, x2, x3, x4, x5, x6}, {x7, x8})
U/r = ({x1, x6, x8}, {x2, x4, x5}, {x3,
x7})
UD/i = ({x1, x7}, {x2}, {x3, x8}, {x4}, {x5, x6})
UD/e = ({x1}, {x5}, {x2, x3, x8}, {x4, x7}, {x6})
UD/f = ({x1, x4}, {x2, x3, x5, x6}, {x7}, {x8})
UD/r = ({x1}, {x6, x8}, {x2, x5}, {x4}, {x3, x7})
2.3 Attribute Similarity
An attribute Ai is similar to another attribute Aj in context of classification power if they induce
the same equivalence classes of objects under their respective relative indiscernible relations. But
in real situation, it rarely occurs and so similarity of attributes is measured by introducing the
similarity measurement factor which indicates the degree of similarity of one attribute to another
attribute. Here, an attribute Ai is said to be similar to an attribute Aj with degree of similarity (or
5. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
111
similarity factor) and is denoted by Ai→Aj if the probability of inducing the same equivalence
classes of objects under their respective relative indiscernible relations is ( ×100)%, where is
computed by equation (2). The details for computation of similarity measurement for the attribute
similarity Ai → Aj (Ai ≠ Aj) is described in algorithm “SIM_FAC” below.
Algorithm: SIM_FAC(Ai , Aj)/* Similarity factor computation for attribute similarity Ai → Aj */
Input: Partitions UD/Ai = and UD/Aj =
obtained by applying relative indiscernibility relation RIRD on
Ai and Aj respectively.
Output: Similarity factor
Begin
For each conditional attribute Ai {
/* compute relative indiscernibility RIRD (Ai) using (1)*/
RIRD (Ai) induces equivalence classes UD/Ai =
} /*end of for*/
/* similarity measurement of Ai to Aj */
For each
{ max_overlap = 0
For each
{ overlap =
if (overlap > max_overlap) then
max_overlap = overlap
}
}
6. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
112
End.
To illustrate the attribute similarity computation process, attribute similarity and its similarity
factor are listed in Table 2 for all attributes of Table 1.
Table 2. Describe the degree of similarity of all pair of attributes
Attribute
Similarity
(Ai → Aj)
Equivalence Classes by
RIRD(Ai)
(UD/Ai)
Equivalence Classes
by RIRD(Aj)
(UD/Aj)
Similarity factor of
Ai to Aj
)
i → e {x1, x7}, {x2}, {x3, x8},
{x4}, {x5, x6}
{x1}, {x5}, {x2, x3, x8},
{x4, x7}, {x6}
= 0.8
i → f {x1, x7}, {x2}, {x3, x8},
{x4}, {x5, x6}
{x1, x4}, {x2, x3, x5,
x6}, {x7}, {x8}
= 0.8
i → r {x1, x7}, {x2}, {x3, x8},
{x4}, {x5, x6}
{x1}, {x6, x8}, {x2, x5},
{x4}, {x3, x7}
= 0.7
e → i {x1}, {x5}, {x2, x3, x8},
{x4, x7}, {x6}
{x1, x7} , {x2}, {x3,
x8}, {x4}, {x5, x6}
= 0.83
e → f {x1}, {x5}, {x2, x3, x8},
{x4, x7}, {x6}
{x1, x4}, {x2, x3, x5,
x6}, {x7}, {x8}
= 0.83
e → r {x1}, {x5}, {x2, x3, x8},
{x4, x7}, {x6}
{x1}, {x6, x8}, {x2, x5},
{x4}, {x3, x7}
= 0.76
f → i {x1, x4}, {x2, x3, x5, x6},
{x7}, {x8}
{x1, x7} , {x2}, {x3,
x8}, {x4}, {x5, x6}
= 0.75
f → e {x1, x4}, {x2, x3, x5, x6},
{x7}, {x8}
{x1}, {x5}, {x2, x3, x8},
{x4, x7}, {x6}
= 0.75
f → r {x1, x4}, {x2, x3, x5, x6},
{x7}, {x8}
{x1}, {x6, x8}, {x2, x5},
{x4}, {x3, x7}
= 0.75
r → i {x1}, {x6, x8}, {x2, x5},
{x4}, {x3, x7}
{x1, x7} , {x2}, {x3,
x8}, {x4}, {x5, x6}
= 0.7
r → e {x1}, {x6, x8}, {x2, x5},
{x4}, {x3, x7}
{x1}, {x5}, {x2, x3, x8},
{x4, x7}, {x6}
= 0.7
r → f {x1}, {x6, x8}, {x2, x5},
{x4}, {x3, x7}
{x1, x4}, {x2, x3, x5,
x6}, {x7}, {x8}
= 0.8
The computation of of each attribute similarity using equation (2) in Table 2 can be
understood by Table 3, in which similarity i → e in first row of Table 2 is considered, where, UD/i
= {x1, x7}, {x2}, {x3, x8}, {x4}, {x5, x6}) and UD/e = {x1}, {x5}, {x2, x3, x8}, {x4, x7}, {x6}).
7. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
113
Table 3. Illustrates the similarity factor computation for i → e
of Overlapping
of with of
∩
∩
∩
∩
{x1, x7} {x1}
{x4, x7}
{x1, x7}∩ {x1}
{x1, x7}∩ {x4, x7}
{x2} {x2, x3, x8} {x2}∩ {x2, x3, x8}
{x3, x8} {x2, x3, x8} {x3, x8} ∩ {x2, x3,
x8}
{x4} {x4, x7} {x4} ∩ {x4, x7}
{x5, x6} {x5}
{x6})
{x5, x6} ∩ {x5}
{x5, x6} ∩ {x6}
= + + + + ) = = 0.8
2.4 Attribute Similarity Set
For each pair of conditional attributes (Ai, Aj), similarity factor is computed by “SIM_FAC”
algorithm, described in section 2.3. The similarity factor of Ai → Aj is higher means that the
relative indiscernibility relations RIRD(Ai) and RIRD(Aj) produce highly similar equivalence
classes. This implies that both the attributes Ai and Aj have almost similar classification power
and so Ai → Aj is considered as strong similarity of Ai to Aj. Since, for any two attributes Ai and
Aj, two similarities Ai → Aj and Aj → Ai are computed, only one with higher similarity factor is
selected in the list of attribute similarity set ASS. Thus, for n conditional attributes, n(n-
1)/2similarities are selected, out of which some are strong and some are not. Out of these
similarities, the similarity with value less than the average δf value are discarded from ASS
and rest is considered as the set of attribute similarity. So, each element x in ASS is of the form x:
Ai→Aj such that Left(x) = Ai and Right(x) = Aj. The algorithm “ASS_GEN” described below,
computes the attribute similarity set ASS.
Algorithm: ASS_GEN(C, δf)
/* Computes attribute similarity set {Ai→Aj} */
Input: C = set of conditional attributes and δf =2-D contains similarity factors between each
pair of conditional attributes.
Output: Attribute Similarity Set ASS
Begin
ASS = {}, sum_δf = 0
8. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
114
/* compute only n(n – 1)/2 elements in ASS */
for i = 1 to |C| - 1
{ for j = i+1 to |C|
{ if( )then
{ sum_δf = sum_δf +
ASS = ASS ∪ {Ai → Aj}
}
else
{ sum_δf = sum_δf +
ASS = ASS ∪ {Aj → Ai}
}
}
}
/* modify ASS by only elements Ai → Aj for which >avg_δf */
ASSmod = {}
avg_δf = (2× sum_δf) / |C|(|C|-1)
for each {Ai → Aj}∈ ASS
{ if( avg_δf) then
{ ASSmod = ASSmod ∪ {Ai → Aj}
ASS = ASS – { Ai → Aj}
}
}
ASS = ASSmod
End
Algorithm “ASS_GEN” is applied and Table 4 is constructed from Table 2, where only six out of
twelve attribute similarities in Table 2 are considered. Thus, initially, ASS = {i → f, i → r, e → i,
e → f, e → r, r → f} and avg_δf = 0.786. As the similarity factor for attribute similarities i → f, e
→ i, e → f and r → f are greater than avg_δf, they are considered in the final attribute similarity
set ASS. So, finally, ASS = {i → f, e → i, e → f, r → f }.
9. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
115
Table 4. Illustrates the selection of attribute similarities.
Attribute Similarity
( Ai→Aj; i ≠ j and
> )
Similarity factor of Ai
to Aj
)
> f
i→f = 0.8 Yes
i→r = 0.7
e→i = 0.83 Yes
e→f = 0.83 Yes
e→r = 0.76
r→f = 0.8 Yes
Average f 0.786
3. Single Reduct Generation
The attribute similarity obtained so far is known as simple similarity of an attribute to other
attribute. But, for simplifying the reduct generation process, the elements in ASS are minimized
by combining some simple similarity. The new similarity obtained by the combination of some of
the simple similarity is called compound similarity. Here, all x from ASS with same Left(x) are
considered and obtained compound similarity is Left(x) → ∪ Right(x) ∀x. Thus, introducing
compound similarity, the set ASS is refined to a set with minimum elements so that for each
attribute, there is at most one element in ASS representing either simple or compound similarity
of the attribute. The detail algorithm for determining compound attribute similarity set is given
below:
Algorithm: COMP_SIM(ASS)
/* Compute the compound attribute similarity of attributes*/
Input: Simple attribute similarity set ASS
Output: Compound attribute similarity set ASS
Begin
for each x ∈ ASS
{ for each y (≠x) ∈ ASS
{ if(Left(x) = = Left(y)) then
{ Right(x) = Right(x) ∪ Right(y)
ASS = ASS – {y}
}
}
10. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
116
}
End
Finally, from the compound attribute similarity set ASS, reduct is generated. First of all, select an
element, say, x from ASS for which length of Right(x) i.e., |Right(x)| is maximum. This selection
guaranteed that the attribute Left(x) is similar to maximum number of attributes and so Left(x) is
an element of reduct RED. Then, all elements z of ASS for which Left(z) ⊆ Right(x) are deleted
and also x is deleted from ASS. This process is repeated until the set ASS becomes empty which
provides the reduct RED. The proposed single reduct generation algorithm is discussed below:
Algorithm: SIN_RED_GEN(ASS, RED)
Input: Compound attribute similarity set ASS
Output: Single reduct RED
Begin
RED = φ
While (ASS ≠ φ)
{ max = 0
for each x ∈ ASS
{ if(|Right(x)| > max) then
{ max = |Right(x)|
L = Left(x)
}
}
for each x ∈ ASS
{ if (Left(x) = = L) then
{ RED = RED ∪ Left(x)
R = Right(x)
ASS = ASS – {x}
for each z ∈ ASS
if(Left(z) ⊆ R) then
ASS = ASS – {z}
break
}
}
} /*end-while*/
Return (RED)
11. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
117
End
Applying “COMP_SIM” algorithm the set ASS = {i → f, e → i, e → f, r → f} is refined to
compound similarity set ASS = {i → f, e → {i, f}, r → f}. So, the selected element from ASS
is e → {i, f}, and thus e ∈ RED and ASS is modified as ASS = {r → f}. And, in the next
iteration, r ∈ RED and ASS =φ. Thus, RED = {e, r}.
4. Results and discussions
The proposed method computes a single reduct for datasets collected from UCI machine learning
repository [12]. At first, all the numeric attributes are discretized by ChiMerge [13] discretization
algorithm .To measure the efficiency of the method, k-fold cross-validations, where k ranges
from 1 to 10 have been carried out on the dataset and classified using “Weka” tool [14]. The
proposed method (PRP) and well known dimensionality reduction methods, such as,Correlated
Feature Subset (CFS) method [15] and Consistency Subset Evaluator (CON) method [16] have
been applied on the dataset for dimension reduction and the reduced datasets are classified on
various classifiers. Original number of attributes, number of attributes after applying various
reduction methods and the accuracies (in %) of the datasets are computed and listed in Table 5,
which shows the efficiency of the proposed method.
Table 5. Accuracy Comparison of Proposed, CFS and CON methods
Class
ifier
Machine (7) Heart(13) Wine(13)
Liver
disorder(6)
Glass(9)
P
R
P
(
3
)
CF
S
(2)
CO
N
(4)
PR
P
(4)
CF
S
(8)
CO
N
(11
)
PR
P
(6)
CF
S
(8)
CO
N
(8)
PR
P
(5)
CF
S
(5)
CO
N
(4)
PR
P
(6)
CF
S
6)
CO
N
(7)
Naïve
Bayes
2
9
.
6
7
30.
77
33.
65
83.
77
84.
36
85.
50
95.
7
97.
19
97.
19
67.
30
68.
31
68.
60
65.
73
43.
92
47.
20
SMO
1
6
.
3
5
12.
98
15.
48
82.
77
84.
75
84.
44
98.
90
98.
21
98.
31
69.
00
69.
18
69.
19
62.
44
57.
94
57.
48
KST
AR
4
5
.
4
8
42.
17
47.
69
81.
95
81.
67
82.
07
95.
39
97.
45
96.
63
70.
93
70.
64
70.
64
83.
57
79.
91
78.
50
Baggi
ng
5
0
.
0
45.
07
50.
77
80.
40
81.
11
81.
48
94.
86
94.
94
94.
94
72.
22
70.
64
71.
22
76.
53
73.
83
71.
50
12. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
118
9
J48
4
2
.
6
5
38.
08
41.
61
82.
31
81.
11
82.
89
96.
0
93.
82
94.
94
68.
90
68.
31
69.
48
72.
30
68.
69
64.
20
PAR
T
5
2
.
0
9
46.
17
54.
37
83.
70
81.
67
79.
55
94.
0
93.
10
94.
3
69.
14
69.
48
68.
60
77.
00
70.
09
68.
60
Aver
age
Accu
racy
3
9
.
3
8
35.
87
40.
59
82.
48
82.
44
82.
71
95.
80
95.
78
96.
05
69.
58
69.
40
69.
62
72.
9
65.
73
64.
58
5. Conclusion
The relative indiscernibility relation introduces in the paper is an equivalence relation which
induces a partition of equivalence classes for each attribute. Then, the degree of similarity is
measured between two attributes based on their equivalence classes. Since, the target of the paper
is to compute reduced attribute set for decision making, so application of equivalence classes for
similarity measurement is the appropriate choice.
References
[1] Pawlak, Z.: “Rough sets.: International journal of information and computer sciences,” Vol, 11, pp.
341-356 (1982)
[2] Pawlak, Z.: “Rough set theory and its applications to data analysis,” Cybernetics and systems 29
(1998) 661-688, (1998)
[3] X. Hu, T.Y. Lin, J.Jianchao, “A New Rough Sets Model Based on Database Systems”, Fundamental
Informaticae, pp.1-18, 2004
[4] R.Jensen, Qiang Shen, “Fuzzy-Rough Attribute Reduction with Application to Web Categorization”,
Fuzzy Sets and Systems, Vol.141, No.3, pp.469-485, 2004
[5] N.Zhong and A. Skowron, “A Rough Set-Based Knowledge Discovery Process”, Int. Journal of
Applied Mathematics and Computer Science., 11(3), 603-619, 2001. BIME Journal, Volume (05),
Issue ∂1), 2005
[6] J. Komorowski and A.Ohrn, “Modelling Prognostic Power of Cardiac tests using rough sets”,
Artificial Intelligence in Medicine, 15, 167-191, 1999.
[7] J.Bazan, “A Comparison of dynamic and nondynamic rough set methods for extracting laws from
decision tables”, Rough Sets in Knowledge Discovery, Physica Verlag, 1998.
[8] U.Carlin and J.Komorowski and A.Ohrn, “Rough Set Analysis of Patients with Suspected Acute
Appendicitis”, Proc., IPMU, 1998.
13. International Journal on Soft Computing ( IJSC ) Vol.3, No.1, February 2012
119
[9] Devroye, L., Gyorfi, L., and Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Newyork:
Springer-Verlag, (1996)
[10] Gupta, S., C., and Kapoor, V., K.: “Fundamental of Mathematical Statistics”, Published by: Sultan
Chand & Sons, A.S. Printing Press, India, (1994)
[11] Pal, S., K., and Mitra, S.: Neuro-Fuzzy pattern Recognition: Methods in Soft Computing, New York:
Willey (1999)
[12] Murphy,P. and Aha, W.: UCI repository of machine learning databases (1996),
http://www.ics.uci.edu/mlearn/MLRepository.html
[13] Kerber, R., ChiMerge: Discretization of Numeric Attributes, in Proceedings of AAAI-92, Ninth Int'l
Conf. Artificial Intelligence, AAAI-Press, pp. 123-128, (1992)
[14] WEKA: Machine Learning Software, http://www.cs.waikato.ac.nz/~ml/
[15] M.A. Hall, Correlation-Based Feature Selection for Machine Learning PhD thesis, Dept. ofComputer
Science, Univ. of Waikato, Hamilton, New Zealand, (1998)
[16] Liu and R. SetionoA Probabilistic Approach to Feature Selection: A Filter Solution Proc.13th Int'l
Conf. Machine Learning, pp. 319-327, (1996)
Authors
Ms Shampa Sengupta is an Assistant Professor of Information Technology at MCKV Institute Of
Engineering, Liluah, Howrah. She has received M.Tech degree in Information Technology from Bengal
Engineering and Science University, Shibpur, Howrah. Since 2011, she has been working toward the PhD
degree in Data Mining at Bengal Engineering and Science University, Shibpur, Howrah.
Dr.Asit Kr.Das is an Assistant Professor of Computer Science and Technology at Bengal Engineering and
Science University, Shibpur, Howrah.He has received M.Tech degree in Computer Science and Engg from
Calcutta University.He obtained PhD(Engg) degree from Bengal Engineering and Science
University,Shibpur, Howrah. His research interests include Data Mining and Pattern Recognition and
Rough Set Theory, Bio-informatics etc.