Optimization problems are dominantly being solved using Computational Intelligence. One of
the issues that can be addressed in this context is problems related to attribute subset selection
evaluation. This paper presents a computational intelligence technique for solving the
optimization problem using a proposed model called Modified Genetic Search Algorithms
(MGSA) that avoids local bad search space with merit and scaled fitness variables, detecting
and deleting bad candidate chromosomes, thereby reducing the number of individual
chromosomes from search space and subsequent iterations in next generations. This paper aims
to show that Rotation forest ensembles are useful in the feature selection method. The base
classifier is multinomial logistic regression method integrated with Haar wavelets as projection
filter and reproducing the ranks of each features with 10 fold cross validation method. It also
discusses the main findings and concludes with promising result of the proposed model. It
explores the combination of MGSA for optimization with Naïve Bayes classification. The result
obtained using proposed model MGSA is validated mathematically using Principal Component
Analysis. The goal is to improve the accuracy and quality of diagnosis of Breast cancer disease
with robust machine learning algorithms. As compared to other works in literature survey,
experimental results achieved in this paper show better results with statistical inferenc
Building a Classifier Employing Prism Algorithm with Fuzzy LogicIJDKP
Classification in data mining is receiving immense interest in recent times. As the knowledge is based on
historical data, classifications of data are essential for discovering the knowledge. To decrease the
classification complexity, the quantitative attributes of data need splitting. But the splitting using the
classical logic is less accurate. This can be overcome by the use of fuzzy logic. This paper illustrates how to
build up the classification rules using the fuzzy logic. The fuzzy classifier is built on by using the prism
decision tree algorithm. This classifier produces more realistic results than the classical one. The
effectiveness of this method is justified over a sample dataset.
sis of health condition is very challenging task for every human being because life is directly related to health
condition. Data mining based classification is one of the important applications for classification of data. In this
research work, we have used various classification techniques for classification of thyroid data. CART gives highest
accuracy 99.47% as best model. Feature selection plays very important role to computationally efficient and increase
the performance of model. This research work focus on Info Gain and Gain Ratio feature selection technique to
reduce the irrelevant features from original data set and computationally increase the performance of model. We have
applied both the feature selection techniques on best model i. e. CART. Our proposed CART-Info Gain and CARTGain
Ratio gives 99.47% and 99.20% accuracy with 25 and 3 feature respectively.
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
Abstract- The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Building a Classifier Employing Prism Algorithm with Fuzzy LogicIJDKP
Classification in data mining is receiving immense interest in recent times. As the knowledge is based on
historical data, classifications of data are essential for discovering the knowledge. To decrease the
classification complexity, the quantitative attributes of data need splitting. But the splitting using the
classical logic is less accurate. This can be overcome by the use of fuzzy logic. This paper illustrates how to
build up the classification rules using the fuzzy logic. The fuzzy classifier is built on by using the prism
decision tree algorithm. This classifier produces more realistic results than the classical one. The
effectiveness of this method is justified over a sample dataset.
sis of health condition is very challenging task for every human being because life is directly related to health
condition. Data mining based classification is one of the important applications for classification of data. In this
research work, we have used various classification techniques for classification of thyroid data. CART gives highest
accuracy 99.47% as best model. Feature selection plays very important role to computationally efficient and increase
the performance of model. This research work focus on Info Gain and Gain Ratio feature selection technique to
reduce the irrelevant features from original data set and computationally increase the performance of model. We have
applied both the feature selection techniques on best model i. e. CART. Our proposed CART-Info Gain and CARTGain
Ratio gives 99.47% and 99.20% accuracy with 25 and 3 feature respectively.
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
Abstract- The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...IJERA Editor
This paper presents a new approach to select reduced number of features in databases. Every database has a
given number of features but it is observed that some of these features can be redundant and can be harmful as
well as confuse the process of classification. The proposed method first applies a binary coded genetic algorithm
to select a small subset of features. The importance of these features is judged by applying Naïve Bayes (NB)
method of classification. The best reduced subset of features which has high classification accuracy on given
databases is adopted. The classification accuracy obtained by proposed method is compared with that reported
recently in publications on eight databases. It is noted that proposed method performs satisfactory on these
databases and achieves higher classification accuracy but with smaller number of feature
In this project, we investigated the use of association rules to extract useful knowledge from raw ontological data. To this end, we proposed an approach to pass from graph representation to transactional data. Then, we used different technological solutions to improve the performance of frequent item-sets extraction such as the FP-growth algorithm, and Hadoop. Check our code on Github: https://github.com/8-chems/OntologyMiner
Preprocessing and Classification in WEKA Using Different ClassifiersIJERA Editor
Data mining is a process of extracting information from a dataset and transform it into understandable structure
for further use, also it discovers patterns in large data sets [1]. Data mining has number of important techniques
such as preprocessing, classification. Classification is one such technique which is based on supervised learning.
It is a technique used for predicting group membership for the data instance. Here in this paper we use
preprocessing, classification on diabetes database. Here we apply classifiers on this database and compare the
result based on certain parameters using WEKA. 77.2 million people in India are suffering from pre diabetes.
ICMR estimates that around 65.1million are diabetes patients. Globally in year 2010, 227 to 285 million people
had diabetes, out of that 90% cases are related to type 2 ,this is equal to 3.3% of the population with equal rates
in both women and men in 2011 it resulted in 1.4 million deaths worldwide making it the leading cause of
death.
In this work article founds the best classifier for
breast cancer diseases classification, the survey shows it is one
of the major diseases in India. For classification of Breast
cancer is a multivariate data analysis it is big challenge among
researcher to classify multivariate data because more than one
decision variables present .The proposed work founds which
classifier is best among 48 different categories of classifier and
it founds logistics function is best among all the classifiers, it
shows 80.9524 accuracy. Logistics function is also called logistic
regression; it is a statistical method to analysis breast cancer
data set.
Regularized Weighted Ensemble of Deep Classifiers ijcsa
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
Data Mining is an analytic process designed to explore data in search of consistent patterns and systematic relationships between variables, and then to validate the results by applying the patterns found to a new subset of data. Data mining is often described as the process of discovering patterns, correlations, trends or relationships by searching through a large amount of data stored in repositories, databases, and data warehouses. Diabetes, often referred to by doctors as diabetes mellitus, describes a group of metabolic diseases in which the person has high blood [3] glucose (blood sugar), either because insulin production is insufficient, or because the body's cells do not respond properly to insulin, or both. This project helps in identifying whether a person has diabetes or not, if predicted diabetic[4] the project suggest measures for maintaining normal health and if not diabetic it predicts the risk of getting diabetic. In this project Classification algorithm was used to classify the Pima Indian diabetes dataset. Results have been obtained using Android Application.
Data Science - Part XIV - Genetic AlgorithmsDerek Kane
This lecture provides an overview on biological evolution and genetic algorithms in a machine learning context. We will start off by going through a broad overview of the biological evolutionary process and then explore how genetic algorithms can be developed that mimic these processes. We will dive into the types of problems that can be solved with genetic algorithms and then we will conclude with a series of practical examples in R which highlights the techniques: The Knapsack Problem, Feature Selection and OLS regression, and constrained optimizations.
Analysis of Parameter using Fuzzy Genetic Algorithm in E-learning SystemHarshal Jain
The aim of this project is to analyze the parameter, for the inputs to find an optimization problem than the candidate solution we have. This will help us to find more accurate knowledge level of user, using Genetic Algorithm (GA). In this algorithm a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions.
GA is a search technique that depends on the natural selection and genetics principles and which determines a optimal solution for even a hard issue.genetic algorithm crossover and genetic algorithm for optimization
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
Machine learning is the scientific study of algorithms and statistical models that is used by the machines to perform a specific task depending on patterns and inference rather than explicit instructions. This research and analysis aims to observe how precisely a machine can predict that a patient suspected of breast cancer is having malignant or benign cancer.In this paper the classification of cancer type and prediction of risk levels is done by various model of machine learning and is pictorially depicted by various tools of visual analytics.
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...IJERA Editor
This paper presents a new approach to select reduced number of features in databases. Every database has a
given number of features but it is observed that some of these features can be redundant and can be harmful as
well as confuse the process of classification. The proposed method first applies a binary coded genetic algorithm
to select a small subset of features. The importance of these features is judged by applying Naïve Bayes (NB)
method of classification. The best reduced subset of features which has high classification accuracy on given
databases is adopted. The classification accuracy obtained by proposed method is compared with that reported
recently in publications on eight databases. It is noted that proposed method performs satisfactory on these
databases and achieves higher classification accuracy but with smaller number of feature
In this project, we investigated the use of association rules to extract useful knowledge from raw ontological data. To this end, we proposed an approach to pass from graph representation to transactional data. Then, we used different technological solutions to improve the performance of frequent item-sets extraction such as the FP-growth algorithm, and Hadoop. Check our code on Github: https://github.com/8-chems/OntologyMiner
Preprocessing and Classification in WEKA Using Different ClassifiersIJERA Editor
Data mining is a process of extracting information from a dataset and transform it into understandable structure
for further use, also it discovers patterns in large data sets [1]. Data mining has number of important techniques
such as preprocessing, classification. Classification is one such technique which is based on supervised learning.
It is a technique used for predicting group membership for the data instance. Here in this paper we use
preprocessing, classification on diabetes database. Here we apply classifiers on this database and compare the
result based on certain parameters using WEKA. 77.2 million people in India are suffering from pre diabetes.
ICMR estimates that around 65.1million are diabetes patients. Globally in year 2010, 227 to 285 million people
had diabetes, out of that 90% cases are related to type 2 ,this is equal to 3.3% of the population with equal rates
in both women and men in 2011 it resulted in 1.4 million deaths worldwide making it the leading cause of
death.
In this work article founds the best classifier for
breast cancer diseases classification, the survey shows it is one
of the major diseases in India. For classification of Breast
cancer is a multivariate data analysis it is big challenge among
researcher to classify multivariate data because more than one
decision variables present .The proposed work founds which
classifier is best among 48 different categories of classifier and
it founds logistics function is best among all the classifiers, it
shows 80.9524 accuracy. Logistics function is also called logistic
regression; it is a statistical method to analysis breast cancer
data set.
Regularized Weighted Ensemble of Deep Classifiers ijcsa
Ensemble of classifiers increases the performance of the classification since the decision of many experts
are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the
classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level
deep classifier ensemble gives the best result in our experiments.
Data Mining is an analytic process designed to explore data in search of consistent patterns and systematic relationships between variables, and then to validate the results by applying the patterns found to a new subset of data. Data mining is often described as the process of discovering patterns, correlations, trends or relationships by searching through a large amount of data stored in repositories, databases, and data warehouses. Diabetes, often referred to by doctors as diabetes mellitus, describes a group of metabolic diseases in which the person has high blood [3] glucose (blood sugar), either because insulin production is insufficient, or because the body's cells do not respond properly to insulin, or both. This project helps in identifying whether a person has diabetes or not, if predicted diabetic[4] the project suggest measures for maintaining normal health and if not diabetic it predicts the risk of getting diabetic. In this project Classification algorithm was used to classify the Pima Indian diabetes dataset. Results have been obtained using Android Application.
Data Science - Part XIV - Genetic AlgorithmsDerek Kane
This lecture provides an overview on biological evolution and genetic algorithms in a machine learning context. We will start off by going through a broad overview of the biological evolutionary process and then explore how genetic algorithms can be developed that mimic these processes. We will dive into the types of problems that can be solved with genetic algorithms and then we will conclude with a series of practical examples in R which highlights the techniques: The Knapsack Problem, Feature Selection and OLS regression, and constrained optimizations.
Analysis of Parameter using Fuzzy Genetic Algorithm in E-learning SystemHarshal Jain
The aim of this project is to analyze the parameter, for the inputs to find an optimization problem than the candidate solution we have. This will help us to find more accurate knowledge level of user, using Genetic Algorithm (GA). In this algorithm a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions.
GA is a search technique that depends on the natural selection and genetics principles and which determines a optimal solution for even a hard issue.genetic algorithm crossover and genetic algorithm for optimization
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
Machine learning is the scientific study of algorithms and statistical models that is used by the machines to perform a specific task depending on patterns and inference rather than explicit instructions. This research and analysis aims to observe how precisely a machine can predict that a patient suspected of breast cancer is having malignant or benign cancer.In this paper the classification of cancer type and prediction of risk levels is done by various model of machine learning and is pictorially depicted by various tools of visual analytics.
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCERcscpconf
In this paper, we study the performance criterion of machine learning tools in classifying breast cancer. We compare the data mining tools such as Naïve Bayes, Support vector machines, Radial basis neural networks, Decision trees J48 and simple CART. We used both binary and multi class data sets namely WBC, WDBC and Breast tissue from UCI machine learning depositary. The experiments are conducted in WEKA. The aim of this research is to find out the best classifier with respect to accuracy, precision, sensitivity and specificity in detecting breast cancer
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches.
Microarray Data Classification Using Support Vector MachineCSCJournals
DNA microarrays allow biologist to measure the expression of thousands of genes simultaneously on a small chip. These microarrays generate huge amount of data and new methods are needed to analyse them. In this paper, a new classification method based on support vector machine is proposed. The proposed method is used to classify gene expression data recorded on DNA microarrays. It is found that the proposed method is faster than neural network and the classification performance is not less than neural network.
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
Medical data mining has great deal for exploring new knowledge from large amount of data. Classification is one of the important data mining techniques for classification of data. In this research work, we have used various data mining based classification techniques for classification of cancer diseases patient or not. We applied the Breast Cancer-Wisconsin (Original) data set into different data mining techniques and compared the accuracy of models with two different data partitions. BayesNet achieved highest accuracy as 97.13% in case of 10-fold data partitions. We have also applied the info gain feature selection technique on BayesNet and Support Vector Machine (SVM) and achieved best accuracy 97.28% accuracy with BayesNet in case of 6 feature subset.
An intelligent mammogram diagnosis system can be very helpful for radiologist in detecting the abnormalities earlier than typical screening techniques. This paper investigates a new classification approach for detection of breast abnormalities in digital mammograms using League Championship Algorithm Optimized Ensembled Fully Complex valued Relaxation Network (LCA-FCRN). The proposed algorithm is based on extracting curvelet fractal texture features from the mammograms and classifying the suspicious regions by applying a pattern classifier. The whole system includes steps for pre-processing, feature extraction, feature selection and classification to classify whether the given input mammogram image is normal or abnormal. The method is applied to MIAS database of 322 film mammograms. The performance of the CAD system is analysed using Receiver Operating Characteristic (ROC) curve. This curve indicates the trade-offs between sensitivity and specificity that is available from a diagnostic system, and thus describes the inherent discrimination capacity of the proposed system. The result shows that the area under the ROC curve of the proposed algorithm is 0.985 with a sensitivity of 98.1% and specificity of 92.105%. Experimental results demonstrate that the proposed method can form an effective CAD system, and achieve good classification accuracy.
Clustering and Classification of Cancer Data Using Soft Computing Technique IOSR Journals
Clustering and classification of cancer data has been used with success in field of medical side. In
this paper the two algorithm K-means and fuzzy C-means proposed for the comparison and find the accuracy of
the result. this paper address the problem of learning to classify the cancer data with two different method and
information derived from the training and testing .various soft computing based classification and show the
comparison of classification technique and classification of this health care data .this paper present the
accuracy of the result in cancer data.
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...cscpconf
Improving accuracy of supervised classification algorithms in biomedical applications,
especially CADx, is one of active area of research. This paper proposes construction of rotation
forest (RF) ensemble using 20 learners over two clinical datasets namely lymphography and
backache. We propose a new feature selection strategy based on support vector machines
optimized by particle swarm optimization for relevant and minimum feature subset for obtaining
higher accuracy of ensembles. We have quantitatively analyzed 20 base learners over two
datasets and carried out the experiments with 10 fold cross validation leave-one-out strategy
and the performance of 20 classifiers are evaluated using performance metrics namely accuracy
(acc), kappa value (K), root mean square error (RMSE) and area under receiver operating
characteristics curve (ROC). Base classifiers succeeded 79.96% & 81.71% average accuracies
for lymphography & backache datasets respectively. As for RF ensembles, they produced
average accuracies of 83.72% & 85.77% for respective diseases. The paper presents promising
results using RF ensembles and provides a new direction towards construction of reliable and robust medical diagnosis systems.
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
In search based test data generation, the problem of test data generation is reduced to that of
function minimization or maximization.Traditionally, for branch testing, the problem of test data
generation has been formulated as a minimization problem. In this paper we define an alternate
maximization formulation and experimentally compare it with the minimization formulation. We
use a genetic algorithm as the search technique and in addition to the usual genetic algorithm
operators we also employ the path prefix strategy as a branch ordering strategy and memory and elitism. Results indicate that there is no significant difference in the performance or the coverage obtained through the two approaches and either could be used in test data generation when coupled with the path prefix strategy, memory and elitism.
Possibilistic Fuzzy C Means Algorithm For Mass classificaion In Digital Mammo...IJERA Editor
Mammography is an effective imaging modality of breast cancer abnormalities detection. Survival rate of breast cancer treatment can be increased via early detection of mammography. However detecting the mass in the early stage is a tough task for radiologist. Detection of suspicious abnormalities is a continual task. Out of thousand cases only 3 to 4 are analyzed as cancerous by a radiologist and thus abnormality may be left out. 10-30% of cancers are failed to detect by radiologist. Computer Aided Diagnosis helps the radiologists to detect abnormalities earlier than traditional procedures. Because of some negligence in capturing device, the image may be affected by noise this leads to fault diagnosis. Preprocessing can remove this unwanted noise. In this paper features such as entropy, circularity, edge detection, and correlation are extracted from the image to distinguish normal and abnormal regions of a mammogram. Classification and detection of mammogram can be done by Possibilistic Fuzzy C Means algorithm and Support Vector Machine using extracted features.
Classification AlgorithmBased Analysis of Breast Cancer DataIIRindia
The classification algorithms are very frequently used algorithms for analyzing various kinds of data available in different repositories which have real world applications. The main objective of this research work is to find the performance of classification algorithms in analyzing Breast Cancer data via analyzing the mammogram images based its characteristics.Different attribute values of cancer affected mammogram images are considered for analysis in this work. The Patients food habits, age of the patients, their life styles, occupation, their problem about the diseases and other information are taken into account for classification. Finally, performance of classification algorithms J48, CART and ADTree are given with its accuracy. The accuracy of taken algorithms is measured by various measures like specificity, sensitivity and kappa statistics (Errors).
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
Similar to ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTION ALGORITHM (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
Using social media in education provides learners with an informal way for communication. Informal communication tends to remove barriers and hence promotes student engagement. This paper presents our experience in using three different social media technologies in teaching software project management course. We conducted different surveys at the end of every semester to evaluate students’ satisfaction and engagement. Results show that using social media enhances students’ engagement and satisfaction. However, familiarity with the tool is an important factor for student satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The amount of piracy in the streaming digital content in general and the music industry in specific is posing a real challenge to digital content owners. This paper presents a DRM solution to monetizing, tracking and controlling online streaming content cross platforms for IP enabled devices. The paper benefits from the current advances in Blockchain and cryptocurrencies. Specifically, the paper presents a Global Music Asset Assurance (GoMAA) digital currency and presents the iMediaStreams Blockchain to enable the secure dissemination and tracking of the streamed content. The proposed solution provides the data owner the ability to control the flow of information even after it has been released by creating a secure, selfinstalled, cross platform reader located on the digital content file header. The proposed system provides the content owners’ options to manage their digital information (audio, video, speech, etc.), including the tracking of the most consumed segments, once it is release. The system benefits from token distribution between the content owner (Music Bands), the content distributer (Online Radio Stations) and the content consumer(Fans) on the system blockchain.
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This paper discusses the importance of verb suffix mapping in Discourse translation system. In
discourse translation, the crucial step is Anaphora resolution and generation. In Anaphora
resolution, cohesion links like pronouns are identified between portions of text. These binders
make the text cohesive by referring to nouns appearing in the previous sentences or nouns
appearing in sentences after them. In Machine Translation systems, to convert the source
language sentences into meaningful target language sentences the verb suffixes should be
changed as per the cohesion links identified. This step of translation process is emphasized in
the present paper. Specifically, the discussion is on how the verbs change according to the
subjects and anaphors. To explain the concept, English is used as the source language (SL) and
an Indian language Telugu is used as Target language (TL)
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The using of information technology resources is rapidly increasing in organizations,
businesses, and even governments, that led to arise various attacks, and vulnerabilities in the
field. All resources make it a must to do frequently a penetration test (PT) for the environment
and see what can the attacker gain and what is the current environment's vulnerabilities. This
paper reviews some of the automated penetration testing techniques and presents its
enhancement over the traditional manual approaches. To the best of our knowledge, it is the
first research that takes into consideration the concept of penetration testing and the standards
in the area.This research tackles the comparison between the manual and automated
penetration testing, the main tools used in penetration testing. Additionally, compares between
some methodologies used to build an automated penetration testing platform.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
In order to treat and analyze real datasets, fuzzy association rules have been proposed. Several
algorithms have been introduced to extract these rules. However, these algorithms suffer from
the problems of utility, redundancy and large number of extracted fuzzy association rules. The
expert will then be confronted with this huge amount of fuzzy association rules. The task of
validation becomes fastidious. In order to solve these problems, we propose a new validation
method. Our method is based on three steps. (i) We extract a generic base of non redundant
fuzzy association rules by applying EFAR-PN algorithm based on fuzzy formal concept analysis.
(ii) we categorize extracted rules into groups and (iii) we evaluate the relevance of these rules
using structural equation model.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
2. 224 Computer Science & Information Technology ( CS & IT )
evolution process. It is a stochastic process method that works on the natural selection strategy
and natural genetics. It can be applied to a variety of wide range of problems. The basic steps
involved in GA are detailed in [ 2, 3, 7, 9, 11,16 ]. The GA has gained popularity due to the
inherent advantages [15] that it provides for the computation such as it does not have much
mathematical requirements [17,19,20], evolution operators performing global search and its
adaptability in hybridizing with domain dependent heuristics.
2. Modified Genetic Search Algorithm
As GA finds solution through evolution process [18], it is not inclined to good solution but move
away from bad solutions. There is a chance that it may lead solution to a dead end. Also the
number of iterations or generations required is very high. In the proposed MGSA, the idea is to
identify and eliminate the bad chromosomes using merit and scaled variables so that the search
space is reduced. Once the search space is minimized, containing prospective chromosomes then
it leads to better optimization in search process. Here the merit and scaled variables refer to the
classification error rates of chromosomes. Bad chromosomes indicate those individuals which
lead to a dead end. Here is a brief summary of MSGA:
1)Initial population generation. The parameters of the model to be optimized are considered as
chromosomes. A randomly chosen set of population of individuals is a binary string with a fixed
length.
2)Fitness function: In each generation for which GA is run, the fitness of each individual is
determined which is close to optimality. Here in the attribute selection problem, we have defined
linear function
f `= af + b, where f `, f are scaled and raw fitness values of chromosomes and a,b are constants.
3)Detection and Elimination: Those chromosomes having greater classification error rate
determined by merit and scaled variables are determined and eliminated so that it is not allowed
to be a candidate chromosome in the next generation and thereby filtering out the bad
chromosomes and hence the number of iterations in the subsequent generations gets reduced.
4)Selection: A pair of best fit chromosomes or individuals which has least classification error
rate is selected from the residual population.
5)Crossover: It is the reproductive stage where two new individuals are crossed with probability
Pc to generate new pair of offspring.
6)Mutation: A single point alteration of bit string from zero to one and vice-versa with
probability Pm is applied to selected individuals to generate a new pair of offspring and avoids
premature convergence.
3. Experimental design and results
The experimental setup exploits the use of Modified Genetic Search algorithm to optimize a
subset of inputs used to classify patients as having either benign or malignant forms of breast
3. Computer Science & Information Technology ( CS & IT ) 225
cancer tumors using java programming. Benign tumors are non-progressive and harmless
whereas malignant tumors spread rapidly and very harmful. The real time data is adapted from
Breast Cancer database [1] that contains 683 instances after deleting 16 records containing one or
more missing values. Besides it contains nine numeric inputs and a target attribute class which
takes on values 2(benign) and 4(malignant). Ten attributes are Clump_Thickness,
Cell_Size_Uniformity, Cell_Shape_Uniformity, Marginal_Adhesion, Single_Epi_Cell_Size,
Bare_Nuclei, Bland_Chromatin, Normal_Nucleoli, Mitoses, Class. These attributes are used in
pathology report on fine needle aspirations to determine whether a lump in a breast could be
either malignant (cancerous) or benign (non-cancerous). The details of these parameters are
discussed in [15]. The data distribution for Class and indicates that 65% (445/683) of records
have value 2(benign) while the remaining 35% (238/683) have value 4(malignant). The detail of
the feature ranking is shown below:
3.1. Feature selection
Rotation forest is an ensembles classifier based on feature extraction. The heuristic component is
the feature extraction to subset of features and rebuilding a total feature set for each classifier.
We have used ensemble that consists of multinomial logistic regression model with a ridge
estimator as classifier [21] and Haar wavelets as projection filter. To our knowledge from
literature, this ensemble is used for first time. We have found experimentally that our ensemble
gives better result compared to the ensemble discussed by Juan [22].
The proposed ensemble is detailed below:-
Let x=[x1, . . . ,xn]T
be an instance given by n variables and X be the training sample in a
form of N x n matrix. Let vector Y=[y1,….yN] be class labels , where yj takes a value from
the set. Let D1,…..,DL be the classifiers in ensemble and F is feature set.
In ensemble learning, choosing L in advance and training classifiers in parallel is
necessary.
Follow the steps to prepare the training sample for classifier Di:
1. Split F randomly into K disjoint or intersecting subsets. To maximize degree of
diversity, disjoint subsets are chosen.
2. Let Fi,j be jth subset of features to train set of classifier Di.
Draw a bootstrap sample of objects of size 75 percent by selecting randomly subset of
classes for every such subset. Run Haar wavelet for only M features in Fi,j and the
selected subset of X. Store the coefficients of the Haar wavelets components,
aij[1],…..ai,j[Mj], each of size M x 1.
3. Arrange obtained vectors using coefficients in a sparse “rotation” matrix Ri having
dimensionality n x ∑Mj. Compute the training sample for classifier Di by rearranging the
columns of Ri. Represent the rearranged rotation matrix Ra
i (size N x n). So the training
sample for classifier Di is X Ra
i.
4. 226 Computer Science & Information Technology ( CS & IT )
Table 1 Attribute ranking by Rotation forest ensemble comprising multinomial logistic regression model
with a ridge estimator with Haar wavelets as projection filter using 10 fold cross validation. The ranks are
generated using Ranker search method.
average merit average rank attribute
92.704 +- 0.479 1.3 +- 0.46 Cell_Size_Uniformity
92.275 +- 0.455 1.7 +- 0.46 Cell_Shape_Uniformity
90.351 +- 0.524 3 +- 0 Bland_Chromatin
88.968 +- 0.697 4 +- 0 Bare_Nuclei
87.554 +- 0.251 5.1 +- 0.3 Single_Epi_Cell_Size
86.663 +- 0.324 6.6 +- 0.66 Normal_Nucleoli
86.409 +- 0.442 7 +- 0.89 Marginal_Adhesion
86.123 +- 0.444 7.3 +- 0.9 Clump_Thickness
78.97 +- 0.321 9 +- 0 Mitoses
3.2 Case 1:
The Naïve Bayes [5] is applied to the dataset to classify with 10-fold cross validation [10] and
shows that it achieves a very impressive 96.3397 %( 658/683) classification accuracy. Details are
shown in table2.
A confusion matrix [4, 10] shown in table 2 contains the analytical details of classifications
where nine attributes are input to it. Its performance is evaluated based on the data in the matrix
for two class classifier. Accuracy is measured by Received Operator Characteristics (ROC) [4]
area under graph with TP as Y-axis and FP as X-axis and ranges from zero to one. With area=1
represents perfect test. For class benign ROC plot =0.99 is shown in fig1 for case 1 and
Cost/benefit analysis is shown in fig 2 where gain is 0.7. When relatively the number of negative
instances are higher than positive instances, then F-measure gives better understanding of
accuracy level and is computed as F = [(β2
+1) * P * TP] / [ β2
*P+TP ] where β varies from zero
to infinity and is used to balance weight assigned to TP and P. Higher the value of F-measure
higher is the accuracy level of classifier and its value ranges from zero to one. The following
tables 2,3 show the confusion matrix and accuracy parameters respectively.
3.3 Case 2:
In case 1 all the nine attributes are considered but in real world data irrelevant, redundant or
noisy attributes are common phenomena, which impairs the result. The learning scheme Wrapper
subset evaluation [8] with Naïve Bayes classification [8] is now integrated with Genetic Search
algorithm [5]. After filtration process only seven attributes are selected as relevant. Attributes
single_cell_size and mitosis is eliminated. The attributes of Genetic algorithm [23] that includes
a population size of n=20 chromosomes, crossover probability Pc=0.6 and mutation probability
Pm=0.033.As specified, Genetic Search algorithm creates an initial set of 20 chromosomes. Now
reclassifying the records using naïve Bayes with 10-fold cross validation [13]; however, this time
only seven attributes are input to the classifier. Selected attributes: 1,2,3,4,6,7,8. They are
Clump_Thickness, Cell_Size_Uniformity Cell_Shape_Uniformity, Marginal_Adhesion,
Bare_Nuclei, Bland_Chromatin, Normal_Nucleoli.
5. Computer Science & Information Technology ( CS & IT ) 227
In table 6, every subset is a chromosome and merit is the fitness score reported by naïve Bayes,
which is equal to the corresponding classification error rate. Also, each chromosome’s scaled
fitness is shown in the scaled column where we use linear scaling technique to scale the values.
By definition, the raw fitness and scaled fitness values have the linear relationship
f `= af + b (1)
where f ` and f are the scaled and raw fitness values respectively. The constants a and b are
chosen where
f`avg = favg and f`max= K f`avg. (2)
The constant K represents the expected number of copies of the fittest individual in the
population. Thus, by computing the average fitness values from table 6, we obtain f`avg
=0.055755 and favg =0.055753. To find a, b the fitness values from last two rows in table 6 are
chosen to solve simultaneous equations:
0.05911= 0.05417a + b (3)
0.06677 =0.04392a + b (4)
Solving equations (3),(4) we get a= -0.747317and b= 0.0999592. We use equation (2) to
determine K i.e. K= f`max / f`avg =0.07006/0.056999= 1.2291.
Observe in fifth row in table 6 f`=0. The raw fitness value of 0.13324 corresponds to the largest
classification error in the population produced by chromosome {8}, and as a result, f` is mapped
to zero to avoid the possibility of producing negatively scaled fitness. Here the chromosome {8}
is detected and deleted so that it does not propagate to the next generation.
The improved classifier has accuracy level of 96.9253 %, which indicates that the second model
outperforms the first model by 0.5856 % where the input to the later model is only seven
attributes indicated in ROC area. For class benign ROC plot =0.993 is shown in fig3 and
Cost/benefit analysis is shown in fig 4 where gain is 0.7 and observe the smoothness achieved in
the graph. Moreover reducing the number of attributes in second model did not affect the gain in
cost benefit analysis. The plot matrix of selected attributes is shown in fig5. That is, the
classification accuracy has improved positively where only seven of nine attributes are specified
in the input. Though the increased accuracy is not a dramatic improvement, it shows the strength
of Modified Genetic Search algorithm is an appropriate algorithm for attribute selection process.
3.4 Case 3:
Now to validate the result obtained in case 2, Principal Component Analysis (PCA) [6] with
attribute selection [14] is applied to the data set consisting of nine attributes for the attribute
selection process. It can be noticed that the accuracy of Modified Genetic Search algorithm
6. 228 Computer Science & Information Technology ( CS & IT )
integrated with Naïve Bayes is mathematically proven by comparing the results with the output
from Principal component analysis like the former result only seven attributes are selected
whereas in PCA analysis result. Correlation matrix [14] of the entire data set is depicted in table
7 which gives the correlation between all pairs of attributes. Eigen vectors shown in table 8 gives
the relationship between attributes where negative and positive values show inverse and direct
proportionality respectively. The corresponding rank of attributes is shown in table 9. Therefore
it can be inferred that the strength of Modified Genetic Search algorithm is analyzed
mathematically. The output result of PCA is shown in table 7,8,9.
Fig 1 . ROC Area= 0.99 for case 1 Fig 2 . Cost/benefit analysis for gain=0.7 in case 1 with
9 attributes.
Fig 3. ROC area=0.993 for case 2 Fig 4. Cost/benefit analysis for gain=0.7 with 7
attributes in case 2
7. Computer Science & Information Technology ( CS & IT ) 229
Table2: Confusion Matrix for Case 1
Predicted
Benign Malignant
425 20 Benign Actual
5 233 Malignant
Table3: Accuracy for Case1
Accuracy
by Class
TP
Rate
FP
Rate
Precision Recall F-Measure ROC
Area
Class
0.955 0.025 0.986 0.955 0.97 0.99 Benign
0.975 0.045 0.921 0.975 0.947 0.985 Malignant
Weighted
Average
0.962 0.032 0.963 0.962 0.962 0.989
Table4 Confusion matrix for Case 2
Predicted
Benign Malignant
427 18 Benign Actual
3 235 Malignant
Table 5 Accuracy for Case 2
Accuracy
by Class
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.96 0.021 0.988 0.96 0.974 0.993 Benign
0.979 0.04 0.928 0.979 0.953 0.991 Malignant
Weighted
Average
0.966 0.028 0.967 0.966 0.967 0.992
Table 6:Initial population characteristics for the 20 chromosomes
Merit Scaled Subset
0.05417 0.05911 4 6 7 9
0.05183 0.06086 1 2 3 4 7 9
0.03953 0.07006 1 2 3 4 6 9
0.05798 0.05627 6 7 8
0.13324 0 8
0.04012 0.06962 2 3 5 6 7 8
0.04656 0.0648 2 6 7
0.09195 0.03087 5 8
0.07174 0.04598 2
0.04539 0.06568 1 6 8 9
0.04246 0.06787 3 4 5 6 7 8
0.04158 0.06853 2 4 6 7 8
0.08433 0.03656 4 5
9. Computer Science & Information Technology ( CS & IT ) 231
Fig5. Plot matrix distribution of selected attributes in MGSA (Class color red: Malignant, blue: Benign )
4. Conclusion
GA is a nature’s inspired computational methodology to solve complex system optimal problems,
and exhibit its outstanding and impressive performance especially to Nondeterministic
Polynomial of some combinatorial optimizations. In this paper, the enhanced performance
accuracy of 0.5856% is achieved from proposed Modified Genetic search algorithm over that of
traditional model, demonstrated with a breast cancer data set. Also the results are validated
mathematically using PCA. Proposed method and PCA eliminated 2 attributes from the data set.
An extensive experiment has showed that improvement in initial population is effective in the
optimization process. The proposed model MGSA addresses the local optima problem using
merit and scaled variables in which the bad individual chromosomes are detected and deleted,
reducing the search space and further iterations thereby improving efficiency. The various
feature selection techniques are categorized into major category like filtering methods, wrapper
subset evaluation and embedded models. The newly introduced method provided better accuracy
10. 232 Computer Science & Information Technology ( CS & IT )
as shown in the result section. The software reliability of Computer Aided Diagnosis system
(CADx) is improved by the use of machine learning ensembles.
Although in this paper there are only nine attributes, there are still 511 possible attribute subsets
that can be given as input. If the numbers of attributes increases say hundred then there are 1.27x
1030
possible attribute subsets to choose. In such a hard situation like this MGSA may prove
helpful in determining the optimal subset. From the experimental results, it is concluded that GA
is an effective algorithm that can be exploited in complex situations. It can be used in various
applications such as optimization of weights of links in neural networks. Further the performance
of GA can be improved by parallel implementation using MPI and other techniques and can be
applied to vast field of emerging applications. Further, the methods need to be tested over larger
datasets as future work.
Conflict of interest
None
References
[1] Breast cancer dataset, compiled by Dr. William H. Wolberg, University of Wisconsin Hospitals,
Madison, WI, obtained January 8, 1991.
[2] Chen Lin, “An Adaptive Genetic Algorithm Based on Population Diversity Strategy”, Third
International Conference on Genetic and Evolutionary Computing. WGEC 2009. Page(s): 93 – 96.
[3] David E. Goldberg, “Genetic algorithms in search, optimization and machine learning. Addison-
Wesley,1989.
[4] http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ confusion_ matrix/confusion_matrix.html
[5] Jiawei Han, Micheline Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann
Publishers, 2001.
[6] Joseph F. Hair,Jr., William C.Black, Barry J. Babin, Rolph E. Anderson, Ronald L.Tatham, “
Multivariate Data Analysis”, Sixth Edition,2007, Pearson Education.
[7] Kosiński, W.; Kotowski, S.; Michalewicz, Z, “On convergence and optimality of genetic algorithms”,
2010 IEEE Congress on Evolutionary Computation (CEC), Page(s): 1 – 6.
[8] Mallik, R.K. “The Uniform Correlation Matrix and its Application to Diversity” ,2007 IEEE
Transactions on Wireless Communications, Volume: 6 , Issue: 5 , Page(s): 1619 - 1625
[9] Melanie Mitchell, An introduction to Genetic Algorithms, MIT Press, Cambridge, MA,2002;first
edition,1996.
[10] Nadav David Marom, Lior Rokach, Armin Shmilovici, ”Using the Confusion Matrix for Improving
Ensemble Classifiers”, 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel.
11. Computer Science & Information Technology ( CS & IT ) 233
[11] Nor Ashidi Mat Isa, Esugasini Subramaniam, Mohd Yusoff Mashor and Nor Hayati Othman “Fine
Needle Aspiration Cytology Evaluation for Classifying Breast Cancer Using Artificial Neural
Network”, American Journal of Applied Sciences 4 (12): 999- 1008, ISSN 1546-9239, 2007 Science
Publications
[12] Pengfei Guo, Xuezhi Wang , Yingshi Han, “The Enhanced Genetic Algorithms for the Optimization
Design”, 2010 IEEE 3rd International Conference on Biomedical Engineering and Informatics, pages:
2990-2994.
[13] Richard R. Picard and R. Dennis Cook, “Cross-Validation of Regression Models”, Journal of the
American Statistical Association, Vol. 79, No. 387 (Sep., 1984), pp. 575-583,
URL: http://www.jstor.org/stable/2288403.
[14] Ron Kohavi, George H. John (1997),“Wrappers for feature subset selection”, Artificial Intelligence.
97(1-2):273-324.
[15] Sahu, S.S.; Panda, G.; Nanda, S.J., “Improved protein structural class prediction using genetic
algorithm and artificial immune system” , World Congress on Nature & Biologically Inspired
Computing, 2009. NaBIC 2009, Page(s): 731 – 735.
[16] Zhi-Qiang Chen, Rong-Long Wang , “An Efficient Real-coded Genetic Algorithm for Real-Parameter
Optimization”, 2010 IEEE Sixth International Conference on Natural Computation, pages: 2276-
2280.
[17] Mandal, I., Sairam, N. Accurate Prediction of Coronary Artery Disease Using Reliable Diagnosis
System (2012) Journal of Medical Systems, pp. 1-21.
[18] Mandal, I., Sairam, N. Enhanced classification performance using computational intelligence (2011)
Communications in Computer and Information Science, 204 CCIS, pp. 384-391.
[19] Mandal, I. Software reliability assessment using artificial neural network (2010) ICWET 2010 -
International Conference and Workshop on Emerging Trends in Technology 2010, Conference
Proceedings, pp. 698-699. Cited 1 time.
[20] Mandal, I.A low-power content-addressable memory (CAM) using pipelined search scheme (2010)
ICWET 2010 - International Conference and Workshop on Emerging Trends in Technology 2010,
Conference Proceedings, pp. 853-858.
[21] le Cessie, S., van Houwelingen, J.C. (1992). Ridge Estimators in Logistic Regression. Applied
Statistics. 41(1):191-201.
[22] Juan J. Rodriguez, Ludmila I. Kuncheva, Carlos J. Alonso (2006). Rotation Forest: A new classifier
ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence. 28(10):1619-
1630.
[23] Larose, D. T. (2006) Genetic Algorithms, in Data Mining Methods and Models, John Wiley & Sons,
Inc., Hoboken, NJ, USA. doi: 10.1002/0471756482.ch6
12. 234 Computer Science & Information Technology ( CS & IT )
Authors
Indrajit Mandal is working as research scholar at School of Computing,
SASTRA University, India. Based on his research work, he has received two
Gold medal awards in Computer Science & Engineering discipline from
National Design and Research forum, The Institution of Engineers (India). He
has won several prizes from IITs, NITs in technical paper presentations held at
national level. He has published research papers in international peer reviewed
journals and international conferences. His research interest includes Machine
learning, Applied statistics, Computational intelligence, Software reliability.
Sairam N is working as Professor in School of Computing, SASTRA
University, India and has teaching experience of 15 years. He has published
several research papers at national and international journals and conferences.
His research interest includes Soft Computing, Theory of Computation,
Parallel Computing and Algorithms, Data Mining.