Microarray is a useful technique for measuring expression data of thousands or more of genes
simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data
is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the
distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and
robust gene identification methods is extremely fundamental. Many gene selection methods as well as their
corresponding classifiers have been proposed. In the proposed method, a single gene with high classdiscrimination
capability is selected and classification rules are generated for cancer based on gene
expression profiles. The method first computes importance factor of each gene of experimental cancer
dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high
class discrimination capability according to their depended degree of classes. Then initial important genes
are selected according to high importance factor of each gene and form initial reduct. Then traditional kmeans
clustering algorithm is applied on each selected gene of initial reduct and compute missclassification
errors of individual genes. The final reduct is formed by selecting most important genes with
respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced
by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples
of experimental test dataset. The proposed method test on four publicly available cancerous gene
expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using
important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to
prove the robustness of proposed method compares the outcomes (correctly classified instances) with some
existing well known classifiers.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many
biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon
cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies
in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in
their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms
and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the
matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix
Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification
accuracies are then compared for these algorithms.This technique gives an accuracy of 98%
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
Interrogating differences in expression of targeted gene sets to predict brea...Enrique Moreno Gonzalez
Genomics provides opportunities to develop precise tests for diagnostics, therapy selection and monitoring. From analyses of our studies and those of published results, 32 candidate genes were identified, whose expression appears related to clinical outcome of breast cancer. Expression of these genes was validated by qPCR and correlated with clinical follow-up to identify a gene subset for development of a prognostic test.
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...IJTET Journal
inAbstract— Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to be taken. The important problem in which pattern recognition are applied have common that they are too complex to model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given problem using a set of instances, each instances represented by a number of features. The microarray expression technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays generated large amount of data has stimulate the development of various computational methods to different biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics and drug designing. In this work a new schema has developed for classification of unknown malignant tumors into known class. According to this work an new classification scheme includes the transformation of very high dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary and multiclass data sets. To improve the performance of the classification gene selection method is applied to the datasets as a preprocessing and data extraction step.
Application of Microarray Technology and softcomputing in cancer BiologyCSCJournals
DNA microarray technology has emerged as a boon to the scientific community in understanding the growth and development of life as well as in widening their knowledge in exploring the genetic causes of anomalies occurring in the working of the human body. microarray technology makes biologists be capable of monitoring expression of thousands of genes in a single experiment on a small chip. Extracting useful knowledge and info from these microarray has attracted the attention of many biologists and computer scientists. Knowledge engineering has revolutionalized the way in which the medical data is being looked at. Soft computing is a branch of computer science capable of analyzing complex medical data. Advances in the area of microarray –based expression analysis have led to the promise of cancer diagnosis using new molecular based approaches. Many studies and methodologies have come up which analyszes the gene espression data by using the techniques in data mining such as feature selection, classification, clustering etc. emboiding the soft computing methods for more accuracy. This review is an attempt to look at the recent advances in cancer research with DNA microarray technology , data mining and soft computing techniques.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many
biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon
cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies
in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in
their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms
and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the
matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix
Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification
accuracies are then compared for these algorithms.This technique gives an accuracy of 98%
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
Interrogating differences in expression of targeted gene sets to predict brea...Enrique Moreno Gonzalez
Genomics provides opportunities to develop precise tests for diagnostics, therapy selection and monitoring. From analyses of our studies and those of published results, 32 candidate genes were identified, whose expression appears related to clinical outcome of breast cancer. Expression of these genes was validated by qPCR and correlated with clinical follow-up to identify a gene subset for development of a prognostic test.
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...IJTET Journal
inAbstract— Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to be taken. The important problem in which pattern recognition are applied have common that they are too complex to model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given problem using a set of instances, each instances represented by a number of features. The microarray expression technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays generated large amount of data has stimulate the development of various computational methods to different biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics and drug designing. In this work a new schema has developed for classification of unknown malignant tumors into known class. According to this work an new classification scheme includes the transformation of very high dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary and multiclass data sets. To improve the performance of the classification gene selection method is applied to the datasets as a preprocessing and data extraction step.
Application of Microarray Technology and softcomputing in cancer BiologyCSCJournals
DNA microarray technology has emerged as a boon to the scientific community in understanding the growth and development of life as well as in widening their knowledge in exploring the genetic causes of anomalies occurring in the working of the human body. microarray technology makes biologists be capable of monitoring expression of thousands of genes in a single experiment on a small chip. Extracting useful knowledge and info from these microarray has attracted the attention of many biologists and computer scientists. Knowledge engineering has revolutionalized the way in which the medical data is being looked at. Soft computing is a branch of computer science capable of analyzing complex medical data. Advances in the area of microarray –based expression analysis have led to the promise of cancer diagnosis using new molecular based approaches. Many studies and methodologies have come up which analyszes the gene espression data by using the techniques in data mining such as feature selection, classification, clustering etc. emboiding the soft computing methods for more accuracy. This review is an attempt to look at the recent advances in cancer research with DNA microarray technology , data mining and soft computing techniques.
Cancer recognition from dna microarray gene expression data using averaged on...IJCI JOURNAL
Cancer is a major leading cause of death and responsible for around 13% of all deaths world-wide. Cancer
incidence rate is growing at an alarming rate in the world. Despite the fact that cancer is preventable and
curable in early stages, the vast majority of patients are diagnosed with cancer very late. Therefore, it is of
paramount importance to prevent and detect cancer early. Nonetheless, conventional methods of detecting
and diagnosing cancer rely solely on skilled physicians, with the help of medical imaging, to detect certain
symptoms that usually appear in the late stages of cancer. The microarray gene expression technology is a
promising technology that can detect cancerous cells in early stages of cancer by analyzing gene
expression of tissue samples. The microarray technology allows researchers to examine the expression of
thousands of genes simultaneously. This paper describes a state-of-the-art machine learning based
approach called averaged one-dependence estimators with subsumption resolution to tackle the problem of
recognizing cancer from DNA microarray gene expression data. To lower the computational complexity
and to increase the generalization capability of the system, we employ an entropy-based geneselection
approach to select relevant gene that are directly responsible for cancer discrimination. This proposed
system has achieved an average accuracy of 98.94% in recognizing and classifyingcancer over 11
benchmark cancer datasets. The experimental results demonstrate the efficacy of our framework.
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism.
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...ijaia
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast Cancer at the early stage was to apply artificial intelligence where machines are simulated with intelligence and programmed to think and act like a human. This allows machines to passively learn and find a pattern, which can be used later to detect any new changes that may occur. In general, machine learning is quite useful particularly in the medical field, which depends on complex genomic measurements such as microarray technique and would increase the accuracy and precision of results. With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast Cancer diagnostic system using complex genomic analysis via microarray technology. The system will combine two machine learning methods, K-means cluster, and linear regression.
Top downloaded article in academia 2020 - International Journal of Computatio...ijcsity
International Journal of Computational Science and Information Technology (IJCSITY) focuses on Complex systems, information and computation using mathematics and engineering techniques. This is an open access peer-reviewed journal will act as a major forum for the presentation of innovative ideas, approaches, developments, and research projects in the area of Computation theory and applications. It will also serve to facilitate the exchange of information between researchers and industry professionals to discuss the latest issues and advancement in the area of advanced Computation and its applications.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...Kiogyf
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND FIREFLY ALGORITHM
ABSTRACT
Cancer is a globally recognized cause of death. A proper cancer analysis demands the classification of several types of tumor. Investigations into microarray gene expressions seem to be a successful platform for revising genetic diseases. Although the standard machine learning (ML) approaches have been efficient in the realization of significant genes and in the classification of new types of cancer cases, their medical and logical application has faced several drawbacks such as DNA microarray data analysis limitation, which includes an incredible number of features and the relatively small size of an instance. To achieve a reasonable and efficient DNA microarray dataset information, there is a need to extend the level of interpretability and forecast approach while maintaining a great level of precision. In this work, a novel way of cancer classification based on based gene expression profiles is presented. This method is a combination of both Firefly algorithm and Mutual Information Method. First, the features are used to select the features before using the Firefly algorithm for feature reduction. Finally, the Support Vector Machine is used to classify cancer into types. The performance of the proposed system was evaluated by using it to classify datasets from colon cancer; the results of the evaluation were compared with some recent approaches.
Keywords: Feature Selection, Firefly Algorithm, Cancer Disease, Mutual Information
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...gerogepatton
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast Cancer at the early stage was to apply artificial intelligence where machines are simulated with intelligence and programmed to think and act like a human. This allows machines to passively learn and find a pattern, which can be used later to detect any new changes that may occur. In general, machine learning is quite useful particularly in the medical field, which depends on complex genomic measurements such as microarray technique and would increase the accuracy and precision of results. With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast Cancer diagnostic system using complex genomic analysis via microarray technology. The system will combine two machine learning methods, K-means cluster, and linear regression.
Graphical Model and Clustering-Regression based Methods for Causal Interactio...gerogepatton
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex
because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast
Cancer at the early stage was to apply artificial intelligence where machines are simulated with
intelligence and programmed to think and act like a human. This allows machines to passively learn and
find a pattern, which can be used later to detect any new changes that may occur. In general, machine
learning is quite useful particularly in the medical field, which depends on complex genomic
measurements such as microarray technique and would increase the accuracy and precision of results.
With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper
treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast
Cancer diagnostic system using complex genomic analysis via microarray technology. The system will
combine two machine learning methods, K-means cluster, and linear regression.
Breast cancer is the leading cause of death for women worldwide. Cancer can be discovered early, lowering the rate of death. Machine learning techniques are a hot field of research, and they have been shown to be helpful in cancer prediction and early detection. The primary purpose of this research is to identify which machine learning algorithms are the most successful in predicting and diagnosing breast cancer, according to five criteria: specificity, sensitivity, precision, accuracy, and F1 score. The project is finished in the Anaconda environment, which uses Python's NumPy and SciPy numerical and scientific libraries as well as matplotlib and Pandas. In this study, the Wisconsin diagnostic breast cancer dataset was used to evaluate eleven machine learning classifiers: decision tree, quadratic discriminant analysis, AdaBoost, Bagging meta estimator, Extra randomized trees, Gaussian process classifier, Ridge, Gaussian nave Bayes, k-Nearest neighbors, multilayer perceptron, and support vector classifier. During performance analysis, extremely randomized trees outperformed all other classifiers with an F1-score of 96.77% after data collection and data analysis.
Cancer recognition from dna microarray gene expression data using averaged on...IJCI JOURNAL
Cancer is a major leading cause of death and responsible for around 13% of all deaths world-wide. Cancer
incidence rate is growing at an alarming rate in the world. Despite the fact that cancer is preventable and
curable in early stages, the vast majority of patients are diagnosed with cancer very late. Therefore, it is of
paramount importance to prevent and detect cancer early. Nonetheless, conventional methods of detecting
and diagnosing cancer rely solely on skilled physicians, with the help of medical imaging, to detect certain
symptoms that usually appear in the late stages of cancer. The microarray gene expression technology is a
promising technology that can detect cancerous cells in early stages of cancer by analyzing gene
expression of tissue samples. The microarray technology allows researchers to examine the expression of
thousands of genes simultaneously. This paper describes a state-of-the-art machine learning based
approach called averaged one-dependence estimators with subsumption resolution to tackle the problem of
recognizing cancer from DNA microarray gene expression data. To lower the computational complexity
and to increase the generalization capability of the system, we employ an entropy-based geneselection
approach to select relevant gene that are directly responsible for cancer discrimination. This proposed
system has achieved an average accuracy of 98.94% in recognizing and classifyingcancer over 11
benchmark cancer datasets. The experimental results demonstrate the efficacy of our framework.
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism.
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...ijaia
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast Cancer at the early stage was to apply artificial intelligence where machines are simulated with intelligence and programmed to think and act like a human. This allows machines to passively learn and find a pattern, which can be used later to detect any new changes that may occur. In general, machine learning is quite useful particularly in the medical field, which depends on complex genomic measurements such as microarray technique and would increase the accuracy and precision of results. With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast Cancer diagnostic system using complex genomic analysis via microarray technology. The system will combine two machine learning methods, K-means cluster, and linear regression.
Top downloaded article in academia 2020 - International Journal of Computatio...ijcsity
International Journal of Computational Science and Information Technology (IJCSITY) focuses on Complex systems, information and computation using mathematics and engineering techniques. This is an open access peer-reviewed journal will act as a major forum for the presentation of innovative ideas, approaches, developments, and research projects in the area of Computation theory and applications. It will also serve to facilitate the exchange of information between researchers and industry professionals to discuss the latest issues and advancement in the area of advanced Computation and its applications.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...Kiogyf
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND FIREFLY ALGORITHM
ABSTRACT
Cancer is a globally recognized cause of death. A proper cancer analysis demands the classification of several types of tumor. Investigations into microarray gene expressions seem to be a successful platform for revising genetic diseases. Although the standard machine learning (ML) approaches have been efficient in the realization of significant genes and in the classification of new types of cancer cases, their medical and logical application has faced several drawbacks such as DNA microarray data analysis limitation, which includes an incredible number of features and the relatively small size of an instance. To achieve a reasonable and efficient DNA microarray dataset information, there is a need to extend the level of interpretability and forecast approach while maintaining a great level of precision. In this work, a novel way of cancer classification based on based gene expression profiles is presented. This method is a combination of both Firefly algorithm and Mutual Information Method. First, the features are used to select the features before using the Firefly algorithm for feature reduction. Finally, the Support Vector Machine is used to classify cancer into types. The performance of the proposed system was evaluated by using it to classify datasets from colon cancer; the results of the evaluation were compared with some recent approaches.
Keywords: Feature Selection, Firefly Algorithm, Cancer Disease, Mutual Information
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...gerogepatton
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast Cancer at the early stage was to apply artificial intelligence where machines are simulated with intelligence and programmed to think and act like a human. This allows machines to passively learn and find a pattern, which can be used later to detect any new changes that may occur. In general, machine learning is quite useful particularly in the medical field, which depends on complex genomic measurements such as microarray technique and would increase the accuracy and precision of results. With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast Cancer diagnostic system using complex genomic analysis via microarray technology. The system will combine two machine learning methods, K-means cluster, and linear regression.
Graphical Model and Clustering-Regression based Methods for Causal Interactio...gerogepatton
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex
because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast
Cancer at the early stage was to apply artificial intelligence where machines are simulated with
intelligence and programmed to think and act like a human. This allows machines to passively learn and
find a pattern, which can be used later to detect any new changes that may occur. In general, machine
learning is quite useful particularly in the medical field, which depends on complex genomic
measurements such as microarray technique and would increase the accuracy and precision of results.
With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper
treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast
Cancer diagnostic system using complex genomic analysis via microarray technology. The system will
combine two machine learning methods, K-means cluster, and linear regression.
Breast cancer is the leading cause of death for women worldwide. Cancer can be discovered early, lowering the rate of death. Machine learning techniques are a hot field of research, and they have been shown to be helpful in cancer prediction and early detection. The primary purpose of this research is to identify which machine learning algorithms are the most successful in predicting and diagnosing breast cancer, according to five criteria: specificity, sensitivity, precision, accuracy, and F1 score. The project is finished in the Anaconda environment, which uses Python's NumPy and SciPy numerical and scientific libraries as well as matplotlib and Pandas. In this study, the Wisconsin diagnostic breast cancer dataset was used to evaluate eleven machine learning classifiers: decision tree, quadratic discriminant analysis, AdaBoost, Bagging meta estimator, Extra randomized trees, Gaussian process classifier, Ridge, Gaussian nave Bayes, k-Nearest neighbors, multilayer perceptron, and support vector classifier. During performance analysis, extremely randomized trees outperformed all other classifiers with an F1-score of 96.77% after data collection and data analysis.
USING DATA MINING TECHNIQUES FOR DIAGNOSIS AND PROGNOSIS OF CANCER DISEASEIJCSEIT Journal
Breast cancer is one of the leading cancers for women in developed countries including India. It is the
second most common cause of cancer death in women. The high incidence of breast cancer in women has
increased significantly in the last years. In this paper we have discussed various data mining approaches
that have been utilized for breast cancer diagnosis and prognosis. Breast Cancer Diagnosis is
distinguishing of benign from malignant breast lumps and Breast Cancer Prognosis predicts when Breast
Cancer is to recur in patients that have had their cancers excised. This study paper summarizes various
review and technical articles on breast cancer diagnosis and prognosis also we focus on current research
being carried out using the data mining techniques to enhance the breast cancer diagnosis and prognosis.
Classification AlgorithmBased Analysis of Breast Cancer DataIIRindia
The classification algorithms are very frequently used algorithms for analyzing various kinds of data available in different repositories which have real world applications. The main objective of this research work is to find the performance of classification algorithms in analyzing Breast Cancer data via analyzing the mammogram images based its characteristics.Different attribute values of cancer affected mammogram images are considered for analysis in this work. The Patients food habits, age of the patients, their life styles, occupation, their problem about the diseases and other information are taken into account for classification. Finally, performance of classification algorithms J48, CART and ADTree are given with its accuracy. The accuracy of taken algorithms is measured by various measures like specificity, sensitivity and kappa statistics (Errors).
Cancer prognosis prediction using balanced stratified samplingijscai
High accuracy in cancer prediction is important to improve the quality of the treatment and to improve the
rate of survivability of patients. As the data volume is increasing rapidly in the healthcare research, the
analytical challenge exists in double. The use of effective sampling technique in classification algorithms
always yields good prediction accuracy. The SEER public use cancer database provides various prominent
class labels for prognosis prediction. The main objective of this paper is to find the effect of sampling
techniques in classifying the prognosis variable and propose an ideal sampling method based on the
outcome of the experimentation. In the first phase of this work the traditional random sampling and
stratified sampling techniques have been used. At the next level the balanced stratified sampling with
variations as per the choice of the prognosis class labels have been tested. Much of the initial time has been
focused on performing the pre-processing of the SEER data set. The classification model for
experimentation has been built using the breast cancer, respiratory cancer and mixed cancer data sets with
three traditional classifiers namely Decision Tree, Naïve Bayes and K-Nearest Neighbour. The three
prognosis factors survival, stage and metastasis have been used as class labels for experimental
comparisons. The results shows a steady increase in the prediction accuracy of balanced stratified model
as the sample size increases, but the traditional approach fluctuates before the optimum results.
A Review on Data Mining Techniques for Prediction of Breast Cancer RecurrenceDr. Amarjeet Singh
The most common type of cancer in women
worldwide is the Breast Cancer. Breast cancer may be
detected early using Mammograms, probably before it's
spread. Recurrent breast cancer could occur months or years
after initial treatment. The cancer could return within the
same place because the original cancer (local recurrence), or it
may spread to different areas of your body (distant
recurrence). Early stage treatment is done not only to cure
breast cancer however additionally facilitate in preventing its
repetition/recurrence. Data mining algorithms provide
assistance in predicting the early-stage breast cancer that
continually has been difficult analysis drawback. The
projected analysis can establish the most effective algorithm
that predicts the recurrence of the breast cancer and improve
the accuracy the algorithms. Large information like Clump,
Classification, Association Rules, Prediction and Neural
Networks, Decision Trees can be analyzed using data mining
applications and techniques.
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Gene selection is usually the crucial step in microarray data analysis. A great deal of recent research has focused on the
challenging task of selecting differentially expressed genes from microarray data (‘gene selection’). Numerous gene selection
algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like
small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. This paper
presents combination of Analysis of Variance (ANOVA), Principle Component Analysis (PCA), Recursive Cluster Elimination
(RCE) a classification algorithm by employing a innovative method for gene selection. It reduces the gene expression data into
minimal number of gene subset. This is a new feature selection method which uses ANOVA statistical test, principal component
analysis, KNN classification &RCE (recursive cluster elimination). At each step redundant & irrelevant features are get
eliminated. Classification accuracy reaches up to 99.10% and lesser time for classification when compared to other convectional techniques.
Cancer is one of the deadliest diseases in the world and is responsible for around 13% of all deaths worldwide.
Cancer incidence rate is growing at an alarming rate in the world. Despite the fact that cancer is
preventable and curable in early stages, the vast majority of patients are diagnosed with cancer very late.
Furthermore, cancer commonly comes back after years of treatment. Therefore, it is of paramount
importance to predict cancer recurrence so that specific treatments can be sought. Nonetheless,
conventional methods of predicting cancer recurrence rely solely on histopathology and the results are not
very reliable. The microarray gene expression technology is a promising technology that couldpredict
cancer recurrence by analyzing the gene expression of sample cells. The microarray technology allows
researchers to examine the expression of thousands of genes simultaneously. This paper describes a stateof-
the-art machine learning based approach called averaged one-dependence estimators with subsumption
resolution to tackle the problem of predicting, from DNA microarray gene expression data, whether a
particular cancer will recur within a specific timeframe, which is usually 5 years. To lower the
computational complexity, we employ an entropy-based geneselection approach to select relevant
prognosticgenes that are directly responsible for recurrence prediction. This proposed system has achieved
an average accuracy of 98.9% in predicting cancer recurrence over 3 datasets. The experimental results
demonstrate the efficacy of our framework.
An understanding towards genetics and epigenetics is essential to cope up with the paradigm shift which is underway. Personalized medicine and gene therapy will confluence the days to come.
This review highlights traditional approaches as well as current advancements in the analysis of the gene expression data from cancer perspective.
Due to improvements in biometric instrumentation and automation, it has become easier to collect a lot of experimental data in molecular biology.
Analysis of such data is extremely important as it leads to knowledge discovery that can be validated by experiments. Previously, the diagnosis of complex genetic diseases has conventionally been done based on the non-molecular characteristics like kind of tumor tissue, pathological characteristics, and clinical phase.
The microarray data can be well accounted for high dimensional space and noise. Same were the reasons for ineffective and imprecise results. Several machine learning and data mining techniques are presently applied for identifying cancer using gene expression data.
While differences in efficiency do exist, none of the well-established approaches is uniformly superior to others. The quality of algorithm is important, but is not in itself a guarantee of the quality of a specific data analysis.
http://kaashivinfotech.com/
http://inplanttrainingchennai.com/
http://inplanttraining-in-chennai.com/
http://internshipinchennai.in/
http://inplant-training.org/
http://kernelmind.com/
http://inplanttraining-in-chennai.com/
http://inplanttrainingchennai.com/
Performance and Evaluation of Data Mining Techniques in Cancer DiagnosisIOSR Journals
Abstract: We analyze the breast Cancer data available from the WBC, WDBC from UCI machine learning with
the aim of developing accurate prediction models for breast cancer using data mining techniques. Data mining
has, for good reason, recently attracted a lot of attention, it is a new Technology, tackling new problem, with
great potential for valuable commercial and scientific discoveries. The experiments are conducted in WEKA.
Several data mining classification techniques were used on the proposed data. There are many classification
techniques in data mining such as Decision Tree, Rules NNge, Tree random forest, Random Tree, lazy IBK. The
aim of this paper is to investigate the performance of different classification techniques. The data breast cancer
data with a total 286 rows and 10 columns will be used to test and justify the different between the classification
methods and algorithm.
Keywords - Machine learning, data mining Weka, classification, breast cancer
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...rahulmonikasharma
Enormous generation of biological data and the need of analysis of that data led to the generation of the field Bioinformatics. Data mining is the stream which is used to derive, analyze the data by exploring the hidden patterns of the biological data. Though, data mining can be used in analyzing biological data such as genomic data, proteomic data here Gene Expression (GE) Data is considered for evaluation. GE is generated from Microarrays such as DNA and oligo micro arrays. The generated data is analyzed through the clustering techniques of data mining. This study deals with an implement the basic clustering approach K-Means and various clustering approaches like Hierarchal, Som, Click and basic fuzzy based clustering approach. Eventually, the comparative study of those approaches which lead to the effective approach of cluster analysis of GE.The experimental results shows that proposed algorithm achieve a higher clustering accuracy and takes less clustering time when compared with existing algorithms.
FRACTAL PARAMETERS OF TUMOUR MICROSCOPIC IMAGES AS PROGNOSTIC INDICATORS OF C...csandit
Research in the field of breast cancer outcome prognosis has been focused on molecular biomarkers, while neglecting the discovery of novel tumour histology structural clues. We thus
aimed to improve breast cancer prognosis by fractal analysis of tumour histomorphology. This study included 92 breast cancer patients without systemic treatment. Fractal parametersfractal dimension and lacunarity of the breast tumour microscopic histology possess prognostic value comparable to the major clinicopathological prognostic parameters. Fractal analysis was performed for the first time on routinely produced archived pan-tissue stained primary breast tumour sections, indicating its potential for clinical use as a simple and cost-effective prognostic indicator of distant metastasis risk to complement the molecular approaches for
cancer risk prognosis.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
Similar to MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER DATASET (20)
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER DATASET
1. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
DOI: 10.5121/ijsc.2012.3306 69
MINING OF IMPORTANT INFORMATIVE GENES AND
CLASSIFIER CONSTRUCTION FOR CANCER DATASET
Soumen Kumar Pati1
and Asit Kumar Das2
1
Department of Computer Science/Information Technology, St. Thomas‘College of
Engineering and Technology, 4, D.H. Road, Kolkata-23
soumen_pati@rediffmail.com
2
Department of Computer Science and Technology, Bengal Engineering and Science
University, Shibpur, Howrah-03
asitdas72@rediffmail.com
ABSTRACT
Microarray is a useful technique for measuring expression data of thousands or more of genes
simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data
is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the
distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and
robust gene identification methods is extremely fundamental. Many gene selection methods as well as their
corresponding classifiers have been proposed. In the proposed method, a single gene with high class-
discrimination capability is selected and classification rules are generated for cancer based on gene
expression profiles. The method first computes importance factor of each gene of experimental cancer
dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high
class discrimination capability according to their depended degree of classes. Then initial important genes
are selected according to high importance factor of each gene and form initial reduct. Then traditional k-
means clustering algorithm is applied on each selected gene of initial reduct and compute miss-
classification errors of individual genes. The final reduct is formed by selecting most important genes with
respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced
by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples
of experimental test dataset. The proposed method test on four publicly available cancerous gene
expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using
important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to
prove the robustness of proposed method compares the outcomes (correctly classified instances) with some
existing well known classifiers.
KEYWORDS
Microarray cancer data, K-means algorithm, Gene selection, Classification Rule, Cancer sample
identification, Gene reducts.
1. INTRODUCTION
Now-a-days, an increasing number of applications in different fields especially on the field of
natural and social sciences produce massive volumes of very high dimensional data under a
variety of experimental constrains. In scientific databases like gene microarray dataset [1], it is
common to encounter large sets of observations, represented by hundreds or more of dimensions.
Microarray technology [2] allows to simultaneously analyzing thousands or more of genes and
thus can give important insights about cell’s function, since changes in the composition of an
2. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
70
organism are generally associated with changes in gene expression patterns. The availability of
massive amounts of experimental data based on genome-wide studies has given momentum in
recent years to a large effort in developing mathematical, statistical, and computational
techniques to surmise biological models from data. In many bioinformatics problems, number of
genes is significantly larger than the number of samples (high gene-to-sample ratio data sets).
This is typical of cancer classification tasks where a systematic investigation of the correlation of
expression patterns of thousands of genes to specific phenotypic variations is expected to provide
an improved catalog of cancer. In this context, the number of features corresponds to the number
of expressed gene probes (up to several thousand) and the number of observations to the number
of tumor samples (typically on the order of hundreds) is typically correlated.
In DNA microarray data [1] analysis generally biologists measure the expression levels of genes
in the tissue samples from patients, and find explanations about how the genes of patients relate to
the types of cancers they had. Many genes could strongly be correlated to a particular type of
cancer, however, biologists prefer to focal point on a small subset of genes that dominates the
outcomes before performing in-depth analysis and expensive experiments with a high
dimensional dataset. Therefore, automated selection of the small subset of genes is highly
advantageous. DNA microarray technology [2] has directed the focus of computational biology
towards analytical data interpretation [3]. However, when examining microarray data, the size of
the data sets and noise contained within the data sets compromises precise qualitative and
quantitative analysis[4].
Generally, this field includes two key procedures: important gene identification and classifier
construction. The gene selection [5,6] is particularly crucial in this topic as the number of genes
irrelevant to classification may be huge, and hence, accurate prediction can be achieved only by
performing gene selection reasonably, that is, identifying most informative genes from a large
number of candidates. Once such genes are chosen, the creation of classifiers on the basis of the
genes is another mission. Most of the papers [7-9] obtain accurate classification results based on
more than two genes.
In the paper, a novel gene selection and subsequently a suitable classification rule generation
technique has been proposed on microarray data for selecting a single important gene to predict
cancerous gene with high classification accuracy. The method can be broken down into following
four steps:
i. The gene expression dataset is standardized to Z-score using Transitional State
Discrimination method [10] and then discretized to five discrete values.
ii. Since, all genes are not important to identification of particular cancer diseases, a
relevance analysis of genes are performed to select only the important genes. As the
samples of genes are collected from both normal and cancerous patients, the samples are
divided into two disjoint classes. For each gene, frequencies of discrete sample values are
computed in each class, based on which importance of the genes is measured.
iii. Since, each gene contains some normal samples and some cancerous samples, traditional
k-means clustering algorithm [11-13] with k =2 is applied on each selected gene and
miss-classification accuracy is computed based on which only the most important genes
are selected for classification.
iv. Finally, classification rules [7, 14, 15] are generated for each gene on the basis of training
dataset to identify cancer and non cancer samples of test dataset and obtained satisfactory
accuracy.
The article is organized into four sections. Section 2 describes the proposed gene selection and
classification methodology to select only the important genes according to high classification
3. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
71
accuracy. The experimental results and performance of the proposed method for a variety of
benchmark gene expression datasets is evaluated in Section 3. Finally, conclusions are drawn in
Section 4.
2. GENE SELECTION AND CLASSIFICATION
Conventionally morphological identification of cancer is not always effective as revealed by
frequent occurrences of misdiagnoses. Recent molecular biological studies have concerned that
cancer was a disease involving dynamic changes in the genome. Moreover, the rapid advances in
cancer diagnosis technology have made it possible to simultaneously measure the expression
levels of genes of microarray data in a single experiment. This technology has much facilitated
the detection of cancerous molecular markers with respect to specified microarray dataset [1].
One current difficulty in interpreting microarray data comes from their innate nature of ‘high
dimensional large sample size’. Therefore, robust and accurate gene selection methods are
required to identify differentially expressed group of genes across different samples, e.g. between
cancerous and normal cells. Gene selection is necessary to find out genes, responsible for
complex disease which take part in disease network and provide information about disease related
genes. Successful gene selection will help to classify different cancer types, lead to a better
understanding of genetic signatures in cancers and improve treatment strategies. Although gene
selection and cancer classification are two closely related problems, most existing approaches
handle them separately by selecting genes prior to classification.
2.1. Relevance Analysis of Genes
Let the labeled microarray gene expression dataset MDS = (U, C, D), where U = {g1, g2, …,gn} is
the universe of discourse contained all the genes of the dataset, C = {C1, C2, …, Cm} is C is the
condition attribute set contains all the samples and D = {d1, d2} is the set of decision attributes.
The Table1 shows the example of MDS with gene expression values and decision attributes.
Table1. Microarray dataset decision table (genes/samples).
Condition attributes (Samples)
Decision attributes (classes)
Class1(d1) Class2(d2)
S1 S2 …. Si Si+1 ….. Sm
Set of
Genes
g1 M(1,1) M(1,2) …. M(1,i) M(1,i+1) ….. M(1,m)
g2 M(2,1) M(2,2) …. M(2,i) M(2,i+1) ….. M(2,m)
…. ….. ….. …. ….. …. ….. …..
gn M(n,1) M(n,2) ….. M(n,i) M(n,i+1) ….. M(n,m)
As all genes are not important to identification of particular cancer diseases, a relevance analysis
of genes is necessary to select only the important genes. Initially, gene dataset MDS are
preprocessed by standardizing the samples to z-score using Transitional State Discrimination
4. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
72
method (TSD) [10]. In TSD, discretization factor fij is computed for sample Cj ∈ C of gene gi ∈
U, i = 1, 2, …, n, j = 1, 2, …, m , using (1).
݂ =
ܯ[ܥ] − µ
δ
(1)
Where, µi and δi are the mean and standard deviation of gene gi and Mi[Cj] is the value of sample
Cj in gene gi. Then mean (Ni) of negative values and mean (Pi) of positive values are computed
from ݂ of each gene gi and discretized to one of fuzzy linguistic term [16] and discretized to one
of fuzzy linguistic term using (2).
݂ =
ە
ۖ
۔
ۖ
ۓ
′ܸܮᇱ
݂݅ ݂ ≤ ܰ
′ܮᇱ
݂݅ ܰ < ݂ < 0
′ܼᇱ
݂݅ ݂ = 0
′ܪᇱ
݂݅ 0 < ݂ < ܲ
′ܸܪᇱ
݂݅ ݂ ≥ ܲ
(2)
As the samples of genes are collected from both normal and cancerous patients, so the samples
are divided into two disjoint classes say, d1 and d2. Now for each gene, frequencies of discrete
sample values are computed in each class. Now for each gene i, maximum frequencies of discrete
sample values are computed in each class using (3) and (4), respectively.
ܲ = ݂ ( ݐ݊ݑܥ| ݆ = 1,2, … , ݀ଵ ܽ݊݀ ݂ ߳ ቄ′ܸܮ′,′ ܮ′,′ ܼ′, ′ு′
, ′ܸܪ′′
ቅ) (3)
ܲ = ݐ݊ݑܥ൫ ݂ห ݆ = 1,2, … , ݀ଶ ܽ݊݀ ݂ ߳ ሼ′ܸ,′ܮ ′ ,′ܮ ′ ܼ′, ′,′ܪ ′ܸ′ܪሽ) (4)
Where, Count(x) is the numeric counting amount of maximum frequencies in class d1 and d2 for
gene gi respectively. If the maximum frequencies of Pli and Pri occur for same discrete value, then
the gene gi is not so important as both the normal and cancerous samples are almost similar.
Otherwise, the sample values of normal and cancerous samples are distinct for gene gi and so the
gene is considered as an important gene with importance factor (PFi) computed using (5).
ܲܨ =
ܲ + ܲ
݉
(5)
Where, i = 1, 2,…, n and m is the total number of samples. So, higher the importance factor more
relevant the gene is and vice versa.
2.2. Reduct Generation
The measurement of similarity/dissimilarity among the genes based on the distance metric may
not be effective for gene data analysis in a high dimensional space. And at the same time, elegant
gene selection decreases the workload and simplifies the subsequent design process to a great
extent. So, the method proposed a design approach to compute a minimum subset of genes called
reduct which can, by itself, fully characterize the knowledge in the gene database as the whole set
of genes (U) and preserves partition of data with respect to cancer classification. After computing
importance factor of all genes, top n1 (where, n1<<n) number of genes are selected as initial
reduct IRED. But in most of the cases, the initial reduct could not classify normal and cancerous
5. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
73
samples with high classification accuracy. As a result, some most important genes are selected
from initial reduct and form final reduct FRED.
To obtain the final reduct, genes in IRED are partitioned from high dimensional space into lower
dimensional space i.e., n1 numbers of one-dimensional matrices are formed, one for each gene.
Since, each gene contains some normal and some cancerous samples, it is expected that the
sample values will form two disjoint clusters, one containing normal sample values and other
with cancerous sample values. So traditional k-means clustering algorithm [11-13] with k =2 is
applied on the gene and miss-classification accuracy is computed using (6).
ܧܯ =
భାమ
(6)
Where, m1i is the number of d1 class samples clustered as d2 class samples and m2i is the number
of d2 class samples clustered as d1 class samples and m is the total number of samples.
In single dimensional space, k-means algorithm is very effective with respect to distance metric
and also the algorithm is effective here because of limited number of genes in IRED. Final reduct
FRED is formed by n2 (where, n2<<n1) number of genes with lowest miss-classification accuracy.
Algorithm: Reduct Generation
Input: Discretized gene dataset U = {g1, g2, …., gn} with sample set C = {C1, C2, …, Cm}
Output: FRED contains most important genes.
Begin
d1 = class in which normal samples of the genes lie
d2 = class in which cancerous samples of the genes lie
For i=1 to n do {
Pki=maximum frequency among all discrete values in d1 of gene gi
Pli=maximum frequency among all discrete values in d2 of gene gi
If (Pki ≠ Pli) then Compute importance factor PFi of gene gi using (5)
}
Arrange all genes in non increasing order of PFi
IRED = set of first n1 genes, where, n1<< n
For i=1 to n1 do {
Apply k-means clustering algorithm with k=2 on gene gi in IRED
m1 = number of d1 class samples misplaced in d2 class
m2 = number of d2 class samples misplaced in d1 class
Compute mis-classification accuracy MEi of gene gi using (6)
6. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
74
}
Arrange ܧܯ in non decreasing order of MEi
FRED = set of first n2 genes, where, n2<< n1
End
2.3. Classifier Construction
The classifier is an important tool [7, 14, 15] constructed from the nature (i.e., expression values)
of selected important gene of training experimental dataset for classification of cancerous and
non-cancerous test samples. Here, only a set of most important genes are selected from the gene
dataset and kept in FRED and classification rules are generated individually for each of the genes.
Classification rules generated are of the form of “x -> y” indicates that “if x, then y”, where x is
the description on condition attributes or samples and y is the description on decision attributes or
types of a gene. Gene is described by the sample values, some from normal and some from
cancerous patients. So, two classes say, d1 and d2 are associated to each gene, where some sample
values corresponding to d1 and some to d2. Let, the intervals in which the sample values of class
d1 and class d2 are [min1, max1] and [min2, max2] respectively. Then one of the three different
possibilities (i) non-overlapping intervals (ii) overlapping intervals and (ii) one interval fully
contained in other may occurs. The rules generated in three cases are described separately.
(i) Non-overlapping intervals: Without loss of generality, assume that max1 < min2, otherwise
two classes are interchanged before rule generation. Hence, gap between two intervals i.e. (min2 -
max1) is equally divided and intervals are extended accordingly. Thus the mid-point value R of
the gap is considered as the upper limit of the sample values of normal genes beyond which
samples are of cancerous genes, as shown in Fig. 1. So the rules are:
If (min1 <= sample value < R) then normal samples
If (R <= sample value <=max2) then cancerous samples
Figure1. Range of values of samples in non-overlapping intervals
(ii) Overlapping intervals: In the case, one interval is not considered as a proper subset of the
other, which is described in next case. Here, also without loss of generality, assume that, min2 <
max1. So, the range of overlap portion is max1 - min2. The range is not divided equally in this
case, rather it is divided based on the number of samples of each class lies in it. If the ratio of
percentage of samples of class d1 to that of class d2 in the range is m: n, then the value (R) of the
point at which the range divided is obtained by (7) or (8) and R is considered as the upper limit of
the sample values of normal genes beyond which samples are of cancerous genes as shown in
Fig.2.
7. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
75
ܴ = ݉݅݊ଶ +
݉
݉ + ݊
× (݉ܽݔଵ − ݉݅݊ଶ) (7)
ܴ = ݉ܽݔଵ −
ା
× (݉ܽݔଵ − ݉݅݊ଶ) (8)
So the rules are:
If (min1 <= sample value < R) then normal samples
If (R <= sample value <=max2) then cancerous samples
Figure2. Range of values of samples in overlapping intervals
(iii) One interval fully contained in other: Without loss of generality, assume that, class d2 is
fully contained in class d1 such that min1 < min2 < max2 < max1. Here, the range (max2 - min2)
contains all samples of class d2 together with some samples of class d1. Similar to step (ii) if the
ratio of percentage of samples of class d1 to that of class d2 in the range is m: n, then the value (R)
of the point at which the range (max2 - min2) divided, as shown in Fig. 3, is obtained by (9) or
(10).
ܴ = ݉݅݊ଶ +
݉
݉ + ݊
× (݉ܽݔଶ − ݉݅݊ଶ) (9)
ܴ = ݉ܽݔଶ −
݉
݉ + ݊
× (݉ܽݔଶ − ݉݅݊ଶ) (10)
Since, class d2 is fully contained in class d1, the value of R may be the upper limit or lower limit
of the sample values of class d2 (i.e., cancerous genes) and thus two possible rules are
(i) If (min1 <= sample value < R) OR (max2 < sample value <= max1)) then normal
samples
(ii)
If (R <= sample value <=max2) then cancerous samples OR
(iii) If (min1 <= sample value < min2) OR (R < sample value <= max1)) then normal
samples
If (min2 <= sample value <=R) then cancerous samples
Figure3. Range of values of samples one contained in other interval
8. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
76
Algorithm: Classification Rule Generation
Input: Final reduct FRED with G numbers of genes and all samples of training dataset.
Output: Suitable classification rule to classify test-dataset.
Begin
For each gene g from FRED do {
d1 = normal class associated to gene g
d2 = cancerous class associated to gene g
Interval of sample values in d1= [min1, max1] and d2= [min2, max2]
Case 1:
If (max1 < min2) then {
R= max1 + (min2- max1) / 2
(min1 <= sample value < R) = > d1 (normal samples)
(R <= sample value <=max2) = > d2 (cancerous samples)
} /*otherwise interchange d1 by d2 and get rules*/
Case 2:
If (min2 < max1) then {
m: n = ratio of percentage of samples in d1 to d2 in (max1 - min2)
Compute R using (7) or (8)
(min1 <= sample value < R) = > d1 (normal samples)
(R <= sample value <=max2) = > d2 (cancerous samples)
} /*otherwise interchange d1 by d2 and get rules*/
Case 3:
If (min1 < min2 < max2 < max1) then {
m: n = ratio of percentage of samples in d1 to d2 in (max2 - min2)
Compute R using (9) or (10)
Two possible rules are:
9. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
77
(i) (min1 <= sample value < R) || (max2 < sample value <= max1) => d1 (normal samples)
and (R <= sample value <=max2) => d2 (cancerous samples)
OR
(ii) (min1 <= sample value < min2) || (R < sample value <= max1) => d1 (normal samples) and
(min2 <= sample value <=R) => d2 (cancerous samples)
} /*otherwise interchange d1 by d2 and get rules*/
End
3. EXPERIMENTAL RESULTS AND PERFORMANCE
EVALUATION
Experimental studies presented here provide an evidence of effectiveness of proposed gene
selection and classification technique. Experiments were carried out on large number of different
kinds of microarray data, few of them publicly available [17-21] as training and test dataset are
summarized in Table 2. Each dataset contains two types of samples, one group is normal and
other is cancerous.
Table2. Summary of Gene expression (training/testing) dataset.
Dataset No.of Genes Class Name No. of Training
Samples
(class1/class2)
No.of Test
Samples
(class1/class2)
Leukemia 7129 ALL/AML 38(27/11) 34(20/14)
Lung
Cancer
12533 MPM/ADCA 32(16/16) 149(15/134)
Prostate
Cancer
12600 Tumor/Normal 102(52/50) 34(25/9)
Breast
Cancer
24481 Relapse/Non-
relapse
78(34/44) 19(12/7)
In addition, because there are microarray intensity discrepancies between the training set and the
test set in the prostate cancer dataset [19, 20] caused by two different experiments, so
normalization is required for both the training and the test dataset. Each original expression level
M(i,j) is normalized using (11).
,݅(ܯ ݆)ୀଵ,.., ௗ ୀଵ,.., =
,݅(ܯ ݆) − ൣ݉ܽݔୀଵ,,,ሼ,݅(ܯ ݆)ሽ + ݉݅݊ୀଵ,,ሼ,݅(ܯ ݆)ሽ൧/2
ൣ݉ܽݔୀଵ,.,ሼ,݅(ܯ ݆)ሽ − ݉݅݊ୀଵ,,..,ሼ,݅(ܯ ݆)ሽ൧/2
(11)
After the normalization, all the gene expression levels are limited in interval [-1, 1]. For the other
datasets, to avoid unnecessary loss of information, the normalization process is not conducted
since the training and the test sets are from the same experiments [17, 18, 21].
10. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
78
The proposed method, computes firstly initial reduct set IRED of seventy five genes with top
probability factors and then final reduct set FRED with fifteen genes with less miss-classification
errors. It is observed that all final identified genes of all gene dataset are most important with
respect to classification accuracy.
In Leukemia dataset [17], seven genes with their computed importance factor, mis-classification
error and classification accuracy are listed in Table 3 and all other selected genes have the
classification accuracy more than 73% (not shown). Two classification rules induced from
training dataset by gene index 2288 are: if M(#2288) ≥ 929.5, then AML and if M(#2288) <
929.5, then ALL. Likewise, gene #760 induces two rules: if M (Gene_id_760) ≥ 720.5, then AML
and if M (Gene_id_760) < 720.5, then ALL.
Table 3: Most important Leukemia (ALL/AML) genes
Gene_i
d
Gene
name
Correctly
classified
samples
[Total(ALL
/AML)]
Classification
accuracy (%)
[Total(ALL/A
ML)]
Kappa
Statistics
Importan-
ce Factor
Miss-
classific-
ation
error
2288 M84526
_at
34 (21/13) 97.89 (100/93) 0.9459 0.921053 0.131579
1882 M27891
_at
33 (20/13) 95.12 (96/93) 0.9078 0.894737 0.131579
1834 M23197
_at
33 (19/14) 95.08 (92/97) 0.8954 0.921053 0.131579
4847 X95735
_at
32 (19/13) 92.67 (91/93) 0.8650 0.973684 0.078947
760 D88422
_at
32 (21/11) 91.78 (100/79) 0.8641 0.894737 0.236842
4373 X62320
_at
31 (20/11) 89 (96/79) 0.8139 0.868421 0.236842
3320 U50136
_rna1_at
26 (19/7) 75 (91/50) 0.7321 0.921053 0.052632
Similarly, for Lung cancer dataset [18], similar information are shown in Table 4 for fourteen
genes and all other selected genes have the classification accuracy more than 80% (not shown).
Two classification rules induced from training dataset by gene index 5301 are: if M (#5301) ≤-
138.9, then MPM and if M (#5301) >-138.9 then ADCA. Likewise, gene index 7765 induces two
rules: if M (Gene_id_7765) > 185.9, then MPM and if M (Gene_id_7765) ≤ 185.9, then ADCA.
11. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
79
Table 4. Most important Lung cancer (MPM/ADCA) genes.
Similarly, for Prostate cancer dataset [19, 20], similar information are shown in Table 5 for seven
genes and all other selected genes have the classification accuracy more than 75% (not shown).
Two classification rules induced from training dataset by gene index 6185 are: if M (#6185) > -
0.716381, then Tumor and if M (#6185) ≤ -0.716381, then Normal. Likewise, gene index 3794
induces two rules: if M (#3794) ≤ -0.323077, then Tumor and if M (#3794) > -0.323077, then
Normal.
Table 5. Most important Prostate cancer (Tumor/Normal) genes
Gene_
id
Gene
name
Correctly
classified
samples
[Total
(Tumor/No
rmal)]
Classification
accuracy (%)
[Total
(Tumor/Normal)]
Kappa
Statistics
Importance
Factor
Miss-
classifica-
tion error
6185 37639_
at
33(24/9) 97.06(96/100) 92.80 0.852941 0.215686
12. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
80
3794 39939_
at
32(23/9) 94.12(92/100) 0.8489 0.803922 0.215686
7557 32243_
g_at
31(22/9) 91.18(88/100) 0.7982 0.794118 0.323529
10138 41288_
at
31(22/9) 91.18(88/100) 0.7982 0.794118 0.235294
5757 36491_
at
30(23/7) 88.24(92/77.78) 0.6756 0.754902 0.215686
9050 38044_
at
29(21/8) 85.30(84/88.89) 0.6643 0.794118 0.215686
205 31444_
s_at
28(19/9) 82.36(76/100) 0.6621 0.794118 0.186275
Similarly, for Breast cancer dataset [21], similar information are shown in Table 6 for seven
genes and all other selected genes have the classification accuracy more than 75% (not shown).
Two classification rules induced from training dataset by gene index 1505 are: if M (#1505) ≤ -
0.005, then Relapse and if M (#1505) > -0.005, then Non-relapse. Likewise, gene index 6214
induces two rules: if M (#6214) ≤ -0.128, then Relapse and if M (#6214) > -0.128, then Non-
relapse.
Table 6. Most important Breast cancer (Relapse/Non-relapse) genes.
Gene_
id
Gene
name
Correctly
classified
samples
[Total(Rela
pse/Non-
relapse)]
Classification
accuracy (%)
[Total(Relapse/Non
-relapse)]
Kappa
Statisti-
cs
Importa-
nce
Factor
Miss-
classifica
tion error
1505 AF_14850
5
16(10/6) 84.22(83.34/85.72) 0.8034 0.717949 0.294872
6214 NM_0124
29
15(10/5) 78.95(83.34/71.43) 0.7566 0.717949 0.282051
10643 NM_0209
74
15(9/6) 78.95(75/85.72) 0.7566 0.717949 0.307692
4732 AF_05208
7
15(8/7) 78.95(66.67/100) 0.7843 0.705128 0.294872
14991 Contig485
90_RC
14(9/5) 73.69(75/71.43) 0.6578 0.717949 0.294872
1603 Contig464
21_RC
14(10/4) 73.69(83.34/57.15) 0.6487 0.717949 0.282051
719 NM_0016
85
14(7/7) 73.69(53/100) 0.6732 0.74359 0.282051
The rules generated for selected genes shown in Table 3, Table 4, Table 5 and Table 6 by the
proposed classification method and other methods such as Bayes classifier (Naïve Bayes), Tree
based classifier (J48-C 0.25 and RandomForest), Rule based classifier (PART), Meta classifier
(AdaBoostM1) and Lazy classifier (Kstar) are applied on test samples and accuracies are
measured, as shown in Fig. 4, Fig. 5, Fig. 6 and Fig. 7. It is observed that for all test-dataset, the
proposed and other classifiers shows better accuracy that shows the importance of selected genes.
Also in most of the cases, accuracy obtained by the proposed method is higher compare to other
methods which show the goodness of the proposed classifier.
13. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
81
Figure 4. Performance of Leukemia genes Figure 5. Performance of Lung Cancer genes
Figure5. Performance of Prostate Cancer genes Figure 6. Performance of Breast Cancer genes
The discretization and labeling of experimental dataset are implemented using Mat lab 7.8.1
version. Also, proposed ‘Reduct Generation’ and ‘Classification Accuracy Computation’ are
implemented using Mat lab 7.8.1 version and all classification performances are measured by
Weaka-3-6-5 Data Mining tool [22] and comparison figures are drawn in Mat lab 7.8.1 version.
The comparison is performed on PC (Intel(R) Core(TM) 2 Duo T5750 2.0 GHz, 2.0 GHz with 2.0
GB of Ram).
4. DISCUSSIONS AND CONCLUSIONS
Systematic and unbiased approach to cancer classification is of great importance to cancer
treatment and drug discovery. It has been known that gene expression contains the keys to the
fundamental problems of cancer diagnosis, cancer treatment and drug discovery. The recent
advent of microarray technology has made the production of large amount of gene expression
data possible. This has motivated the researchers in proposing different cancer classification
algorithms using gene expression data.
14. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
82
In the paper, a novel gene selection and classification technique has been proposed for select
important genes (single) and then constructs classification rules to classify cancerous and non-
cancerous samples with high classification accuracy. The proposed method is applied on four
publicly available experimental microarray cancer dataset and selects some important genes by
comparing probability factors of all genes and form initial reduct according to proposed
algorithm. Then traditional k-means algorithm is applied on initial reduct for each gene and form
final reduct with more important genes on consideration of less miss-classification accuracy.
Then construct classification rules on the basis of selected genes (single train gene) and
classification accuracy in terms of correctly classified instances apply on test genes that shows
quantitative satisfactory results. Gene selection, an important preprocessing step was presented in
detail and evaluated for their relevance in cancer classification. Comparative study is also made
with respect to correctly classified instances (%) by some traditional classifiers namely Bayes,
J48, PART, MLP, Random Forest, AdaBoost and Kstar which shows that the goodness of the
proposed method.
REFERENCES
[1] Lee, S.hyun. & Kim Mi Na, (2008) “This is my paper”, ABC Transactions on ECE, Vol. 10, No. 5,
pp120-122.
[2 Aerman D.A., Gish K., Ybarra S., Mack D., & Levine A.J. .,(1999) “Expression revealed by
clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays”, Proc. Natl.
Acad. Sci, vol 1, pp 6745–6750.
[3] DeRisi J, et al. (1996) “Use of a cDNA microarray to analyse gene expression patterns in human
cancer”, Nat Genet, Dec, vol. 14, No. 4, pp 457-60.
[4] Muralidhar K. & Sarathy R., (1999) “Security of random data perturbation methods”, ACM Trans.
Database Syst., Vol. 24, No. 4, pp 487–493.
[5] Petrov A. & Shams S., (2004) “Microarray image processing and quality control”, VLSI Signal
Processing, vol. 38, No. 3, pp 211–226.
[6] Su Y., Murali T. M., Pavlovic V., Schaffer M. & Kasif S., (2003) "RankGene: identification of
diagnostic genes based on expression data", BIOINFORMATICS, vol. 19, pp. 1578-1579.
[7] Li, L., Weinberg, R. C., Darden, T. A. & Pedersen L. G., (2001) "Gene selection for sample
classification based on gene expression data: study of sensitivity to choice of parameters of the
GA/KNN method," BIOINFORMATICS, vol. 17, pp.1131-1142.
[8] Zhang H., Yu C. Y., Singer B. & Xiong M., (2001) "Recursive partitioning for tumor classification
with gene expression microarray data," PNAS, vol. 98, pp. 6730-6735.
[9] Dudoit S., Fridlyand J., & Speed T. P., (2002) “Comparison of Discrimination Methods for the
Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, No. 457,
pp. 77-87.
[10] Wang, X., & Gotoh, O., (2009) “Microarray-Based Cancer Prediction Using Soft Computing
Approach”, Cancer Informatics, vol. 7, pp 123–139.
[11] R.G. Pensa, C. Leschi, J. Besson, & J. Boulicaut., (2004) “Assessment of discretization techniques
for relevant pattern discovery from gene expression data”, In 4th Workshop on Data Mining in
Bioinformatics.
[12] Qu Y., & Xu S., (2004) “Supervised cluster analysis for microarray data based on multivariate
Gaussian mixture”, Bioinformatics, vol. 20, pp 1905-13.
[13] Guha, S., Rastogi R. & Shim K., (1998) “CURE: an efficient clustering algorithm for large
databases”, Proc. of ACM SIGMOD International Conference on Management of Data, pp. 73 – 84.
[14] Bradley P. S., Bennett K. P. & Demiriz A., (2000) “Constrained k-means clustering (Technical
ReportMSR-TR-2000-65)”, Microsoft Research, Redmond, WA.
[15] Dudoit S., Fridlyand J., & Speed T.P., (2002) “Comparison of Discrimination Methods for the
Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, no. 457,
pp. 77-87.
[16] Golub. T. R., (1999) “Molecular classification of cancer: class discovery and class prediction by Gene
Expression Monitoring,” Science, vol. 286, pp 531-537.
15. International Journal on Soft Computing (IJSC) Vol.3, No.3, August 2012
83
[17] Ivars Peterson, (1993) "Fuzzy Sets", Science News, Vol. 144, July 24, pp. 55.
[18] Leukemia dataset: http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi.
[19] Lung dataset: http://www genome. wi.mit.edu/mpr/lung.
[20] Prostate cancer train dataset: http://www-genome.wi.mit.edu/mpr/prostate.
[21] Prostate cancer test dataset: http://carrier.gnf.org/welsh/prostate.
[22] Breast cancer dataset: http://www.rii.com/publications/2002/vantveer.htm.
[23] WEKA: Machine Learning Software, http://www.cs.waikato.ac.nz/~.html
.
Authors
Mr. Soumen Kumar Pati is an Assistant Professor of Computer Science/Information
Technology at St. Thomas’ College of Engineering and Technology, Kidderpore,
Kolkata,West Bengal, India. He has received M.Tech degree in Computer Science and
Engg from Jadavpur University. He is registered for PhD (Engg) degree at Bengal
Engineering and Science University, Shibpur, Howrah. His research interests include
Bio-informatics, Data Mining and Pattern Recognition, Rough Set Theory, etc.
Dr.Asit Kr. Das is an Assistant Professor of Computer Science and Technology at
Bengal Engineering and Science University, Shibpur, Howrah. He has received B.Sc.
Honours in Mathematics, B. Tech. and M.Tech degree in Computer Science and Engg
from Calcutta University. He obtained PhD (Engg) degree from Bengal Engineering and
Science University, Shibpur, Howrah. His research interests include Data Mining and
Pattern Recognition, Text Categorization, Rough Set Theory, Bio-informatics etc.