High accuracy in cancer prediction is important to improve the quality of the treatment and to improve the
rate of survivability of patients. As the data volume is increasing rapidly in the healthcare research, the
analytical challenge exists in double. The use of effective sampling technique in classification algorithms
always yields good prediction accuracy. The SEER public use cancer database provides various prominent
class labels for prognosis prediction. The main objective of this paper is to find the effect of sampling
techniques in classifying the prognosis variable and propose an ideal sampling method based on the
outcome of the experimentation. In the first phase of this work the traditional random sampling and
stratified sampling techniques have been used. At the next level the balanced stratified sampling with
variations as per the choice of the prognosis class labels have been tested. Much of the initial time has been
focused on performing the pre-processing of the SEER data set. The classification model for
experimentation has been built using the breast cancer, respiratory cancer and mixed cancer data sets with
three traditional classifiers namely Decision Tree, Naïve Bayes and K-Nearest Neighbour. The three
prognosis factors survival, stage and metastasis have been used as class labels for experimental
comparisons. The results shows a steady increase in the prediction accuracy of balanced stratified model
as the sample size increases, but the traditional approach fluctuates before the optimum results.
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...ijaia
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast Cancer at the early stage was to apply artificial intelligence where machines are simulated with intelligence and programmed to think and act like a human. This allows machines to passively learn and find a pattern, which can be used later to detect any new changes that may occur. In general, machine learning is quite useful particularly in the medical field, which depends on complex genomic measurements such as microarray technique and would increase the accuracy and precision of results. With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast Cancer diagnostic system using complex genomic analysis via microarray technology. The system will combine two machine learning methods, K-means cluster, and linear regression.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
Supervised machine learning based liver disease prediction approach with LASS...journalBEEI
In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1-score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...ijaia
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast Cancer at the early stage was to apply artificial intelligence where machines are simulated with intelligence and programmed to think and act like a human. This allows machines to passively learn and find a pattern, which can be used later to detect any new changes that may occur. In general, machine learning is quite useful particularly in the medical field, which depends on complex genomic measurements such as microarray technique and would increase the accuracy and precision of results. With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast Cancer diagnostic system using complex genomic analysis via microarray technology. The system will combine two machine learning methods, K-means cluster, and linear regression.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
Supervised machine learning based liver disease prediction approach with LASS...journalBEEI
In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1-score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Gene selection is usually the crucial step in microarray data analysis. A great deal of recent research has focused on the
challenging task of selecting differentially expressed genes from microarray data (‘gene selection’). Numerous gene selection
algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like
small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. This paper
presents combination of Analysis of Variance (ANOVA), Principle Component Analysis (PCA), Recursive Cluster Elimination
(RCE) a classification algorithm by employing a innovative method for gene selection. It reduces the gene expression data into
minimal number of gene subset. This is a new feature selection method which uses ANOVA statistical test, principal component
analysis, KNN classification &RCE (recursive cluster elimination). At each step redundant & irrelevant features are get
eliminated. Classification accuracy reaches up to 99.10% and lesser time for classification when compared to other convectional techniques.
Iganfis Data Mining Approach for Forecasting Cancer Threatsijsrd.com
Healthcare facilities have at their disposal vast amounts of cancer patients’ data. Medical practitioners require more efficient techniques to extract relevant knowledge from this data for accurate decision-making. However the challenge is how to extract and act upon it in a timely manner. If well engineered, the huge data can aid in developing expert systems for decision support that can assist physicians in diagnosing and predicting some debilitating life threatening diseases such as cancer. Expert systems for decision support can reduce the cost, the waiting time, and liberate medical practitioners for more research, as well as reduce errors and mistakes that can be made by humans due to fatigue and tiredness. The process of utilizing health data effectively however, involves many challenges such as the problem of missing feature values, the curse of dimensionality due to a large number of attributes, and the course of actions to determine the features that can lead to more accurate diagnosis. Effective data mining tools can assist in early detection of diseases such as cancer. In This paper, we propose a new approach called IGANFIS. This approach optimally minimizes the number of features using the information gain (IG) algorithm which is usually used in text categorization to select the quality of text. The IG will be used for selecting the quality of cancer features by virtue of reducing them in number. The reduced number quality features dataset will then be applied to the Adaptive Neuro Fuzzy Inference System (ANFIS) to train and test the proposed approach. ANFIS method of training is ideally the hybrid learning algorithm which uses the gradient descent method and Least Square Estimate (LSE) for computing the error measure for each training pair. Each cycle of the ANFIS hybrid learning consists of a forward pass to present the input vector calculating the node outputs layer by layer repeating the process for all data and a backward pass using the steepest descent algorithm to update parameters, a process called back propagation.
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...ijsc
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from
large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an
established, proven structure from a voluminous collection of facts. A dominant area of modern-day
research in the field of medical investigations includes disease prediction and malady categorization. In
this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering
techniques and compare the performance of classification algorithms on the clinical data. Feature
selection is a supervised method that attempts to select a subset of the predictor features based on the
information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with
the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms
in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen
classification algorithms on the Lymphography dataset that enables the classifier to accurately perform
multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and
the Quinlan’s C4.5 algorithm give 100 percent classification accuracy with all the predictor features and
also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated
here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm
offers increased clustering accuracy in less computation time.
Hybrid System of Tiered Multivariate Analysis and Artificial Neural Network f...IJECEIAES
Improved system performance diagnosis of coronary heart disease becomes an important topic in research for several decades. One improvement would be done by features selection, so only the attributes that influence is used in the diagnosis system using data mining algorithms. Unfortunately, the most feature selection is done with the assumption has provided all the necessary attributes, regardless of the stage of obtaining the attribute, and cost required. This research proposes a hybrid model system for diagnosis of coronary heart disease. System diagnosis preceded the feature selection process, using tiered multivariate analysis. The analytical method used is logistic regression. The next stage, the classification by using multi-layer perceptron neural network. Based on test results, system performance proposed value for accuracy 86.3%, sensitivity 84.80%, specificity 88.20%, positive prediction value (PPV) 90.03%, negative prediction value (NPV) 81.80%, accuracy 86,30% and area under the curve (AUC) of 92.1%. The performance of a diagnosis using a combination attributes of risk factors, symptoms and exercise ECG. The conclusion that can be drawn is that the proposed diagnosis system capable of delivering performance in the very good category, with a number of attributes that are not a lot of checks and a relatively low cost.
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
A huge amount of medical data is generated every da
y, which presents a challenge in analysing
these data. The obvious solution to this challenge
is to reduce the amount of data without
information loss. Dimension reduction is considered
the most popular approach for reducing
data size and also to reduce noise and redundancies
in data. In this paper, we investigate the
effect of feature selection in improving the predic
tion of patient deterioration in ICUs. We
consider lab tests as features. Thus, choosing a su
bset of features would mean choosing the
most important lab tests to perform. If the number
of tests can be reduced by identifying the
most important tests, then we could also identify t
he redundant tests. By omitting the redundant
tests, observation time could be reduced and early
treatment could be provided to avoid the risk.
Additionally, unnecessary monetary cost would be av
oided. Our approach uses state-of-the-art
feature selection for predicting ICU patient deteri
oration using the medical lab results. We
apply our technique on the publicly available MIMIC
-II database and show the effectiveness of
the feature selection. We also provide a detailed a
nalysis of the best features identified by our
approach.
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism.
PERFORMANCE OF DATA MINING TECHNIQUES TO PREDICT IN HEALTHCARE CASE STUDY: CH...ijdms
With the promises of predictive analytics in big data, and the use of machine learning algorithms,
predicting future is no longer a difficult task, especially for health sector, that has witnessed a great
evolution following the development of new computer technologies that gave birth to multiple fields of
research. Many efforts are done to cope with medical data explosion on one hand, and to obtain useful
knowledge from it, predict diseases and anticipate the cure on the other hand. This prompted researchers
to apply all the technical innovations like big data analytics, predictive analytics, machine learning and
learning algorithms in order to extract useful knowledge and help in making decisions. In this paper, we
will present an overview on the evolution of big data in healthcare system, and we will apply three learning
algorithms on a set of medical data. The objective of this research work is to predict kidney disease by
using multiple machine learning algorithms that are Support Vector Machine (SVM), Decision Tree (C4.5),
and Bayesian Network (BN), and chose the most efficient one.
Early Detection of Lung Cancer Using Neural Network TechniquesIJERA Editor
Effective identification of lung cancer at an initial stage is an important and crucial aspect of image processing. Several data mining methods have been used to detect lung cancer at early stage. In this paper, an approach has been presented which will diagnose lung cancer at an initial stage using CT scan images which are in Dicom (DCM) format. One of the key challenges is to remove white Gaussian noise from the CT scan image, which is done using non local mean filter and to segment the lung Otsu’s thresholding is used. The textural and structural features are extracted from the processed image to form feature vector. In this paper, three classifiers namely SVM, ANN, and k-NN are applied for the detection of lung cancer to find the severity of disease (stage I or stage II) and comparison is made with ANN, and k-NN classifier with respect to different quality attributes such as accuracy, sensitivity(recall), precision and specificity. It has been found from results that SVM achieves higher accuracy of 95.12% while ANN achieves 92.68% accuracy on the given data set and k-NN shows least accuracy of 85.37%. SVM algorithm which achieves 95.12% accuracy helps patients to take remedial action on time and reduces mortality rate from this deadly disease.
Nowadays, Healthcare sector data are enormous, composite and diverse because it contains a data of different types and getting knowledge from that data is essential. So for this purpose, data mining techniques may be utilized to mine knowledge by building models from healthcare dataset. At present, the classification of breast cancer patients has been a demanding research confront for many researchers. For building a classification model for the cancer patient, we used four different classification algorithms such as J48, REPTree, RandomForest, and RandomTree and tested on the dataset taken from UCI. The main aim of this paper is to classify the patient into benign (not cancer) or malignant (cancer), based on some diagnostic measurements integrated into the dataset.
Breast cancer diagnosis and recurrence prediction using machine learning tech...eSAT Journals
Abstract Breast Cancer has become the common cause of death among women. Due to long hours invested in manual diagnosis and lesser diagnostic system available emphasize the development of automated diagnosis for early diagnosis of the disease. Our aim is to classify whether the breast cancer is benign or malignant and predict the recurrence and non-recurrence of malignant cases after a certain period. To achieve this we have used machine learning techniques such as Support Vector Machine, Logistic Regression, KNN and Naive Bayes. These techniques are coded in MATLAB using UCI machine learning depository. We have compared the accuracies of different techniques and observed the results. We found SVM most suited for predictive analysis and KNN performed best for our overall methodology. Keywords: Breast Cancer, SVM, KNN, Naive Bayes, Logistic Regression, Classification.
Performance Analysis of Data Mining Methods for Sexually Transmitted Disease ...IJECEIAES
According to health reports of Malang city, many people are exposed to sexually transmitted diseases and most sufferers are not aware of the symptoms. Malang city being known as a city of education so that every year the population number increases, it is at risk of increasing the spread of sexually transmitted diseases virus. This problem is important to be solved to treat earlier sufferers sexually transmitted diseases virus in order to reduce the burden of patient spending. In this research, authors conduct data mining methods to classifying sexually transmitted diseases. From the experiment result shows that K-NN is the best method for solve this problem with 90% accuracy.
Medical informatics growth can be observed now days. Advancement in different medical fields
discovers the various critical diseases and provides the guidelines for their cure. This has been possible
only because of well heeled medical databases as well as automation of data analysis process. Towards
this analysis process lots of learning and intelligence is required, the data mining techniques provides the
basis for that and various data mining techniques are available like Decision tree Induction, Rule Based
Classification or mining, Support vector machine, Stochastic classification, Logistic regression, Naïve
bayes, Artificial Neural Network & Fuzzy Logic, Genetic Algorithms. This paper provides the basic of
data mining with their effective techniques availability in medical sciences & reveals the efforts done on
medical databases using data mining techniques for human disease diagnosis.
Study On Decision-Making For Café Management Alternativesijcsit
In recent years, many new cafés have emerged onto the market. Other than view cafés, beautiful cafés that
seem as if they came from Paris or New York have gradually appeared in leisure and quiet residential
areas, in alleyways, in peripheral areas, and in local commercial areas. In particular, leisure is trendy at
present, and modern restaurants innovate in terms of their food, leisure, and consumption. Unlike
traditional restaurants, they are able to develop into cafés with unique styles to attract consumers. Even
though not all of these new cafés are successful, as cafés are an industry that is at the forefront of fashion,
many individuals who dream of entrepreneurship would want to open a café. However, as there are many
types of cafés on the market, what type and style of cafés are the most suitable? An overview of cafés in
Taiwan shows that each café offer unique services and functions to attract consumers, which is the key to
sustainable operations of cafés. Therefore, this study explores the decisions of companies when choosing
the style for their cafés. This study uses the analytical hierarchy process (AHP) to explore the selection of
café styles, in order to provide references for café operators to achieve successful and sustainable
operations. Based on literature review, expert interviews, and AHP, this study intends to provide useful
results to the operators of cafés
Clustering of Deep WebPages: A Comparative Studyijcsit
The internethas massive amount of information. This information is stored in the form of zillions of
webpages. The information that can be retrieved by search engines is huge, and this information constitutes
the ‘surface web’.But the remaining information, which is not indexed by search engines – the ‘deep web’,
is much bigger in size than the ‘surface web’, and remains unexploited yet.
Several machine learning techniques have been commonly employed to access deep web content. Under
machine learning, topic models provide a simple way to analyze large volumes of unlabeled text. A ‘topic’is
a cluster of words that frequently occur together and topic models can connect words with similar
meanings and distinguish between words with multiple meanings. In this paper, we cluster deep web
databases employing several methods, and then perform a comparative study. In the first method, we apply
Latent Semantic Analysis (LSA) over the dataset. In the second method, we use a generative probabilistic
model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web
databases.Both these techniques are implemented after preprocessing the set of web pages to extract page
contents and form contents.Further, we propose another version of Latent Dirichlet Allocation (LDA) to the
dataset. Experimental results show that the proposed method outperforms the existing clustering methods.
Nowadays much is written about how to manage projects, but too little on what really happens in project
actuality. Project Actuality came out in the Rethinking Project Management (RPM) agenda in 2006 and it
aims at understanding what really happens at project context. To be able to understand project actuality
phenomenon, we first need to get a better comprehension on its definition and discover how to observe it
and analyse it. This paper presents the results of the systematic review conducted to collect evidence on
Project Actuality. The research focused on four search engines, in publications from 1994 to 2013. Among
others, the study concludes that project actuality has been analysed by several methods and techniques,
mostly on large organization and public sectors, in Northern Europe. The most common definitions,
techniques, and tips were identified as well as the intent of transforming the results in knowledge.
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Gene selection is usually the crucial step in microarray data analysis. A great deal of recent research has focused on the
challenging task of selecting differentially expressed genes from microarray data (‘gene selection’). Numerous gene selection
algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like
small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. This paper
presents combination of Analysis of Variance (ANOVA), Principle Component Analysis (PCA), Recursive Cluster Elimination
(RCE) a classification algorithm by employing a innovative method for gene selection. It reduces the gene expression data into
minimal number of gene subset. This is a new feature selection method which uses ANOVA statistical test, principal component
analysis, KNN classification &RCE (recursive cluster elimination). At each step redundant & irrelevant features are get
eliminated. Classification accuracy reaches up to 99.10% and lesser time for classification when compared to other convectional techniques.
Iganfis Data Mining Approach for Forecasting Cancer Threatsijsrd.com
Healthcare facilities have at their disposal vast amounts of cancer patients’ data. Medical practitioners require more efficient techniques to extract relevant knowledge from this data for accurate decision-making. However the challenge is how to extract and act upon it in a timely manner. If well engineered, the huge data can aid in developing expert systems for decision support that can assist physicians in diagnosing and predicting some debilitating life threatening diseases such as cancer. Expert systems for decision support can reduce the cost, the waiting time, and liberate medical practitioners for more research, as well as reduce errors and mistakes that can be made by humans due to fatigue and tiredness. The process of utilizing health data effectively however, involves many challenges such as the problem of missing feature values, the curse of dimensionality due to a large number of attributes, and the course of actions to determine the features that can lead to more accurate diagnosis. Effective data mining tools can assist in early detection of diseases such as cancer. In This paper, we propose a new approach called IGANFIS. This approach optimally minimizes the number of features using the information gain (IG) algorithm which is usually used in text categorization to select the quality of text. The IG will be used for selecting the quality of cancer features by virtue of reducing them in number. The reduced number quality features dataset will then be applied to the Adaptive Neuro Fuzzy Inference System (ANFIS) to train and test the proposed approach. ANFIS method of training is ideally the hybrid learning algorithm which uses the gradient descent method and Least Square Estimate (LSE) for computing the error measure for each training pair. Each cycle of the ANFIS hybrid learning consists of a forward pass to present the input vector calculating the node outputs layer by layer repeating the process for all data and a backward pass using the steepest descent algorithm to update parameters, a process called back propagation.
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...ijsc
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from
large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an
established, proven structure from a voluminous collection of facts. A dominant area of modern-day
research in the field of medical investigations includes disease prediction and malady categorization. In
this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering
techniques and compare the performance of classification algorithms on the clinical data. Feature
selection is a supervised method that attempts to select a subset of the predictor features based on the
information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with
the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms
in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen
classification algorithms on the Lymphography dataset that enables the classifier to accurately perform
multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and
the Quinlan’s C4.5 algorithm give 100 percent classification accuracy with all the predictor features and
also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated
here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm
offers increased clustering accuracy in less computation time.
Hybrid System of Tiered Multivariate Analysis and Artificial Neural Network f...IJECEIAES
Improved system performance diagnosis of coronary heart disease becomes an important topic in research for several decades. One improvement would be done by features selection, so only the attributes that influence is used in the diagnosis system using data mining algorithms. Unfortunately, the most feature selection is done with the assumption has provided all the necessary attributes, regardless of the stage of obtaining the attribute, and cost required. This research proposes a hybrid model system for diagnosis of coronary heart disease. System diagnosis preceded the feature selection process, using tiered multivariate analysis. The analytical method used is logistic regression. The next stage, the classification by using multi-layer perceptron neural network. Based on test results, system performance proposed value for accuracy 86.3%, sensitivity 84.80%, specificity 88.20%, positive prediction value (PPV) 90.03%, negative prediction value (NPV) 81.80%, accuracy 86,30% and area under the curve (AUC) of 92.1%. The performance of a diagnosis using a combination attributes of risk factors, symptoms and exercise ECG. The conclusion that can be drawn is that the proposed diagnosis system capable of delivering performance in the very good category, with a number of attributes that are not a lot of checks and a relatively low cost.
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
A huge amount of medical data is generated every da
y, which presents a challenge in analysing
these data. The obvious solution to this challenge
is to reduce the amount of data without
information loss. Dimension reduction is considered
the most popular approach for reducing
data size and also to reduce noise and redundancies
in data. In this paper, we investigate the
effect of feature selection in improving the predic
tion of patient deterioration in ICUs. We
consider lab tests as features. Thus, choosing a su
bset of features would mean choosing the
most important lab tests to perform. If the number
of tests can be reduced by identifying the
most important tests, then we could also identify t
he redundant tests. By omitting the redundant
tests, observation time could be reduced and early
treatment could be provided to avoid the risk.
Additionally, unnecessary monetary cost would be av
oided. Our approach uses state-of-the-art
feature selection for predicting ICU patient deteri
oration using the medical lab results. We
apply our technique on the publicly available MIMIC
-II database and show the effectiveness of
the feature selection. We also provide a detailed a
nalysis of the best features identified by our
approach.
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism.
PERFORMANCE OF DATA MINING TECHNIQUES TO PREDICT IN HEALTHCARE CASE STUDY: CH...ijdms
With the promises of predictive analytics in big data, and the use of machine learning algorithms,
predicting future is no longer a difficult task, especially for health sector, that has witnessed a great
evolution following the development of new computer technologies that gave birth to multiple fields of
research. Many efforts are done to cope with medical data explosion on one hand, and to obtain useful
knowledge from it, predict diseases and anticipate the cure on the other hand. This prompted researchers
to apply all the technical innovations like big data analytics, predictive analytics, machine learning and
learning algorithms in order to extract useful knowledge and help in making decisions. In this paper, we
will present an overview on the evolution of big data in healthcare system, and we will apply three learning
algorithms on a set of medical data. The objective of this research work is to predict kidney disease by
using multiple machine learning algorithms that are Support Vector Machine (SVM), Decision Tree (C4.5),
and Bayesian Network (BN), and chose the most efficient one.
Early Detection of Lung Cancer Using Neural Network TechniquesIJERA Editor
Effective identification of lung cancer at an initial stage is an important and crucial aspect of image processing. Several data mining methods have been used to detect lung cancer at early stage. In this paper, an approach has been presented which will diagnose lung cancer at an initial stage using CT scan images which are in Dicom (DCM) format. One of the key challenges is to remove white Gaussian noise from the CT scan image, which is done using non local mean filter and to segment the lung Otsu’s thresholding is used. The textural and structural features are extracted from the processed image to form feature vector. In this paper, three classifiers namely SVM, ANN, and k-NN are applied for the detection of lung cancer to find the severity of disease (stage I or stage II) and comparison is made with ANN, and k-NN classifier with respect to different quality attributes such as accuracy, sensitivity(recall), precision and specificity. It has been found from results that SVM achieves higher accuracy of 95.12% while ANN achieves 92.68% accuracy on the given data set and k-NN shows least accuracy of 85.37%. SVM algorithm which achieves 95.12% accuracy helps patients to take remedial action on time and reduces mortality rate from this deadly disease.
Nowadays, Healthcare sector data are enormous, composite and diverse because it contains a data of different types and getting knowledge from that data is essential. So for this purpose, data mining techniques may be utilized to mine knowledge by building models from healthcare dataset. At present, the classification of breast cancer patients has been a demanding research confront for many researchers. For building a classification model for the cancer patient, we used four different classification algorithms such as J48, REPTree, RandomForest, and RandomTree and tested on the dataset taken from UCI. The main aim of this paper is to classify the patient into benign (not cancer) or malignant (cancer), based on some diagnostic measurements integrated into the dataset.
Breast cancer diagnosis and recurrence prediction using machine learning tech...eSAT Journals
Abstract Breast Cancer has become the common cause of death among women. Due to long hours invested in manual diagnosis and lesser diagnostic system available emphasize the development of automated diagnosis for early diagnosis of the disease. Our aim is to classify whether the breast cancer is benign or malignant and predict the recurrence and non-recurrence of malignant cases after a certain period. To achieve this we have used machine learning techniques such as Support Vector Machine, Logistic Regression, KNN and Naive Bayes. These techniques are coded in MATLAB using UCI machine learning depository. We have compared the accuracies of different techniques and observed the results. We found SVM most suited for predictive analysis and KNN performed best for our overall methodology. Keywords: Breast Cancer, SVM, KNN, Naive Bayes, Logistic Regression, Classification.
Performance Analysis of Data Mining Methods for Sexually Transmitted Disease ...IJECEIAES
According to health reports of Malang city, many people are exposed to sexually transmitted diseases and most sufferers are not aware of the symptoms. Malang city being known as a city of education so that every year the population number increases, it is at risk of increasing the spread of sexually transmitted diseases virus. This problem is important to be solved to treat earlier sufferers sexually transmitted diseases virus in order to reduce the burden of patient spending. In this research, authors conduct data mining methods to classifying sexually transmitted diseases. From the experiment result shows that K-NN is the best method for solve this problem with 90% accuracy.
Medical informatics growth can be observed now days. Advancement in different medical fields
discovers the various critical diseases and provides the guidelines for their cure. This has been possible
only because of well heeled medical databases as well as automation of data analysis process. Towards
this analysis process lots of learning and intelligence is required, the data mining techniques provides the
basis for that and various data mining techniques are available like Decision tree Induction, Rule Based
Classification or mining, Support vector machine, Stochastic classification, Logistic regression, Naïve
bayes, Artificial Neural Network & Fuzzy Logic, Genetic Algorithms. This paper provides the basic of
data mining with their effective techniques availability in medical sciences & reveals the efforts done on
medical databases using data mining techniques for human disease diagnosis.
Study On Decision-Making For Café Management Alternativesijcsit
In recent years, many new cafés have emerged onto the market. Other than view cafés, beautiful cafés that
seem as if they came from Paris or New York have gradually appeared in leisure and quiet residential
areas, in alleyways, in peripheral areas, and in local commercial areas. In particular, leisure is trendy at
present, and modern restaurants innovate in terms of their food, leisure, and consumption. Unlike
traditional restaurants, they are able to develop into cafés with unique styles to attract consumers. Even
though not all of these new cafés are successful, as cafés are an industry that is at the forefront of fashion,
many individuals who dream of entrepreneurship would want to open a café. However, as there are many
types of cafés on the market, what type and style of cafés are the most suitable? An overview of cafés in
Taiwan shows that each café offer unique services and functions to attract consumers, which is the key to
sustainable operations of cafés. Therefore, this study explores the decisions of companies when choosing
the style for their cafés. This study uses the analytical hierarchy process (AHP) to explore the selection of
café styles, in order to provide references for café operators to achieve successful and sustainable
operations. Based on literature review, expert interviews, and AHP, this study intends to provide useful
results to the operators of cafés
Clustering of Deep WebPages: A Comparative Studyijcsit
The internethas massive amount of information. This information is stored in the form of zillions of
webpages. The information that can be retrieved by search engines is huge, and this information constitutes
the ‘surface web’.But the remaining information, which is not indexed by search engines – the ‘deep web’,
is much bigger in size than the ‘surface web’, and remains unexploited yet.
Several machine learning techniques have been commonly employed to access deep web content. Under
machine learning, topic models provide a simple way to analyze large volumes of unlabeled text. A ‘topic’is
a cluster of words that frequently occur together and topic models can connect words with similar
meanings and distinguish between words with multiple meanings. In this paper, we cluster deep web
databases employing several methods, and then perform a comparative study. In the first method, we apply
Latent Semantic Analysis (LSA) over the dataset. In the second method, we use a generative probabilistic
model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web
databases.Both these techniques are implemented after preprocessing the set of web pages to extract page
contents and form contents.Further, we propose another version of Latent Dirichlet Allocation (LDA) to the
dataset. Experimental results show that the proposed method outperforms the existing clustering methods.
Nowadays much is written about how to manage projects, but too little on what really happens in project
actuality. Project Actuality came out in the Rethinking Project Management (RPM) agenda in 2006 and it
aims at understanding what really happens at project context. To be able to understand project actuality
phenomenon, we first need to get a better comprehension on its definition and discover how to observe it
and analyse it. This paper presents the results of the systematic review conducted to collect evidence on
Project Actuality. The research focused on four search engines, in publications from 1994 to 2013. Among
others, the study concludes that project actuality has been analysed by several methods and techniques,
mostly on large organization and public sectors, in Northern Europe. The most common definitions,
techniques, and tips were identified as well as the intent of transforming the results in knowledge.
Improving initial generations in pso algorithm for transportation network des...ijcsit
Transportation Network Design Problem (TNDP) aims to select the best project sets among a number of new projects. Recently, metaheuristic methods are applied to solve TNDP in the sense of finding better solutions sooner. PSO as a metaheuristic method is based on stochastic optimization and is a parallel revolutionary computation technique. The PSO system initializes with a number of random solutions and seeks for optimal solution by improving generations. This paper studies the behavior of PSO on account of improving initial generation and fitness value domain to find better solutions in comparison with previous attempts.
The use of mobile applications is now so common that users now expect companies whose services which
they consume already have an application to provide these services or a mobile version of your site, but this is not always simple to do or cheap. Thus, the hybrid development has emerged as a potential alternative to this need. The evolution of this new paradigm has taken the attention of researchers and companies as viable alternative to the mobile development. This paper shows how hybrid development can be an alternative for companies provide their services with a low investment and still offer a great service to their clients.
SYMMETRICAL WEIGHTED SUBSPACE HOLISTIC APPROACH FOR EXPRESSION RECOGNITIONijcsit
Human face expression is one of the cognitive activity or attribute to deliver the opinions to others.This paper mainly delivers the performance of appearance based holistic approach subspace methods based on Principal Component Analysis (PCA). In this work texture features are extracted from face images using Gabor filter. It was observed that extracted texture feature vector space has higher dimensional and has
more number of redundant contents. Hence training, testing and classification time becomes more. The expression recognition accuracy rate is also reduced. To overcome this problem Symmetrical Weighted 2DPCA (SW2DPCA) subspace method is introduced. Extracted feature vector space is projected in to subspace by using SW2DPCA method. By implementing weighted principles on odd and even symmetrical
decomposition space of training samples sets proposed method have been formed. Conventional PCA and 2DPCA method yields less recognition rate due to larger variations in expressions and light due to more number of feature space redundant variants. Proposed SW2DPCA method optimizes this problem by reducing redundant contents and discarding unequal variants. In this work a well known JAFFE databases
is used for experiments and tested with proposed SW2DPCA algorithm. From the experimental results it was found that facial recognition accuracy rate of GF+SW2DPCA based feature fusion subspace method has been increased to 95.24% compared to 2DPCA method.
Change management and version control of Scientific Applicationsijcsit
The development process of scientific applications is largely dependent on scientific progress and the
experimental research results. Thus, dealing with frequent changes is one of the main problems faced by
the developers of scientific software. Taking into account the results of the survey conducted among
scientists in the HP-SEE project, the implementation of change management and version control software
processes is inevitable. In this paper, we propose software engineering principles that should be included
in the development process to improve the version control and change management. Moreover, we give
some specific recommendations for their implementation, thereby making a slight modification of already
generally accepted templates and methods. The development steps practiced by scientists should not be
replaced completely, but they need to be supplemented with appropriate practices, documents and formal
methods. We also emphasize the reasons for the inclusion of these two processes and the consequences that
may arise as a result of their non-application.
Classification AlgorithmBased Analysis of Breast Cancer DataIIRindia
The classification algorithms are very frequently used algorithms for analyzing various kinds of data available in different repositories which have real world applications. The main objective of this research work is to find the performance of classification algorithms in analyzing Breast Cancer data via analyzing the mammogram images based its characteristics.Different attribute values of cancer affected mammogram images are considered for analysis in this work. The Patients food habits, age of the patients, their life styles, occupation, their problem about the diseases and other information are taken into account for classification. Finally, performance of classification algorithms J48, CART and ADTree are given with its accuracy. The accuracy of taken algorithms is measured by various measures like specificity, sensitivity and kappa statistics (Errors).
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...IJTET Journal
inAbstract— Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to be taken. The important problem in which pattern recognition are applied have common that they are too complex to model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given problem using a set of instances, each instances represented by a number of features. The microarray expression technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays generated large amount of data has stimulate the development of various computational methods to different biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics and drug designing. In this work a new schema has developed for classification of unknown malignant tumors into known class. According to this work an new classification scheme includes the transformation of very high dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary and multiclass data sets. To improve the performance of the classification gene selection method is applied to the datasets as a preprocessing and data extraction step.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many
biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon
cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies
in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in
their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms
and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the
matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix
Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification
accuracies are then compared for these algorithms.This technique gives an accuracy of 98%
Multivariate sample similarity measure for feature selection with a resemblan...IJECEIAES
Feature selection improves the classification performance of machine learning models. It also identifies the important features and eliminates those with little significance. Furthermore, feature selection reduces the dimensionality of training and testing data points. This study proposes a feature selection method that uses a multivariate sample similarity measure. The method selects features with significant contributions using a machine-learning model. The multivariate sample similarity measure is evaluated using the University of California, Irvine heart disease dataset and compared with existing feature selection methods. The multivariate sample similarity measure is evaluated with metrics such as minimum subset selected, accuracy, F1-score, and area under the curve (AUC). The results show that the proposed method is able to diagnose chest pain, thallium scan, and major vessels scanned using X-rays with a high capability to distinguish between healthy and heart disease patients with a 99.6% accuracy.
Breast Tumor Detection Using Efficient Machine Learning and Deep Learning Tec...mlaij
Breast cancer tissues grow when cells in the breast expand and divide uncontrollably, resulting in a lump of tissue commonly called and named tumor. Breast cancer is the second most prevalent cancer among women, following skin cancer. While it is more commonly diagnosed in women aged 50 and above, it can affect individuals of any age. Although it is rare, men can also develop breast cancer, accounting for less than 1% of all cases, with approximately 2,600 cases reported annually in the United States. Early detection of breast tumors is crucial in reducing the risk of developing breast cancer. A publicly available dataset containing features of breast tumors was utilized to identify breast tumors using machine learning and deep learning techniques. Various prediction models were constructed, including logistic regression (LR), decision tree (DT), random forest (RF), support vector machine (SVM), Gradient Boosting (GB), Extreme Gradient Boosting (XGB), Light GBM, and a recurrent neural network (RNN) model. These models were trained to classify and predict breast tumor cases based on the provided features.
BREAST TUMOR DETECTION USING EFFICIENT MACHINE LEARNING AND DEEP LEARNING TEC...mlaij
Breast cancer tissues grow when cells in the breast expand and divide uncontrollably, resulting in a lump
of tissue commonly called and named tumor. Breast cancer is the second most prevalent cancer among
women, following skin cancer. While it is more commonly diagnosed in women aged 50 and above, it can
affect individuals of any age. Although it is rare, men can also develop breast cancer, accounting for less
than 1% of all cases, with approximately 2,600 cases reported annually in the United States. Early
detection of breast tumors is crucial in reducing the risk of developing breast cancer. A publicly available
dataset containing features of breast tumors was utilized to identify breast tumors using machine learning
and deep learning techniques. Various prediction models were constructed, including logistic regression
(LR), decision tree (DT), random forest (RF), support vector machine (SVM), Gradient Boosting (GB),
Extreme Gradient Boosting (XGB), Light GBM, and a recurrent neural network (RNN) model. These
models were trained to classify and predict breast tumor cases based on the provided features.
Breast Tumor Detection Using Efficient Machine Learning and Deep Learning Tec...mlaij
Machine Learning and Applications: An International Journal (MLAIJ) is a quarterly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the machine learning. The journal is devoted to the publication of high quality papers on theoretical and practical aspects of machine learning and applications.The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on machine learning advancements, and establishing new collaborations in these areas. Original research papers, state-of-the-art reviews are invited for publication in all areas of machine learning.
Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of machine learning.
Breast cancer is the leading cause of death for women worldwide. Cancer can be discovered early, lowering the rate of death. Machine learning techniques are a hot field of research, and they have been shown to be helpful in cancer prediction and early detection. The primary purpose of this research is to identify which machine learning algorithms are the most successful in predicting and diagnosing breast cancer, according to five criteria: specificity, sensitivity, precision, accuracy, and F1 score. The project is finished in the Anaconda environment, which uses Python's NumPy and SciPy numerical and scientific libraries as well as matplotlib and Pandas. In this study, the Wisconsin diagnostic breast cancer dataset was used to evaluate eleven machine learning classifiers: decision tree, quadratic discriminant analysis, AdaBoost, Bagging meta estimator, Extra randomized trees, Gaussian process classifier, Ridge, Gaussian nave Bayes, k-Nearest neighbors, multilayer perceptron, and support vector classifier. During performance analysis, extremely randomized trees outperformed all other classifiers with an F1-score of 96.77% after data collection and data analysis.
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...gerogepatton
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast Cancer at the early stage was to apply artificial intelligence where machines are simulated with intelligence and programmed to think and act like a human. This allows machines to passively learn and find a pattern, which can be used later to detect any new changes that may occur. In general, machine learning is quite useful particularly in the medical field, which depends on complex genomic measurements such as microarray technique and would increase the accuracy and precision of results. With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast Cancer diagnostic system using complex genomic analysis via microarray technology. The system will combine two machine learning methods, K-means cluster, and linear regression.
Graphical Model and Clustering-Regression based Methods for Causal Interactio...gerogepatton
The early detection of Breast Cancer, the deadly disease that mostly affects women is extremely complex
because it requires various features of the cell type. Therefore, the efficient approach to diagnosing Breast
Cancer at the early stage was to apply artificial intelligence where machines are simulated with
intelligence and programmed to think and act like a human. This allows machines to passively learn and
find a pattern, which can be used later to detect any new changes that may occur. In general, machine
learning is quite useful particularly in the medical field, which depends on complex genomic
measurements such as microarray technique and would increase the accuracy and precision of results.
With this technology, doctors can easily diagnose patients with cancer quickly and apply the proper
treatment in a timely manner. Therefore, the goal of this paper is to address and propose a robust Breast
Cancer diagnostic system using complex genomic analysis via microarray technology. The system will
combine two machine learning methods, K-means cluster, and linear regression.
USING DATA MINING TECHNIQUES FOR DIAGNOSIS AND PROGNOSIS OF CANCER DISEASEIJCSEIT Journal
Breast cancer is one of the leading cancers for women in developed countries including India. It is the
second most common cause of cancer death in women. The high incidence of breast cancer in women has
increased significantly in the last years. In this paper we have discussed various data mining approaches
that have been utilized for breast cancer diagnosis and prognosis. Breast Cancer Diagnosis is
distinguishing of benign from malignant breast lumps and Breast Cancer Prognosis predicts when Breast
Cancer is to recur in patients that have had their cancers excised. This study paper summarizes various
review and technical articles on breast cancer diagnosis and prognosis also we focus on current research
being carried out using the data mining techniques to enhance the breast cancer diagnosis and prognosis.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second
most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast
cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast
cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data
processing tools, we tackled this disease analysis. Data mining is an important step of library discovery
where intelligent methods are used to detect patterns. Several clinical breast cancer studies were
conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier,
easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise
the testing algorithm. We used genetic programming methods to choose classification machines' best
features and parameter values. Data mining is an important step of library discovery where intelligent
methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A
comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F.
Tree processes, i.e. 97.71%.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast
cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise
the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
Predictive modeling for breast cancer based on machine learning algorithms an...IJECEIAES
Breast cancer is one of the leading causes of death among women worldwide. However, early prediction of breast cancer plays a crucial role. Therefore, strong needs exist for automatic accurate early prediction of breast cancer. In this paper, machine learning (ML) classifiers combined with features selection methods are used to build an intelligent tool for breast cancer prediction. The Wisconsin diagnostic breast cancer (WDBC) dataset is used to train and test the model. Classification algorithms, including support vector machine (SVM), light gradient boosting machine (LightGBM), random forest (RF), logistic regression (LR), k-nearest neighbors (k-NN), and naïve Bayes, were employed. Performance measures for each of them were obtained, namely: accuracy, precision, recall, F-score, Kappa, Matthews correlation coefficient (MCC), and time. The results indicate that without feature selection, LightGBM achieves the highest accuracy at 95%. With minimum redundancy maximum relevance (mRMR) feature selection (15 features), LightGBM outperforms other classifiers, achieving an accuracy of 98%. For Pearson correlation coefficient feature selection (15 features), LightGBM also excels with a 95% accuracy rate. Lasso feature selection (5 features) produces varied results across classifiers, with logistic regression achieving the highest accuracy at 96%. These findings underscore the importance of feature selection in refining model performance and in improving detection for breast cancer.
Mining of Important Informative Genes and Classifier Construction for Cancer ...ijsc
Microarray is a useful technique for measuring expression data of thousands or more of genes simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust gene identification methods is extremely fundamental. Many gene selection methods as well as their corresponding classifiers have been proposed. In the proposed method, a single gene with high classdiscrimination capability is selected and classification rules are generated for cancer based on gene expression profiles. The method first computes importance factor of each gene of experimental cancer dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high class discrimination capability according to their depended degree of classes. Then initial important genes are selected according to high importance factor of each gene and form initial reduct. Then traditional kmeans clustering algorithm is applied on each selected gene of initial reduct and compute missclassification errors of individual genes. The final reduct is formed by selecting most important genes with respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples of experimental test dataset. The proposed method test on four publicly available cancerous gene expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to prove the robustness of proposed method compares the outcomes (correctly classified instances) with some existing well known classifiers.
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...ijsc
Microarray is a useful technique for measuring expression data of thousands or more of genes
simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data
is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the
distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and
robust gene identification methods is extremely fundamental. Many gene selection methods as well as their
corresponding classifiers have been proposed. In the proposed method, a single gene with high classdiscrimination
capability is selected and classification rules are generated for cancer based on gene
expression profiles. The method first computes importance factor of each gene of experimental cancer
dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high
class discrimination capability according to their depended degree of classes. Then initial important genes
are selected according to high importance factor of each gene and form initial reduct. Then traditional kmeans
clustering algorithm is applied on each selected gene of initial reduct and compute missclassification
errors of individual genes. The final reduct is formed by selecting most important genes with
respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced
by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples
of experimental test dataset. The proposed method test on four publicly available cancerous gene
expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using
important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to
prove the robustness of proposed method compares the outcomes (correctly classified instances) with some
existing well known classifiers.
Breast cancer diagnosis via data mining performance analysis of seven differe...cseij
According to World Health Organization (WHO), breast cancer is the top cancer in women both in the
developed and the developing world. Increased life expectancy, urbanization and adoption of western
lifestyles trigger the occurrence of breast cancer in the developing world. Most cancer events are
diagnosed in the late phases of the illness and so, early detection in order to improve breast cancer
outcome and survival is very crucial.
In this study, it is intended to contribute to the early diagnosis of breast cancer. An analysis on breast
cancer diagnoses for the patients is given. For the purpose, first of all, data about the patients whose
cancers’ have already been diagnosed is gathered and they are arranged, and then whether the other
patients are in trouble with breast cancer is tried to be predicted under cover of those data. Predictions of
the other patients are realized through seven different algorithms and the accuracies of those have been
given. The data about the patients have been taken from UCI Machine Learning Repository thanks to Dr.
William H. Wolberg from the University of Wisconsin Hospitals, Madison. During the prediction process,
RapidMiner 5.0 data mining tool is used to apply data mining with the desired algorithms.
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...Kiogyf
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND FIREFLY ALGORITHM
ABSTRACT
Cancer is a globally recognized cause of death. A proper cancer analysis demands the classification of several types of tumor. Investigations into microarray gene expressions seem to be a successful platform for revising genetic diseases. Although the standard machine learning (ML) approaches have been efficient in the realization of significant genes and in the classification of new types of cancer cases, their medical and logical application has faced several drawbacks such as DNA microarray data analysis limitation, which includes an incredible number of features and the relatively small size of an instance. To achieve a reasonable and efficient DNA microarray dataset information, there is a need to extend the level of interpretability and forecast approach while maintaining a great level of precision. In this work, a novel way of cancer classification based on based gene expression profiles is presented. This method is a combination of both Firefly algorithm and Mutual Information Method. First, the features are used to select the features before using the Firefly algorithm for feature reduction. Finally, the Support Vector Machine is used to classify cancer into types. The performance of the proposed system was evaluated by using it to classify datasets from colon cancer; the results of the evaluation were compared with some recent approaches.
Keywords: Feature Selection, Firefly Algorithm, Cancer Disease, Mutual Information
Similar to Cancer prognosis prediction using balanced stratified sampling (20)
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Cancer prognosis prediction using balanced stratified sampling
1. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
DOI :10.5121/ijscai.2014.3102 9
CANCER PROGNOSIS PREDICTION USING BALANCED
STRATIFIED SAMPLING
J.S.Saleema 1
, N.Bhagawathi2
, S.Monica2
, P.Deepa Shenoy2*
, K.R.Venugopal2
and
L.M.Patnaik3
1
Department of Computer Science, Christ University, Bangalore, India
2
Dept. of CSE, University Visvesvaraya College of Engineering, Bangalore, India
3
Indian Institute of Science, Bangalore, India
ABSTRACT
High accuracy in cancer prediction is important to improve the quality of the treatment and to improve the
rate of survivability of patients. As the data volume is increasing rapidly in the healthcare research, the
analytical challenge exists in double. The use of effective sampling technique in classification algorithms
always yields good prediction accuracy. The SEER public use cancer database provides various prominent
class labels for prognosis prediction. The main objective of this paper is to find the effect of sampling
techniques in classifying the prognosis variable and propose an ideal sampling method based on the
outcome of the experimentation. In the first phase of this work the traditional random sampling and
stratified sampling techniques have been used. At the next level the balanced stratified sampling with
variations as per the choice of the prognosis class labels have been tested. Much of the initial time has been
focused on performing the pre-processing of the SEER data set. The classification model for
experimentation has been built using the breast cancer, respiratory cancer and mixed cancer data sets with
three traditional classifiers namely Decision Tree, Naïve Bayes and K-Nearest Neighbour. The three
prognosis factors survival, stage and metastasis have been used as class labels for experimental
comparisons. The results shows a steady increase in the prediction accuracy of balanced stratified model
as the sample size increases, but the traditional approach fluctuates before the optimum results.
Keywords
Cancer, Classification, Pre-processing, Sampling
1. INTRODUCTION
Data analytics and big data are the market trend today in academia, research institutes and
industries. The solutions for all analytical problems are obtained majorly through machine
learning and data mining techniques. The major challenge of data handling exists in healthcare, as
the rate of increase in patience is proportional to the rate of population growth and life style
changes.
Cancer is the second most common cause of death, exceeded only by heart disease. The mortality
rates are above 50% of the incidence in African and Asian countries, but the incidence are more
in western countries though the mortality is less. The Surveillance, Epidemiology, and End
Results (SEER) Program of the National Cancer Institute in United States is a well known
repository of cancer database that is being updated every year [1]. The repository contains data
from 1973 – 2010. Many other countries have also come up with the maintenance of the
2. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
10
repository through various registries, as the overall number of cancer cases rapidly increasing
worldwide.
Classification based decision systems are popular in cancer diagnostics since last decade. The
hidden knowledge extracted will help the physicians to treat the patients effectively and cure
them at the earliest. Product based cancer prediction calculators are available in market and
literature. Ankit Agrawal in [2] has proposed a lung cancer outcome calculator using data mining
technique for SEER data set and provides assistance to the doctors with a tool for survival
prediction. Though efficient pre-processing improves the prediction accuracy, it is widely seen in
[3-8] that the choice of sampling techniques highly affects the results of classifier irrespective of
its power.
Random sampling and stratified sampling are the two traditional techniques used for selecting the
data for classification. Though the stratified sampling is balanced at the strata level, the same may
not hold good for the overall samples. This paper follows a methodology that selects a balanced
sample based on the class labels used for experiments. If the class label contains a minority class
particularly for multivariate classes, then a ratio is fixed while selecting the samples. If the class
label contains a very minimal value below a fixed threshold, then the same would be considered
as missing and hence will be removed during pre-processing. The pre-processing includes null
removal, correlation and information gain ratio based removal of attributes. Few attributes are
transformed and modified as required for best results. Some of the attributes have been removed
in consultation with the domain experts and references [3].
The organization of the paper is as follows: section 2 addresses the background work related to
the issues in sampling and other pre-processing challenges for improving the classifiers. Section 3
provides the data sets used for the experiments and the detailed methodology for building the
proposed sampling model. Section 4 presents the empirical results analysis. Concluding remarks
are given in section 5 along with a plan for the future work.
2. LITERATURE REVIEW
A mature discussion and proof on SEER data has been found in the literature on SEER statistics
and in the public domain [1-3]. Few of the researchers are using the SEER*Stat software
provided by SEER through NCI for cancer updates. The response variables for cancer prediction
in this paper have been detected using the prominent variable identification study by Ankit
Agrawal in [2, 3]. Various types of classification methods with different kinds of pre-processing
techniques have been used for the individual cancer types like lung, colorectal and breast.
Principle steps of SEER data pre-process contain the data conversion, merging or splitting the
data based on the data types or the properties of the multivariate feature and construction of new
feature [3]. Most of the researches considered survival prediction as a major problem. Lung
cancer, breast cancer, colon cancer and colorectal cancer survival prediction using the base
classifiers have been found in the literature.
Survival prediction of cancer patients are of two types. First one is based on the severity of
diseases from the date of diagnosis; a patient’s survival period is defined. The second one predicts
using the follow-up features after the diagnosis. Less work is found in the survival of combined
data where as there would be similarities in age, morphological data, stage and extent of disease
that lead as a motivation for this paper. The SEER data set has been generated from different
sources and the data the recoding strategy based on the medical dictionary among the SEER
community leaves data sparse and skewed. The data is skewed mainly because of the imbalanced
class labels that are used for prediction analysis as discussed in [4].
3. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
11
Class imbalance is treatable, but the way it has to be treated depends on the nature of the data set
used and the choice of balancing technique. In [5] author tries to balance the samples with over
sampling with replacement approach for balancing the minority class. Lemmens in [6] uses a
weighted ratio balancing approach to predict churn and Liu explores the under sampling using the
ensemble sampling approach as a solution for under sampling [7]. In this paper a simple balanced
stratified sampling with ratio of samples based on the prediction class labels has been proposed.
Importance of balanced sampling for improving the classification accuracy is found in Kahlilias
work in [8].
The SEER data pre-processing and selection of prominent labels has been applied similar to the
earlier work of this team as provided in [9]. The missing handling was complex during the initial
stage of the pre-processing and the random imputation for varied attributes of [10] lent a
supporting hand in the process. The varied survivability study as in [11-13] for the breast,
colorectal and colon cancer motivated us for the selection of two existing cancer data sets and one
new generation of mixed type cancer by selecting samples from breast and respiratory cancers for
this study.
Quinlan in [15] proposed a new c4.5 decision tree algorithm that overcomes the disadvantages of
basic ID3. The authors in [16] have explained the three algorithms decision tree, Bayes and k-NN
in a detailed manner. As the SEER data set is a combination of nominal and numerical data these
three algorithms would be the right choice for building a classifier. Other well known algorithms
ANN and SVM from [17] have not been selected because of its time consuming pre-processing of
data using rapidminer.
3. METHODOLOGY
The first part of this section briefs the data pre-processing steps, followed by the sampling
techniques and finally the classifiers with prediction and evaluation. Figure 1 presents the
proposed framework with all the major phases as a separate block.
Figure 1. Framework of the proposed model
4. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
12
3.1. Data Set Description
SEER public use database from the program of National Cancer Institute (NCI) has been used for
the empirical study of this paper [1]. SEER program currently collects data from approximately
26% of the USA population. Periodical incidences have been recorded as separate files and the
incidence files of all eight cancer categories during the year 2000-2007 have been limited in this
paper.
Each incidence text file contains different types of cancer data sets. More than two hundred
thousands of records are available in a single file. A patient profile is a single record with 254
characters representing a total of 118 attributes. The attributes are both nominal and numerical in
type [2]. Only the breast and respiratory raw files have been used in the first phase of pre-
processing. A mixed file has been generated using the two types as a third file in the experiment.
Figure 2 is a sample MATLAB GUI developed for generating a mixed data set from breast and
respiratory cancer files. The repeated generation of samples from the source files have been
merged in the excel repositories.
Figure 2. Sample frame of mixed data set generation
There exist 19 categories of attributes in the SEER database. From the 50000 processed samples,
around 20000 sub samples have been used from breast and respiratory data sets for
implementation. The major categories are demographic details, site specific details, disease
extension, multiple primaries and cause of death. The categories are of type nominal and
continuous in nature.
3.2. Data Pre-processing
The pre-processing has been performed in two stages. In the first stage, the raw data from the
SEER text file has been parsed into tokens as per the available data dictionary. The pre-
processing has been coded using the MATLAB. The data in the form of matrix has been
preserved in the excel file for further processes.
In the second phase the features and the samples from the incidence year 2000-2007 have been
analyzed for missing variables and records with null values. Based on the importance of data the
missing features or the samples have been removed. Renaming and transformation has been
performed for few attributes that were updated by the SEER registry. We reduced few attributes
manually by studying the actual parameters and the class label. Some attributes have been re-
5. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
13
coded based on the SEER dictionary changes with respect to years. The attributes size has been
reduced to 64 from 118 in this phase.
SEER provides the numeric variable and its relevant nominal recoded variables like age-recode,
Histology recode, etc. We have used the re-coded nominal variables for the experiments. Survival
variable has been re-coded using Survival Time Recode (STR) in terms of month, as the data is a
combination of year and month. For example, if the input is 0211, the same has been converted to
33 months (02X12+11). As per the literature in [2] using Vital Status Recode (VSR) and Cause of
Death (COD), the survival data has been converted into a binary class label as survived and not
survived for further study in sampling.
As the target group of our study is between 2000 and 2007, based on ICD-0-2 or ICD-0-3 medical
standards, few attributes have been changed from 2004 onwards. Such attributes have been
identified and merged using the common definitions available from SEER dictionary. For
generating the metastasis variable the Extent of Disease (EOD) has been merged with
Collaborative Stage (CS).
In the third phase all the remaining attributes have been processed for final feature extraction. The
correlated attributes from the same category and attributes with low information gain have been
removed as discussed in [9] and reduced to 37 attributes. The operators’ correlation and filtration
from rapidminer have been used for this feature extraction with its default parameters. All filtered
data sets have been stored in a separate excel file.
3.3. Sampling
The two traditional sampling techniques and the balanced stratified sampling have been used in
this phase. The number of classes for the class labels metastasis, stage and survival are 10, 4 and
2 respectively. The implementation of sampling selection and classification has been performed
in rapidminer tool [14].
3.3.1. Random Sampling
A simple random sampling is the probability of getting equal chance of being selected for each
and every tuple of a data set D. The choice of sub-sample n from the selected data set NϵD
ranging from 500 to 30000 have been fixed for the remaining sampling techniques. In this
approach the n is random. Direct random sample operator has been used from rapidminer for this
analysis.
3.3.2. Stratified Sampling
A stratified sampling is the formation of distinct strata of m homogenous classes and the process
of selecting the samples from each strata based on the required subsample n from the N total
sample and m strata. Random choice of numbers from each strata increase the occurrence of all
class representatives but can also result in over or under-sampling based on the availability of
tuples in each strata. The stratified sampling of rapid miner uses the maximum possible size of
data from each stratum. There is no guarantee that each class owns a representative in this. It
depends on the overall population of 50000 that has been extracted during the pre-processing
stage.
6. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
14
3.3.3. Balanced Stratified Sampling
The over sampling, under sampling and imbalanced sampling have been ignored. Balanced
stratified sampling is selecting the samples from m strata, but of equal size. If there are minority
or majority classes then over-sampling or under-sampling respectively need to be performed as
required based on the distribution of the classes. In this paper the first option of balancing is
through ignoring the minority classes for the ratios above 1:100. As the response variable is of
multivariate in nature, ignoring the minority class leads to minimal error. The second method
adopted in this paper is to take equal size of samples from each strata by reducing the stratum to
p<m, such that p changing as the size of the sub-sample changes gradually from 500 to 30000.
The balancing criteria has been maintained in an excel file with the required size (i.e from 500–
30000) and that file has been given as input for the already built-in stratified model. As the
survival attribute was in binary form, it was easy to pick the sample representatives. The second
option of equiv-width has been used. For minority multivariate classes like stage and metastasis
the ratio of 1:100 has been verified using excel filter and this was given as an input to the
rapidminer for classification. This method is a fixed allocation of balanced classes in the literature
of probability and statistics.
3.4. Classifiers
There are many techniques available in literature to predict and classify cancer patterns. Machine
learning and ANN systems are widely used in biomedical fields for various applications. Review
of literature shows that applications of these learning approaches have improved the accuracy of
cancer classification and survival prediction when compared to the other statistical methods. But
the system must be exercised to enhance confidence, quality and reliability of reported data [12].
To test the variations of sampling techniques three basic classifiers have been chosen for the
implementation of this paper. The algorithms are famous for nominal class labels [15-17].
3.4.1. Decision Tree
In decision tree classification the construction of tree starts by identifying an attribute to create a
root node based on the information theory concept of entropy. Each such node splits the tree until
a termination condition is met. The tree thus generated using the 36 attributes and one class labels
will help the model to predict and evaluate the prediction accuracy for further testing. As the
input attributes are of nominal and numeric type, the decision tree that works similar to Quinlan’s
c4.5 [15] or CART has been selected among the various algorithms available in rapidminer. The
default information gain settings have been used for training the decision tree. Initial trial was
performed using the ID2 simple decision tree in rapidminer. The data conversion process for the
numerical variables was time consuming and hence the c4.5 has been decided for our complete
analysis.
3.4.2. Naïve Bayes
Naïve Bayes works using the prior knowledge from the tuple that helps in predicting the future
with likelihood estimation: A Naive Bayes classifier is a simple and low cost probabilistic
classifier works on the Bayes theorem. A Bayesian approach divides the posterior probability into
a prior distribution and a likelihood function. In “(1)” Probability of X given C is calculated using
the probability of C given X and the prior probability of X and C. The outcome of (1) will be used
to predict the unknown class C for any given X using P(C/X).
7. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
15
)(
)()/(
)/(
CP
XPXCP
CXP = (1)
The X vectors with 36 features and one class label C from survival, stage and metastasis have
been used for training the algorithm. The 60:40 ratios have been picked using the rapidminer split
operator for random and stratified sampling. Manual filtering form excel file has been used for
balanced sampling. The procedure has been iteratively executed and the experiment results have
been stored in a separate file for comparisons.
3.4.3. K-Nearest Neighbours
K Nearest Neighbours is a method that finds its closest neighbour from the training sample space
that is having high similarity with a member in testing sample. Distance measures are primarily
used to find the k-nearest neighbour (k-NN). The k-NN is a type of instance-based learning, or
lazy learning, where the function is only approximated locally and all computation is deferred
until classification. The default choice of k=1 was initially used for testing the working process of
the model. Later k=5 and k=10 have been used for variation analysis. It is found that the accuracy
was less deviated during this test and thus kept the k value as 10 for further analysis. The
rapidminer k-NN operator with the default mixed distance measures have been used, as the input
variables are of numeric and nominal in nature.
4. EXPERIMENTAL RESULTS
The SEER incidence raw data files of the year 2000-2007 have been used for the implementation.
MATLAB code has been developed for the pre-processing phase to parse, transform and reduce
the attributes. Based on [4], few of the important features like stage, grade, marital status, etc
have been converted from numerical to nominal. Few like survival time recode and sequence
number have been split and reconstructed. As the source file contains above 200000 records,
parsing of records has been performed in batches of 50000 each. The samples are drawn from
different batches and collated into one for further processing. Table 1 presents the combination of
size of samples, classifier algorithms and the caner types used for the comparison.
Table 1. Implementation features
Sample
size
Classifiers Data Set
Training:
Testing
500
Naïve Bayes
Decision
Tree
KNN
Breast Cancer
Respiratory
Cancer
Mixed Type
60:40
1000
2000
5000
10000
15000
20000
25000
30000
Training and testing data split is performed with 60:40 ratios as this is the default best split. Using
the split operator of rapidminer this has been experimented for random and stratified sampling.
Minimum of 10 iterations have been performed for each sample size and the best of the ten results
8. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
16
were recorded here as output. The 10-fold cross validation was not performed because of the
difficulty in selecting the balanced stratified samples as per our proposed definition. The time
factor is a limitation of our work with could be resolved in further studies in the offing.
The processed excel file has given as an input to the rapidminer tool for classification. The result
of all sampling techniques used for different class labels with various classifiers are discussed
below in detail. Accuracy of the classifier has been kept as the evaluation measure for this paper.
Table 2 shows the simple random and stratified sample output of the breast cancer data set for all
three classifiers with stage as class label.
The result of the balanced sample for the same data set as that of table 2 is shown in table 3. It has
been observed the balanced sampling technique gradually improves the accuracy and also show a
good difference in performance. Similar experiment has been conducted for all other class labels
for the scaling size of samples. The metastasis is a generated class from the severity of the disease
using the increased primary node and multiple location of disease extent. It is believed that the
balanced stage scores maximum accuracy out of all other members. The result is consistent with
the other two classifiers too.
Table 2. Breast cancer performance for class stage using stratified and random sampling
Sample
Size
Stratified Random
DT NB KNN DT NB KNN
500 80.90% 84.92% 51.76% 85.00% 85.50% 43.50%
1000 66.83% 90.02% 49.88% 69.58% 89.03% 50.37%
2000 84.61% 93.12% 52.57% 84.25% 91.62% 51.38%
5000 66.87% 94.85% 55.62% 85.29% 93.25% 57.78%
10000 90.75% 95.52% 58.32% 87.55% 95.30% 59.65%
15000 66.90% 94.70% 61.33% 67.28% 95.03% 61.05%
20000 84.20% 93.95% 61.42% 85.38% 94.94% 63.41%
25000 84.49% 94.63% 63.00% 84.83% 94.17% 63.18%
30000 84.72% 94.73% 63.73% 84.37% 95.84% 64.14%
Table 3. Breast cancer performance for class stage using balanced stratified sampling
Sample
Size
Balanced
DT NB KNN
500 88.50% 67.60% 42.00%
1000 91.50% 69.00% 47.25%
2000 92.50% 71.33% 47.50%
5000 93.50% 72.10% 54.75%
10000 96.65% 73.62% 62.38%
15000 96.82% 76.15% 64.88%
20000 97.33% 78.00% 66.91%
25000 98.20% 78.25% 69.00%
30000 98.40% 79.00% 69.49%
As the stratified samples are more or less closer to the result, only the comparison of random and
balanced sampling graph for all the three class labels for the respiratory data set has been
9. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
17
presented in figure 3. The result presented in this is specific to KNN, as this was the least
performer in breast cancer.
The result of the mixed cancer data set for the stage and metastasis is shown in figure 4 for further
discussion ahead. As the survival has been studied in detail by many authors in the literature, only
the stage and metastasis is shown in this figure. It is observe that the accuracy of the balanced
stage and metastasis are almost equal throughout in the mixed type. It is also observed that the
decision tree performed well in all cases of our experiment.
Figure 3. Performance of Respiratory Cancer for KNN Classifier
Figure 4. Performance of Mixed Cancer for Decision Tree Classifier
10. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.3, No. 1, February 2014
18
5. CONCLUSION
This paper is about predicting survival, metastasis, or stage using various techniques such
as Naive Bayesian, decision tree and k-NN algorithm. As the SEER data size is very
large, ample pre-processing and proper sampling need to be used while modelling the
classifier. The objective of this paper is to compare three sampling techniques: random,
stratified, and balanced stratified. The model has been tested with the SEER data sets and it
shows a commendable variance in the performance. It has been observed that the balanced
sampling always shows incremental performance proportional to the size of the samples. The
experiment results shows a steady increase in the prediction accuracy of balanced stratified model
as the sample size increases, but the traditional approach fluctuates before the optimum results.
The work will be continued in future to study the performance of the modified balanced approach
for an ensemble classifiers and also to automate the selection of balanced samples.
REFERENCES
[1] SEER Publication, Cancer Facts, Surveillance Research Program, Cancer Statistics Branch, limited-
use data (1973-2007). Available at: http:// seer.cancer.gov/data/.
[2] A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi, and Alok Choudhary, “A lung cancer outcome
calculator using ensemble data mining on SEER data, ” Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. August 2011
[3] R. Al-Bahrani, A. Agrawal, and A. Choudhary, “Colon cancer survival prediction using ensemble
data mining on SEER data,” in Proceedings of the IEEE Big Data Workshop on Bioinformatics and
Health Informatics (BHI), 2013.
[4] S. Li, Z. Wang, G. Zhou, and S. Y. Mei Lee, “Semi-supervised learning for imbalanced sentiment
classification,” in Proceedings of the Twenty-Second international joint conference on Artificial
Intelligence. AAAI Press. vol. 3, pp. 1826-1831, 2011.
[5] J. Thongkam, G. Xu, Y. Zhang, and F. Huang, “Toward breast cancer survivability prediction models
through improving training space,” Expert Systems with Applications.vol 36(10), pp. 12200-12209,
2009.
[6] Lemmens, Aurélie and C. Croux, “Bagging and boosting classification trees to predict churn,” Journal
of Marketing Research. pp. 276-286, 2006.
[7] X. Liu, J. Wu, and Z. Zhou, “Exploratory under-sampling for class-imbalance learning,” IEEE
Transactions on Systems, Man, and Cybernetics. vol. 39(2), pp.539-550, 2009.
[8] M. Khalilia, S. Chakraborty, and M. Popescu, “Predicting disease risks from highly imbalanced data
using random forest,” BMC Medical Informatics and Decision Making. 2011.
[9] J. S. Saleema, B. Sairam, S. D. Naveen, K. Yuvaraj and P Deepa Shenoy, “Prominent label
identification and multi-label classification for cancer prognosis prediction,” TENCON 2012 - 2012
IEEE Region 10 Conference. Cebu. November 2012.
[10] J. Chen, J. N. K. Rao And R. R. Sitter, “Efficient random imputation for missing data in complex
surveys,” Statistica Sinica, vol. 10, pp 1153-1169, 2000.
[11] A. Bellaachia, E. Guven, “Predicting breast cancer survivability using data mining techniques,” Age:
Omaha, vol. 58, pp. 110-113, 2000.
[12] S. Kassem Fathy, “A prediction survival model for colorectal cancer,” Proceedings of the 2011
American conference on applied mathematics and the 5th WSEAS international conference on
Computer engineering and applications, pp 36-42, 2011.
[13] F.E Ahmed, “Artificial neural network for diagnosis and survival prediction in colon cancer,”
Molecular Cancer, vol. 4, no.29, 2005.
[14] Rapid Miner: Community Edition, Data Mining Software, Available at: http://rapidminer.com/
products/rapidminer-studio/.
[15] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, pp. 81-106, 1986.
[16] J. Han and M. Kamber, Data Mining Concepts and Techniques. San Francisco, CA: Morgan
Kaufmann, 2000.
[17] E. Alpaydin, Introduction to machine learning, 2nd ed. New Delhi: Prentice-Hall, 2010.