This document discusses developing predictive models for Acute Myeloid Leukemia (AML) patients using two computational modeling techniques: Artificial Neural Networks (ANN) and Logistic Regression (LR). Protein expression data from 191 AML patients was used to develop ANN and LR models to predict patient remission or resistance. The ANN exhibited high specificity and positive predictive value, indicating potential clinical use. While the LR performed poorly, it provided statistical insights. Integrating ANN and LR may create a more accurate predictive tool for physicians.
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...Kiogyf
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND FIREFLY ALGORITHM
ABSTRACT
Cancer is a globally recognized cause of death. A proper cancer analysis demands the classification of several types of tumor. Investigations into microarray gene expressions seem to be a successful platform for revising genetic diseases. Although the standard machine learning (ML) approaches have been efficient in the realization of significant genes and in the classification of new types of cancer cases, their medical and logical application has faced several drawbacks such as DNA microarray data analysis limitation, which includes an incredible number of features and the relatively small size of an instance. To achieve a reasonable and efficient DNA microarray dataset information, there is a need to extend the level of interpretability and forecast approach while maintaining a great level of precision. In this work, a novel way of cancer classification based on based gene expression profiles is presented. This method is a combination of both Firefly algorithm and Mutual Information Method. First, the features are used to select the features before using the Firefly algorithm for feature reduction. Finally, the Support Vector Machine is used to classify cancer into types. The performance of the proposed system was evaluated by using it to classify datasets from colon cancer; the results of the evaluation were compared with some recent approaches.
Keywords: Feature Selection, Firefly Algorithm, Cancer Disease, Mutual Information
Comparative analysis on bayesian classification for breast cancer problemjournalBEEI
The problem of imbalanced class distribution or small datasets is quite frequent in certain fields especially in medical domain. However, the classical Naive Bayes approach in dealing with uncertainties within medical datasets face with the difficulties in selecting prior distributions, whereby parameter estimation such as the maximum likelihood estimation (MLE) and maximum a posteriori (MAP) often hurt the accuracy of predictions. This paper presents the full Bayesian approach to assess the predictive distribution of all classes using three classifiers; naïve bayes (NB), bayesian networks (BN), and tree augmented naïve bayes (TAN) with three datasets; Breast cancer, breast cancer wisconsin, and breast tissue dataset. Next, the prediction accuracies of bayesian approaches are also compared with three standard machine learning algorithms from the literature; K-nearest neighbor (K-NN), support vector machine (SVM), and decision tree (DT). The results showed that the best performance was the bayesian networks (BN) algorithm with accuracy of 97.281%. The results are hoped to provide as base comparison for further research on breast cancer detection. All experiments are conducted in WEKA data mining tool.
Cancer prognosis prediction using balanced stratified samplingijscai
High accuracy in cancer prediction is important to improve the quality of the treatment and to improve the
rate of survivability of patients. As the data volume is increasing rapidly in the healthcare research, the
analytical challenge exists in double. The use of effective sampling technique in classification algorithms
always yields good prediction accuracy. The SEER public use cancer database provides various prominent
class labels for prognosis prediction. The main objective of this paper is to find the effect of sampling
techniques in classifying the prognosis variable and propose an ideal sampling method based on the
outcome of the experimentation. In the first phase of this work the traditional random sampling and
stratified sampling techniques have been used. At the next level the balanced stratified sampling with
variations as per the choice of the prognosis class labels have been tested. Much of the initial time has been
focused on performing the pre-processing of the SEER data set. The classification model for
experimentation has been built using the breast cancer, respiratory cancer and mixed cancer data sets with
three traditional classifiers namely Decision Tree, Naïve Bayes and K-Nearest Neighbour. The three
prognosis factors survival, stage and metastasis have been used as class labels for experimental
comparisons. The results shows a steady increase in the prediction accuracy of balanced stratified model
as the sample size increases, but the traditional approach fluctuates before the optimum results.
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...Kiogyf
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND FIREFLY ALGORITHM
ABSTRACT
Cancer is a globally recognized cause of death. A proper cancer analysis demands the classification of several types of tumor. Investigations into microarray gene expressions seem to be a successful platform for revising genetic diseases. Although the standard machine learning (ML) approaches have been efficient in the realization of significant genes and in the classification of new types of cancer cases, their medical and logical application has faced several drawbacks such as DNA microarray data analysis limitation, which includes an incredible number of features and the relatively small size of an instance. To achieve a reasonable and efficient DNA microarray dataset information, there is a need to extend the level of interpretability and forecast approach while maintaining a great level of precision. In this work, a novel way of cancer classification based on based gene expression profiles is presented. This method is a combination of both Firefly algorithm and Mutual Information Method. First, the features are used to select the features before using the Firefly algorithm for feature reduction. Finally, the Support Vector Machine is used to classify cancer into types. The performance of the proposed system was evaluated by using it to classify datasets from colon cancer; the results of the evaluation were compared with some recent approaches.
Keywords: Feature Selection, Firefly Algorithm, Cancer Disease, Mutual Information
Comparative analysis on bayesian classification for breast cancer problemjournalBEEI
The problem of imbalanced class distribution or small datasets is quite frequent in certain fields especially in medical domain. However, the classical Naive Bayes approach in dealing with uncertainties within medical datasets face with the difficulties in selecting prior distributions, whereby parameter estimation such as the maximum likelihood estimation (MLE) and maximum a posteriori (MAP) often hurt the accuracy of predictions. This paper presents the full Bayesian approach to assess the predictive distribution of all classes using three classifiers; naïve bayes (NB), bayesian networks (BN), and tree augmented naïve bayes (TAN) with three datasets; Breast cancer, breast cancer wisconsin, and breast tissue dataset. Next, the prediction accuracies of bayesian approaches are also compared with three standard machine learning algorithms from the literature; K-nearest neighbor (K-NN), support vector machine (SVM), and decision tree (DT). The results showed that the best performance was the bayesian networks (BN) algorithm with accuracy of 97.281%. The results are hoped to provide as base comparison for further research on breast cancer detection. All experiments are conducted in WEKA data mining tool.
Cancer prognosis prediction using balanced stratified samplingijscai
High accuracy in cancer prediction is important to improve the quality of the treatment and to improve the
rate of survivability of patients. As the data volume is increasing rapidly in the healthcare research, the
analytical challenge exists in double. The use of effective sampling technique in classification algorithms
always yields good prediction accuracy. The SEER public use cancer database provides various prominent
class labels for prognosis prediction. The main objective of this paper is to find the effect of sampling
techniques in classifying the prognosis variable and propose an ideal sampling method based on the
outcome of the experimentation. In the first phase of this work the traditional random sampling and
stratified sampling techniques have been used. At the next level the balanced stratified sampling with
variations as per the choice of the prognosis class labels have been tested. Much of the initial time has been
focused on performing the pre-processing of the SEER data set. The classification model for
experimentation has been built using the breast cancer, respiratory cancer and mixed cancer data sets with
three traditional classifiers namely Decision Tree, Naïve Bayes and K-Nearest Neighbour. The three
prognosis factors survival, stage and metastasis have been used as class labels for experimental
comparisons. The results shows a steady increase in the prediction accuracy of balanced stratified model
as the sample size increases, but the traditional approach fluctuates before the optimum results.
Breast cancer diagnosis and recurrence prediction using machine learning tech...eSAT Journals
Abstract Breast Cancer has become the common cause of death among women. Due to long hours invested in manual diagnosis and lesser diagnostic system available emphasize the development of automated diagnosis for early diagnosis of the disease. Our aim is to classify whether the breast cancer is benign or malignant and predict the recurrence and non-recurrence of malignant cases after a certain period. To achieve this we have used machine learning techniques such as Support Vector Machine, Logistic Regression, KNN and Naive Bayes. These techniques are coded in MATLAB using UCI machine learning depository. We have compared the accuracies of different techniques and observed the results. We found SVM most suited for predictive analysis and KNN performed best for our overall methodology. Keywords: Breast Cancer, SVM, KNN, Naive Bayes, Logistic Regression, Classification.
A Comparative Study on the Methods Used for the Detection of Breast Cancerrahulmonikasharma
Among women in the world, the death caused by the Breast cancer has become the leading role. At an initial stage, the tumor in the breast is hard to detect. Manual attempt have proven to be time consuming and inefficient in many cases. Hence there is a need for efficient methods that diagnoses the cancerous cell without human involvement with high accuracy. Mammography is a special case of CT scan which adopts X-ray method with high resolution film. so that it can detect well the tumors in the breast. This paper describes the comparative study of the various data mining methods on the detection of the breast cancer by using image processing techniques.
Survival Analysis of Determinants of Breast Cancer Patients at Hossana Queen ...Premier Publishers
Breast cancer is one of the most severe diseases in the world and become the public’s ever day’s agenda in both developed and developing countries. The primary goal of this study was to identify the determinants of survival time of breast cancer patients at Hossana hospital, south Ethiopia. Kaplan-Meier estimation method and a new two-parameter probability distribution called hypertabastic are introduced to model the survival time of the data. A simulation study was carried out to evaluate the performance of the hypertabastic distribution in comparison with popular distribution with the help of R and SAS statistical software Packages. One-fourth (25%) of the total patients survived for only 2 days. 31(35.2%) were censored, and 55(62.5%) were died. Hypertabastic survival model was found to be best fitting to the breast cancer data and age, level of education, family history, breast problem before, High fat diet, child late age, early menarche, late menopause were significant risk factors for the death of breast cancer patients. Awareness has to be given for the society on causes of breast cancer and screening test and early detection policies for most risky groups has to be established.
On Predicting and Analyzing Breast Cancer using Data Mining ApproachMasud Rana Basunia
Breast Cancer is one of the crucial and burning diseases that has invaded women. Predicting breast cancer manually takes a lot of time and it is difficult for the physician to classification. So, detecting cancer through various automatic diagnostic techniques is very necessary. Data mining is the process of running powerful classification techniques that extract useful information from data. The uses and potentials of these techniques have found its scope in medical data. Classification techniques tend to simplify the prediction segment.
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...ijsc
Microarray is a useful technique for measuring expression data of thousands or more of genes
simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data
is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the
distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and
robust gene identification methods is extremely fundamental. Many gene selection methods as well as their
corresponding classifiers have been proposed. In the proposed method, a single gene with high classdiscrimination
capability is selected and classification rules are generated for cancer based on gene
expression profiles. The method first computes importance factor of each gene of experimental cancer
dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high
class discrimination capability according to their depended degree of classes. Then initial important genes
are selected according to high importance factor of each gene and form initial reduct. Then traditional kmeans
clustering algorithm is applied on each selected gene of initial reduct and compute missclassification
errors of individual genes. The final reduct is formed by selecting most important genes with
respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced
by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples
of experimental test dataset. The proposed method test on four publicly available cancerous gene
expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using
important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to
prove the robustness of proposed method compares the outcomes (correctly classified instances) with some
existing well known classifiers.
Ensemble Classifier Approach in Breast Cancer Detection and Malignancy Gradin...ijmpict
The diagnosed cases of Breast cancer is increasing annually and unfortunately getting converted into a
high mortality rate. Cancer, at the early stages, is hard to detect because the malicious cells show similar
properties (density) as shown by the non-malicious cells. The mortality ratio could have been minimized if
the breast cancer could have been detected in its early stages. But the current systems have not been able
to achieve a fully automatic system which is not just capable of detecting the breast cancer but also can
detect the stage of it. Estimation of malignancy grading is important in diagnosing the degree of growth of
malicious cells as well as in selecting a proper therapy for the patient. Therefore, a complete and efficient
clinical decision support system is proposed which is capable of achieving breast cancer malignancy
grading scheme very efficiently. The system is based on Image processing and machine learning domains.
Classification Imbalance problem, a machine learning problem, occurs when instances of one class is
much higher than the instances of the other class resulting in an inefficient classification of samples and
hence a bad decision support system. Therefore EUSBoost, ensemble based classifier is proposed which is
efficient and is able to outperform other classifiers as it takes the benefits of both-boosting algorithm with
Random Undersampling techniques. Also comparison of EUSBoost with other techniques is shown in the
paper.
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Gene selection is usually the crucial step in microarray data analysis. A great deal of recent research has focused on the
challenging task of selecting differentially expressed genes from microarray data (‘gene selection’). Numerous gene selection
algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like
small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. This paper
presents combination of Analysis of Variance (ANOVA), Principle Component Analysis (PCA), Recursive Cluster Elimination
(RCE) a classification algorithm by employing a innovative method for gene selection. It reduces the gene expression data into
minimal number of gene subset. This is a new feature selection method which uses ANOVA statistical test, principal component
analysis, KNN classification &RCE (recursive cluster elimination). At each step redundant & irrelevant features are get
eliminated. Classification accuracy reaches up to 99.10% and lesser time for classification when compared to other convectional techniques.
Outlier Modification and Gene Selection for Binary Cancer Classification usin...CSCJournals
Gaussian linear Bayes classifier is one of the most popular approaches for classification. However, it is not so popular for cancer classification using gene expression data due to the inverse problem of its covariance matrix in presence of large number of gene variables with small number of cancer patients/samples in the training dataset. To overcome these problems, we propose few top differentially expressed (DE) genes from both upregulated and downregulated groups for binary cancer classification using the Gaussian linear Bayes classifier. Usually top DE genes are selected by ranking the p-values of t-test procedure. However, both t-test statistic and Gaussian linear Bayes classifier are sensitive to outliers. Therefore, we also propose outlier modification for gene expression dataset before applying to the proposed methods, since gene expression datasets are often contaminated by outliers due to several steps involves in the data generating process from hybridization to image analysis. The performance of the proposed method is investigated using both simulated and real gene expression datasets. It is observed that the proposed method improves the performance with outlier modifications for binary cancer classification.
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...Joel Saltz
In this presentation, I will survey the development of Digital Pathology methodology beginning with the 1997 virtual microscope prototype at Hopkins to current tools, methods and algorithms designed to display, analyze and classify whole slide imaging data. I will describe methods, tools and algorithms to extract information from Pathology images. These tools include ability to traverse whole slide images, segment nuclei, carry out deep learning region classification and characterize relationship between extracted features and morphological structures. I will also describe some of the research efforts that motivate development of these tools, the role Pathomics is playing in precision medicine research as well as the impact of Pathology Informatics on clinical practice and health care quality.
Presentation at the Department of Biomedical Informatics, University Pittsburgh Medical Center, April 27, 2018
Association between delayed initiation of adjuvant CMF or anthracycline-based...Enrique Moreno Gonzalez
Adjuvant chemotherapy (AC) improves survival among patients with operable breast cancer. However, the effect of delay in AC initiation on survival is unclear. We performed a systematic review and meta-analysis to determine the relationship between time to AC and
survival outcomes.
USING DATA MINING TECHNIQUES FOR DIAGNOSIS AND PROGNOSIS OF CANCER DISEASEIJCSEIT Journal
Breast cancer is one of the leading cancers for women in developed countries including India. It is the
second most common cause of cancer death in women. The high incidence of breast cancer in women has
increased significantly in the last years. In this paper we have discussed various data mining approaches
that have been utilized for breast cancer diagnosis and prognosis. Breast Cancer Diagnosis is
distinguishing of benign from malignant breast lumps and Breast Cancer Prognosis predicts when Breast
Cancer is to recur in patients that have had their cancers excised. This study paper summarizes various
review and technical articles on breast cancer diagnosis and prognosis also we focus on current research
being carried out using the data mining techniques to enhance the breast cancer diagnosis and prognosis.
Breast cancer diagnosis and recurrence prediction using machine learning tech...eSAT Journals
Abstract Breast Cancer has become the common cause of death among women. Due to long hours invested in manual diagnosis and lesser diagnostic system available emphasize the development of automated diagnosis for early diagnosis of the disease. Our aim is to classify whether the breast cancer is benign or malignant and predict the recurrence and non-recurrence of malignant cases after a certain period. To achieve this we have used machine learning techniques such as Support Vector Machine, Logistic Regression, KNN and Naive Bayes. These techniques are coded in MATLAB using UCI machine learning depository. We have compared the accuracies of different techniques and observed the results. We found SVM most suited for predictive analysis and KNN performed best for our overall methodology. Keywords: Breast Cancer, SVM, KNN, Naive Bayes, Logistic Regression, Classification.
A Comparative Study on the Methods Used for the Detection of Breast Cancerrahulmonikasharma
Among women in the world, the death caused by the Breast cancer has become the leading role. At an initial stage, the tumor in the breast is hard to detect. Manual attempt have proven to be time consuming and inefficient in many cases. Hence there is a need for efficient methods that diagnoses the cancerous cell without human involvement with high accuracy. Mammography is a special case of CT scan which adopts X-ray method with high resolution film. so that it can detect well the tumors in the breast. This paper describes the comparative study of the various data mining methods on the detection of the breast cancer by using image processing techniques.
Survival Analysis of Determinants of Breast Cancer Patients at Hossana Queen ...Premier Publishers
Breast cancer is one of the most severe diseases in the world and become the public’s ever day’s agenda in both developed and developing countries. The primary goal of this study was to identify the determinants of survival time of breast cancer patients at Hossana hospital, south Ethiopia. Kaplan-Meier estimation method and a new two-parameter probability distribution called hypertabastic are introduced to model the survival time of the data. A simulation study was carried out to evaluate the performance of the hypertabastic distribution in comparison with popular distribution with the help of R and SAS statistical software Packages. One-fourth (25%) of the total patients survived for only 2 days. 31(35.2%) were censored, and 55(62.5%) were died. Hypertabastic survival model was found to be best fitting to the breast cancer data and age, level of education, family history, breast problem before, High fat diet, child late age, early menarche, late menopause were significant risk factors for the death of breast cancer patients. Awareness has to be given for the society on causes of breast cancer and screening test and early detection policies for most risky groups has to be established.
On Predicting and Analyzing Breast Cancer using Data Mining ApproachMasud Rana Basunia
Breast Cancer is one of the crucial and burning diseases that has invaded women. Predicting breast cancer manually takes a lot of time and it is difficult for the physician to classification. So, detecting cancer through various automatic diagnostic techniques is very necessary. Data mining is the process of running powerful classification techniques that extract useful information from data. The uses and potentials of these techniques have found its scope in medical data. Classification techniques tend to simplify the prediction segment.
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...ijsc
Microarray is a useful technique for measuring expression data of thousands or more of genes
simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data
is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the
distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and
robust gene identification methods is extremely fundamental. Many gene selection methods as well as their
corresponding classifiers have been proposed. In the proposed method, a single gene with high classdiscrimination
capability is selected and classification rules are generated for cancer based on gene
expression profiles. The method first computes importance factor of each gene of experimental cancer
dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high
class discrimination capability according to their depended degree of classes. Then initial important genes
are selected according to high importance factor of each gene and form initial reduct. Then traditional kmeans
clustering algorithm is applied on each selected gene of initial reduct and compute missclassification
errors of individual genes. The final reduct is formed by selecting most important genes with
respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced
by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples
of experimental test dataset. The proposed method test on four publicly available cancerous gene
expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using
important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to
prove the robustness of proposed method compares the outcomes (correctly classified instances) with some
existing well known classifiers.
Ensemble Classifier Approach in Breast Cancer Detection and Malignancy Gradin...ijmpict
The diagnosed cases of Breast cancer is increasing annually and unfortunately getting converted into a
high mortality rate. Cancer, at the early stages, is hard to detect because the malicious cells show similar
properties (density) as shown by the non-malicious cells. The mortality ratio could have been minimized if
the breast cancer could have been detected in its early stages. But the current systems have not been able
to achieve a fully automatic system which is not just capable of detecting the breast cancer but also can
detect the stage of it. Estimation of malignancy grading is important in diagnosing the degree of growth of
malicious cells as well as in selecting a proper therapy for the patient. Therefore, a complete and efficient
clinical decision support system is proposed which is capable of achieving breast cancer malignancy
grading scheme very efficiently. The system is based on Image processing and machine learning domains.
Classification Imbalance problem, a machine learning problem, occurs when instances of one class is
much higher than the instances of the other class resulting in an inefficient classification of samples and
hence a bad decision support system. Therefore EUSBoost, ensemble based classifier is proposed which is
efficient and is able to outperform other classifiers as it takes the benefits of both-boosting algorithm with
Random Undersampling techniques. Also comparison of EUSBoost with other techniques is shown in the
paper.
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Gene selection is usually the crucial step in microarray data analysis. A great deal of recent research has focused on the
challenging task of selecting differentially expressed genes from microarray data (‘gene selection’). Numerous gene selection
algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like
small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. This paper
presents combination of Analysis of Variance (ANOVA), Principle Component Analysis (PCA), Recursive Cluster Elimination
(RCE) a classification algorithm by employing a innovative method for gene selection. It reduces the gene expression data into
minimal number of gene subset. This is a new feature selection method which uses ANOVA statistical test, principal component
analysis, KNN classification &RCE (recursive cluster elimination). At each step redundant & irrelevant features are get
eliminated. Classification accuracy reaches up to 99.10% and lesser time for classification when compared to other convectional techniques.
Outlier Modification and Gene Selection for Binary Cancer Classification usin...CSCJournals
Gaussian linear Bayes classifier is one of the most popular approaches for classification. However, it is not so popular for cancer classification using gene expression data due to the inverse problem of its covariance matrix in presence of large number of gene variables with small number of cancer patients/samples in the training dataset. To overcome these problems, we propose few top differentially expressed (DE) genes from both upregulated and downregulated groups for binary cancer classification using the Gaussian linear Bayes classifier. Usually top DE genes are selected by ranking the p-values of t-test procedure. However, both t-test statistic and Gaussian linear Bayes classifier are sensitive to outliers. Therefore, we also propose outlier modification for gene expression dataset before applying to the proposed methods, since gene expression datasets are often contaminated by outliers due to several steps involves in the data generating process from hybridization to image analysis. The performance of the proposed method is investigated using both simulated and real gene expression datasets. It is observed that the proposed method improves the performance with outlier modifications for binary cancer classification.
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...Joel Saltz
In this presentation, I will survey the development of Digital Pathology methodology beginning with the 1997 virtual microscope prototype at Hopkins to current tools, methods and algorithms designed to display, analyze and classify whole slide imaging data. I will describe methods, tools and algorithms to extract information from Pathology images. These tools include ability to traverse whole slide images, segment nuclei, carry out deep learning region classification and characterize relationship between extracted features and morphological structures. I will also describe some of the research efforts that motivate development of these tools, the role Pathomics is playing in precision medicine research as well as the impact of Pathology Informatics on clinical practice and health care quality.
Presentation at the Department of Biomedical Informatics, University Pittsburgh Medical Center, April 27, 2018
Association between delayed initiation of adjuvant CMF or anthracycline-based...Enrique Moreno Gonzalez
Adjuvant chemotherapy (AC) improves survival among patients with operable breast cancer. However, the effect of delay in AC initiation on survival is unclear. We performed a systematic review and meta-analysis to determine the relationship between time to AC and
survival outcomes.
USING DATA MINING TECHNIQUES FOR DIAGNOSIS AND PROGNOSIS OF CANCER DISEASEIJCSEIT Journal
Breast cancer is one of the leading cancers for women in developed countries including India. It is the
second most common cause of cancer death in women. The high incidence of breast cancer in women has
increased significantly in the last years. In this paper we have discussed various data mining approaches
that have been utilized for breast cancer diagnosis and prognosis. Breast Cancer Diagnosis is
distinguishing of benign from malignant breast lumps and Breast Cancer Prognosis predicts when Breast
Cancer is to recur in patients that have had their cancers excised. This study paper summarizes various
review and technical articles on breast cancer diagnosis and prognosis also we focus on current research
being carried out using the data mining techniques to enhance the breast cancer diagnosis and prognosis.
PREDICTION OF BREAST CANCER USING DATA MINING TECHNIQUESIAEME Publication
Women who have improved from breast cancer (BC) constantly panic about setback. The way that they have persevered through the meticulous treatment makes repeat their biggest fear. However, with current spreads in technology, early repeat prediction can enable patients to get treatment prior. The accessibility of broad information and propelled techniques make precise and fast prediction possible. This examination expects to think about the exactness of a couple of existing information mining calculations in predicting BC repeat. It inserts a particle swarm optimization as highlight choice into ANN classifier. An objective of increasing the accuracy level of the prediction model.
A Review on Data Mining Techniques for Prediction of Breast Cancer RecurrenceDr. Amarjeet Singh
The most common type of cancer in women
worldwide is the Breast Cancer. Breast cancer may be
detected early using Mammograms, probably before it's
spread. Recurrent breast cancer could occur months or years
after initial treatment. The cancer could return within the
same place because the original cancer (local recurrence), or it
may spread to different areas of your body (distant
recurrence). Early stage treatment is done not only to cure
breast cancer however additionally facilitate in preventing its
repetition/recurrence. Data mining algorithms provide
assistance in predicting the early-stage breast cancer that
continually has been difficult analysis drawback. The
projected analysis can establish the most effective algorithm
that predicts the recurrence of the breast cancer and improve
the accuracy the algorithms. Large information like Clump,
Classification, Association Rules, Prediction and Neural
Networks, Decision Trees can be analyzed using data mining
applications and techniques.
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...CSCJournals
Logistic Regression (LR) is a well known classification method in the field of statistical learning. It allows probabilistic classification and shows promising results on several benchmark problems. Logistic regression enables us to investigate the relationship between a categorical outcome and a set of explanatory variables. Artificial Neural Networks (ANNs) are popularly used as universal non-linear inference models and have gained extensive popularity in recent years. Research activities are considerable and literature is growing. The goal of this research work is to compare the performance of Logistic Regression and Neural Network models on publicly available medical datasets. The evaluation process of the model is as follows. The logistic regression and neural network methods with sensitivity analysis have been evaluated for the effectiveness of the classification. The Classification Accuracy is used to measure the performance of both the models. From the experimental results it is confirmed that the neural network model with sensitivity analysis model gives more efficient result.
REVIEW OF MACHINE LEARNING APPLICATIONS AND DATASETS IN CLASSIFICATION OF ACU...IJDKP
Cancer is an extremely heterogenous disease. Leukemia is a cancer of the white blood cells and some other
cell types. Diagnosing leukemia is laborious in a multitude of areas including heamatology. Machine
Learning (ML) is the branch of Artificial Intelligence. There is an emerging trend in ML models for data
classification. This review aimed to describe the literature of ML in the classification of datasets for acute
leukemia. In addition to describing the existing literature, this work aims to identify different sources of
publicly available data that could be utilised for research and development of intelligent machine learning
applications for classification. To best of the knowledge there is no such work that contributes such
information to the research community.
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...semualkaira
Retrospective analysis of clinical data on female patients with breast cancer was performed. Model 1 was developed by entering variables from the univariate analysis (P < 0.1) into a multivariate logistic regression analysis. Model 2 was developed via the stepwise forward-backward variable selection technique in partial least squares regression. For model 3, the least absolute shrinkage and selection operator (LASSO) method was used to choose suitable variables, followed by the multivariate logistic regression analysis.
Development and Validation of a Nomogram for Predicting Response to Neoadjuva...semualkaira
Retrospective analysis of clinical data on female
patients with breast cancer was performed. Model 1 was developed by entering variables from the univariate analysis (P < 0.1)
into a multivariate logistic regression analysis. Model 2 was developed via the stepwise forward-backward variable selection technique in partial least squares regression. For model 3, the least
absolute shrinkage and selection operator (LASSO) method was
used to choose suitable variables, followed by the multivariate
logistic regression analysis. Harrell’s C-index, calibration curves,
and decision curve analyses (DCA) were used to compare the
performance of the models. In the validation cohort, these results
were validated
Breast cancer is the leading cause of death for women worldwide. Cancer can be discovered early, lowering the rate of death. Machine learning techniques are a hot field of research, and they have been shown to be helpful in cancer prediction and early detection. The primary purpose of this research is to identify which machine learning algorithms are the most successful in predicting and diagnosing breast cancer, according to five criteria: specificity, sensitivity, precision, accuracy, and F1 score. The project is finished in the Anaconda environment, which uses Python's NumPy and SciPy numerical and scientific libraries as well as matplotlib and Pandas. In this study, the Wisconsin diagnostic breast cancer dataset was used to evaluate eleven machine learning classifiers: decision tree, quadratic discriminant analysis, AdaBoost, Bagging meta estimator, Extra randomized trees, Gaussian process classifier, Ridge, Gaussian nave Bayes, k-Nearest neighbors, multilayer perceptron, and support vector classifier. During performance analysis, extremely randomized trees outperformed all other classifiers with an F1-score of 96.77% after data collection and data analysis.
INTEGRATING MACHINE LEARNING IN CLINICAL DECISION SUPPORT SYSTEMShiij
This review article examines the role of machine learning (ML) in enhancing Clinical Decision Support
Systems (CDSSs) within the modern healthcare landscape. Focusing on the integration of various ML
algorithms, such as regression, random forest, and neural networks, the review aims to showcase their
potential in advancing patient care. A rapid review methodology was utilized, involving a survey of recent
articles from PubMed and Google Scholar on ML applications in healthcare. Key findings include the
demonstration of ML's predictive power in patient outcomes, its ability to augment clinician knowledge,
and the effectiveness of ensemble algorithmic approaches. The review highlights specific applications of
diverse ML models, including moment kernel machines in predicting surgical outcomes, k-means clustering
in simplifying disease phenotypes, and extreme gradient boosting in estimating injury risk. Emphasizing
the potential of ML to tackle current healthcare challenges, the article highlights the critical role of ML in
evolving CDSSs for improved clinical decision-making and patient care. This comprehensive review also
addresses the challenges and limitations of integrating ML into healthcare systems, advocating for a
collaborative approach to refine these systems for safety, efficacy, and equity.
INTEGRATING MACHINE LEARNING IN CLINICAL DECISION SUPPORT SYSTEMShiij
This review article examines the role of machine learning (ML) in enhancing Clinical Decision Support
Systems (CDSSs) within the modern healthcare landscape. Focusing on the integration of various ML
algorithms, such as regression, random forest, and neural networks, the review aims to showcase their
potential in advancing patient care. A rapid review methodology was utilized, involving a survey of recent
articles from PubMed and Google Scholar on ML applications in healthcare. Key findings include the
demonstration of ML's predictive power in patient outcomes, its ability to augment clinician knowledge,
and the effectiveness of ensemble algorithmic approaches. The review highlights specific applications of
diverse ML models, including moment kernel machines in predicting surgical outcomes, k-means clustering
in simplifying disease phenotypes, and extreme gradient boosting in estimating injury risk. Emphasizing
the potential of ML to tackle current healthcare challenges, the article highlights the critical role of ML in
evolving CDSSs for improved clinical decision-making and patient care. This comprehensive review also
addresses the challenges and limitations of integrating ML into healthcare systems, advocating for a
collaborative approach to refine these systems for safety, efficacy, and equity.
Breast Tumor Detection Using Efficient Machine Learning and Deep Learning Tec...mlaij
Breast cancer tissues grow when cells in the breast expand and divide uncontrollably, resulting in a lump of tissue commonly called and named tumor. Breast cancer is the second most prevalent cancer among women, following skin cancer. While it is more commonly diagnosed in women aged 50 and above, it can affect individuals of any age. Although it is rare, men can also develop breast cancer, accounting for less than 1% of all cases, with approximately 2,600 cases reported annually in the United States. Early detection of breast tumors is crucial in reducing the risk of developing breast cancer. A publicly available dataset containing features of breast tumors was utilized to identify breast tumors using machine learning and deep learning techniques. Various prediction models were constructed, including logistic regression (LR), decision tree (DT), random forest (RF), support vector machine (SVM), Gradient Boosting (GB), Extreme Gradient Boosting (XGB), Light GBM, and a recurrent neural network (RNN) model. These models were trained to classify and predict breast tumor cases based on the provided features.
BREAST TUMOR DETECTION USING EFFICIENT MACHINE LEARNING AND DEEP LEARNING TEC...mlaij
Breast cancer tissues grow when cells in the breast expand and divide uncontrollably, resulting in a lump
of tissue commonly called and named tumor. Breast cancer is the second most prevalent cancer among
women, following skin cancer. While it is more commonly diagnosed in women aged 50 and above, it can
affect individuals of any age. Although it is rare, men can also develop breast cancer, accounting for less
than 1% of all cases, with approximately 2,600 cases reported annually in the United States. Early
detection of breast tumors is crucial in reducing the risk of developing breast cancer. A publicly available
dataset containing features of breast tumors was utilized to identify breast tumors using machine learning
and deep learning techniques. Various prediction models were constructed, including logistic regression
(LR), decision tree (DT), random forest (RF), support vector machine (SVM), Gradient Boosting (GB),
Extreme Gradient Boosting (XGB), Light GBM, and a recurrent neural network (RNN) model. These
models were trained to classify and predict breast tumor cases based on the provided features.
BREAST TUMOR DETECTION USING EFFICIENT MACHINE LEARNING AND DEEP LEARNING TEC...
Anjali_Ganguly_Siemens_2014
1. Towards a Novel Model for the Prognosis of Acute Myeloid Leukemia Patients: An Evaluation
of Cancer Modeling Techniques
Executive Summary
In recent years, as the incidence of cancer continues to climb, much research attention has
been devoted to alternative methods of cancer management, among them computational
modeling. While computational modeling offers many promises to cancer researchers, it faces
several challenges as well, including the lack of a robust, clinically applicable cancer model.
This study aims to develop an accurate, comprehensive predictive model for patients of Acute
Myeloid Leukemia, one of the most devastating cancers. This paper presents the initial
development of this model with the evaluation of the predictive potential of two widely used
computational models, the Artificial Neural Network (ANN) and Logistic Regression (LR). Both
the ANN and LR were developed in Python using protein expression data from 191 AML
patients so that, given a patient’s profile at the time of diagnosis, the model could predict
whether the patient would go into remission or exhibit resistance to current therapies for AML.
The predictive potential of these models was measured using five criteria: sensitivity, specificity,
positive predictive value, negative predictive value, and overall accuracy. The ANN exhibited
excellent recognition of patients likely to go into remission, which would be greatly valuable in
the clinic. While the LR did not perform as well, it did yield important statistical information that
set it up as a likely supplement as part of a larger, robust model. Our results indicate that an
integration of the ANN and LR will likely yield the most effective predictive tool, as the two
together compensate for the flaws they exhibit individually. This research is the first step in the
creation of a novel tool that doctors can use to predict a patient’s prognosis at the time of
diagnosis, contributing to a more individualized, effective approach to cancer therapy.
2.
3. Towards a Novel Model for the Prognosis of Acute Myeloid Leukemia Patients: An Evaluation
of Cancer Modeling Techniques
Abstract
This paper presents the initial stages of the development of a predictive model for Acute Myeloid
Leukemia (AML) patients. This study assessed the respective advantages and disadvantages of
two widely used computational models, the Artificial Neural Network (ANN) and Logistic
Regression (LR), in order to firstly, evaluate their potential to predict, given a patient’s profile at
diagnosis, whether the patient will enter remission or exhibit primary resistance, and secondly, to
offer a broader understanding of the current necessities of cancer modeling. The ANN and LR
were both developed in Python using proteomic data (231 proteins) obtained from 191 AML
patients. Predictive potential of models was measured using sensitivity, specificity, positive
predictive value, negative predictive value, and overall accuracy. The ANN exhibited an
excellent specificity and positive predictive value, indicating its potential clinical applications;
however, it did exhibit a tendency to overfit to data, confirming one hypothesis of this research.
While the LR was a relatively poor performer, it did yield important statistical information that
set it up as a likely supplement as part of a larger, robust model. Thus, our results point us to
integration of the two modeling techniques to create the novel predictive tool for physicians we
envision.
4. 1
1. Introduction
Cancer persists today as one of the leading causes of mortality worldwide, threatening to
claim almost 600,000 lives in the United States alone this year (ACS et al. 2014). As the
incidence of cancer continues to climb, researchers are prompted to go beyond traditional
techniques of treatment to explore alternative methods of cancer management. Among the
various methods gaining popularity, mathematical and computational modeling has become an
especially promising new avenue for cancer research.
Medical society has mastered the art of data collection with regards to cancer: today,
gathering and storing data about a patient’s clinical and genetic correlates is no longer the
challenge. Instead, extracting valuable information from this wealth of “big data” proves to be
the next large obstacle for cancer researchers to surmount (Quaranta et al. 2005). Here is how
computational modeling, or a “top-down” approach to medicine, offers great potential; by
providing researchers with the tools able to perform large-scale analysis, computational
modeling, with enormous predictive capacity, may yield the tools necessary to improve the
management of cancer starting from diagnosis. As the focus on cancer prediction models has
grown, interest in ensuring their proper development and use is of the utmost importance
(Durrett et al. 2013).
However, cancer modeling faces challenges that currently prevent it from becoming
widely adapted in clinics—as the type and number of cancer models continues to rise, the
relative lack of assessment on the specific strengths and weakness of individual cancer models
has resulted in a plethora of models that are functional but severely limited in scope, rendering
these models useless in the clinic (Freedman et al. 2005). There is a need for a better
5. 2
fundamental understanding of cancer modeling techniques in order to identify the best strategies
to apply when modeling a specific type of cancer.
This study focuses on Acute Myeloid Leukemia (AML), a cancer originating in bone
marrow and the bone’s soft tissue. A heterogeneous clonal disorder in which hematopoietic
precursors are restricted in early stages, AML has garnered much research attention in recent
years; however, understanding the pathophysiology of AML has not greatly improved the five-
year survival rate, which barely scrapes 24% (Stone et al. 2004). Several reasons account for the
low rate of survival, including the rapid progression of the disease (as denoted by the term
“acute”), but more important in limiting instances of recovery is the ineffective use of current
AML therapies. Current therapies, which initially concentrate on chemotherapy coupled with
allogenic hematopoietic stem cell transplantation, often induce relapse in AML patients. Thus,
chemotherapy doses must be consistently augmented depending on the specific patient’s
reaction, and this process eventually destroys bone marrow and results in the high mortality rate
(Stone et al. 2004). The resistance to chemotherapy noted has in turn contributed to a hypothesis
that the stem cells causing AML are particularly resistant to chemotherapy, eliminating one
heavily used avenue in cancer treatment and highlighting the need for a tool capable of
identifying the best treatment strategy for a given patient.
6. 3
Thus, the ultimate goal of our research is to address this need for the betterment of AML
treatment by developing the best prediction model that, given a patient’s profile, can predict
whether he will be primary resistant (exhibiting a resistance to current therapies should they be
applied) or whether he will reach complete remission (becoming cancer-free for a given amount
of time). Developing this model as a predictive tool able to be used at the time of diagnosis
would enable doctors to use the prediction of the model—either remission or resistant—to decide
what treatment strategy would be best for the patient under consideration, and thus physicians
would be able to deliver an optimal combination of therapies to the patient, in essence creating a
more effective, individualized approach to cancer therapy.
This paper presents the primary stages of this enterprise. Given the current challenges in
the computational modeling of cancer, we sought to specifically evaluate the predictive potential
of highly used modeling techniques and assess their value in order to facilitate the development
of the comprehensive, optimal diagnostic tool we envision.
Figure 1: This figure depicts the dismally low survival rate for AML patients (both male and female) in the U.S.,
highlighting the need for a better treatment strategy for AML. Data collected between 1988 and 2007. Figure
obtained from http://www.medinfographics.com/ and data corroborated from the American Cancer Society
(http://www.cancer.org).
7. 4
After much theoretical evaluation, we chose to work with two currently used modeling
techniques—the Artificial Neural Network (ANN) and Logistic Regression (LR) to assess their
predictive potential. Research reveals that the ANN, a type of machine learning algorithm, is
excellent for dealing with a nonlinear, dichotomous outcome, the type of outcome expected for
our models, and it learns easily, rendering highly accurate output. However, it is expected to
exhibit a tendency to overfit data, given the extensive emphasis it places on “learning” the
training dataset completely (Burke et al. 1997). On the other hand, the LR, a standard model used
in probabilistic statistics, can take in several features and is easy to set up, although it has been
widely recognized by the modeling community to exhibit a tendency to underfit data, given its
relative simplicity (Smith et al. 1982). Thus, we hypothesized that, by assessing the tendency of
the ANN to overfit and the LR to underfit simultaneously, we would gain a better understanding
of how to potentially use the two modeling techniques to complement each other and produce a
more accurate, comprehensive model.
2. Methods
2.1 Overview of Methods
The process of developing and evaluating the code for the cancer models selected can be
divided into two main steps: data and models. The first large body of work this paper presents
regards organizing, visualizing, and sorting the data to prepare a dataset ready for input into the
modeling code. The second body of work, the process of coding the models, used the dataset
prepared by Step 1 for developing and evaluating model performance.
2.2 Organization and Visualization of Data
Data were collected by Dr. Steven M. Kornblau over the course of eight years from 191
patients at the M. D. Anderson Cancer Center in Texas. Each patient’s data consisted of clinical
8. 5
correlates, including patient demographics, mutation status, and cytogenetic category, and
proteomic measurements from 231 antibodies indicating the level of total or phosphorylated
proteins on a scale of -5 to 5 for each given protein. For the purposes of this study, only protein
data was considered in developing prediction models; it was hypothesized that, as proteins play
an important role in the regulation of apoptosis, protein expression could significantly affect
cancer prognosis. A schematic representation of the format of the original protein data is
depicted in the following table:
ACTB AIFM AKTI1_2_3.pS473 …FN1 …ZNF346
1 0.185844713 -0.21783783 -0.097350442 -0.97929751 -0.15678498
2 0.24250154 0.770726524 0.277496831 -0.34302295 1.260718032
3 0.555895971 0.331599622 -1.614156048 1.626415093 0.147560248
…191 1.220082392 0.233040465 1.548540573 -0.80614041 -0.30311939
Additional visualization techniques were used to better understand protein expression
level for each patient, especially through heatmaps. The heatmap is interactive and can be
segmented as different groups of proteins are selected, allowing patients to be categorized by
their relative expression levels. While not of critical relevance to the development of the
prediction models, the heat maps were an important preliminary tool that allowed us to better
understand the format of the data provided more clearly and clarify the development of a data-
sorting algorithm (discussed in section 2.3).
Table 1: This table is a schematic representation of the type of data in the protein dataset. For each of the 231
named proteins, each of the 191 patients had a value between -5 and 5; this value, although initially inconsistent
between patients, corresponds to the level expression of the protein under consideration. Data obtained from M. D.
Anderson Cancer Center.
9. 6
2.3 Sorting of Data
After understanding the data, we wanted to eliminate all present outliers to essentially
have a “clean” dataset. As we were working with human data, no established standard for
comparison for proteomic data existed. In order to account for this lack of a control, we started
by confirming the dataset was well-modeled by a normal distribution using the Shapiro Wilk
Normality Test. This test utilizes the null hypothesis principle to check if the sample is from a
normally distributed population with the following standard equation:
(Eq. 1)
After determining that our dataset was indeed a normally distributed population, we began the
elimination of outliers. A script to remove outliers beyond a 3.5 σ threshold was written in
Python. The following standard formula was utilized to calculate standard deviation values:
Figure 2: This figure depicts one of the primary forms of data visualization used in this study: the protein
heatmaps. Each ring in the heatmap represents a selected protein; in this heatmap, red corresponds to a higher
relative expression level, and blue corresponds to a lower expression level. Heatmap visualization technique
obtained from Sage Synapse: Dream 9 Challenge.
10. 7
(Eq. 2)
Using this equation, which only includes values within a limited threshold, it was confirmed that
none of our data points were far from the median values. The script was run in iterations to
remove all detectable outliers to ensure that the data describe the majority of patients, critical to
producing an accurate model. All data points not in the 3.5 σ range for each individual protein
were eliminated. Eliminating those patients from the training data gave us a cleaner and more
statistically sound dataset with which to develop the models. Although clinical data was not used
extensively in this stage of the study, we subdivided the patients by their individual clinical
covariates and noticed that patients with certain clinical characteristics and protein expression
levels tend to exhibit similar trends when comparing remission rates, an interesting finding for
future considerations.
2.4 Model Development
Following the preparation of a cleaned dataset, this dataset was used as the training set to
develop prediction models, both with an Artificial Neural Network and Logistic Regression.
Both models were coded in Python to output a binary value: either 1 to indicate a patient going
into remission, or 0 to indicate primary resistance. Furthermore, both were developed with a
split-sample technique, using 75% of the training dataset for development and 25% for testing
and redesign. The Artificial Neural Network and Logistic Regression were each evaluated
separately to determine the possibility of integration.
2.4.1 Artificial Neural Networks (ANN)
To reiterate, an Artificial Neural Network, or ANN, is a type of machine learning that
seeks to mimic the human brain; specifically it models neurons and their connections. Each
11. 8
neuron is referred to as a node, and these various nodes connect to form a network that mimics
the biological nervous system. Much like a human child, an ANN learns by practice and thus
needs to be trained on a given dataset once created. During the training process, the ANN learns
to recognize patters and complexities inherently present in the data before computing an output.
As shown in this oversimplified diagram, an ANN consists of input nodes, or neurons,
which make up the input layer. These inputs are fed to one or more hidden layers, which use the
ANN’s inbuilt algorithms to process the inputs and link them to an output layer, where the
answer is output:
The ANN used in this study was developed with 231 input notes, 128 hidden nodes, and 1 output
node. The tanh squashing function was chosen as the activation function of the hidden layer, as it
limits complexities in the learning algorithm. The ANN was trained using supervised learning
via the PyBrain library’s backpropagation algorithm, which calculates the gradient of a loss
function with respect to all weights already present in the network, goes backward through the
network, and updates all parameters so as to minimize error. An arbitrary threshold was set—the
Figure 3: This figure is an oversimplified representation of the architecture of a neural network. Input (protein
data) is fed in through the input nodes, after which the network uses the inbuilt algorithms in its hidden layer to
form connections between the input data points, strengthen connections through backpropagation, and output a
binary outcome.
12. 9
ANN was trained on the dataset until its error was less than 1e^-15. In addition, in order to find
the optimal weights for a given network, networks were created and evaluated 1000 times—an
arbitrary number set through experimental observation—and the network that yielded the highest
overall accuracy was saved and used for further evaluation.
2.4.2 Logistic Regression (LR)
The LR, a standard probabilistic model used in statistics, uses the logistic function
(Eq. 3)
to take in the input dataset; it then assigns coefficients, or weights, to the data points, in order to
output a probability value between 0 and 1. For the purposes of this study, 0 would mean the
patient is primary resistant, and 1 would indicate his going into remission. To reiterate, a LR
does have several advantages: it is much more robust than a linear regression, does not tend to
overfit data, and can take in many input features. However, as previous discussed, theoretically it
may be hampered by its assumption of linearity and consequent tendency to underfit. We
hypothesized that this drawback were indeed revealed in the course of this research as well, it
would likely need to be accounted for in order to produce a robust model.
The LR in this study was developed using modules from the scikit-learn library in
Python. An L1-norm (Least Absolute Deviations) was chosen; although an L1-norm does not
always yield a stable solution, it is robust and has a built-in feature selection method, factors we
considered important when working with protein data of a diverse nature. The regularization
parameter C was experimentally set to 1.0. Following evaluation of the model’s output, the
capacity of the LR to output probability values, a confidence interval, and decision scores for
each of the patient’s in the test set was explored.
13. 10
2.5 Model Evaluation
Both the ANN and LR models were evaluated following the calculation of True Positive
(TP), True Negative (TN), False Positive (FP), and False Negative (FN) rates using a module we
developed to output a confusion matrix:
Positive (actual) Negative (actual)
Positive (predicted) True Positive False Positive
Negative (predicted) False Negative True Negative
Calculating the TP, TN, FP, and FN allowed us to extract a wealth of analysis; we specifically
focused on the following five criteria, as research reveals them to be the most clinically relevant:
1. Sensitivity: the likelihood of the test to correctly identify a patient who will not go into
remission
2. Specificity: the likelihood of the test to correctly evaluate a patient who will go into
remission
3. Positive Predictive Value (PPV): the probability that a patient who tests positive will
actually go into remission
4. Negative Predictive Value (NPV): the probability that a patient who tests negative will
not go into remission and exhibit resistance
5. Accuracy: the overall accuracy of the model, as computed by the sum of the number of
correctly classified positives and number of correctly classified negatives divided by the
total number of data points in the dataset.
Following the evaluation of the models with the abovementioned criteria, additional avenues for
research were explored, including the ability of the LR to produce information such as a
confidence interval.
Table 2: This table depicts the structure of the confusion matrix used to evaluate results.
14. 11
3. Results
This study presents the beginning stages of a long-term effort to establish a thorough
basis for researchers in the pursuit of developing a comprehensive model for AML patients.
However, it is hoped that this study and similar ones in the future simultaneously contribute to a
broader understanding of cancer modeling techniques as a whole. During this project, we
evaluated two widely used modeling techniques—the ANN and LR—in view of this long-term
goal; thus, our results are presented in line with their potential implications.
3.1 ANN Results
The following table summarizes the results obtained from the ANN:
Sensitivity 0.82
Specificity 0.93
PPV 0.96
NPV 0.72
Accuracy 0.86
Significant features in these results include the high specificity and PPV values, which indicate
to us that the ANN developed succeeds in correctly identifying patients who will indeed go into
remission. Thus, the ANN specifically exhibits the capacity to recognize protein profiles of
patients almost certain to go into remission, a predictive capacity that would be of high clinical
value. However, the ANN exhibits difficulty in recognizing profiles of patients likely to be
resistant—an avenue that would need to be improved upon to create a fully comprehensive
model for clinical use.
To evaluate the nature of the ANN to overfit to data, we additionally tested it with its
own training dataset, for which the results are summarized below:
Table 3: ANN results with the test dataset with respect to the 5 criteria
15. 12
Sensitivity 1.0
Specificity 1.0
PPV 1.0
NPV 1.0
Accuracy 1.0
These results confirm the ANN’s tendency to overfit, as revealed by the perfect scores output
when tested on the training data, indicating that, while the ANN is excellent at learning the
trends and patterns inherent in a dataset, it would likely need either a larger dataset or integration
with another, more simplified model to increase its accuracy with unknown data. Nevertheless,
this result is significant, as it confirms both the ANN’s major successes and its drawback and
thus highlights the path needed to use an ANN most effectively.
3.2 LR Results
The following table summarizes the results obtained from the LR:
Sensitivity 0.71
Specificity 0.43
PPV 0.71
NPV 0.43
Accuracy 0.62
As displayed by the LR’s relatively poor performance, especially in determining
specificity and NPV, it became evident to us that the LR, which likely does underfit as
hypothesized, is unreliable when used as the sole predictive model. Thus, we explored other
means of exploiting the LR’s capacities to use it as a supplementary, or confirmatory, model in
the future. We developed modules to successfully extract probability scores and a confidence
interval for each patient’s profile, a development that is significant, as it allows us to individually
identify the model’s confidence regarding a given patient. This information, when integrated into
Table 4: ANN results with the training dataset with respect to the 5 criteria
Table 5: LR results with the test dataset with respect to the 5 criteria
16. 13
another model, would establish in essence a system of “checks and balances” where a potentially
incorrect output may be reevaluated and corrected with the additional input of probability and
confidence scores.
4. Application
We believe that this study, which is unique in its exploration of different avenues with the
goal of developing the optimal model, has several implications for cancer research, both
immediately and in the long-term. The current results of both models point us towards an
integration of the ANN and LR, a little-investigated approach to cancer modeling that may yield
the most promising results: the ANN, an otherwise powerful predictive tool, can compensate for
its one drawback, its nature to overfit, by incorporating models such as the LR, which are
unreliable when used alone but provide essential information as input as part of a greater,
comprehensive model. Perhaps more importantly, in the long term, the models we are continuing
to research and develop are the fundamental steps in creating a novel tool for physicians. This
tool will allow doctors to view an AML patient’s profile at the time of diagnosis and
immediately determine the likelihood of the patient to respond to current therapies. Recognizing
either remission or resistance, the doctor can use the model’s prediction to identify the best
possible treatment path for the patient, creating a much-needed accurate, individualized approach
to cancer therapy.
5. Conclusions and Future Work
Our study provides the first set of code modules and the first evaluation that our mentors’
research group will be using for years to come as part of a larger effort to improve cancer
modeling. This research, which points us to the necessity of integration of more than one
modeling technique, opens up a new path for other researchers to create a more effective model.
17. 14
We continue to explore additional avenues to improve the process of cancer modeling,
many of which will continue under other researchers. Future steps include devising an algorithm
to quantify clinical correlates so they can corroborate with proteomic data once incorporated into
a computational model. Furthermore, we look to construct a dynamical system for our data—a
model that is different from the traditional computational approach to modeling by closely
emulating human cognition, which happens continuously and linearly rather than cyclically. The
development of these improved models as well as of a website for doctors to use these models
will continue through 2015 into future years.
18. Works Cited
ACS Home. (2014). Viewed 8 August 2014. Web. http://www.cancer.org/
Burke, Harry B. and Philip H. Goodman, David B. Rosen, Donald E. Henson, John N.
Weinstein, Frank E. Harrell, Jr., Jeffrey R. Marks, David P. Winchester, and David G.
Bostwick. 1997. “Artificial Neural Networks Improve the Accuracy of Cancer Survival
Prediction.” American Cancer Society 79(4): 857-862.
Durrett, Rick. 2013. “Cancer Modeling: A Personal Perspective.” Notices of the AMS 60(3): 304-
309.
Freedman, Andrew N., and Daniela Seminara, Mitchell H. Gail, Patricia Hartge, Graham A.
Colditz, Rachel Ballard-Barbash, and Ruth M. Pfeiffer. “Cancer Risk Prediction Models:
A Workshop on Development, Evaluation, and Application.” Journal of the National
Cancer Institute 97(10): 715-723.
Quaranta, Vito and Alissa M. Weaver, Peter T. Cummings, and Alexandra R. A. Anderson.
2005. “Mathematical Modeling of Cancer: The Future of Prognosis and Treatment.”
Clinica Chimica Acta 357(2): 173-179.
Smith, Terry L., and Edmund A. Gehan, Michael J. Keating, and Emil J. Freireich. 1982.
“Prediction of Remission in Adult Acute Leukemia.” Cancer 50: 466-472.
Stone, Richard M., and Margaret R. O’Donnell, and Mikkael A. Sekeres. 2004. “Acute Myeloid
Leukemia.” Hematology 1: 98-117.
Works Consulted
Delen, Dursun, and Glenn Walker, and Amit Kadam. 2004. “Predicting Breast Cancer
Survivability: A Comparison of Three Data Mining Methods.” Artificial Intelligence in
Medicine 34: 113-127.
Herst, Patries M., and Rebecca A. Howman, Paul J. Neeson, Michael V. Berridge, and David S.
Ritchie. 2010. “The Level of Glycolytic Metabolism in Acute Myeloid Leukemia Blasts
at Diagnosis is Prognostic for Clinical Outcome.” Journal of Leukocyte Biology 89: 51-
55.
Li, Lihua, and Hong Tang, Zuobao Wu, Jianli Gong, Michael Gruidl, Jun Zou, Melvyn
Tockman, and Robert A. Clark. 2004. “Data Mining Techniques for Cancer Detection
Using Serum Proteomic Profiling.” Artificial Intelligence in Medicine 32: 71-83.