SlideShare a Scribd company logo
1 of 12
Download to read offline
Analyzing Gene Data Using a Logistic Model, LDA, QDA, KNN and a Logistic GAM
Author: Jonathan Fivelsdal
Analyzing Gene Data Using a Logistic Model, LDA, QDA, KNN and a Logistic GAM
Introduction
Genetics is a field in which data mining methods are used. Data from Brem and Kruglyak contains data on 231
active genes, 95 segregants (individuals) and time points for each segregant at time 0, 10, 20, 30, 40, 50
minutes. Using a subset of the original data, 122 genes are in the training set and 53 genes are in the validation
set. Five methods used in this report to analyze gene expression data are logistic regression, LDA, QDA, KNN
and logistic GAM. Protein-protein interactions (PPI) can be identified within the gene data using the model
methods mentioned.
Analysis and Results
For each method, we obtain the posterior probabilities of interaction for the 7,381 gene pairs in the training data
and will either be identified as being in class 1 (interacted) if they are in the top 200 most likely interacted pairs
and into class 0 (not interacted) otherwise. The predictors used in each model are the mean of gene i (denoted as
X1), variance of gene i (denoted as X2), mean of gene j (denoted as X3), variance of gene j (denoted as X4) and
the covariance between gene i and gene j (denoted as X5).
Logistic regression is used 1st. The model is 𝑦̂ = -14.124 + 0.494*X1 + 0.404*X2 + 1.435*X3 + 0.002*X4 –
0.084*X5. The X1, X2 and X3 variables are significant at a 95% confidence level. The predictors X4 and X5
have p-values of 0.984 and 0.387 and are not significant. The cutoff probability for the logistic model is 0.1721.
Below is the confusion matrix when applying the logistic model on the training data:
Confusion Matrix for Fitted Logistic Regression Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7062 119
Predicted Class 1 122 78
For the training data, we have that the classification error is 3.27%, the sensitivity is 39.59% and the specificity
is 98.30%. Below is the confusion matrix when the logistic model is applied on the validation data:
Confusion Matrix for Fitted Logistic Regression Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1301 19
Predicted Class 1 34 24
For the test data, the classification error is 3.85%, the sensitivity is 55.81% and the specificity is 97.45%.
Now we use LDA. The LDA model has the following coefficients for the five predictor model:
Terms X1 X2 X3 X4 X5
Coefficients 0.352 0.294 1.033 0.081 0.045
The cutoff probability for the LDA model applied to the training data is 0.1450. Below is the confusion matrix
for the training data that we obtain for the fitted LDA model:
Confusion Matrix for Fitted Linear Discriminant Analysis Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7051 130
Predicted Class 1 133 67
For the training data, the classification error is 3.56%, the sensitivity is 34.01% and the specificity is 98.15%.
Below is the confusion matrix for the validation set data that we obtain for the LDA model:
Confusion Matrix for Fitted Linear Discriminant Analysis Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1311 21
Predicted Class 1 24 22
For the test data, the classification error of 3.27%, the sensitivity is 51.16% and the specificity is 98.20%.
Now we use QDA. The cutoff probability for the QDA model applied to the training data is 0.3514. Below is the
confusion matrix for applying the QDA model on the training data:
Confusion Matrix for Quadratic Discriminant Analysis Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7055 126
Predicted Class 1 129 71
For the training data, we have that the classification error is 3.45%, the sensitivity is 36.04% and the specificity
is 98.20%. The following is the confusion matrix for applying the QDA model on the validation data:
Confusion Matrix for Fitted Quadratic Discriminant Analysis Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1307 19
Predicted Class 1 28 24
For the validation data, we have that the classification error is 3.41%, the sensitivity is 55.81% and the
specificity is 97.90%.
Now we use KNN. By means of cross-validation, we will use k=13 for the KNN model applied to the training
data. The cutoff probability for the KNN model is 0.2308.
Confusion Matrix for Fitted K-Nearest Neighbors Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7091 103
Predicted Class 1 93 94
For the training data, the classification error is 2.66%, the sensitivity is 47.72% and the specificity is 98.71%.
Confusion Matrix for Fitted K-Nearest Neighbors Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1294 30
Predicted Class 1 41 13
For the validation data, the classification error is 5.15%, the sensitivity is 30.23% and the specificity is 96.93%.
Finally we use a logistic GAM. By applying the logistic GAM to the training data, we get a cutoff probability of
0.1726.
Confusion Matrix for Fitted Logistic Generalized Additive Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7071 110
Predicted Class 1 113 87
For the training data, the classification error is 3.02%, the sensitivity is 44.16% and the specificity is 98.43%.
The cutoff probability for the logistic GAM model applied to the validation data is 0.0692.
Confusion Matrix for Fitted Logistic Generalized Additive Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1254 19
Predicted Class 1 81 24
For the validation data, we have that the classification error is 7.26%, the sensitivity is 55.81% and the
specificity is 93.93%.
Conclusion
The following is a table that summarizes the results from the analysis:
Method Class Error
(train)
Sensitivity
(train)
Specificity
(train)
Class Error
(validation)
Sensitivity
(validation)
Specificity
(validation)
Logistic 3.27% 39.59% 98.30% 3.85% 55.81% 97.45%
LDA 3.56% 34.01% 98.15% 3.27% 51.16% 98.20%
QDA 3.45% 36.04% 98.20% 3.41% 55.81% 97.90%
KNN 2.66% 47.72% 98.71% 5.15% 30.23% 96.93%
Logistic
GAM
3.02% 44.16% 98.43% 7.26% 55.81% 93.93%
The KNN model has the lowest classification error based on the training data while LDA has the highest
classification error on the training data. The QDA, logistic regression and logistic GAM have the 2nd
highest, 3rd
highest and 4th
highest training classification errors respectively. The KNN model has the highest sensitivity and
specificity on the training data. With respect to the training data, the KNN model has the highest specificity
while the LDA model has the lowest specificity.
The LDA model has the lowest classification error and highest specificity on the validation data but it has the
second lowest sensitivity on the validation data. The 3 models with the highest sensitivity are the logistic
regression, QDA and logistic GAM (all having a sensitivity of 55.81%). Despite being 1 of 3 models that have
the highest sensitivity on the validation set, the logistic GAM has the highest classification error and is likely too
flexible for this gene expression dataset. The model with the lowest sensitivity is the KNN model which also has
the 2nd highest classification error on the validation set and has the 2nd
lowest specificity on the validation set.
Based on the training data, it would appear the KNN would be the best technique to use on the gene expression
data since it has the lowest training classification error, the highest training sensitivity and the highest training
specificity. However, when it comes to the validation set, the KNN method has the 2nd
lowest classification error
and specificity along with the lowest sensitivity. LDA performs well on the validation set since its classification
error of 3.27% is the lowest and it has the highest specificity. The main concern with the performance on the
LDA on the validation set is that it has the 2nd
lowest sensitivity. An alternative model that performs well on the
validation set is the QDA model which has the 2nd
lowest classification error, the 2nd
highest specificity and is 1
of 3 models that has the highest sensitivity value (55.81%) with respect to the validation error. If a lower
specificity value can be tolerated, the LDA appears to be the best model since it has the lowest classification
error and highest specificity on the validation set. If a higher sensitivity is desired for a particular application, it
appears that the QDA model has desirable properties (has the 2nd
lowest classification error and 2nd
highest
specificity on the validation set) and it is 1 of 3 models that has the highest sensitivity with respect to the
validation set.
Appendix
load('Project2.RData') #Load the gene expression data
ls() #View the variable names in the data set
# Create the training set
Y.train <- Network.train[lower.tri(Network.train)] # 7381 pairs in training
#data
n.train <- length(Y.train)
Y.train.mean <-mean(Y.train)
# 0.02669015 proportion of PPI interactions in training data
Y.valid <- Network.valid[lower.tri(Network.valid)] # 1378 pairs in validation
#data
n.valid <- length(Y.valid)
mean(Y.valid)
# 0.03120464 proportion of PPI interactions in validation data
X.train = NULL # 7381 by 5 matrix with 5 predictors for each gene pair
for (i in 1:(dim(DATA.train)[1]-1))
for (j in (i+1):dim(DATA.train)[1])
X.train = rbind(X.train,
c(mean(DATA.train[i,]), mean(DATA.train[j,]),
cov(DATA.train[i,],DATA.train[j,]),
var(DATA.train[i,]), var(DATA.train[j,])))
data.train <-as.data.frame(cbind(Y.train, X.train))
names(data.train) <- c("Y", "X1", "X2", "X3", "X4", "X5")
##########################################################
#Create the validation set
X.valid = NULL # 1378 by 5 matrix with 5 predictors for each gene pair
for (i in 1:(dim(DATA.valid)[1]-1))
for (j in (i+1):dim(DATA.valid)[1])
X.valid = rbind(X.valid,
c(mean(DATA.valid[
i,]), mean(DATA.valid[j,]),
cov(DATA.valid[i,],DATA.valid[j,]),
var(DATA.valid[i,]), var(DATA.valid[j,])))
data.valid <-as.data.frame(X.valid)
names(data.valid) <-c("X1", "X2", "X3", "X4", "X5")
# Regression Formula for the logistic regression model
reg.formula <- paste(names(data.train)[1],
paste(c("X1", "X2", "X3", "X4", "X5"),collapse = "+"),
sep = "~")
print(reg.formula) #Shows what the regression formula is
# 1.) Logistic Regression
# Logistic regression analysis using glm function
logisticMod1 <- glm(reg.formula, data = data.train, family=binomial("logit"))
summary(logisticMod1)
posterior.logistic.train <- logisticMod1$fitted.values # n.train posterior probabilities
of Y=1
#The cut-off probability is the 201st highest posterior probability
cut.off.logistic.Train <- sort(posterior.logistic.train,
decreasing=T)[201]
#Cut-off Probability for logistic regression is 0.1720898
Ghat.train.logistic <- ifelse(posterior.logistic.train > cut.off.logistic.Train,1,0)
##classification rule
table(Ghat.train.logistic,Y.train) # classification table
#Actual
# 0 1
#Predicted 0 7062 119
# 1 122 78
sum(abs(Ghat.train.logistic-Y.train))/n.train #training classification error rate = 0.0327
# 0.0326514
sum(Ghat.train.logistic==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.3959
# 0.3959391
sum(Ghat.train.logistic==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9830
# 0.9830178
post.valid.logistic <- predict(logisticMod1, data.valid, type="response") # n.valid post
probs
Ghat.valid.logistic <- ifelse(post.valid.logistic>cut.off.logistic.Train,1,0)
#use same probability cutoff
table(Ghat.valid.logistic,Y.valid) # classification table
# 0 1
#0 1301 19
#1 34 24
sum(abs(Ghat.valid.logistic-Y.valid))/n.valid # classification error rate = 0.0385
# 0.0385
sum(Ghat.valid.logistic==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5581
# 0.5581
sum(Ghat.valid.logistic==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9745
# 0.9745
# Linear discriminant analysis using lda function in MASS
require(MASS)
model.lda.Gene <- lda(Y ~ X1 + X2 + X3 + X4 + X5,data = data.train)
model.lda.Gene
#Prior probabilities of groups:
# 0 1
# 0.97330985 0.02669015
#Group means:
# X1 X2 X3 X4 X5
#0 10.67300 10.80075 0.01331963 0.8531054 0.9114245
#1 12.26554 12.22904 0.27908714 0.7315978 0.6442998
#Coefficients of linear discriminants:
# LD1
#X1 0.35181460
#X2 0.29353120
#X3 1.03316558
#X4 0.08062700
#X5 0.04478497
plot(model.lda.Gene) # for 2 classes this displays histograms
post.train.lda.Gene <- predict(model.lda.Gene)$posterior[,2] # n.train posterior
probabilities of Y=1
#The cut-off probability for LDA is the 201st highest posterior probability
cut.off.lda.Train <- sort(post.train.lda.Gene,
decreasing=T)[201]
# cut-off probabiliy of LDA is 0.1450
Ghat.train.lda <- ifelse(post.train.lda.Gene >cut.off.lda.Train,1,0) # classification rule
#Ghat.train.lda <- ifelse(predict(model.lda)$class=="0",0,1) # alternative rule when
cutoff=0.5
table(Ghat.train.lda,Y.train) # classification table
#Actual
# 0 1
# Predicted 0 7051 130
# 1 133 67
sum(abs(Ghat.train.lda-Y.train))/n.train #training classification error rate = 0.0356
# 0.03563203
sum(Ghat.train.lda==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.3401
# 0.3401015
sum(Ghat.train.lda==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9815
# 0.9814866
post.valid.lda <- predict(model.lda.Gene, data.valid)$posterior[,2] # n.valid posterior
probabilities of Y=1
Ghat.valid.lda <- ifelse(post.valid.lda>cut.off.lda.Train,1,0)
# use same probability cutoff
table(Ghat.valid.lda,Y.valid) # classification table
#Classification Table for LDA
# 0 1
#0 1311 21
#1 24 22
sum(abs(Ghat.valid.lda-Y.valid))/n.valid #classification error rate = 0.0327
# 0.0327
sum(Ghat.valid.lda==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5116
# 0.5116
sum(Ghat.valid.lda==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9820
# 0.9820
#################################################
# Quadratic discriminant analysis using qda function in MASS
model.qda.gene <- qda(Y ~ X1 + X2 + X3 + X4 + X5,data = data.train)
post.train.qda <- predict(model.qda.gene)$posterior[,2] # n.train posterior probabilities
of Y=1
#The cut-off probability for QDA is the 201st highest posterior probability
cut.off.qda.Train <- sort(post.train.qda,
decreasing=T)[201]
#The cutoff probability when QDA is 0.3514358
Ghat.train.qda <- ifelse(post.train.qda > cut.off.qda.Train,1,0) # classification rule
table(Ghat.train.qda,Y.train) # classification table
#Classification Table for QDA
# 0 1
#0 7055 126
#1 129 71
sum(abs(Ghat.train.qda-Y.train))/n.train #training classification error rate = 0.0345
# 0.03454816
sum(Ghat.train.qda==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.3604
# 0.3604061
sum(Ghat.train.qda==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9820
# 0.9820
#QDA results for the validation set
post.valid.qda <- predict(model.qda.gene,data.valid)$posterior[,2] # n.train posterior
probabilities of Y=1
#The cut-off probability for QDA is the 201st highest posterior probability
#The cutoff probability for the QDA is 0.054226
Ghat.valid.qda <- ifelse(post.valid.qda > cut.off.qda.Train,1,0) # classification rule
table(Ghat.valid.qda,Y.valid) # classification table
#Classification Table for QDA
# 0 1
#0 1307 19
#1 28 24
sum(abs(Ghat.valid.qda-Y.valid))/n.valid #validation classification error rate = 0.0341
# 0.0341
sum(Ghat.valid.qda==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5581
# 0.5581
sum(Ghat.valid.qda==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9790
# 0.9790
#Method #4: KNN with no validation set
require(class)
mer <- rep(NA, 30) # misclassification error rates based on leave-one-out cross-validation
X <- rbind(X.train,X.valid)
X.std <- scale(X)
X.train.std <- X.std[1:n.train,]
X.valid.std <- X.std[(n.train+1):(n.train+n.valid),]
set.seed(2014) # seed must be set because R randomly breaks ties
for (i in 1:30) mer[i] <- sum((Y.train-(c(knn.cv(train=X.train.std, cl=Y.train, k=i))-
1))^2)/n.train
plot(mer)
which.min(mer) # minimum occurs at k = 13
set.seed(2014)
model.knn <- knn(train=X.train.std, test=X.train.std, cl=Y.train, k=13, prob=T)
predclass.knn <- c(model.knn)-1 # convert factor to numeric classes
predprob.knn <- attr(model.knn, "prob") # proportion of votes for winning class
post.train.knn <- predclass.knn*predprob.knn+(1-predclass.knn)*(1-predprob.knn) # n.train
post probs of Y=1
cutoff.knn.Train <- sort(post.train.knn,decreasing = T)[201] # probability cutoff for
predicting classes
#The cutoff probability for the KNN is 0.2307692
Ghat.train.knn <- ifelse(post.train.knn>cutoff.knn.Train,1,0) # classification rule
table(Ghat.train.knn,Y.train) # classification table
#Classification data for KNN for the training set
# 0 1
#0 7091 103
#1 93 94
sum(abs(Ghat.train.knn-Y.train))/n.train # classification error rate = 0.02655
# 0.02655
sum(Ghat.train.knn==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.4772
# 0.4772
sum(Ghat.train.knn==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9871
# 0.9871
#Method # 4: KNN with validation set
set.seed(2014)
model.knn <- knn(train=X.train.std, test=X.valid.std, cl=Y.train, k=13, prob=T)
predclass.knn <- c(model.knn)-1 # convert factor to numeric classes
predprob.knn <- attr(model.knn, "prob") # proportion of votes for winning class
post.valid.knn <- predclass.knn*predprob.knn+(1-predclass.knn)*(1-predprob.knn) # n.valid
post probs of Y=1
Ghat.valid.knn <- ifelse(post.valid.knn>cutoff.knn.Train ,1,0) # use same probability
cutoff
table(Ghat.valid.knn,Y.valid) # classification table
#Confusion Matrix for KNN results with k = 13
# 0 1
#0 1271 21
#1 64 22
sum(abs(Ghat.valid.knn-Y.valid))/n.valid # classification error rate = 0.0515
# 0.0515
sum(Ghat.valid.knn==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.3023
# 0.3023
sum(Ghat.valid.knn==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9693
# 0.9693
require(gam)
model.gam <- gam(Y ~ s(X1,df=5) + s(X2,df=5) + s(X3,df=5) + s(X4,df=5) + s(X5,df=5)
, data.train, family=binomial)
summary(model.gam)
post.train.gam <- model.gam$fitted.values # n.train posterior probabilities of Y=1
cutoff.gam.Train <- sort(post.train.gam,decreasing = T)[201] # probability cutoff for
predicting classes
#Cutoff probability for the logistic GAM is 0.1725833
Ghat.train.gam <- ifelse(post.train.gam>cutoff.gam.Train,1,0) # classification rule
table(Ghat.train.gam,Y.train) # classification table
#Confusion Matrix for the Logistic GAM model
# 0 1
#0 7071 110
#1 113 87
sum(abs(Ghat.train.gam-Y.train))/n.train #training classification error rate = 0.0302
# 0.03021271
sum(Ghat.train.gam==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.4416
# 0.4416244
sum(Ghat.train.gam==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9843
# 0.9842706
#Method 5.) Logistic GAM with Validation Set
# Suppose we had validation data as for other examples
post.valid.gam <- predict(model.gam, data.valid, type="response") # n.valid post probs
#Cutoff proabaility for the logistic GAM for the validation set is 0.0692
Ghat.valid.gam <- ifelse(post.valid.gam>cutoff.gam.Train,1,0) # use same probability
cutoff
table(Ghat.valid.gam,Y.valid) # classification table
#Confusion Matrix for Logistic GAM
# 0 1
#0 1254 19
#1 81 24
sum(abs(Ghat.valid.gam-Y.valid))/n.valid # classification error rate = 0.0726
# 0.07256894
sum(Ghat.valid.gam==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5581
# 0.5581395
sum(Ghat.valid.gam==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9393
# 0.9393258

More Related Content

What's hot

PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET Journal
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET Journal
 
IRJET- Ordinal based Classification Techniques: A Survey
IRJET-  	  Ordinal based Classification Techniques: A SurveyIRJET-  	  Ordinal based Classification Techniques: A Survey
IRJET- Ordinal based Classification Techniques: A SurveyIRJET Journal
 
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...AIRCC Publishing Corporation
 
Study of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsStudy of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsChemseddine Berbague
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
F043046054
F043046054F043046054
F043046054inventy
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)IJCI JOURNAL
 
A Novel Hybrid Voter Using Genetic Algorithm and Performance History
A Novel Hybrid Voter Using Genetic Algorithm and Performance HistoryA Novel Hybrid Voter Using Genetic Algorithm and Performance History
A Novel Hybrid Voter Using Genetic Algorithm and Performance HistoryWaqas Tariq
 
Multi-Cluster Based Approach for skewed Data in Data Mining
Multi-Cluster Based Approach for skewed Data in Data MiningMulti-Cluster Based Approach for skewed Data in Data Mining
Multi-Cluster Based Approach for skewed Data in Data MiningIOSR Journals
 

What's hot (14)

PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
 
B017410916
B017410916B017410916
B017410916
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
 
Hc3413121317
Hc3413121317Hc3413121317
Hc3413121317
 
IRJET- Ordinal based Classification Techniques: A Survey
IRJET-  	  Ordinal based Classification Techniques: A SurveyIRJET-  	  Ordinal based Classification Techniques: A Survey
IRJET- Ordinal based Classification Techniques: A Survey
 
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
 
Study of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsStudy of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systems
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
F043046054
F043046054F043046054
F043046054
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
 
A Novel Hybrid Voter Using Genetic Algorithm and Performance History
A Novel Hybrid Voter Using Genetic Algorithm and Performance HistoryA Novel Hybrid Voter Using Genetic Algorithm and Performance History
A Novel Hybrid Voter Using Genetic Algorithm and Performance History
 
Multi-Cluster Based Approach for skewed Data in Data Mining
Multi-Cluster Based Approach for skewed Data in Data MiningMulti-Cluster Based Approach for skewed Data in Data Mining
Multi-Cluster Based Approach for skewed Data in Data Mining
 

Similar to STAT 897D Project 2 - Final Draft

Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsChirag Gupta
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakashShivaram Prakash
 
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ijnlc
 
Classification of Mammogram Images for Detection of Breast Cancer
Classification of Mammogram Images for Detection of Breast CancerClassification of Mammogram Images for Detection of Breast Cancer
Classification of Mammogram Images for Detection of Breast Canceriosrjce
 
Hepatic injury classification
Hepatic injury classificationHepatic injury classification
Hepatic injury classificationZheliang Jiang
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling reviewJaideep Adusumelli
 
A scalable collaborative filtering framework based on co-clustering
A scalable collaborative filtering framework based on co-clusteringA scalable collaborative filtering framework based on co-clustering
A scalable collaborative filtering framework based on co-clusteringlau
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
Discriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshareDiscriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshareRavi Nakulan
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...Kamel Mansouri
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andAlexander Decker
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andAlexander Decker
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...AIRCC Publishing Corporation
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
 On Feature Selection Algorithms and Feature Selection Stability Measures : A... On Feature Selection Algorithms and Feature Selection Stability Measures : A...
On Feature Selection Algorithms and Feature Selection Stability Measures : A...AIRCC Publishing Corporation
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Dr Athar Khan
 
Lazy Association Classification
Lazy Association ClassificationLazy Association Classification
Lazy Association ClassificationJason Yang
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionIOSR Journals
 
Machine Learning Aided Breast Cancer Classification
Machine Learning Aided Breast Cancer ClassificationMachine Learning Aided Breast Cancer Classification
Machine Learning Aided Breast Cancer ClassificationIRJET Journal
 
Lect24-Efficient test suite mgt - IV.pptx
Lect24-Efficient test suite mgt - IV.pptxLect24-Efficient test suite mgt - IV.pptx
Lect24-Efficient test suite mgt - IV.pptxvijay518229
 

Similar to STAT 897D Project 2 - Final Draft (20)

Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
 
B017261117
B017261117B017261117
B017261117
 
Classification of Mammogram Images for Detection of Breast Cancer
Classification of Mammogram Images for Detection of Breast CancerClassification of Mammogram Images for Detection of Breast Cancer
Classification of Mammogram Images for Detection of Breast Cancer
 
Hepatic injury classification
Hepatic injury classificationHepatic injury classification
Hepatic injury classification
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
A scalable collaborative filtering framework based on co-clustering
A scalable collaborative filtering framework based on co-clusteringA scalable collaborative filtering framework based on co-clustering
A scalable collaborative filtering framework based on co-clustering
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Discriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshareDiscriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshare
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
 On Feature Selection Algorithms and Feature Selection Stability Measures : A... On Feature Selection Algorithms and Feature Selection Stability Measures : A...
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
 
Lazy Association Classification
Lazy Association ClassificationLazy Association Classification
Lazy Association Classification
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature Selection
 
Machine Learning Aided Breast Cancer Classification
Machine Learning Aided Breast Cancer ClassificationMachine Learning Aided Breast Cancer Classification
Machine Learning Aided Breast Cancer Classification
 
Lect24-Efficient test suite mgt - IV.pptx
Lect24-Efficient test suite mgt - IV.pptxLect24-Efficient test suite mgt - IV.pptx
Lect24-Efficient test suite mgt - IV.pptx
 

STAT 897D Project 2 - Final Draft

  • 1. Analyzing Gene Data Using a Logistic Model, LDA, QDA, KNN and a Logistic GAM Author: Jonathan Fivelsdal
  • 2. Analyzing Gene Data Using a Logistic Model, LDA, QDA, KNN and a Logistic GAM Introduction Genetics is a field in which data mining methods are used. Data from Brem and Kruglyak contains data on 231 active genes, 95 segregants (individuals) and time points for each segregant at time 0, 10, 20, 30, 40, 50 minutes. Using a subset of the original data, 122 genes are in the training set and 53 genes are in the validation set. Five methods used in this report to analyze gene expression data are logistic regression, LDA, QDA, KNN and logistic GAM. Protein-protein interactions (PPI) can be identified within the gene data using the model methods mentioned. Analysis and Results For each method, we obtain the posterior probabilities of interaction for the 7,381 gene pairs in the training data and will either be identified as being in class 1 (interacted) if they are in the top 200 most likely interacted pairs and into class 0 (not interacted) otherwise. The predictors used in each model are the mean of gene i (denoted as X1), variance of gene i (denoted as X2), mean of gene j (denoted as X3), variance of gene j (denoted as X4) and the covariance between gene i and gene j (denoted as X5). Logistic regression is used 1st. The model is 𝑦̂ = -14.124 + 0.494*X1 + 0.404*X2 + 1.435*X3 + 0.002*X4 – 0.084*X5. The X1, X2 and X3 variables are significant at a 95% confidence level. The predictors X4 and X5 have p-values of 0.984 and 0.387 and are not significant. The cutoff probability for the logistic model is 0.1721. Below is the confusion matrix when applying the logistic model on the training data: Confusion Matrix for Fitted Logistic Regression Model (Training Set) Actual Class 0 Actual Class 1 Predicted Class 0 7062 119 Predicted Class 1 122 78 For the training data, we have that the classification error is 3.27%, the sensitivity is 39.59% and the specificity is 98.30%. Below is the confusion matrix when the logistic model is applied on the validation data: Confusion Matrix for Fitted Logistic Regression Model (Validation Set) Actual Class 0 Actual Class 1 Predicted Class 0 1301 19 Predicted Class 1 34 24 For the test data, the classification error is 3.85%, the sensitivity is 55.81% and the specificity is 97.45%. Now we use LDA. The LDA model has the following coefficients for the five predictor model:
  • 3. Terms X1 X2 X3 X4 X5 Coefficients 0.352 0.294 1.033 0.081 0.045 The cutoff probability for the LDA model applied to the training data is 0.1450. Below is the confusion matrix for the training data that we obtain for the fitted LDA model: Confusion Matrix for Fitted Linear Discriminant Analysis Model (Training Set) Actual Class 0 Actual Class 1 Predicted Class 0 7051 130 Predicted Class 1 133 67 For the training data, the classification error is 3.56%, the sensitivity is 34.01% and the specificity is 98.15%. Below is the confusion matrix for the validation set data that we obtain for the LDA model: Confusion Matrix for Fitted Linear Discriminant Analysis Model (Validation Set) Actual Class 0 Actual Class 1 Predicted Class 0 1311 21 Predicted Class 1 24 22 For the test data, the classification error of 3.27%, the sensitivity is 51.16% and the specificity is 98.20%. Now we use QDA. The cutoff probability for the QDA model applied to the training data is 0.3514. Below is the confusion matrix for applying the QDA model on the training data: Confusion Matrix for Quadratic Discriminant Analysis Model (Training Set) Actual Class 0 Actual Class 1 Predicted Class 0 7055 126 Predicted Class 1 129 71 For the training data, we have that the classification error is 3.45%, the sensitivity is 36.04% and the specificity is 98.20%. The following is the confusion matrix for applying the QDA model on the validation data: Confusion Matrix for Fitted Quadratic Discriminant Analysis Model (Validation Set) Actual Class 0 Actual Class 1 Predicted Class 0 1307 19 Predicted Class 1 28 24 For the validation data, we have that the classification error is 3.41%, the sensitivity is 55.81% and the specificity is 97.90%.
  • 4. Now we use KNN. By means of cross-validation, we will use k=13 for the KNN model applied to the training data. The cutoff probability for the KNN model is 0.2308. Confusion Matrix for Fitted K-Nearest Neighbors Model (Training Set) Actual Class 0 Actual Class 1 Predicted Class 0 7091 103 Predicted Class 1 93 94 For the training data, the classification error is 2.66%, the sensitivity is 47.72% and the specificity is 98.71%. Confusion Matrix for Fitted K-Nearest Neighbors Model (Validation Set) Actual Class 0 Actual Class 1 Predicted Class 0 1294 30 Predicted Class 1 41 13 For the validation data, the classification error is 5.15%, the sensitivity is 30.23% and the specificity is 96.93%. Finally we use a logistic GAM. By applying the logistic GAM to the training data, we get a cutoff probability of 0.1726. Confusion Matrix for Fitted Logistic Generalized Additive Model (Training Set) Actual Class 0 Actual Class 1 Predicted Class 0 7071 110 Predicted Class 1 113 87 For the training data, the classification error is 3.02%, the sensitivity is 44.16% and the specificity is 98.43%. The cutoff probability for the logistic GAM model applied to the validation data is 0.0692. Confusion Matrix for Fitted Logistic Generalized Additive Model (Validation Set) Actual Class 0 Actual Class 1 Predicted Class 0 1254 19 Predicted Class 1 81 24 For the validation data, we have that the classification error is 7.26%, the sensitivity is 55.81% and the specificity is 93.93%.
  • 5. Conclusion The following is a table that summarizes the results from the analysis: Method Class Error (train) Sensitivity (train) Specificity (train) Class Error (validation) Sensitivity (validation) Specificity (validation) Logistic 3.27% 39.59% 98.30% 3.85% 55.81% 97.45% LDA 3.56% 34.01% 98.15% 3.27% 51.16% 98.20% QDA 3.45% 36.04% 98.20% 3.41% 55.81% 97.90% KNN 2.66% 47.72% 98.71% 5.15% 30.23% 96.93% Logistic GAM 3.02% 44.16% 98.43% 7.26% 55.81% 93.93% The KNN model has the lowest classification error based on the training data while LDA has the highest classification error on the training data. The QDA, logistic regression and logistic GAM have the 2nd highest, 3rd highest and 4th highest training classification errors respectively. The KNN model has the highest sensitivity and specificity on the training data. With respect to the training data, the KNN model has the highest specificity while the LDA model has the lowest specificity. The LDA model has the lowest classification error and highest specificity on the validation data but it has the second lowest sensitivity on the validation data. The 3 models with the highest sensitivity are the logistic regression, QDA and logistic GAM (all having a sensitivity of 55.81%). Despite being 1 of 3 models that have the highest sensitivity on the validation set, the logistic GAM has the highest classification error and is likely too flexible for this gene expression dataset. The model with the lowest sensitivity is the KNN model which also has the 2nd highest classification error on the validation set and has the 2nd lowest specificity on the validation set. Based on the training data, it would appear the KNN would be the best technique to use on the gene expression data since it has the lowest training classification error, the highest training sensitivity and the highest training specificity. However, when it comes to the validation set, the KNN method has the 2nd lowest classification error and specificity along with the lowest sensitivity. LDA performs well on the validation set since its classification error of 3.27% is the lowest and it has the highest specificity. The main concern with the performance on the LDA on the validation set is that it has the 2nd lowest sensitivity. An alternative model that performs well on the validation set is the QDA model which has the 2nd lowest classification error, the 2nd highest specificity and is 1 of 3 models that has the highest sensitivity value (55.81%) with respect to the validation error. If a lower specificity value can be tolerated, the LDA appears to be the best model since it has the lowest classification error and highest specificity on the validation set. If a higher sensitivity is desired for a particular application, it appears that the QDA model has desirable properties (has the 2nd lowest classification error and 2nd highest specificity on the validation set) and it is 1 of 3 models that has the highest sensitivity with respect to the validation set.
  • 6. Appendix load('Project2.RData') #Load the gene expression data ls() #View the variable names in the data set # Create the training set Y.train <- Network.train[lower.tri(Network.train)] # 7381 pairs in training #data n.train <- length(Y.train) Y.train.mean <-mean(Y.train) # 0.02669015 proportion of PPI interactions in training data Y.valid <- Network.valid[lower.tri(Network.valid)] # 1378 pairs in validation #data n.valid <- length(Y.valid) mean(Y.valid) # 0.03120464 proportion of PPI interactions in validation data X.train = NULL # 7381 by 5 matrix with 5 predictors for each gene pair for (i in 1:(dim(DATA.train)[1]-1)) for (j in (i+1):dim(DATA.train)[1]) X.train = rbind(X.train, c(mean(DATA.train[i,]), mean(DATA.train[j,]), cov(DATA.train[i,],DATA.train[j,]), var(DATA.train[i,]), var(DATA.train[j,]))) data.train <-as.data.frame(cbind(Y.train, X.train)) names(data.train) <- c("Y", "X1", "X2", "X3", "X4", "X5") ########################################################## #Create the validation set X.valid = NULL # 1378 by 5 matrix with 5 predictors for each gene pair for (i in 1:(dim(DATA.valid)[1]-1)) for (j in (i+1):dim(DATA.valid)[1]) X.valid = rbind(X.valid, c(mean(DATA.valid[ i,]), mean(DATA.valid[j,]), cov(DATA.valid[i,],DATA.valid[j,]), var(DATA.valid[i,]), var(DATA.valid[j,]))) data.valid <-as.data.frame(X.valid) names(data.valid) <-c("X1", "X2", "X3", "X4", "X5") # Regression Formula for the logistic regression model
  • 7. reg.formula <- paste(names(data.train)[1], paste(c("X1", "X2", "X3", "X4", "X5"),collapse = "+"), sep = "~") print(reg.formula) #Shows what the regression formula is # 1.) Logistic Regression # Logistic regression analysis using glm function logisticMod1 <- glm(reg.formula, data = data.train, family=binomial("logit")) summary(logisticMod1) posterior.logistic.train <- logisticMod1$fitted.values # n.train posterior probabilities of Y=1 #The cut-off probability is the 201st highest posterior probability cut.off.logistic.Train <- sort(posterior.logistic.train, decreasing=T)[201] #Cut-off Probability for logistic regression is 0.1720898 Ghat.train.logistic <- ifelse(posterior.logistic.train > cut.off.logistic.Train,1,0) ##classification rule table(Ghat.train.logistic,Y.train) # classification table #Actual # 0 1 #Predicted 0 7062 119 # 1 122 78 sum(abs(Ghat.train.logistic-Y.train))/n.train #training classification error rate = 0.0327 # 0.0326514 sum(Ghat.train.logistic==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.3959 # 0.3959391 sum(Ghat.train.logistic==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9830 # 0.9830178 post.valid.logistic <- predict(logisticMod1, data.valid, type="response") # n.valid post probs Ghat.valid.logistic <- ifelse(post.valid.logistic>cut.off.logistic.Train,1,0) #use same probability cutoff table(Ghat.valid.logistic,Y.valid) # classification table # 0 1 #0 1301 19 #1 34 24 sum(abs(Ghat.valid.logistic-Y.valid))/n.valid # classification error rate = 0.0385 # 0.0385 sum(Ghat.valid.logistic==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5581 # 0.5581
  • 8. sum(Ghat.valid.logistic==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9745 # 0.9745 # Linear discriminant analysis using lda function in MASS require(MASS) model.lda.Gene <- lda(Y ~ X1 + X2 + X3 + X4 + X5,data = data.train) model.lda.Gene #Prior probabilities of groups: # 0 1 # 0.97330985 0.02669015 #Group means: # X1 X2 X3 X4 X5 #0 10.67300 10.80075 0.01331963 0.8531054 0.9114245 #1 12.26554 12.22904 0.27908714 0.7315978 0.6442998 #Coefficients of linear discriminants: # LD1 #X1 0.35181460 #X2 0.29353120 #X3 1.03316558 #X4 0.08062700 #X5 0.04478497 plot(model.lda.Gene) # for 2 classes this displays histograms post.train.lda.Gene <- predict(model.lda.Gene)$posterior[,2] # n.train posterior probabilities of Y=1 #The cut-off probability for LDA is the 201st highest posterior probability cut.off.lda.Train <- sort(post.train.lda.Gene, decreasing=T)[201] # cut-off probabiliy of LDA is 0.1450 Ghat.train.lda <- ifelse(post.train.lda.Gene >cut.off.lda.Train,1,0) # classification rule #Ghat.train.lda <- ifelse(predict(model.lda)$class=="0",0,1) # alternative rule when cutoff=0.5 table(Ghat.train.lda,Y.train) # classification table #Actual # 0 1 # Predicted 0 7051 130 # 1 133 67 sum(abs(Ghat.train.lda-Y.train))/n.train #training classification error rate = 0.0356 # 0.03563203 sum(Ghat.train.lda==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.3401 # 0.3401015 sum(Ghat.train.lda==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9815 # 0.9814866
  • 9. post.valid.lda <- predict(model.lda.Gene, data.valid)$posterior[,2] # n.valid posterior probabilities of Y=1 Ghat.valid.lda <- ifelse(post.valid.lda>cut.off.lda.Train,1,0) # use same probability cutoff table(Ghat.valid.lda,Y.valid) # classification table #Classification Table for LDA # 0 1 #0 1311 21 #1 24 22 sum(abs(Ghat.valid.lda-Y.valid))/n.valid #classification error rate = 0.0327 # 0.0327 sum(Ghat.valid.lda==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5116 # 0.5116 sum(Ghat.valid.lda==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9820 # 0.9820 ################################################# # Quadratic discriminant analysis using qda function in MASS model.qda.gene <- qda(Y ~ X1 + X2 + X3 + X4 + X5,data = data.train) post.train.qda <- predict(model.qda.gene)$posterior[,2] # n.train posterior probabilities of Y=1 #The cut-off probability for QDA is the 201st highest posterior probability cut.off.qda.Train <- sort(post.train.qda, decreasing=T)[201] #The cutoff probability when QDA is 0.3514358 Ghat.train.qda <- ifelse(post.train.qda > cut.off.qda.Train,1,0) # classification rule table(Ghat.train.qda,Y.train) # classification table #Classification Table for QDA # 0 1 #0 7055 126 #1 129 71 sum(abs(Ghat.train.qda-Y.train))/n.train #training classification error rate = 0.0345 # 0.03454816 sum(Ghat.train.qda==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.3604 # 0.3604061 sum(Ghat.train.qda==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9820
  • 10. # 0.9820 #QDA results for the validation set post.valid.qda <- predict(model.qda.gene,data.valid)$posterior[,2] # n.train posterior probabilities of Y=1 #The cut-off probability for QDA is the 201st highest posterior probability #The cutoff probability for the QDA is 0.054226 Ghat.valid.qda <- ifelse(post.valid.qda > cut.off.qda.Train,1,0) # classification rule table(Ghat.valid.qda,Y.valid) # classification table #Classification Table for QDA # 0 1 #0 1307 19 #1 28 24 sum(abs(Ghat.valid.qda-Y.valid))/n.valid #validation classification error rate = 0.0341 # 0.0341 sum(Ghat.valid.qda==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5581 # 0.5581 sum(Ghat.valid.qda==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9790 # 0.9790 #Method #4: KNN with no validation set require(class) mer <- rep(NA, 30) # misclassification error rates based on leave-one-out cross-validation X <- rbind(X.train,X.valid) X.std <- scale(X) X.train.std <- X.std[1:n.train,] X.valid.std <- X.std[(n.train+1):(n.train+n.valid),] set.seed(2014) # seed must be set because R randomly breaks ties for (i in 1:30) mer[i] <- sum((Y.train-(c(knn.cv(train=X.train.std, cl=Y.train, k=i))- 1))^2)/n.train plot(mer) which.min(mer) # minimum occurs at k = 13 set.seed(2014) model.knn <- knn(train=X.train.std, test=X.train.std, cl=Y.train, k=13, prob=T) predclass.knn <- c(model.knn)-1 # convert factor to numeric classes predprob.knn <- attr(model.knn, "prob") # proportion of votes for winning class post.train.knn <- predclass.knn*predprob.knn+(1-predclass.knn)*(1-predprob.knn) # n.train post probs of Y=1 cutoff.knn.Train <- sort(post.train.knn,decreasing = T)[201] # probability cutoff for predicting classes
  • 11. #The cutoff probability for the KNN is 0.2307692 Ghat.train.knn <- ifelse(post.train.knn>cutoff.knn.Train,1,0) # classification rule table(Ghat.train.knn,Y.train) # classification table #Classification data for KNN for the training set # 0 1 #0 7091 103 #1 93 94 sum(abs(Ghat.train.knn-Y.train))/n.train # classification error rate = 0.02655 # 0.02655 sum(Ghat.train.knn==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.4772 # 0.4772 sum(Ghat.train.knn==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9871 # 0.9871 #Method # 4: KNN with validation set set.seed(2014) model.knn <- knn(train=X.train.std, test=X.valid.std, cl=Y.train, k=13, prob=T) predclass.knn <- c(model.knn)-1 # convert factor to numeric classes predprob.knn <- attr(model.knn, "prob") # proportion of votes for winning class post.valid.knn <- predclass.knn*predprob.knn+(1-predclass.knn)*(1-predprob.knn) # n.valid post probs of Y=1 Ghat.valid.knn <- ifelse(post.valid.knn>cutoff.knn.Train ,1,0) # use same probability cutoff table(Ghat.valid.knn,Y.valid) # classification table #Confusion Matrix for KNN results with k = 13 # 0 1 #0 1271 21 #1 64 22 sum(abs(Ghat.valid.knn-Y.valid))/n.valid # classification error rate = 0.0515 # 0.0515 sum(Ghat.valid.knn==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.3023 # 0.3023 sum(Ghat.valid.knn==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9693 # 0.9693 require(gam)
  • 12. model.gam <- gam(Y ~ s(X1,df=5) + s(X2,df=5) + s(X3,df=5) + s(X4,df=5) + s(X5,df=5) , data.train, family=binomial) summary(model.gam) post.train.gam <- model.gam$fitted.values # n.train posterior probabilities of Y=1 cutoff.gam.Train <- sort(post.train.gam,decreasing = T)[201] # probability cutoff for predicting classes #Cutoff probability for the logistic GAM is 0.1725833 Ghat.train.gam <- ifelse(post.train.gam>cutoff.gam.Train,1,0) # classification rule table(Ghat.train.gam,Y.train) # classification table #Confusion Matrix for the Logistic GAM model # 0 1 #0 7071 110 #1 113 87 sum(abs(Ghat.train.gam-Y.train))/n.train #training classification error rate = 0.0302 # 0.03021271 sum(Ghat.train.gam==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.4416 # 0.4416244 sum(Ghat.train.gam==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9843 # 0.9842706 #Method 5.) Logistic GAM with Validation Set # Suppose we had validation data as for other examples post.valid.gam <- predict(model.gam, data.valid, type="response") # n.valid post probs #Cutoff proabaility for the logistic GAM for the validation set is 0.0692 Ghat.valid.gam <- ifelse(post.valid.gam>cutoff.gam.Train,1,0) # use same probability cutoff table(Ghat.valid.gam,Y.valid) # classification table #Confusion Matrix for Logistic GAM # 0 1 #0 1254 19 #1 81 24 sum(abs(Ghat.valid.gam-Y.valid))/n.valid # classification error rate = 0.0726 # 0.07256894 sum(Ghat.valid.gam==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5581 # 0.5581395 sum(Ghat.valid.gam==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9393 # 0.9393258