1. Analyzing Gene Data Using a Logistic Model, LDA, QDA, KNN and a Logistic GAM
Author: Jonathan Fivelsdal
2. Analyzing Gene Data Using a Logistic Model, LDA, QDA, KNN and a Logistic GAM
Introduction
Genetics is a field in which data mining methods are used. Data from Brem and Kruglyak contains data on 231
active genes, 95 segregants (individuals) and time points for each segregant at time 0, 10, 20, 30, 40, 50
minutes. Using a subset of the original data, 122 genes are in the training set and 53 genes are in the validation
set. Five methods used in this report to analyze gene expression data are logistic regression, LDA, QDA, KNN
and logistic GAM. Protein-protein interactions (PPI) can be identified within the gene data using the model
methods mentioned.
Analysis and Results
For each method, we obtain the posterior probabilities of interaction for the 7,381 gene pairs in the training data
and will either be identified as being in class 1 (interacted) if they are in the top 200 most likely interacted pairs
and into class 0 (not interacted) otherwise. The predictors used in each model are the mean of gene i (denoted as
X1), variance of gene i (denoted as X2), mean of gene j (denoted as X3), variance of gene j (denoted as X4) and
the covariance between gene i and gene j (denoted as X5).
Logistic regression is used 1st. The model is 𝑦̂ = -14.124 + 0.494*X1 + 0.404*X2 + 1.435*X3 + 0.002*X4 –
0.084*X5. The X1, X2 and X3 variables are significant at a 95% confidence level. The predictors X4 and X5
have p-values of 0.984 and 0.387 and are not significant. The cutoff probability for the logistic model is 0.1721.
Below is the confusion matrix when applying the logistic model on the training data:
Confusion Matrix for Fitted Logistic Regression Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7062 119
Predicted Class 1 122 78
For the training data, we have that the classification error is 3.27%, the sensitivity is 39.59% and the specificity
is 98.30%. Below is the confusion matrix when the logistic model is applied on the validation data:
Confusion Matrix for Fitted Logistic Regression Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1301 19
Predicted Class 1 34 24
For the test data, the classification error is 3.85%, the sensitivity is 55.81% and the specificity is 97.45%.
Now we use LDA. The LDA model has the following coefficients for the five predictor model:
3. Terms X1 X2 X3 X4 X5
Coefficients 0.352 0.294 1.033 0.081 0.045
The cutoff probability for the LDA model applied to the training data is 0.1450. Below is the confusion matrix
for the training data that we obtain for the fitted LDA model:
Confusion Matrix for Fitted Linear Discriminant Analysis Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7051 130
Predicted Class 1 133 67
For the training data, the classification error is 3.56%, the sensitivity is 34.01% and the specificity is 98.15%.
Below is the confusion matrix for the validation set data that we obtain for the LDA model:
Confusion Matrix for Fitted Linear Discriminant Analysis Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1311 21
Predicted Class 1 24 22
For the test data, the classification error of 3.27%, the sensitivity is 51.16% and the specificity is 98.20%.
Now we use QDA. The cutoff probability for the QDA model applied to the training data is 0.3514. Below is the
confusion matrix for applying the QDA model on the training data:
Confusion Matrix for Quadratic Discriminant Analysis Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7055 126
Predicted Class 1 129 71
For the training data, we have that the classification error is 3.45%, the sensitivity is 36.04% and the specificity
is 98.20%. The following is the confusion matrix for applying the QDA model on the validation data:
Confusion Matrix for Fitted Quadratic Discriminant Analysis Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1307 19
Predicted Class 1 28 24
For the validation data, we have that the classification error is 3.41%, the sensitivity is 55.81% and the
specificity is 97.90%.
4. Now we use KNN. By means of cross-validation, we will use k=13 for the KNN model applied to the training
data. The cutoff probability for the KNN model is 0.2308.
Confusion Matrix for Fitted K-Nearest Neighbors Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7091 103
Predicted Class 1 93 94
For the training data, the classification error is 2.66%, the sensitivity is 47.72% and the specificity is 98.71%.
Confusion Matrix for Fitted K-Nearest Neighbors Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1294 30
Predicted Class 1 41 13
For the validation data, the classification error is 5.15%, the sensitivity is 30.23% and the specificity is 96.93%.
Finally we use a logistic GAM. By applying the logistic GAM to the training data, we get a cutoff probability of
0.1726.
Confusion Matrix for Fitted Logistic Generalized Additive Model (Training Set)
Actual Class 0 Actual Class 1
Predicted Class 0 7071 110
Predicted Class 1 113 87
For the training data, the classification error is 3.02%, the sensitivity is 44.16% and the specificity is 98.43%.
The cutoff probability for the logistic GAM model applied to the validation data is 0.0692.
Confusion Matrix for Fitted Logistic Generalized Additive Model (Validation Set)
Actual Class 0 Actual Class 1
Predicted Class 0 1254 19
Predicted Class 1 81 24
For the validation data, we have that the classification error is 7.26%, the sensitivity is 55.81% and the
specificity is 93.93%.
5. Conclusion
The following is a table that summarizes the results from the analysis:
Method Class Error
(train)
Sensitivity
(train)
Specificity
(train)
Class Error
(validation)
Sensitivity
(validation)
Specificity
(validation)
Logistic 3.27% 39.59% 98.30% 3.85% 55.81% 97.45%
LDA 3.56% 34.01% 98.15% 3.27% 51.16% 98.20%
QDA 3.45% 36.04% 98.20% 3.41% 55.81% 97.90%
KNN 2.66% 47.72% 98.71% 5.15% 30.23% 96.93%
Logistic
GAM
3.02% 44.16% 98.43% 7.26% 55.81% 93.93%
The KNN model has the lowest classification error based on the training data while LDA has the highest
classification error on the training data. The QDA, logistic regression and logistic GAM have the 2nd
highest, 3rd
highest and 4th
highest training classification errors respectively. The KNN model has the highest sensitivity and
specificity on the training data. With respect to the training data, the KNN model has the highest specificity
while the LDA model has the lowest specificity.
The LDA model has the lowest classification error and highest specificity on the validation data but it has the
second lowest sensitivity on the validation data. The 3 models with the highest sensitivity are the logistic
regression, QDA and logistic GAM (all having a sensitivity of 55.81%). Despite being 1 of 3 models that have
the highest sensitivity on the validation set, the logistic GAM has the highest classification error and is likely too
flexible for this gene expression dataset. The model with the lowest sensitivity is the KNN model which also has
the 2nd highest classification error on the validation set and has the 2nd
lowest specificity on the validation set.
Based on the training data, it would appear the KNN would be the best technique to use on the gene expression
data since it has the lowest training classification error, the highest training sensitivity and the highest training
specificity. However, when it comes to the validation set, the KNN method has the 2nd
lowest classification error
and specificity along with the lowest sensitivity. LDA performs well on the validation set since its classification
error of 3.27% is the lowest and it has the highest specificity. The main concern with the performance on the
LDA on the validation set is that it has the 2nd
lowest sensitivity. An alternative model that performs well on the
validation set is the QDA model which has the 2nd
lowest classification error, the 2nd
highest specificity and is 1
of 3 models that has the highest sensitivity value (55.81%) with respect to the validation error. If a lower
specificity value can be tolerated, the LDA appears to be the best model since it has the lowest classification
error and highest specificity on the validation set. If a higher sensitivity is desired for a particular application, it
appears that the QDA model has desirable properties (has the 2nd
lowest classification error and 2nd
highest
specificity on the validation set) and it is 1 of 3 models that has the highest sensitivity with respect to the
validation set.
6. Appendix
load('Project2.RData') #Load the gene expression data
ls() #View the variable names in the data set
# Create the training set
Y.train <- Network.train[lower.tri(Network.train)] # 7381 pairs in training
#data
n.train <- length(Y.train)
Y.train.mean <-mean(Y.train)
# 0.02669015 proportion of PPI interactions in training data
Y.valid <- Network.valid[lower.tri(Network.valid)] # 1378 pairs in validation
#data
n.valid <- length(Y.valid)
mean(Y.valid)
# 0.03120464 proportion of PPI interactions in validation data
X.train = NULL # 7381 by 5 matrix with 5 predictors for each gene pair
for (i in 1:(dim(DATA.train)[1]-1))
for (j in (i+1):dim(DATA.train)[1])
X.train = rbind(X.train,
c(mean(DATA.train[i,]), mean(DATA.train[j,]),
cov(DATA.train[i,],DATA.train[j,]),
var(DATA.train[i,]), var(DATA.train[j,])))
data.train <-as.data.frame(cbind(Y.train, X.train))
names(data.train) <- c("Y", "X1", "X2", "X3", "X4", "X5")
##########################################################
#Create the validation set
X.valid = NULL # 1378 by 5 matrix with 5 predictors for each gene pair
for (i in 1:(dim(DATA.valid)[1]-1))
for (j in (i+1):dim(DATA.valid)[1])
X.valid = rbind(X.valid,
c(mean(DATA.valid[
i,]), mean(DATA.valid[j,]),
cov(DATA.valid[i,],DATA.valid[j,]),
var(DATA.valid[i,]), var(DATA.valid[j,])))
data.valid <-as.data.frame(X.valid)
names(data.valid) <-c("X1", "X2", "X3", "X4", "X5")
# Regression Formula for the logistic regression model
7. reg.formula <- paste(names(data.train)[1],
paste(c("X1", "X2", "X3", "X4", "X5"),collapse = "+"),
sep = "~")
print(reg.formula) #Shows what the regression formula is
# 1.) Logistic Regression
# Logistic regression analysis using glm function
logisticMod1 <- glm(reg.formula, data = data.train, family=binomial("logit"))
summary(logisticMod1)
posterior.logistic.train <- logisticMod1$fitted.values # n.train posterior probabilities
of Y=1
#The cut-off probability is the 201st highest posterior probability
cut.off.logistic.Train <- sort(posterior.logistic.train,
decreasing=T)[201]
#Cut-off Probability for logistic regression is 0.1720898
Ghat.train.logistic <- ifelse(posterior.logistic.train > cut.off.logistic.Train,1,0)
##classification rule
table(Ghat.train.logistic,Y.train) # classification table
#Actual
# 0 1
#Predicted 0 7062 119
# 1 122 78
sum(abs(Ghat.train.logistic-Y.train))/n.train #training classification error rate = 0.0327
# 0.0326514
sum(Ghat.train.logistic==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.3959
# 0.3959391
sum(Ghat.train.logistic==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9830
# 0.9830178
post.valid.logistic <- predict(logisticMod1, data.valid, type="response") # n.valid post
probs
Ghat.valid.logistic <- ifelse(post.valid.logistic>cut.off.logistic.Train,1,0)
#use same probability cutoff
table(Ghat.valid.logistic,Y.valid) # classification table
# 0 1
#0 1301 19
#1 34 24
sum(abs(Ghat.valid.logistic-Y.valid))/n.valid # classification error rate = 0.0385
# 0.0385
sum(Ghat.valid.logistic==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5581
# 0.5581
8. sum(Ghat.valid.logistic==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9745
# 0.9745
# Linear discriminant analysis using lda function in MASS
require(MASS)
model.lda.Gene <- lda(Y ~ X1 + X2 + X3 + X4 + X5,data = data.train)
model.lda.Gene
#Prior probabilities of groups:
# 0 1
# 0.97330985 0.02669015
#Group means:
# X1 X2 X3 X4 X5
#0 10.67300 10.80075 0.01331963 0.8531054 0.9114245
#1 12.26554 12.22904 0.27908714 0.7315978 0.6442998
#Coefficients of linear discriminants:
# LD1
#X1 0.35181460
#X2 0.29353120
#X3 1.03316558
#X4 0.08062700
#X5 0.04478497
plot(model.lda.Gene) # for 2 classes this displays histograms
post.train.lda.Gene <- predict(model.lda.Gene)$posterior[,2] # n.train posterior
probabilities of Y=1
#The cut-off probability for LDA is the 201st highest posterior probability
cut.off.lda.Train <- sort(post.train.lda.Gene,
decreasing=T)[201]
# cut-off probabiliy of LDA is 0.1450
Ghat.train.lda <- ifelse(post.train.lda.Gene >cut.off.lda.Train,1,0) # classification rule
#Ghat.train.lda <- ifelse(predict(model.lda)$class=="0",0,1) # alternative rule when
cutoff=0.5
table(Ghat.train.lda,Y.train) # classification table
#Actual
# 0 1
# Predicted 0 7051 130
# 1 133 67
sum(abs(Ghat.train.lda-Y.train))/n.train #training classification error rate = 0.0356
# 0.03563203
sum(Ghat.train.lda==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.3401
# 0.3401015
sum(Ghat.train.lda==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9815
# 0.9814866
9. post.valid.lda <- predict(model.lda.Gene, data.valid)$posterior[,2] # n.valid posterior
probabilities of Y=1
Ghat.valid.lda <- ifelse(post.valid.lda>cut.off.lda.Train,1,0)
# use same probability cutoff
table(Ghat.valid.lda,Y.valid) # classification table
#Classification Table for LDA
# 0 1
#0 1311 21
#1 24 22
sum(abs(Ghat.valid.lda-Y.valid))/n.valid #classification error rate = 0.0327
# 0.0327
sum(Ghat.valid.lda==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5116
# 0.5116
sum(Ghat.valid.lda==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9820
# 0.9820
#################################################
# Quadratic discriminant analysis using qda function in MASS
model.qda.gene <- qda(Y ~ X1 + X2 + X3 + X4 + X5,data = data.train)
post.train.qda <- predict(model.qda.gene)$posterior[,2] # n.train posterior probabilities
of Y=1
#The cut-off probability for QDA is the 201st highest posterior probability
cut.off.qda.Train <- sort(post.train.qda,
decreasing=T)[201]
#The cutoff probability when QDA is 0.3514358
Ghat.train.qda <- ifelse(post.train.qda > cut.off.qda.Train,1,0) # classification rule
table(Ghat.train.qda,Y.train) # classification table
#Classification Table for QDA
# 0 1
#0 7055 126
#1 129 71
sum(abs(Ghat.train.qda-Y.train))/n.train #training classification error rate = 0.0345
# 0.03454816
sum(Ghat.train.qda==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.3604
# 0.3604061
sum(Ghat.train.qda==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9820
10. # 0.9820
#QDA results for the validation set
post.valid.qda <- predict(model.qda.gene,data.valid)$posterior[,2] # n.train posterior
probabilities of Y=1
#The cut-off probability for QDA is the 201st highest posterior probability
#The cutoff probability for the QDA is 0.054226
Ghat.valid.qda <- ifelse(post.valid.qda > cut.off.qda.Train,1,0) # classification rule
table(Ghat.valid.qda,Y.valid) # classification table
#Classification Table for QDA
# 0 1
#0 1307 19
#1 28 24
sum(abs(Ghat.valid.qda-Y.valid))/n.valid #validation classification error rate = 0.0341
# 0.0341
sum(Ghat.valid.qda==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5581
# 0.5581
sum(Ghat.valid.qda==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9790
# 0.9790
#Method #4: KNN with no validation set
require(class)
mer <- rep(NA, 30) # misclassification error rates based on leave-one-out cross-validation
X <- rbind(X.train,X.valid)
X.std <- scale(X)
X.train.std <- X.std[1:n.train,]
X.valid.std <- X.std[(n.train+1):(n.train+n.valid),]
set.seed(2014) # seed must be set because R randomly breaks ties
for (i in 1:30) mer[i] <- sum((Y.train-(c(knn.cv(train=X.train.std, cl=Y.train, k=i))-
1))^2)/n.train
plot(mer)
which.min(mer) # minimum occurs at k = 13
set.seed(2014)
model.knn <- knn(train=X.train.std, test=X.train.std, cl=Y.train, k=13, prob=T)
predclass.knn <- c(model.knn)-1 # convert factor to numeric classes
predprob.knn <- attr(model.knn, "prob") # proportion of votes for winning class
post.train.knn <- predclass.knn*predprob.knn+(1-predclass.knn)*(1-predprob.knn) # n.train
post probs of Y=1
cutoff.knn.Train <- sort(post.train.knn,decreasing = T)[201] # probability cutoff for
predicting classes
11. #The cutoff probability for the KNN is 0.2307692
Ghat.train.knn <- ifelse(post.train.knn>cutoff.knn.Train,1,0) # classification rule
table(Ghat.train.knn,Y.train) # classification table
#Classification data for KNN for the training set
# 0 1
#0 7091 103
#1 93 94
sum(abs(Ghat.train.knn-Y.train))/n.train # classification error rate = 0.02655
# 0.02655
sum(Ghat.train.knn==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.4772
# 0.4772
sum(Ghat.train.knn==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9871
# 0.9871
#Method # 4: KNN with validation set
set.seed(2014)
model.knn <- knn(train=X.train.std, test=X.valid.std, cl=Y.train, k=13, prob=T)
predclass.knn <- c(model.knn)-1 # convert factor to numeric classes
predprob.knn <- attr(model.knn, "prob") # proportion of votes for winning class
post.valid.knn <- predclass.knn*predprob.knn+(1-predclass.knn)*(1-predprob.knn) # n.valid
post probs of Y=1
Ghat.valid.knn <- ifelse(post.valid.knn>cutoff.knn.Train ,1,0) # use same probability
cutoff
table(Ghat.valid.knn,Y.valid) # classification table
#Confusion Matrix for KNN results with k = 13
# 0 1
#0 1271 21
#1 64 22
sum(abs(Ghat.valid.knn-Y.valid))/n.valid # classification error rate = 0.0515
# 0.0515
sum(Ghat.valid.knn==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.3023
# 0.3023
sum(Ghat.valid.knn==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9693
# 0.9693
require(gam)
12. model.gam <- gam(Y ~ s(X1,df=5) + s(X2,df=5) + s(X3,df=5) + s(X4,df=5) + s(X5,df=5)
, data.train, family=binomial)
summary(model.gam)
post.train.gam <- model.gam$fitted.values # n.train posterior probabilities of Y=1
cutoff.gam.Train <- sort(post.train.gam,decreasing = T)[201] # probability cutoff for
predicting classes
#Cutoff probability for the logistic GAM is 0.1725833
Ghat.train.gam <- ifelse(post.train.gam>cutoff.gam.Train,1,0) # classification rule
table(Ghat.train.gam,Y.train) # classification table
#Confusion Matrix for the Logistic GAM model
# 0 1
#0 7071 110
#1 113 87
sum(abs(Ghat.train.gam-Y.train))/n.train #training classification error rate = 0.0302
# 0.03021271
sum(Ghat.train.gam==1&Y.train==1)/sum(Y.train==1) # sensitivity = 0.4416
# 0.4416244
sum(Ghat.train.gam==0&Y.train==0)/sum(Y.train==0) # specificity = 0.9843
# 0.9842706
#Method 5.) Logistic GAM with Validation Set
# Suppose we had validation data as for other examples
post.valid.gam <- predict(model.gam, data.valid, type="response") # n.valid post probs
#Cutoff proabaility for the logistic GAM for the validation set is 0.0692
Ghat.valid.gam <- ifelse(post.valid.gam>cutoff.gam.Train,1,0) # use same probability
cutoff
table(Ghat.valid.gam,Y.valid) # classification table
#Confusion Matrix for Logistic GAM
# 0 1
#0 1254 19
#1 81 24
sum(abs(Ghat.valid.gam-Y.valid))/n.valid # classification error rate = 0.0726
# 0.07256894
sum(Ghat.valid.gam==1&Y.valid==1)/sum(Y.valid==1) # sensitivity = 0.5581
# 0.5581395
sum(Ghat.valid.gam==0&Y.valid==0)/sum(Y.valid==0) # specificity = 0.9393
# 0.9393258