Kaggle digits analysis_final_fc

Kaggle Digits Analysis
Zachary Combs, Philip Remmele, M.S. Data Science Candidates
South Dakota State University
July 2, 2015
Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Introduction
In the following presentation we will be discussing our analysis of the Kaggle Digits data.
The Digits data set is comprised of a training set of 42,000 observations and 784
variables (not including the response), and a test set, containing 28,000 observations.
The variables contain pixelation values of hand written digits, ranging from 0-9.
For more information regarding the Kaggle Digits data please visit the site:
https://www.kaggle.com/c/digit-recognizer.

Objective
Develop a classiﬁcation model that is able to accurately classify digit labels in the
test set where class labels are unknown.

Methods
Employed a repeated 10-fold cross-validation to obtain stable estimates of
classiﬁcation accuracy.
Iteratively maximized model tuning parameters (e.g. number of components, decay
factor, etc.).
Performed model comparison.
Selected optimal model based on accuracy measure.

!"#$%#&'!()
*+,-.'./"'01.1'-2.%'
3#1-2451,-01.-%2'*".6
7%#'819/'
:%0",'3;+"
<6"'=>?@A7%,0'(5'.%'6","9.'./"'
:%0",B6'3C2-2D'!1#1&"."#6
*","9.'E"6.'&%0",'E16"0'%2'(5'F"6C,.6
G".'1'6"9%201#;'"6.-&1."'%$'199C#19;'
E;'+#"0-9.-2D'H1,-01.-%2'6".

Data Exploration: Mean
0.00
0.02
0.04
0.06
0.08
0 50 100 150
Mean
Density
Train Data Mean Pixel Values
Table 1:Train Data Summary Statistics
Mean Median
33.40891 7.2315

Data Exploration: Percent Unique
0.00
0.05
0.10
0.15
0 20 40 60 80
Percent Unique
Density
Percent of Unique Pixel Values in Train Data
Max Percentage Unique
60.95238

Data Exploration: Max
0.00
0.02
0.04
0.06
0 100 200 300
Max
Density
Max Pixel Values in Training Data
Maximum Pixel Values
255

Image of Kaggle Handwritten Digit Labels
1
1:28
1:28
0
1:28
1:28
1
1:28
1:28
4
1:28
1:28
0
1:28
1:28
0
1:28
1:28
7
1:28
1:28
3
1:28
1:28
5
1:28
1:28
3
1:28
1:28
8
1:281:28
9
1:28
1:28
1
1:28
3
1:28
3
1:28
1
1:28

PCA With Diﬀerent Transformations
0.25
0.50
0.75
1.00
0 50 100 150 200
Number of Components
PercentofTotalVarianceExplained
transform_Type
Dr. Saunder's Transform
Log Transformation
No Transform
Square Root

Kaggle Digits Data Variance Explained via. PCA
0.75
0.80
0.85
0.90
0.95
1.00
0 200 400 600 800
Components
CummulativeVarianceExplained
0.0
0.2
0.4
0.6
0 200 400 600 800
Components
ProportionofVarianceExplained

Two-dimensional Visualization of PCA
−25
0
25
50
−70 −60 −50
PC1
PC2
−30
−20
−10
0
10
20
30
−70 −60 −50
PC1
PC3
−30
−20
−10
0
10
20
30
−25 0 25 50
PC2
PC3

Shiny Applications: PCA Exploration
Shiny PCA 1
Shiny PCA 2

Data Partitioning
We created a 70/30 split of the data based on the distributions of class labels for
our training and validation set.
training_index <- createDataPartition(y = training[,1],
p = .7,
list = FALSE)
training <- training[training_index,]
validation <- training[-training_index,]
100 covariates were kept due to explaining approximately 95% of variation in the
data, and for the ease of presentation.
dim(training)
## [1] 29404 101
dim(validation)
## [1] 8821 101

Class Proportions
Train
0%
3%
6%
9%
0 1 2 3 4 5 6 7 8 9
Training Partition
0%
3%
6%
9%
0 1 2 3 4 5 6 7 8 9
Class Label
Validation
0%
3%
6%
9%
0 1 2 3 4 5 6 7 8 9

Class Proportions Continued
Table 4:Class Proportions
0 1 2 3 4 5 6 7 8 9
Orig. 0.1 0.11 0.1 0.10 0.1 0.09 0.1 0.10 0.1 0.1
Train 0.1 0.11 0.1 0.10 0.1 0.09 0.1 0.10 0.1 0.1
Valid 0.1 0.11 0.1 0.11 0.1 0.09 0.1 0.11 0.1 0.1

Linear Discriminant Analysis
Discriminant Function
δk (x) = xT
Σ−1
µk −
1
2
µT
k Σ−1
µk + logπk
Estimating Class Probabilities

Pr(Y = k|X = x) =
πk e

δk
K
l=1 πl e

δ l (x)
Assigning x to the class with the largest discriminant score δk (x) will result in the
highest probability for that classiﬁcation. [James, 2013]

Model Fitting: LDA
ind - seq(10,100,10)
lda_Ctrl - trainControl(method = repeatedcv, repeats = 3,
classProbs = TRUE,
summaryFunction = defaultSummary)
accuracy_measure_lda - NULL
ptm - proc.time()
for(i in 1:length(ind)){
lda_Fit - train(label ~ ., data = training[,1:(ind[i]+1)],
method = lda,
metric = Accuracy,
maximize = TRUE,
trControl = lda_Ctrl)
accuracy_measure_lda[i] - confusionMatrix(validation$label,
predict(lda_Fit,
validation[,2:(ind[i]+1)]))$overall[1]
}
proc.time() - ptm
## user system elapsed
## 22.83 2.44 129.86

LDA Optimal Model: Number of Components vs. Model Accuracy
0.876 0.876
0.78
0.80
0.82
0.84
0.86
0.88
25 50 75 100
ClassificationAccuracy
LDA Accuracy vs. Number of Components

LDA Optimal Model Summary Statistics
Table 5:Confusion Matrix (Columns:Predicted,Rows:Actual)
zero one two three four ﬁve six seven eight nine
zero 827 1 2 4 2 16 7 2 4 5
one 0 916 2 4 0 7 3 2 16 1
two 9 31 726 17 21 8 19 11 42 7
three 3 11 23 803 6 41 7 26 26 25
four 0 9 2 0 770 2 5 1 8 56
ﬁve 10 16 2 39 5 653 18 9 29 15
six 11 9 2 3 13 23 804 0 9 0
seven 2 26 9 4 16 4 0 791 3 76
eight 4 46 6 28 13 32 7 3 686 17
nine 8 5 1 16 28 1 1 29 5 748
Table 6:Overall Accuracy
Accuracy 0.8756377
AccuracyLower 0.8685703
AccuracyUpper 0.8824559

LDA Optimal Model Confusion Matrix Image
827 1 2 4 2 16 7 2 4 5
0 916 2 4 0 7 3 2 16 1
9 31 726 17 21 8 19 11 42 7
3 11 23 803 6 41 7 26 26 25
0 9 2 0 770 2 5 1 8 56
10 16 2 39 5 653 18 9 29 15
11 9 2 3 13 23 804 0 9 0
2 26 9 4 16 4 0 791 3 76
4 46 6 28 13 32 7 3 686 17
8 5 1 16 28 1 1 29 5 748
9.4% 0.0% 0.0% 0.0% 0.0% 0.2% 0.1% 0.0% 0.0% 0.1%
10.4% 0.0% 0.0% 0.1% 0.0% 0.0% 0.2% 0.0%
0.1% 0.4% 8.2% 0.2% 0.2% 0.1% 0.2% 0.1% 0.5% 0.1%
0.0% 0.1% 0.3% 9.1% 0.1% 0.5% 0.1% 0.3% 0.3% 0.3%
0.1% 0.0% 8.7% 0.0% 0.1% 0.0% 0.1% 0.6%
0.1% 0.2% 0.0% 0.4% 0.1% 7.4% 0.2% 0.1% 0.3% 0.2%
0.1% 0.1% 0.0% 0.0% 0.1% 0.3% 9.1% 0.1%
0.0% 0.3% 0.1% 0.0% 0.2% 0.0% 9.0% 0.0% 0.9%
0.0% 0.5% 0.1% 0.3% 0.1% 0.4% 0.1% 0.0% 7.8% 0.2%
0.1% 0.1% 0.0% 0.2% 0.3% 0.0% 0.0% 0.3% 0.1% 8.5%
nine
eight
seven
six
five
four
three
two
one
zero
zero one two three four five six seven eight nine
Predicted
Actual
0
20
40
60
80
Count
LDA Optimal Model
Confusion Matrix Image

LDA Optimal Model Bar Plot
0
300
600
900
Labels
Count
Labels
actual
predicted
LDA Optimal Model
Predicted vs. Actual Class Labels

LDA Optimal Model Predictions for Test Set
2
1:28
1:28
0
1:28
1:28
9
1:28
1:28
4
1:28
1:28
3
1:28
1:28
7
1:28
1:28
0
1:281:28
3
1:28
1:28
0
1:28
1:28
3
1:28
1:28
5
1:28
1:28
7
1:28
1:28
4
1:28
1:28
0
1:28
1:28
4
1:28
1:28
0
1:28
1:28
2
1:28
1:28
1
1:28
1:28
9
1:28
1:28
0
1:28
1:28
9
1:28
1:28
1
1:28
1:28
8
1:28
1:28
5
1:28
1:28
7
1:28
1:28

LDA Summary Statistics on Manually Labeled Test Set
zero 92 1 1 0 1 3 1 0 3 0
one 0 111 0 0 0 1 0 0 3 0
two 1 6 62 2 3 1 1 3 4 0
three 1 1 4 100 0 4 1 5 5 1
four 0 0 0 0 100 1 0 1 0 6
ﬁve 0 2 0 3 1 83 0 0 4 2
six 2 0 1 0 0 1 92 0 4 0
seven 0 1 1 0 1 0 0 91 1 6
eight 0 8 1 2 1 5 0 0 65 4
nine 1 0 0 1 4 0 0 1 1 80
Accuracy 0.8760000

Quadratic Discriminant Analysis
Discriminant Function
δk (x) = −
1
2
(x − µk )T
Σ−1
k (x − µk ) + logπk
Estimating Class Probabilities

Pr(Y = k|X = x) =
πk fk (x)
K
l=1 πl fl (x)
While fk (x) are Gaussian densities with diﬀerent covariance matrix

for each class
we obtain a Quadratic Discriminant Analysis. [James, 2013]

Model Fitting: QDA
qda_Ctrl - trainControl(method = repeatedcv, repeats = 3,
classProbs = TRUE,
accuracy_measure_qda - NULL
ptm - proc.time()
for(i in 1:length(ind)){
qda_Fit - train(label ~ ., data = training[,1:(ind[i]+1)],
method = qda,
metric = Accuracy,
maximize = TRUE,
trControl = lda_Ctrl)
accuracy_measure_qda[i] - confusionMatrix(validation$label,
predict(qda_Fit,
validation[,2:(ind[i]+1)]))$overall[1]
}
proc.time() - ptm
## 20.89 2.16 66.20

QDA Optimal Model: Number of Components vs. Model Accuracy
0.967
0.875
0.900
0.925
0.950
25 50 75 100
QDA Accuracy vs. Number of Components

QDA Optimal Model Summary Statistics
zero 862 0 2 1 0 1 0 0 4 0
one 0 917 10 2 2 0 1 2 17 0
two 1 0 871 0 1 0 0 3 15 0
three 0 0 12 929 0 9 0 4 17 0
four 0 1 1 0 838 0 0 0 6 7
ﬁve 2 0 1 13 0 773 0 0 6 1
six 2 0 0 1 2 14 850 0 5 0
seven 3 4 15 3 3 3 0 874 11 15
eight 0 1 9 7 2 4 0 0 816 3
nine 1 0 5 12 5 1 0 9 9 800
Accuracy 0.9670105

QDA Optimal Model Confusion Matrix Image
862 0 2 1 0 1 0 0 4 0
0 917 10 2 2 0 1 2 17 0
1 0 871 0 1 0 0 3 15 0
0 0 12 929 0 9 0 4 17 0
0 1 1 0 838 0 0 0 6 7
2 0 1 13 0 773 0 0 6 1
2 0 0 1 2 14 850 0 5 0
3 4 15 3 3 3 0 874 11 15
0 1 9 7 2 4 0 0 816 3
1 0 5 12 5 1 0 9 9 800
9.8% 0.0% 0.0% 0.0% 0.0%
10.4% 0.1% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 9.9% 0.0% 0.0% 0.2%
0.1% 10.5% 0.1% 0.0% 0.2%
0.0% 0.0% 9.5% 0.1% 0.1%
0.0% 0.0% 0.1% 8.8% 0.1% 0.0%
0.0% 0.0% 0.0% 0.2% 9.6% 0.1%
0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 9.9% 0.1% 0.2%
0.0% 0.1% 0.1% 0.0% 0.0% 9.3% 0.0%
0.0% 0.1% 0.1% 0.1% 0.0% 0.1% 0.1% 9.1%
nine
eight
seven
six
five
four
three
two
one
zero
Predicted
Actual
0
20
40
60
80
Count
QDA Optimal Model

QDA Optimal Model Bar Plot
0
250
500
750
1000
Labels
Count
Labels
actual
predicted
QDA Optimal Model

QDA Optimal Model Predictions for Test Set
2
1:28
1:28
0
1:28
1:28
9
1:28
1:28
9
1:28
1:28
3
1:28
1:28
7
1:28
1:28
0
1:281:28
3
1:28
1:28
0
1:28
1:28
3
1:28
1:28
5
1:28
1:28
7
1:28
1:28
4
1:28
1:28
0
1:28
1:28
4
1:28
1:28
3
1:28
1:28
3
1:28
1:28
1
1:28
1:28
9
1:28
1:28
0
1:28
1:28
9
1:28
1:28
1
1:28
1:28
8
1:28
1:28
5
1:28
1:28
7
1:28
1:28

QDA Summary Statistics on Manually Labeled Test Set
zero 99 0 0 0 0 1 0 0 1 1
one 0 111 1 0 0 0 0 0 3 0
two 0 0 79 1 1 0 1 1 0 0
three 0 0 1 117 0 0 0 0 4 0
four 0 0 0 0 107 0 0 1 0 0
ﬁve 0 0 0 1 0 93 0 0 1 0
six 0 0 0 0 0 1 98 0 1 0
seven 1 0 1 0 0 0 0 98 1 0
eight 0 0 0 0 1 1 0 0 84 0
nine 0 0 0 1 0 0 0 0 1 86
Accuracy 0.9720000

K-Nearest Neighbor
KNN Algorithm
1. Each predictor in the training set represents a dimension in some space.
2. The value that an observation has for each predictor is that values coordinates in
this space.
3. The similarity between points are based on a distance metric (e.g. Euclidean
Distance).
4. The class of an observation is predicted by taking the k-closest data points to
that observation, and assigning the observation to that class which it has most in
common with.

KNN Model Fitting and Parameter Tuning
0.80
0.85
0.90
0.95
1.00
1 2 3 4 5
Neighbors
Accuracy
Component
10
20
30
40
KNN Accuracy vs. Number of Components
and
Number of Neighbors

KNN: Number of Components vs. Accuracy
0.972
0.92
0.94
0.96
10 20 30 40
KNN Classification Accuracy
vs

KNN: Optimal Model Fitting
knn_Ctrl - trainControl(method = repeatedcv, repeats = 3,
classProbs = TRUE,
knn_grid - expand.grid(k=c(1,2,3,4,5))
knn_Fit_opt - train(label~., data = training[,1:(knn_opt+1)],
method = knn,
metric = Accuracy,
maximize = TRUE,
tuneGrid = knn_grid,
trControl = knn_Ctrl)
accuracy_measure_knn_opt - confusionMatrix(validation$label,
predict(knn_Fit_opt,
validation[,2:(knn_opt+1)]))

KNN Optimal Model Summary Statistics
zero 868 0 0 0 0 0 2 0 0 0
one 0 945 1 0 0 0 0 2 2 1
two 1 0 879 0 0 0 1 8 2 0
three 0 0 6 949 0 7 0 4 4 1
four 0 3 0 0 835 0 1 1 0 13
ﬁve 2 1 0 4 0 781 7 0 0 1
six 1 0 0 0 1 1 871 0 0 0
seven 0 9 5 1 1 0 0 909 0 6
eight 0 3 1 2 4 6 2 1 822 1
nine 0 0 2 7 4 1 1 4 1 822
Accuracy 0.9841288

KNN Optimal Model Confusion Matrix Image
868 0 0 0 0 0 2 0 0 0
0 945 1 0 0 0 0 2 2 1
1 0 879 0 0 0 1 8 2 0
0 0 6 949 0 7 0 4 4 1
0 3 0 0 835 0 1 1 0 13
2 1 0 4 0 781 7 0 0 1
1 0 0 0 1 1 871 0 0 0
0 9 5 1 1 0 0 909 0 6
0 3 1 2 4 6 2 1 822 1
0 0 2 7 4 1 1 4 1 822
9.8% 0.0%
10.7% 0.0% 0.0% 0.0% 0.0%
0.0% 10.0% 0.0% 0.1% 0.0%
0.1% 10.8% 0.1% 0.0% 0.0% 0.0%
0.0% 9.5% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 8.9% 0.1% 0.0%
0.0% 0.0% 0.0% 9.9%
0.1% 0.1% 0.0% 0.0% 10.3% 0.1%
0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 9.3% 0.0%
0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 9.3%nine
eight
seven
six
five
four
three
two
one
zero
Predicted
Actual
0
20
40
60
80
Count
KNN Optimal Model

KNN Optimal Bar Plot
0
250
500
750
1000
Labels
Count
Labels
actual
predicted
KNN Optimal Model

KNN Optimal Model Predictions for Test Set
2
1:28
1:28
0
1:28
1:28
9
1:28
1:28
0
1:28
1:28
3
1:28
1:28
7
1:28
1:28
0
1:281:28
3
1:28
1:28
0
1:28
1:28
3
1:28
1:28
5
1:28
1:28
7
1:28
1:28
4
1:28
1:28
0
1:28
1:28
4
1:28
1:28
3
1:28
1:28
3
1:28
1:28
1
1:28
1:28
9
1:28
1:28
0
1:28
1:28
9
1:28
1:28
1
1:28
1:28
1
1:28
1:28
5
1:28
1:28
7
1:28
1:28

KNN Summary Statistics on Manually Labeled Test Set
zero 101 0 0 0 0 0 0 0 0 1
one 0 115 0 0 0 0 0 0 0 0
two 0 0 81 0 1 0 0 1 0 0
three 0 0 2 116 0 1 0 1 2 0
four 0 0 0 0 105 0 0 0 0 3
ﬁve 0 0 0 0 0 95 0 0 0 0
six 0 1 0 0 0 2 97 0 0 0
seven 0 1 0 0 1 0 0 99 0 0
eight 0 1 0 0 0 1 0 0 82 2
nine 0 0 0 0 0 1 0 0 1 86
Accuracy 0.9770000

Random Forest
”A random forest is a classiﬁer consisting of a collection of tree-structured
classiﬁers {h(x, θk ), k = 1} where the {θk } are independent identically
distributed random vectors and each tree casts a unit vote for the most
popular class input x.” [Breiman, 2001]

RF Model Fitting: Recursive Feature Selection
subsets - c(1:40,seq(45,100,5)) # vector of variable subsets
# for recursive feature selection
ptm - proc.time() # starting timer for code execution
ctrl - rfeControl(functions = rfFuncs, method = repeatedcv,
number = 3, verbose = FALSE,
returnResamp = all, allowParallel = FALSE)
rfProfile - rfe(x = training[,-1],
y = as.factor(as.character(training$label)),
sizes = subsets, rfeControl = ctrl)
rf_opt - rfProfile$optVariables
proc.time() - ptm
## 7426.48 64.87 7491.48

Random Forest: Accuracy vs. Number of Variables
0.4
0.6
0.8
1.0
0 25 50 75 100
Variables
Accuracy(RepeatedCross−Validation)
Random Forest Recursive Feature Selection

Random Forest Optimal Model Summary Statistics
eight ﬁve four nine one seven six three two zero
eight 842 0 0 0 0 0 0 0 0 0
ﬁve 0 796 0 0 0 0 0 0 0 0
four 0 0 853 0 0 0 0 0 0 0
nine 0 0 0 842 0 0 0 0 0 0
one 0 0 0 0 951 0 0 0 0 0
seven 0 0 0 0 0 931 0 0 0 0
six 0 0 0 0 0 0 874 0 0 0
three 0 0 0 0 0 0 0 971 0 0
two 0 0 0 0 0 0 0 0 891 0
zero 0 0 0 0 0 0 0 0 0 870
Accuracy 1.0000000

Random Forest Optimal: Confusion Matrix Image
842 0 0 0 0 0 0 0 0 0
0 796 0 0 0 0 0 0 0 0
0 0 853 0 0 0 0 0 0 0
0 0 0 842 0 0 0 0 0 0
0 0 0 0 951 0 0 0 0 0
0 0 0 0 0 931 0 0 0 0
0 0 0 0 0 0 874 0 0 0
0 0 0 0 0 0 0 971 0 0
0 0 0 0 0 0 0 0 891 0
0 0 0 0 0 0 0 0 0 870
9.5%
9.0%
9.7%
9.5%
10.8%
10.6%
9.9%
11.0%
10.1%
9.9%zero
two
three
six
seven
one
nine
four
five
eight
eight five four nine one seven six three two zero
Predicted
Actual
0
20
40
60
80
Count
Random Forest Optimal Model

Random Forest Bar Plot
0
250
500
750
1000
eight five four nine one seven six three two zero
Labels
Count
Labels
actual
predicted
Random Forest
Actual vs. Predicted Class Labels

RF Summary Statistics on Manually Labeled Test Set
eight ﬁve four nine one seven six three two zero
eight 82 1 0 1 0 1 1 2 2 0
ﬁve 1 93 0 1 1 0 1 2 0 0
four 1 0 104 0 0 0 0 0 1 0
nine 0 0 1 84 0 0 0 1 0 0
one 2 0 0 0 114 0 0 0 0 0
seven 0 0 2 1 0 100 0 2 0 1
six 0 0 1 0 0 0 97 0 1 0
three 0 1 0 1 0 0 0 114 0 0
two 0 0 0 0 0 0 1 1 77 0
zero 0 0 0 0 0 0 0 0 2 101
Accuracy 0.9660000

Conditional Inference Tree
General Recursive Partitioning Tree
1. Perform an exhaustive search over all possible splits
2. Maximize information measure of node impurity
3. Select covariate split that maximized this measure
CTREE
1. In each node the partial hypotheses Hj
o : D(Y |Xj ) = D(Y ) is tested against the
global null hypothesis of H0 =
m
j=1 Hj
0.
2. If the global hypothesis can be rejected then the association between Y and each
of the covariates Xj , j = 1..., m is measured by P-value.
3. If we are unable to reject H0 at the speciﬁed α then recursion is stopped.
[Hothorn, 2006]

CTREE Model Fitting and Tuning
0.83
0.805
0.810
0.815
0.820
0.825
0.830
10 15 20 25 30
CTREE Classification Accuracy
vs

CTREE: Optimal Model Fitting
ctree_Ctrl - trainControl(method = repeatedcv, repeats = 3,
classProbs = TRUE,
ctree_Fit_opt - train(label~., data = training[,1:(ctree_opt+1)],
method = ctree,
metric = Accuracy,
tuneLength = 5,
maximize = TRUE,
trControl = ctree_Ctrl)
accuracy_measure_ctree_opt - confusionMatrix(validation$label,
predict(ctree_Fit_opt,
validation[,2:(ctree_opt+1)]))

CTREE Optimal Model Summary Statistics
zero 825 0 7 8 1 6 13 2 6 2
one 0 924 2 3 1 7 0 7 5 2
two 10 11 797 14 5 7 11 16 16 4
three 15 3 20 847 6 23 8 8 33 8
four 5 8 7 7 749 6 10 14 10 37
ﬁve 15 6 4 37 9 671 14 7 26 7
six 23 4 13 9 5 16 799 1 2 2
seven 2 6 11 4 12 3 1 851 6 35
eight 12 10 15 31 5 25 5 10 720 9
nine 3 5 8 13 54 11 3 26 11 708
Accuracy 0.8945698

CTREE Optimal Model Confusion Matrix Image
825 0 7 8 1 6 13 2 6 2
0 924 2 3 1 7 0 7 5 2
10 11 797 14 5 7 11 16 16 4
15 3 20 847 6 23 8 8 33 8
5 8 7 7 749 6 10 14 10 37
15 6 4 37 9 671 14 7 26 7
23 4 13 9 5 16 799 1 2 2
2 6 11 4 12 3 1 851 6 35
12 10 15 31 5 25 5 10 720 9
3 5 8 13 54 11 3 26 11 708
9.4% 0.1% 0.1% 0.0% 0.1% 0.1% 0.0% 0.1% 0.0%
10.5% 0.0% 0.0% 0.0% 0.1% 0.1% 0.1% 0.0%
0.1% 0.1% 9.0% 0.2% 0.1% 0.1% 0.1% 0.2% 0.2% 0.0%
0.2% 0.0% 0.2% 9.6% 0.1% 0.3% 0.1% 0.1% 0.4% 0.1%
0.1% 0.1% 0.1% 0.1% 8.5% 0.1% 0.1% 0.2% 0.1% 0.4%
0.2% 0.1% 0.0% 0.4% 0.1% 7.6% 0.2% 0.1% 0.3% 0.1%
0.3% 0.0% 0.1% 0.1% 0.1% 0.2% 9.1% 0.0% 0.0% 0.0%
0.0% 0.1% 0.1% 0.0% 0.1% 0.0% 0.0% 9.6% 0.1% 0.4%
0.1% 0.1% 0.2% 0.4% 0.1% 0.3% 0.1% 0.1% 8.2% 0.1%
0.0% 0.1% 0.1% 0.1% 0.6% 0.1% 0.0% 0.3% 0.1% 8.0%nine
eight
seven
six
five
four
three
two
one
zero
Predicted
Actual
0
20
40
60
80
Count
CTREE Optimal Model

CTREE Optimal Bar Plot
0
250
500
750
1000
Labels
Count
Labels
actual
predicted
CTREE Optimal Model

CTREE Optimal Model Confusion Matrix on Manually Labeled Test Set
zero 93 0 1 3 0 1 2 0 1 1
one 0 110 0 0 3 0 1 0 1 0
two 1 0 74 2 1 0 2 1 2 0
three 2 0 4 96 0 7 0 3 9 1
four 0 0 2 1 89 1 1 2 0 12
ﬁve 1 0 0 2 2 77 3 1 6 3
six 0 1 3 0 0 2 90 0 4 0
seven 0 0 2 1 4 2 0 90 0 2
eight 0 2 4 1 1 3 1 1 70 3
nine 0 0 1 1 11 1 0 1 3 70
Accuracy 0.8590000

Multinomial Logistic Regression
Class Probabilities
Pr(Y = k|X = x) =
eβ0k +β1k X1+...+βpk Xp
K
l=1 eβ0l +β1l X1+...+βpl Xp
Logistic Regression Model generalized for problems containing more than two classes.
[James, 2013]

MLR Model Fitting and Tuning
0.80
0.82
0.84
0.86
0.88
20 40 60
Multinomial Logistic Model:
Number of Components vs. Accuracy

MLR Optimal Model Summary Statistics
zero 802 0 5 8 0 43 6 0 2 4
one 0 900 16 6 0 14 4 2 9 0
two 25 19 674 28 34 7 54 15 31 4
three 11 12 27 730 5 90 8 12 60 16
four 5 8 3 4 672 9 22 9 7 114
ﬁve 27 19 9 68 14 585 14 15 31 14
six 16 20 29 7 12 31 748 3 6 2
seven 8 17 22 8 10 14 0 775 12 65
eight 6 31 39 68 6 48 6 5 608 25
nine 14 8 7 15 142 16 1 71 17 551
Accuracy 0.7986623

MLR Optimal Model Confusion Matrix Image
802 0 5 8 0 43 6 0 2 4
0 900 16 6 0 14 4 2 9 0
25 19 674 28 34 7 54 15 31 4
11 12 27 730 5 90 8 12 60 16
5 8 3 4 672 9 22 9 7 114
27 19 9 68 14 585 14 15 31 14
16 20 29 7 12 31 748 3 6 2
8 17 22 8 10 14 0 775 12 65
6 31 39 68 6 48 6 5 608 25
14 8 7 15 142 16 1 71 17 551
9.1% 0.1% 0.1% 0.5% 0.1% 0.0% 0.0%
10.2% 0.2% 0.1% 0.2% 0.0% 0.0% 0.1%
0.3% 0.2% 7.6% 0.3% 0.4% 0.1% 0.6% 0.2% 0.4% 0.0%
0.1% 0.1% 0.3% 8.3% 0.1% 1.0% 0.1% 0.1% 0.7% 0.2%
0.1% 0.1% 0.0% 0.0% 7.6% 0.1% 0.2% 0.1% 0.1% 1.3%
0.3% 0.2% 0.1% 0.8% 0.2% 6.6% 0.2% 0.2% 0.4% 0.2%
0.2% 0.2% 0.3% 0.1% 0.1% 0.4% 8.5% 0.0% 0.1% 0.0%
0.1% 0.2% 0.2% 0.1% 0.1% 0.2% 8.8% 0.1% 0.7%
0.1% 0.4% 0.4% 0.8% 0.1% 0.5% 0.1% 0.1% 6.9% 0.3%
0.2% 0.1% 0.1% 0.2% 1.6% 0.2% 0.0% 0.8% 0.2% 6.2%nine
eight
seven
six
five
four
three
two
one
zero
Predicted
Actual
0
20
40
60
80
Count
Multinomial Logistic Optimal Model

MLR Optimal Bar Plot
0
250
500
750
1000
Labels
Count
Labels
actual
predicted
Multinomial Logistic Optimal Model

MLR Optimal Model Confusion Matrix on Manually Labeled Test Set
zero 93 0 0 0 1 4 3 1 0 0
one 0 109 2 0 0 1 1 1 1 0
two 1 1 74 3 2 0 1 1 0 0
three 1 0 0 108 0 4 1 3 0 5
four 0 0 0 0 104 0 0 1 0 3
ﬁve 2 1 0 3 4 81 1 0 2 1
six 0 0 1 0 0 1 97 1 0 0
seven 0 0 2 0 3 0 0 88 1 7
eight 0 1 0 2 2 11 0 0 62 8
nine 0 0 0 1 8 1 0 1 0 77
Accuracy 0.8930000

Model Comparison: Summary Statistics
Table 29:Model Comparison: Summary Statistics
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
KNN 0.9653 0.9685 0.9711 0.9713 0.9737 0.9779 0
LDA 0.8606 0.8681 0.8722 0.8706 0.8733 0.8851 0
QDA 0.9524 0.9575 0.9585 0.9590 0.9613 0.9667 0
RF 0.9422 0.9486 0.9521 0.9514 0.9548 0.9572 0
Log 0.8690 0.8800 0.8846 0.8857 0.8911 0.9062 0
Ctree 0.8158 0.8229 0.8254 0.8270 0.8314 0.8387 0

Testing for Normality: LDA
0
50
100
150
200
0.86 0.87 0.88
accuracy
density
0.860
0.865
0.870
0.875
0.880
0.885
−2 −1 0 1 2
theoretical
sample
Table 30:Shapiro-Wilk normality Test
Test-statistic (W) P-value
0.9224415 0.0310465

Testing for Normality: QDA
0
50
100
150
200
0.955 0.960 0.965
accuracy
density
0.952
0.956
0.960
0.964
−2 −1 0 1 2
theoretical
sample
0.9769401 0.7396847

Testing for Normality: KNN
0
50
100
150
200
250
0.964 0.968 0.972 0.976
accuracy
density
0.965
0.970
0.975
−2 −1 0 1 2
theoretical
sample
0.9774543 0.7545886

Testing for Normality: RF
0
50
100
150
200
0.945 0.950 0.955
accuracy
density
0.945
0.950
0.955
−2 −1 0 1 2
theoretical
sample
0.9504195 0.1734898

Testing for Normality: CTREE
0
50
100
0.815 0.820 0.825 0.830 0.835 0.840
accuracy
density
0.815
0.820
0.825
0.830
0.835
−2 −1 0 1 2
theoretical
sample
0.9686452 0.5028018

Testing for Normality: Log
0
30
60
90
0.87 0.88 0.89 0.90
accuracy
density
0.87
0.88
0.89
0.90
−2 −1 0 1 2
theoretical
sample
0.9850217 0.9375558

Model Comprison: Statistical Inference
Table 36:Summary Statistics
nbr.val min max median mean var
KNN 30 0.96532 0.97788 0.97111 0.97133 1e-05
QDA 30 0.95236 0.96669 0.95852 0.95901 1e-05
Table 37:Wilcoxon Signed Rank Test
Test-statistic (V) P-value
Two-sided 465 0
Greater 465 0
Table 38:T-test
Test-statistic (t) P-value
Two-sided 15.75693 0
Greater 15.75693 0

Model Comparison: Box Plot
Ctree
LDA
Log
RF
QDA
KNN
0.80 0.85 0.90 0.95
Accuracy
0.80 0.85 0.90 0.95
Kappa

Class Accuracy by Model
Table 39:Optimal Model Class Accuracy Measures
0 1 2 3 4 5 6 7 8 9
KNN 0.998 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
LDA 0.970 0.93 0.96 0.93 0.94 0.91 0.96 0.94 0.90 0.89
QDA 0.994 0.99 0.97 0.98 0.99 0.98 1.00 0.99 0.95 0.98
RF 0.983 0.98 0.94 0.92 0.94 0.95 0.97 0.95 0.93 0.92
Ctree 0.950 0.97 0.94 0.93 0.94 0.93 0.96 0.95 0.92 0.93
Log 0.934 0.93 0.89 0.87 0.86 0.83 0.93 0.92 0.87 0.83

Ensemble Predictions:
Goal: Develop a method though which the class accuracy of each ‘optimized’
model can be employed in making class predictions.
Condition 1: Majority vote wins.
Condition 2: If each model predicts a diﬀerent class label, go with the prediction
from the model that has the maximum accuracy for that class prediction.
Condition 3: If there is a two-way tie or split-vote then go with that class label
that has the maximum mean accuracy among all models for that class.
Condition 4: If there is a three-way tie then go with that class label that has the
maximum mean accuracy among all models for that class.

Ensemble Summary Statistics
0 1 2 3 4 5 6 7 8 9
0 101 0 1 0 0 0 0 0 0 0
1 1 114 1 4 2 3 1 3 4 1
2 0 0 78 1 0 0 0 0 0 0
3 0 0 0 112 0 1 0 0 0 1
4 0 0 1 0 105 0 0 0 1 0
5 0 1 0 0 0 91 1 0 1 1
6 0 0 1 0 0 0 98 0 0 0
7 0 0 1 1 1 0 0 97 0 1
8 0 0 0 4 0 0 0 1 79 0
9 0 0 0 0 0 0 0 0 1 84
Accuracy 0.9590000

Conclusion
1. KNN was the best performing model with a classification accuracy of 0.978.
2. Examine effectiveness of Support Vector Machine classifiers, as well as Neural
Network models.
3. Also, may wish to examine the effectiveness of employing a hierarchical clustering
technique for dimension reduction and compare results with principle component
analysis.
4. Continue to explore ensemble prediction method, with a variety of logic rules.

Parallel Processing
50
100
25 50 75 100
TimeElasped(seconds)
group
LDA
LDA 2 cores
QDA
QDA 2 cores
LDA and QDA
Parallel vs Non−parallel Processing

Parallel Processing Continued

References
Breiman, L. (2001). ”Random forests.” Machine learning 45(1): 5-32.
Hothorn, T., et al. (2006). ”Unbiased recursive partitioning: A conditional inference
framework.” Journal of Computational and Graphical statistics 15(3): 651-674.
James, G., et al. (2013). An introduction to statistical learning, Springer.
Kuhn, M. and K. Johnson (2013). Applied predictive Modeling, Springer.

Kaggle digits analysis_final_fc

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Kaggle digits analysis_final_fc

Similar to Kaggle digits analysis_final_fc (20)

Recently uploaded

Recently uploaded (20)

Kaggle digits analysis_final_fc