Summary of a 4-fold cross validation study performed on classifiers used in OCR. Presented for a class in pattern recognition. OCstar Inc. is a made up company name for the purpose of the presentation, per the requirements of the project.
2. Introduction
Purpose
● Determine Best Classifier
● Predict classifier performance on unseen data
4fold cross validation performed on:
● K Nearest Neighbors
● Bayesian
● Artificial Naural Network
3. N-Fold Cross Validation
● Technique for comparing classification algorithms
● Insight on how classifiers perform on unseen data
Process
● Training data partisioned into N groups
● N1 groups used to train classifier
● 1 group used to test classifier
● Repeated for all groups
4. K = 5 Nearest Neighbors
Algorithm
● The 5 nearest points in the training set to the input
● Majority vote of nearest points classifies input
● If a tie exists, the number of nearest points is
reduced
Distance Metric is Euclidian
5. Bayesian
● Probability mathmatics foundation
● Uses statistical data from training set
● Mean i of each class
● Average covariance of all classes
● Uses discriminants
t 1
gi(x) = 0.5(x – i) (x – i) + ln P(i)
6. Artificial Neural Network
● Interconnected network Output Class
of nonlinear nodes
● Weight Matrices govern
performance
● Weights trained by
gradient descent
Feature Input
7. Results: 5 Nearest
Neighbors
● Consistant performance of 97% between folds
● Most commonly confused class, varies between
folds
● Worst class 80% correct in worst case
● Does not provide insight on error classes for
application
8. Results: Bayesian Classifier
● Performance varied slightly between folds
● Precision varied between 97% and 100% accuracy,
with an average of 98.75%
● All observed errors on class 'x'
● Class x 70% correct in worst case, and 87.5% on
average
● 'x' is likely to be a problem class in application
9. Results: Artificial Neural Net
● Inconsistent results varying between 77% correct
and 96%
● Possible that worst case did not converge during
training.
● Average performance wihtout worse case 95.33%
● Problem classes varied between folds
10. Study Conclusion
● Bayesian classifier recomended
● Best average precision between folds
● Errors confined to class 'x'
● Class 'x' correct 87.5% on average, 70% in worst case
● Provides insight a postprocessing technique could take
advantage of
11. Results on Final Data Set
a c e m n o r s x z
a 120 0 0 0 0 0 0 0 0 0 0
c 0 120 0 0 0 0 0 0 0 0 0
e 0 2 118 0 0 0 0 0 0 0 2
m 0 0 0 120 0 0 0 0 0 0 0
n 0 0 0 0 120 0 0 0 0 0 0
o 0 0 1 0 0 119 0 0 0 0 1
r 0 0 0 0 0 0 120 0 0 0 0
s 0 0 0 0 0 0 0 120 0 0 0
x 0 0 0 0 1 0 0 27 92 0 28
z 0 0 0 0 0 0 2 0 0 118 2
0 2 1 0 1 0 2 27 0 0 33
97.25% correct 2.75% error
● Class 'x' 76.7% correct