Classification by  Machine Learning  Approaches - Exercise Solution Michael J. Kerner –  [email_address] Center for Biological Sequence Analysis Technical University of Denmark
Exercise Solution:  donors_trainset.arff  - All features: trees.J48 === Stratified cross-validation === === Summary === Correctly Classified Instances  4972  94.5967 % Incorrectly Classified Instances  284  5.4033 % Kappa statistic  0.8381 === Detailed Accuracy By Class === TP Rate  FP Rate  Precision  Recall  F-Measure  Class 0.87  0.034  0.875  0.87  0.872  true 0.966  0.13  0.965  0.966  0.966  false === Confusion Matrix === a  b  <-- classified as 971  145 |  a = true 139 4001 |  b = false
Exercise Solution:  donors_trainset.arff  - All features: bayes.NaiveBayes === Stratified cross-validation === === Summary === Correctly Classified Instances  4910  93.417  % Incorrectly Classified Instances  346  6.583  % Kappa statistic  0.8056 === Detailed Accuracy By Class === TP Rate  FP Rate  Precision  Recall  F-Measure  Class 0.862  0.046  0.834  0.862  0.848  true 0.954  0.138  0.962  0.954  0.958  false === Confusion Matrix === a  b  <-- classified as 962  154 |  a = true 192 3948 |  b = false
Exercise Solution:  donors_trainset.arff  - All features: functions.SMO === Stratified cross-validation === === Summary === Correctly Classified Instances  4986  94.863  % Incorrectly Classified Instances  270  5.137  % Kappa statistic  0.8455 === Detailed Accuracy By Class === TP Rate  FP Rate  Precision  Recall  F-Measure  Class 0.871  0.03  0.885  0.871  0.878  true 0.97  0.129  0.965  0.97  0.967  false === Confusion Matrix === a  b  <-- classified as 972  144 |  a = true 126 4014 |  b = false
@RELATION donors.train @ATTRIBUTE -7_A {0,1} @ATTRIBUTE -7_T {0,1} @ATTRIBUTE -7_C {0,1} [...] @ATTRIBUTE 6_A {0,1} @ATTRIBUTE 6_T {0,1} @ATTRIBUTE 6_C {0,1} @ATTRIBUTE 6_G {0,1} @ATTRIBUTE class {true,false} @DATA 0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,true 0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,true [...] Exercise Solution:  @RELATION donors.train  @ATTRIBUTE -7 {A,C,G,T}  @ATTRIBUTE -6 {A,C,G,T} @ATTRIBUTE -5 {A,C,G,T} @ATTRIBUTE -4 {A,C,G,T} [...] @ATTRIBUTE +3 {A,C,G,T} @ATTRIBUTE +4 {A,C,G,T} @ATTRIBUTE +5 {A,C,G,T} @ATTRIBUTE +6 {A,C,G,T} @ATTRIBUTE splicesite {true,false} @DATA C,T,C,C,G,A,A,A,G,G,A,T,T,true T,C,A,G,A,A,G,G,A,G,G,G,C,true T,T,G,G,A,A,G,T,C,G,C,A,G,true [..] donors_trainset.arff Binary Feature Encoding donors_trainset_diffencod.arff Fewer features Four (nominal) values per feature
Exercise Solution:  donors_trainset_ diffencod .arff  - All features: trees.J48 === Stratified cross-validation === === Summary === Correctly Classified Instances  4948  94.14  % Incorrectly Classified Instances  308  5.86  % Kappa statistic  0.8248 === Detailed Accuracy By Class === TP Rate  FP Rate  Precision  Recall  F-Measure  Class 0.862  0.037  0.862  0.862  0.862  true 0.963  0.138  0.963  0.963  0.963  false === Confusion Matrix === a  b  <-- classified as 962  154 |  a = true 154 3986 |  b = false
Exercise Solution:  donors_trainset_ diffencod .arff  - All features: bayes.NaiveBayes === Stratified cross-validation === === Summary === Correctly Classified Instances  4922  93.6454 % Incorrectly Classified Instances  334  6.3546 % Kappa statistic  0.8078 === Detailed Accuracy By Class === TP Rate  FP Rate  Precision  Recall  F-Measure  Class 0.834  0.036  0.862  0.834  0.848  true 0.964  0.166  0.956  0.964  0.96  false === Confusion Matrix === a  b  <-- classified as 931  185 |  a = true 149 3991 |  b = false
Exercise Solution:  donors_trainset_ diffencod .arff  - All features: functions.SMO === Stratified cross-validation === === Summary === Correctly Classified Instances  4986  94.863  % Incorrectly Classified Instances  270  5.137  % Kappa statistic  0.8456 === Detailed Accuracy By Class === TP Rate  FP Rate  Precision  Recall  F-Measure  Class 0.872  0.031  0.885  0.872  0.878  true 0.969  0.128  0.966  0.969  0.967  false === Confusion Matrix === a  b  <-- classified as 973  143 |  a = true 127 4013 |  b = false
Exercise Solution: Feature Selection: CfsSubsetEval ,  BestFirst : Features -2A, -1G, 1A, 2A, 3_G CorrelationCoefficients:  J48: 0.7981 NaiveBayes: 0.7762 SMO: 0.7388 MultilayerPerceptron: 0.8053 ClassifierSubsetEval  (w/  NaiveBayes ),  BestFirst : Features: -7A, -7C, -6G, -4A, -1G, 1A, 1T, 1C, 2A, 3G, 4T, 5A CorrelationCoefficients:  J48: 0.7935 NaiveBayes: 0.8033 SMO: 0.7597 MultilayerPerceptron: 0.7765
Summary Generally, there is no ‘best’ method for all problems. Feature representation can influence classification results. Feature selection often improves classification performance, but not always.  Feature selection significantly speeds up classification – thereby allowing also computationally very demanding classifiers Always try to test multiple methods!

Classification by Machine Learning Approaches

  • 1.
    Classification by Machine Learning Approaches - Exercise Solution Michael J. Kerner – [email_address] Center for Biological Sequence Analysis Technical University of Denmark
  • 2.
    Exercise Solution: donors_trainset.arff - All features: trees.J48 === Stratified cross-validation === === Summary === Correctly Classified Instances 4972 94.5967 % Incorrectly Classified Instances 284 5.4033 % Kappa statistic 0.8381 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.87 0.034 0.875 0.87 0.872 true 0.966 0.13 0.965 0.966 0.966 false === Confusion Matrix === a b <-- classified as 971 145 | a = true 139 4001 | b = false
  • 3.
    Exercise Solution: donors_trainset.arff - All features: bayes.NaiveBayes === Stratified cross-validation === === Summary === Correctly Classified Instances 4910 93.417 % Incorrectly Classified Instances 346 6.583 % Kappa statistic 0.8056 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.862 0.046 0.834 0.862 0.848 true 0.954 0.138 0.962 0.954 0.958 false === Confusion Matrix === a b <-- classified as 962 154 | a = true 192 3948 | b = false
  • 4.
    Exercise Solution: donors_trainset.arff - All features: functions.SMO === Stratified cross-validation === === Summary === Correctly Classified Instances 4986 94.863 % Incorrectly Classified Instances 270 5.137 % Kappa statistic 0.8455 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.871 0.03 0.885 0.871 0.878 true 0.97 0.129 0.965 0.97 0.967 false === Confusion Matrix === a b <-- classified as 972 144 | a = true 126 4014 | b = false
  • 5.
    @RELATION donors.train @ATTRIBUTE-7_A {0,1} @ATTRIBUTE -7_T {0,1} @ATTRIBUTE -7_C {0,1} [...] @ATTRIBUTE 6_A {0,1} @ATTRIBUTE 6_T {0,1} @ATTRIBUTE 6_C {0,1} @ATTRIBUTE 6_G {0,1} @ATTRIBUTE class {true,false} @DATA 0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,true 0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,true [...] Exercise Solution: @RELATION donors.train @ATTRIBUTE -7 {A,C,G,T} @ATTRIBUTE -6 {A,C,G,T} @ATTRIBUTE -5 {A,C,G,T} @ATTRIBUTE -4 {A,C,G,T} [...] @ATTRIBUTE +3 {A,C,G,T} @ATTRIBUTE +4 {A,C,G,T} @ATTRIBUTE +5 {A,C,G,T} @ATTRIBUTE +6 {A,C,G,T} @ATTRIBUTE splicesite {true,false} @DATA C,T,C,C,G,A,A,A,G,G,A,T,T,true T,C,A,G,A,A,G,G,A,G,G,G,C,true T,T,G,G,A,A,G,T,C,G,C,A,G,true [..] donors_trainset.arff Binary Feature Encoding donors_trainset_diffencod.arff Fewer features Four (nominal) values per feature
  • 6.
    Exercise Solution: donors_trainset_ diffencod .arff - All features: trees.J48 === Stratified cross-validation === === Summary === Correctly Classified Instances 4948 94.14 % Incorrectly Classified Instances 308 5.86 % Kappa statistic 0.8248 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.862 0.037 0.862 0.862 0.862 true 0.963 0.138 0.963 0.963 0.963 false === Confusion Matrix === a b <-- classified as 962 154 | a = true 154 3986 | b = false
  • 7.
    Exercise Solution: donors_trainset_ diffencod .arff - All features: bayes.NaiveBayes === Stratified cross-validation === === Summary === Correctly Classified Instances 4922 93.6454 % Incorrectly Classified Instances 334 6.3546 % Kappa statistic 0.8078 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.834 0.036 0.862 0.834 0.848 true 0.964 0.166 0.956 0.964 0.96 false === Confusion Matrix === a b <-- classified as 931 185 | a = true 149 3991 | b = false
  • 8.
    Exercise Solution: donors_trainset_ diffencod .arff - All features: functions.SMO === Stratified cross-validation === === Summary === Correctly Classified Instances 4986 94.863 % Incorrectly Classified Instances 270 5.137 % Kappa statistic 0.8456 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.872 0.031 0.885 0.872 0.878 true 0.969 0.128 0.966 0.969 0.967 false === Confusion Matrix === a b <-- classified as 973 143 | a = true 127 4013 | b = false
  • 9.
    Exercise Solution: FeatureSelection: CfsSubsetEval , BestFirst : Features -2A, -1G, 1A, 2A, 3_G CorrelationCoefficients: J48: 0.7981 NaiveBayes: 0.7762 SMO: 0.7388 MultilayerPerceptron: 0.8053 ClassifierSubsetEval (w/ NaiveBayes ), BestFirst : Features: -7A, -7C, -6G, -4A, -1G, 1A, 1T, 1C, 2A, 3G, 4T, 5A CorrelationCoefficients: J48: 0.7935 NaiveBayes: 0.8033 SMO: 0.7597 MultilayerPerceptron: 0.7765
  • 10.
    Summary Generally, thereis no ‘best’ method for all problems. Feature representation can influence classification results. Feature selection often improves classification performance, but not always. Feature selection significantly speeds up classification – thereby allowing also computationally very demanding classifiers Always try to test multiple methods!