Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University
  2. 2. Some Fundamental Problems <ul><li>Which algorithm is “best”? </li></ul><ul><li>Are there any reasons to favor one algorithm over another? </li></ul><ul><li>Is “Occam’s razor” really so evident? </li></ul><ul><li>Do simpler or “smoother” classifiers generalize better? If so, why? </li></ul><ul><li>Are there fundamental “conservation” or “constraint” laws other than Bayes error rate? </li></ul>
  3. 3. Meaning of “Algorithm-Independent” <ul><li>Mathematical foundations that do not depend upon the particular classifier or learning algorithm used </li></ul><ul><ul><li>e.g., bias and variance concept </li></ul></ul><ul><li>Techniques that can be used in conjunction with different learning algorithm, or provide guidance in their use </li></ul><ul><ul><li>e.g., cross-validation and resampling techniques </li></ul></ul>
  4. 4. Roadmap <ul><li>No pattern classification method is inherently superior to any other </li></ul><ul><li>Ways to quantify and adjust the “match” between a learning algorithm and the problem it addresses </li></ul><ul><li>Estimation of accuracies and comparison of different classifiers with certain assumptions </li></ul><ul><li>Methods for integrating component classifiers </li></ul>
  5. 5. Generalization Performance by Off-Training Set Error <ul><li>Consider a two-category problem </li></ul><ul><li>Training set D </li></ul><ul><li>Training patterns x i </li></ul><ul><li>y i = 1 or -1 for i = 1, . . ., n is generated by unknown target function F ( x ) to be learned </li></ul><ul><li>F ( x ) is often with a random component </li></ul><ul><ul><li>The same input could lead to different categories </li></ul></ul><ul><ul><li>Giving non-zero Bayes error </li></ul></ul>
  6. 6. Generalization Performance by Off-Training Set Error <ul><li>Let H be the (discrete) set of hypotheses, or sets of parameters to be learned </li></ul><ul><li>A particular h belongs to H </li></ul><ul><ul><li>quantized weights in neural network </li></ul></ul><ul><ul><li>Parameters q in a functional model </li></ul></ul><ul><ul><li>Sets of decisions in a tree </li></ul></ul><ul><li>P ( h ) : prior probability that the algorithm will produce hypothesis h after training </li></ul>
  7. 7. Generalization Performance by Off-Training Set Error <ul><li>P ( h | D ) : probability the algorithm will yield h when trained on data D </li></ul><ul><ul><li>Nearest-neighbor and decision tree: non-zero only for a single hypothesis </li></ul></ul><ul><ul><li>Neural network: can be a broad distribution </li></ul></ul><ul><li>E : error for zero-one or other loss function </li></ul>
  8. 8. Generalization Performance by Off-Training Set Error <ul><li>A natural measure </li></ul><ul><li>Expected off-training-set classification error for the k th candidate learning algorithm </li></ul>
  9. 9. No Free Lunch Theorem
  10. 10. No Free Lunch Theorem <ul><li>For any two algorithms </li></ul><ul><li>No matter how clever in choosing a “good” algorithm and a “bad” algorithm, if all target functions are equally likely, the “good” algorithm will not outperform the “bad” one </li></ul><ul><li>There is at least one target function for which random guessing is a better algorithm </li></ul>
  11. 11. No Free Lunch Theorem <ul><li>Even if we know D , averaged over all target functions no algorithm yields an off-training set error that is superior to any other </li></ul>
  12. 12. Example 1 No Free Lunch for Binary Data -1 1 1 111 -1 1 1 110 -1 1 -1 101 -1 1 1 100 -1 1 -1 011 1 1 1 010 -1 -1 -1 001 D 1 1 1 000 h 2 h 1 F x
  13. 13. No Free Lunch Theorem
  14. 14. Conservation in Generalization <ul><li>Can not achieve positive performance on some problems without getting an equal and opposite amount of negative performance on other problems </li></ul><ul><li>Can trade performance on problems we do not expect to encounter with those that we do expect to encounter </li></ul><ul><li>It is the assumptions about the learning domains that are relevant </li></ul>
  15. 15. Ugly Duckling Theorem <ul><li>In the absence of assumptions there is no privileged or “best” feature representation </li></ul><ul><li>Even the notion of similarity between patterns depends implicitly on assumptions that may or may not be correct </li></ul>
  16. 16. Venn Diagram Representation of Features as Predicates
  17. 17. Rank of a Predicate <ul><li>Number of the simplest or indivisible elements it contains </li></ul><ul><li>Example: rank r = 1 </li></ul><ul><ul><li>x 1 : f 1 AND NOT f 2 </li></ul></ul><ul><ul><li>x 2 : f 1 AND f 2 </li></ul></ul><ul><ul><li>x 3 : f 2 AND NOT f 1 </li></ul></ul><ul><ul><li>x 4 : NOT ( f 1 OR f 2 ) </li></ul></ul><ul><ul><li>C (4,1) = 4 predicates </li></ul></ul>
  18. 18. Examples of Rank of a Predicate <ul><li>Rank r = 2 </li></ul><ul><ul><li>x 1 OR x 2 : f 1 </li></ul></ul><ul><ul><li>x 1 OR x 3 : f 1 XOR f 2 </li></ul></ul><ul><ul><li>x 1 OR x 4 : NOT f 2 </li></ul></ul><ul><ul><li>x 2 OR x 3 : f 2 </li></ul></ul><ul><ul><li>x 2 OR x 4 : ( f 1 AND f 2 ) OR NOT ( f 1 OR f 2 ) </li></ul></ul><ul><ul><li>x 3 OR x 4 : NOT f 1 </li></ul></ul><ul><ul><li>C (4, 2) = 6 predicates </li></ul></ul>
  19. 19. Examples of Rank of a Predicate <ul><li>Rank r = 3 </li></ul><ul><ul><li>x 1 OR x 2 OR x 3 : f 1 OR f 2 </li></ul></ul><ul><ul><li>x 1 OR x 2 OR x 4 : f 1 OR NOT f 2 </li></ul></ul><ul><ul><li>x 1 OR x 3 OR x 4 : NOT ( f 1 AND f 2 ) </li></ul></ul><ul><ul><li>x 2 OR x 3 OR x 4 : f 2 OR NOT f 1 </li></ul></ul><ul><ul><li>C (4, 3) = 4 predicates </li></ul></ul>
  20. 20. Total Number of Predicates in Absence of Constraints <ul><li>Let d be the number of regions in the Venn diagrams (i.e., number of distinctive patterns, or number of possible values determined by combinations of the features) </li></ul>
  21. 21. A Measure of Similarity in Absence of Prior Information <ul><li>Number of features or attributes shared by two patterns </li></ul><ul><ul><li>Concept difficulties </li></ul></ul><ul><ul><ul><li>e.g., blind_in_right_eyes and blind_in_left_eyes , (1,0) more similar to (1,1) and (0,0) than to (0,1) </li></ul></ul></ul><ul><ul><li>There are always multiple ways to represent vectors of attributes </li></ul></ul><ul><ul><ul><li>e.g. blind_in_right_eye and same_in_both_eyes </li></ul></ul></ul><ul><ul><li>No principled reason to prefer one of these representations over another </li></ul></ul>
  22. 22. A Plausible Measure of Similarity in Absence of Prior Information <ul><li>Number of predicates the patterns share </li></ul><ul><li>Consider two distinct patterns </li></ul><ul><ul><li>no predicates of rank 1 is shared </li></ul></ul><ul><ul><li>1 predicate of rank 2 is shared </li></ul></ul><ul><ul><li>C ( d -2, 1) predicates of rank 3 is shared </li></ul></ul><ul><ul><li>C ( d -2, r -2) predicates of rank r is shared </li></ul></ul><ul><ul><li>Total number of predicates shared </li></ul></ul>
  23. 23. Ugly Duckling Theorem <ul><li>Given a finite set of predicates that enables us to distinguish any two patterns </li></ul><ul><li>The number of predicates shared by any two such patterns is constant and independent of the choice of those patterns </li></ul><ul><li>If pattern similarity is based on the total number of predicates shared, any two patterns are “equally similar” </li></ul>
  24. 24. Ugly Duckling Theorem <ul><li>No problem-independent or privileged or “best” set of features or feature attributes </li></ul><ul><li>Also applies to a continuous feature spaces </li></ul>
  25. 25. Minimum Description Length (MDL) <ul><li>Find some irreducible, smallest representation (“signal”) of all members of a category </li></ul><ul><li>All variation among the individual patterns is then “noise” </li></ul><ul><li>By simplifying recognizers appropriately, the signal can be retained while the noise is ignored </li></ul>
  26. 26. Algorithm Complexity (Kolmogorov Complexity) <ul><li>Kolmogrov complexity of binary string x </li></ul><ul><ul><li>On an abstract computer (Turing machine) U </li></ul></ul><ul><ul><li>As the shortest program (binary) string y </li></ul></ul><ul><ul><li>Without additional data, computes the string x and halts </li></ul></ul>
  27. 27. Algorithm Complexity Example <ul><li>Suppose x consists solely of n 1 s </li></ul><ul><li>Use some fixed number of bits k to specify a loop for printing a string of 1 s </li></ul><ul><li>Need log 2 n more bits to specify the iteration number n , the condition for halting </li></ul><ul><li>Thus K ( x ) = O (log 2 n ) </li></ul>
  28. 28. Algorithm Complexity Example <ul><li>Constant  =11.001001000011111110110101010001… </li></ul><ul><li>The shortest program is the one that can produce any arbitrary large number of consecutive digits of  </li></ul><ul><li>Thus K ( x ) = O ( 1 ) </li></ul>
  29. 29. Algorithm Complexity Example <ul><li>Assume that x is a “truly” random binary string </li></ul><ul><li>Can not be expressed as a shorter string </li></ul><ul><li>Thus K ( x ) = O (| x |) </li></ul>
  30. 30. Minimum Description Length (MDL) Principle <ul><li>Minimize the sum of the model’s algorithmic complexity and the description of the training data with respect to that model </li></ul>
  31. 31. An Application of MDL Principle <ul><li>For decision-tree classifiers, a model h specifies the tree and the decisions at the nodes </li></ul><ul><li>Algorithmic complexity of h is proportional to the number of nodes </li></ul><ul><li>Complexity of data is expressed in terms of the entropy (in bits) of the data </li></ul><ul><li>Tree-pruning based on entropy is equivalent to the MDL principle </li></ul>
  32. 32. Convergence of MDL Classifiers <ul><li>MDL classifiers are guaranteed to converge to the ideal or true model in the limit of more and more data </li></ul><ul><li>Can not prove that the MDL principle leads to a superior performance in the finite data case </li></ul><ul><li>Still consistent with the no free lunch principle </li></ul>
  33. 33. Bayesian Perspective of MDL Principle
  34. 34. Overfitting Avoidance <ul><li>Avoiding overfitting or minimizing description length are not inherently beneficial </li></ul><ul><li>Amount to a preference over the forms or parameters of classifiers </li></ul><ul><li>Beneficial only if they match their problems </li></ul><ul><li>There are problems that overfitting avoidance leads to worse performances </li></ul>
  35. 35. Explanation of Success of Occam’s Razor <ul><li>Through evolution and strong selection pressure on our neurons </li></ul><ul><li>Likely to ignore problems for which Occam’s razor does not hold </li></ul><ul><li>Researchers naturally develop simple algorithms before more complex ones – a bias imposed by methodology </li></ul><ul><li>Principle of satisficing: creating an adequate though possibly nonoptimal solution </li></ul>
  36. 36. Bias and Variance <ul><li>Ways to measure the “match” or “alignment” of the learning algorithm to the classification problem </li></ul><ul><li>Bias </li></ul><ul><ul><li>Accuracy or quality of the match </li></ul></ul><ul><ul><li>High bias implies a poor match </li></ul></ul><ul><li>Variance </li></ul><ul><ul><li>Precision or specificity of the match </li></ul></ul><ul><ul><li>High variance implies a weak match </li></ul></ul>
  37. 37. Bias and Variance for Regression
  38. 38. Bias-Variance Dilemma
  39. 39. Bias-Variance Dilemma <ul><li>Given a target function </li></ul><ul><li>Model has many parameters </li></ul><ul><ul><li>Generally low bias </li></ul></ul><ul><ul><li>Fits data well </li></ul></ul><ul><ul><li>Yields high variance </li></ul></ul><ul><li>Model has few parameters </li></ul><ul><ul><li>Generally high bias </li></ul></ul><ul><ul><li>May not fit data well </li></ul></ul><ul><ul><li>The fit does not change much for different data sets (low variance) </li></ul></ul>
  40. 40. Bias-Variance Dilemma <ul><li>Best way to get low bias and low variance </li></ul><ul><ul><li>Have prior information about the target function </li></ul></ul><ul><li>Virtually never get zero bias and zero variance </li></ul><ul><ul><li>Only one learning problem to solve and the answer is already known </li></ul></ul><ul><li>Large amount of training data will yield improved performance as the model is sufficiently general </li></ul>
  41. 41. Bias and Variance for Classification <ul><li>Reference </li></ul><ul><ul><li>J. H. Friedman, “On bias, variance, 0/1-loss, and the curse of dimensionality,” Data Mining and Knowledge Discovery , vol. 1, no. 1, pp. 55-77, 1997. </li></ul></ul>
  42. 42. Bias and Variance for Classification <ul><li>Two-category problem </li></ul><ul><li>Target function </li></ul><ul><li>F ( x )=Pr[ y =1| x ]=1-Pr[ y =0| x ] </li></ul><ul><li>Discriminant function </li></ul><ul><li>y d = F ( x ) +  </li></ul><ul><li> : zero mean, centered binomialy distributed random variable </li></ul><ul><li>F ( x ) = E [ y d | x ] </li></ul>
  43. 43. Bias and Variance for Classification <ul><li>Find estimate g ( x ; D ) to minimize </li></ul><ul><li>E D [( g ( x ; D )- y d ) 2 ] </li></ul><ul><li>The estimated g ( x ; D ) can be used to classify x </li></ul><ul><li>The bias and variance concept for regression can be applied to g ( x ; D ) as an estimate of F ( x ) </li></ul><ul><li>However, this is not related to classification error directly </li></ul>
  44. 44. Bias and Variance for Classification
  45. 45. Bias and Variance for Classification
  46. 46. Bias and Variance for Classification
  47. 47. Bias and Variance for Classification
  48. 48. Bias and Variance for Classification <ul><li>Sign of the boundary bias affects the role of variance in the error </li></ul><ul><li>Low variance is generally important for accurate classification, if the sign is positive </li></ul><ul><li>Low boundary bias need not result in lower error rate </li></ul><ul><li>Simple methods is often with lower variance, and need not be inferior to more flexible methods </li></ul>
  49. 49. Error rates and optimal K vs. N for d = 20 in KNN
  50. 50. Estimation and classification error vs. d for N = 12800 in KNN d
  51. 51. Boundary Bias-Variance Trade-Off
  52. 52. Leave-One-Out Method (Jackknife)
  53. 53. Generalization to Estimates of Other Statistics <ul><li>Estimator of other statistics </li></ul><ul><ul><li>median, 25th percentile, mode, etc. </li></ul></ul><ul><li>Leave-one-out estimate </li></ul><ul><li>Jackknife estimate and its related variance </li></ul>
  54. 54. Jackknife Bias Estimate
  55. 55. Example 2 Jackknife for Mode
  56. 56. Bootstrap <ul><li>Randomly selecting n points from the training data set D , with replacement </li></ul><ul><li>Repeat this process independently B times to yield B bootstrap data set, treated as independent sets </li></ul><ul><li>Bootstrap estimate of a statistic  </li></ul>
  57. 57. Bootstrap Bias and Variance Estimate
  58. 58. Properties of Bootstrap Estimates <ul><li>The larger the number B of bootstrap samples, the more satisfactory is the estimate of a statistic and its variance </li></ul><ul><li>B can be adjusted to the computational resources </li></ul><ul><ul><li>Jackknife estimate requires exactly n repetitions </li></ul></ul>
  59. 59. Bagging <ul><li>Arcing </li></ul><ul><ul><li>Adaptive Reweighting and Combining </li></ul></ul><ul><ul><li>Reusing or selecting data in order to improve classification, e.g., AdaBoost </li></ul></ul><ul><li>Bagging </li></ul><ul><ul><li>Bootstrap aggregation </li></ul></ul><ul><ul><li>Multiple versions of D , by drawing n’ < n samples from D with replacement </li></ul></ul><ul><ul><li>Each set trains a component classifier </li></ul></ul><ul><ul><li>Final decision is based on vote of each component classifier </li></ul></ul>
  60. 60. Unstable Algorithm and Bagging <ul><li>Unstable algorithm </li></ul><ul><ul><li>“small” changes in training data lead to significantly different classifiers and relatively “large” changes in accuracy </li></ul></ul><ul><li>In general bagging improves recognition for unstable classifiers </li></ul><ul><ul><li>Effectively averages over such discontinuities </li></ul></ul><ul><ul><li>No convincing theoretical derivations or simulation studies showing bagging helps for all unstable classifiers </li></ul></ul>
  61. 61. Boosting <ul><li>Create the first classifier </li></ul><ul><ul><li>With accuracy on the training set greater than average (weak learner) </li></ul></ul><ul><li>Add a new classifier </li></ul><ul><ul><li>Form an ensemble </li></ul></ul><ul><ul><li>Joint decision rule has much higher accuracy on the training set </li></ul></ul><ul><li>Classification performance has been “boosted” </li></ul>
  62. 62. Boosting Procedure <ul><li>Trains successive component classifiers with a subset of the training data that is most “informative” given the current set of component classifiers </li></ul>
  63. 63. Training Data and Weak Learner
  64. 64. <ul><li>Flip a fair coin </li></ul><ul><li>Head </li></ul><ul><ul><li>Select remaining samples from D </li></ul></ul><ul><ul><li>Present them to C 1 one by one until C 1 misclassifies a pattern </li></ul></ul><ul><ul><li>Add the misclassified pattern to D 2 </li></ul></ul><ul><li>Tail </li></ul><ul><ul><li>Find a pattern that C 1 classifies correctly </li></ul></ul>“Most Informative” Set Given C 1
  65. 65. Third Data Set and Classifier C 3 <ul><li>Randomly select a remaining training pattern </li></ul><ul><li>Add the pattern if C 1 and C 2 disagree, otherwise ignore it </li></ul>
  66. 66. Classification of a Test Pattern <ul><li>If C 1 and C 2 agree, use their label </li></ul><ul><li>If they disagree, trust C 3 </li></ul>
  67. 67. Choosing n 1 <ul><li>For final vote, n 1 ~ n 2 ~ n 3 ~ n /3 is desired </li></ul><ul><li>Reasonable guess: n /3 </li></ul><ul><li>Simple problem: n 2 << n 1 </li></ul><ul><li>Difficult problem: n 2 too large </li></ul><ul><li>In practice we need to run the whole boosting procedure a few times </li></ul><ul><ul><li>To use the full training set </li></ul></ul><ul><ul><li>To get roughly equal partitions of the training set </li></ul></ul>
  68. 68. AdaBoost <ul><li>Adaptive boosting </li></ul><ul><li>Most popular version of basic boosting </li></ul><ul><li>Continue adding weak learners until some desired low training error has been achieved </li></ul>
  69. 69. AdaBoost Algorithm
  70. 70. Final Decision <ul><li>Discriminant function </li></ul>
  71. 71. Ensemble Training Error
  72. 72. AdaBoost vs. No Free Lunch Theorem <ul><li>Boosting only improves classification if the component classifiers perform better than chance </li></ul><ul><ul><li>Can not be guaranteed a priori </li></ul></ul><ul><li>Exponential reduction in error on the training set does not ensure reduction of the off-training set error or generalization </li></ul><ul><li>Proven effective in many real-world applications </li></ul>
  73. 73. Learning with Queries <ul><li>Set of unlabeled patterns </li></ul><ul><li>Exists some (possibly costly) way of labeling any pattern </li></ul><ul><li>To determine which unlabeled patterns would be most informative if they were presented as a query to an oracle </li></ul><ul><li>Also called active learning or interactive learning </li></ul><ul><li>Can be refined further as cost-based learning </li></ul>
  74. 74. Application Example <ul><li>Design a classifier for handwritten numerals </li></ul><ul><li>Using unlabeled pixel images scanned from documents from a corpus too large to label every pattern </li></ul><ul><li>Human as the oracle </li></ul>
  75. 75. Learning with Queries <ul><li>Begin with a preliminary, weak classifier developed with a small set of labeled samples </li></ul><ul><li>Two related methods for selecting an informative pattern </li></ul><ul><ul><li>Confidence-based query selection </li></ul></ul><ul><ul><li>Voting-based or committee-based query selection </li></ul></ul>
  76. 76. Selecting Most Informative Patterns <ul><li>Confidence-based query selection </li></ul><ul><ul><li>Pattern that two largest discriminant functions have nearly the same value </li></ul></ul><ul><ul><li>i.e., patterns lie near the current decision boundaries </li></ul></ul><ul><li>Voting-based query selection </li></ul><ul><ul><li>Pattern that yields the greatest disagreement among the k resulting category labels </li></ul></ul>
  77. 77. Active Learning Example
  78. 78. Arcing and Active Learning vs. IID Sampling <ul><li>If take a model of true distribution and train it with a highly skewed distribution by active learning, the final classifier accuracy might be low </li></ul><ul><li>Resampling methods are generally use techniques not attempt to model or fit the full category distributions </li></ul><ul><ul><li>Not fitting parameters in a model </li></ul></ul><ul><ul><li>But instead seeking decision boundaries directly </li></ul></ul>
  79. 79. Arcing and Active Learning <ul><li>As number of component classifiers is increased, resampling, boosting and related methods effectively broaden that class of implementable functions </li></ul><ul><li>Allow to try to “match” the final classifier to the problem by indirectly adjusting the bias and variance </li></ul><ul><li>Can be used with arbitrary classification techniques </li></ul>
  80. 80. Estimating the Generalization Rate <ul><li>See if the classifier performs well enough to be useful </li></ul><ul><li>Compare its performance with that of a competing design </li></ul><ul><li>Requires making assumptions about the classifier or the problem or both </li></ul><ul><li>All the methods given here are heuristic </li></ul>
  81. 81. Parametric Models <ul><li>Compute from the assumed parametric model </li></ul><ul><li>Example: two-class multivariate normal case </li></ul><ul><ul><li>Bhattacharyya or Chernoff bounds using estimated mean and covariance matrix </li></ul></ul><ul><li>Problems </li></ul><ul><ul><li>Overly optimistic </li></ul></ul><ul><ul><li>Always suspect the model </li></ul></ul><ul><ul><li>Error rate may be difficult to compute </li></ul></ul>
  82. 82. Simple Cross-Validation <ul><li>Randomly split the set of labeled training samples D into a training set and a validation set </li></ul>
  83. 83. m -Fold Cross-Validation <ul><li>Training set is randomly divided into m disjoint sets of equal size n/m </li></ul><ul><li>The classifier is trained m times </li></ul><ul><ul><li>Each time with a different set held out as a validation set </li></ul></ul><ul><li>Estimated performance is the mean of these m errors </li></ul><ul><li>When m = n , it is in effect the leave-one-out approach </li></ul>
  84. 84. Forms of Learning for Cross-Validation <ul><li>neural networks of fixed topology </li></ul><ul><ul><li>Number of epochs or presentations of the training set </li></ul></ul><ul><ul><li>Number of hidden units </li></ul></ul><ul><li>Width of the Gaussian window in Parzen windows </li></ul><ul><li>Optimal k in the k -nearest neighbor classifier </li></ul>
  85. 85. Portion  of D as a Validation Set <ul><li>Should be small </li></ul><ul><ul><li>Validation set is used merely to know when to stop adjusting parameters </li></ul></ul><ul><ul><li>Training set is used to set large number of parameters or degrees of freedoms </li></ul></ul><ul><li>Traditional default </li></ul><ul><ul><li>Set  = 0.1 </li></ul></ul><ul><ul><li>Proven effective in many applications </li></ul></ul>
  86. 86. Anti-Cross-Validation <ul><li>Cross-validation need not work on every problem </li></ul><ul><li>Anti-cross-validation </li></ul><ul><ul><li>Halt when the validation error is the first local maximum </li></ul></ul><ul><ul><li>Must explore different values of  </li></ul></ul><ul><ul><li>Possibly abandon the use of cross-validation if performance cannot be improved </li></ul></ul>
  87. 87. Estimation of Error Rate <ul><li>Let p be the true and unknown error rate of the classifier </li></ul><ul><li>Assume k of the n’ independent, randomly drawn test samples are misclassified, then k has the binomial distribution </li></ul><ul><li>Maximum-likelihood estimate for p </li></ul>
  88. 88. 95% Confidence Intervals for a Given Estimated p
  89. 89. Jackknife Estimation of Classification Accuracy <ul><li>Use leave-one-out approach </li></ul><ul><li>Obtain Jackknife estimate for the mean and variance of the leave-one-out accuracies </li></ul><ul><li>Use traditional hypothesis testing to see if one classifier is superior to another with statistical significance </li></ul>
  90. 90. Jackknife Estimation of Classification Accuracy
  91. 91. Bootstrap Estimation of Classification Accuracy <ul><li>Train B classifiers, each with a different bootstrap data set </li></ul><ul><li>Test on other bootstrap data sets </li></ul><ul><li>Bootstrap estimate is the mean of these bootstrap accuracies </li></ul>
  92. 92. Maximum-Likelihood Comparison (ML-II) <ul><li>Also called maximum-likelihood selection </li></ul><ul><li>Find the maximum-likelihood parameters for each of the candidate models </li></ul><ul><li>Calculate the resulting likelihoods (evidences) </li></ul><ul><li>Choose the model with the largest likelihood </li></ul>
  93. 93. Maximum-Likelihood Comparison
  94. 94. Scientific Process *D. J. C. MacKay, “Bayesian interpolation,” Neural Computation, 4(3), 415-447, 1992
  95. 95. Bayesian Model Comparison
  96. 96. Concept of Occam Factor
  97. 97. Concept of Occam Factor <ul><li>An inherent bias toward simple models (small  0  ) </li></ul><ul><li>Models that are overly complex (large  0  ) are automatically self-penalizing </li></ul>
  98. 98. Evidence for Gaussian Parameters
  99. 99. Bayesian Model Selection vs. No Free Lunch Theorem <ul><li>Bayesian model selection </li></ul><ul><ul><li>Ignore the prior over the space of models </li></ul></ul><ul><ul><li>Effectively assume that it is uniform </li></ul></ul><ul><ul><li>Not take into account how models correspond to underlying target functions </li></ul></ul><ul><ul><li>Usually corresponds to non-uniform prior over target functions </li></ul></ul><ul><li>No Free Lunch Theorem </li></ul><ul><ul><li>Allows that for some particular non-uniform prior there may be an algorithm that gives better than chance, or even optimal, results </li></ul></ul>
  100. 100. Error Rate as a Function of Number n of Training Samples <ul><li>Classifiers trained by a small number of samples will not performed well on new data </li></ul><ul><li>Typical steps </li></ul><ul><ul><li>Estimate unknown parameters from samples </li></ul></ul><ul><ul><li>Use these estimates to determine the classifier </li></ul></ul><ul><ul><li>Calculate the error rate for the resulting classifier </li></ul></ul>
  101. 101. Analytical Analysis <ul><li>Case of two categories having equal prior probabilities </li></ul><ul><li>Partition feature space into some m disjoint cells, C 1 , . . ., C m </li></ul><ul><li>Conditional probabilities p ( x |  1 ) and p ( x |  2 ) do not vary appreciably within any cell </li></ul><ul><li>Need only know which cell x falls </li></ul>
  102. 102. Analytical Analysis
  103. 103. Analytical Analysis
  104. 104. Analytical Analysis
  105. 105. Results of Simulation Experiments
  106. 106. Discussions on Error Rate for Given n <ul><li>For every curve involving finite n there is an optimal number of cells </li></ul><ul><li>At first increasing number of cells make it easier to distinguish between distributions represented by p and q </li></ul><ul><li>If the number of cells becomes too large, there will not be enough training patterns to fill them </li></ul><ul><ul><li>Eventually number of patterns in most cells will be zero </li></ul></ul>
  107. 107. Discussions on Error Rate for Given n <ul><li>For n = 500 , the minimal error rate occurs somewhere around m = 20 </li></ul><ul><li>Form the cells by dividing each feature axis into l intervals </li></ul><ul><li>With d features, m = l d </li></ul><ul><li>If l = 2 , using more than four or five binary features will lead to worsen rather than better performance </li></ul>
  108. 108. Test Errors vs. Number of Training Patterns
  109. 109. Test and Training Error
  110. 110. Power Law
  111. 111. Sum and Difference of Test and Training Error
  112. 112. Fraction of Dichotomies of n Points in d Dimensions That are Linear
  113. 113. One-Dimensional Case f ( n = 4, d = 1) = 0.5 X 1111 X 0111 X 1110 0110 1101 0101 X 1100 0100 1011 X 0011 1010 0010 1001 X 0001 X 1000 X 0000 Linearly Separable? Labels Linearly Separable? Labels
  114. 114. Capacity of a Separating Plane <ul><li>Not until n is a sizable fraction of 2( d +1) that the problem begins to become difficult </li></ul><ul><li>Capacity of a hyperplane </li></ul><ul><ul><li>At n = 2( d +1) , half of the possible dichotomies are still linear </li></ul></ul><ul><li>Can not expect a linear classifier to “match” a problem, on average, if the dimension of the feature space is greater than n /2 - 1 </li></ul>
  115. 115. Mixture-of-Expert Models <ul><li>Classifiers whose decision is based on the outputs of component classifiers </li></ul><ul><li>Also called </li></ul><ul><ul><li>Ensemble classifiers </li></ul></ul><ul><ul><li>Modular classifiers </li></ul></ul><ul><ul><li>Pooled classifiers </li></ul></ul><ul><li>Useful if each component classifier is highly trained (“expert”) in a different region of the feature space </li></ul>
  116. 116. Mixture Model for Producing Patterns
  117. 117. Mixture-of-Experts Architecture
  118. 118. Ensemble Classifiers
  119. 119. Maximum-Likelihood Estimation
  120. 120. Final Decision Rule <ul><li>Choose the category corresponding to the maximum discriminant value after the pooling system </li></ul><ul><li>Winner-take-all method </li></ul><ul><ul><li>Use the decision of the single component classifier that is the “most confident”, i.e., largest g rj </li></ul></ul><ul><ul><li>Suboptimal but simple </li></ul></ul><ul><ul><li>Works well if the component classifiers are experts in separate regions </li></ul></ul>
  121. 121. Component Classifiers without Discriminant Functions <ul><li>Example </li></ul><ul><ul><li>A KNN classifier (rank order) </li></ul></ul><ul><ul><li>A decision tree (label) </li></ul></ul><ul><ul><li>A neural network (analog value) </li></ul></ul><ul><ul><li>A rule-based system (label) </li></ul></ul>
  122. 122. Heuristics to Convert Outputs to Discrimunant Values
  123. 123. Illustration Examples 0.0 0 3/21=0.143 4th 0.111 0.1 0.0 0 5/21=0.238 2nd 0.129 0.2 0.0 0 6/21=0.286 1st 0.143 0.3 0.0 0 2/21=0.095 5th 0.260 0.9 1.0 1 1/21=0.048 6th 0.193 0.6 0.0 0 4/21=0.194 3rd 0.158 0.4 g i g i g i One-of- c Rank Order Analog