Your SlideShare is downloading. ×
MachineLearning.ppt
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

MachineLearning.ppt

3,706
views

Published on


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,706
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
81
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University
  • 2. Some Fundamental Problems
    • Which algorithm is “best”?
    • Are there any reasons to favor one algorithm over another?
    • Is “Occam’s razor” really so evident?
    • Do simpler or “smoother” classifiers generalize better? If so, why?
    • Are there fundamental “conservation” or “constraint” laws other than Bayes error rate?
  • 3. Meaning of “Algorithm-Independent”
    • Mathematical foundations that do not depend upon the particular classifier or learning algorithm used
      • e.g., bias and variance concept
    • Techniques that can be used in conjunction with different learning algorithm, or provide guidance in their use
      • e.g., cross-validation and resampling techniques
  • 4. Roadmap
    • No pattern classification method is inherently superior to any other
    • Ways to quantify and adjust the “match” between a learning algorithm and the problem it addresses
    • Estimation of accuracies and comparison of different classifiers with certain assumptions
    • Methods for integrating component classifiers
  • 5. Generalization Performance by Off-Training Set Error
    • Consider a two-category problem
    • Training set D
    • Training patterns x i
    • y i = 1 or -1 for i = 1, . . ., n is generated by unknown target function F ( x ) to be learned
    • F ( x ) is often with a random component
      • The same input could lead to different categories
      • Giving non-zero Bayes error
  • 6. Generalization Performance by Off-Training Set Error
    • Let H be the (discrete) set of hypotheses, or sets of parameters to be learned
    • A particular h belongs to H
      • quantized weights in neural network
      • Parameters q in a functional model
      • Sets of decisions in a tree
    • P ( h ) : prior probability that the algorithm will produce hypothesis h after training
  • 7. Generalization Performance by Off-Training Set Error
    • P ( h | D ) : probability the algorithm will yield h when trained on data D
      • Nearest-neighbor and decision tree: non-zero only for a single hypothesis
      • Neural network: can be a broad distribution
    • E : error for zero-one or other loss function
  • 8. Generalization Performance by Off-Training Set Error
    • A natural measure
    • Expected off-training-set classification error for the k th candidate learning algorithm
  • 9. No Free Lunch Theorem
  • 10. No Free Lunch Theorem
    • For any two algorithms
    • No matter how clever in choosing a “good” algorithm and a “bad” algorithm, if all target functions are equally likely, the “good” algorithm will not outperform the “bad” one
    • There is at least one target function for which random guessing is a better algorithm
  • 11. No Free Lunch Theorem
    • Even if we know D , averaged over all target functions no algorithm yields an off-training set error that is superior to any other
  • 12. Example 1 No Free Lunch for Binary Data -1 1 1 111 -1 1 1 110 -1 1 -1 101 -1 1 1 100 -1 1 -1 011 1 1 1 010 -1 -1 -1 001 D 1 1 1 000 h 2 h 1 F x
  • 13. No Free Lunch Theorem
  • 14. Conservation in Generalization
    • Can not achieve positive performance on some problems without getting an equal and opposite amount of negative performance on other problems
    • Can trade performance on problems we do not expect to encounter with those that we do expect to encounter
    • It is the assumptions about the learning domains that are relevant
  • 15. Ugly Duckling Theorem
    • In the absence of assumptions there is no privileged or “best” feature representation
    • Even the notion of similarity between patterns depends implicitly on assumptions that may or may not be correct
  • 16. Venn Diagram Representation of Features as Predicates
  • 17. Rank of a Predicate
    • Number of the simplest or indivisible elements it contains
    • Example: rank r = 1
      • x 1 : f 1 AND NOT f 2
      • x 2 : f 1 AND f 2
      • x 3 : f 2 AND NOT f 1
      • x 4 : NOT ( f 1 OR f 2 )
      • C (4,1) = 4 predicates
  • 18. Examples of Rank of a Predicate
    • Rank r = 2
      • x 1 OR x 2 : f 1
      • x 1 OR x 3 : f 1 XOR f 2
      • x 1 OR x 4 : NOT f 2
      • x 2 OR x 3 : f 2
      • x 2 OR x 4 : ( f 1 AND f 2 ) OR NOT ( f 1 OR f 2 )
      • x 3 OR x 4 : NOT f 1
      • C (4, 2) = 6 predicates
  • 19. Examples of Rank of a Predicate
    • Rank r = 3
      • x 1 OR x 2 OR x 3 : f 1 OR f 2
      • x 1 OR x 2 OR x 4 : f 1 OR NOT f 2
      • x 1 OR x 3 OR x 4 : NOT ( f 1 AND f 2 )
      • x 2 OR x 3 OR x 4 : f 2 OR NOT f 1
      • C (4, 3) = 4 predicates
  • 20. Total Number of Predicates in Absence of Constraints
    • Let d be the number of regions in the Venn diagrams (i.e., number of distinctive patterns, or number of possible values determined by combinations of the features)
  • 21. A Measure of Similarity in Absence of Prior Information
    • Number of features or attributes shared by two patterns
      • Concept difficulties
        • e.g., blind_in_right_eyes and blind_in_left_eyes , (1,0) more similar to (1,1) and (0,0) than to (0,1)
      • There are always multiple ways to represent vectors of attributes
        • e.g. blind_in_right_eye and same_in_both_eyes
      • No principled reason to prefer one of these representations over another
  • 22. A Plausible Measure of Similarity in Absence of Prior Information
    • Number of predicates the patterns share
    • Consider two distinct patterns
      • no predicates of rank 1 is shared
      • 1 predicate of rank 2 is shared
      • C ( d -2, 1) predicates of rank 3 is shared
      • C ( d -2, r -2) predicates of rank r is shared
      • Total number of predicates shared
  • 23. Ugly Duckling Theorem
    • Given a finite set of predicates that enables us to distinguish any two patterns
    • The number of predicates shared by any two such patterns is constant and independent of the choice of those patterns
    • If pattern similarity is based on the total number of predicates shared, any two patterns are “equally similar”
  • 24. Ugly Duckling Theorem
    • No problem-independent or privileged or “best” set of features or feature attributes
    • Also applies to a continuous feature spaces
  • 25. Minimum Description Length (MDL)
    • Find some irreducible, smallest representation (“signal”) of all members of a category
    • All variation among the individual patterns is then “noise”
    • By simplifying recognizers appropriately, the signal can be retained while the noise is ignored
  • 26. Algorithm Complexity (Kolmogorov Complexity)
    • Kolmogrov complexity of binary string x
      • On an abstract computer (Turing machine) U
      • As the shortest program (binary) string y
      • Without additional data, computes the string x and halts
  • 27. Algorithm Complexity Example
    • Suppose x consists solely of n 1 s
    • Use some fixed number of bits k to specify a loop for printing a string of 1 s
    • Need log 2 n more bits to specify the iteration number n , the condition for halting
    • Thus K ( x ) = O (log 2 n )
  • 28. Algorithm Complexity Example
    • Constant  =11.001001000011111110110101010001…
    • The shortest program is the one that can produce any arbitrary large number of consecutive digits of 
    • Thus K ( x ) = O ( 1 )
  • 29. Algorithm Complexity Example
    • Assume that x is a “truly” random binary string
    • Can not be expressed as a shorter string
    • Thus K ( x ) = O (| x |)
  • 30. Minimum Description Length (MDL) Principle
    • Minimize the sum of the model’s algorithmic complexity and the description of the training data with respect to that model
  • 31. An Application of MDL Principle
    • For decision-tree classifiers, a model h specifies the tree and the decisions at the nodes
    • Algorithmic complexity of h is proportional to the number of nodes
    • Complexity of data is expressed in terms of the entropy (in bits) of the data
    • Tree-pruning based on entropy is equivalent to the MDL principle
  • 32. Convergence of MDL Classifiers
    • MDL classifiers are guaranteed to converge to the ideal or true model in the limit of more and more data
    • Can not prove that the MDL principle leads to a superior performance in the finite data case
    • Still consistent with the no free lunch principle
  • 33. Bayesian Perspective of MDL Principle
  • 34. Overfitting Avoidance
    • Avoiding overfitting or minimizing description length are not inherently beneficial
    • Amount to a preference over the forms or parameters of classifiers
    • Beneficial only if they match their problems
    • There are problems that overfitting avoidance leads to worse performances
  • 35. Explanation of Success of Occam’s Razor
    • Through evolution and strong selection pressure on our neurons
    • Likely to ignore problems for which Occam’s razor does not hold
    • Researchers naturally develop simple algorithms before more complex ones – a bias imposed by methodology
    • Principle of satisficing: creating an adequate though possibly nonoptimal solution
  • 36. Bias and Variance
    • Ways to measure the “match” or “alignment” of the learning algorithm to the classification problem
    • Bias
      • Accuracy or quality of the match
      • High bias implies a poor match
    • Variance
      • Precision or specificity of the match
      • High variance implies a weak match
  • 37. Bias and Variance for Regression
  • 38. Bias-Variance Dilemma
  • 39. Bias-Variance Dilemma
    • Given a target function
    • Model has many parameters
      • Generally low bias
      • Fits data well
      • Yields high variance
    • Model has few parameters
      • Generally high bias
      • May not fit data well
      • The fit does not change much for different data sets (low variance)
  • 40. Bias-Variance Dilemma
    • Best way to get low bias and low variance
      • Have prior information about the target function
    • Virtually never get zero bias and zero variance
      • Only one learning problem to solve and the answer is already known
    • Large amount of training data will yield improved performance as the model is sufficiently general
  • 41. Bias and Variance for Classification
    • Reference
      • J. H. Friedman, “On bias, variance, 0/1-loss, and the curse of dimensionality,” Data Mining and Knowledge Discovery , vol. 1, no. 1, pp. 55-77, 1997.
  • 42. Bias and Variance for Classification
    • Two-category problem
    • Target function
    • F ( x )=Pr[ y =1| x ]=1-Pr[ y =0| x ]
    • Discriminant function
    • y d = F ( x ) + 
    •  : zero mean, centered binomialy distributed random variable
    • F ( x ) = E [ y d | x ]
  • 43. Bias and Variance for Classification
    • Find estimate g ( x ; D ) to minimize
    • E D [( g ( x ; D )- y d ) 2 ]
    • The estimated g ( x ; D ) can be used to classify x
    • The bias and variance concept for regression can be applied to g ( x ; D ) as an estimate of F ( x )
    • However, this is not related to classification error directly
  • 44. Bias and Variance for Classification
  • 45. Bias and Variance for Classification
  • 46. Bias and Variance for Classification
  • 47. Bias and Variance for Classification
  • 48. Bias and Variance for Classification
    • Sign of the boundary bias affects the role of variance in the error
    • Low variance is generally important for accurate classification, if the sign is positive
    • Low boundary bias need not result in lower error rate
    • Simple methods is often with lower variance, and need not be inferior to more flexible methods
  • 49. Error rates and optimal K vs. N for d = 20 in KNN
  • 50. Estimation and classification error vs. d for N = 12800 in KNN d
  • 51. Boundary Bias-Variance Trade-Off
  • 52. Leave-One-Out Method (Jackknife)
  • 53. Generalization to Estimates of Other Statistics
    • Estimator of other statistics
      • median, 25th percentile, mode, etc.
    • Leave-one-out estimate
    • Jackknife estimate and its related variance
  • 54. Jackknife Bias Estimate
  • 55. Example 2 Jackknife for Mode
  • 56. Bootstrap
    • Randomly selecting n points from the training data set D , with replacement
    • Repeat this process independently B times to yield B bootstrap data set, treated as independent sets
    • Bootstrap estimate of a statistic 
  • 57. Bootstrap Bias and Variance Estimate
  • 58. Properties of Bootstrap Estimates
    • The larger the number B of bootstrap samples, the more satisfactory is the estimate of a statistic and its variance
    • B can be adjusted to the computational resources
      • Jackknife estimate requires exactly n repetitions
  • 59. Bagging
    • Arcing
      • Adaptive Reweighting and Combining
      • Reusing or selecting data in order to improve classification, e.g., AdaBoost
    • Bagging
      • Bootstrap aggregation
      • Multiple versions of D , by drawing n’ < n samples from D with replacement
      • Each set trains a component classifier
      • Final decision is based on vote of each component classifier
  • 60. Unstable Algorithm and Bagging
    • Unstable algorithm
      • “small” changes in training data lead to significantly different classifiers and relatively “large” changes in accuracy
    • In general bagging improves recognition for unstable classifiers
      • Effectively averages over such discontinuities
      • No convincing theoretical derivations or simulation studies showing bagging helps for all unstable classifiers
  • 61. Boosting
    • Create the first classifier
      • With accuracy on the training set greater than average (weak learner)
    • Add a new classifier
      • Form an ensemble
      • Joint decision rule has much higher accuracy on the training set
    • Classification performance has been “boosted”
  • 62. Boosting Procedure
    • Trains successive component classifiers with a subset of the training data that is most “informative” given the current set of component classifiers
  • 63. Training Data and Weak Learner
  • 64.
    • Flip a fair coin
    • Head
      • Select remaining samples from D
      • Present them to C 1 one by one until C 1 misclassifies a pattern
      • Add the misclassified pattern to D 2
    • Tail
      • Find a pattern that C 1 classifies correctly
    “Most Informative” Set Given C 1
  • 65. Third Data Set and Classifier C 3
    • Randomly select a remaining training pattern
    • Add the pattern if C 1 and C 2 disagree, otherwise ignore it
  • 66. Classification of a Test Pattern
    • If C 1 and C 2 agree, use their label
    • If they disagree, trust C 3
  • 67. Choosing n 1
    • For final vote, n 1 ~ n 2 ~ n 3 ~ n /3 is desired
    • Reasonable guess: n /3
    • Simple problem: n 2 << n 1
    • Difficult problem: n 2 too large
    • In practice we need to run the whole boosting procedure a few times
      • To use the full training set
      • To get roughly equal partitions of the training set
  • 68. AdaBoost
    • Adaptive boosting
    • Most popular version of basic boosting
    • Continue adding weak learners until some desired low training error has been achieved
  • 69. AdaBoost Algorithm
  • 70. Final Decision
    • Discriminant function
  • 71. Ensemble Training Error
  • 72. AdaBoost vs. No Free Lunch Theorem
    • Boosting only improves classification if the component classifiers perform better than chance
      • Can not be guaranteed a priori
    • Exponential reduction in error on the training set does not ensure reduction of the off-training set error or generalization
    • Proven effective in many real-world applications
  • 73. Learning with Queries
    • Set of unlabeled patterns
    • Exists some (possibly costly) way of labeling any pattern
    • To determine which unlabeled patterns would be most informative if they were presented as a query to an oracle
    • Also called active learning or interactive learning
    • Can be refined further as cost-based learning
  • 74. Application Example
    • Design a classifier for handwritten numerals
    • Using unlabeled pixel images scanned from documents from a corpus too large to label every pattern
    • Human as the oracle
  • 75. Learning with Queries
    • Begin with a preliminary, weak classifier developed with a small set of labeled samples
    • Two related methods for selecting an informative pattern
      • Confidence-based query selection
      • Voting-based or committee-based query selection
  • 76. Selecting Most Informative Patterns
    • Confidence-based query selection
      • Pattern that two largest discriminant functions have nearly the same value
      • i.e., patterns lie near the current decision boundaries
    • Voting-based query selection
      • Pattern that yields the greatest disagreement among the k resulting category labels
  • 77. Active Learning Example
  • 78. Arcing and Active Learning vs. IID Sampling
    • If take a model of true distribution and train it with a highly skewed distribution by active learning, the final classifier accuracy might be low
    • Resampling methods are generally use techniques not attempt to model or fit the full category distributions
      • Not fitting parameters in a model
      • But instead seeking decision boundaries directly
  • 79. Arcing and Active Learning
    • As number of component classifiers is increased, resampling, boosting and related methods effectively broaden that class of implementable functions
    • Allow to try to “match” the final classifier to the problem by indirectly adjusting the bias and variance
    • Can be used with arbitrary classification techniques
  • 80. Estimating the Generalization Rate
    • See if the classifier performs well enough to be useful
    • Compare its performance with that of a competing design
    • Requires making assumptions about the classifier or the problem or both
    • All the methods given here are heuristic
  • 81. Parametric Models
    • Compute from the assumed parametric model
    • Example: two-class multivariate normal case
      • Bhattacharyya or Chernoff bounds using estimated mean and covariance matrix
    • Problems
      • Overly optimistic
      • Always suspect the model
      • Error rate may be difficult to compute
  • 82. Simple Cross-Validation
    • Randomly split the set of labeled training samples D into a training set and a validation set
  • 83. m -Fold Cross-Validation
    • Training set is randomly divided into m disjoint sets of equal size n/m
    • The classifier is trained m times
      • Each time with a different set held out as a validation set
    • Estimated performance is the mean of these m errors
    • When m = n , it is in effect the leave-one-out approach
  • 84. Forms of Learning for Cross-Validation
    • neural networks of fixed topology
      • Number of epochs or presentations of the training set
      • Number of hidden units
    • Width of the Gaussian window in Parzen windows
    • Optimal k in the k -nearest neighbor classifier
  • 85. Portion  of D as a Validation Set
    • Should be small
      • Validation set is used merely to know when to stop adjusting parameters
      • Training set is used to set large number of parameters or degrees of freedoms
    • Traditional default
      • Set  = 0.1
      • Proven effective in many applications
  • 86. Anti-Cross-Validation
    • Cross-validation need not work on every problem
    • Anti-cross-validation
      • Halt when the validation error is the first local maximum
      • Must explore different values of 
      • Possibly abandon the use of cross-validation if performance cannot be improved
  • 87. Estimation of Error Rate
    • Let p be the true and unknown error rate of the classifier
    • Assume k of the n’ independent, randomly drawn test samples are misclassified, then k has the binomial distribution
    • Maximum-likelihood estimate for p
  • 88. 95% Confidence Intervals for a Given Estimated p
  • 89. Jackknife Estimation of Classification Accuracy
    • Use leave-one-out approach
    • Obtain Jackknife estimate for the mean and variance of the leave-one-out accuracies
    • Use traditional hypothesis testing to see if one classifier is superior to another with statistical significance
  • 90. Jackknife Estimation of Classification Accuracy
  • 91. Bootstrap Estimation of Classification Accuracy
    • Train B classifiers, each with a different bootstrap data set
    • Test on other bootstrap data sets
    • Bootstrap estimate is the mean of these bootstrap accuracies
  • 92. Maximum-Likelihood Comparison (ML-II)
    • Also called maximum-likelihood selection
    • Find the maximum-likelihood parameters for each of the candidate models
    • Calculate the resulting likelihoods (evidences)
    • Choose the model with the largest likelihood
  • 93. Maximum-Likelihood Comparison
  • 94. Scientific Process *D. J. C. MacKay, “Bayesian interpolation,” Neural Computation, 4(3), 415-447, 1992
  • 95. Bayesian Model Comparison
  • 96. Concept of Occam Factor
  • 97. Concept of Occam Factor
    • An inherent bias toward simple models (small  0  )
    • Models that are overly complex (large  0  ) are automatically self-penalizing
  • 98. Evidence for Gaussian Parameters
  • 99. Bayesian Model Selection vs. No Free Lunch Theorem
    • Bayesian model selection
      • Ignore the prior over the space of models
      • Effectively assume that it is uniform
      • Not take into account how models correspond to underlying target functions
      • Usually corresponds to non-uniform prior over target functions
    • No Free Lunch Theorem
      • Allows that for some particular non-uniform prior there may be an algorithm that gives better than chance, or even optimal, results
  • 100. Error Rate as a Function of Number n of Training Samples
    • Classifiers trained by a small number of samples will not performed well on new data
    • Typical steps
      • Estimate unknown parameters from samples
      • Use these estimates to determine the classifier
      • Calculate the error rate for the resulting classifier
  • 101. Analytical Analysis
    • Case of two categories having equal prior probabilities
    • Partition feature space into some m disjoint cells, C 1 , . . ., C m
    • Conditional probabilities p ( x |  1 ) and p ( x |  2 ) do not vary appreciably within any cell
    • Need only know which cell x falls
  • 102. Analytical Analysis
  • 103. Analytical Analysis
  • 104. Analytical Analysis
  • 105. Results of Simulation Experiments
  • 106. Discussions on Error Rate for Given n
    • For every curve involving finite n there is an optimal number of cells
    • At first increasing number of cells make it easier to distinguish between distributions represented by p and q
    • If the number of cells becomes too large, there will not be enough training patterns to fill them
      • Eventually number of patterns in most cells will be zero
  • 107. Discussions on Error Rate for Given n
    • For n = 500 , the minimal error rate occurs somewhere around m = 20
    • Form the cells by dividing each feature axis into l intervals
    • With d features, m = l d
    • If l = 2 , using more than four or five binary features will lead to worsen rather than better performance
  • 108. Test Errors vs. Number of Training Patterns
  • 109. Test and Training Error
  • 110. Power Law
  • 111. Sum and Difference of Test and Training Error
  • 112. Fraction of Dichotomies of n Points in d Dimensions That are Linear
  • 113. One-Dimensional Case f ( n = 4, d = 1) = 0.5 X 1111 X 0111 X 1110 0110 1101 0101 X 1100 0100 1011 X 0011 1010 0010 1001 X 0001 X 1000 X 0000 Linearly Separable? Labels Linearly Separable? Labels
  • 114. Capacity of a Separating Plane
    • Not until n is a sizable fraction of 2( d +1) that the problem begins to become difficult
    • Capacity of a hyperplane
      • At n = 2( d +1) , half of the possible dichotomies are still linear
    • Can not expect a linear classifier to “match” a problem, on average, if the dimension of the feature space is greater than n /2 - 1
  • 115. Mixture-of-Expert Models
    • Classifiers whose decision is based on the outputs of component classifiers
    • Also called
      • Ensemble classifiers
      • Modular classifiers
      • Pooled classifiers
    • Useful if each component classifier is highly trained (“expert”) in a different region of the feature space
  • 116. Mixture Model for Producing Patterns
  • 117. Mixture-of-Experts Architecture
  • 118. Ensemble Classifiers
  • 119. Maximum-Likelihood Estimation
  • 120. Final Decision Rule
    • Choose the category corresponding to the maximum discriminant value after the pooling system
    • Winner-take-all method
      • Use the decision of the single component classifier that is the “most confident”, i.e., largest g rj
      • Suboptimal but simple
      • Works well if the component classifiers are experts in separate regions
  • 121. Component Classifiers without Discriminant Functions
    • Example
      • A KNN classifier (rank order)
      • A decision tree (label)
      • A neural network (analog value)
      • A rule-based system (label)
  • 122. Heuristics to Convert Outputs to Discrimunant Values
  • 123. Illustration Examples 0.0 0 3/21=0.143 4th 0.111 0.1 0.0 0 5/21=0.238 2nd 0.129 0.2 0.0 0 6/21=0.286 1st 0.143 0.3 0.0 0 2/21=0.095 5th 0.260 0.9 1.0 1 1/21=0.048 6th 0.193 0.6 0.0 0 4/21=0.194 3rd 0.158 0.4 g i g i g i One-of- c Rank Order Analog

×