• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
MachineLearning.ppt
 

MachineLearning.ppt

on

  • 651 views

 

Statistics

Views

Total Views
651
Views on SlideShare
651
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    MachineLearning.ppt MachineLearning.ppt Presentation Transcript

    • Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University
    • Some Fundamental Problems
      • Which algorithm is “best”?
      • Are there any reasons to favor one algorithm over another?
      • Is “Occam’s razor” really so evident?
      • Do simpler or “smoother” classifiers generalize better? If so, why?
      • Are there fundamental “conservation” or “constraint” laws other than Bayes error rate?
    • Meaning of “Algorithm-Independent”
      • Mathematical foundations that do not depend upon the particular classifier or learning algorithm used
        • e.g., bias and variance concept
      • Techniques that can be used in conjunction with different learning algorithm, or provide guidance in their use
        • e.g., cross-validation and resampling techniques
    • Roadmap
      • No pattern classification method is inherently superior to any other
      • Ways to quantify and adjust the “match” between a learning algorithm and the problem it addresses
      • Estimation of accuracies and comparison of different classifiers with certain assumptions
      • Methods for integrating component classifiers
    • Generalization Performance by Off-Training Set Error
      • Consider a two-category problem
      • Training set D
      • Training patterns x i
      • y i = 1 or -1 for i = 1, . . ., n is generated by unknown target function F ( x ) to be learned
      • F ( x ) is often with a random component
        • The same input could lead to different categories
        • Giving non-zero Bayes error
    • Generalization Performance by Off-Training Set Error
      • Let H be the (discrete) set of hypotheses, or sets of parameters to be learned
      • A particular h belongs to H
        • quantized weights in neural network
        • Parameters q in a functional model
        • Sets of decisions in a tree
      • P ( h ) : prior probability that the algorithm will produce hypothesis h after training
    • Generalization Performance by Off-Training Set Error
      • P ( h | D ) : probability the algorithm will yield h when trained on data D
        • Nearest-neighbor and decision tree: non-zero only for a single hypothesis
        • Neural network: can be a broad distribution
      • E : error for zero-one or other loss function
    • Generalization Performance by Off-Training Set Error
      • A natural measure
      • Expected off-training-set classification error for the k th candidate learning algorithm
    • No Free Lunch Theorem
    • No Free Lunch Theorem
      • For any two algorithms
      • No matter how clever in choosing a “good” algorithm and a “bad” algorithm, if all target functions are equally likely, the “good” algorithm will not outperform the “bad” one
      • There is at least one target function for which random guessing is a better algorithm
    • No Free Lunch Theorem
      • Even if we know D , averaged over all target functions no algorithm yields an off-training set error that is superior to any other
    • Example 1 No Free Lunch for Binary Data -1 1 1 111 -1 1 1 110 -1 1 -1 101 -1 1 1 100 -1 1 -1 011 1 1 1 010 -1 -1 -1 001 D 1 1 1 000 h 2 h 1 F x
    • No Free Lunch Theorem
    • Conservation in Generalization
      • Can not achieve positive performance on some problems without getting an equal and opposite amount of negative performance on other problems
      • Can trade performance on problems we do not expect to encounter with those that we do expect to encounter
      • It is the assumptions about the learning domains that are relevant
    • Ugly Duckling Theorem
      • In the absence of assumptions there is no privileged or “best” feature representation
      • Even the notion of similarity between patterns depends implicitly on assumptions that may or may not be correct
    • Venn Diagram Representation of Features as Predicates
    • Rank of a Predicate
      • Number of the simplest or indivisible elements it contains
      • Example: rank r = 1
        • x 1 : f 1 AND NOT f 2
        • x 2 : f 1 AND f 2
        • x 3 : f 2 AND NOT f 1
        • x 4 : NOT ( f 1 OR f 2 )
        • C (4,1) = 4 predicates
    • Examples of Rank of a Predicate
      • Rank r = 2
        • x 1 OR x 2 : f 1
        • x 1 OR x 3 : f 1 XOR f 2
        • x 1 OR x 4 : NOT f 2
        • x 2 OR x 3 : f 2
        • x 2 OR x 4 : ( f 1 AND f 2 ) OR NOT ( f 1 OR f 2 )
        • x 3 OR x 4 : NOT f 1
        • C (4, 2) = 6 predicates
    • Examples of Rank of a Predicate
      • Rank r = 3
        • x 1 OR x 2 OR x 3 : f 1 OR f 2
        • x 1 OR x 2 OR x 4 : f 1 OR NOT f 2
        • x 1 OR x 3 OR x 4 : NOT ( f 1 AND f 2 )
        • x 2 OR x 3 OR x 4 : f 2 OR NOT f 1
        • C (4, 3) = 4 predicates
    • Total Number of Predicates in Absence of Constraints
      • Let d be the number of regions in the Venn diagrams (i.e., number of distinctive patterns, or number of possible values determined by combinations of the features)
    • A Measure of Similarity in Absence of Prior Information
      • Number of features or attributes shared by two patterns
        • Concept difficulties
          • e.g., blind_in_right_eyes and blind_in_left_eyes , (1,0) more similar to (1,1) and (0,0) than to (0,1)
        • There are always multiple ways to represent vectors of attributes
          • e.g. blind_in_right_eye and same_in_both_eyes
        • No principled reason to prefer one of these representations over another
    • A Plausible Measure of Similarity in Absence of Prior Information
      • Number of predicates the patterns share
      • Consider two distinct patterns
        • no predicates of rank 1 is shared
        • 1 predicate of rank 2 is shared
        • C ( d -2, 1) predicates of rank 3 is shared
        • C ( d -2, r -2) predicates of rank r is shared
        • Total number of predicates shared
    • Ugly Duckling Theorem
      • Given a finite set of predicates that enables us to distinguish any two patterns
      • The number of predicates shared by any two such patterns is constant and independent of the choice of those patterns
      • If pattern similarity is based on the total number of predicates shared, any two patterns are “equally similar”
    • Ugly Duckling Theorem
      • No problem-independent or privileged or “best” set of features or feature attributes
      • Also applies to a continuous feature spaces
    • Minimum Description Length (MDL)
      • Find some irreducible, smallest representation (“signal”) of all members of a category
      • All variation among the individual patterns is then “noise”
      • By simplifying recognizers appropriately, the signal can be retained while the noise is ignored
    • Algorithm Complexity (Kolmogorov Complexity)
      • Kolmogrov complexity of binary string x
        • On an abstract computer (Turing machine) U
        • As the shortest program (binary) string y
        • Without additional data, computes the string x and halts
    • Algorithm Complexity Example
      • Suppose x consists solely of n 1 s
      • Use some fixed number of bits k to specify a loop for printing a string of 1 s
      • Need log 2 n more bits to specify the iteration number n , the condition for halting
      • Thus K ( x ) = O (log 2 n )
    • Algorithm Complexity Example
      • Constant  =11.001001000011111110110101010001…
      • The shortest program is the one that can produce any arbitrary large number of consecutive digits of 
      • Thus K ( x ) = O ( 1 )
    • Algorithm Complexity Example
      • Assume that x is a “truly” random binary string
      • Can not be expressed as a shorter string
      • Thus K ( x ) = O (| x |)
    • Minimum Description Length (MDL) Principle
      • Minimize the sum of the model’s algorithmic complexity and the description of the training data with respect to that model
    • An Application of MDL Principle
      • For decision-tree classifiers, a model h specifies the tree and the decisions at the nodes
      • Algorithmic complexity of h is proportional to the number of nodes
      • Complexity of data is expressed in terms of the entropy (in bits) of the data
      • Tree-pruning based on entropy is equivalent to the MDL principle
    • Convergence of MDL Classifiers
      • MDL classifiers are guaranteed to converge to the ideal or true model in the limit of more and more data
      • Can not prove that the MDL principle leads to a superior performance in the finite data case
      • Still consistent with the no free lunch principle
    • Bayesian Perspective of MDL Principle
    • Overfitting Avoidance
      • Avoiding overfitting or minimizing description length are not inherently beneficial
      • Amount to a preference over the forms or parameters of classifiers
      • Beneficial only if they match their problems
      • There are problems that overfitting avoidance leads to worse performances
    • Explanation of Success of Occam’s Razor
      • Through evolution and strong selection pressure on our neurons
      • Likely to ignore problems for which Occam’s razor does not hold
      • Researchers naturally develop simple algorithms before more complex ones – a bias imposed by methodology
      • Principle of satisficing: creating an adequate though possibly nonoptimal solution
    • Bias and Variance
      • Ways to measure the “match” or “alignment” of the learning algorithm to the classification problem
      • Bias
        • Accuracy or quality of the match
        • High bias implies a poor match
      • Variance
        • Precision or specificity of the match
        • High variance implies a weak match
    • Bias and Variance for Regression
    • Bias-Variance Dilemma
    • Bias-Variance Dilemma
      • Given a target function
      • Model has many parameters
        • Generally low bias
        • Fits data well
        • Yields high variance
      • Model has few parameters
        • Generally high bias
        • May not fit data well
        • The fit does not change much for different data sets (low variance)
    • Bias-Variance Dilemma
      • Best way to get low bias and low variance
        • Have prior information about the target function
      • Virtually never get zero bias and zero variance
        • Only one learning problem to solve and the answer is already known
      • Large amount of training data will yield improved performance as the model is sufficiently general
    • Bias and Variance for Classification
      • Reference
        • J. H. Friedman, “On bias, variance, 0/1-loss, and the curse of dimensionality,” Data Mining and Knowledge Discovery , vol. 1, no. 1, pp. 55-77, 1997.
    • Bias and Variance for Classification
      • Two-category problem
      • Target function
      • F ( x )=Pr[ y =1| x ]=1-Pr[ y =0| x ]
      • Discriminant function
      • y d = F ( x ) + 
      •  : zero mean, centered binomialy distributed random variable
      • F ( x ) = E [ y d | x ]
    • Bias and Variance for Classification
      • Find estimate g ( x ; D ) to minimize
      • E D [( g ( x ; D )- y d ) 2 ]
      • The estimated g ( x ; D ) can be used to classify x
      • The bias and variance concept for regression can be applied to g ( x ; D ) as an estimate of F ( x )
      • However, this is not related to classification error directly
    • Bias and Variance for Classification
    • Bias and Variance for Classification
    • Bias and Variance for Classification
    • Bias and Variance for Classification
    • Bias and Variance for Classification
      • Sign of the boundary bias affects the role of variance in the error
      • Low variance is generally important for accurate classification, if the sign is positive
      • Low boundary bias need not result in lower error rate
      • Simple methods is often with lower variance, and need not be inferior to more flexible methods
    • Error rates and optimal K vs. N for d = 20 in KNN
    • Estimation and classification error vs. d for N = 12800 in KNN d
    • Boundary Bias-Variance Trade-Off
    • Leave-One-Out Method (Jackknife)
    • Generalization to Estimates of Other Statistics
      • Estimator of other statistics
        • median, 25th percentile, mode, etc.
      • Leave-one-out estimate
      • Jackknife estimate and its related variance
    • Jackknife Bias Estimate
    • Example 2 Jackknife for Mode
    • Bootstrap
      • Randomly selecting n points from the training data set D , with replacement
      • Repeat this process independently B times to yield B bootstrap data set, treated as independent sets
      • Bootstrap estimate of a statistic 
    • Bootstrap Bias and Variance Estimate
    • Properties of Bootstrap Estimates
      • The larger the number B of bootstrap samples, the more satisfactory is the estimate of a statistic and its variance
      • B can be adjusted to the computational resources
        • Jackknife estimate requires exactly n repetitions
    • Bagging
      • Arcing
        • Adaptive Reweighting and Combining
        • Reusing or selecting data in order to improve classification, e.g., AdaBoost
      • Bagging
        • Bootstrap aggregation
        • Multiple versions of D , by drawing n’ < n samples from D with replacement
        • Each set trains a component classifier
        • Final decision is based on vote of each component classifier
    • Unstable Algorithm and Bagging
      • Unstable algorithm
        • “small” changes in training data lead to significantly different classifiers and relatively “large” changes in accuracy
      • In general bagging improves recognition for unstable classifiers
        • Effectively averages over such discontinuities
        • No convincing theoretical derivations or simulation studies showing bagging helps for all unstable classifiers
    • Boosting
      • Create the first classifier
        • With accuracy on the training set greater than average (weak learner)
      • Add a new classifier
        • Form an ensemble
        • Joint decision rule has much higher accuracy on the training set
      • Classification performance has been “boosted”
    • Boosting Procedure
      • Trains successive component classifiers with a subset of the training data that is most “informative” given the current set of component classifiers
    • Training Data and Weak Learner
      • Flip a fair coin
      • Head
        • Select remaining samples from D
        • Present them to C 1 one by one until C 1 misclassifies a pattern
        • Add the misclassified pattern to D 2
      • Tail
        • Find a pattern that C 1 classifies correctly
      “Most Informative” Set Given C 1
    • Third Data Set and Classifier C 3
      • Randomly select a remaining training pattern
      • Add the pattern if C 1 and C 2 disagree, otherwise ignore it
    • Classification of a Test Pattern
      • If C 1 and C 2 agree, use their label
      • If they disagree, trust C 3
    • Choosing n 1
      • For final vote, n 1 ~ n 2 ~ n 3 ~ n /3 is desired
      • Reasonable guess: n /3
      • Simple problem: n 2 << n 1
      • Difficult problem: n 2 too large
      • In practice we need to run the whole boosting procedure a few times
        • To use the full training set
        • To get roughly equal partitions of the training set
    • AdaBoost
      • Adaptive boosting
      • Most popular version of basic boosting
      • Continue adding weak learners until some desired low training error has been achieved
    • AdaBoost Algorithm
    • Final Decision
      • Discriminant function
    • Ensemble Training Error
    • AdaBoost vs. No Free Lunch Theorem
      • Boosting only improves classification if the component classifiers perform better than chance
        • Can not be guaranteed a priori
      • Exponential reduction in error on the training set does not ensure reduction of the off-training set error or generalization
      • Proven effective in many real-world applications
    • Learning with Queries
      • Set of unlabeled patterns
      • Exists some (possibly costly) way of labeling any pattern
      • To determine which unlabeled patterns would be most informative if they were presented as a query to an oracle
      • Also called active learning or interactive learning
      • Can be refined further as cost-based learning
    • Application Example
      • Design a classifier for handwritten numerals
      • Using unlabeled pixel images scanned from documents from a corpus too large to label every pattern
      • Human as the oracle
    • Learning with Queries
      • Begin with a preliminary, weak classifier developed with a small set of labeled samples
      • Two related methods for selecting an informative pattern
        • Confidence-based query selection
        • Voting-based or committee-based query selection
    • Selecting Most Informative Patterns
      • Confidence-based query selection
        • Pattern that two largest discriminant functions have nearly the same value
        • i.e., patterns lie near the current decision boundaries
      • Voting-based query selection
        • Pattern that yields the greatest disagreement among the k resulting category labels
    • Active Learning Example
    • Arcing and Active Learning vs. IID Sampling
      • If take a model of true distribution and train it with a highly skewed distribution by active learning, the final classifier accuracy might be low
      • Resampling methods are generally use techniques not attempt to model or fit the full category distributions
        • Not fitting parameters in a model
        • But instead seeking decision boundaries directly
    • Arcing and Active Learning
      • As number of component classifiers is increased, resampling, boosting and related methods effectively broaden that class of implementable functions
      • Allow to try to “match” the final classifier to the problem by indirectly adjusting the bias and variance
      • Can be used with arbitrary classification techniques
    • Estimating the Generalization Rate
      • See if the classifier performs well enough to be useful
      • Compare its performance with that of a competing design
      • Requires making assumptions about the classifier or the problem or both
      • All the methods given here are heuristic
    • Parametric Models
      • Compute from the assumed parametric model
      • Example: two-class multivariate normal case
        • Bhattacharyya or Chernoff bounds using estimated mean and covariance matrix
      • Problems
        • Overly optimistic
        • Always suspect the model
        • Error rate may be difficult to compute
    • Simple Cross-Validation
      • Randomly split the set of labeled training samples D into a training set and a validation set
    • m -Fold Cross-Validation
      • Training set is randomly divided into m disjoint sets of equal size n/m
      • The classifier is trained m times
        • Each time with a different set held out as a validation set
      • Estimated performance is the mean of these m errors
      • When m = n , it is in effect the leave-one-out approach
    • Forms of Learning for Cross-Validation
      • neural networks of fixed topology
        • Number of epochs or presentations of the training set
        • Number of hidden units
      • Width of the Gaussian window in Parzen windows
      • Optimal k in the k -nearest neighbor classifier
    • Portion  of D as a Validation Set
      • Should be small
        • Validation set is used merely to know when to stop adjusting parameters
        • Training set is used to set large number of parameters or degrees of freedoms
      • Traditional default
        • Set  = 0.1
        • Proven effective in many applications
    • Anti-Cross-Validation
      • Cross-validation need not work on every problem
      • Anti-cross-validation
        • Halt when the validation error is the first local maximum
        • Must explore different values of 
        • Possibly abandon the use of cross-validation if performance cannot be improved
    • Estimation of Error Rate
      • Let p be the true and unknown error rate of the classifier
      • Assume k of the n’ independent, randomly drawn test samples are misclassified, then k has the binomial distribution
      • Maximum-likelihood estimate for p
    • 95% Confidence Intervals for a Given Estimated p
    • Jackknife Estimation of Classification Accuracy
      • Use leave-one-out approach
      • Obtain Jackknife estimate for the mean and variance of the leave-one-out accuracies
      • Use traditional hypothesis testing to see if one classifier is superior to another with statistical significance
    • Jackknife Estimation of Classification Accuracy
    • Bootstrap Estimation of Classification Accuracy
      • Train B classifiers, each with a different bootstrap data set
      • Test on other bootstrap data sets
      • Bootstrap estimate is the mean of these bootstrap accuracies
    • Maximum-Likelihood Comparison (ML-II)
      • Also called maximum-likelihood selection
      • Find the maximum-likelihood parameters for each of the candidate models
      • Calculate the resulting likelihoods (evidences)
      • Choose the model with the largest likelihood
    • Maximum-Likelihood Comparison
    • Scientific Process *D. J. C. MacKay, “Bayesian interpolation,” Neural Computation, 4(3), 415-447, 1992
    • Bayesian Model Comparison
    • Concept of Occam Factor
    • Concept of Occam Factor
      • An inherent bias toward simple models (small  0  )
      • Models that are overly complex (large  0  ) are automatically self-penalizing
    • Evidence for Gaussian Parameters
    • Bayesian Model Selection vs. No Free Lunch Theorem
      • Bayesian model selection
        • Ignore the prior over the space of models
        • Effectively assume that it is uniform
        • Not take into account how models correspond to underlying target functions
        • Usually corresponds to non-uniform prior over target functions
      • No Free Lunch Theorem
        • Allows that for some particular non-uniform prior there may be an algorithm that gives better than chance, or even optimal, results
    • Error Rate as a Function of Number n of Training Samples
      • Classifiers trained by a small number of samples will not performed well on new data
      • Typical steps
        • Estimate unknown parameters from samples
        • Use these estimates to determine the classifier
        • Calculate the error rate for the resulting classifier
    • Analytical Analysis
      • Case of two categories having equal prior probabilities
      • Partition feature space into some m disjoint cells, C 1 , . . ., C m
      • Conditional probabilities p ( x |  1 ) and p ( x |  2 ) do not vary appreciably within any cell
      • Need only know which cell x falls
    • Analytical Analysis
    • Analytical Analysis
    • Analytical Analysis
    • Results of Simulation Experiments
    • Discussions on Error Rate for Given n
      • For every curve involving finite n there is an optimal number of cells
      • At first increasing number of cells make it easier to distinguish between distributions represented by p and q
      • If the number of cells becomes too large, there will not be enough training patterns to fill them
        • Eventually number of patterns in most cells will be zero
    • Discussions on Error Rate for Given n
      • For n = 500 , the minimal error rate occurs somewhere around m = 20
      • Form the cells by dividing each feature axis into l intervals
      • With d features, m = l d
      • If l = 2 , using more than four or five binary features will lead to worsen rather than better performance
    • Test Errors vs. Number of Training Patterns
    • Test and Training Error
    • Power Law
    • Sum and Difference of Test and Training Error
    • Fraction of Dichotomies of n Points in d Dimensions That are Linear
    • One-Dimensional Case f ( n = 4, d = 1) = 0.5 X 1111 X 0111 X 1110 0110 1101 0101 X 1100 0100 1011 X 0011 1010 0010 1001 X 0001 X 1000 X 0000 Linearly Separable? Labels Linearly Separable? Labels
    • Capacity of a Separating Plane
      • Not until n is a sizable fraction of 2( d +1) that the problem begins to become difficult
      • Capacity of a hyperplane
        • At n = 2( d +1) , half of the possible dichotomies are still linear
      • Can not expect a linear classifier to “match” a problem, on average, if the dimension of the feature space is greater than n /2 - 1
    • Mixture-of-Expert Models
      • Classifiers whose decision is based on the outputs of component classifiers
      • Also called
        • Ensemble classifiers
        • Modular classifiers
        • Pooled classifiers
      • Useful if each component classifier is highly trained (“expert”) in a different region of the feature space
    • Mixture Model for Producing Patterns
    • Mixture-of-Experts Architecture
    • Ensemble Classifiers
    • Maximum-Likelihood Estimation
    • Final Decision Rule
      • Choose the category corresponding to the maximum discriminant value after the pooling system
      • Winner-take-all method
        • Use the decision of the single component classifier that is the “most confident”, i.e., largest g rj
        • Suboptimal but simple
        • Works well if the component classifiers are experts in separate regions
    • Component Classifiers without Discriminant Functions
      • Example
        • A KNN classifier (rank order)
        • A decision tree (label)
        • A neural network (analog value)
        • A rule-based system (label)
    • Heuristics to Convert Outputs to Discrimunant Values
    • Illustration Examples 0.0 0 3/21=0.143 4th 0.111 0.1 0.0 0 5/21=0.238 2nd 0.129 0.2 0.0 0 6/21=0.286 1st 0.143 0.3 0.0 0 2/21=0.095 5th 0.260 0.9 1.0 1 1/21=0.048 6th 0.193 0.6 0.0 0 4/21=0.194 3rd 0.158 0.4 g i g i g i One-of- c Rank Order Analog