Machine Learning: An Introduction Fu Chang


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Machine Learning: An Introduction Fu Chang

  1. 1. Machine Learning: An Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 [email_address]
  2. 2. Machine Learning: as a Tool for Classifying Patterns <ul><li>What is the difference between you and me? </li></ul><ul><li>Tentative answer 1: </li></ul><ul><ul><li>You are pretty, and I am ugly </li></ul></ul><ul><li>A vague answer, not very useful </li></ul><ul><li>Tentative answer 2: </li></ul><ul><ul><li>You have a tiny mouth, and I have a big one </li></ul></ul><ul><li>More useful, but what if we are viewed from the side? </li></ul><ul><li>In general, can we use a single feature difference to distinguish one pattern from another? </li></ul>
  3. 3. Old Philosophical Debates <ul><li>What makes a cup a cup? </li></ul><ul><li>Philosophical views </li></ul><ul><ul><li>Plato: the ideal type </li></ul></ul><ul><ul><li>Aristotle: the collection of all cups </li></ul></ul><ul><ul><li>Wittgenstein: family resemblance </li></ul></ul>
  4. 4. Machine Learning Viewpoint <ul><li>Represent each object with a set of features: </li></ul><ul><ul><li>Mouth, nose, eyes, etc., viewed from the front, the right side, the left side, etc. </li></ul></ul><ul><li>Each pattern is taken as a conglomeration of sample points or feature vectors </li></ul>
  5. 5. Patterns as Conglomerations of sample Points A B Two types of sample points
  6. 6. Types of Separation <ul><li>Left panel: positive separation between heterogeneous data points </li></ul><ul><li>Right panel: a margin between them </li></ul>
  7. 7. ML Viewpoint (Cnt’d) <ul><li>Training phase: </li></ul><ul><ul><li>Want to learn pattern differences among conglomerations of labeled samples </li></ul></ul><ul><ul><li>Have to describe the differences by means of a model: probability distribution, prototype, neural network, etc. </li></ul></ul><ul><ul><li>Have to estimate parameters involved in the model </li></ul></ul><ul><li>Testing phase: </li></ul><ul><ul><li>Have to classify at acceptable accuracy rates </li></ul></ul>
  8. 8. Models <ul><li>Prototype classifiers </li></ul><ul><li>Neural networks </li></ul><ul><li>Support vector machines </li></ul><ul><li>Classification and regression tree </li></ul><ul><li>AdaBoost </li></ul><ul><li>Boltzmann-Gibbs Models </li></ul>
  9. 9. Neural Networks
  10. 10. Back-Propagation Neural Networks <ul><li>Layers: </li></ul><ul><ul><li>Input: number of nodes = dimension of feature vector </li></ul></ul><ul><ul><li>Output: number of nodes = number of class types </li></ul></ul><ul><ul><li>Hidden: number of nodes > dimension of feature vector </li></ul></ul><ul><li>Direction of data migration </li></ul><ul><ul><li>Training: backward propagation </li></ul></ul><ul><ul><li>Testing: forward propagation </li></ul></ul><ul><li>Training problems </li></ul><ul><ul><li>Overfitting </li></ul></ul><ul><ul><li>Convergence </li></ul></ul>
  11. 11. Illustration
  12. 12. Neural Networks <ul><li>Forward propagation: </li></ul><ul><li>Error: </li></ul><ul><li>Backward update of weights: gradient descent </li></ul>
  13. 13. Support Vector Machines (SVM)
  14. 14. SVM <ul><li>Gives rise to the optimal solution to binary classification problem </li></ul><ul><li>Finds a separating boundary (hyperplane) that maintains the largest margin between samples of two class types </li></ul><ul><li>Things to tune up with: </li></ul><ul><ul><li>Kernel functions: defining the similarity measure of two sample vectors </li></ul></ul><ul><ul><li>Tolerance for misclassification </li></ul></ul><ul><ul><li>Parameters associated with the kernel function </li></ul></ul>
  15. 15. Illustration
  16. 16. Classification and Regression Tree (CART)
  17. 17. Illustration
  18. 18. Determination of Branch Points <ul><li>At the root, a feature of each input sample is examined </li></ul><ul><li>For a given feature f , we want to determine a branch point b </li></ul><ul><ul><li>Samples whose f values fall below b are assigned to the left branch; otherwise they are assigned to the right branch </li></ul></ul><ul><li>Determination of a branch point </li></ul><ul><ul><li>The impurity of a set of samples S is defined as </li></ul></ul><ul><ul><li>where is the proportion of samples in S labeled as class type C </li></ul></ul>
  19. 19. Branch Point (Cnt’d) <ul><li>At a branch point b , the impurity reduction is then defined as </li></ul><ul><li>The optimal branch point for the given feature f examined as this node is then set as </li></ul><ul><li>To determine which feature type should be examined at the root, we compute b ( f ) for all possible feature types. We then take the feature type at the root as </li></ul>
  20. 20. AdaBoost
  21. 21. AdaBoost <ul><li>Can be thought as a linear combination of the same classifier c ( ·, ·) with varying weights </li></ul><ul><li>The Idea: </li></ul><ul><ul><li>Iteratively apply the same classifier C to a set of samples </li></ul></ul><ul><ul><li>At iteration m , the samples erroneously classified at (m-1) st iteration are duplicated at a rate γ m </li></ul></ul><ul><ul><li>The weight β m is related to γ m in a certain way </li></ul></ul>
  22. 22. Boltzmann-Gibbs Model
  23. 23. Boltzmann-Gibbs Density Function <ul><li>Given: </li></ul><ul><ul><li>States s 1 , s 2 , …, s n </li></ul></ul><ul><ul><li>Density p ( s ) = p s </li></ul></ul><ul><ul><li>Features f i , i = 1, 2, … </li></ul></ul><ul><li>Maximum entropy principle : </li></ul><ul><ul><li>Without any information, one chooses the density p s to maximize the entropy </li></ul></ul><ul><ul><li>subject to the constraints </li></ul></ul>
  24. 24. Boltzmann-Gibbs (Cnt’d) <ul><li>Consider the Lagrangian </li></ul><ul><li>Take partial derivatives of L with respect to p s and set them to zero, we obtain Boltzmann-Gibbs density functions </li></ul><ul><li>where Z is the normalizing factor </li></ul>
  25. 25. Exercise I <ul><li>Derive the Boltzmann-Gibbs density functions from the Lagrangian, shown on the last viewgraph </li></ul>
  26. 26. Boltzmann-Gibbs (Cnt’d) <ul><li>Maximum entropy (ME) </li></ul><ul><ul><li>Use of Boltzmann-Gibbs as prior distribution </li></ul></ul><ul><ul><li>Compute the posterior for given observed data and features f i </li></ul></ul><ul><ul><li>Use the optimal posterior to classify </li></ul></ul>
  27. 27. Bayesian Approach <ul><li>Given: </li></ul><ul><ul><li>Training samples X = { x 1 , x 2 , …, x n } </li></ul></ul><ul><ul><li>Probability density p( t | Θ ) </li></ul></ul><ul><ul><li>t is an arbitrary vector (a test sample) </li></ul></ul><ul><ul><li>Θ is the set of parameters </li></ul></ul><ul><ul><li>Θ is taken as a set of random variables </li></ul></ul>
  28. 28. Bayesian Approach (Cnt’d) <ul><li>Posterior density: </li></ul><ul><li>Different class types give rise to different posteriors </li></ul><ul><li>Use the posteriors to evaluate the class type of a given test sample t </li></ul>
  29. 29. Boltzmann-Gibbs (Cnt’d) <ul><li>Maximum entropy Markov model (MEMM) </li></ul><ul><ul><li>The posterior consists of transition probability densities </li></ul></ul><ul><ul><li>p( s | s ´, X ) </li></ul></ul><ul><li>Conditional random field (CRF) </li></ul><ul><ul><li>The posterior consists of both transition probability densities p( s | s ´, X ) and state probability densities </li></ul></ul><ul><ul><li>p( s | X ) </li></ul></ul>
  30. 30. References <ul><li>R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001. </li></ul><ul><li>T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001. </li></ul><ul><li>S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999. </li></ul>