• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Machine Learning: An Introduction Fu Chang

Machine Learning: An Introduction Fu Chang






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Machine Learning: An Introduction Fu Chang Machine Learning: An Introduction Fu Chang Presentation Transcript

    • Machine Learning: An Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 [email_address]
    • Machine Learning: as a Tool for Classifying Patterns
      • What is the difference between you and me?
      • Tentative answer 1:
        • You are pretty, and I am ugly
      • A vague answer, not very useful
      • Tentative answer 2:
        • You have a tiny mouth, and I have a big one
      • More useful, but what if we are viewed from the side?
      • In general, can we use a single feature difference to distinguish one pattern from another?
    • Old Philosophical Debates
      • What makes a cup a cup?
      • Philosophical views
        • Plato: the ideal type
        • Aristotle: the collection of all cups
        • Wittgenstein: family resemblance
    • Machine Learning Viewpoint
      • Represent each object with a set of features:
        • Mouth, nose, eyes, etc., viewed from the front, the right side, the left side, etc.
      • Each pattern is taken as a conglomeration of sample points or feature vectors
    • Patterns as Conglomerations of sample Points A B Two types of sample points
    • Types of Separation
      • Left panel: positive separation between heterogeneous data points
      • Right panel: a margin between them
    • ML Viewpoint (Cnt’d)
      • Training phase:
        • Want to learn pattern differences among conglomerations of labeled samples
        • Have to describe the differences by means of a model: probability distribution, prototype, neural network, etc.
        • Have to estimate parameters involved in the model
      • Testing phase:
        • Have to classify at acceptable accuracy rates
    • Models
      • Prototype classifiers
      • Neural networks
      • Support vector machines
      • Classification and regression tree
      • AdaBoost
      • Boltzmann-Gibbs Models
    • Neural Networks
    • Back-Propagation Neural Networks
      • Layers:
        • Input: number of nodes = dimension of feature vector
        • Output: number of nodes = number of class types
        • Hidden: number of nodes > dimension of feature vector
      • Direction of data migration
        • Training: backward propagation
        • Testing: forward propagation
      • Training problems
        • Overfitting
        • Convergence
    • Illustration
    • Neural Networks
      • Forward propagation:
      • Error:
      • Backward update of weights: gradient descent
    • Support Vector Machines (SVM)
    • SVM
      • Gives rise to the optimal solution to binary classification problem
      • Finds a separating boundary (hyperplane) that maintains the largest margin between samples of two class types
      • Things to tune up with:
        • Kernel functions: defining the similarity measure of two sample vectors
        • Tolerance for misclassification
        • Parameters associated with the kernel function
    • Illustration
    • Classification and Regression Tree (CART)
    • Illustration
    • Determination of Branch Points
      • At the root, a feature of each input sample is examined
      • For a given feature f , we want to determine a branch point b
        • Samples whose f values fall below b are assigned to the left branch; otherwise they are assigned to the right branch
      • Determination of a branch point
        • The impurity of a set of samples S is defined as
        • where is the proportion of samples in S labeled as class type C
    • Branch Point (Cnt’d)
      • At a branch point b , the impurity reduction is then defined as
      • The optimal branch point for the given feature f examined as this node is then set as
      • To determine which feature type should be examined at the root, we compute b ( f ) for all possible feature types. We then take the feature type at the root as
    • AdaBoost
    • AdaBoost
      • Can be thought as a linear combination of the same classifier c ( ·, ·) with varying weights
      • The Idea:
        • Iteratively apply the same classifier C to a set of samples
        • At iteration m , the samples erroneously classified at (m-1) st iteration are duplicated at a rate γ m
        • The weight β m is related to γ m in a certain way
    • Boltzmann-Gibbs Model
    • Boltzmann-Gibbs Density Function
      • Given:
        • States s 1 , s 2 , …, s n
        • Density p ( s ) = p s
        • Features f i , i = 1, 2, …
      • Maximum entropy principle :
        • Without any information, one chooses the density p s to maximize the entropy
        • subject to the constraints
    • Boltzmann-Gibbs (Cnt’d)
      • Consider the Lagrangian
      • Take partial derivatives of L with respect to p s and set them to zero, we obtain Boltzmann-Gibbs density functions
      • where Z is the normalizing factor
    • Exercise I
      • Derive the Boltzmann-Gibbs density functions from the Lagrangian, shown on the last viewgraph
    • Boltzmann-Gibbs (Cnt’d)
      • Maximum entropy (ME)
        • Use of Boltzmann-Gibbs as prior distribution
        • Compute the posterior for given observed data and features f i
        • Use the optimal posterior to classify
    • Bayesian Approach
      • Given:
        • Training samples X = { x 1 , x 2 , …, x n }
        • Probability density p( t | Θ )
        • t is an arbitrary vector (a test sample)
        • Θ is the set of parameters
        • Θ is taken as a set of random variables
    • Bayesian Approach (Cnt’d)
      • Posterior density:
      • Different class types give rise to different posteriors
      • Use the posteriors to evaluate the class type of a given test sample t
    • Boltzmann-Gibbs (Cnt’d)
      • Maximum entropy Markov model (MEMM)
        • The posterior consists of transition probability densities
        • p( s | s ´, X )
      • Conditional random field (CRF)
        • The posterior consists of both transition probability densities p( s | s ´, X ) and state probability densities
        • p( s | X )
    • References
      • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001.
      • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001.
      • S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999.