Lecture 17: Supervised Learning Recap

1,491 views
1,426 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,491
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
43
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • p(x) = pi_0f_0(x) + pi_1f_1(x) + pi_2f_2(x) + ldots + pi_kf_k(x)
  • Lecture 17: Supervised Learning Recap

    1. 1. Lecture 17: Supervised Learning Recap<br />Machine Learning<br />April 6, 2010<br />
    2. 2. Last Time<br />Support Vector Machines<br />Kernel Methods<br />
    3. 3. Today<br />Short recap of Kernel Methods<br />Review of Supervised Learning<br />Unsupervised Learning<br />(Soft) K-means clustering<br />Expectation Maximization<br />Spectral Clustering<br />Principle Components Analysis<br />Latent Semantic Analysis<br />
    4. 4. Kernel Methods<br />Feature extraction to higher dimensional spaces.<br />Kernels describe the relationship between vectors (points) rather than the new feature space directly. <br />
    5. 5. When can we use kernels?<br />Any time training and evaluation are both based on the dot product between two points.<br />SVMs<br />Perceptron<br />k-nearest neighbors<br />k-means<br />etc.<br />
    6. 6. Kernels in SVMs<br />Optimize αi’s and bias w.r.t. kernel<br />Decision function:<br />
    7. 7. Kernels in Perceptrons<br />Training<br />Decision function<br />
    8. 8. Good and Valid Kernels<br />Good: Computing K(xi,xj) is cheaper than ϕ(xi)<br />Valid: <br />Symmetric: K(xi,xj) =K(xj,xi) <br />Decomposable into ϕ(xi)Tϕ(xj)<br />Positive Semi Definite Gram Matrix<br />Popular Kernels<br />Linear, Polynomial<br />Radial Basis Function<br />String (technically infinite dimensions)<br />Graph<br />
    9. 9. Supervised Learning<br />Linear Regression<br />Logistic Regression<br />Graphical Models<br />Hidden Markov Models<br />Neural Networks<br />Support Vector Machines<br />Kernel Methods<br />
    10. 10. Major concepts<br />Gaussian, Multinomial, Bernoulli Distributions<br />Joint vs. Conditional Distributions<br />Marginalization<br />Maximum Likelihood<br />Risk Minimization<br />Gradient Descent<br />Feature Extraction, Kernel Methods<br />
    11. 11. Some favorite distributions<br />Bernoulli<br />Multinomial<br />Gaussian<br />
    12. 12. Maximum Likelihood<br />Identify the parameter values that yield the maximum likelihood of generating the observed data.<br />Take the partial derivative of the likelihood function<br />Set to zero<br />Solve<br />NB: maximum likelihood parameters are the same as maximum log likelihood parameters<br />
    13. 13. Maximum Log Likelihood<br />Why do we like the log function?<br />It turns products (difficult to differentiate) and turns them into sums (easy to differentiate)<br />log(xy) = log(x) + log(y)<br />log(xc) = clog(x)<br />
    14. 14. Risk Minimization<br />Pick a loss function<br />Squared loss<br />Linear loss<br />Perceptron (classification) loss<br />Identify the parameters that minimize the loss function.<br />Take the partial derivative of the loss function<br />Set to zero<br />Solve<br />
    15. 15. Frequentistsv. Bayesians<br />Point estimates vs. Posteriors<br />Risk Minimization vs. Maximum Likelihood<br />L2-Regularization <br />Frequentists: Add a constraint on the size of the weight vector<br />Bayesians: Introduce a zero-mean prior on the weight vector<br />Result is the same!<br />
    16. 16. L2-Regularization<br />Frequentists:<br />Introduce a cost on the size of the weights<br />Bayesians:<br />Introduce a prior on the weights<br />
    17. 17. Types of Classifiers<br />Generative Models<br />Highest resource requirements. <br />Need to approximate the joint probability<br />Discriminative Models<br />Moderate resource requirements. <br />Typically fewer parameters to approximate than generative models<br />Discriminant Functions<br />Can be trained probabilistically, but the output does not include confidence information<br />
    18. 18. Linear Regression<br />Fit a line to a set of points<br />
    19. 19. Linear Regression<br />Extension to higher dimensions<br />Polynomial fitting<br />Arbitrary function fitting<br />Wavelets<br />Radial basis functions<br />Classifier output<br />
    20. 20. Logistic Regression<br />Fit gaussians to data for each class<br />The decision boundary is where the PDFs cross<br />No “closed form” solution to the gradient.<br />Gradient Descent<br />
    21. 21. Graphical Models<br />General way to describe the dependence relationships between variables.<br />Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.<br />
    22. 22. Junction Tree Algorithm<br />Moralization<br />“Marry the parents”<br />Make undirected<br />Triangulation<br />Remove cycles >4<br />Junction Tree Construction<br />Identify separators such that the running intersection property holds<br />Introduction of Evidence<br />Pass slices around the junction tree to generate marginals<br />
    23. 23. Hidden Markov Models<br />Sequential Modeling<br />Generative Model<br />Relationship between observations and state (class) sequences<br />
    24. 24. Perceptron<br />Step function used for squashing.<br />Classifier as Neuron metaphor.<br />
    25. 25. Perceptron Loss<br />Classification Error vs. Sigmoid Error<br />Loss is only calculated on Mistakes<br />Perceptrons use<br />strictly classification<br />error<br />
    26. 26. Neural Networks<br />Interconnected Layers of Perceptrons or Logistic Regression “neurons”<br />
    27. 27. Neural Networks<br />There are many possible configurations of neural networks<br />Vary the number of layers<br />Size of layers<br />
    28. 28. Support Vector Machines<br />Maximum Margin Classification<br />Small Margin<br />Large Margin<br />
    29. 29. Support Vector Machines<br />Optimization Function<br />Decision Function<br />
    30. 30. Visualization of Support Vectors<br />30<br />
    31. 31. Questions?<br />Now would be a good time to ask questions about Supervised Techniques.<br />
    32. 32. Clustering<br />Identify discrete groups of similar data points<br />Data points are unlabeled<br />
    33. 33. Recall K-Means<br />Algorithm<br />Select K – the desired number of clusters<br />Initialize K cluster centroids<br />For each point in the data set, assign it to the cluster with the closest centroid<br />Update the centroid based on the points assigned to each cluster<br />If any data point has changed clusters, repeat<br />
    34. 34. k-means output<br />
    35. 35. Soft K-means<br />In k-means, we force every data point to exist in exactly one cluster.<br />This constraint can be relaxed.<br />Minimizes the entropy of cluster <br />assignment<br />
    36. 36. Soft k-means example<br />
    37. 37. Soft k-means<br />We still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data points<br />Convergence is based on a stopping threshold rather than changed assignments<br />
    38. 38. Gaussian Mixture Models<br />Rather than identifying clusters by “nearest” centroids<br />Fit a Set of k Gaussians to the data.<br />
    39. 39. GMM example<br />
    40. 40. Gaussian Mixture Models<br />Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution, <br />
    41. 41. Graphical Modelswith unobserved variables<br />What if you have variables in a Graphical model that are never observed?<br />Latent Variables<br />Training latent variable models is an unsupervised learning application<br />uncomfortable<br />amused<br />laughing<br />sweating<br />
    42. 42. Latent Variable HMMs<br />We can cluster sequences using an HMM with unobserved state variables <br />We will train the latent variable models using Expectation Maximization<br />
    43. 43. Expectation Maximization<br />Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation Maximization<br />Step 1: Expectation (E-step)<br />Evaluate the “responsibilities” of each cluster with the current parameters<br />Step 2: Maximization (M-step)<br />Re-estimate parameters using the existing “responsibilities”<br />Related to k-means<br />
    44. 44. Questions <br />One more time for questions on supervised learning…<br />
    45. 45. Next Time<br />Gaussian Mixture Models (GMMs)<br />Expectation Maximization<br />

    ×