Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Supervised Learning by butest 6855 views
- JOSA TechTalk: Introduction to Supe... by Jordan Open Sourc... 225 views
- Visualization of Supervised Learnin... by Takashi J OZAKI 5762 views
- Datamining with nb by Luis Goldster 18 views
- Presentation on supervised learning by Tonmoy Bhagawati 2659 views
- Konversa.docx - konversa.googlecode... by butest 417 views

No Downloads

Total views

1,491

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

43

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Lecture 17: Supervised Learning Recap<br />Machine Learning<br />April 6, 2010<br />
- 2. Last Time<br />Support Vector Machines<br />Kernel Methods<br />
- 3. Today<br />Short recap of Kernel Methods<br />Review of Supervised Learning<br />Unsupervised Learning<br />(Soft) K-means clustering<br />Expectation Maximization<br />Spectral Clustering<br />Principle Components Analysis<br />Latent Semantic Analysis<br />
- 4. Kernel Methods<br />Feature extraction to higher dimensional spaces.<br />Kernels describe the relationship between vectors (points) rather than the new feature space directly. <br />
- 5. When can we use kernels?<br />Any time training and evaluation are both based on the dot product between two points.<br />SVMs<br />Perceptron<br />k-nearest neighbors<br />k-means<br />etc.<br />
- 6. Kernels in SVMs<br />Optimize αi’s and bias w.r.t. kernel<br />Decision function:<br />
- 7. Kernels in Perceptrons<br />Training<br />Decision function<br />
- 8. Good and Valid Kernels<br />Good: Computing K(xi,xj) is cheaper than ϕ(xi)<br />Valid: <br />Symmetric: K(xi,xj) =K(xj,xi) <br />Decomposable into ϕ(xi)Tϕ(xj)<br />Positive Semi Definite Gram Matrix<br />Popular Kernels<br />Linear, Polynomial<br />Radial Basis Function<br />String (technically infinite dimensions)<br />Graph<br />
- 9. Supervised Learning<br />Linear Regression<br />Logistic Regression<br />Graphical Models<br />Hidden Markov Models<br />Neural Networks<br />Support Vector Machines<br />Kernel Methods<br />
- 10. Major concepts<br />Gaussian, Multinomial, Bernoulli Distributions<br />Joint vs. Conditional Distributions<br />Marginalization<br />Maximum Likelihood<br />Risk Minimization<br />Gradient Descent<br />Feature Extraction, Kernel Methods<br />
- 11. Some favorite distributions<br />Bernoulli<br />Multinomial<br />Gaussian<br />
- 12. Maximum Likelihood<br />Identify the parameter values that yield the maximum likelihood of generating the observed data.<br />Take the partial derivative of the likelihood function<br />Set to zero<br />Solve<br />NB: maximum likelihood parameters are the same as maximum log likelihood parameters<br />
- 13. Maximum Log Likelihood<br />Why do we like the log function?<br />It turns products (difficult to differentiate) and turns them into sums (easy to differentiate)<br />log(xy) = log(x) + log(y)<br />log(xc) = clog(x)<br />
- 14. Risk Minimization<br />Pick a loss function<br />Squared loss<br />Linear loss<br />Perceptron (classification) loss<br />Identify the parameters that minimize the loss function.<br />Take the partial derivative of the loss function<br />Set to zero<br />Solve<br />
- 15. Frequentistsv. Bayesians<br />Point estimates vs. Posteriors<br />Risk Minimization vs. Maximum Likelihood<br />L2-Regularization <br />Frequentists: Add a constraint on the size of the weight vector<br />Bayesians: Introduce a zero-mean prior on the weight vector<br />Result is the same!<br />
- 16. L2-Regularization<br />Frequentists:<br />Introduce a cost on the size of the weights<br />Bayesians:<br />Introduce a prior on the weights<br />
- 17. Types of Classifiers<br />Generative Models<br />Highest resource requirements. <br />Need to approximate the joint probability<br />Discriminative Models<br />Moderate resource requirements. <br />Typically fewer parameters to approximate than generative models<br />Discriminant Functions<br />Can be trained probabilistically, but the output does not include confidence information<br />
- 18. Linear Regression<br />Fit a line to a set of points<br />
- 19. Linear Regression<br />Extension to higher dimensions<br />Polynomial fitting<br />Arbitrary function fitting<br />Wavelets<br />Radial basis functions<br />Classifier output<br />
- 20. Logistic Regression<br />Fit gaussians to data for each class<br />The decision boundary is where the PDFs cross<br />No “closed form” solution to the gradient.<br />Gradient Descent<br />
- 21. Graphical Models<br />General way to describe the dependence relationships between variables.<br />Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.<br />
- 22. Junction Tree Algorithm<br />Moralization<br />“Marry the parents”<br />Make undirected<br />Triangulation<br />Remove cycles >4<br />Junction Tree Construction<br />Identify separators such that the running intersection property holds<br />Introduction of Evidence<br />Pass slices around the junction tree to generate marginals<br />
- 23. Hidden Markov Models<br />Sequential Modeling<br />Generative Model<br />Relationship between observations and state (class) sequences<br />
- 24. Perceptron<br />Step function used for squashing.<br />Classifier as Neuron metaphor.<br />
- 25. Perceptron Loss<br />Classification Error vs. Sigmoid Error<br />Loss is only calculated on Mistakes<br />Perceptrons use<br />strictly classification<br />error<br />
- 26. Neural Networks<br />Interconnected Layers of Perceptrons or Logistic Regression “neurons”<br />
- 27. Neural Networks<br />There are many possible configurations of neural networks<br />Vary the number of layers<br />Size of layers<br />
- 28. Support Vector Machines<br />Maximum Margin Classification<br />Small Margin<br />Large Margin<br />
- 29. Support Vector Machines<br />Optimization Function<br />Decision Function<br />
- 30. Visualization of Support Vectors<br />30<br />
- 31. Questions?<br />Now would be a good time to ask questions about Supervised Techniques.<br />
- 32. Clustering<br />Identify discrete groups of similar data points<br />Data points are unlabeled<br />
- 33. Recall K-Means<br />Algorithm<br />Select K – the desired number of clusters<br />Initialize K cluster centroids<br />For each point in the data set, assign it to the cluster with the closest centroid<br />Update the centroid based on the points assigned to each cluster<br />If any data point has changed clusters, repeat<br />
- 34. k-means output<br />
- 35. Soft K-means<br />In k-means, we force every data point to exist in exactly one cluster.<br />This constraint can be relaxed.<br />Minimizes the entropy of cluster <br />assignment<br />
- 36. Soft k-means example<br />
- 37. Soft k-means<br />We still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data points<br />Convergence is based on a stopping threshold rather than changed assignments<br />
- 38. Gaussian Mixture Models<br />Rather than identifying clusters by “nearest” centroids<br />Fit a Set of k Gaussians to the data.<br />
- 39. GMM example<br />
- 40. Gaussian Mixture Models<br />Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution, <br />
- 41. Graphical Modelswith unobserved variables<br />What if you have variables in a Graphical model that are never observed?<br />Latent Variables<br />Training latent variable models is an unsupervised learning application<br />uncomfortable<br />amused<br />laughing<br />sweating<br />
- 42. Latent Variable HMMs<br />We can cluster sequences using an HMM with unobserved state variables <br />We will train the latent variable models using Expectation Maximization<br />
- 43. Expectation Maximization<br />Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation Maximization<br />Step 1: Expectation (E-step)<br />Evaluate the “responsibilities” of each cluster with the current parameters<br />Step 2: Maximization (M-step)<br />Re-estimate parameters using the existing “responsibilities”<br />Related to k-means<br />
- 44. Questions <br />One more time for questions on supervised learning…<br />
- 45. Next Time<br />Gaussian Mixture Models (GMMs)<br />Expectation Maximization<br />

No public clipboards found for this slide

Be the first to comment