3. Today Short recap of Kernel Methods Review of Supervised Learning Unsupervised Learning (Soft) K-means clustering Expectation Maximization Spectral Clustering Principle Components Analysis Latent Semantic Analysis
4. Kernel Methods Feature extraction to higher dimensional spaces. Kernels describe the relationship between vectors (points) rather than the new feature space directly.
5. When can we use kernels? Any time training and evaluation are both based on the dot product between two points. SVMs Perceptron k-nearest neighbors k-means etc.
6. Kernels in SVMs Optimize αi’s and bias w.r.t. kernel Decision function:
12. Maximum Likelihood Identify the parameter values that yield the maximum likelihood of generating the observed data. Take the partial derivative of the likelihood function Set to zero Solve NB: maximum likelihood parameters are the same as maximum log likelihood parameters
13. Maximum Log Likelihood Why do we like the log function? It turns products (difficult to differentiate) and turns them into sums (easy to differentiate) log(xy) = log(x) + log(y) log(xc) = clog(x)
14. Risk Minimization Pick a loss function Squared loss Linear loss Perceptron (classification) loss Identify the parameters that minimize the loss function. Take the partial derivative of the loss function Set to zero Solve
15. Frequentistsv. Bayesians Point estimates vs. Posteriors Risk Minimization vs. Maximum Likelihood L2-Regularization Frequentists: Add a constraint on the size of the weight vector Bayesians: Introduce a zero-mean prior on the weight vector Result is the same!
17. Types of Classifiers Generative Models Highest resource requirements. Need to approximate the joint probability Discriminative Models Moderate resource requirements. Typically fewer parameters to approximate than generative models Discriminant Functions Can be trained probabilistically, but the output does not include confidence information
19. Linear Regression Extension to higher dimensions Polynomial fitting Arbitrary function fitting Wavelets Radial basis functions Classifier output
20. Logistic Regression Fit gaussians to data for each class The decision boundary is where the PDFs cross No “closed form” solution to the gradient. Gradient Descent
21. Graphical Models General way to describe the dependence relationships between variables. Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.
22. Junction Tree Algorithm Moralization “Marry the parents” Make undirected Triangulation Remove cycles >4 Junction Tree Construction Identify separators such that the running intersection property holds Introduction of Evidence Pass slices around the junction tree to generate marginals
23. Hidden Markov Models Sequential Modeling Generative Model Relationship between observations and state (class) sequences
33. Recall K-Means Algorithm Select K – the desired number of clusters Initialize K cluster centroids For each point in the data set, assign it to the cluster with the closest centroid Update the centroid based on the points assigned to each cluster If any data point has changed clusters, repeat
35. Soft K-means In k-means, we force every data point to exist in exactly one cluster. This constraint can be relaxed. Minimizes the entropy of cluster assignment
37. Soft k-means We still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data points Convergence is based on a stopping threshold rather than changed assignments
38. Gaussian Mixture Models Rather than identifying clusters by “nearest” centroids Fit a Set of k Gaussians to the data.
40. Gaussian Mixture Models Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,
41. Graphical Modelswith unobserved variables What if you have variables in a Graphical model that are never observed? Latent Variables Training latent variable models is an unsupervised learning application uncomfortable amused laughing sweating
42. Latent Variable HMMs We can cluster sequences using an HMM with unobserved state variables We will train the latent variable models using Expectation Maximization
43. Expectation Maximization Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation Maximization Step 1: Expectation (E-step) Evaluate the “responsibilities” of each cluster with the current parameters Step 2: Maximization (M-step) Re-estimate parameters using the existing “responsibilities” Related to k-means