Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Machine Learning Aristotelis Tsirigos


Published on

  • Be the first to comment

Introduction to Machine Learning Aristotelis Tsirigos

  1. 1. Introduction to Machine Learning Aristotelis Tsirigos email: Dennis Shasha - Advanced Database Systems NYU Computer Science
  2. 2. What is Machine Learning? <ul><li>Principles, methods and algorithms for predicting using past experience </li></ul><ul><li>Not mere memorization, but capability of generalizing on novel situations based on past experience </li></ul><ul><li>Where no theoretical model exists to explain the data, machine learning can be employed to offer such a model </li></ul>
  3. 3. Learning models <ul><li>According to how active or passive the learner is: </li></ul><ul><li>Statistical learning model </li></ul><ul><ul><li>No control over observations, they are presented at random in an independent identically distributed fashion </li></ul></ul><ul><li>Online model </li></ul><ul><ul><li>An external source presents the observations to the learner in a query form </li></ul></ul><ul><li>Query model </li></ul><ul><ul><li>The learner is querying an external “expert” source </li></ul></ul>
  4. 4. Types of learning problems <ul><li>A very rough categorization of learning problems: </li></ul><ul><li>Unsupervised learning </li></ul><ul><ul><li>Clustering, density estimation, feature selection </li></ul></ul><ul><li>Supervised learning </li></ul><ul><ul><li>Classification, regression </li></ul></ul><ul><li>Reinforcement learning </li></ul><ul><ul><li>Feedback, games </li></ul></ul>
  5. 5. Outline <ul><li>Learning methods </li></ul><ul><ul><li>Bayesian learning </li></ul></ul><ul><ul><li>Nearest neighbor </li></ul></ul><ul><ul><li>Decision trees </li></ul></ul><ul><ul><li>Linear classifiers </li></ul></ul><ul><ul><li>Ensemble methods (bagging & boosting) </li></ul></ul><ul><li>Testing the learner </li></ul><ul><li>Learner evaluation </li></ul><ul><li>Practical issues </li></ul><ul><li>Resources </li></ul>
  6. 6. Bayesian learning - Introduction <ul><li>Given are: </li></ul><ul><ul><li>Observed data D = { d 1 , d 2 , …, d n } </li></ul></ul><ul><ul><li>Hypothesis space H </li></ul></ul><ul><li>In the Bayesian setup we want to find the hypothesis that best fits the data in a probabilistic manner: </li></ul><ul><li>In general, it is computationally intractable without any assumptions about the data </li></ul>
  7. 7. <ul><li>First transformation using Bayes rule: </li></ul><ul><li>Now we get: </li></ul><ul><li>Notice that the optimal choice depends also on the a priori probability P(h) of hypothesis h </li></ul>Bayesian learning - Elaboration
  8. 8. Bayesian learning - Independence <ul><li>We can further simplify by assuming that the data in D are independently drawn from the underlying distribution: </li></ul><ul><li>Under this assumption: </li></ul><ul><li>How do we estimate P( d i |h) in practice? </li></ul>
  9. 9. Bayesian learning - Analysis <ul><li>For any hypothesis h in H we assume a distribution: </li></ul><ul><ul><li>P( d |h) for any point d in the input space </li></ul></ul><ul><li>If d is a point in an m-dimensional space, then the distribution is in fact: </li></ul><ul><ul><li>P(d (1) ,d (2) ,…,d (m) |h) </li></ul></ul><ul><li>Problems: </li></ul><ul><ul><li>Complex optimization problem, suboptimal techniques can be used if the distributions are differentiable </li></ul></ul><ul><ul><li>In general, there are too many parameters to estimate, therefore a lot of data is needed for reliable estimation </li></ul></ul>
  10. 10. Bayesian learning - Analysis <ul><li>Need to further analyze distribution P( d |h) : </li></ul><ul><ul><li>We can assume features are independent (Naïve Bayes): </li></ul></ul><ul><ul><li>Or, build Bayesian Networks where dependencies of the features are explicitly modeled </li></ul></ul><ul><li>Still we have to somehow learn the distributions </li></ul><ul><ul><li>Model as parametrized distributions (e.g. Gaussians) </li></ul></ul><ul><ul><li>Estimate the parameters using standard greedy techniques (e.g. Expectation Maximization) </li></ul></ul>
  11. 11. Bayesian learning - Summary <ul><li>Makes use of prior knowledge of: </li></ul><ul><ul><li>The likelihood of alternative hypotheses and </li></ul></ul><ul><ul><li>The probability of observing data given a specific hypothesis </li></ul></ul><ul><li>The goal is to determine the most probable hypothesis given a series of observations </li></ul><ul><li>The Naïve Bayes method has been found useful in practical applications (e.g. text classification) </li></ul><ul><li>If the naïve assumption is not appropriate, there is a generic algorithm (EM) that can be used to find a hypothesis that is locally optimal </li></ul>
  12. 12. Nearest Neighbor - Introduction <ul><li>Belongs to class of instance-based learners: </li></ul><ul><ul><li>The learner does not make any global prediction of the target function, only predicts locally for a given point (lazy learner) </li></ul></ul><ul><li>Idea: </li></ul><ul><ul><li>Given a query instance x , look at past observations D that are “close” to x in order to determine x ’s class y . </li></ul></ul><ul><li>Issues: </li></ul><ul><ul><li>How do we define distance? </li></ul></ul><ul><ul><li>How do we define the notion of “neighborhood”? </li></ul></ul>
  13. 13. Nearest Neighbor - Details <ul><li>Classify new instance x according to its neighborhood N( x ) </li></ul><ul><li>Neighborhood can be defined in different ways: </li></ul><ul><ul><li>Constant radius </li></ul></ul><ul><ul><li>k-Neighbors </li></ul></ul><ul><li>Weights are a function of distance: w i = w(d( x , x i )) </li></ul>Classification rule: y i : label for point x i w i : weight for x i x N( x )
  14. 14. Nearest Neighbor - Summary <ul><li>Classify new instances according to their closest points </li></ul><ul><li>Control accuracy in three ways: </li></ul><ul><ul><li>Distance metric </li></ul></ul><ul><ul><li>Definition of neighborhood </li></ul></ul><ul><ul><li>Weight assignment </li></ul></ul><ul><li>These parameters must be tuned depending on the problem: </li></ul><ul><ul><li>Is there noise in the data? </li></ul></ul><ul><ul><li>Outliers? </li></ul></ul><ul><ul><li>What is a “natural” distance for the data? </li></ul></ul>
  15. 15. Decision Trees - Introduction <ul><li>Suppose data is categorical </li></ul><ul><li>Observation: </li></ul><ul><ul><li>Distance cannot be defined in a natural way </li></ul></ul><ul><ul><li>Need a learner that operates directly on the attribute values </li></ul></ul>YES NO LOW HIGH BAD NO YES LOW HIGH BAD NO NO HIGH LOW BAD NO NO LOW LOW BAD YES YES LOW LOW GOOD NO NO HIGH LOW GOOD YES NO LOW HIGH GOOD YES YES LOW HIGH GOOD YES NO HIGH HIGH GOOD YES YES HIGH HIGH GOOD Elected War casualties Gas prices Popularity Economy
  16. 16. Decision Trees - The model <ul><li>Idea: a decision tree that “explains” the data </li></ul><ul><li>Observation: </li></ul><ul><ul><li>In general, there is no unique tree to represent the data </li></ul></ul><ul><ul><li>In some nodes the decision is not strongly supported by the data </li></ul></ul>Economy Popularity War Gas prices Popularity GOOD LOW HIGH HIGH LOW BAD HIGH LOW YES NO YES=4 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=2
  17. 17. Decision Trees - Training <ul><li>Build the tree from top to bottom choosing one attribute at a time </li></ul><ul><li>How do we make the choice? Idea: </li></ul><ul><ul><li>Choose the most “informative” attribute first </li></ul></ul><ul><ul><li>Having no other information, which attribute allows us to classify correctly most of the time? </li></ul></ul><ul><li>This can be quantified using the Information Gain metric: </li></ul><ul><ul><li>Based on Entropy = Randomness </li></ul></ul><ul><ul><li>Measures the reduction in uncertainty about the target value given the value of one of the attributes, therefore it tells us how informative that attribute is </li></ul></ul>
  18. 18. Decision Trees - Overfitting <ul><li>Problems: </li></ul><ul><ul><li>Solution is greedy, therefore suboptimal </li></ul></ul><ul><ul><li>Optimal solution infeasible due to time constraints </li></ul></ul><ul><ul><li>Is optimal really optimal? What if observations are corrupted by noise? </li></ul></ul><ul><li>We are really interested in the true, not the training error </li></ul><ul><ul><li>Overfitting in the presence of noise </li></ul></ul><ul><ul><li>Occam’s razor: prefer simpler solutions </li></ul></ul><ul><ul><li>Apply pruning to eliminate nodes with low statistical support </li></ul></ul>
  19. 19. Decision Trees - Pruning <ul><li>Get rid of nodes with low support </li></ul>Economy Popularity War Gas prices Popularity GOOD LOW HIGH HIGH LOW BAD HIGH LOW YES NO YES=4 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=1 YES=1 NO=0 YES=0 NO=2 <ul><li>The pruned tree does not fully explain the data, but we hope that it will generalize better on unseen instances… </li></ul>
  20. 20. Decision Trees - Summary <ul><li>Advantages: </li></ul><ul><ul><li>Categorical data </li></ul></ul><ul><ul><li>Easy to interpret in a simple rule format </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Hard to accommodate numerical data </li></ul></ul><ul><ul><li>Suboptimal solution </li></ul></ul><ul><ul><li>Bias towards simple trees </li></ul></ul>
  21. 21. Linear Classifiers - Introduction <ul><li>There is an infinite number of hyperplanes f( x )=0 that can separate positive from negative examples! </li></ul><ul><li>Is there an optimal one to choose? </li></ul>Decision function: Predicted label:
  22. 22. Linear Classifiers - Margins <ul><li>Make sure it leaves enough …room for future points! </li></ul><ul><li>Margins: </li></ul><ul><ul><li>For a training point x i define its margin: </li></ul></ul><ul><ul><li>For classifier f it is the worst of all margins: </li></ul></ul>f( x )=0
  23. 23. Linear Classifiers - Optimization <ul><li>Now, the only thing we have to do is find the f with the maximum possible margin! </li></ul><ul><ul><li>Quadratic optimization problem: </li></ul></ul><ul><li>This classifier is known as Support Vector Machines </li></ul><ul><ul><li>The optimal w * will yield the maximum margin γ * : </li></ul></ul><ul><ul><li>Finally, the optimal hyperplane can be written as: </li></ul></ul>
  24. 24. Linear Classifiers - Problems <ul><li>What if data is noisy or, worse, linearly inseparable? </li></ul><ul><li>Solution 1: </li></ul><ul><ul><li>Allow for outliers in the data when data is noisy </li></ul></ul><ul><li>Solution 2: </li></ul><ul><ul><li>Increase dimensionality by creating composite features if the target function is nonlinear </li></ul></ul><ul><li>Solution 3: </li></ul><ul><ul><li>Do both 1 and 2 </li></ul></ul>
  25. 25. Linear Classifiers - Outliers <ul><li>Impose softer restrictions on the margin distribution to accept outliers </li></ul>f( x )=0 <ul><li>Now our classifier is more flexible and more powerful, but there are more parameters to estimate </li></ul>outlier
  26. 26. Linear Classifiers - Nonlinearity (!) <ul><li>Combine input features to form more complex ones: </li></ul><ul><ul><li>Initial space x = (x 1 ,x 2 ,…,x m ) </li></ul></ul><ul><ul><li>Induced space Φ ( x ) = (x 1 ,x 2 ,…,x m ,2x 1 x 2 ,2x 1 x 3 ,…,2x m-1 x m ) </li></ul></ul><ul><ul><li>Inner product can now be written as < Φ (x)· Φ (y)> = <x·y> 2 </li></ul></ul><ul><li>Kernels: </li></ul><ul><ul><li>The above product is denoted K(x,y)= < Φ (x)· Φ (y)> and it is called a kernel </li></ul></ul><ul><ul><li>Kernels induce nonlinear feature spaces based on the initial feature space </li></ul></ul><ul><ul><li>There is a huge collection of kernels, for vectors, trees, strings, graphs, time series, … </li></ul></ul><ul><li>Linear separation in the composite feature space implies a nonlinear separation in the initial space! </li></ul>
  27. 27. Linear Classifiers - Summary <ul><li>Provides a generic solution to the learning problem </li></ul><ul><li>We just have to solve an easy optimization problem </li></ul><ul><li>Parametrized by the induced feature space and the noise parameters </li></ul><ul><li>There exist theoretical bounds on their performance </li></ul>
  28. 28. Ensembles - Introduction <ul><li>Motivation: </li></ul><ul><ul><li>finding just one classifier is “too risky” </li></ul></ul><ul><li>Idea: </li></ul><ul><ul><li>combine a group of classifiers into the final learner </li></ul></ul><ul><li>Intuition: </li></ul><ul><ul><li>each classifier is associated with some risk of wrong predictions in future data </li></ul></ul><ul><ul><li>instead of investing in just one risky classifier, we can distribute the decision in many classifiers thus effectively reducing the overall risk </li></ul></ul>
  29. 29. Ensembles - Bagging <ul><li>Main idea: </li></ul><ul><ul><li>From observations D , T subsets D 1 ,…,D T are drawn at random </li></ul></ul><ul><ul><li>For each D i train a “base” classifier f i (e.g. decision tree) </li></ul></ul><ul><ul><li>Finally, combine the T classifiers into one classifier f by taking a majority vote: </li></ul></ul><ul><li>Observations: </li></ul><ul><ul><li>Need enough observations to get partitions that approximately respect the iid condition ( |D| >> T ) </li></ul></ul><ul><ul><li>How do we decide on the base classifier? </li></ul></ul>
  30. 30. Ensembles - Boosting <ul><li>Main idea: </li></ul><ul><ul><li>Run through a number of iterations </li></ul></ul><ul><ul><li>At each iteration t , a “weak” classifier f t is trained on a weighted version of the training data (initially weights are equal) </li></ul></ul><ul><ul><li>Each point’s weight is updated so that examples with poor margin with respect to f t are assigned a higher weight in an attempt to “boost” them in the next iterations </li></ul></ul><ul><ul><li>The classifier itself is assigned a weight α t according to its training error </li></ul></ul><ul><li>Combine all classifiers into a weighted majority vote: </li></ul>
  31. 31. Bagging vs. Boosting <ul><li>Two distinct ways to apply the diversification idea </li></ul>Margin maximization Risk minimization Effect Simple Complex Base learner Adaptive data weighting Partition before training Training data Boosting Bagging
  32. 32. Testing the learner <ul><li>How do we estimate the learner’s performance? </li></ul><ul><li>Create test sets from the original observations: </li></ul><ul><ul><li>Test set: </li></ul></ul><ul><ul><ul><li>Partition into training and test sets and use error on test set as an estimate of the true error </li></ul></ul></ul><ul><ul><li>Leave-one-out: </li></ul></ul><ul><ul><ul><li>Remove one point and train the rest, then report error on this point </li></ul></ul></ul><ul><ul><ul><li>Do it for all points and report mean error </li></ul></ul></ul><ul><ul><li>k-fold Cross Validation: </li></ul></ul><ul><ul><ul><li>Randomly partition the data set in k non-overlapping sets </li></ul></ul></ul><ul><ul><ul><li>Choose one set at a time for testing and train on the rest </li></ul></ul></ul><ul><ul><ul><li>Report mean error </li></ul></ul></ul>
  33. 33. Learner evaluation - PAC learning <ul><li>Probably Approximate Correct (PAC) learning of a target class C using hypothesis space H : </li></ul><ul><ul><li>If for all target functions f in C and for all 0<ε,δ<1 /2 , with probability at least (1- δ ) we can learn a hypothesis h in H that approximates f with an error at most ε . </li></ul></ul><ul><li>Generalization (or true) error: </li></ul><ul><ul><li>We really care about the error in unseen data </li></ul></ul><ul><li>Statistical learning theory gives as the tools to express the true error (and its confidence) in terms of: </li></ul><ul><ul><li>The empirical (or training) error </li></ul></ul><ul><ul><li>The confidence 1- δ of the true error </li></ul></ul><ul><ul><li>The number of training examples </li></ul></ul><ul><ul><li>The complexity of the classes C and/or H </li></ul></ul>
  34. 34. Learner evaluation - VC dimension <ul><li>The VC dimension of a hypothesis space H measures its power to interpret the observations </li></ul><ul><li>Infinite hypothesis space size does not necessarily imply infinite VC dimension! </li></ul><ul><li>Bad news: </li></ul><ul><ul><li>If we allow a hypothesis space with infinite VC dimension, learning is impossible (requires infinite number of observations) </li></ul></ul><ul><li>Good news: </li></ul><ul><ul><li>For the class of large margin linear classifiers the following error bound can be proven: </li></ul></ul>
  35. 35. Practical issues <ul><li>Machine learning is driven by data, so a good learner must be data-dependent in all aspects: </li></ul><ul><ul><li>Hypothesis space </li></ul></ul><ul><ul><li>Prior knowledge </li></ul></ul><ul><ul><li>Feature selection and composition </li></ul></ul><ul><ul><li>Distance/similarity measures </li></ul></ul><ul><ul><li>Outliers/Noise </li></ul></ul><ul><li>Never forget that learners and training algorithms must be efficient in time and space with respect to: </li></ul><ul><ul><li>The feature space dimensionality </li></ul></ul><ul><ul><li>The training set size </li></ul></ul><ul><ul><li>The hypothesis space size </li></ul></ul>
  36. 36. Conclusions <ul><li>Machine learning is mostly art and a little bit of science! </li></ul><ul><li>For each problem at hand a different classifier will be the optimal one </li></ul><ul><li>This simply means that the solution must be data-dependent : </li></ul><ul><ul><li>Select an “appropriate” family of classifiers (e.g. Decision Trees) </li></ul></ul><ul><ul><li>Choose the right representation for the data in the feature space </li></ul></ul><ul><ul><li>Tune available parameters of your favorite classifier to reflect the “nature” of the data </li></ul></ul><ul><li>Many practical applications, especially when there is no good theory available to model the data </li></ul>
  37. 37. Resources <ul><li>Books </li></ul><ul><ul><li>T. Mitchell, Machine Learning </li></ul></ul><ul><ul><li>N. Cristianini & Shawe-Taylor, An introduction to Support Vector Machines </li></ul></ul><ul><ul><li>V. Kecman, Learning and Soft Computing </li></ul></ul><ul><ul><li>R. Duda, P. Hart & D. Stork, Pattern Classification </li></ul></ul><ul><li>Online tutorials </li></ul><ul><ul><li>A. Moore, </li></ul></ul><ul><li>Software </li></ul><ul><ul><li>WEKA: </li></ul></ul><ul><ul><li>SVMlight: </li></ul></ul>