Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Introduction to Machine Learning course 67577 fall 2007 <ul><li>Lecturer: Amnon Shashua </li></ul><ul><li>Teaching Assistant: Yevgeny Seldin </li></ul><ul><li>School of Computer Science and Engineering </li></ul><ul><li>Hebrew University </li></ul>
  2. 2. What is Machine Learning? <ul><li>Inference engine (computer program) that when given sufficient data (examples) computes a function that matches as close as possible the process generating the data. </li></ul><ul><li>Make accurate prediction based on observed data </li></ul><ul><li>Algorithms to optimize a performance criterion based on observed data </li></ul><ul><li>Learning to do better in the future based on what was experienced in the past </li></ul><ul><li>Programming by examples: instead of writing a program to solve a task directly, machine learning seeks methods by which the computer will come up with its own program based on training examples. </li></ul>
  3. 3. Why Machine Learning? <ul><li>Data-driven algorithms are able examine large amounts of data. A human expert on the other hand is likely to be guided by subjective impressions or by examining a relatively small number of examples. </li></ul><ul><li>Humans often have trouble expressing what they know but have no difficulty in labeling data </li></ul><ul><li>Machine learning is effective in domains where declarative (rule based) knowledge is difficult to obtain yet generating training data is easy </li></ul>
  4. 4. Typical Examples <ul><li>Visual recognition (say, detect faces in an image): the amount of variability in appearance introduce challenges that are beyond the capacity of direct programming </li></ul><ul><li>Spam filtering: data-driven programming can adapt to changing tactics by spammers </li></ul><ul><li>Extract topics from documents: categorize news articles whether they are about politics, sports, science, etc. </li></ul><ul><li>Natural language understanding: from spoken words to text; categorize the meaning of spoken sentences </li></ul><ul><li>Optical character recognition (OCR) </li></ul><ul><li>Medical diagnosis: from symptoms to diagnosis </li></ul><ul><li>Credit card transaction fraud detection </li></ul><ul><li>Wealth prediction </li></ul>
  5. 5. Fundamental Issues <ul><li>Over-fitting: doing well on a training set does not guarantee accuracy on new examples </li></ul><ul><li>What is the resource we wish to optimize? For a given accuracy, use the smallest size training set </li></ul><ul><li>Examples are drawn from some (fixed) distribution D over X x Y (instance space x output space). Does the learner actually need to recover D during the learning process? </li></ul><ul><li>How does the learning process depend on the complexity of the family of learning functions (concept class C)? How does one define complexity of C? </li></ul><ul><li>When the goal is to learn the joint distribution D then the problem is computationally unwieldy because the joint distribution table is exponentially large. What assumptions can be made to simplify the task? </li></ul>
  6. 6. Supervised vs. Un-supervised Multiclass classification. K=2 is normally of most interest. Supervised Learning Models: where X is the instance (data) space and Y is the output space Regression. Predict the price of a used car given brand, year, mileage.. Kinematics of a robot arm; navigate by determining steering angle from image input.. Un-supervised Learning Models: Find regularities in the input data assuming there is some structure in the input space <ul><li>Density estimation </li></ul><ul><li>Clustering (non-parametric density estimation): divide customers to groups which have similar attributes.. </li></ul><ul><li>Latent class models: extract topics from documents </li></ul><ul><li>Compression: represent the input space with fewer parameters; projection to lower-dimensional spaces </li></ul>
  7. 7. Notations X is the instance space : space from which observations are drawn. Examples, input instance , a single observation. Examples, Y is the output space : set of possible outcomes that can be associated with a measurement. Examples, An example is an instance-label pair (x,y). If |Y|=2 one typically uses {0,1} or {-1,1}. We say that an example (x,y) is positive if y=1 and otherwise we call it a negative example A training set Z consists of m instance-label pairs: In some cases we refer to the training set without labels:
  8. 8. Notations Separating hyperplanes : a concept h(x) is specified by a vector and a scalar b such that: Conjunction learning : a conjunction is a special case of a Boolean formula. A literal Is a variable or its negation and a term is a conjunction of literals, i.e. A target function is a term which consists of a subset of literals. In this case and Each is called a concept or hypothesis or classifier. Example, if A concept (hypothesis) class C is a set (not necessarily finite) of functions of the form: Other examples: then C might be: Decision trees : when then any boolean function can be described by a binary tree. Thus, C consists of decision trees ( )
  9. 9. The Formal Learning Model Probably Approximate Correct (PAC) <ul><li>Distribution invariant: Learner does not need to estimate the joint distribution D over X x Y. Assumptions are that examples arrive i.i.d. and that D exists and is fixed. </li></ul><ul><li>The training sample complexity (size of the training set Z) depends only the desired accuracy and confidence parameters - does not depend on D. </li></ul><ul><li>Not all concept classes D are PAC-learnable. But some interesting classes are. </li></ul>
  10. 10. Unrealizable case: when and the training set is and D is over XxY Realizable case: when a target concept is known to lie inside C. In this case, the training set is sampled randomly and independently (i.i.d) according to some (unknown) Distribution D, i.e., S is distributed according to the product distribution Given a concept function is the probability that an instance x sampled according to D will be labeled incorrectly by h(x) PAC Model Definitions
  11. 11. given to the learner specifies desired accuracy, i.e. Note: in realizable case because given to the learner specifies desired confidence, i.e. The learner is allowed to deviate occasionally from the desired accuracy but only rarely so.. PAC Model Definitions
  12. 12. We will say that an algorithm L learns C if for every and for every D over XxY, L generates a concept function such that the probability that is at least PAC Model Definitions
  13. 13. from the set of all training examples to C with the following property: given any there is an integer such that if then, for any probability distribution D on XxY, if Z is a training set of length m drawn randomly according to , then with probability of at least then hypothesis is such that Formal Definition of PAC Learning A learning algorithm L is a function: We say that C is learnable (or PAC-learnable) if there is a learning algorithm for C
  14. 14. Formal Definition of PAC Learning does not depend on D, i.e., PAC model is distribution invariant The class C determines the sample complexity. For “simple” classes would be small compared to more “complex” classes. Notes:
  15. 15. Course Syllabus 3 x PAC: 2 x Separating Hyperplanes: Support Vector Machine, Kernels, Linear Discriminant Analysis 3 x Unsupervised Learning: Dimensionality Reduction (PCA), Density Estimation, Non-parametric Clustering (spectral methods) 5 x Statistical Inference: Maximum Likelihood, Conditional Independence, Latent Class Models, Expectation-Maximization Algorithm, Graphical Models