Upcoming SlideShare
×

ppt

494 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Views
Total views
494
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
4
0
Likes
0
Embeds 0
No embeds

No notes for slide
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• As we said, this is the game we are playing; in NLP, it has always been clear, that the raw information In a sentence is not sufficient, as is to represent a good predictor. Better functions of the input were Generated, and learning was done in these terms.
• Good treatment in Bishop, Chp 3 Classic Weiner filtering solution; text omits 0.5 factor; In any case we use the gradient and eta (text) or R (these notes) to modulate the step size
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• Badges game Don’t give me the answer Start thinking about how to write a program that will figure out whether my name has + or – next to it.
• ppt

1. 1. CS 446: Machine Learning Gerald DeJong [email_address] 3-0491 3320 SC Recent approval for a TA to be named later
2. 2. <ul><li>Office hours: after most classes and Thur @ 3 </li></ul><ul><li>Text: Mitchell’s Machine Learning </li></ul><ul><li>Midterm: Oct. 4 </li></ul><ul><li>Final: Dec. 12 each a third </li></ul><ul><li>Homeworks / projects </li></ul><ul><ul><li>Submit at the beginning of class </li></ul></ul><ul><ul><li>Late penalty: 20% / day up to 3 days </li></ul></ul><ul><ul><li>Programming, some in-class assignments </li></ul></ul><ul><li>Class web site soon </li></ul><ul><li>Cheating: none allowed! We adopt dept. policy </li></ul>
3. 3. Please answer these and hand in now <ul><li>Name </li></ul><ul><li>Department </li></ul><ul><li>Where (If?*) you had Intro AI course </li></ul><ul><li>Who taught it (esp. if not here) </li></ul><ul><li>1) Why interested in Machine Learning? </li></ul><ul><li>2) Any topics you would like to see covered? </li></ul><ul><li>* may require significant additional effort </li></ul>
4. 4. Approx. Course Overview / Topics <ul><li>Introduction: Basic problems and questions </li></ul><ul><li>A detailed examples: Linear threshold units </li></ul><ul><li>Basic Paradigms: </li></ul><ul><ul><li>PAC (Risk Minimization); Bayesian Theory; SRM (Structural Risk Minimization); Compression; Maximum Entropy;… </li></ul></ul><ul><ul><li>Generative/Discriminative; Classification/Skill;… </li></ul></ul><ul><li>Learning Protocols </li></ul><ul><ul><li>Online/Batch; Supervised/Unsupervised/Semi-supervised; Delayed supervision </li></ul></ul><ul><li>Algorithms: </li></ul><ul><ul><li>Decision Trees (C4.5) </li></ul></ul><ul><ul><li>[Rules and ILP (Ripper, Foil)] </li></ul></ul><ul><ul><li>Linear Threshold Units (Winnow, Perceptron; Boosting; SVMs; Kernels) </li></ul></ul><ul><ul><li>Probabilistic Representations (naïve Bayes, Bayesian trees; density estimation) </li></ul></ul><ul><ul><li>Delayed supervision: RL </li></ul></ul><ul><ul><li>Unsupervised/Semi-supervised: EM </li></ul></ul><ul><li>Clustering, Dimensionality Reduction, or others of student interest </li></ul>
5. 5. What to Learn <ul><li>Classifiers: Learn a hidden function </li></ul><ul><ul><li>Concept Learning: chair ? face ? game ? </li></ul></ul><ul><ul><li>Diagnosis: medical; risk assessment </li></ul></ul><ul><li>Models: Learn a map (and use it to navigate) </li></ul><ul><ul><li>Learn a distribution (and use it to answer queries) </li></ul></ul><ul><ul><li>Learn a language model; Learn an Automaton </li></ul></ul><ul><li>Skills: </li></ul><ul><ul><li>Learn to play games; Learn a Plan / Policy </li></ul></ul><ul><ul><li>Learn to Reason; Learn to Plan </li></ul></ul><ul><li>Clusterings: </li></ul><ul><ul><li>Shapes of objects; Functionality; Segmentation </li></ul></ul><ul><ul><li>Abstraction </li></ul></ul><ul><li>Focus on classification (importance, theoretical richness, generality,…) </li></ul>
6. 6. What to Learn? <ul><li>Direct Learning: (discriminative, model-free[bad name]) </li></ul><ul><ul><li>Learn a function that maps an input instance to the sought after property. </li></ul></ul><ul><li>Model Learning: (indirect, generative) </li></ul><ul><ul><li>Learning a model of the domain; then use it to answer various questions about the domain </li></ul></ul><ul><li>In both cases, several protocols can be used – </li></ul><ul><ul><li>Supervised – learner is given examples and answers </li></ul></ul><ul><ul><li>Unsupervised – examples, but no answers </li></ul></ul><ul><ul><li>Semi-supervised – some examples w/answers, others w/o </li></ul></ul><ul><ul><li>Delayed supervision </li></ul></ul>
7. 7. Supervised Learning <ul><li>Given: Examples (x,f ( x)) of some unknown function f </li></ul><ul><li>Find: A good approximation to f </li></ul><ul><li>x provides some representation of the input </li></ul><ul><ul><li>The process of mapping a domain element into a representation is called Feature Extraction. (Hard; ill-understood; important) </li></ul></ul><ul><ul><li>x 2 {0,1} n or x 2 < n </li></ul></ul><ul><li>The target function (label) </li></ul><ul><ul><li>f(x) 2 {-1,+1} Binary Classification </li></ul></ul><ul><ul><li>f(x) 2 {1,2,3,.,k-1} Multi-class classification </li></ul></ul><ul><ul><li>f(x) 2 < Regression </li></ul></ul>
8. 8. Example and Hypothesis Spaces X H X: Example Space – set of all well-formed inputs [w/a distribution] H: Hypothesis Space – set of all well-formed outputs - - + + + - - - +
9. 9. Supervised Learning: Examples <ul><li>Disease diagnosis </li></ul><ul><ul><li>x: Properties of patient (symptoms, lab tests) </li></ul></ul><ul><ul><li>f : Disease (or maybe: recommended therapy) </li></ul></ul><ul><li>Part-of-Speech tagging </li></ul><ul><ul><li>x: An English sentence (e.g., The can will rust) </li></ul></ul><ul><ul><li>f : The part of speech of a word in the sentence </li></ul></ul><ul><li>Face recognition </li></ul><ul><ul><li>x: Bitmap picture of person’s face </li></ul></ul><ul><ul><li>f : Name the person (or maybe: a property of) </li></ul></ul><ul><li>Automatic Steering </li></ul><ul><ul><li>x: Bitmap picture of road surface in front of car </li></ul></ul><ul><ul><li>f : Degrees to turn the steering wheel </li></ul></ul>
10. 10. y = f (x 1 , x 2 , x 3 , x 4 ) Unknown function x 1 x 2 x 3 x 4 A Learning Problem X H ? ? (Boolean: x1, x2, x3, x4, f )
11. 11. y = f (x 1 , x 2 , x 3 , x 4 ) Unknown function x 1 x 2 x 3 x 4 Training Set Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 2 0 1 0 0 0
12. 12. Hypothesis Space <ul><li>Complete Ignorance: </li></ul><ul><li>How many possible functions? </li></ul><ul><li>2 16 = 56536 over four input features. </li></ul><ul><li>After seven examples how many possibilities for f? </li></ul><ul><li>2 9 possibilities remain for f </li></ul><ul><li>How many examples until we figure out which is correct? </li></ul><ul><li>We need to see labels for all 16 examples! </li></ul><ul><li>Is Learning Possible? </li></ul>Example x 1 x 2 x 3 x 4 y 1 1 1 1 ? 0 0 0 0 ? 1 0 0 0 ? 1 0 1 1 ? 1 1 0 0 0 1 1 0 1 ? 1 0 1 0 ? 1 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 ? 0 0 1 1 1 0 0 1 0 0 0 0 0 1 ? 1 1 1 0 ?
13. 13. Another Hypothesis Space <ul><li>Simple Rules: There are only 16 simple </li></ul><ul><li>conjunctive rules of the form y=x i Æ x j Æ x k ... </li></ul><ul><li>No simple rule explains the data. The same is true for simple clauses </li></ul>1 0 0 1 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 2 0 1 0 0 0 y =c x 1 1100 0 x 2 0100 0 x 3 0110 0 x 4 0101 1 x 1  x 2 1100 0 x 1  x 3 0011 1 x 1  x 4 0011 1 Rule Counterexample x 2  x 3 0011 1 x 2  x 4 0011 1 x 3  x 4 1001 1 x 1  x 2  x 3 0011 1 x 1  x 2  x 4 0011 1 x 1  x 3  x 4 0011 1 x 2  x 3  x 4 0011 1 x 1  x 2  x 3  x 4 0011 1 Rule Counterexample
14. 14. Third Hypothesis Space <ul><li>m-of-n rules: There are 29 possible rules </li></ul><ul><li>of the form ”y = 1 if and only if at least m </li></ul><ul><li>of the following n variables are 1” </li></ul><ul><li>Found a consistent hypothesis. </li></ul>1 0 0 1 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 2 0 1 0 0 0  x 1  3 - - -  x 2  2 - - -  x 3  1 - - -  x 4  7 - - -  x 1, x 2  2 3 - -  x 1, x 3  1 3 - -  x 1, x 4  6 3 - -  x 2, x 3  2 3 - - variables 1 -of 2 -of 3 -of 4 -of  x 2, x 4  2 3 - -  x 3, x 4  4 4 - -  x 1, x 2, x 3  1 3 3 -  x 1, x 2, x 4  2 3 3 -  x 1, x 3, x 4  1    3 -  x 2, x 3, x 4  1 5 3 -  x 1, x 2, x 3, x 4  1 5 3 3 variables 1 -of 2 -of 3 -of 4 -of
15. 15. Views of Learning <ul><li>Learning is the removal of our remaining uncertainty: </li></ul><ul><ul><li>Suppose we knew that the unknown function was an m-of-n Boolean function, then we could use the training data to infer which function it is. </li></ul></ul><ul><li>Learning requires guessing a good, small hypothesis class : </li></ul><ul><ul><li>We can start with a very small class and enlarge it until it contains an hypothesis that fits the data. </li></ul></ul><ul><li>We could be wrong ! </li></ul><ul><ul><li>Our prior knowledge might be wrong: y=x4  one-of (x1, x3) is also consistent </li></ul></ul><ul><ul><li>Our guess of the hypothesis class could be wrong </li></ul></ul><ul><li>If this is the unknown function, then we will make errors when we are given new examples, and are asked to predict the value of the function </li></ul>
16. 16. General strategy for Machine Learning <ul><li>H should respect our prior understanding: </li></ul><ul><ul><li>Excess expressivity makes learning difficult </li></ul></ul><ul><ul><li>Expressivity of H should match our ignorance </li></ul></ul><ul><li>Understand flexibility of std. hypothesis spaces: </li></ul><ul><ul><li>Decision trees, neural networks, rule grammars, stochastic models </li></ul></ul><ul><ul><li>Hypothesis spaces of flexible size; Nested collections of hypotheses. </li></ul></ul><ul><li>ML succeeds when these interrelate </li></ul><ul><ul><li>Develop algorithms for finding a hypothesis h that fits the data </li></ul></ul><ul><ul><li>h will likely perform well when the richness of H is less than the information in the training set </li></ul></ul>
17. 17. Terminology <ul><li>Training example: An pair of the form (x, f (x)) </li></ul><ul><li>Target function (concept): The true function f (?) </li></ul><ul><li>Hypothesis: A proposed function h, believed to be similar to f. </li></ul><ul><li>Concept: Boolean function. Example for which f (x)= 1 are positive examples; those for which f (x)= 0 are negative examples (instances) (sometimes used interchangeably w/ “Hypothesis”) </li></ul><ul><li>Classifier: A discrete valued function. The possible value of f: {1,2,…K} are the classes or class labels . </li></ul><ul><li>Hypothesis space: The space of all hypotheses that can, in principle, be output by the learning algorithm. </li></ul><ul><li>Version Space: The space of all hypothesis in the hypothesis space that have not yet been ruled out. </li></ul>
18. 18. Key Issues in Machine Learning <ul><li>Modeling </li></ul><ul><ul><li>How to formulate application problems as machine learning problems ? </li></ul></ul><ul><ul><li>Learning Protocols (where is the data coming from, how?) </li></ul></ul><ul><li>Project examples: [complete products] </li></ul><ul><li>EMAIL </li></ul><ul><ul><li>Given a seminar announcement, place the relevant information in my outlook </li></ul></ul><ul><ul><li>Given a message, place it in the appropriate folder </li></ul></ul><ul><li>Image processing: </li></ul><ul><ul><li>Given a folder with pictures; automatically rotate all those that need it. </li></ul></ul><ul><li>My office: </li></ul><ul><ul><li>have my office greet me in the morning and unlock the door (but do it only for me!) </li></ul></ul><ul><li>Context Sensitive Spelling: Incorporate into Word </li></ul>
19. 19. Key Issues in Machine Learning <ul><li>Modeling </li></ul><ul><ul><li>How to formulate application problems as machine learning problems ? </li></ul></ul><ul><ul><li>Learning Protocols (where is the data coming from, how?) </li></ul></ul><ul><li>Representation: </li></ul><ul><ul><li>What are good hypothesis spaces ? </li></ul></ul><ul><ul><li>Any rigorous way to find these? Any general approach? </li></ul></ul><ul><li>Algorithms: </li></ul><ul><ul><li>What are good algorithms? </li></ul></ul><ul><ul><li>How do we define success? </li></ul></ul><ul><ul><li>Generalization Vs. over fitting </li></ul></ul><ul><ul><li>The computational problem </li></ul></ul>
20. 20. Example: Generalization vs Overfitting <ul><li>What is a Tree ? </li></ul><ul><li>A botanist Her brother </li></ul><ul><li>A tree is something with A tree is a green thing </li></ul><ul><li>leaves I’ve seen before </li></ul><ul><li>Neither will generalize well </li></ul>
21. 21. Self-organize into Groups of 4 or 5 <ul><li>Assignment 1 </li></ul><ul><li>The Badges Game …… </li></ul><ul><li>Prediction or Modeling? </li></ul><ul><ul><li>Representation </li></ul></ul><ul><ul><li>Background Knowledge </li></ul></ul><ul><ul><li>When did learning take place? </li></ul></ul><ul><ul><li>Learning Protocol? </li></ul></ul><ul><ul><li>What is the problem? </li></ul></ul><ul><ul><li>Algorithms </li></ul></ul>
22. 22. Linear Discriminators <ul><li>I don’t know { whether, weather} to laugh or cry </li></ul><ul><li>How can we make this a learning problem? </li></ul><ul><li>We will look for a function </li></ul><ul><li>F: Sentences  { whether, weather} </li></ul><ul><li>We need to define the domain of this function better. </li></ul><ul><li>An option : For each word w in English define a Boolean feature x w : </li></ul><ul><li>[x w =1] iff w is in the sentence </li></ul><ul><li>This maps a sentence to a point in {0,1} 50,000 </li></ul><ul><li>In this space: some points are whether points </li></ul><ul><li>some are weather points </li></ul>Learning Protocol? Supervised? Unsupervised?
23. 23. What’s Good? <ul><li>Learning problem : </li></ul><ul><ul><li>Find a function that </li></ul></ul><ul><ul><li>best separates the data </li></ul></ul><ul><li>What function? </li></ul><ul><li>What’s best? </li></ul><ul><li>How to find it? </li></ul><ul><li>A possibility: Define the learning problem to be: </li></ul><ul><li>Find a (linear) function that best separates the data </li></ul>
24. 24. Exclusive-OR (XOR) <ul><li>(x 1 Æ x 2) Ç ( : {x 1 } Æ : {x 2 }) </li></ul><ul><li>In general: a parity function. </li></ul><ul><li>x i 2 {0,1} </li></ul><ul><li>f(x 1 , x 2 ,…, x n ) = 1 </li></ul><ul><li>iff  x i is even </li></ul><ul><li>This function is not </li></ul><ul><li>linearly separable . </li></ul>x 1 x 2
25. 25. Sometimes Functions Can be Made Linear <ul><li>x 1 x 2 x 4 Ç x 2 x 4 x 5 Ç x 1 x 3 x 7 </li></ul><ul><li>Space: X= x 1 , x 2 ,…, x n </li></ul><ul><li>input Transformation </li></ul><ul><li>New Space: Y = {y 1 ,y 2 ,…} = {x i ,x i x j , x i x j x j } </li></ul>y 3 Ç y 4 Ç y 7 New discriminator is functionally simpler Weather Whether
26. 26. <ul><li>Data are not separable in one dimension </li></ul><ul><li>Not separable if you insist on using a specific class of functions </li></ul>Feature Space x
27. 27. Blown Up Feature Space <ul><li>Data are separable in <x, x 2 > space </li></ul>x x 2 Key issue: what features to use. Computationally, can be done implicitly (kernels)
28. 28. A General Framework for Learning <ul><li>Goal: predict an unobserved output value y 2 Y </li></ul><ul><li>based on an observed input vector x 2 X </li></ul><ul><li>Estimate a functional relationship y~f(x) </li></ul><ul><li>from a set {(x,y) i } i=1,n </li></ul><ul><li>Most relevant - Classification : y  {0,1} (or y  {1,2,…k} ) </li></ul><ul><ul><li>(But, within the same framework can also talk about Regression, y 2 < </li></ul></ul><ul><li>What do we want f(x) to satisfy? </li></ul><ul><ul><li>We want to minimize the Loss (Risk): L(f()) = E X,Y ( [f(x)  y] ) </li></ul></ul><ul><ul><li>Where: E X,Y denotes the expectation with respect to the true distribution . </li></ul></ul>Simply: # of mistakes […] is a indicator function
29. 29. A General Framework for Learning (II) <ul><li>We want to minimize the Loss: L(f()) = E X,Y ( [f(X)  Y] ) </li></ul><ul><li>Where: E X,Y denotes the expectation with respect to the true distribution . </li></ul><ul><li>We cannot do that. Why not? </li></ul><ul><li>Instead, we try to minimize the empirical classification error. </li></ul><ul><li>For a set of training examples {(X i ,Y i )} i=1,n </li></ul><ul><li>Try to minimize the observed loss </li></ul><ul><ul><li>(Issue I : when is this good enough? Not now) </li></ul></ul><ul><li>This minimization problem is typically NP hard. </li></ul><ul><li>To alleviate this computational problem, minimize a new function – a convex upper bound of the classification error function </li></ul><ul><li>I (f(x),y) =[f(x)  y] = {1 when f(x)  y; 0 otherwise} </li></ul>
30. 30. Learning as an Optimization Problem <ul><li>A Loss Function L(f(x),y) measures the penalty incurred by a classifier f on example (x,y). </li></ul><ul><li>There are many different loss functions one could define: </li></ul><ul><ul><li>Misclassification Error: </li></ul></ul><ul><li>L(f(x),y) = 0 if f(x) = y; 1 otherwise </li></ul><ul><ul><li>Squared Loss: </li></ul></ul><ul><li>L(f(x),y) = (f(x) –y) 2 </li></ul><ul><ul><li>Input dependent loss: </li></ul></ul><ul><li>L(f(x),y) = 0 if f(x)= y; c(x)otherwise. </li></ul>A continuous convex loss function also allows a conceptually simple optimization algorithm. f(x) –y
31. 31. How to Learn? <ul><li>Local search: </li></ul><ul><ul><li>Start with a linear threshold function. </li></ul></ul><ul><ul><li>See how well you are doing. </li></ul></ul><ul><ul><li>Correct </li></ul></ul><ul><ul><li>Repeat until you converge. </li></ul></ul><ul><li>There are other ways that do not </li></ul><ul><li>search directly in the </li></ul><ul><li>hypotheses space </li></ul><ul><ul><li>Directly compute the hypothesis? </li></ul></ul>
32. 32. Learning Linear Separators (LTU) <ul><li>f(x) = sgn {x ¢ w -  } = sgn{  i=1 n w i x i -  } </li></ul><ul><li>x= ( x 1 ,x 2 ,… ,x n ) 2 {0,1} n </li></ul><ul><ul><li>is the feature based </li></ul></ul><ul><ul><li>encoding of the data point </li></ul></ul><ul><li>w= ( w 1 ,w 2 ,… ,w n ) 2 < n </li></ul><ul><ul><li>is the target function. </li></ul></ul><ul><li> determines the shift with </li></ul><ul><li>respect to the origin </li></ul>w 
33. 33. Expressivity <ul><li>f(x) = sgn {x ¢ w -  } = sgn{  i=1 n w i x i -  } </li></ul><ul><li>Many functions are Linear </li></ul><ul><ul><li>Conjunctions: </li></ul></ul><ul><ul><ul><li>y = x 1 Æ x 3 Æ x 5 </li></ul></ul></ul><ul><ul><ul><li>y = sgn{1 ¢ x 1 + 1 ¢ x 3 + 1 ¢ x 5 - 3} </li></ul></ul></ul><ul><ul><li>At least m of n: </li></ul></ul><ul><ul><ul><li>y = at least 2 of { x 1 ,x 3 , x 5 } </li></ul></ul></ul><ul><ul><ul><li>y = sgn{1 ¢ x 1 + 1 ¢ x 3 + 1 ¢ x 5 - 2} </li></ul></ul></ul><ul><li>Many functions are not </li></ul><ul><ul><li>Xor: y = x 1 Æ x 2 Ç x 1 Æ x 2 </li></ul></ul><ul><ul><li>Non trivial DNF: y = x 1 Æ x 2 Ç x 3 Æ x 4 </li></ul></ul><ul><li>But some can be made linear </li></ul>Probabilistic Classifiers as well
34. 34. Canonical Representation <ul><li>f(x) = sgn {x ¢ w -  } = sgn{  i=1 n w i x i -  } </li></ul><ul><li>sgn {x ¢ w -  } ´ sgn {x’ ¢ w’} </li></ul><ul><li>Where: </li></ul><ul><ul><li>x’ = (x, -  ) and w’ = (w,1) </li></ul></ul><ul><li>Moved from an n dimensional representation to an (n+1) dimensional representation, but now can look for hyperplans that go through the origin. </li></ul>
35. 35. LMS: An online, local search algorithm <ul><li>A local search learning algorithm requires: </li></ul><ul><li>Hypothesis Space: </li></ul><ul><ul><li>Linear Threshold Units </li></ul></ul><ul><li>Loss function: </li></ul><ul><ul><li>Squared loss </li></ul></ul><ul><ul><li>LMS (Least Mean Square, L 2 ) </li></ul></ul><ul><li>Search procedure: </li></ul><ul><ul><li>Gradient Descent </li></ul></ul>w 
36. 36. LMS: An online, local search algorithm <ul><li>Let w (j) be our current weight vector </li></ul><ul><li>Our prediction on the d-th example x is therefore: </li></ul><ul><li>Let t d be the target value for this example ( real value; represents u ¢ x ) </li></ul><ul><li>A convenient error function of the data set is: </li></ul>(i (subscript) – vector component; j (superscript) - time; d – example #) Assumption: x 2 R n ; u 2 R n is the target weight vector; the target (label) is t d = u ¢ x Noise has been added; so, possibly, no weight vector is consistent with the data.
37. 37. Gradient Descent <ul><li>We use gradient descent to determine the weight vector that minimizes Err (w) ; </li></ul><ul><li>Fixing the set D of examples, E is a function of w j </li></ul><ul><li>At each step, the weight vector is modified in the direction that produces the steepest descent along the error surface . </li></ul>E(w) w w 4 w 3 w 2 w 1
38. 38. <ul><li>To find the best direction in the weight space we compute the gradient </li></ul><ul><li>of E with respect to each of the components of </li></ul><ul><li>This vector specifies the direction that produces the steepest </li></ul><ul><li>increase in E; </li></ul><ul><li>We want to modify in the direction of </li></ul><ul><li>Where: </li></ul>Gradient Descent
39. 39. <ul><li>We have: </li></ul><ul><li>Therefore: </li></ul>Gradient Descent: LMS
40. 40. Gradient Descent: LMS <ul><li>Weight update rule: </li></ul>
41. 41. Gradient Descent: LMS <ul><li>Weight update rule: </li></ul><ul><li>Gradient descent algorithm for training linear units: </li></ul><ul><li>- Start with an initial random weight vector </li></ul><ul><li>- For every example d with target value : </li></ul><ul><li>- Evaluate the linear unit </li></ul><ul><li>- update by adding to each component </li></ul><ul><li>- Continue until E below some threshold </li></ul>
42. 42. <ul><li>Weight update rule: </li></ul><ul><li>Gradient descent algorithm for training linear units: </li></ul><ul><li>- Start with an initial random weight vector </li></ul><ul><li>- For every example d with target value : </li></ul><ul><li>- Evaluate the linear unit </li></ul><ul><li>- update by adding to each component </li></ul><ul><li>- Continue until E below some threshold </li></ul><ul><li>Because the surface contains only a single global minimum </li></ul><ul><li>the algorithm will converge to a weight vector with minimum </li></ul><ul><li>error, regardless of whether the examples are linearly separable </li></ul>Gradient Descent: LMS
43. 43. <ul><li>Weight update rule: </li></ul>Incremental Gradient Descent: LMS
44. 44. Incremental Gradient Descent - LMS <ul><li>Weight update rule: </li></ul><ul><li>Gradient descent algorithm for training linear units: </li></ul><ul><li>- Start with an initial random weight vector </li></ul><ul><li>- For every example d with target value : </li></ul><ul><li>- Evaluate the linear unit </li></ul><ul><li>- update by incrementally adding </li></ul><ul><li>to each component </li></ul><ul><li>- Continue until E below some threshold </li></ul><ul><li>In general - does not converge to global minimum </li></ul><ul><li>Decreasing R with time guarantees convergence </li></ul><ul><li>Incremental algorithms are sometimes advantageous … </li></ul>
45. 45. Learning Rates and Convergence <ul><li>In the general (non-separable) case the learning rate R </li></ul><ul><li>must decrease to zero to guarantee convergence. It cannot decrease too quickly nor too slowly. </li></ul><ul><li>The learning rate is called the step size. There are more </li></ul><ul><li>sophisticates algorithms (Conjugate Gradient) that choose </li></ul><ul><li>the step size automatically and converge faster. </li></ul><ul><li>There is only one “basin” for linear threshold units, so a </li></ul><ul><li>local minimum is the global minimum. However, choosing </li></ul><ul><li>a starting point can make the algorithm converge much </li></ul><ul><li>faster. </li></ul>
46. 46. Computational Issues Assume the data is linearly separable. Sample complexity: Suppose we want to ensure that our LTU has an error rate (on new examples) of less than  with high probability(at least (1-  )) How large must m (the number of examples) be in order to achieve this? It can be shown that for n dimensional problems m = O(1/  [ln(1/  ) + (n+1) ln(1/  ) ]. Computational complexity: What can be said? It can be shown that there exists a polynomial time algorithm for finding consistent LTU (by reduction from linear programming). (On-line algorithms have inverse quadratic dependence on the margin)
47. 47. Other methods for LTUs <ul><li>Direct Computation: </li></ul><ul><li>Set  J( w ) = 0 and solve for w . Can be accomplished using SVD </li></ul><ul><li>methods. </li></ul><ul><li>Fisher Linear Discriminant: </li></ul><ul><li>A direct computation method. </li></ul><ul><li>Probabilistic methods (naive Bayes): </li></ul><ul><li>Produces a stochastic classifier that can be viewed as a linear </li></ul><ul><li>threshold unit. </li></ul><ul><li>Winnow: </li></ul><ul><li>A multiplicative update algorithm with the property that it can </li></ul><ul><li>handle large numbers of irrelevant attributes. </li></ul>
48. 48. Summary of LMS algorithms for LTUs Local search: Begins with initial weight vector. Modifies iteratively to minimize and error function. The error function is loosely related to the goal of minimizing the number of classification errors. Memory: The classifier is constructed from the training examples. The examples can then be discarded. Online or Batch: Both online and batch variants of the algorithms can be used.
49. 49. Fisher Linear Discriminant <ul><li>This is a classical method for discriminant analysis. </li></ul><ul><li>It is based on dimensionality reduction – finding a better representation for the data. </li></ul><ul><li>Notice that just finding good representations for the data may not always be good for discrimination . [E.g., O, Q] </li></ul><ul><li>Intuition: </li></ul><ul><ul><li>Consider projecting data from d dimensions to the line. </li></ul></ul><ul><ul><li>Likely results in a mixed set of points and poor separation. </li></ul></ul><ul><ul><li>However, by moving the line around we might be able to find an orientation for which the projected samples are well separated. </li></ul></ul>
50. 50. Fisher Linear Discriminant <ul><li>Sample S= {x 1 , x 2 , … x n } 2 < d </li></ul><ul><li>P, N are the positive, negative examples, resp. </li></ul><ul><li>Let w 2 < d . And assume ||w||=1. Then: </li></ul><ul><li>The projection of a vector x on a line in the direction w, is w t ¢ x . </li></ul><ul><li>If the data is linearly separable, there exists a good direction w . </li></ul>(all vectors are column vectors)
51. 51. Finding a Good Direction <ul><li>Sample mean (positive, P; Negative, N): </li></ul><ul><li>M p = 1/|P|  P x i </li></ul><ul><li>The mean of the projected (positive, negative) points </li></ul><ul><li>m p = 1/|P|  P w t ¢ x i = 1/|P|  P y i = w t ¢ M p </li></ul><ul><li>Is simply the projection of the sample mean. </li></ul><ul><li>Therefore, the distance between the projected means is: </li></ul><ul><li>|m p - m N |= |w t ¢ (M p - M N )| </li></ul>Want large difference
52. 52. Finding a Good Direction (2) <ul><li>Scaling w isn’t the solution. We want the difference to be large relative to some measure of standard deviation for each class. </li></ul><ul><li>S 2 p =  P (y-m p ) 2 s 2 N =  N (y-m N ) 2 </li></ul><ul><li>1/ ( S 2 p + s 2 N ) within class scatter : estimates the variances of the sample. </li></ul><ul><li>The Fischer linear discriminant employs the linear function w t ¢ x for which </li></ul><ul><li>J(w) = | m P – m N | 2 / S 2 p + s 2 N </li></ul><ul><li>is maximized. </li></ul><ul><li>How to make this a classifier? </li></ul><ul><li>How to find the optimal w? </li></ul><ul><ul><li>Some Algebra </li></ul></ul>
53. 53. J as an explicit function of w (1) <ul><li>Compute the scatter matrices: </li></ul><ul><li>S p =  P (x-M p )(x-M p ) t S N =  N (x-M N )(x-M N ) t </li></ul><ul><li>and </li></ul><ul><li>S W = S p + S p </li></ul><ul><li>We can write: </li></ul><ul><li>S 2 p =  P (y-m p ) 2 =  P (w t x -w t M p ) 2 = </li></ul><ul><li>=  P w t (x- M p ) (x- M p ) t w = w t S p w </li></ul><ul><li>Therefore: </li></ul><ul><li>S 2 p + S 2 N = w t S W w </li></ul><ul><li>S W is the within-class scatter matrix. It is proportional to the sample covariance matrix for the d-dimensional sample. </li></ul>
54. 54. J as an explicit function of w (2) <ul><li>We can do a similar computation for the means: </li></ul><ul><li>S B = (M P -M N )(M P -M N ) t </li></ul><ul><li>and we can write: </li></ul><ul><li>(m P -m N ) 2 = (w t M P -w t M N ) 2 = </li></ul><ul><li>= w t (M P -M N ) (M P -M N ) t w = w t S B w </li></ul><ul><li>Therefore: </li></ul><ul><li>S B is the between-class scatter matrix . It is the outer product of two vectors and therefore its rank is at most 1. </li></ul><ul><li>S B w is always in the direction of (M P -M N ) </li></ul>
55. 55. J as an explicit function of w (3) <ul><li>Now we can compute explicitly: We can do a similar computation for the means: </li></ul><ul><li>J(w) = | m P – m N | 2 / S 2 p + s 2 N = w t S B w / w t S W w </li></ul><ul><li>We are looking for a the value of w that maximizes this expression. </li></ul><ul><li>This is a generalized eigenvalue problem; when S W is nonsingular, it is just a eigenvalue problem. The solution can be written without solving the problem, as: </li></ul><ul><li>w=S -1 W (M P -M N ) </li></ul><ul><li>This is the Fisher Linear Discriminant . </li></ul><ul><li>1 : We converted a d-dimensional problem to a 1-dimensional problem and suggested a solution that makes some sense. </li></ul><ul><li>2: We have a solution that makes sense; how to make it a classifier? And, how good it is? </li></ul>
56. 56. Fisher Linear Discriminant - Summary <ul><li>It turns out that both problems can be solved if we make assumptions. E.g., if the data consists of two classes of points, generated according to a normal distribution, with the same covariance. Then: </li></ul><ul><li>The solution is optimal. </li></ul><ul><li>Classification can be done by choosing a threshold, which can be computed. </li></ul><ul><li>Is this satisfactory? </li></ul>
57. 57. Introduction - Summary <ul><li>We introduced the technical part of the class by giving two examples for (very different) approaches to linear discrimination. </li></ul><ul><li>There are many other solutions. </li></ul><ul><li>Questions 1 : But this assumes that we are linear. Can we learn a function that is more flexible in terms of what it does with the features space? </li></ul><ul><li>Question 2 : Can we say something about the quality of what we learn (sample complexity, time complexity; quality) </li></ul>