ML

341 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
341
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ML

  1. 1. Machine Learning in Practice Lecture 6 Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute
  2. 2. Plan for the Day <ul><li>Announcements </li></ul><ul><ul><li>Answer Keys posted </li></ul></ul><ul><ul><li>Assignment 3 assigned </li></ul></ul><ul><ul><li>Project proposal </li></ul></ul><ul><ul><ul><li>Due next Thursday!!! </li></ul></ul></ul><ul><li>Finish Naïve Bayes </li></ul><ul><li>Maybe start Linear Models </li></ul>
  3. 4. Project Proposal <ul><li>A few sentences about what problem you are working on </li></ul><ul><li>A description of what your data is </li></ul><ul><ul><li>How many instances </li></ul></ul><ul><ul><li>What features do you have </li></ul></ul><ul><li>Preferably some baseline performance using any ML technique and some error analysis </li></ul><ul><li>Tell me something about your ideas for approaching this problem </li></ul>
  4. 5. Assignment 3 <ul><li>Compares Statistical models and weight based models </li></ul><ul><li>Who remembers what we discussed last time about contrasts between these two types of models? </li></ul>
  5. 7. Weka Helpful Hints
  6. 8. Use the visualize tab to view 3-way interactions
  7. 9. Naïve Bayes
  8. 10. Bayes Theorem
  9. 11. Bayes Theorem <ul><li>How would you compute the likelihood that a person was a bagpipe major given that they had red hair? </li></ul>
  10. 12. Bayes Theorem <ul><li>How would you compute the likelihood that a person was a bagpipe major given that they had red hair? </li></ul><ul><li>Could you compute the likelihood that a person has red hair given that they were a bagpipe major? </li></ul>
  11. 13. Bayes Theorem <ul><li>How would you compute the likelihood that a person was a bagpipe major given that they had red hair? </li></ul><ul><li>Could you compute the likelihood that a person has red hair given that they were a bagpipe major? </li></ul>
  12. 14. How do we train a model? We need to compute what evidence each value of every feature gives of each possible prediction (or how typical it would be for instances of that class) What is P(Outlook = rainy | Class = yes)? Store counts on (class value, feature value) pairs How many times is Outlook = rainy when class = yes? Likelihood that play = yes if Outlook = rainy = Count(yes & rainy)/ Count(yes) * Count(yes)/Count(yes or no)
  13. 15. How do we train a model? Now try to compute likelihood play = yes for Outlook = sunny, Temperature = cool, Humidity = high, Windy = TRUE
  14. 16. Scaling <ul><li>Likelihood that play = yes when Outlook = sunny, Temperature = cool, Humidity = high, Windy = true </li></ul><ul><ul><li>2/9 * 3/9 * 3/9 * 3/9 * 9/14 = 0.005291005 </li></ul></ul><ul><ul><li>P(yes|sunny,cool,high,true) = 0.005/.019 = 27.8% </li></ul></ul><ul><li>Likelihood that play = no when Outlook = sunny, Temperature = cool, Humidity = high, Windy = true </li></ul><ul><ul><li>3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.013714286 </li></ul></ul><ul><ul><li>P(no|sunny,cool,high,true) = 0.014/.019 = 72.2% </li></ul></ul>
  15. 17. Unknown Values <ul><li>Not a problem for Naïve Bayes </li></ul><ul><li>Probabilities computed using only the specified values </li></ul><ul><li>Likelihood that play = yes when Outlook = sunny, Temperature = cool, Humidity = high, Windy = true </li></ul><ul><ul><li>2/9 * 3/9 * 3/9 * 3/9 * 9/14 </li></ul></ul><ul><ul><li>If Outlook is unknown, 3/9 * 3/9 * 3/9 * 9/14 </li></ul></ul><ul><li>Likelihoods will be higher when there are unknown values </li></ul><ul><ul><li>Factored out during normalization </li></ul></ul>Note that unknown values are different from unobserved combinations!!!
  16. 18. How do we train a model? What is the conditional probability P(Humidity = Low| Play = yes)?
  17. 19. Another Example Model <ul><li>Compute conditional probabilities for each attribute value/class pair </li></ul><ul><ul><li>P(B|A) = Count(B&A)/Count(A) </li></ul></ul><ul><ul><li>P(coffee ice-cream | yum) = .25 </li></ul></ul><ul><ul><li>P(vanilla ice-cream | yum) = 0 </li></ul></ul>@attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy
  18. 20. Another Example Model <ul><ul><li>What class would you assign to strawberry ice cream with chocolate cake? </li></ul></ul><ul><ul><li>Compute likelihoods and then normalize </li></ul></ul><ul><ul><li>Note: this model cannot take into account that the class might depend on how well the cake and ice cream “go together” </li></ul></ul>@attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy What is the likelihood that the answer is yum? P(strawberry|yum) = .25 P(chocolate cake|yum) = .75 .25 * .75 * .66 = .124 What is the likelihood that The answer is good? P(strawberry|good) = 0 P(chocolate cake|good) = 1 0* 1 * .17 = 0 What is the likelihood that The answer is ok? P(strawberry|ok) = 0 P(chocolate cake|ok) = 0 0*0 * .17 = 0
  19. 21. Another Example Model <ul><ul><li>What class would you assign to strawberry ice cream with chocolate cake? </li></ul></ul><ul><ul><li>Compute likelihoods and then normalize </li></ul></ul><ul><ul><li>Note: this model cannot take into account that the class might depend on how well the cake and ice cream “go together” </li></ul></ul>@attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy What is the likelihood that the answer is yum? P(strawberry|yum) = .25 P(chocolate cake|yum) = .75 .25 * .75 * .66 = .124 What is the likelihood that The answer is good? P(strawberry|good) = 0 P(chocolate cake|good) = 1 0* 1 * .17 = 0 What is the likelihood that The answer is ok? P(strawberry|ok) = 0 P(chocolate cake|ok) = 0 0*0 * .17 = 0
  20. 22. Another Example Model <ul><li>What about vanilla ice cream and vanilla cake </li></ul><ul><li>Intuitively, there is more evidence that the selected category should be Good. </li></ul>@attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy What is the likelihood that the answer is yum? P(vanilla|yum) = 0 P(vanilla cake|yum) = .25 0*.25 * .66= 0 What is the likelihood that The answer is good? P(vanilla|good) = 1 P(vanilla cake|good) = 0 1*0 * .17= 0 What is the likelihood that The answer is ok? P(vanilla|ok) = 0 P(vanilla cake|ok) = 1 0* 1 * .17 = 0
  21. 23. Another Example Model <ul><li>What about vanilla ice cream and vanilla cake </li></ul><ul><li>Intuitively, there is more evidence that the selected category should be Good. </li></ul>@attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy What is the likelihood that the answer is yum? P(vanilla|yum) = 0 P(vanilla cake|yum) = .25 0*.25 * .66= 0 What is the likelihood that The answer is good? P(vanilla|good) = 1 P(vanilla cake|good) = 0 1*0 * .17= 0 What is the likelihood that The answer is ok? P(vanilla|ok) = 0 P(vanilla cake|ok) = 1 0* 1 * .17 = 0
  22. 24. Statistical Modeling with Small Datasets <ul><li>When you train your model, how many probabilities are you trying to estimate? </li></ul><ul><li>This statistical modeling approach has problems with small datasets where not every class is observed in combination with every attribute value </li></ul><ul><ul><li>What potential problem occurs when you never observe coffee ice-cream with class ok? </li></ul></ul><ul><ul><li>When is this not a problem? </li></ul></ul>
  23. 25. Smoothing <ul><li>One way to compensate for 0 counts is to add 1 to every count </li></ul><ul><li>Then you never have 0 probabilities </li></ul><ul><li>But what might be the problem you still have on small data sets? </li></ul>
  24. 26. Naïve Bayes with smoothing @attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy What is the likelihood that the answer is yum? P(vanilla|yum) = .11 P(vanilla cake|yum) = .33 .11*.33* .66= .03 What is the likelihood that The answer is good? P(vanilla|good) = .33 P(vanilla cake|good) = .33 .33 * .33 * .17 = .02 What is the likelihood that The answer is ok? P(vanilla|ok) = .17 P(vanilla cake|ok) = .66 .17 * .66 * .17 = .02
  25. 27. Naïve Bayes with smoothing @attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy What is the likelihood that the answer is yum? P(vanilla|yum) = .11 P(vanilla cake|yum) = .33 .11*.33* .66= .03 What is the likelihood that The answer is good? P(vanilla|good) = .33 P(vanilla cake|good) = .33 .33 * .33 * .17 = .02 What is the likelihood that The answer is ok? P(vanilla|ok) = .17 P(vanilla cake|ok) = .66 .17 * .66 * .17 = .02
  26. 28. Numeric Values <ul><li>List values of numeric feature for all class features </li></ul><ul><ul><li>Values for play = yes: 83, 70,68, 64, 69, 75, 75, 72, 81 </li></ul></ul><ul><li>Compute Mean and Standard Deviation </li></ul><ul><ul><li>Values for play = yes: 83, 70, 68, 64, 69,75, 75, 72, 81 </li></ul></ul><ul><ul><ul><li> = 73,  = 6.16 </li></ul></ul></ul><ul><ul><li>Values for play = no: 85, 80, 65, 72, 71 </li></ul></ul><ul><ul><ul><li> = 74.6,  = 7.89 </li></ul></ul></ul><ul><li>Compute likelihoods </li></ul><ul><ul><li>f(x) = [1/sqrt(2   )] e -(x-  ) 2 /2  2 </li></ul></ul><ul><li>Normalize using proportion of predicted class feature as before </li></ul><ul><li>Assumes a normal distribution!!!!!! </li></ul>
  27. 29. Example <ul><li>Movie Review Dataset </li></ul><ul><li>Task: Movie reviews are either positive or negative </li></ul><ul><li>Represent the data two ways </li></ul><ul><ul><li>Binary: Each feature is a word, value is 1 if the word is present at least once and 0 otherwise </li></ul></ul><ul><ul><li>Counts: Each feature is a word, value is the number of times it occurs in the document </li></ul></ul>
  28. 30. Does this hold for text?
  29. 31. Multinomial Naïve Bayes Multiply this by the prior probability of H to get the likelihood, just like with Naïve Bayes.
  30. 32. Does it matter?
  31. 33. Scenario Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14
  32. 34. Scenario Each problem may be associated with more than one skill Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14
  33. 35. Scenario Each skill may be associated with more than one problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14
  34. 36. How to address the problem? <ul><li>In reality there is a many-to-many mapping between math problems and skills </li></ul>
  35. 37. How to address the problem? <ul><li>In reality there is a many-to-many mapping between math problems and skills </li></ul><ul><li>Ideally, we should be able to assign any subset of the full set of skills to any problem </li></ul><ul><ul><li>But can we do that accurately? </li></ul></ul>
  36. 38. How to address the problem? <ul><li>In reality there is a many-to-many mapping between math problems and skills </li></ul><ul><li>Ideally, we should be able to assign any subset of the full set of skills to any problem </li></ul><ul><ul><li>But can we do that accurately? </li></ul></ul><ul><li>If we can’t do that, it may be good enough to assign the single most important skill </li></ul>
  37. 39. How to address the problem? <ul><li>In reality there is a many-to-many mapping between math problems and skills </li></ul><ul><li>Ideally, we should be able to assign any subset of the full set of skills to any problem </li></ul><ul><ul><li>But can we do that accurately? </li></ul></ul><ul><li>If we can’t do that, it may be good enough to assign the single most important skill </li></ul><ul><li>In that case, we will not accomplish the whole task </li></ul>
  38. 40. How to address the problem? <ul><li>But if we can do that part of the task more accurately, then we might accomplish more overall than if we try to achieve the more ambitious goal </li></ul>
  39. 41. Low resolution gives more information if the accuracy is higher Remember this discussion from lecture 2?
  40. 42. Which of these approaches is better? <ul><li>You have a corpus of math problem texts and you are trying to learn models that assign skill labels. </li></ul><ul><li>Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. </li></ul><ul><li>Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. </li></ul>
  41. 43. Approach 1 Each skill corresponds to a separate binary predictor. Each of 91 binary predictors is applied to each text 91 separate predictions are made for each text. Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14
  42. 44. Approach 2 Each skill corresponds to a separate Class value. A single multi- class predictor is applied to each text Only 1 prediction is made for each text. Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14
  43. 45. Which of these approaches is better? <ul><li>You have a corpus of math problem texts and you are trying to learn models that assign skill labels. </li></ul><ul><li>Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. </li></ul><ul><li>Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. </li></ul>More power, but more opportunity for error
  44. 46. Which of these approaches is better? <ul><li>You have a corpus of math problem texts and you are trying to learn models that assign skill labels. </li></ul><ul><li>Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. </li></ul><ul><li>Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. </li></ul>Less power, but fewer opportunities for error
  45. 47. Approach 1: One versus all <ul><li>Assume you have 80 example texts, and 4 of them have skill5 associated with them </li></ul><ul><li>Assume you are using some form of smoothing – 0 counts become 1 </li></ul><ul><li>Let’s say WordX occurs with skill5 75% of the time and only 5% of the time for majority (it’s the best predictor for skill5) </li></ul><ul><ul><li>After smoothing, P(WordX|Skill5) = 2/3 </li></ul></ul><ul><ul><li>P(WordX|majority) = 2/38 </li></ul></ul>
  46. 48. Counts Without Smoothing <ul><li>80 math problem texts </li></ul><ul><li>7 instances of WordX </li></ul><ul><li>3 of them are skill5 (75% of skill5) </li></ul><ul><li>WordX is the best predictor for skill5) </li></ul>Skill5 Majority Class WordX WordY 3 4
  47. 49. Counts With Smoothing <ul><li>80 math problem texts </li></ul><ul><li>7 instances of WordX </li></ul><ul><li>3 of them are skill5 (75% of skill5) </li></ul><ul><li>WordX is the best predictor for skill5) </li></ul>Skill5 Majority Class WordX WordY 4 5
  48. 50. Approach 1 <ul><li>Assume you have 80 example texts, and 4 of them have skill5 associated with them </li></ul><ul><li>Assume you are using some form of smoothing – 0 counts become 1 </li></ul><ul><li>Let’s say WordX occurs with skill5 75% of the time and 4 times with the majority class (it’s the best predictor for skill5) </li></ul><ul><ul><li>After smoothing, P(WordX|Skill5) = 2/3 = .66 </li></ul></ul><ul><ul><li>P(WordX|majority) = 5/78 = .06 </li></ul></ul>
  49. 51. Approach 1 <ul><li>Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) </li></ul><ul><ul><li>In reality, 13 counts of WordY with majority and 1 with Skill5 </li></ul></ul><ul><ul><li>With smoothing, we get 14 counts of WordY with majority and 2 with Skill5 </li></ul></ul><ul><ul><li>P(WordY|Skill5) = 1/3 </li></ul></ul><ul><ul><li>P(WordY|Majority) = 7/38 </li></ul></ul><ul><ul><li>Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed </li></ul></ul><ul><ul><li>For “WordX WordY” you would get .66*.33*.04 = .009 for skill5 and .05*.18 *.96 = .009 for majority </li></ul></ul><ul><li>What would you predict without smoothing? </li></ul>
  50. 52. Counts Without Smoothing <ul><li>80 math problem texts </li></ul><ul><li>4 of them are skill5 </li></ul><ul><li>WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) </li></ul>Skill5 Majority Class WordX WordY 3 4 1 13
  51. 53. Counts With Smoothing <ul><li>80 math problem texts </li></ul><ul><li>4 of them are skill5 </li></ul><ul><li>WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) </li></ul>Skill5 Majority Class WordX WordY 4 5 2 14
  52. 54. Approach 1 <ul><li>Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) </li></ul><ul><ul><li>In reality, 13 counts of WordY with majority and 1 with Skill5 </li></ul></ul><ul><ul><li>With smoothing, we get 14 counts of WordY with majority and 2 with Skill5 </li></ul></ul><ul><ul><li>P(WordY|Skill5) = 1/3 = .33 </li></ul></ul><ul><ul><li>P(WordY|Majority) = 14/78 = .18 </li></ul></ul><ul><ul><li>Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed </li></ul></ul><ul><ul><li>For “WordX WordY” you would get .66*.33*.04 = .009 for skill5 and .05*.18 *.96 = .009 for majority </li></ul></ul><ul><li>What would you predict without smoothing? </li></ul>
  53. 55. Approach 1 <ul><li>Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) </li></ul><ul><ul><li>In reality, 13 counts of WordY with majority and 1 with Skill5 </li></ul></ul><ul><ul><li>With smoothing, we get 14 counts of WordY with majority and 2 with Skill5 </li></ul></ul><ul><ul><li>P(WordY|Skill5) = 1/3 = .33 </li></ul></ul><ul><ul><li>P(WordY|Majority) = 14/78 = .18 </li></ul></ul><ul><ul><li>Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed </li></ul></ul><ul><ul><li>For “WordX WordY” you would get .66*.33*.05 = .01 for skill5 and .06*.18*.95 = .01 for majority </li></ul></ul><ul><li>What would you predict without smoothing? </li></ul>
  54. 56. Linear Models
  55. 57. Remember this: What do concepts look like?
  56. 58. Remember this: What do concepts look like?
  57. 59. Review: Concepts as Lines R B S T C X X X X X X
  58. 60. Review: Concepts as Lines R B S T C X X X X X X
  59. 61. Review: Concepts as Lines R B S T C X X X X X X
  60. 62. Review: Concepts as Lines R B S T C X X X X X X
  61. 63. Review: Concepts as Lines X What will be the prediction for this new data point? R B S T C X X X X X X
  62. 64. What are we learning? <ul><li>We’re learning to draw a line through a multidimensional space </li></ul><ul><ul><li>Really a “hyperplane” </li></ul></ul><ul><li>Each function we learn is like a single split in a decision tree </li></ul><ul><ul><li>But it can take many features into account at one time rather than just one </li></ul></ul><ul><li>F(x) = C 0 + C 1 X 1 + C 2 X 2 + C 3 X 3 </li></ul><ul><ul><li>X 1 -X n are our attributes </li></ul></ul><ul><ul><li>C 0 -C n are coefficients </li></ul></ul><ul><ul><li>We’re learning the coefficients, which are weights </li></ul></ul>
  63. 65. Taking a Step Back <ul><li>We started out with tree learning a algorithms that learn symbolic rules with the goal of achieving the highest accuracy </li></ul><ul><ul><li>0R, 1R, Decision Trees (J48) </li></ul></ul><ul><li>Then we talked about statistical models that make decisions based on probability </li></ul><ul><ul><li>Naïve Bayes </li></ul></ul><ul><ul><li>Rules look different – we just store counts </li></ul></ul><ul><ul><li>No explicit focus on accuracy during learning </li></ul></ul><ul><li>What are the implications of the contrast between an accuracy focus and a probability focus ? </li></ul>
  64. 66. Performing well with skewed class distributions <ul><li>Naïve Bayes has trouble with skewed class distributions because of the contribution of prior probabilities </li></ul><ul><ul><li>Remember our math problem case </li></ul></ul><ul><li>Linear models can compensate for this </li></ul><ul><ul><li>They don’t have any notion of prior probability per se </li></ul></ul><ul><ul><li>If they can find a good split on the data, they will find it wherever it is </li></ul></ul><ul><ul><li>Problem if there is not a good split </li></ul></ul>
  65. 67. Skewed but clean separation
  66. 68. Skewed but clean separation
  67. 69. Skewed but no clean separation
  68. 70. Skewed but no clean separation
  69. 71. Taking a Step Back <ul><li>The models we will look at now have rules composed of numbers </li></ul><ul><ul><li>So they “look” more like Naïve Bayes than like Decision Trees </li></ul></ul><ul><li>But the numbers are obtained through a focus on achieving accuracy </li></ul><ul><ul><li>So the learning process is more like Decision Trees </li></ul></ul><ul><li>Given these two properties, what can you say about assumptions about the form of the solution and assumptions about the world that are made? </li></ul>

×