Successfully reported this slideshow.
Your SlideShare is downloading. ×

Lecture 8: Machine Learning in Practice (1)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 36 Ad

Lecture 8: Machine Learning in Practice (1)

Download to read offline

evaluation, t-­test, cost-sensitive measures, occam's razor, k-statistic, lift charts, ROC curves, recall-precision curves, loss function, counting the cost, weka

evaluation, t-­test, cost-sensitive measures, occam's razor, k-statistic, lift charts, ROC curves, recall-precision curves, loss function, counting the cost, weka

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Advertisement

Similar to Lecture 8: Machine Learning in Practice (1) (20)

More from Marina Santini (20)

Advertisement

Lecture 8: Machine Learning in Practice (1)

  1. 1. Machine  Learning  for  Language  Technology  2015   h6p://stp.lingfil.uu.se/~san?nim/ml/2015/ml4lt_2015.htm       Machine  Learning  in  Prac-ce  (1)     Marina  San-ni   san$nim@stp.lingfil.uu.se     Department  of  Linguis-cs  and  Philology   Uppsala  University,  Uppsala,  Sweden     Autumn  2015    
  2. 2. Acknowledgements   •  Weka’s  slides   •  WiHen  et  al.  (2011):  Ch  5  (156-­‐180)   •  Daume’  III  (2015):  ch  4  pp.  65-­‐67.   Lecture  8  ML  in  Practice  (1) 2
  3. 3. Outline   l  Comparing  schemes:  the  t-­‐test   l  Predic-ng  probabili-es   l  Cost-­‐sensi-ve  measures   l  Occam’s  razor   Lecture  8  ML  in  Practice  (1) 3
  4. 4. 4 Lecture  8  ML  in  Practice  (1) Comparing  data  mining  schemes   l  Frequent question: which of two learning schemes performs better? l  Note: this is domain dependent! l  Obvious way: compare 10-fold CV estimates l  Generally sufficient in applications (we don't loose if the chosen method is not truly better) l  However, what about machine learning research? ♦  Need to show convincingly that a particular method works better
  5. 5. 5 Lecture  8  ML  in  Practice  (1) Comparing  schemes  II   l  Want  to  show  that  scheme  A  is  beHer  than  scheme  B  in  a   par-cular  domain   ♦  For  a  given  amount  of  training  data   ♦  On  average,  across  all  possible  training  sets   l  Let's  assume  we  have  an  infinite  amount  of  data  from  the   domain:   ♦  Sample  infinitely  many  dataset  of  specified  size   ♦  Obtain  cross-­‐valida-on  es-mate  on  each  dataset  for  each   scheme   ♦  Check  if  mean  accuracy  for  scheme  A  is  beHer  than  mean   accuracy  for  scheme  B  
  6. 6. 6 Lecture  8  ML  in  Practice  (1) Paired  t-­‐test   l  In practice we have limited data and a limited number of estimates for computing the mean l  Student’s t-test tells whether the means of two samples are significantly different l  In our case the samples are cross-validation estimates for different datasets from the domain l  Use a paired t-test because the individual samples are paired ♦  The same CV is applied twice William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Obtained a post as a chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".
  7. 7. 7 Lecture  8  ML  in  Practice  (1) Distribu-on  of  the  means   l  x1 x2 … xk l  y1 y2 … yk l  mx and my are the means l  With enough samples, the mean of a set of independent samples is normally distributed l  Estimated variances of the means are σx 2/k and σy 2/k l  If µx and µy are the true means then à à à are approximately normally distributed with mean 0, variance 1
  8. 8. 8 Lecture  8  ML  in  Practice  (1) Student’s  distribu-on   l  With small samples (k < 100) the mean follows Student’s distribution with k–1 degrees of freedom l  Confidence limits: 0.8820% 1.3810% 1.835% 2.82 3.25 4.30 z 1% 0.5% 0.1% Pr[X ≥ z] 0.8420% 1.2810% 1.655% 2.33 2.58 3.09 z 1% 0.5% 0.1% Pr[X ≥ z] 9 degrees of freedom normal distribution Assuming we have 10 estimates
  9. 9. 9 Lecture  8  ML  in  Practice  (1) Distribu-on  of  the  differences   l  Let md = mx – my l  The difference of the means (md) also has a Student’s distribution with k–1 degrees of freedom l  The standardized version of md is called the t- statistic: …. l  We use t to perform the t-test l  σd 2 = the variance of the difference samples
  10. 10. 10 Lecture  8  ML  in  Practice  (1) Performing  the  test   •  Fix a significance level •  If a difference is significant at the α% level, there is a (100-α)% chance that the true means differ •  Divide the significance level by two because the test is two-tailed •  i.e. the true difference can be +ve or – ve •  Look up the value for z that corresponds to α/2 •  If t ≤ –z or t ≥z then the difference is significant •  I.e. the null hypothesis (that the difference is zero) can be rejected
  11. 11. 11 Lecture  8  ML  in  Practice  (1) Unpaired  observa-ons   l  If the CV estimates are from different datasets, they are no longer paired (or maybe we have k estimates for one scheme, and j estimates for the other one) l  Then we have to use an un paired t-test with min(k , j) – 1 degrees of freedom l  The estimate of the variance of the difference of the means becomes….:
  12. 12. 12 Lecture  8  ML  in  Practice  (1) Predic-ng  probabili-es   l  Performance measure so far: success rate l  Also called 0-1 loss function: l  Most classifiers produces class probabilities l  Depending on the application, we might want to check the accuracy of the probability estimates l  0-1 loss is not the right thing to use in those cases ∑ i {0  if  prediction  is  correct 1  if  prediction  is  incorrect }
  13. 13. 13 Lecture  8  ML  in  Practice  (1) Quadra-c  loss  func-on   l  p1 … pk are probability estimates for an instance l  c is the index of the instance’s actual class l  a1 … ak = 0, except for ac which is 1 l  Quadratic loss is:…… l  Want to minimize…..
  14. 14. 14 Lecture  8  ML  in  Practice  (1) Informa-onal  loss  func-on   l  The informational loss function is –log(pc), where c is the index of the instance’s actual class l  Let p1 * … pk * be the true class probabilities l  Then the expected value for the loss function is:
  15. 15. 15 Lecture  8  ML  in  Practice  (1) Discussion   l  Which loss function to choose? ♦  Quadratic loss function takes into account all class probability estimates for an instance ♦  Informational loss focuses only on the probability estimate for the actual class 1义∑ j p j 2
  16. 16. 16 Lecture  8  ML  in  Practice  (1) The  kappa  sta-s-c   l  Two  confusion  matrices  for  a  3-­‐class  problem:   actual  predic-ons  (le])  vs.  random  predic-ons  (right)                 l  Number  of  successes:  sum  of  entries  in  diagonal  (D)     l  Kappa  sta-s-c:     measures  rela-ve  improvement  over  random  predic-ons     D obs e rve d− D rand om D pe rfe ct− D rand om
  17. 17. K  sta-s-c:  Calcula-ons   •  Propor-ons  of  the  class  ”a”  =  0.5  (ie  100  instances  out  of  200  à  50%  à  50/100  à  0.5)   •  Propor-ons  of  the  class  ”b”  =  0.3  (ie  60  instances  out  of  200  à  30%  à  30/100  à  0.3)   •  Propor-ons  of  the  class  ”c”  =  0.2  (ie  40  instances  out  of  200  à  20%  à  20/100  à  0.2)   Both  classifiers  (see  below)  returns  120  a’s,  60  b’s  and  20  c’s,  but  one  classifier  is  random.  How   much  the  actual  classifier  improves  on  the  random  classifier?   A  classifier  randomly  guessing  would  return  the  predic-ons  in  the  table  on  the  RHS:   0.5*120=60;  0.3*60=18;  0.2*20=4  à  60+18+4  =  82   The  actual  classifier  returns  the  predic-ons  in  the  table  on  the  LHS,  140  correct  predic-ons  (see   diagonal),  ie  70%  success  rate.  However:  k  sta$s$c  =  140-­‐82/200-­‐82  =  58/118=0.49=49%   •  So  the  actual  success  rate  of  70%  repesents  an  improvement  of  49%  on  random  guessing!   Lecture  8  ML  in  Practice  (1) 17 D obs e rve d− D rand om D pe rfe ct− D rand om actual predictions (left) vs. random predictions (right)
  18. 18. In  summary   •  A  k  sta-s-c  of  100%  (or  1)  implies  a  perfect  classifier.     •  A  k  sta-s-c  of  0  implies  that  the  classifier  provides  no   informa-on  and  behaves  as  if  it  were  guessing   randomly.     •  The  Kappa  sta-s-c  is  used  to  measure  the  agreement   between  predicted  and  observed  categoriza-ons  of  a   dataset,  and  corrects  the  agreement  that  occurs  by   chance.     •  Weka  provides  the  k  sta-s-c  value  to  assess  the   success  rate  beyond  the  chance.   Lecture  8  ML  in  Practice  (1) 18
  19. 19. Quiz  1:  k  sta-s-c   Our  classifier  predicts  Red  41  -mes,  Green  29  -mes  and  Blue  30  -mes.  The  actual   numbers    for  the  sample  are:  40  Red,  30  Green  and  30  Blue.         Overall,  our  classifier  is  right  70%  of  the  -me.       Suppose  these  predic$ons  had  been  random  guesses.  Our  classifier  have  been   randomly  right:  0.4  x  41  +  0.3  x  29  +  0.3  x  30  =  34.1  (random  guess)     So  the  actual  success  rate  of  70%  represents  an  improvement  of  35.9%  on  random   guessing.     What  is  the  k  sta-s-c  for  our  classifier?     1.  0.54   2.  0.60   3.  0.70       Lecture  8  ML  in  Practice  (1) 19
  20. 20. 20 Lecture  8  ML  in  Practice  (1) Coun-ng  the  cost   l  In practice, different types of classification errors often incur different costs l  Examples: ♦  Promotional mailing ♦  Terrorist profiling l “Not a terrorist” correct 99.99% of the time, but if you miss 0.01% the cost will be very high ♦  Loan decisions ♦  etc. l  There are many other types of cost! l  E.g.: cost of collecting training data
  21. 21. 21 Lecture  8  ML  in  Practice  (1) Coun-ng  the  cost   l  The confusion matrix: Actual class True negativeFalse positiveNo False negativeTrue positiveYes NoYes Predicted class
  22. 22. 22 Lecture  8  ML  in  Practice  (1) Classifica-on  with  costs   l  Two  cost  matrices:             l  Success  rate  is  replaced  by  average  cost  per   predic-on   ♦  Cost  is  given  by  appropriate  entry  in  the  cost   matrix      
  23. 23. 23 Lecture  8  ML  in  Practice  (1) Cost-­‐sensi-ve  classifica-on   l  Can  take  costs  into  account  when  making  predic-ons   ♦  Basic  idea:  only  predict  high-­‐cost  class  when  very  confident   about  predic-on   l  Given:  predicted  class  probabili-es   ♦  Normally  we  just  predict  the  most  likely  class   ♦  Here,  we  should  make  the  predic-on  that  minimizes  the   expected  cost   l  Expected  cost:  dot  product  of  vector  of  class  probabili-es  and   appropriate  column  in  cost  matrix   l  Choose  column  (class)  that  minimizes  expected  cost      
  24. 24. 24 Lecture  8  ML  in  Practice  (1) Cost-­‐sensi-ve  learning   l  So far we haven't taken costs into account at training time l  Most learning schemes do not perform cost- sensitive learning l  They generate the same classifier no matter what costs are assigned to the different classes l  Example: standard decision tree learner l  Simple methods for cost-sensitive learning: l  Resampling of instances according to costs l  Weighting of instances according to costs l  Some schemes can take costs into account by varying a parameter, e.g. naïve Bayes
  25. 25. 25 Lecture  8  ML  in  Practice  (1) Li]  charts   l  In practice, costs are rarely known l  Decisions are usually made by comparing possible scenarios l  Example: promotional mailout to 1,000,000 households •  Mail to all; 0.1% respond (1000) •  Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) 40% of responses for 10% of cost may pay off •  Identify subset of 400,000 most promising, 0.2% respond (800) l  A lift chart allows a visual comparison
  26. 26. Data  for  a  li]  chart   Lecture  8  ML  in  Practice  (1) 26
  27. 27. 27 Lecture  8  ML  in  Practice  (1) Genera-ng  a  li]  chart   l  Sort instances according to predicted probability of being positive: l  x axis is sample size y axis is number of true positives ……… Yes0.884 No0.933 Yes0.932 Yes0.951 Actual classPredicted probability
  28. 28. 28 Lecture  8  ML  in  Practice  (1) A  hypothe-cal  li]  chart   40% of responses for 10% of cost 80% of responses for 40% of cost
  29. 29. 29 Lecture  8  ML  in  Practice  (1) ROC  curves   l  ROC curves are similar to lift charts ♦  Stands for “receiver operating characteristic” ♦  Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel l  Differences to lift chart: ♦  y axis shows percentage of true positives in sample rather than absolute number ♦  x axis shows percentage of false positives in sample rather than sample size
  30. 30. 30 Lecture  8  ML  in  Practice  (1) A  sample  ROC  curve   l  Jagged curve—one set of test data l  Smooth curve—use cross-validation
  31. 31. 31 Lecture  8  ML  in  Practice  (1) Cross-­‐valida-on  and  ROC  curves   l  Simple method of getting a ROC curve using cross-validation: ♦  Collect probabilities for instances in test folds ♦  Sort instances according to probabilities l  This method is implemented in WEKA l  However, this is just one possibility ♦  Another possibility is to generate an ROC curve for each fold and average them
  32. 32. 32 Lecture  8  ML  in  Practice  (1) ROC  curves  for  two  schemes   l  For a small, focused sample, use method A l  For a larger one, use method B l  In between, choose between A and B with appropriate probabilities
  33. 33. 33 Lecture  8  ML  in  Practice  (1) Recall-­‐Precision  Curves   l  Percentage of retrieved documents that are relevant: precision=TP/(TP+FP) l  Percentage of relevant documents that are returned: recall =TP/(TP+FN) l  Precision/recall curves have hyperbolic shape l  Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall) l  F-measure=(2 × recall × precision)/(recall+precision) l  sensitivity × specificity = (TP / (TP + FN)) × (TN / (FP + TN)) l  Area under the ROC curve (AUC): probability that randomly chosen positive instance is ranked above randomly chosen negative one
  34. 34. 34 Lecture  8  ML  in  Practice  (1) Model  selec-on  criteria   l  Model selection criteria attempt to find a good compromise between: l  The complexity of a model l  Its prediction accuracy on the training data l  Reasoning: a good model is a simple model that achieves high accuracy on the given data l  Also known as Occam’s Razor : the best theory is the smallest one that describes all the facts William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the most influential philosopher of the 14th century and a controversial theologian.
  35. 35. 35 Lecture  8  ML  in  Practice  (1) Elegance  vs.  errors   l  Model 1: very simple, elegant model that accounts for the data almost perfectly l  Model 2: significantly more complex model that reproduces the data without mistakes l  Model 1 is probably preferable.
  36. 36. The  End   Lecture  8  ML  in  Practice  (1) 36

×