NIPS2007: Theory and Applications of Boosting

713 views

Published on

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
713
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
23
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

NIPS2007: Theory and Applications of Boosting

  1. 1. Theory and Applications of Boosting Rob Schapire Princeton University
  2. 2. Example: “How May I Help You?” [Gorin et al.] • goal: automatically categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) • yes I’d like to place a collect call long distance please (Collect) • operator I need to make a call but I need to bill it to my office (ThirdNumber) • yes I’d like to place a call on my master card please (CallingCard) • I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit)
  3. 3. Example: “How May I Help You?” [Gorin et al.] • goal: automatically categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) • yes I’d like to place a collect call long distance please (Collect) • operator I need to make a call but I need to bill it to my office (ThirdNumber) • yes I’d like to place a call on my master card please (CallingCard) • I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) • observation: • easy to find “rules of thumb” that are “often” correct • e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ” • hard to find single highly accurate prediction rule
  4. 4. The Boosting Approach • devise computer program for deriving rough rules of thumb • apply procedure to subset of examples • obtain rule of thumb • apply to 2nd subset of examples • obtain 2nd rule of thumb • repeat T times
  5. 5. Details • how to choose examples on each round? • concentrate on “hardest” examples (those most often misclassified by previous rules of thumb) • how to combine rules of thumb into single prediction rule? • take (weighted) majority vote of rules of thumb
  6. 6. Boosting • boosting = general method of converting rough rules of thumb into highly accurate prediction rule • technically: • assume given “weak” learning algorithm that can consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55% (in two-class setting) • given sufficient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 99%
  7. 7. Outline of Tutorial • brief background • basic algorithm and core theory • other ways of understanding boosting • experiments, applications and extensions
  8. 8. Brief Background
  9. 9. Strong and Weak Learnability • boosting’s roots are in “PAC” (Valiant) learning model • get random examples from unknown, arbitrary distribution • strong PAC learning algorithm: • for any distribution with high probability given polynomially many examples (and polynomial time) can find classifier with arbitrarily small generalization error • weak PAC learning algorithm • same, but generalization error only needs to be slightly better than random guessing ( 1 − γ) 2 • [Kearns & Valiant ’88]: • does weak learnability imply strong learnability?
  10. 10. Early Boosting Algorithms • [Schapire ’89]: • first provable boosting algorithm • [Freund ’90]: • “optimal” algorithm that “boosts by majority” • [Drucker, Schapire & Simard ’92]: • first experiments using boosting • limited by practical drawbacks
  11. 11. AdaBoost • [Freund & Schapire ’95]: introduced “AdaBoost” algorithm • strong practical advantages over previous boosting • algorithms • experiments and applications using AdaBoost: [Drucker & Cortes ’96] [Abney, Schapire & Singer ’99] [Tieu & Viola ’00] [Jackson & Craven ’96] [Haruno, Shirai & Ooyama ’99] [Walker, Rambow & Rogati ’01] [Freund & Schapire ’96] [Cohen & Singer’ 99] [Rochery, Schapire, Rahim & Gupta ’01] [Quinlan ’96] [Dietterich ’00] [Merler, Furlanello, Larcher & Sboner ’01] [Breiman ’96] [Schapire & Singer ’00] [Di Fabbrizio, Dutton, Gupta et al. ’02] [Maclin & Opitz ’97] [Collins ’00] [Qu, Adam, Yasui et al. ’02] [Bauer & Kohavi ’97] [Escudero, M`rquez & Rigau ’00] a [Tur, Schapire & Hakkani-T¨r ’03] u [Schwenk & Bengio ’98] [Iyer, Lewis, Schapire et al. ’00] [Viola & Jones ’04] [Schapire, Singer & Singhal ’98] [Onoda, R¨tsch & M¨ller ’00] a u [Middendorf, Kundaje, Wiggins et al. ’04] . . . • continuing development of theory and algorithms: [Breiman ’98, ’99] [Duffy & Helmbold ’99, ’02] [Koltchinskii, Panchenko & Lozano ’01] [Schapire, Freund, Bartlett & Lee ’98] [Freund & Mason ’99] [Collins, Schapire & Singer ’02] [Grove & Schuurmans ’98] [Ridgeway, Madigan & Richardson ’99] [Demiriz, Bennett & Shawe-Taylor ’02] [Mason, Bartlett & Baxter ’98] [Kivinen & Warmuth ’99] [Lebanon & Lafferty ’02] [Schapire & Singer ’99] [Friedman, Hastie & Tibshirani ’00] [Wyner ’02] [Cohen & Singer ’99] [R¨tsch, Onoda & M¨ller ’00] a u [Rudin, Daubechies & Schapire ’03] [Freund & Mason ’99] [R¨tsch, Warmuth, Mika et al. ’00] a [Jiang ’04] [Domingo & Watanabe ’99] [Allwein, Schapire & Singer ’00] [Lugosi & Vayatis ’04] [Mason, Baxter, Bartlett & Frean ’99] [Friedman ’01] [Zhang ’04] . . .
  12. 12. Basic Algorithm and Core Theory • introduction to AdaBoost • analysis of training error • analysis of test error based on margins theory
  13. 13. A Formal Description of Boosting • given training set (x1 , y1 ), . . . , (xm , ym ) • yi ∈ {−1, +1} correct label of instance xi ∈ X
  14. 14. A Formal Description of Boosting • given training set (x1 , y1 ), . . . , (xm , ym ) • yi ∈ {−1, +1} correct label of instance xi ∈ X • for t = 1, . . . , T : • construct distribution Dt on {1, . . . , m} • find weak classifier (“rule of thumb”) ht : X → {−1, +1} with small error t on Dt : t = Pri ∼Dt [ht (xi ) = yi ]
  15. 15. A Formal Description of Boosting • given training set (x1 , y1 ), . . . , (xm , ym ) • yi ∈ {−1, +1} correct label of instance xi ∈ X • for t = 1, . . . , T : • construct distribution Dt on {1, . . . , m} • find weak classifier (“rule of thumb”) ht : X → {−1, +1} with small error t on Dt : t = Pri ∼Dt [ht (xi ) = yi ] • output final classifier Hfinal
  16. 16. AdaBoost [with Freund] • constructing Dt : • D1 (i ) = 1/m
  17. 17. AdaBoost [with Freund] • constructing Dt : • D1 (i ) = 1/m • given Dt and ht : Dt (i ) e −αt if yi = ht (xi ) Dt+1 (i ) = × Zt e αt if yi = ht (xi ) Dt (i ) = exp(−αt yi ht (xi )) Zt where Zt = normalization constant 1 1− t αt = 2 ln >0 t
  18. 18. AdaBoost [with Freund] • constructing Dt : • D1 (i ) = 1/m • given Dt and ht : Dt (i ) e −αt if yi = ht (xi ) Dt+1 (i ) = × Zt e αt if yi = ht (xi ) Dt (i ) = exp(−αt yi ht (xi )) Zt where Zt = normalization constant 1 1− t αt = 2 ln >0 t • final classifier: • Hfinal (x) = sign αt ht (x) t
  19. 19. Toy Example D1 weak classifiers = vertical or horizontal half-planes
  20. 20. α1=0.42 ε1 =0.30 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £ £ £ £                         ¢ ¢ ¢ ¢D2 h1 Round 1
  21. 21. α2=0.65 ε2 =0.21 § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ § § § § © © © © © © © © © © © © © ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¦ ¦ ¦ ¦ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ D3 h2 Round 2
  22. 22. α3=0.92 ε3 =0.14! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !# # # # # # # # # # # # # # # # ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # h3# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # Round 3
  23. 23. 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 = 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 5 5 5 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A 5 A A A A 4 4 4 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ 4 @ @ @ 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 3 final1 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 + 0.92 + 0.65 0.42 H = sign 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 31 1 1 1 1 1 1 1 1 1 1 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 20 0 0 0 0 0 0 0 0 0 0) ) ) ) ) ) ) ) ) ) )( ( ( ( ( ( ( ( ( ( ( 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 3 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 2) ) ) ) ) ) ) ) ) ) )( ( ( ( ( ( ( ( ( ( ( 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 3 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 2) ) ) ) ) ) ) ) ) ) )( ( ( ( ( ( ( ( ( ( ( 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 3 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 2) ) ) ) ) ) ) ) ) ) )( ( ( ( ( ( ( ( ( ( ( 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 3 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 2) ) ) ) ) ) ) ) ) ) )( ( ( ( ( ( ( ( ( ( ( 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 3 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 2) ) ) ) ) ) ) ) ) ) )( ( ( ( ( ( ( ( ( ( ( 9 9 7 7 7 7 7 7 7 7 7 % % % % % % % % 3 3 8 8 6 6 6 6 6 6 6 6 6 $ $ $ $ $ $ $ $ 2 2) ) ) ) ) ) ) ) ) ) )( ( ( ( ( ( ( ( ( ( ( Final Classifier
  24. 24. Analyzing the training error • Theorem: • write t as 1/2 − γt
  25. 25. Analyzing the training error • Theorem: • write t as 1/2 − γt • then training error(Hfinal ) ≤ 2 t (1 − t) t = 2 1 − 4γt t 2 ≤ exp −2 γt t
  26. 26. Analyzing the training error • Theorem: • write t as 1/2 − γt • then training error(Hfinal ) ≤ 2 t (1 − t) t = 2 1 − 4γt t 2 ≤ exp −2 γt t • so: if ∀t : γt ≥ γ 0 2 then training error(Hfinal ) ≤ e −2γ T • AdaBoost is adaptive: • does not need to know γ or T a priori • can exploit γt γ
  27. 27. Proof • let f (x) = αt ht (x) ⇒ Hfinal (x) = sign(f (x)) t • Step 1: unwrapping recurrence: exp −yi αt ht (xi ) 1 t Dfinal (i ) = m Zt t 1 exp (−yi f (xi )) = m Zt t
  28. 28. Proof (cont.) • Step 2: training error(Hfinal ) ≤ Zt t
  29. 29. Proof (cont.) • Step 2: training error(Hfinal ) ≤ Zt t • Proof: 1 1 if yi = Hfinal (xi ) training error(Hfinal ) = m 0 else i
  30. 30. Proof (cont.) • Step 2: training error(Hfinal ) ≤ Zt t • Proof: 1 1 if yi = Hfinal (xi ) training error(Hfinal ) = m 0 else i 1 1 if yi f (xi ) ≤ 0 = m 0 else i
  31. 31. Proof (cont.) • Step 2: training error(Hfinal ) ≤ Zt t • Proof: 1 1 if yi = Hfinal (xi ) training error(Hfinal ) = m 0 else i 1 1 if yi f (xi ) ≤ 0 = m 0 else i 1 ≤ exp(−yi f (xi )) m i
  32. 32. Proof (cont.) • Step 2: training error(Hfinal ) ≤ Zt t • Proof: 1 1 if yi = Hfinal (xi ) training error(Hfinal ) = m 0 else i 1 1 if yi f (xi ) ≤ 0 = m 0 else i 1 ≤ exp(−yi f (xi )) m i = Dfinal (i ) Zt i t
  33. 33. Proof (cont.) • Step 2: training error(Hfinal ) ≤ Zt t • Proof: 1 1 if yi = Hfinal (xi ) training error(Hfinal ) = m 0 else i 1 1 if yi f (xi ) ≤ 0 = m 0 else i 1 ≤ exp(−yi f (xi )) m i = Dfinal (i ) Zt i t = Zt t
  34. 34. Proof (cont.) • Step 3: Zt = 2 t (1 − t)
  35. 35. Proof (cont.) • Step 3: Zt = 2 t (1 − t) • Proof: Zt = Dt (i ) exp(−αt yi ht (xi )) i = Dt (i )e αt + Dt (i )e −αt i :yi =ht (xi ) i :yi =ht (xi ) −αt = t e αt + (1 − t) e = 2 t (1 − t)
  36. 36. How Will Test Error Behave? (A First Guess) 1 0.8 0.6 error 0.4 test 0.2 train 20 40 60 80 100 # of rounds (T) expect: • training error to continue to drop (or reach zero) • test error to increase when Hfinal becomes “too complex” • “Occam’s razor” • overfitting • hard to know when to stop training
  37. 37. Actual Typical Run 20 15 C4.5 test error error 10 (boosting C4.5 on test “letter” dataset) 5 0 train 10 100 1000 # of rounds (T) • test error does not increase, even after 1000 rounds • (total size 2,000,000 nodes) • test error continues to drop even after training error is zero! # rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 • Occam’s razor wrongly predicts “simpler” rule is better
  38. 38. A Better Story: The Margins Explanation [with Freund, Bartlett Lee] • key idea: • training error only measures whether classifications are right or wrong • should also consider confidence of classifications
  39. 39. A Better Story: The Margins Explanation [with Freund, Bartlett Lee] • key idea: • training error only measures whether classifications are right or wrong • should also consider confidence of classifications • recall: Hfinal is weighted majority vote of weak classifiers
  40. 40. A Better Story: The Margins Explanation [with Freund, Bartlett Lee] • key idea: • training error only measures whether classifications are right or wrong • should also consider confidence of classifications • recall: Hfinal is weighted majority vote of weak classifiers • measure confidence by margin = strength of the vote = (fraction voting correctly) − (fraction voting incorrectly) high conf. high conf. incorrect low conf. correct Hfinal Hfinal −1 incorrect 0 correct +1
  41. 41. Empirical Evidence: The Margin Distribution • margin distribution = cumulative distribution of margins of training examples cumulative distribution 1.0 20 1000 15 100 error 10 0.5 5 test train 5 0 10 100 1000 -1 -0.5 0.5 1 # of rounds (T) margin # rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 % margins ≤ 0.5 7.7 0.0 0.0 minimum margin 0.14 0.52 0.55
  42. 42. Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds)
  43. 43. Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) • proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election)
  44. 44. Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) • proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election) • Theorem: boosting tends to increase margins of training examples (given weak learning assumption)
  45. 45. Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) • proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election) • Theorem: boosting tends to increase margins of training examples (given weak learning assumption) • proof idea: similar to training error proof
  46. 46. Theoretical Evidence: Analyzing Boosting Using Margins • Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds) • proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election) • Theorem: boosting tends to increase margins of training examples (given weak learning assumption) • proof idea: similar to training error proof • so: although final classifier is getting larger, margins are likely to be increasing, so final classifier actually getting close to a simpler classifier, driving down the test error
  47. 47. More Technically... • with high probability, ∀θ 0 : ˆ ˜ d/m generalization error ≤ Pr[margin ≤ θ] + O θ ˆ (Pr[ ] = empirical probability) • bound depends on • m = # training examples • d = “complexity” of weak classifiers • entire distribution of margins of training examples ˆ • Pr[margin ≤ θ] → 0 exponentially fast (in T ) if (error of ht on Dt ) 1/2 − θ (∀t) • so: if weak learning assumption holds, then all examples will quickly have “large” margins
  48. 48. Other Ways of Understanding AdaBoost • game theory • loss minimization • estimating conditional probabilities
  49. 49. Game Theory • game defined by matrix M: Rock Paper Scissors Rock 1/2 1 0 Paper 0 1/2 1 Scissors 1 0 1/2 • row player chooses row i • column player chooses column j (simultaneously) • row player’s goal: minimize loss M(i , j)
  50. 50. Game Theory • game defined by matrix M: Rock Paper Scissors Rock 1/2 1 0 Paper 0 1/2 1 Scissors 1 0 1/2 • row player chooses row i • column player chooses column j (simultaneously) • row player’s goal: minimize loss M(i , j) • usually allow randomized play: • players choose distributions P and Q over rows and columns • learner’s (expected) loss = P(i )M(i , j)Q(j) i ,j = PT MQ ≡ M(P, Q)
  51. 51. The Minmax Theorem • von Neumann’s minmax theorem: min max M(P, Q) = max min M(P, Q) P Q Q P = v = “value” of game M • in words: • v = min max means: • row player has strategy P∗ such that ∀ column strategy Q loss M(P∗ , Q) ≤ v • v = max min means: • this is optimal in sense that column player has strategy Q∗ such that ∀ row strategy P loss M(P, Q∗ ) ≥ v
  52. 52. The Boosting Game  !

×