SlideShare a Scribd company logo
AdaBoost and BrownBoost with respect to
              Noisy Data

             Shadhin Rahman
            Prof. Stephen Lucci

               May 27, 2010
Abstract

Boosting is a learning technique which learns strong learning hypothesis from
weak hypothesis. In this paper, we investigate two well known algorithms
namely AdaBoost and BrownBoost with respect to noisy dataset. We run
both algorithms with a non-noisy dataset. Then we introduce artificial noise
in the dataset and compare variability of our result.
Introduction
    Many of us has played soccer in our early childhood. Kicking the ball
high in the air accurately takes practice. Our coaches may have told us, while
kicking soccer ball high on the air, we need to concentrate on follow through.
That is one of the aspects of kicking the ball accurately. However there are
other variables we need to pay attention to while mastering accurate kick. As
we practice, just concentrating on follow through, we quickly discover angle
of foot, amount of force applied are other techniques for playing soccer. We
all heard the expression ”Practice helps us to achieve perfection”.
    Boosting is the same concept as discussed above. Boosting helps us to
achieve a strong learning algorithm from a weak learning algorithm by imply-
ing repetitions. In our soccer scenario, the concept of follow through is our
weak learner and mastering how to kick soccer ball accurately is our strong
learner.
    While Boosting enables us to learn strong hypothesis from weak hypoth-
esis, over fitting may happen with noisy dataset. In this paper we are going
to investigate two well know boosting algorithms. We are going to theoreti-
cally and experimentally show BrownBoost works better than Adaboost with
respect to noisy datasets.
Background
    PAC
    In the heart of Boosting algorithm is PAC learning algorithm. PAC learn-
ing algorithm was introduced by Laslie Valiant. Before we define PAC learn-
ing algorithm, we need to clarify few concepts.

   • Let X be an instance space.

   • A concept c ∈ X.

   • A collection of concepts C ∈ X is a concept class.

   • Oracle EX(c, D) is a system to draw an example x using probability
     distribution, which gives the correct label c(x).

    Now that we have all our notations in place, we can define the PAC
learning algorithm. If an algorithm A given access to oracle EX(c, D), and
input and δ such that Algorithm A outputs hypothesis h ∈ C with error
≤ with probability 1 − δ. If such algorithm works with all c ∈ C, and all
D over X, and for all 0 < < 1/2 and 0 < δ < 1/2 then concept class C is
PAC learnable. Interesting thing here to note, we need to know , δ prior to
executing the algorithm. [1]
    WeakLearner


                                      1
For any probability distribution D, the error rate of that distribution is
denoted by
     t = Pri∼Dt [ht (xi = yi )] =        i:ht =yi Dt (i)
    WeakLearner is defined by an algorithm which does little better than
random guessing. For any concept, random guesing can be think of having
1/2 chance of predicting it correctly. The amount which weak algorithm
supersedes random guessing is denote by γ. Algorithm A is a weak PAC
learner for C with advantage γ if for any concept c ∈ C, with any distribution
D, for any δ with probability 1 − δ, A outputs a hypothesis h such that
    Prx∼D [h(x0 = c(x)] ≤ 1 − γ [1]
                                  2
    AdaBoost
    Now we are ready discuss the main two algorithms of our paper. As
mentioned earlier, Boosting is a learning technique which iteratively learns
a strong learner from a weak learner. A booster B learn concept c given
access to oracle EX(c, D) and a weak base learning algorithm A by running
algorithm multiple times. In boosting scenario, weak learning algorithm is
called on training set. Algorithm A outputs a hypothesis. Next round we run
algorithm A again on training set, but this time we make sure algorithm A
concentrate more on examples which A mis-classified in the previous round.
When we go through the above setup multiple times, algorithm A ultimately
outputs a strong hypothesis. We need to ask ourselves few key questions.
How are we going to choose our distribution, so that algorithm A concentrate
more on mis classified examples? Also how are we going to combine all
hypothesis’s into a single hypothesis. We answer these questions below with
our mathematical representation of boosting steps.[1]
    Given : (x1 , y1 ), (x2 , y2 ), ·, (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1}
    initialize D1 (i) = 1/m
    for t = 1, ·T

   • Train WeakLearner using distribution Dt

   • Get base classifier ht = X → R

   • choose α ∈ R

   • update :

   Dt=1 (i) = Dt (i)exp(−αt yi ht (xi ))
                        Zt
   Where Zt is a normalizing factor.
   Outputs finalize hypothesis:
   H(x) = sign( T αt ht (x)) [1]
                     t=1




                                           2
There is a key practical limitation to this algorithm. We needed to know
α in advance. The AdaBoost algorithm, introduced by Freund and Schapire,
addressed this difficulty. The algorithm is same as discussed above, but here
we calculate α with respect to the training error. AdaBoost algorithm steps
are described below.
    Given : (x1 , y1 ), (x2 , y2 ), ·, (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1}
    initialize D1 (i) = 1/m
    for t = 1, ·T

   • Train WeakLearner using distribution Dt

   • Get weak hypothesis ht : X → {−1, +1} with error
       t   = Pr t ∼ Dt [ht (xi ) = yi ]

   • Choose αt = 1/2 ln (1−t t )

   • Update :

                                   Dt (i)    e− αt if ht (xi ) = yi
                      Dt+1 (i) =          ×
                                    Zt          eα if ht (xi ) = yi
                                                  t
                                            Dt (i)
                                          =         exp(−αt yi ht (xi ))
                                             Zt

      Where Zt is a normalizing factor.

    outputs the final hypothesis :
    Hf inal(x) = sign( T αt ht (x))
                           t=1
    The key component of the above AdaBoost description is to come up
with a proper distribution on each trial. In each trial we are assigning more
weight on the mis-classified examples and we are assigning less weight on
the correctly classified examples. The weight we are assigning is by αt =
1/2 ln(1 − t / t ). In the final hypothesis, we are taking the weighted majority
vote of the T weak hypothesis’s.[3]
    We have described the steps of AdaBoost but how can we assure that
AdaBoost will ultimately come up with a strong hypothesis?
    In AdaBoost algorithm, we assumed that our weak hypothesis did little
better than random guessing by γt . As long as γt is a positive number then
our algorithm will decrease error rate exponentially on each trial run. We
are going to explain this concept in the theorem below.

Theorem 1            •   t   = 1/2 − γt


                                              3
• then

                          trainingerror(Hf inal ≤                     [2     t (1   − t)
                                                                  t
                                                                                   2
                                                              =             (1 − 4γt )
                                                                      t
                                                                                             2
                                                                          ≤ (−              γt )
                                                                                    t


      • So: ∀: γt > γ > 0

                             then trainingerrorHf inal < e− 2γ 2 T

[5]
    AdaBoost is adaptive and key advantage is that we do not need to know
γ or T in advance. As long as γt is a positive number our error will decrease
exponentially as a function of Training trials.
    We will go through simple proof to show the above theorem holds for
training error.

Proof          • let f (x) =       t   αt ht (x) ⇒ Hf inal (x) = sign(f (x))

      • Step 1 :

                                              1 exp(yi t αt ht (xi ))
                             Df inal (i) =
                                              m      ( t Zt ))
                                                  1 exp(−yi f (xi ))
                                               =
                                                  m       t Zt


      • Step 2 trainingerror(Hf inal ) =≤                 t   Zt

      • proof

                                                 1                    1 if yi = Hf inal (xi )
                trainingerror(Hf inal =
                                                 m    i
                                                                      0 else
                                                      1                    1 if yi f (xi ) ≤ 0
                                                  =
                                                      m           i
                                                                           0 else
                  1
                ≤         exp(−yi f (xi )) =          D + f inal(i)                         Zt =       Zt
                  m   t                           t                                     t          t


      • Step 3 Zt = 2       t (1   − t)

                                                  4
• proof
                                         Zt =          Dt (i)exp(−αt yi ht (xi ))
                                                   t
                          =                    Dt (i)eα +
                                                      t                      Dt (i)e− αt
                              i:yi =ht (xi )                i:yi =ht (xi )

                                                           = αt + (1
                                                               t             − t )e− αt
                                                                   =2         t (1   − t)
[5]
    We just finished deriving the training error bound of Adaboost. We
quickly need to discuss the generalization error of the Adaboost algorithm.
Freund and shapire intially bounded the generalization error of Adaboost
algrithm with respect to sample size m, the vc-dimension d of the weak
hypothesis space and the number of iteration T as following.
    Pr[H( x) = y] + O( T d )
                        m
    This bound above suggests that over-fitting can happen with large num-
ber of iterations. However, empirical finding suggested that generalization
error goes down even after larning error reaches zero. Based on this finding
Shapire, re-iterated bound of generalization error with respect to margins of
the training example. The margin of the example (x, y) is defined to be
      y   t
              αt ht (x)
              t
                αt
    Margin is a number in [−1, +1] which is positive only if hypothesis cor-
rectly classifies the example. The generalization error given by Shapire is
given below
                                   d
    Pr[margin(x, y) ≤ θ] + O( mθ2 )
    This bound is independent of iteration T . In this scenario, margins con-
tinues to increase even after training error reaches zero.
    Now that we have proved the upper bound of AdaBoost algorithm, we
come to the main topic of our paper. AdaBoost’s selection strategy and
the notion of combining all hypothesis’s into a single hypothesis, makes this
algorithm poor choice when dealing with noisy dataset. Several empirical
studies and theoretical works have shown that the ability to generalize for
AdaBoost decreases as noise in datasets increases.[2]
    BrownBoost
    The BrownBoost algorithm was introduced by Freund, as an enhancement
from his earlier Boosting by Majority algorithm. BrownBoost works similarly
as AdaBoost, however there is a core difference between these two algorithms.
BrownBoost rely on the core assumption that example that are repeatedly
mis-classified are noisy data. Thus, BrownBoost will give up on noisy data
and non-noisy dataset will contribute to the final hypothesis.[2]

                                                       5
BrownBoost derivation start by fixing the δ to some small value, small
                                                1
enough that most hypothesis can achieve error 2 − δ. Given a hypothesis h
                                                       `
and a hypothesis error 1 − γ, γ > δ, Freund introduced h with the following
                       2
properties.
                        
                                                      δ
                        h(x), with probability
                        
                        
                                                     γ
               ´
               h(x) =      0, with probability        1−δ
                                                          /2   .
                                                        γ
                                                      1−δ
                      
                           1, with probability            /2
                      
                      
                                                        γ

    Since δ is very small and error is 1/2 − δ, we can use the same hypothesis
over and over instead of calling weak learner on each iteration until error
becomes larger than 1/2 − δ. Instead of choosing weight proportion to error,
unlike AdaBoost, here we choose weight from the last weak hypothesis with
new altered distribution. This process works because of the well known
notion of “Brownian motion of drift”, which is beyond the scope of this
paper. Thus, we have the name BrownBoost[6]
    BrownBoost uses c as a time parameter for how long the algorithm is set
to run. BrownBoost assumes that each hypothesis takes a variable amount
of time t which is directly related to the weight given to the hypothesis α.
The time parameter in BrownBoost is analogous to the number of iterations
T in AdaBoost. [1]
    A larger value of parameter c tells BrownBoost that dataset we are dealing
with less noisy and a smaller value tells BrownBoost that we are dealing with
noisy dataset.
    During each iteration, a hypothesis is selected with some advantage over
random guessing. This process is same as AdaBoost. The weight of hypoth-
esis α and the amount of time has passed so far are given to algorithm. The
algorithm runs until there is no time left. The final combined hypothesis is
the weighted majority of all hypothesis’s. The key point to note here is the
time parameter and determining how much time left for each iteration. If
there is no time left, unlike AdaBoost, then BrownBoost gives up on that
particular example. BrownBoost steps are described below.
   • Input: : (x1 , y1 ), (x2 , y2 ), ·, (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1}
   • The time parameter c


   • initialize: s = c The value of s is the time left in the game.


   • ri (xj ) = 0/f orallj The value of ri (xj ) is the margin at iteration i for
     example j.

                                         6
while s > 0

                                 2)
   • Wi (xj ) = e− (ri (xjc)+s


   • Find classifier ht: X → {−1, +1} such that
       j Wi (xj )hi (xj )yj



   • Find values α, t that satisfy the following equation
                                                  2
                      − (ri (xj )+αi (xj )yj +s−t) )
       j hi (xj )yj e                 c
                                                     =0


   • Update margin : ri + 1( xj ) = ri (xj ) + αh(xj )yj


   • Update time: s = s − t


   • Output H(x) = sign(              i   αi hi (x). [1]

    The key thing to note here is that for each example and each class, algo-
rithm maintains a margin. These are initially set to 0, and at each iteration i
they are updated. The hypothesis weight αi are related to margin. Also, we
note that the algorithm will only run while there are time parameter s √   left.
Also, we want to point out that the final training error is is = 1−erf ( c),
where erf is the error function. [4]
Experimental Data
    For our experimentation we have Jboost package. Jboost current version
is 2.0. However we used Jboost 1.4 due to the fact that Jboost does not
support BrownBoost in the most recent version. Jboost comes with few
visualization tools, which are extremely useful. The datasets are described
below.
    Our dataset came from UCI machine learning repository. Blood-Transfusion
dataset is collected from Twain Blood Transfusion Service. This dataset se-
lected 748 donors at random from the donor database. These 748 donor data,
each one included R (Recency - months since last donation), F (Frequency
- total number of donation), M (Monetary - total blood donated in c.c.),
T (Time - months since first donation), and a binary variable representing
whether he/she donated blood in March 2007. (1 stand for donating blood;
0 stands for not donating blood).


                                                  7
We ran our dataset for two algorithms namely AdaBoost and Brown-
Boost. We also ran them with 5 percent artificially introduced noise. We
flip the label within dataset randomly to introduce noise. We ran the Ad-
aBoost algorithm with 3000 iterations. We ran BrownBoost algorithm with
parameter c as 4 minutes. The resulted error rate are listed below in tabular
format.
                                        ResultSet
         data         data-type      aboost aboostnoisy bboost bboostnoisy
                    training error 0.1483        0.1383     0.2024          0.2004
     Transfusion
                     testing error 0.2177        0.2540     0.2258          0.2278
Discussion of Results and Conclusions
    In any supervised learning technique, the biggest challenge is to come
up with dataset. Supervised data is extremely expensive and hard to come
by. In our experimentation we see that the variability of testing error from
noiseless data to noisy data is higher in Adaboost than BrownBoost. However
this is not very conclusive from the tables. One of the reason is that we had
a very small data set we played with. Also, we introduced only 5 percent
noise in the dataset. Creating noisy data manually is time consuming task.
However experimentation reveals that BrownBoost has more capacity to deal
with noisy dataset.
    BrownBoost has a very bright future in machine learning arena. We live
in a world full of information where analyzing data and deriving conclusion
from data has become norm in many fields. We can see so many implica-
tions of BrownBoost algorithm in real world scenarios. Spammers are trying
to fool spam detection program by injecting non spam word in the email.
BrownBoost can play a key role treating these email as noisy data and cor-
rectly detect spam. A serial killer who changes killing patter to fool authority
can be predicted with BrownBoost algorithm. We intend to more research on
this topic to find out how to accurately calculate c parameter of BrownBoost.
    Acknowledgments We would like to thank Jboost community for pro-
viding a rich set of tools for our experiment. We would also like to thank
UCI machine learning repository for providing data for our project.




                                       8
Bibliography

[1] http://en.wikipedia.org

[2] “Mathematical Analysis of Evolution, Information, and Complexity”,
    Wolfgang Arendt, Wolfgang P. Schleich

[3] “A Short Introduction to Boosting”, Freund Yoav, Schapire E. Robert

[4] “An Empirical Comparison of Three Boosting Algorithm on Real Data
    Sets with Artificial Class Noise, Ross A McDonald, David J Hand, and
    Idris A Eckley

[5] http://videolectures.net/mlss05us schapire b

[6] An Adaptive version of the boost by majority algorithm, Freund Yoav




                                    9

More Related Content

What's hot

آموزش روش های حل روابط بازگشتی - بخش یکم
آموزش روش های حل روابط بازگشتی - بخش یکمآموزش روش های حل روابط بازگشتی - بخش یکم
آموزش روش های حل روابط بازگشتی - بخش یکم
faradars
 
Artrodesis de Tobillo.pptx
Artrodesis de Tobillo.pptxArtrodesis de Tobillo.pptx
Artrodesis de Tobillo.pptx
Alvaro Alva Arriaga
 
Survey on Test Time Adaptation.pptx
Survey on Test Time Adaptation.pptxSurvey on Test Time Adaptation.pptx
Survey on Test Time Adaptation.pptx
Kaito Sugiyama
 
【材料力学】自重を受ける棒の伸び
【材料力学】自重を受ける棒の伸び【材料力学】自重を受ける棒の伸び
【材料力学】自重を受ける棒の伸び
Kazuhiro Suga
 
Lecture 31 maxwell's equations. em waves.
Lecture 31   maxwell's equations. em waves.Lecture 31   maxwell's equations. em waves.
Lecture 31 maxwell's equations. em waves.
Albania Energy Association
 
Matlab integration
Matlab integrationMatlab integration
Matlab integration
pramodkumar1804
 
はじパタ8章 svm
はじパタ8章 svmはじパタ8章 svm
はじパタ8章 svmtetsuro ito
 
Fracturas de pilon tibial
Fracturas de pilon tibialFracturas de pilon tibial
Fracturas de pilon tibial
Maxi G
 
Tensor analysis EFE
Tensor analysis  EFETensor analysis  EFE
Tensor analysis EFE
BAIJU V
 
PRML読書会1スライド(公開用)
PRML読書会1スライド(公開用)PRML読書会1スライド(公開用)
PRML読書会1スライド(公開用)
tetsuro ito
 
48315828 tensor-analysis
48315828 tensor-analysis48315828 tensor-analysis
48315828 tensor-analysis
Ernesto Palacios
 
深層学習入門 スライド
深層学習入門 スライド  深層学習入門 スライド
深層学習入門 スライド
swamp Sawa
 

What's hot (12)

آموزش روش های حل روابط بازگشتی - بخش یکم
آموزش روش های حل روابط بازگشتی - بخش یکمآموزش روش های حل روابط بازگشتی - بخش یکم
آموزش روش های حل روابط بازگشتی - بخش یکم
 
Artrodesis de Tobillo.pptx
Artrodesis de Tobillo.pptxArtrodesis de Tobillo.pptx
Artrodesis de Tobillo.pptx
 
Survey on Test Time Adaptation.pptx
Survey on Test Time Adaptation.pptxSurvey on Test Time Adaptation.pptx
Survey on Test Time Adaptation.pptx
 
【材料力学】自重を受ける棒の伸び
【材料力学】自重を受ける棒の伸び【材料力学】自重を受ける棒の伸び
【材料力学】自重を受ける棒の伸び
 
Lecture 31 maxwell's equations. em waves.
Lecture 31   maxwell's equations. em waves.Lecture 31   maxwell's equations. em waves.
Lecture 31 maxwell's equations. em waves.
 
Matlab integration
Matlab integrationMatlab integration
Matlab integration
 
はじパタ8章 svm
はじパタ8章 svmはじパタ8章 svm
はじパタ8章 svm
 
Fracturas de pilon tibial
Fracturas de pilon tibialFracturas de pilon tibial
Fracturas de pilon tibial
 
Tensor analysis EFE
Tensor analysis  EFETensor analysis  EFE
Tensor analysis EFE
 
PRML読書会1スライド(公開用)
PRML読書会1スライド(公開用)PRML読書会1スライド(公開用)
PRML読書会1スライド(公開用)
 
48315828 tensor-analysis
48315828 tensor-analysis48315828 tensor-analysis
48315828 tensor-analysis
 
深層学習入門 スライド
深層学習入門 スライド  深層学習入門 スライド
深層学習入門 スライド
 

Similar to Ada boost brown boost performance with noisy data

Ada boosting2
Ada boosting2Ada boosting2
Ada boosting2
Nassim Asbai
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)
NYversity
 
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
The Statistical and Applied Mathematical Sciences Institute
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
Per Kristian Lehre
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
PK Lehre
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Chiheb Ben Hammouda
 
Lecture_9.pdf
Lecture_9.pdfLecture_9.pdf
Lecture_9.pdf
BrofessorPaulNguyen
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
Eric Zhang
 
5. cem granger causality ecm
5. cem granger causality  ecm 5. cem granger causality  ecm
5. cem granger causality ecm
Quang Hoang
 
Signal Processing Homework Help
Signal Processing Homework HelpSignal Processing Homework Help
Signal Processing Homework Help
Matlab Assignment Experts
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
zukun
 
CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture Notes
Eric Conner
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
Marco Moldenhauer
 
Ada boost
Ada boostAda boost
Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011
BigMC
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Arthur Charpentier
 
Learning to Reconstruct
Learning to ReconstructLearning to Reconstruct
Learning to Reconstruct
Jonas Adler
 
optimal control principle slided
optimal control principle slidedoptimal control principle slided
optimal control principle slided
Karthi Ramachandran
 
Linear regression
Linear regressionLinear regression
Linear regression
Zoya Bylinskii
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayes
Phong Vo
 

Similar to Ada boost brown boost performance with noisy data (20)

Ada boosting2
Ada boosting2Ada boosting2
Ada boosting2
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)
 
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
Lecture_9.pdf
Lecture_9.pdfLecture_9.pdf
Lecture_9.pdf
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
5. cem granger causality ecm
5. cem granger causality  ecm 5. cem granger causality  ecm
5. cem granger causality ecm
 
Signal Processing Homework Help
Signal Processing Homework HelpSignal Processing Homework Help
Signal Processing Homework Help
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture Notes
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
Ada boost
Ada boostAda boost
Ada boost
 
Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
 
Learning to Reconstruct
Learning to ReconstructLearning to Reconstruct
Learning to Reconstruct
 
optimal control principle slided
optimal control principle slidedoptimal control principle slided
optimal control principle slided
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayes
 

Ada boost brown boost performance with noisy data

  • 1. AdaBoost and BrownBoost with respect to Noisy Data Shadhin Rahman Prof. Stephen Lucci May 27, 2010
  • 2. Abstract Boosting is a learning technique which learns strong learning hypothesis from weak hypothesis. In this paper, we investigate two well known algorithms namely AdaBoost and BrownBoost with respect to noisy dataset. We run both algorithms with a non-noisy dataset. Then we introduce artificial noise in the dataset and compare variability of our result.
  • 3. Introduction Many of us has played soccer in our early childhood. Kicking the ball high in the air accurately takes practice. Our coaches may have told us, while kicking soccer ball high on the air, we need to concentrate on follow through. That is one of the aspects of kicking the ball accurately. However there are other variables we need to pay attention to while mastering accurate kick. As we practice, just concentrating on follow through, we quickly discover angle of foot, amount of force applied are other techniques for playing soccer. We all heard the expression ”Practice helps us to achieve perfection”. Boosting is the same concept as discussed above. Boosting helps us to achieve a strong learning algorithm from a weak learning algorithm by imply- ing repetitions. In our soccer scenario, the concept of follow through is our weak learner and mastering how to kick soccer ball accurately is our strong learner. While Boosting enables us to learn strong hypothesis from weak hypoth- esis, over fitting may happen with noisy dataset. In this paper we are going to investigate two well know boosting algorithms. We are going to theoreti- cally and experimentally show BrownBoost works better than Adaboost with respect to noisy datasets. Background PAC In the heart of Boosting algorithm is PAC learning algorithm. PAC learn- ing algorithm was introduced by Laslie Valiant. Before we define PAC learn- ing algorithm, we need to clarify few concepts. • Let X be an instance space. • A concept c ∈ X. • A collection of concepts C ∈ X is a concept class. • Oracle EX(c, D) is a system to draw an example x using probability distribution, which gives the correct label c(x). Now that we have all our notations in place, we can define the PAC learning algorithm. If an algorithm A given access to oracle EX(c, D), and input and δ such that Algorithm A outputs hypothesis h ∈ C with error ≤ with probability 1 − δ. If such algorithm works with all c ∈ C, and all D over X, and for all 0 < < 1/2 and 0 < δ < 1/2 then concept class C is PAC learnable. Interesting thing here to note, we need to know , δ prior to executing the algorithm. [1] WeakLearner 1
  • 4. For any probability distribution D, the error rate of that distribution is denoted by t = Pri∼Dt [ht (xi = yi )] = i:ht =yi Dt (i) WeakLearner is defined by an algorithm which does little better than random guessing. For any concept, random guesing can be think of having 1/2 chance of predicting it correctly. The amount which weak algorithm supersedes random guessing is denote by γ. Algorithm A is a weak PAC learner for C with advantage γ if for any concept c ∈ C, with any distribution D, for any δ with probability 1 − δ, A outputs a hypothesis h such that Prx∼D [h(x0 = c(x)] ≤ 1 − γ [1] 2 AdaBoost Now we are ready discuss the main two algorithms of our paper. As mentioned earlier, Boosting is a learning technique which iteratively learns a strong learner from a weak learner. A booster B learn concept c given access to oracle EX(c, D) and a weak base learning algorithm A by running algorithm multiple times. In boosting scenario, weak learning algorithm is called on training set. Algorithm A outputs a hypothesis. Next round we run algorithm A again on training set, but this time we make sure algorithm A concentrate more on examples which A mis-classified in the previous round. When we go through the above setup multiple times, algorithm A ultimately outputs a strong hypothesis. We need to ask ourselves few key questions. How are we going to choose our distribution, so that algorithm A concentrate more on mis classified examples? Also how are we going to combine all hypothesis’s into a single hypothesis. We answer these questions below with our mathematical representation of boosting steps.[1] Given : (x1 , y1 ), (x2 , y2 ), ·, (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1} initialize D1 (i) = 1/m for t = 1, ·T • Train WeakLearner using distribution Dt • Get base classifier ht = X → R • choose α ∈ R • update : Dt=1 (i) = Dt (i)exp(−αt yi ht (xi )) Zt Where Zt is a normalizing factor. Outputs finalize hypothesis: H(x) = sign( T αt ht (x)) [1] t=1 2
  • 5. There is a key practical limitation to this algorithm. We needed to know α in advance. The AdaBoost algorithm, introduced by Freund and Schapire, addressed this difficulty. The algorithm is same as discussed above, but here we calculate α with respect to the training error. AdaBoost algorithm steps are described below. Given : (x1 , y1 ), (x2 , y2 ), ·, (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1} initialize D1 (i) = 1/m for t = 1, ·T • Train WeakLearner using distribution Dt • Get weak hypothesis ht : X → {−1, +1} with error t = Pr t ∼ Dt [ht (xi ) = yi ] • Choose αt = 1/2 ln (1−t t ) • Update : Dt (i) e− αt if ht (xi ) = yi Dt+1 (i) = × Zt eα if ht (xi ) = yi t Dt (i) = exp(−αt yi ht (xi )) Zt Where Zt is a normalizing factor. outputs the final hypothesis : Hf inal(x) = sign( T αt ht (x)) t=1 The key component of the above AdaBoost description is to come up with a proper distribution on each trial. In each trial we are assigning more weight on the mis-classified examples and we are assigning less weight on the correctly classified examples. The weight we are assigning is by αt = 1/2 ln(1 − t / t ). In the final hypothesis, we are taking the weighted majority vote of the T weak hypothesis’s.[3] We have described the steps of AdaBoost but how can we assure that AdaBoost will ultimately come up with a strong hypothesis? In AdaBoost algorithm, we assumed that our weak hypothesis did little better than random guessing by γt . As long as γt is a positive number then our algorithm will decrease error rate exponentially on each trial run. We are going to explain this concept in the theorem below. Theorem 1 • t = 1/2 − γt 3
  • 6. • then trainingerror(Hf inal ≤ [2 t (1 − t) t 2 = (1 − 4γt ) t 2 ≤ (− γt ) t • So: ∀: γt > γ > 0 then trainingerrorHf inal < e− 2γ 2 T [5] AdaBoost is adaptive and key advantage is that we do not need to know γ or T in advance. As long as γt is a positive number our error will decrease exponentially as a function of Training trials. We will go through simple proof to show the above theorem holds for training error. Proof • let f (x) = t αt ht (x) ⇒ Hf inal (x) = sign(f (x)) • Step 1 : 1 exp(yi t αt ht (xi )) Df inal (i) = m ( t Zt )) 1 exp(−yi f (xi )) = m t Zt • Step 2 trainingerror(Hf inal ) =≤ t Zt • proof 1 1 if yi = Hf inal (xi ) trainingerror(Hf inal = m i 0 else 1 1 if yi f (xi ) ≤ 0 = m i 0 else 1 ≤ exp(−yi f (xi )) = D + f inal(i) Zt = Zt m t t t t • Step 3 Zt = 2 t (1 − t) 4
  • 7. • proof Zt = Dt (i)exp(−αt yi ht (xi )) t = Dt (i)eα + t Dt (i)e− αt i:yi =ht (xi ) i:yi =ht (xi ) = αt + (1 t − t )e− αt =2 t (1 − t) [5] We just finished deriving the training error bound of Adaboost. We quickly need to discuss the generalization error of the Adaboost algorithm. Freund and shapire intially bounded the generalization error of Adaboost algrithm with respect to sample size m, the vc-dimension d of the weak hypothesis space and the number of iteration T as following. Pr[H( x) = y] + O( T d ) m This bound above suggests that over-fitting can happen with large num- ber of iterations. However, empirical finding suggested that generalization error goes down even after larning error reaches zero. Based on this finding Shapire, re-iterated bound of generalization error with respect to margins of the training example. The margin of the example (x, y) is defined to be y t αt ht (x) t αt Margin is a number in [−1, +1] which is positive only if hypothesis cor- rectly classifies the example. The generalization error given by Shapire is given below d Pr[margin(x, y) ≤ θ] + O( mθ2 ) This bound is independent of iteration T . In this scenario, margins con- tinues to increase even after training error reaches zero. Now that we have proved the upper bound of AdaBoost algorithm, we come to the main topic of our paper. AdaBoost’s selection strategy and the notion of combining all hypothesis’s into a single hypothesis, makes this algorithm poor choice when dealing with noisy dataset. Several empirical studies and theoretical works have shown that the ability to generalize for AdaBoost decreases as noise in datasets increases.[2] BrownBoost The BrownBoost algorithm was introduced by Freund, as an enhancement from his earlier Boosting by Majority algorithm. BrownBoost works similarly as AdaBoost, however there is a core difference between these two algorithms. BrownBoost rely on the core assumption that example that are repeatedly mis-classified are noisy data. Thus, BrownBoost will give up on noisy data and non-noisy dataset will contribute to the final hypothesis.[2] 5
  • 8. BrownBoost derivation start by fixing the δ to some small value, small 1 enough that most hypothesis can achieve error 2 − δ. Given a hypothesis h ` and a hypothesis error 1 − γ, γ > δ, Freund introduced h with the following 2 properties.  δ h(x), with probability    γ ´ h(x) = 0, with probability 1−δ /2 . γ 1−δ  1, with probability /2   γ Since δ is very small and error is 1/2 − δ, we can use the same hypothesis over and over instead of calling weak learner on each iteration until error becomes larger than 1/2 − δ. Instead of choosing weight proportion to error, unlike AdaBoost, here we choose weight from the last weak hypothesis with new altered distribution. This process works because of the well known notion of “Brownian motion of drift”, which is beyond the scope of this paper. Thus, we have the name BrownBoost[6] BrownBoost uses c as a time parameter for how long the algorithm is set to run. BrownBoost assumes that each hypothesis takes a variable amount of time t which is directly related to the weight given to the hypothesis α. The time parameter in BrownBoost is analogous to the number of iterations T in AdaBoost. [1] A larger value of parameter c tells BrownBoost that dataset we are dealing with less noisy and a smaller value tells BrownBoost that we are dealing with noisy dataset. During each iteration, a hypothesis is selected with some advantage over random guessing. This process is same as AdaBoost. The weight of hypoth- esis α and the amount of time has passed so far are given to algorithm. The algorithm runs until there is no time left. The final combined hypothesis is the weighted majority of all hypothesis’s. The key point to note here is the time parameter and determining how much time left for each iteration. If there is no time left, unlike AdaBoost, then BrownBoost gives up on that particular example. BrownBoost steps are described below. • Input: : (x1 , y1 ), (x2 , y2 ), ·, (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1} • The time parameter c • initialize: s = c The value of s is the time left in the game. • ri (xj ) = 0/f orallj The value of ri (xj ) is the margin at iteration i for example j. 6
  • 9. while s > 0 2) • Wi (xj ) = e− (ri (xjc)+s • Find classifier ht: X → {−1, +1} such that j Wi (xj )hi (xj )yj • Find values α, t that satisfy the following equation 2 − (ri (xj )+αi (xj )yj +s−t) ) j hi (xj )yj e c =0 • Update margin : ri + 1( xj ) = ri (xj ) + αh(xj )yj • Update time: s = s − t • Output H(x) = sign( i αi hi (x). [1] The key thing to note here is that for each example and each class, algo- rithm maintains a margin. These are initially set to 0, and at each iteration i they are updated. The hypothesis weight αi are related to margin. Also, we note that the algorithm will only run while there are time parameter s √ left. Also, we want to point out that the final training error is is = 1−erf ( c), where erf is the error function. [4] Experimental Data For our experimentation we have Jboost package. Jboost current version is 2.0. However we used Jboost 1.4 due to the fact that Jboost does not support BrownBoost in the most recent version. Jboost comes with few visualization tools, which are extremely useful. The datasets are described below. Our dataset came from UCI machine learning repository. Blood-Transfusion dataset is collected from Twain Blood Transfusion Service. This dataset se- lected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007. (1 stand for donating blood; 0 stands for not donating blood). 7
  • 10. We ran our dataset for two algorithms namely AdaBoost and Brown- Boost. We also ran them with 5 percent artificially introduced noise. We flip the label within dataset randomly to introduce noise. We ran the Ad- aBoost algorithm with 3000 iterations. We ran BrownBoost algorithm with parameter c as 4 minutes. The resulted error rate are listed below in tabular format. ResultSet data data-type aboost aboostnoisy bboost bboostnoisy training error 0.1483 0.1383 0.2024 0.2004 Transfusion testing error 0.2177 0.2540 0.2258 0.2278 Discussion of Results and Conclusions In any supervised learning technique, the biggest challenge is to come up with dataset. Supervised data is extremely expensive and hard to come by. In our experimentation we see that the variability of testing error from noiseless data to noisy data is higher in Adaboost than BrownBoost. However this is not very conclusive from the tables. One of the reason is that we had a very small data set we played with. Also, we introduced only 5 percent noise in the dataset. Creating noisy data manually is time consuming task. However experimentation reveals that BrownBoost has more capacity to deal with noisy dataset. BrownBoost has a very bright future in machine learning arena. We live in a world full of information where analyzing data and deriving conclusion from data has become norm in many fields. We can see so many implica- tions of BrownBoost algorithm in real world scenarios. Spammers are trying to fool spam detection program by injecting non spam word in the email. BrownBoost can play a key role treating these email as noisy data and cor- rectly detect spam. A serial killer who changes killing patter to fool authority can be predicted with BrownBoost algorithm. We intend to more research on this topic to find out how to accurately calculate c parameter of BrownBoost. Acknowledgments We would like to thank Jboost community for pro- viding a rich set of tools for our experiment. We would also like to thank UCI machine learning repository for providing data for our project. 8
  • 11. Bibliography [1] http://en.wikipedia.org [2] “Mathematical Analysis of Evolution, Information, and Complexity”, Wolfgang Arendt, Wolfgang P. Schleich [3] “A Short Introduction to Boosting”, Freund Yoav, Schapire E. Robert [4] “An Empirical Comparison of Three Boosting Algorithm on Real Data Sets with Artificial Class Noise, Ross A McDonald, David J Hand, and Idris A Eckley [5] http://videolectures.net/mlss05us schapire b [6] An Adaptive version of the boost by majority algorithm, Freund Yoav 9