Ada boost brown boost performance with noisy data

AdaBoost and BrownBoost with respect to
Noisy Data

Shadhin Rahman
Prof. Stephen Lucci

May 27, 2010

Abstract

Boosting is a learning technique which learns strong learning hypothesis from
weak hypothesis. In this paper, we investigate two well known algorithms
namely AdaBoost and BrownBoost with respect to noisy dataset. We run
both algorithms with a non-noisy dataset. Then we introduce artiﬁcial noise
in the dataset and compare variability of our result.

Introduction
Many of us has played soccer in our early childhood. Kicking the ball
high in the air accurately takes practice. Our coaches may have told us, while
kicking soccer ball high on the air, we need to concentrate on follow through.
That is one of the aspects of kicking the ball accurately. However there are
other variables we need to pay attention to while mastering accurate kick. As
we practice, just concentrating on follow through, we quickly discover angle
of foot, amount of force applied are other techniques for playing soccer. We
all heard the expression ”Practice helps us to achieve perfection”.
Boosting is the same concept as discussed above. Boosting helps us to
achieve a strong learning algorithm from a weak learning algorithm by imply-
ing repetitions. In our soccer scenario, the concept of follow through is our
weak learner and mastering how to kick soccer ball accurately is our strong
learner.
While Boosting enables us to learn strong hypothesis from weak hypoth-
esis, over fitting may happen with noisy dataset. In this paper we are going
to investigate two well know boosting algorithms. We are going to theoreti-
cally and experimentally show BrownBoost works better than Adaboost with
respect to noisy datasets.
Background
PAC
In the heart of Boosting algorithm is PAC learning algorithm. PAC learn-
ing algorithm was introduced by Laslie Valiant. Before we define PAC learn-
ing algorithm, we need to clarify few concepts.

• Let X be an instance space.

• A concept c ∈ X.

• A collection of concepts C ∈ X is a concept class.

• Oracle EX(c, D) is a system to draw an example x using probability
distribution, which gives the correct label c(x).

Now that we have all our notations in place, we can define the PAC
learning algorithm. If an algorithm A given access to oracle EX(c, D), and
input and δ such that Algorithm A outputs hypothesis h ∈ C with error
≤ with probability 1 − δ. If such algorithm works with all c ∈ C, and all
D over X, and for all 0 < < 1/2 and 0 < δ < 1/2 then concept class C is
PAC learnable. Interesting thing here to note, we need to know , δ prior to
executing the algorithm. [1]
WeakLearner

1

For any probability distribution D, the error rate of that distribution is
denoted by
t = Pri∼Dt [ht (xi = yi )] = i:ht =yi Dt (i)
WeakLearner is defined by an algorithm which does little better than
random guessing. For any concept, random guesing can be think of having
1/2 chance of predicting it correctly. The amount which weak algorithm
supersedes random guessing is denote by γ. Algorithm A is a weak PAC
learner for C with advantage γ if for any concept c ∈ C, with any distribution
D, for any δ with probability 1 − δ, A outputs a hypothesis h such that
Prx∼D [h(x0 = c(x)] ≤ 1 − γ [1]
2
AdaBoost
Now we are ready discuss the main two algorithms of our paper. As
mentioned earlier, Boosting is a learning technique which iteratively learns
a strong learner from a weak learner. A booster B learn concept c given
access to oracle EX(c, D) and a weak base learning algorithm A by running
algorithm multiple times. In boosting scenario, weak learning algorithm is
called on training set. Algorithm A outputs a hypothesis. Next round we run
algorithm A again on training set, but this time we make sure algorithm A
concentrate more on examples which A mis-classified in the previous round.
When we go through the above setup multiple times, algorithm A ultimately
outputs a strong hypothesis. We need to ask ourselves few key questions.
How are we going to choose our distribution, so that algorithm A concentrate
more on mis classified examples? Also how are we going to combine all
hypothesis’s into a single hypothesis. We answer these questions below with
our mathematical representation of boosting steps.[1]
Given : (x1 , y1 ), (x2 , y2 ), ·, (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1}
initialize D1 (i) = 1/m
for t = 1, ·T

• Train WeakLearner using distribution Dt

• Get base classifier ht = X → R

• choose α ∈ R

• update :

Dt=1 (i) = Dt (i)exp(−αt yi ht (xi ))
Zt
Where Zt is a normalizing factor.
Outputs finalize hypothesis:
H(x) = sign( T αt ht (x)) [1]
t=1

2

There is a key practical limitation to this algorithm. We needed to know
α in advance. The AdaBoost algorithm, introduced by Freund and Schapire,
addressed this difficulty. The algorithm is same as discussed above, but here
we calculate α with respect to the training error. AdaBoost algorithm steps
are described below.
Given : (x1 , y1 ), (x2 , y2 ), ·, (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1}
initialize D1 (i) = 1/m
for t = 1, ·T

• Train WeakLearner using distribution Dt

• Get weak hypothesis ht : X → {−1, +1} with error
t = Pr t ∼ Dt [ht (xi ) = yi ]

• Choose αt = 1/2 ln (1−t t )

• Update :

Dt (i) e− αt if ht (xi ) = yi
Dt+1 (i) = ×
Zt eα if ht (xi ) = yi
t
Dt (i)
= exp(−αt yi ht (xi ))
Zt

Where Zt is a normalizing factor.

outputs the final hypothesis :
Hf inal(x) = sign( T αt ht (x))
t=1
The key component of the above AdaBoost description is to come up
with a proper distribution on each trial. In each trial we are assigning more
weight on the mis-classified examples and we are assigning less weight on
the correctly classified examples. The weight we are assigning is by αt =
1/2 ln(1 − t / t ). In the final hypothesis, we are taking the weighted majority
vote of the T weak hypothesis’s.[3]
We have described the steps of AdaBoost but how can we assure that
AdaBoost will ultimately come up with a strong hypothesis?
In AdaBoost algorithm, we assumed that our weak hypothesis did little
better than random guessing by γt . As long as γt is a positive number then
our algorithm will decrease error rate exponentially on each trial run. We
are going to explain this concept in the theorem below.

Theorem 1 • t = 1/2 − γt

3

• then

trainingerror(Hf inal ≤ [2 t (1 − t)
t
2
= (1 − 4γt )
t
2
≤ (− γt )
t

• So: ∀: γt > γ > 0

then trainingerrorHf inal < e− 2γ 2 T

[5]
AdaBoost is adaptive and key advantage is that we do not need to know
γ or T in advance. As long as γt is a positive number our error will decrease
exponentially as a function of Training trials.
We will go through simple proof to show the above theorem holds for
training error.

Proof • let f (x) = t αt ht (x) ⇒ Hf inal (x) = sign(f (x))

• Step 1 :

1 exp(yi t αt ht (xi ))
Df inal (i) =
m ( t Zt ))
1 exp(−yi f (xi ))
=
m t Zt

• Step 2 trainingerror(Hf inal ) =≤ t Zt

• proof

1 1 if yi = Hf inal (xi )
trainingerror(Hf inal =
m i
0 else
1 1 if yi f (xi ) ≤ 0
=
m i
0 else
1
≤ exp(−yi f (xi )) = D + f inal(i) Zt = Zt
m t t t t

• Step 3 Zt = 2 t (1 − t)

4

• proof
Zt = Dt (i)exp(−αt yi ht (xi ))
t
= Dt (i)eα +
t Dt (i)e− αt
i:yi =ht (xi ) i:yi =ht (xi )

= αt + (1
t − t )e− αt
=2 t (1 − t)
[5]
We just finished deriving the training error bound of Adaboost. We
quickly need to discuss the generalization error of the Adaboost algorithm.
Freund and shapire intially bounded the generalization error of Adaboost
algrithm with respect to sample size m, the vc-dimension d of the weak
hypothesis space and the number of iteration T as following.
Pr[H( x) = y] + O( T d )
m
This bound above suggests that over-fitting can happen with large num-
ber of iterations. However, empirical finding suggested that generalization
error goes down even after larning error reaches zero. Based on this finding
Shapire, re-iterated bound of generalization error with respect to margins of
the training example. The margin of the example (x, y) is defined to be
y t
αt ht (x)
t
αt
Margin is a number in [−1, +1] which is positive only if hypothesis cor-
rectly classifies the example. The generalization error given by Shapire is
given below
d
Pr[margin(x, y) ≤ θ] + O( mθ2 )
This bound is independent of iteration T . In this scenario, margins con-
tinues to increase even after training error reaches zero.
Now that we have proved the upper bound of AdaBoost algorithm, we
come to the main topic of our paper. AdaBoost’s selection strategy and
the notion of combining all hypothesis’s into a single hypothesis, makes this
algorithm poor choice when dealing with noisy dataset. Several empirical
studies and theoretical works have shown that the ability to generalize for
AdaBoost decreases as noise in datasets increases.[2]
BrownBoost
The BrownBoost algorithm was introduced by Freund, as an enhancement
from his earlier Boosting by Majority algorithm. BrownBoost works similarly
as AdaBoost, however there is a core difference between these two algorithms.
BrownBoost rely on the core assumption that example that are repeatedly
mis-classified are noisy data. Thus, BrownBoost will give up on noisy data
and non-noisy dataset will contribute to the final hypothesis.[2]

5

BrownBoost derivation start by ﬁxing the δ to some small value, small
1
enough that most hypothesis can achieve error 2 − δ. Given a hypothesis h
`
and a hypothesis error 1 − γ, γ > δ, Freund introduced h with the following
2
properties.

δ
h(x), with probability


 γ
´
h(x) = 0, with probability 1−δ
/2 .
γ
1−δ

1, with probability /2


γ

Since δ is very small and error is 1/2 − δ, we can use the same hypothesis
over and over instead of calling weak learner on each iteration until error
becomes larger than 1/2 − δ. Instead of choosing weight proportion to error,
unlike AdaBoost, here we choose weight from the last weak hypothesis with
new altered distribution. This process works because of the well known
notion of “Brownian motion of drift”, which is beyond the scope of this
paper. Thus, we have the name BrownBoost[6]
BrownBoost uses c as a time parameter for how long the algorithm is set
to run. BrownBoost assumes that each hypothesis takes a variable amount
of time t which is directly related to the weight given to the hypothesis α.
The time parameter in BrownBoost is analogous to the number of iterations
T in AdaBoost. [1]
A larger value of parameter c tells BrownBoost that dataset we are dealing
with less noisy and a smaller value tells BrownBoost that we are dealing with
noisy dataset.
During each iteration, a hypothesis is selected with some advantage over
random guessing. This process is same as AdaBoost. The weight of hypoth-
esis α and the amount of time has passed so far are given to algorithm. The
algorithm runs until there is no time left. The ﬁnal combined hypothesis is
the weighted majority of all hypothesis’s. The key point to note here is the
time parameter and determining how much time left for each iteration. If
there is no time left, unlike AdaBoost, then BrownBoost gives up on that
particular example. BrownBoost steps are described below.
• Input: : (x1 , y1 ), (x2 , y2 ), ·, (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1}
• The time parameter c

• initialize: s = c The value of s is the time left in the game.

• ri (xj ) = 0/f orallj The value of ri (xj ) is the margin at iteration i for
example j.

6

while s > 0

2)
• Wi (xj ) = e− (ri (xjc)+s

• Find classifier ht: X → {−1, +1} such that
j Wi (xj )hi (xj )yj

• Find values α, t that satisfy the following equation
2
− (ri (xj )+αi (xj )yj +s−t) )
j hi (xj )yj e c
=0

• Update margin : ri + 1( xj ) = ri (xj ) + αh(xj )yj

• Update time: s = s − t

• Output H(x) = sign( i αi hi (x). [1]

The key thing to note here is that for each example and each class, algo-
rithm maintains a margin. These are initially set to 0, and at each iteration i
they are updated. The hypothesis weight αi are related to margin. Also, we
note that the algorithm will only run while there are time parameter s √ left.
Also, we want to point out that the final training error is is = 1−erf ( c),
where erf is the error function. [4]
Experimental Data
For our experimentation we have Jboost package. Jboost current version
is 2.0. However we used Jboost 1.4 due to the fact that Jboost does not
support BrownBoost in the most recent version. Jboost comes with few
visualization tools, which are extremely useful. The datasets are described
below.
Our dataset came from UCI machine learning repository. Blood-Transfusion
dataset is collected from Twain Blood Transfusion Service. This dataset se-
lected 748 donors at random from the donor database. These 748 donor data,
each one included R (Recency - months since last donation), F (Frequency
- total number of donation), M (Monetary - total blood donated in c.c.),
T (Time - months since first donation), and a binary variable representing
whether he/she donated blood in March 2007. (1 stand for donating blood;
0 stands for not donating blood).

7

We ran our dataset for two algorithms namely AdaBoost and Brown-
Boost. We also ran them with 5 percent artificially introduced noise. We
flip the label within dataset randomly to introduce noise. We ran the Ad-
aBoost algorithm with 3000 iterations. We ran BrownBoost algorithm with
parameter c as 4 minutes. The resulted error rate are listed below in tabular
format.
ResultSet
data data-type aboost aboostnoisy bboost bboostnoisy
training error 0.1483 0.1383 0.2024 0.2004
Transfusion
testing error 0.2177 0.2540 0.2258 0.2278
Discussion of Results and Conclusions
In any supervised learning technique, the biggest challenge is to come
up with dataset. Supervised data is extremely expensive and hard to come
by. In our experimentation we see that the variability of testing error from
noiseless data to noisy data is higher in Adaboost than BrownBoost. However
this is not very conclusive from the tables. One of the reason is that we had
a very small data set we played with. Also, we introduced only 5 percent
noise in the dataset. Creating noisy data manually is time consuming task.
However experimentation reveals that BrownBoost has more capacity to deal
with noisy dataset.
BrownBoost has a very bright future in machine learning arena. We live
in a world full of information where analyzing data and deriving conclusion
from data has become norm in many fields. We can see so many implica-
tions of BrownBoost algorithm in real world scenarios. Spammers are trying
to fool spam detection program by injecting non spam word in the email.
BrownBoost can play a key role treating these email as noisy data and cor-
rectly detect spam. A serial killer who changes killing patter to fool authority
can be predicted with BrownBoost algorithm. We intend to more research on
this topic to find out how to accurately calculate c parameter of BrownBoost.
Acknowledgments We would like to thank Jboost community for pro-
viding a rich set of tools for our experiment. We would also like to thank
UCI machine learning repository for providing data for our project.

8

Bibliography

[1] http://en.wikipedia.org

[2] “Mathematical Analysis of Evolution, Information, and Complexity”,
Wolfgang Arendt, Wolfgang P. Schleich

[3] “A Short Introduction to Boosting”, Freund Yoav, Schapire E. Robert

[4] “An Empirical Comparison of Three Boosting Algorithm on Real Data
Sets with Artiﬁcial Class Noise, Ross A McDonald, David J Hand, and
Idris A Eckley

[5] http://videolectures.net/mlss05us schapire b

[6] An Adaptive version of the boost by majority algorithm, Freund Yoav

9

Ada boost brown boost performance with noisy data

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Ada boost brown boost performance with noisy data

Similar to Ada boost brown boost performance with noisy data (20)

Ada boost brown boost performance with noisy data