H2O World - GBM and Random Forest in H2O- Mark Landry

GBM
&
Random
Forest
in
H2O

Mark
Landry

Presenta6on
Outline

•  Algorithm
Background

o Decision
Trees

o Random
Forest

o Gradient
Boosted
Machines
(GBM)

•  H2O
ImplementaCons

o Code
examples

o DescripCon
of
parameters
and
general
usage

Decision
Trees:
Concept

•  Separate
the
data

according
to
a
series
of

quesCons

o  Age
>
9.5?

•  The
quesCons
are
found

automaCcally
to

opCmize
separaCon
of

the
data
point
by
the

“target”

Source: wikimedia CART tree Titanic survivors
Example decision tree:
Predicting survival of Titanic passengers

Decision
Trees:
Prac6cal
Use

•  Non
linear

•  Robust
to
correlated

features

•  Robust
to
feature

distribuCons

•  Robust
to
missing

values

•  Simple
to
comprehend

•  Fast
to
train

•  Fast
to
score

•  Poor
accuracy

•  Cannot
project

•  Ineﬃciently
ﬁts
linear

relaConships

WeaknessesStrengths

Improved
Decision
Trees:
Ensembles

•  Bootstrap
aggregaCon

(bagging)

•  Fit
many
trees
against

diﬀerent
samples
of
the

data
and
average

together

•  BoosCng

•  Fits
consecuCve
trees

where
each
solves
for

the
net
error
of
the

prior
trees

GBMRandom Forest

Random
Forest

•  Combine
mulCple

decision
trees,
each
ﬁt

to
a
random
sample
of

the
original
data

•  Randomly
samples

o  Rows

o  Columns

•  Reduce
variance,
with

minimal
increase
in
bias

•  Strengths

o  Easy
to
use

•  Few
parameters

•  Well-‐established
default

values
for
parameters

o  Robust

o  CompeCCve
accuracy
on

most
data
sets

•  Weaknesses

o  Slow
to
score

o  Lack
of
transparency

PracticalConceptual

Gradient
Boosted
Machines
(GBM)

•  BoosCng:
ensemble
of

weak
learners*

•  Fits
consecuCve
trees

where
each
solves
for
the

net
loss
of
the
prior
trees

•  Results
of
new
trees
are

applied
parCally
to
the

enCre
soluCon

•  Strengths

o  Oèn
best
possible
model

o  Robust

o  Directly
opCmizes
cost

funcCon

•  Weaknesses

o  Overfits

•  Need
to
find
proper

stopping
point

o  SensiCve
to
noise
and

extreme
values

o  Several
hyper-‐parameters

o  Lack
of
transparency

PracticalConceptual
* the notion of “weak” is being challenged
in practice

Trees
in
H2O

•  Individual
tree
ﬁang
is
performed
in
parallel

•  Shared
histograms
calculate
cut-‐points

•  Greedy
search
of
histogram
bins,
opCmizing

squared
error

Explore
Further
through
Examples

I have H2O
Installed
I have R
installed
I have the
H2O World
data sets

H2O World - GBM and Random Forest in H2O- Mark Landry

More Related Content

What's hot

Viewers also liked

Similar to H2O World - GBM and Random Forest in H2O- Mark Landry

More from Sri Ambati

Recently uploaded

H2O World - GBM and Random Forest in H2O- Mark Landry