Final Project_Vince_Velocci

Vince&Velocci!
ME!5013.005!
Thursday!May!7,!2015!
&
Using&Boosted&Decision&Classification&Trees&on&the&Higgs&Kaggle&
Contest&Data&Set&
!
Abstract!
! We!use!boosted!decision!trees!to!form!a!statistical!model!for!classifying,!as!signal!or!
background,!the!simulated!data!set!used!in!the!Higgs!Machine!Learning!Challenge!on!Kaggle!
from!2014,!as!described!in!the!Project!Proposal.!!Using!a!validation!set,!we!choose!a!model!
which!uses!a!shrinkage!parameter!λ=0.2,!a!tree!depth!of!d=2,!and!1000!trees.!!With!the!R!
programming!language,!we!then!train!the!model!using!the!full!training!set!provided,!and!apply!
the!model!to!a!test!data!set,!containing!550,000!events,!used!by!the!contest!organizers!in!the!
contestant!ranking.!!This!test!set!does!not!contain!the!true!class!labels.!!Although!the!contest!
has!long!been!over,!the!organizers!still!accept!submissions.!!The!resulting!predictions!were!
submitted!and!an!AMS!score!of!2.85538!was!achieved.!
!
Introduction&to&the&Problem&
In!particle!physics,!the!Higgs!boson!is!an!elementary!particle!whose!existence!was!first!
postulated!in!the!1960’s!on!purely!theoretical!grounds.!!The!role!of!the!Higgs!boson!is!to!break!
the!symmetry!which!is!present!between!the!electromagnetic!and!the!weak!nuclear!force!at!
high!energies!and,!in!doing!so,!gives!mass!to!all!the!particles!of!the!Standard!Model!with!
nonzero!mass:!six!quarks,!three!leptons,!and!the!W!and!Z!particles!(which!mediate!the!weak!
nuclear!force).!!The!discovery!of!the!Higgs!boson!was!announced!in!July!2012!at!CERN,!as!a!
result!of!a!joint!analysis!by!the!two!main!particle!physics!experiments:!ATLAS!and!CMS.!!The!
two!experiments!are!giant!particle!detectors!located!within!the!Large!Hadron!Collider!(LHC),!
which!is!the!most!energetic!particle!accelerator!ever!built.!!The!LHC!accelerates!two!beams!of!
protons!moving!in!opposite!directions!and!collides!these!beams!at!two!locations,!corresponding!
to!the!locations!of!ATLAS!and!CMS.!!During!the!run!that!successfully!detected!the!Higgs!particle,!
each!beam!of!protons!had!an!energy!of!7!TeV!(one!TeV!is!one!trillion!times!the!energy!an!
electron!attains!after!moving!across!a!potential!difference!of!1!V).!!Thus,!when!two!protons!
collide!(the!collision!constitutes!an!“event”),!the!energy!of!the!collision!can!reach!up!to!14!TeV.!!
That!energy,!via!E=mc2
,!can!go!into!creating!other!particles,!one!of!which!was!the!Higgs!particle.!!

The!Higgs!boson,!while!discovered,!was!not!observed!directly.!!Rather,!the!presence!of!the!
Higgs!boson!was!inferred!by!the!properties!of!the!observed!decay!products,!which!are!precisely!
predicted!by!the!Standard!Model!of!particle!physics!including!a!Higgs!boson.!!In!July!2012,!the!
discovery!of!the!particle!was!announced!after!both!experiments!claimed!to!detect!a!5σ!excess!
of!signal!above!background,!the!standard!for!the!discovery!of!a!new!particle!in!the!high!energy!
physics!community.!
In!May!2014,!a!contest!(“Higgs!Boson!Machine!Learning!Challenge”)!was!posted!on!
Kaggle!(www.kaggle.com/c/higgsfboson)!by!ATLAS!to!entice!machine!learning!and!data!science!
researchers!and!programmers!to!come!up!with!clever!models!and!algorithms!to!help!ATLAS!
improve!its!ability!to!discriminate!between!a!true!Higgs!signal!and!a!background!event!
(anything!that!looks!similar!to!a!Higgs!decay!but!does!not!involve!any!Higgs!particle).!!A!true!
Higgs!signal!is!given!by!a!Higgs!decay!into!a!tau!lepton!pair!of!opposite!electric!charge:!!!!
h!!!!τ+
!+!τf
.!!!
The!organizers!of!the!contest!posted!a!simulated!dataset!consisting!of!a!training!set,!
whose!class!labels!are!given,!and!a!test!dataset,!whose!class!labels!are!not!given!but!must!be!
predicted!by!the!contestants.!!The!simulated!data!was!not!generated!by!any!kind!of!joint!
probability!distribution!function!(PDF).!!Physicists!do!not!know!what!the!PDF!is!for!the!
particular!process!they!are!trying!to!detect,!due!to!limited!CPU!resources.!!Otherwise,!one!
presumes!that!they!could!just!use!a!likelihood!ratio!test.!!Rather,!the!data!was!simulated!using!
their!"official"!detector!simulator,!consisting!of!two!parts.!!The!first!part!simulated!the!protonf
proton!collisions!inside!the!accelerator!based!on!everything!they!know!about!particle!
physics.!!The!second!part!simulates!the!detector!using!a!virtual!model!of!the!detector.!!The!
simulator!results!in!simulated!events!that!resemble!the!statistical!properties!of!the!real!
events.!!From!what!I!understand,!these!simulations!take!a!long!time!to!complete,!but!they!are!
not!generated!from!a!joint!probability!distribution.!!What!they!were!looking!for!is!the!software!
architecture!that!could!extract!the!most!discrimination!power.!!They!were!looking!to!the!data!
science!community!for!the!best!available!and!most!efficient!algorithm!for!doing!this.!
The!purpose!of!this!project!is!to!use!the!boosted!decision!tree!algorithm,!via!R,!to!find!a!
model!that!adequately!classifies!the!events!as!either!signal!or!background.!!The!model!may!also!
be!submitted!to!the!contest!organizers!to!determine!its!score!relative!to!the!other!competitors.!!!
The!(simulated)!dataset!consists!of!a!training!set!of!250000!events,!with!an!event!ID!
column,!30!feature!columns,!a!weight!column,!and!a!label!column!(s!or!b);!a!test!set!of!550000!
events!(used!to!judge!the!algorithm!from!each!contestant),!with!an!ID!column,!30!feature!
columns,!but!no!weights!or!labels.!!A!typical!training!event!will!look!like!this:!
100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,3
2.638,1.017,0.381,51.626,2.273,f2.414,16.824,f
0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,f2.475,113.497,0.00265331133733,s!!

The!above!event!has!an!event!ID!of!100000,!a!weight!of!0.00265331133733,!and!a!class!label!‘s’!
(signal).!!The!other!30!pieces!of!data!are!kinematical!quantities!measured!in!the!ATLAS!detector!
(though!the!data!is!simulated).!!For!more!information!and!documentation!on!this!contest!and!
the!dataset,!please!see!http://higgsml.lal.in2p3.fr/documentation/!(also!used!below!in!the!
description!of!the!features)!
Data&Features!
The!data!set!contains!250,000!observations!and!33!columns.!!Note!that!a!value!of!f999!
represents!an!undefined!value.!!The!features!are!as!follows:!
EventID:!An!integer,!from!100,000!to!349,999,!which!identifies!the!observation!(“event”).!!It!is!
not!a!feature!and!should!not!be!used!in!the!analysis!
DER_mass_MMC:!Estimated!mass!of!the!Higgs!boson!candidate.!!May!be!undefined!for!a!given!
event!
DER_mass_transverse_met_lep:!The!transverse!mass!between!the!missing!transverse!energy!
and!the!lepton!
DER_mass_vis:!The!invariant!mass!of!the!hadronic!tau!and!the!lepton!
DER_pt_h:!The!modulus!of!the!vector!sum!of!the!transverse!momentum!of!the!hadronic!tau,!
the!lepton,!and!the!missing!transverse!energy!vector!
DER_deltaeta_jet_jet:!The!absolute!value!of!the!pseudorapidity!separation!between!the!two!
jets!(undefined!if!PRI_jet_num!≤!1)!
DER_mass_jet_jet:!The!invariant!mass!of!the!two!jets!(undefined!if!PRI_jet_num!≤!1)!
DER_prodeta_jet_jet:!The!product!of!the!pseudorapidities!of!the!two!jets!(undefined!if!
PRI_jet_num!≤!1)!
DER_deltar_tau_lep:!The!R!separation!between!the!hadronic!tau!and!the!lepton!
DER_pt_tot:!The!modulus!of!the!vector!sum!of!the!missing!transverse!momenta!and!the!
transverse!momenta!of!the!hadronic!tau,!the!lepton,!the!leading!jet!(if!PRI_jet_num!≥!1)!and!
the!subleading!jet!(if!PRI_jet_num!=!2)!
DER_sum_pt:!The!sum!of!the!moduli!of!the!transverse!momenta!of!the!hadronic!tau,!the!
lepton,!the!leading!jet!(if!PRI_jet_num!≥!1)!and!the!subleading!jet!(if!PRI_jet_num!=!2)!and!the!
other!jets!(if!PRI_jet_num!=!3)!
DER_pt_ratio_lep_tau:!The!ratio!of!the!transverse!momenta!of!the!lepton!and!the!hadronic!tau!
DER_met_phi_centrality:!The!centrality!of!the!azimuthal!angle!of!the!missing!transverse!energy!
vector!with!respect!to!the!hadronic!tau!and!the!lepton!

DER_lep_eta_centrality:!The!centrality!of!the!pseudorapidity!of!the!lepton!with!respect!to!the!
two!jets!(undefined!if!PRI_jet_num!≤!1)!
PRI_tau_pt:!The!transverse!momentum!of!the!hadronic!tau!
PRI_tau_eta:!The!pseudorapidity!of!the!hadronic!tau!
PRI_tau_phi:!The!azimuth!angle!of!the!hadronic!tau!
PRI_lep_pt:!The!transverse!momentum!of!the!lepton!(electron!or!muon)!
PRI_lep_eta:!The!pseudorapidity!of!the!lepton!
PRI_lep_phi:!The!azimuth!angle!of!the!lepton!
PRI_met:!The!missing!transverse!energy!
PRI_met_phi:!The!azimuth!angle!of!the!missing!transverse!energy!
PRI_met_sumet:!The!total!transverse!energy!in!the!detector!
PRI_jet_num:!The!number!of!jets!(0,!1,!2,!or!3;!any!larger!values!are!given!a!value!of!3)!
PRI_jet_leading_pt:!The!transverse!momentum!of!the!leading!jet,!that!is!the!jet!with!largest!
transverse!momentum!(undefined!if!PRI_jet_num!=!0)!
PRI_jet_leading_eta:!The!pseudorapidity!of!the!leading!jet!(undefined!if!PRI_jet_num!=!0)!
PRI_jet_leading_phi:!The!azimuth!angle!of!the!leading!jet!(undefined!if!PRI_jet_num!=!0)!
PRI_jet_subleading_pt:!The!transverse!momentum!of!the!leading!jet,!that!is,!the!jet!with!
second!largest!transverse!momentum!(undefined!if!PRI_jet_num!≤!1)!
PRI_jet_subleading_eta:!The!pseudorapidity!of!the!subleading!jet!(undefined!if!PRI_jet_num!≤!
1)!
PRI_jet_subleading_phi:!The!azimuth!angle!of!the!subleading!jet!(undefined!if!PRI_jet_num!≤!1)!
PRI_jet_all_pt:!The!scalar!sum!of!the!transverse!momentum!of!all!the!jets!of!the!events!
Weight:!Event!weight!(It!is!not!a!feature!and!should!not!be!used!in!the!analysis;!not!given!in!the!
test!sample!used!in!the!submission)!
Label:!The!response.!!Event!label!(s,!for!signal,!or!b,!for!background;!not!given!in!the!test!sample!
used!in!the!submission)!
!Evaluation&of&Submissions&(AMS&score)!
! As!mentioned!above,!contestants!use!their!model!to!predict!the!class!labels!of!a!test!
sample!of!550,000!observations!provided!by!the!contest!organizers.!!The!submission!is!in!the!
form!of!a!.csv!file.!!The!first!column!contains!the!event!ID,!the!second!column!is!the!rank!order,!

and!the!third!column!is!the!class!label!prediction!(s!or!b).!!The!rank!order!is!an!integer!from!1!to!
550,000.!!A!rank!order!of!550,000!is!given!to!the!event!that!is!most!signalflike.!!The!event!that!is!
least!backgroundflike!is!given!a!rank!order!of!one!less!than!the!smallest!signal!rank.!!The!most!
backgroundflike!event!is!given!a!rank!order!of!1.!!Since!our!model!outputs!probabilities!for!‘s’,!
probabilities!from!1!to!0!are!given!rankings!from!550,000!to!1,!respectively.!
The!AMS!(approximate!median!significance)!score!given!to!contestants!is!given!by!the!
formula:!
!"# =% 2 ' + ) + )* ln 1 +
.
/0/1
− ' !!
where!s!and!b!are!the!unnormalized!true!positive!and!false!positive!rates,!respectively;!br!=!10!
is!a!regularization!term.!
!
R&Statistical&Model&Used!
! A!statistical!model!based!on!the!boosted!decision!tree!algorithm!was!chosen!for!this!
data!set.!!A!tree!depth!(number!of!splits),!shrinkage!parameter!λ,!and!number!of!trees!had!to!
be!chosen.!!These!were!chosen!by!randomly!splitting!the!training!data!in!half!and!assigning!one!
half!as!the!training!set,!and!the!other!half!as!a!validation!set.!
! The!data!set!first!needed!to!be!purged!of!columns!1!and!32,!which!are!the!event!ID!and!
weight!columns,!respectively.!!Then,!every!instance!of!f999!was!replaced!with!NA,!since!those!
entries!represent!undefined!values.!!In!order!to!use!the!gmb!function!for!binary!classification,!
the!Label!column!values!had!to!be!converted!to!{0,!1}.!!A!value!of!‘s’!was!set!to!1,!and!a!value!of!
‘b’!was!set!to!0.!!Then,!the!old!Label!column!was!deleted.!
I!wanted!to!compute!the!classification!error!rate!on!the!test!set!for!tree!depths!from!1!
to!4,!and!with!number!of!trees!from!100!to!5000!in!increments!of!100.!!R!ran!the!loops!for!the!
entire!night!but,!for!some!reason,!stopped!after!a!depth!of!2!with!1500!trees.!!For!a!tree!depth!
of!1,!the!error!rate!was!0.3422!until!1300!trees,!and!began!dropping!and!leveled!off!at!around!
0.19.!!I!also!calculated!the!error!rate!for!depth!=!2,!which!also!began!at!0.3422!and!remained!
that!way!until!the!number!of!trees!reached!700!at!which!point!the!error!dropped!to!0.2112.!!It!
gradually!kept!dropping!until!reaching!0.2027!at!1500!trees,!at!which!point!R!stopped.!!The!
following!graph!captures!this!data!(see!code!for!how!I!did!this):!
!

!
Not!shown!in!my!code!is!the!boosted!tree!model!on!the!training!set,!with!a!tree!depth!
of!4!and!5000!trees.!!The!error!rate!was!0.1764,!which!is!lower!than!the!result!for!a!depth!of!1!
and!5000!trees!(0.1923).!
Next,!I!wanted!to!test!different!values!of!the!shrinkage!parameter!for!a!tree!depth!of!1!
and!4000!trees.!!I!considered!the!values!0.01,!0.1,!0.2,!and!1.!!For!these!values,!I!got!test!error!
rates!of!0.194992,!0.181088,!0.176024,!and!0.177096,!respectively.!!Contrast!these!results!with!
the!result!of!applying!a!model!with!shrinkage!0.001,!depth!1,!and!4000!trees:!0.19496.!!Out!of!
curiosity,!I!tried!a!shrinkage!of!0.5,!depth!1,!and!4000!trees.!!I!got!a!test!error!rate!of!0.17808.!!
Overall,!the!best!choice!among!these!shrinkage!values!was!0.2.!!
I!also!tried!models!with!depths!of!2!and!3,!shrinkage!0.2,!and!4000!trees.!!The!test!error!
rates!were!0.163752,!and!0.164976,!respectively.!!Comparing!these!different!depth!values,!a!
depth!of!2!would!probably!work!best.!
Finally,!I!wanted!to!test!different!tree!numbers.!!I!tried!models!with!tree!numbers!
ranging!from!1000!to!5000!in!increments!of!1000,!all!with!shrinkage!0.2!and!tree!depth!2.!!The!
five!test!error!rates!were!0.161792,!0.162168,!0.163128,!0.163984,!and!0.163688.!!Therefore!a!
tree!number!of!1000!seems!to!work!best.!!The!lowest!test!error!rate!obtained,!so!far,!was!
around!0.162,!which!corresponds!to!a!success!rate!of!83.8%.!!Note!that!34%!of!the!training!
class!labels!belong!to!class!‘s’,!and!66%!belong!to!class!‘b’.!!Thus,!a!trivial!classifier!that!only!
predicts!‘b’!will!be,!on!average,!66%!successful.!

Therefore,!the!final!boosted!tree!model!was!chosen!to!have!shrinkage!0.2,!a!depth!of!2,!
and!1000!trees.!!This!model!was!then!trained!on!the!full!training!data!set.!!(Note!that!it!took!R!a!
very!long!time!to!compute!these!results,!so!investigating!more!possibilities!for!the!parameters!
could!not!be!done.!!Perhaps!a!more!specialized!machine!or!software!is!needed!to!speed!things!
up.)!!The!following!shows!relative!importance!of!the!variables:!
!
!

!
We!see!that!the!four!most!important!variables!in!the!model!are!DER_mass_MMC,!
DER_mass_transverse_met_lep,!DER_mass_vis,!PRI_tau_pt.!
The!test!data!set!(for!submission)!was!read!into!R,!and!the!eventId!column!was!purged.!!
Then,!all!entries!containing!f999!were!converted!into!NA.!!This!step!took!a!very!long!time!(more!
than!two!hours).!
The!boosted!decision!tree!model!with!shrinkage!0.2,!depth!2,!and!1000!trees!was!then!
used!to!make!class!probability!predictions!on!this!test!data!set.!!Probabilities!>!0.5!are!classified!
as!signal!(1,!or!‘s’).!!These!predictions!were!then!added!into!the!test!data!frame!and!the!
probabilities!sorted!in!descending!order.!!Finally,!a!submission!.csv!file!was!formed!and!
submitted.!
The!following!is!the!text!of!my!R!code,!followed!by!my!results:!
Higgs=read.csv("training.csv",header=T)!#!Read!in!the!training!data!.csv!file!
Higgs!<f!subset(Higgs,!select!=!fc(1,32))!#!Columns!1!(eventID)!and!32!(weights)!are!not!to!be!
used!in!the!analysis!
!
for!(i!in!1:nrow(Higgs)){!#!This!routine!takes!all!entries!equal!to!f999!(which!means!entry!is!
undefined)!and!assigns!it!as!NA!
for!(j!in!1:ncol(Higgs)){!
if(Higgs[i,j]==f999){!
Higgs[i,j]=NA}!

}!
}!
!
attach(Higgs)!#!Attaches!Higgs!data!set!
dim(Higgs)!#!training!data!size!and!number!of!features!+!classification!label!
sum(Higgs$Label=='s')/nrow(Higgs)!#!Roughly!1/3!of!labels!are!signal!(Higgs!produced)!
!
Label2=ifelse(Label=='s',1,0)!#!These!converts!the!class!labels!into!1,!for!'s',!and!0,!for!'b',!so!that!
the!gbm!function!can!use!it!
Higgs2$Label!<f!NULL!#!This!gets!rid!of!the!old!class!labels!
!
train=sample(1:nrow(Higgs2),nrow(Higgs2)/2)!#!define!training!set!for!validation!
test=(ftrain)!#!test!set!is!everything!else!
Higgs2.test=Higgs2[ftrain,]!#!test!set!
Label2.test=Higgs2$Label2[ftrain]!#!labels!in!the!test!set!
!
library(tree)!
library(gbm)!
!
lam=c(0.01,!0.1,!0.2,!1)!#!different!values!of!the!shrinkage!parameter/learning!rate!parameter!
x=1:50!#!a!vector!of!integers!from!1!to!50!
ntrees=100*x!#!Number!of!trees!from!100!to!5000!in!increments!of!100!
A=matrix(0,4,50)!#!4!X!50!matrix!of!zeros.!!Rows!=!tree!depth,!Columns!=!Number!of!trees!
!
for!(d!in!1:4){!#!This!routine!will!compute!the!error!rate!on!the!test!data!for!boosted!trees!at!
depths!from!1!to!4!and!number!of!trees!from!100!to!5000,!shrinkage!=!0.001!
for!(i!in!1:50){!

boost.Higgs=gbm(Label2~.,data=Higgs2[train,],distribution="bernoulli",n.trees=ntrees[i],interac
tion.depth=d,verbose=F)!#!Boosted!trees!model!on!training!set!
yhat.boost=predict(boost.Higgs,newdata=Higgs2.test,n.trees=ntrees[i],type="response")!#!Let!
model!predict!probabilities!of!getting!a!1!for!a!response!on!the!test!set!
yhat=rep(0,125000)!#!form!a!vector!of!zeros,!length!125000!
yhat[yhat.boost>.5]=1!#!Convert!probabilities!into!class!decisions!for!the!test!set!
A[d,i]=1fsum(diag(table(yhat,Label2.test)))/sum(table(yhat,Label2.test))!#!Compare!predictions!
with!the!actual!values.!!The!difth!element!is!the!error!rate!for!tree!depth=d!and!number!of!trees!
=!100i!
}!
}!#!The!above!routine!ran!the!whole!night,!but!only!until!d=2,!ntrees=1500.!!I'm!not!sure!if!R!
stopped!automatically,!but!I!had!to!stop!it!anyway!
!
plot(ntrees,A[1,],col=1,main="Error!Rates!on!Test!Data!for!depth=1!(grey)!and!depth=2!
(red)",xlab="Number!of!trees",ylab="Error!Rate")!#!scatter!plot!of!the!data!in!matrix!A!
points(ntrees,A[2,],col=2)!#!This!and!the!line!above!will!plot!the!error!rates!vs!number!of!trees!
for!all!the!error!rates!I!was!able!to!get!R!to!compute!
!
B=rep(0,4)!#!vector!of!zeros,!length!4!
for!(i!in!1:4){!#!This!routine!will!compute!the!error!rate!on!the!test!data!for!boosted!trees!with!
different!shrinkage,!4000!trees,!depth=1!
boost.Higgs=gbm(Label2~.,data=Higgs2[train,],distribution="bernoulli",n.trees=4000,interactio
n.depth=1,shrinkage=lam[i],verbose=F)!
yhat.boost=predict(boost.Higgs,newdata=Higgs2.test,n.trees=4000,type="response")!
yhat=rep(0,125000)!
yhat[yhat.boost>.5]=1!#!If!probability!is!>!0.5,!classify!as!1!(signal)!
B[i]=1fsum(diag(table(yhat,Label2.test)))/sum(table(yhat,Label2.test))!#!ifth!element!is!the!
error!rate!for!tree!depth=1,!number!of!trees!=!4000,!shrinkage=lam[i]!
}!#!The!results!are!stored!in!the!vector!B!
!

n.depth=1,shrinkage=0.5,verbose=F)!#Trying!shrinkage=0.5!
yhat=rep(0,125000)!
1fsum(diag(table(yhat,Label2.test)))/sum(table(yhat,Label2.test))!#!Test!error!
!
n.depth=2,shrinkage=0.2,verbose=F)!#Trying!depth=2!
yhat=rep(0,125000)!
1fsum(diag(table(yhat,Label2.test)))/sum(table(yhat,Label2.test))!#!Test!error!
!
n.depth=3,shrinkage=0.2,verbose=F)!#Trying!depth=3!
yhat=rep(0,125000)!
1fsum(diag(table(yhat,Label2.test)))/sum(table(yhat,Label2.test))!
!
num=c(1000,2000,3000,4000,5000)!#!vector!of!number!of!trees!used!
C=rep(0,5)!#!vector!of!zeros,!length!5!
for!(i!in!1:5){!#!This!routine!will!compute!the!error!rate!on!the!test!data!for!boosted!trees!with!
tree!numbers!from!1000!to!5000,!depth!2,!shrinkage!0.2!
boost.Higgs=gbm(Label2~.,data=Higgs2[train,],distribution="bernoulli",n.trees=num[i],interacti
on.depth=2,shrinkage=0.2,verbose=F)!
yhat.boost=predict(boost.Higgs,newdata=Higgs2.test,n.trees=num[i],type="response")!
yhat=rep(0,125000)!

C[i]=1fsum(diag(table(yhat,Label2.test)))/sum(table(yhat,Label2.test))!#!ifth!element!is!the!
error!rate!for!tree!depth=2,!number!of!trees!=!1000i,!shrinkage=0.2!
}!#!Store!the!error!rates!in!vector!C!
!
final.boost=gbm(Label2~.,data=Higgs2,distribution="bernoulli",n.trees=1000,interaction.depth=
2,shrinkage=0.2,verbose=F)!#!Train!the!final!model!on!the!full!training!set!
summary(final.boost)!
!
#!Using!the!model!to!make!predictions!on!the!Test!data!set!
HiggsTest=read.csv("test.csv",header=T)!#!Read!in!the!Test!data!.csv!fle!
HiggsTest2!<f!subset(HiggsTest,!select!=!fc(1))!
!
for!(i!in!1:nrow(HiggsTest2)){!#!This!routine!takes!all!entries!equal!to!f999!(which!means!entry!is!
undefined)!and!assigns!it!as!NA!
for!(j!in!1:ncol(HiggsTest2)){!
if(HiggsTest2[i,j]==f999){!
HiggsTest2[i,j]=NA}!
}!
}!
!
Class.boost=predict(final.boost,newdata=HiggsTest2,n.trees=1000,type="response")!#!Makes!
predictions!for!probabiity!of!getting!a!1!on!the!test!set!using!our!model!
ClassBinary=rep(0,550000)!#!vector!of!zeros,!length!550000!
ClassBinary[Class.boost>.5]=1!#!If!probability!is!>!0.5,!classify!as!1!(signal)!
Class=ifelse(ClassBinary==1,'s','b')!#!Convert!binary!predictions!into!'s'!or!'b'!
!
S=matrix(0,550000,4)!#!Submission!matrix!

S[,1]=HiggsTest$EventId!#!First!column!is!the!EventId!
S[,3]=Class.boost!#!Third!column!are!the!probabilities!for!'s'!associated!with!each!EventId!
S[,4]=Class!#!Corresponding!'s'!or!'b'!label!
!
colnames(S)=c("EventId","RankOrder","Probs","Class")!#!Specifies!the!column!names!for!matrix!
S!
library(reshape2)!#!Needed!to!convert!matrix!to!a!data!frame!
as.data.frame(S)!#!Convert!S!to!a!data!frame!
!
write.csv(file="Submission.csv",!x=S)!#!Writes!the!S!data!frame!to!a!.csv!file!
!
final.boost2=gbm(Label2~DER_mass_MMC+DER_mass_transverse_met_lep+DER_mass_vis+PRI
_tau_pt,data=Higgs2,distribution="bernoulli",n.trees=1000,interaction.depth=2,shrinkage=0.2,v
erbose=F)!#!Train!the!final!model!on!the!full!training!set!using!4!most!important!variables!
!
Class.boost2=predict(final.boost2,newdata=HiggsTest2,n.trees=1000,type="response")!#!Makes!
predictions!for!probabiity!of!getting!a!1!on!the!test!set!using!our!model!
ClassBinary2=rep(0,550000)!#!vector!of!zeros,!length!550000!
ClassBinary2[Class.boost2>.5]=1!#!If!probability!is!>!0.5,!classify!as!1!(signal)!
Class2=ifelse(ClassBinary2==1,'s','b')!#!Convert!binary!predictions!into!'s'!or!'b'!
!
T=matrix(0,550000,4)!#!Submission!matrix!
T[,1]=HiggsTest$EventId!#!First!column!is!the!EventId!
T[,3]=Class.boost2!#!Third!column!are!the!probabilities!for!'s'!associated!with!each!EventId!
T[,4]=Class2!#!Corresponding!'s'!or!'b'!label!
!
colnames(T)=c("EventId","RankOrder","Probs","Class")!#!Specifies!the!column!names!for!matrix!
T!
library(reshape2)!#!Needed!to!convert!matrix!to!a!data!frame!

as.data.frame(T)!#!Convert!T!to!a!data!frame!
!
write.csv(file="Submission4.csv",!x=T)!#!Writes!the!T!data!frame!to!a!.csv!file!
!
#!End!of!code!
#!Now!we!need!to!sort!the!probabilities!and!assign!Rank!Orders!to!the!Submission!class!
probabilities.!
#!This!will!be!done!in!Excel!
Results!
! After!submitting!my!.csv!file!to!the!Higgs!Machine!Learning!Challenge!postfdeadline!
submission!page,!I!achieved!a!significance!score!of!2.85538,!which!would!have!been!good!
enough!for!1165th
!place.!!The!winning!participant!had!a!significance!score!of!3.806.!!Of!course,!
the!contestants!near!the!top!of!the!leaderboard!most!likely!are!professionals!using!advanced!
algorithms,!or!are!part!of!a!team!of!professionals.!
!
!
! !

(I!tried!the!model!again,!this!time!using!only!the!top!four!important!variables!listed!above!and!
got!a!score!of!2.59545,!good!enough!for!1308th
!place.)!
This!project!has!taught!me!that!using!textbook!algorithms,!such!as!Boosted!Decision!
Trees,!can!do!a!decent!job!of!classifying!data!at!the!cutting!edge!of!high!energy!particle!physics,!
and!can!be!used!to!find!signal!in!the!backgroundfdominated!data!of!a!neverfbeforefdetected!
particle!such!as!the!Higgs!particle,!as!long!as!theory!informs!us!on!what!to!look!for.!!Completing!
this!project!has!also!exposed!me!to!the!Kaggle!website!and!community,!and!has!also!inspired!
me!to!learn!more!about!data!science!and,!hopefully,!do!better!than!1165th
!place!!

Final Project_Vince_Velocci

Recommended

Recommended

More Related Content

Featured

Featured (20)

Final Project_Vince_Velocci