DA 592 - Term Project Report - Berker Kozan Can Koklu - Kaggle Contest

SabancıUniversity
DataAnalyticsM.Sc.Programme
2015-2016Term

DA592-TermProject

GrupoBimboInventoryDemand
KaggleContest

Students:BerkerKozan,CanKöklü

Abstract
ThisdataanalyticsprojectwasdoneforaKagglecontestwherethegoalwastoperform
demandpredictionforGrupoBimbocompany.PythonlanguagewasusedwithJupyter
notebooks.XGBoostlibrarywasusedtoperformtrainingandpredictions.
VariousfeatureengineeringfeaturessuchasNLTKfortextextraction,creationoflag
columnsandaveragingoveralargenumberofvariableswereusedtoenhancedata.
Afterthetraintablewascreated,XGBoostwasutilizedtooptimizeaccordingtothe
scoringfunctiondictatedbythecontest,RMSLE.Hyperparametertuningwasalso
leveragedafterfeatureselectionbasedonfeatureimportanceandcorrelationanalysis
todeterminethebestparametersfortheXGBoostoptimizer.
ThefinalsubmissiontoKaggleachievedascoreof0.48666;placingourteaminthetop
17%ofthe2000contestants.
Thebiggestchallengeswererelatedtoanalyzingandtrainingonalargedataset.This
wasovercomebyforcingthedatatypestosmallertypes(unsignedintegers,low
accuracyfloats,etc.),usingHDF5fileformatfordatastorageandlaunchinga
powerfulGoogleCloudComputePreemptibleInstance(with208GBRAM).
Furtherimprovementswouldincludeattemptinghyperparametertuningacrossawider
rangeoftrainingtables(withdifferentfeatures)andalsoimplementingafailsafe
methodforrunningtheexperimentinpreemptibleinstances.Additionally,creating
differentmodelsandaveragingthemtofindoptimalandnon-overfittedmodelswould
haveyieldedbetterresults.

Keywords:datascience,kaggle,demandprediction,python,jupyter,xgboost,cloud,
googlecloudcompute,hdf5,hyperparametertuning,featureselection

1

TableofContents

Abstract 1
TableofContents 2
Introduction 3
WhatisKaggle? 3
Whatisthecontestabout? 3
Whythisproject? 4
ToolsUsed 4
Python 4
Platforms 5
DataExploration 6
DefinitionoftheDataSets 6
ExploratoryDataAnalysis 7
DataTypesandSizes 7
SummaryofData 8
Correlations 9
DecreasingtheDataSize 9
Models 10
NaivePrediction 10
Score 10
NLTKbasedModelling 10
FeatureEngineering 10
Modeling 11
TechnicalProblems 11
GarbageCollection 11
DataSize 11
Score 12
Conclusion 12
FinalModels 12
DeeperDataExploration 12
Demanda-Dev-Venta 12
Train-TestDifference 13
FeatureEngineering 14
Agencia 14
Producto 14
ClientFeatures 14
DemandFeatures 15
GeneralTotals 15
ValidationTechnique 17
Xgboost 17
Training 18
HyperparameterTuning 18
MaxDepth 18
Subsample 18
ColSampleByTree 18
LearningRate 19
TechnicalProblems 19
StoringData 19
RAMProblem 19
CodeReuse&Automatization 19
Results 20
Conclusion 21
CriticalMistakes 22
FurtherExploration 22

2

Introduction
WhatisKaggle?
Kaggle isawebsitefoundedin2010thatprovidesdatasciencechallengestoits 1
participants.Participantscompeteagainsteachothertosolvedatascienceproblems.
Kagglehasunranked“practice”challengesaswellascontestswithmonetaryrewards.
CompaniesthathavedatachallengesworktogetherwithKaggletoformulatetheproblem
andrewardthetopperformers.
Whatisthecontestabout?
Thecontestthatwehavetaken onforourprojectbelongstoaMexicancompanynamed 2
GrupoBimbo.GrupoBimboisacompanythatproducesanddistributesfreshbakery
products.Thenatureoftheproblematitscoreisdemandestimation.
GrupoBimboproducestheproductsandshipsthemfromstoragefacilities(agencies)to
stores(clients).Thefollowingweek,acertainnumberofproductsthataren’tsoldare
returnedfromtheclientstoBimbo.Tomaximizetheirprofits,GrupoBimboneedsto
predictthedemandofstoresaccuratelytominimizethesereturns.
Inthecontest,weareprovidedwith9weeksworthofdataregardingtheseshipments
andweareaskedtopredictthedemandforweeks10and11.Participantsareallowedto
submit3setsofpredictionseverydayuntilthedeadlineoftheprojectandcanpick
anytwoofthesepredictionsastheirfinalsubmissions.
AsastandardpracticeatKaggle,whenmakinginitialsubmissions,thepredictionsare
rankedbasedona“public”rankingwhichonlyevaluatesacertainpartofthe
submission.Thisisdonetopreventgamingthesystembyoverfittingthroughtrial&
errorofsubmissions.Forthiscontest,the“public”rankingisdoneonweek10data;
meaning,whensubmittingourpredictionswewouldonlybeabletoseeourperformance
forweek10.Theprivateperformanceofourpredictions(i.e.weeks11)areonlyshown
afterthecontestends.
ForthiscontesttheevaluationmetricistheRootMeanSquaredLogarithmicErrorof
ourpredictions.

1
"About|Kaggle."2012.10Sep.2016<https://www.kaggle.com/about>
2
"GrupoBimboInventoryDemand|Kaggle."2016.10Sep.2016
<https://www.kaggle.com/c/grupo-bimbo-inventory-demand>
3

Whythisproject?
WedecidedtodoaKagglecontestforourprojectforvariousreasons:
1.Itwouldallowustobenchmarkourdatascienceabilitiesinaninternational
field.
2.Kagglehasveryactiveforumsforeachindividualcontestandthesewouldprovide
uswithgreatnewmethodsandinsightsinsolvingproblems.
3.Sincethedataprovidedisclean,wecouldspendmoretimeinfeatureandmodel
buildingratherthandatacleaning.
4.Wecouldworktowardsacleargoalandnotbedistracted.
5.FromthenumberofcontestinKaggle,wepickedtheGrupoBimboprojectbecause:
a.Itdealswithtextdatawhichisconsiderablyeasiertoworkwithfor
beginners.
b.Thedatawasverylargeandprovidedalearningopportunityinworking
withlargedatasets.
c.Thedeadlineoftheproject(August30)wasinlinewiththedeadlineof
ourtermproject.
ToolsUsed
Python
WedecidedtousePython(version2.7)astheourscriptinglanguage.Thisisthe 3
languageweworkedwithmostinourprogrammeandalsooneofthemostpopulardata
sciencelanguages.WebuiltoursystemsontheAnacondapackagebyContinuum asit 4
offersalargenumberoflibrariesthathelpusfacethechallenges.
WemainlyranJupyter (IPython)notebooksonvarioussystemstocodeandreport 5
results.
Afewofthespecifictools/packagesthatweusedwere:
● NLTK:NLTKisthemostpopularNaturalLanguageProcessingToolkitforPython. 6
Itoffersgreatfeatureslikestemming,tokenizingandchunkinginmultiple
languages.ThiswascriticalsincetheproductnameswereinSpanish.
● XGBoost:XGBoostisalibrarythatcanbeusedinconjunctionwithvarious
scriptinglanguages(includingRandPython)thatisdesignedforgradient
boostingtrees.Itismuchfasterthanregularscriptingtoolssincethe
computationalpartsarewrittenandprecompiledinC++.Wepickedthissolution
basedsimplyonitsfame,asmanyofthewinnershaveusedthistoolinKaggle
contests. 7
● Pickle:Thepicklemoduleimplementsbinaryprotocolsforserializingand 8
de-serializingaPythonobjectstructure.
3
"Python2.7.0Release|Python.org."2014.10Sep.2016
<https://www.python.org/download/releases/2.7/>
4
"DownloadAnacondaNow!|Continuum-ContinuumAnalytics."2015.10Sep.2016
<https://www.continuum.io/downloads>
5
"ProjectJupyter|Home."2014.10Sep.2016<http://jupyter.org/>
6
"NaturalLanguageToolkit— NLTK3.0documentation."2005.10Sep.2016
<http://www.nltk.org/>
7
"xgboost/demoatmaster·dmlc/xgboost·GitHub."2015.10Sep.2016
<https://github.com/dmlc/xgboost/tree/master/demo>
8
"12.1.pickle— Pythonobjectserialization— Python3.5.2documentation."2014.20Sep.
2016<https://docs.python.org/3/library/pickle.html>
4

● HDF5 FileFormat:HDFisself-describing,allowinganapplicationtointerpret 9
thestructureandcontentsofafilewithnooutsideinformation,a
general-purpose,machine-independentstandardforstoringscientificdatain
files,developedbytheNationalCenterforSupercomputingApplications(NCSA).
● Scikit-Learn:Scikit-Learnisasimpleandefficienttoolfordataminingand 10
machinelearningbesidethatit’sfreeandbuildonnumpy,matplotlibandscipy.
Weuseditonfeatureextractionphase.
● NumPy:NumPyisanopensourceextensionmoduleforPython,whichprovidesfast 11
precompiledfunctionsformathematicalandnumericalroutines.Furthermore,NumPy
enrichestheprogramminglanguagePythonwithpowerfuldatastructuresfor
efficientcomputationofmultidimensionalarraysandmatrices.
● SciPy:SciPyisaPython-basedecosystemofopen-sourcesoftwarefor 12
mathematics,science,andengineering.Weuseditforsparsematrices.
● GarbageCollector:Thegcmodulewasusedinordertofreeupmemory 13
periodicallytooptimizeperformance.
Platforms
Forcodingandperformingourcomputations,weinitiallyattemptedtouseourlaptops
(aMacbookProandanUbuntuMachineeachwith16GBofRAM).However,aftergetting
numerousMemoryErrors,wegraduallycametorealizethatourcomputerswouldnotbe
abletorunthecomputationsthatweneed(atleastnotinanefficientandtimely
manner).Tosolveourproblemweturnedtocloudservices.
WefirstsetupanEC2instanceonAmazonWebServiceswithabout100GBofRAMand16
virtualCPUcores,usingapublictutorial.However,runningsuchapowerfulinstance 14
continuouslyprovedcostly;atwodayattempttobuildandrunmodelscostover150USD.
(Animportantsidenote,oneshouldmakesurethatallitemsthatrelatetothe
instancecreatedareremovedcompletelytoavoidincurringcharges.Inthecaseofone
oftheauthorsofthispaper,anextra50USDwaslaterchargedbecausebackupcopiesof
theinstanceswerenotdeleted.)
WethendecidedtoswitchtoGoogleCloudComputeservice;buildingasystemwith
similarspecs,againfollowingapubliclyavailabletutorial.Althoughslightly 15
cheaper,havingadedicatedmachinerunforanentiredayagainprovedcostly,
incurringabout50USD.Atthispointwedecidedtofindacheapersolutionanddecided
tolookatAmazon’sSpotInstancesandGoogle’sPreemptibleInstances.
BothAmazonSpotInstancesandGooglePreemptibleInstancesoperateontheprinciple
thattheyofferthecompany'ssurpluscomputingpoweratadiscount.Thecaveatbeing
thatifthereareotherconsumersthatwanttousethiscomputingpower,theinstances
9
"ImportingHDF5Files-MATLAB&Simulink-MathWorks."2012.20Sep.2016
<http://www.mathworks.com/help/matlab/import_export/importing-hierarchical-data-format-hdf5-file
s.html>
10
"scikit-learn:machinelearninginPython— scikit-learn0.17.1..."2011.20Sep.2016
<http://scikit-learn.org/>
11
"WhatisNumPy?-NumpyandScipyDocumentation."2009.20Sep.2016
<http://docs.scipy.org/doc/numpy/user/whatisnumpy.html>
12
"SciPy.org— SciPy.org."2002.21Sep.2016<http://www.scipy.org/>
13
"28.12.gc— GarbageCollectorinterface— Python2.7.12..."2014.21Sep.2016
<https://docs.python.org/2/library/gc.html>
14
"SettingupAWSforKagglePart1– CreatingafirstInstance– grants..."2016.10Sep.2016
<http://www.grant-mckinnon.com/?p=6>
15
"SetupAnaconda+IPython+Tensorflow+JuliaonaGoogle..."2016.10Sep.2016
<https://haroldsoh.com/2016/04/28/set-up-anaconda-ipython-tensorflow-julia-on-a-google-compute-e
ngine-vm/>
5

canbestoppedbythecompanyatanypoint.Thebiggestdifferencebetweenthetwois
thatAmazonoffersamorebiddingmodelwherethepricesforthecomputingpower
fluctuates;ifthebidthatthebuyerishigherthanthecurrentmarketprice,the
instanceremainsactive;howeverifthemarketpriceraisesabovethebid,itisshut
down.Googleontheotherhandoffersaspecificpricefortheinstance. 16
WeeventuallysettleddownonusingaGooglePreemptibleinstancewith32virtualCPUs
and208GBofRAM.Wehadtodealwithaprematureshutdownonlyoncewhilerunningthe
instanceoverthecourseofthreedays.Thetotalcostofthepreemptibleinstancesand
backupsetccametoabout60USD.
ThekeyinterfacetotheGoogleCloudinterfacewasacommandpromptterminal,where
theJupyternotebookwasinitiatedanddatafileswereuploadedandsubmissionfiles
weredownloadedviaSSH.
Github 17
GitHubisacodehostingplatformforversioncontrolandcollaborationwhichlets
peopleworktogetheronprojectsfromanywhere.Weusedthistoworkonourcodesin
parallelwhileeasilymergingourdevelopments.
DataExploration
DefinitionoftheDataSets
Thedatasetsthatwewereprovidedwithwasasfollows:
● train.csv— thetrainingset,totalDemanddatafromclientsandproductsper
weekforweeks3-9;containingthefollowingfields:
○ Semana-Theweek
○ Agencia_ID-IDofthestoragefacilityfromwhichtheorderis
dispatched.
○ Canal_ID-Thechannelthroughwhichtheorderisplaced.
○ Ruta_SAK-TherouteIDofthedeliveryroute.
○ Cliente_ID-TheClientID
○ Producto_ID-TheProductID
○ Venta_uni_hoy-Thenumberofitemsthatwereordered
○ Venta_hoy-Thetotalcostoftheitemsthatwereordered
○ Dev_uni_proxima-Thenumberofitemsthatwerereturned.
○ Dev_proxima-Thetotalcostoftheitemsthatwerereturned.
○ Demanda_uni_equil-Actualdemand(thestockthatwasactuallysold),this
isthelabelthatweneedtopredictforweeks10and11.
● test.csv— thetestset,datafromclientsandproductsforweeks10and11
containingthefields:
○ Id
○ Semana
○ Agencia_ID
○ Canal_ID
○ Ruta_SAK
○ Cliente_ID
16
"WhatarethekeydifferencesbetweenAWSSpotInstances...-Quora."2015.10Sep.2016
<https://www.quora.com/What-are-the-key-differences-between-AWS-Spot-Instances-and-Googles-Preem
ptive-Instances>
17
"HelloWorld·GitHubGuides."2014.20Sep.2016
<https://guides.github.com/activities/hello-world/>
6

○ Producto_ID
● cliente_tabla.csv— clientnames(canbejoinedwithtrain/testonCliente_ID)
● producto_tabla.csv— productnames(canbejoinedwithtrain/teston
Producto_ID)
● town_state.csv— townandstate(canbejoinedwithtrain/testonAgencia_ID)
● sample_submission.csv— asamplesubmissionfileinthecorrectformat

Image1:DataStructure
Noneofthenumericvariablesexistingintraindataareinthetestsetofthedata,
sotheproblemhereispredictingthedemandwithonly6categoricalfeatures.
ExploratoryDataAnalysis
DataTypesandSizes
Thesizesofthedatafileswereasfollows:
● town_state.csv 0.03MB
● train.csv 3199.36MB
● cliente_tabla.csv 21.25MB
● test.csv 251.11MB
● producto_tabla.csv 0.11MB
● sample_submission.csv 68.88MB
7

DistributionsandSummaryofData

Image2:SummaryofTrainData

Image3:SummaryofTrainData(cont.)

Image4:DistributionofTargetVariable
Targetvariable'smeanis7,medianis3,maxis5000,stdis25and%75ofthedatais
between0-6.Thisisaclassicalright-skeweddataandthisexplainswhyevaluation
metricisRMSLE.Moreover,weloggedtargetvariable(log(variable+1))beforestarting
modelingandthantakeexponentialofitbeforesubmitting(exp(variable)-1).
8

Correlations

Image5:ScatterPlotsofKeyVariables
Inthesescatterplots,weseethatordersarehighlycorrelatedwithdemand,and
secondly,wheredemandishighreturnsarelow.
DecreasingtheDataSize
InordertooptimizeRAMusageandspeedupXGBoost’sperformance,wemadesureto
forcethetypeofthedatafieldsexplicitly.Wedefinedallourintegerstouse
unsignedintegerformatanddecreasedtheaccuracyoffloatingpointintegersasmuch
aspossible.Forexample,Canal_IDcanbeuint8.Afterconversions,memoryisreduced
to2.1gbfrom6.1gb.

DatawithDefaultDataTypes DatawithOptimizedDataTypes
RangeIndex:74180464entries,0to74180463
Datacolumns(total11columns):
Semana int64
Agencia_ID int64
Canal_ID int64
RangeIndex:74180464entries,0to74180463
Semana uint8
Agencia_ID uint16
Canal_ID uint8
9

Ruta_SAK int64
Cliente_ID int64
Producto_ID int64
Venta_uni_hoy int64
Venta_hoy float64
Dev_uni_proxima int64
Dev_proxima float64
Demanda_uni_equil int64
dtypes:float64(2),int64(9)

memoryusage:6.1GB
Ruta_SAK uint16
Cliente_ID uint32
Producto_ID uint16
Venta_uni_hoy uint16
Venta_hoy float32
Dev_uni_proxima uint32
Dev_proxima float32
Demanda_uni_equil uint32
dtypes:float32(2),uint16(4),uint32(3),
uint8(2)
memoryusage:2.1GB

Models
NaivePrediction
Wefirstdecidedtocreateanaiveprediction;forthiswegroupedthetrainingdata
basedonProductID,ClientID,AgencyIDandRouteID.Wesimplytookthemedianof
thisgrouping,ifthisspecificgroupingdidnotexist,inthetrainingdataset,we
defaultedbacktotheproduct’smediandemandandifthisalsodidnotexist,wesimply
tooktheaverageoftheoveralldemand.
Score
Thismethodresultedinascoreof0.73whensubmitted.

NLTKbasedModelling
FeatureEngineering
WeutilizedtheNLTKlibrarytoextractthefollowinginformationfromtheProducto
Tablafile(weusedaslightlymodifiedversionofacodeprovidedbyAndreyVykhodtsev
) 18
● Weight:Ingrams
● Pieces
● BrandName:Extractedthroughathreeletteracronym
● ShortName:ExtractedfromtheProductNamefield.Weprocessedthisinformation
usingtheNLTKlibrary.WefirstremovedtheSpanish“stopwords”andthenused
thestemminginordertomakesureonlythecoresofthenamesremained.
18
"Exploringproducts-Kaggle."2016.10Sep.2016
<https://www.kaggle.com/vykhand/grupo-bimbo-inventory-demand/exploring-products>
10

Image6:ProductDataNamesafterpreprocessing
Modeling
Wewantedtomodeltextdataandpredict.Herearethestepsthatweretaken:
1)Separatexandyoftraindata
2)Appendtestdatatotraindatatoalignthemtohavesamesparseproductfeatures
order(Iftheydon’thavesamecolumnorder,traininggivesfalseresults).
3)Mergethisdatawithproducts.
4)Use“countvectorizer”ofScikit-learnonbrandandshort_namecolumnstocreate
sparsecount-wordmatricesandappendthemtotrain-testdatahorizontally.
5)Separateappendedtrainandtestdata.
6)TrainXgboostwithdefaultparametersontraindataandpredicttestdata.
TechnicalProblems
1)GarbageCollection
Garbagecollectionwasabigproblembecauseofthesizeofthedata.Whenwe
stoppedusingapythonobject,wehadtoremoveandforcegarbagecollection
mechanismtofreethismemory.Forthisthethegc librarywasused. 19
2)DataSize
Beforeusingxgboost,wehad70+millionrecordswith577columns.Holdingthis
sparsedatainmemorywithdataframewasimpossible.Wesolvedthisissuewith
sparsematricesofSciPylibrary.
Intheexamplebelow,insteadofholdingalldataincludingzerosinmemory,
sparsemethodholdsonlydatadifferentthan0.Therearemanysparsematrices
methods.Weused“CSR”and“COO”ones. 20
19
"28.12.gc— GarbageCollectorinterface— Python2.7.12..."2014.21Sep.2016
<https://docs.python.org/2/library/gc.html>
20
"Sparsematrices(scipy.sparse)— SciPyv0.18.1ReferenceGuide."2008.21Sep.2016
<http://docs.scipy.org/doc/scipy/reference/sparse.html>
11

Image7:VisualexplanationofhowtheCOOsparsematrixworks.
Score
TheRMSLEscoresobtainedbyusingthismethodwereasfollows:
Validation Test10 Test11(Private)
0.764 0.775 0.781
Conclusion
Thesescoresareworsethanthenaiveapproach,sowestartedtothinkaboutanew
model.
FinalModels
DiggingDeeperinDataExploration
1.Demanda-Dev-VentaRelationship
Ondatadescriptionpageofcontest,it’ssaidthatDemanda=Venta-Devexcept
somereturnsituations.
Whenwequerythisequation,thereare615000recordswhichareexceptionsas
shownbelow.Itcanmeanthatreturnscanbedoneaftermorethan1week.We
flaggedtheseproducts.

Image8:Exceptionalcaseswherethenumberofreturnsishigherthanthenumber
oforders(laggingreturns).
Secondly,wequeryDemanda=0&Dev=0andthereare199767recordswhich
includesonlyreturns.Whenwefinddemandmeanofaproduct,theserecordscan
falsifyourresultsastheyonlyincludereturnvalues.
12

Image9:Exceptionalcaseswherethenumberofordersanddemandarebothzero.
2.Train-TestDifference
Weanalyzedthemissingproducts,clients,agenciesandroutestupleswhichexist
intrainbutnotintestandviceversaorinspecificfiles.
Therewere9663clients,34products,0agenciesand1012routesthatdoesn’t
existintraindata.
heimportantoutcomeofthisanalysiswasthat:weshouldbuildageneralmodel
thatcanhandlenewproducts,clientsandrouteswhichdon’texistintraindata
butintestdata.
FeatureEngineering
Inordertoprovideourmodelswithmoreinformation,wehadtoperformsomefeature
engineering.
Agencia
Agenciafileshowseachagency’stownidandstatename.Wecanmergethisfilewith
trainandtestdataonAgencia_IDcolumnandencodestatecolumnsintointegers.

Image10:AgenciaTableafterprocessing.
Producto
WeusedfeaturesfromNLTKmodel,weightsandpieces.Inadditiontothem,weincluded
shortnamesofproductandbrandid.
Inthepicturebelow,wecanseesameproductwithdifferentweightsanddifferentids.
Wetaketheshortnameofthisproducts(wewilladdafeatureliketheyarethesame
product)andincludethemtothefeatures.Laterwewillseewhy.
Productfile
2025,PanBlanco460gWON2025
2027,PanBlanco567gWON2027
13

DemandFeatures
Thiswasthemostcriticalpartofourdatastructure.Wegenerated4newcolumnsfor
ourtrainingandtestingdataandnamedthemLag0,Lag1,Lag2andLag3.Weasked
ourselveswhywehadn’taddedtheproduct’sexdemands.
Lag0isaspecialcasethatattemptstofindtheaveragedemandforaspecificrow.
Thisisdonebyattemptingtofindtheaveragebasedonalargenumberofvariables(as
specificaspossible)andfailingthat,attemptingtofindtheaverageofafewer
numberofvariables(amorerelaxed,lessaccurateandmoregeneralaverage).
Forexample,
● Averagedemandbasedon:
"Producto_ID","Cliente_ID","Ruta_SAK","Agencia_ID","Canal_ID"
● Ifthiscombinationisnotfound,attempttofindaveragebasedon:
"Producto_ID","Cliente_ID","Ruta_SAK","Agencia_ID"
● Ifthiscombinationisnotfound,attempttofindaveragebasedon:
"Producto_ID","Cliente_ID","Ruta_SAK"
● Andsoonandsoforth.
Thiswasdoneintheorderoffindingvariousaveragesbasedonproductidfirst,then
fallingbackonaveragesbasedontheshortnamesofproducts(InPanBlancoexample
above,Ifproduct2025can’tbefound,weused“product2027”instead,thinkingthat
product2027givesanideaabouttheproduct2025),failingthat,fallingbackon
averagesbasedonthebrandnames(Inthesameexample,“WON”isused).
Lag1through3wereconstructedinasimilarfashionbutweremorestrictand
consideredonlyasingleweek’sdata.Inthesecases,wedidnotwanttocreateany
informationbasedonBrandnamesasitwouldbetoogeneral.Onlycombinationswith
productidandproductshortnamewereused.Soforalineoftrainingdatathat
pertainedtoweek7,Lag1wouldbetheaveragesofthatproductidorproductname
basedonweek6data;Lag2wouldbeaveragesofweek5data;andsoonandsoforth.
ClientFeatures
TheClientFeaturesweremoredifficulttoengineer.Unliketheproducttable,the
clienttablehadalargenumberofduplicates,whereclientsnamesweremisspelledin
differentcases.Weremovedtheduplicatesfromtheclienttableandthenusedacode
snippetprovidedbyAbderRahmanSobh (theprocessmadeuseofusingTF-IDFscoringof 21
theclientnamesandthenmanualselectionofcertainkeywords)inordertoclassify
theclientsbasedontheirtypes,resultinginthefollowingcategorization:
● Individual 353145
● NOIDENTIFICADO 281670
● SmallFranchise 160501
● GeneralMarket/Mart 66416
● Eatery 30419
● Supermarket 16019
● OxxoStore 9313
● Hospital/Pharmacy 5798
● School 5705
21
"ClassifyingClientTypeusingClientNames-Kaggle."2016.10Sep.2016
<https://www.kaggle.com/abbysobh/grupo-bimbo-inventory-demand/classifying-client-type-using-clie
nt-names>
14

● Post 2667
● Hotel 1127
● FreshMarket 1069
● GovtStore 959
● BimboStore 320
● Walmart 220
● Consignment 14
GeneralTotals
Afterobtainingtheaboveaverages,wealsoincludedthefollowing:
● TotalVentaperclient(giroofclient)
● TotalVenta_uni_hoyperclient(totalunitproductsoldbyaclient)
● Divisionofsumofventa_hoytoventa_uni_hoy(givingtheapproximatepriceper
unit).
● DivisionofsumofdemandtosumofVentauni(givingtheratioofgoodsactually
soldbytheclient,i.e.abilitytosellinventory)
Thiswasdoneforproductshortnamesandalsoproductids;resultinginanadditional
12morecolumnsforourtrainingdata.Otheraddedcolumnsareshownbelow:
● Clientpertown
● Sumofreturnsofproduct
● Sumofreturnsofshortnameofaproduct
Aftereliminatinghighlycorrelatedfeatures(%90),thetrainingdatatablewasas
follows:
Int64Index:74180464entries,0to74180463
Semana uint8
Agencia_ID uint16
Canal_ID uint8
Ruta_SAK uint16
Cliente_ID uint32
Producto_ID uint16
Venta_uni_hoy uint16
Venta_hoy oat32
Dev_uni_proxima uint32
Dev_proxima oat32
Demanda_uni_equil oat64
Town_ID uint16
State_ID uint8
weight uint16
pieces uint8
Prod_name_ID uint16
Brand_ID uint8
Demanda_uni_equil_original oat64
DemandaNotEqualTheDifferenceOfVentaUniAndDev bool
Lag0 oat64
Lag1 oat64
Lag2 oat64
Lag3 oat64
weightppieces uint16
Client_Sum_Venta_hoy oat32
15

Client_Sum_Venta_uni_hoy oat32
Client_Sum_venta_div_venta_uni oat32
prod_name_sum_Venta_hoy oat32
prod_name_sum_Venta_uni_hoy oat32
prod_name_sum_venta_div_venta_uni oat32
Producto_sum_Venta_hoy oat32
Producto_sum_Venta_uni_hoy oat32
Producto_sum_venta_div_venta_uni oat32
Producto_ID_sum_demanda_divide_sum_venta_uni oat64
Prod_name_ID_sum_demanda_divide_sum_venta_uni oat64
Cliente_ID_sum_demanda_divide_sum_venta_uni oat64
memoryusage:10.6GB

ValidationTechnique
Validationisthemaybethemostcriticalpartofadatascienceproject.Toppriority
wastonotoverfittingthedata.Weuseddifferentmodelstopredictweek10andweek
11.

Image11:Structureoftraining,validationandtestmechanism.
Weused6thand7thweekdataastraining.Ourvalidationforweek10was8th,our
validationforweek11was9th.Inthelatterone,wedidn’tuseLag1variable,because
itmeansthatinordertopredictweek11,weshoulduseweek10’sdemand(Lag1ofweek
11isweek10)whichdoesn’texist.Or,weshouldpredictweek10firstandwiththis
predicteddemands,wepredictweek11butitcarrieserrorfromweek10toweek11.
Afterfeatureextractionphaseandaddingfeaturestoeachrecord,wedeletedfirst3
weeks.Becausetheydon’thaveLag1,Lag2andLag3features.
Xgboost
Xgboostcanbegiven2differentdatasets(trainandvalidation).Withplayingwith
parameters,wecanmakeittrainuntilthevalidationscorestopsincreasingafter“N”
iterations.Itautomaticallystopsandtellsyouthebestiterationnumberandits
score.
Xgboostcanalsogivefeatureimportancesaccordingtothecountsoffeaturesontrees
ofmodel.Forexample:
16

Image12:FeatureimportancegraphoffittedtrainingdatabasedonXGboost
Training
Westartedtomakemodelsafterdefiningvalidationstrategyandfeatureextraction.
Features Validation1
(Week8)
Validation2
(Week9)
Trial1 0.476226 0.498475
Trial2:Removinghighlycorrelatedfeatures 0.477067 0.493038
Trial3:Addinglaginteractions. 0.502514 N/A
Trial4:Addingmorelaginteractions 0.51825 N/A
Trial5:Laginteractionsbutremovingmore
correlatedfeatures
0.517606 N/A
Trial6:ReplacingextremevalueswithNAN 0.517467 0.517375
Trial7:Removinglowimportancefeatures(all
laginteractionsareremoved)
0.480394 0.494104
Trial8:AddingClientTypes 0.48101 0.494804
Manyothervariationsweretriedbutabandonedduetopoorperformance.
Interestingly,theoriginaldataset(withengineeredfeaturessuchasaverages,lags
etc.)resultedinthebestperformance.Thereisacaveathowever,theseattemptswere
allmadewithafixedsettinginXGBoost,aswillbeseennext,thenumberoftreesmay
havebeensettoolowinthesetrialstotakeintoaccountthebenefitsofadded
featuressuchasinteractionsbetweenlagsorclientstypesetc.
HyperparameterTuning
Afterselectingthedataset,weproceededwithhyperparametertuningoftheXGBoost
model.
TheXGBoostlibraryhasnumerousparameters,theonesthatwereusedfortuningwere:
17

MaxDepth:
Themaximumdepthofthedecisiontrees.
● Valuestried:10,12,8,6,14,18,20,22
● OptimalValue:22
Subsample:
Thesubsamplingrateofrowsofthedata.
● Valuestried:1,0.9,0.8,0.6
● OptimalValue:0.9
ColSampleByTree:
Thesubsamplingrateofcolumnsofthedata.
● Valuestried:0.4,0.3,0.5,0.6,0.8,1
LearningRate:
Thegradientdescentoptimizationparameter(thesizeofeachstep).
● Valuestried:0.1,0.05

Features Validation1
(Week8)
Validation2
(Week9)
OriginalTraining 0.476226 0.498475
TrainingafterParameterTuning 0.469628 0.489799
TechnicalProblems
StoringData
“CSV”filetypeisveryslowtoloadandsave.Inadditiontothat,itisn’t
self-describing.Whenwetrytoloaddatafromit,wehavetodoallconversionsaswe
didbeforesavingit.Wesearchedforabetterfileformattostorethatmuchdata.
Firstly,wetried“pickle”librarywhichweusedforstoringxgboostmodelsbecauseof
self-describingfeature.Butafterfilesizegetsbigger,itstartstogiveerror.
Secondly,wetried“HDF5”whichisdesignedforstoringbigdataonfile.Itwasboth
veryfasttoloadandsaveandalsoself-describing.Wepickedthisone.
RAMProblem
Duetothesizeofthetrainingandtesttables,itwasnotpossibletoperformthe
operationsusingourunderpoweredlaptops.Attemptingtojoinlargetablesoruse
XGBoosttocreatemodelsalwaysresultedinmemoryerrors.Wesolvedthisissueby
migratingourenvironmenttoGoogleCloudCompute.Weusedlinuxcommandlineprompts
toinstallAnacondaandrelatedlibrariesandthenlaunchedJupyternotebooktocreate
adevelopmentenvironment.Atitshighestlevel,ourinstance(with32virtualCPUsand
18

208GBRAM)wasperformingat100%CPUloadand40%RAMusage.Trainingandpredicting
overourfulltrainandtestdatatookmorethan2hours.
CodeReuseandAutomatization
Therewerelotsofcodingchallengesforusasfollows:
● Openingcsvfileswithpredefineddatatypesandnames
● Handlinghdf5files
● Addingconfigurablefeatures(Lag0,Lag1,…)todata
● Automaticallydeletingfirst“N”laggedweeksfromtraindata
● Appendingtesttotraindata
● Separatingtestandtraindataautomatically
● Xgboostconfigurablehyperparametertuning
● Handlingmemoryissues
WesolvedthisissueswithObjectOrientedProgrammingwithPython.Thisisthe
structureofourgeneralclass.
classFeatureEngineering:
def__init__(self,ValidationStart,ValidationEnd,trainHdfPath,trainHdfFile,
testHdfPath1,testHdfPath2,testHdfFile,testTypes,trainTypes,trainCsvPath,testCsvPath,
maxLag=0)
def__printDataFrameBasics__(data)
defReadHdf(self,trainOrTestOrBoth)
defReadCsv(self,trainOrTestOrBoth)
defConvertCsvToHdf(csvPath,HdfPath,HdfName,ColumnTypeDict)
defPreprocess(self,trainOrTestOrBoth,columnFunctionTypeList)
defSaveDataFrameToHdf(self,trainOrTestOrBoth)
defAddCon gurableFeaturesToTrain(self,con g)
defDeleteLaggedWeeksFromTrain(self)
defReadFirstNRowsOfACsv(self,nrows,trainOrTestOrBoth)
defAppendTestToTrain(self,deleteTest=True)
defSplitTrainToTestUsingValidationStart(self)
Wecanusethisclassbygivingconfigurableparameters.

parameterDict= {"ValidationStart":8,"ValidationEnd":9,"maxLag":3,
"trainHdfPath":'../../input/train_wz.h5',"trainHdfFile":"train",
"testHdfPath1":"../../input/test1_wz.h5","testHdfPath2":"../../input/test2_wz.h5",
"testHdfFile":"test",
"trainTypes":{'Semana':np.uint8,'Demanda_uni_equil':np.uint32},"testTypes":
{'id':np.uint32,'Semana':np.uint8,'Agencia_ID':np.uint16},
"trainCsvPath":'../../input/train.csv',"testCsvPath":'../../input/test.csv'}
FE=FeatureEngineering(**parameterDict)
Toaddcomplexlaggedfeature,webuildanautomationsystemwhichworkswithaconfig
variable.

con gLag0Target1DeleteColumnsFalse=Con gElements(0,[("SPClRACh0_mean",
["Producto_ID,"Cliente_ID,"Ruta_SAK,"Agencia_ID,"Canal_ID],["mean"]),
("SPClRA0_mean",
["Producto_ID","Cliente_ID","Ruta_SAK","Agencia_ID"],["mean"]),
("SB0_mean",["Brand_ID"],["mean"])],"Lag0",True)
FE.AddCon gurableFeaturesToTrain(con gLag0Target1)
Todohyperparametertuningautomatically,wewroteapythonfunction.
19

defaultParams={"max_depth":10,"subsample":1.,"colsample_bytree":0.4,"missing":np.nan,
"n_estimators":500,"learning_rate":0.1}
testParams=[("max_depth",[12,8,6,14,16,18,20,22]),("subsample",[0.9,0.8,0.6]),
("colsample_bytree",[0.3,0.5,0.6,0.8,1]),("learning_rate",[0.05])]
tParams={"verbose":2,"early_stopping_rounds":10}
GiveBestParameterWithoutCV(defaultParams,testParams,X_train,X_test,y_train,y_test,
tParams)
Results
Overthecourseofthecontest,wecanname4milestonesubmissions.Thevalidation
andpublicandprivatescoresofthesesubmissionsareshownbelow.
Model Validation1 Validation2 PublicScore PrivateScore
Naive
(averages)
0.736 0.734 0.754
Optimizedwith
ProductData
viaNLTK
0.764 0.775 0.781
XGBoostwith
default
parameters
0.476226 0.498475 0.46949 0.49596
XGBoostwith
parameter
tuning
0.469628 0.489799 0.46257 0.48666
Wecanseefromtheresultsthatwedidn’toverfitdataatanypoint.
Forourfinalsubmissionofpredictions;weachievedascoreof0.48666;placingour
teaminthetop17%ofthe2000contestants.

Lookingoverthescoresofotherparticipants,we’dliketosaythatforfirsttime
participantsofaKagglecontest,ourresultswereverypromising.
Conclusion
WeareextremelyhappythatwepickedaKagglecontestforourproject.Itallowedus
toworkonacommonrealworldproblemwhilegivingusabenchmarkofourabilitiesin
theglobalarena.
Welearnedtoleveragepowerfulcloudcomputingcapabilitiesacrossvariousplatforms
whilealsolearninghowtomanipulatelargedatasetswithmemoryandcomputingpower
20

constraints;whilealsolearninghowtouseXGBoostlibraryfortrainingandtesting
purposes.
Wealsolearnedhowtouseimportanttoolslikecommandpromptstolaunchdevelopment
environmentsandGithubforcodesharingcollaboration.
CriticalMistakes
Poordataexploration
Weperformedverylittledataexplorationonourown.Wemainlydependedonthedata
explorationthatwasdonebyotherKagglers.Thisresultedinsub-optimalsolutionsin
ourtrainingandtestingaswedidnotexcludeoutliersetc.
Notpreparingforsystemoutages
WefacedoneoutagewhileusingtheGoogleCloudPreemptibleInstance(possiblydueto
highdemandfromotherclients)whichcausedakeydatafiletobecomecorrupted.The
re-creationofthisdatafilecostusover5hoursofwork.Inthefuture,itwouldbe
preferableifthesystemwaslisteningto“shut-down”signalsthataresentbythe
platformsandtooknecessarystepstopreventthecorruptionofthisdata.
Performinghyperparametertuningtoolate
Inourprocessweinitiallyperformedfeatureselectionusingasetofparametersfor
XGBoostandthenproceededtohyperparametertuningstep.However,itbecameapparent
thatsomefeatureswerebeinggivenlowerscoresbecauseourinitialsetofparameters
werenotoptimalforahighnumberoffeaturecolumns.Specifically,thedepthofthe
treesweresetto6inourinitialfeatureselection;whenthiswasincreasedto22,it
becameapparentthatthefeaturesthatwereoriginallydroppedcouldhavebeengood
predictors.
FurtherExploration
Ifwehadmoretimeandresources,wewouldhavelikedtoundertakeadditionalactions.
PartialFitting
Whenfacedwiththememoryproblemwedecidedtousecloudservices.However,another
methodwouldhavebeenloadingandprocessingthedatainsmallerbatches.Thiswould
beamorescalablemodelandcouldevenbeusedtocreateaclusterofcloudmachines
toperformoperationsinparallel.
MultipleModels
AlthoughXGBoostisaveryeffectivetool,itgivesasinglemodel(orinourcase,2
modelsoneforeachweek).Wewouldliketoexplorethepossibilityofcreatinga
largernumberofmodelsusingdifferentsystemsandseeinghowtheyperformfor
differentslicesofdata.Wewouldthentakesomesortofweightedaverageofthese
predictionstoreachourfinalprediction.
Asanextensiontothisidea,wewouldalsoperformparametertuningacrossthese
variousmodelstofindoptimalsolutionsforeachone.
NeuralNetworks
Wewouldalsohaveliketoapproachthisproblemwithaneuralnetworksolutiontosee
theaccuracyofthepredictionsandalsotheperformanceoftheneuralnetworksolution
vstheXGBoosttool.
21

DA 592 - Term Project Report - Berker Kozan Can Koklu - Kaggle Contest

Recommended

Recommended

More Related Content

Similar to DA 592 - Term Project Report - Berker Kozan Can Koklu - Kaggle Contest

Similar to DA 592 - Term Project Report - Berker Kozan Can Koklu - Kaggle Contest (20)

Recently uploaded

Recently uploaded (20)

DA 592 - Term Project Report - Berker Kozan Can Koklu - Kaggle Contest