The Hitchhiker’s Guide to Kaggle

The Hitchhiker’s Guide to Kaggle

July 27, 2011
ksankar42@gmail.com [doubleclix.wordpress.com]
anthony.goldbloom@kaggle.com

The Amateur Data Scientist CART

Analytics
Competitions! Algorithms
randomForest

Tools

Old DataSets
Competition

Competition Titanic
in-flight
Churn

HHP Ford

Encounters
—  1st
◦  This Workshop

—  2nd
◦  Do Hands-on Walkthrough
◦  I will post the walkthrough
scripts in ~ 10 days

—  3rd
◦  Participate in HHP &
Other competitions

Goals Of This workshop
1.  Introduction to Analytics Competitions
from Data, Algorithms & Tools
perspective
2.  End-To-End Flow of a Kaggle
Competition – Ford
3.  Introduction to the Heritage Health Prize
Competition
4.  Materials for you to explore further
◦  Lot more slides
◦  Walkthrough – will post in 10 days

Agenda
—  Algorithms for the Amateur Data Scientist [25Min]
◦  Algorithms, Tools & frameworks in perspective
—  The Art of Analytics Competitions[10Min]
◦  The Kaggle challenges
—  How the RTA FORD was won - Anatomy of a competition
[15Min]
◦  Predicting FORD using Trees
◦  Submit an Entry
—  Competition in flight - The Heritage Health Prize [30Min]
◦  Walkthrough
–  Introduction
–  Dataset Organization
–  Analytics Walkthrough
◦  Submit our entry
—  Conclusion [5Min]

ALGORITHMS FOR THE
AMATEUR DATA SCIENTIST

Algorithms ! The Most Massively useful thing an Amateur Data
Scientist can have …

“A towel is about the most massively useful thing an
interstellar hitchhiker can have … any man who can hitch
the length and breadth of the Galaxy, rough it … win
through, and still know where his towel is, is clearly a
man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Published by Harmony Books in 1979

The Amateur Data Scientist
—  Am not a quant or a ML expert
—  School of Amz, Springer & UTube
—  For the Rest of us
—  References I used (Refs also in the respective slide):
◦  The Elements Of Statistical Learning (a.k.a ESLII)
–  By Hastie,Tibshirani & Friedman
◦  Statistical Learning From a Regression Perspective
–  By Richard Berk
—  As Jeremy says it, you can dig into it as needed
◦  Not necessarily be an expert in R toolbox

Jeremy’s Axioms
—  Iteratively explore data
—  Tools
◦  Excel Format, Perl, Perl Book
—  Get your head around data
◦  Pivot Table
—  Don’t over-complicate
—  If people give you data, don’t assume
that you need to use all of it
—  Look at pictures !
—  History of your submissions – keep a
tab
—  Don’t be afraid to submit simple
solutions
◦  We will do this during this workshop

Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-
sciencetalk-by-jeremy-howard/ !

Don’t throw away
1 any data !
Big data to smart data
Be ready for different
2 ways of organizing the
data
—  summary

Users apply different techniques

•  Support Vector Machine •  Genetic Algorithms
•  adaBoost •  Monte Carlo Methods
•  Bayesian Networks •  Principal Component
•  Decision Trees Analysis
•  Ensemble Methods •  Kalman Filter
•  Random Forest •  Evolutionary Fuzzy
•  Logistic Regression Modelling
•  Neural Networks
Quora
•  http://www.quora.com/What-are-the-top-10-
data-mining-or-machine-learning-algorithms
Ref: Anthony’s Kaggle Presentation!

—  Let
us take a 15 min overview of the
algorithms
◦  Relevant in the context of this workshop
◦  From the perspective of the datasets we plan
to use
—  More of a qualitative than mathematical
—  To get a feel for the how & the why

Bias Continuous
Linear
Variance Variables
Regression
Model Complexity
Over-ﬁtting Categorical
Variables
Classifiers

k-NN
Bagging Boosting Decision (Nearest
Trees Neighbors)

CART

Titanic Passenger Metadata Customer Churn
•  Small •  17 Predictors
•  3 Predictors
•  Class
•  Sex
•  Age
•  Survived?

Kaggle Competition - Stay Alert
Ford Challenge
•  Simple Dataset
•  Competition Class Heritage Health Prize Data
•  Complex
•  Competition in Flight
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic!
http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!

Titanic Dataset
—  Taken from passenger
manifest
—  Good candidate for a
Decision Tree
—  CART [Classification
& Regression Tree]
◦  Greedy, top-down
binary, recursive
partitioning that
divides feature space
into sets of disjoint
rectangular regions
—  CART in R http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!

Titanic Dataset Y
Male ?
R walk through
—  Load libraries
3rd?
—  Load data
—  Model CART
N Y
—  Model rattle() Y
Adult?
—  Tree
Y
—  Discussion N
3rd?

N Y

CART
Y
Male ?
Female

3rd?

N Y
Y
Adult?
Child

Y
3rd?
N

N Y

CART

Y
Male ?

Female

1 Do Not Over-fit 3rd?
N
2 All predictors are not needed
N Y
3 All data rows are not needed

4 Tuning the algorithms will give
different results

Churn Data

—  Predictchurn
—  Based on
◦  Service calls, v-mail and so forth

Challenges
—  Model Complexity
◦  Complex Model increases the training data fit
◦  But then over-fits and doesn't perform as well
with real data
—  Bias vs.Variance
◦  Classical diagram
◦  From ELSII
Prediction Error
◦  By Hastie,Tibshirani &
Friedman Training
Error

—  Goal
◦  Model Complexity (-)
Solution #1 ◦  Variance (-)
◦  Prediction Accuracy (+)

Partition Data !
◦  Training (60%)
◦  Validation(20%) &
◦  “Vault” Test (20%) Data sets
k-fold Cross-Validation
◦  Split data into k equal parts
◦  Fit model to k-1 parts & calculate prediction
error on kth part
◦  Non-overlapping dataset
But the fundamental problem still exists !

—  Goal

Bootstrap
◦  Draw datasets (with replacement) and fit model
for each dataset
–  Remember : Data Partitioning (#1) & Cross Validation
(#2) are without replacement

Bagging (Bootstrap aggregation)
◦  Average prediction over a
collection of bootstrap-ed
samples, thus reducing variance

—  Goal

Boosting
◦  “Output of weak classifiers into a powerful
committee”
◦  Final Prediction = weighted majority vote
◦  Later classifiers get misclassified points
–  With higher weight,
–  So they are forced
–  To concentrate on them
◦  AdaBoost (AdaptiveBoosting)
◦  Boosting vs Bagging
–  Bagging – independent trees
–  Boosting – successively weighted

—  Goal

Random Forests+
◦  Builds large collection of de-correlated trees &
averages them
◦  Improves Bagging by selecting i.i.d* random
variables for splitting
◦  Simpler to train & tune
◦  “Do remarkably well, with very little tuning
required” – ESLII
◦  Less suseptible to overfitting (than boosting)
◦  Many RF implementations
–  Original version - Fortran-77 ! By Breiman/Cutler
–  R, Mahout, Weka, Milk (ML toolkit for py), matlab
* i.i.d – independent identically distributed!
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm!

—  Goal
Solution - General ◦  Variance (-)

Ensemble methods
◦  Two Step
–  Develop a set of learners
–  Combine the results to develop a composite
predictor
◦  Ensemble methods can take the form of:
–  Using different algorithms,
–  Using the same algorithm with different settings
–  Assigning different parts of the dataset to different
classifiers
◦  Bagging & Random Forests are examples of
ensemble method
Ref: Machine Learning In Action !

Random Forests
—  While Boosting splits based on best among all variables, RF splits
based on best among randomly chosen variables
—  Simpler because it requires two variables – no. of Predictors
(typically √k) & no. of trees (500 for large dataset, 150 for smaller)
—  Error prediction
◦  For each iteration, predict for dataset that is not in the sample (OOB
data)
◦  Aggregate OOB predictions
◦  Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate
–  Can use this to search for optimal # of predictors
◦  We will see how close this is to the actual error in the Heritage Health
Prize
—  Assumes equal cost for mis-prediction. Can add a cost function
—  Proximity matrix & applications like adding missing data, dropping
outliers
Ref: R News Vol 2/3, Dec 2002 !
Statistical Learning from a Regression Perspective : Berk!
A Brief Overview of RF by Dan Steinberg!

Lot more to explore (Homework!)

—  Loss matrix
◦  E.g. Telcom churn - Better to give incentives to
false + (who is not leaving) than optimize in
incentives for false –ves(who is leaving)
—  Missing values
—  Additive Models
—  Bayesian Models
—  Gradient Boosting

Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4-
New_Tree_Data_Set_and_Loss_Matrices.pdf!

“I keep saying the sexy job in
the next ten years will be
statisticians.”
Hal Varian
Google Chief Economist
2009

Crowdsourcing

Mismatch between those with data and
those with the skills to analyse it

Tourism Forecasting Competition

Forecast
Error
(MASE)
Existing model

Aug 9 2 weeks 1 month Competition
later later End

Existing model (ELO)

Error Rate
(RMSE)

Aug 4 1 month 2 months Today
later later

Chess Ratings Competition

12,500 “Amateur” Data Scientists with different
backgrounds

R
R
Matlab
Matlab
SAS
SAS
WEKA
WEKA
SPSS
SPSS
Python
Python
Excel
Excel
Mathematica
Mathematica
Stata
Stata

R on Kaggle Among academics

R
Matlab
SAS
WEKA
SPSS
Python
Excel
Among Americans
Mathematica
Stata Ref: Anthony’s Kaggle Presentation!

Mapping Dark Matter is a image
analysis competition whose aim is to
encourage the development of new
algorithms that can be applied to
challenge of measuring the tiny
distortions in galaxy images caused by
dark matter.

~25%
Successful
grant applications

NASA tried, now it s our turn

“The world’s brightest
physicists have been
working for decades on
solving one of the great
unifying problems of our
universe”

“In less than a week,
Martin O’Leary, a PhD
student in glaciology,
outperformed the state-of-
the-art algorithms”

Why Participants Compete

1 2

Clean, Real world data Professional Reputation & Experience

3 4

Interactions with experts in related fields Prizes

Use the wizard to post a competition

Participants make their entries

Competitions are judged based on predictive accuracy

Competition Mechanics

Competitions are judged on objective criteria

The Anatomy of a KAGGLE COMPETITION

THE FORD
COMPETITION

Ford Challenge - DataSet
—  Goal:
◦  Predict Driver Alertness
—  Predictors:
◦  Psychology – P1 .. P8
◦  Environment – E1 .. E11
◦  Vehicle – V1 ..V11
◦  IsAlert ?
—  Datastatistics meaningless outside the
IsAlert context

Ford Challenge – DataSet Files
—  Three files
◦  ford_train
–  510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows

◦  ford_test
–  100 Trials,~1,200 observations/trial, 120,841 rows

◦  example_submission.csv

Submission & Results
Raw, all variables, rpart

Raw, selected variables, rpart

All variables, glm

How the Ford Competition was won
—  How I Did It Blogs
—  http://blog.kaggle.com/
2011/03/25/inference-
on-winning-the-ford-
stay-alert-
competition/
2011/04/20/mick-
wagner-on-finishing-
second-in-the-ford-
challenge/
2011/03/16/junpei-
komiyama-on-
finishing-4th-in-the-
ford-competition/

—  Junpei Komiyama (#4)
◦  To solve this problem, I constructed a Support Vector
Machine (SVM), which is one of the best tools for
classification and regression analysis, using the libSVM
package.
◦  This approach took more than 3 hours to complete
◦  I found some data (P3-P6) were characterized by
strong noise... Also, many environmental and vehicular
data showed discrete values continuously increased and
decreased.These suggested the necessity of pre-
processing the observation data before SVM analysis
for better performance


—  Junpei Komiyama (#4)
◦  Averaging – improved score and processing time
◦  Average 7 data points
–  Reduced processing by 86% &
–  Increased score by 0.01
◦  Tools
–  Python processing of csv
–  libSVM

—  Mick Wagner (#2)
◦  Tools
–  Excel, SQL Server
◦  I spent the majority of my time analyzing the data. I
inputted the data into Excel and started examining the
data taking note of discrete and continuous values, category
based parameters, and simple statistics (mean, median,
variance, coefficient of variance). I also looked for extreme
outliers.
◦  I made the first 150 trials (~30%) be my test data and the
remainder be my training dataset (~70%). This single factor
had the largest impact on the accuracy of my final model.
◦  I was concerned that using the entire data set would create
too much noise and lead to inaccuracies in the model … so
focussed on data with state change


—  Mick Wagner (#2)
◦  After testing the Decision Tree and Neural Network
algorithms against each other and submitting
models to Kaggle, I found the Neural Network
model to be more accurate
◦  Only used E4, E5, E6, E7, E8, E9, E10, P6,V4,V6,
V10, and V11


—  Inference (#1)
◦  Very interesting
◦  “Our first observation is that trials are not
homogeneous – so calculated mean, sd et al”
◦  “Training set & test set are not from the same
population” – a good fit for training will result in a
low score
◦  Lucky Model (Regression)
–  -‐410.6073(sd(E5))
+
0.1494(V11)
+
4.4185(E9)

◦  (Remember – Data had P1-P8,E1-E11,V1-V11)

HOW THE RTA WAS
WON

“This competition requires participants to predict travel time on
Sydney's M4 freeway from past travel time observations.”

—  Thanks to
◦  François
GUILLEM &
◦  Andrzej Janusz
—  They both
used R
—  Share their
code &
algorithms

How the RTA was won
—  I effectively used R for the RTA competition.
For my best submission, I just used simple
technics (OLS and means) but in a clever way
- François GUILLEM (#14)
—  I used a simple k-NN approach but the idea
was to process data first & to compute some
summaries of time series in consecutive
timestamps using some standard indicators
from technical analysis
- Andrzej Janusz(#17)

How the RTA was won
—  #1 used Random Forests
◦  Time, Date & Week as predictors
- José P. González-Brenes and Matías Cortés

—  Regression models for data segments (total
~600!)
—  Tools:
◦  Java/Weka
◦  4 processors, 12 GB RAM
◦  48 hours of computations
- Marcin Pionnier (#5)
Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/!
Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!

THE HHP

TimeCheck : Should be ~2:40!!

Lessons from Kaggle Winners
1 Don’t over-fit

2 All predictors are not needed

3 All data rows are not needed, either

4 Tuning the algorithms will give different results

5 Reduce the dataset (Average, select transition data,…)

6 Test set & training set can differ

7 Iteratively explore & get your head around data

8 Don’t be afraid to submit simple solutions

9 Keep a tab & history your submissions

The Competition
“The goal of the prize is to develop a predictive
algorithm that can identify patients who will be
admitted to the hospital within the next year,
using historical claims data”

Data Organization
ID 113,000 Entries

Members Age at 1st Claim Missing values

Sex Days In Hospital Y2
MemberID 2,668,990 Entries

Prov ID Missing values

Claims Diﬀerent Coding

Vendor, PCP,

Delay 162+

Days In Hospital Y3
Year
Speciality SupLOS – Length of stay is
PlaceOfSvc suppressed during de-
PayDelay identiﬁcation process for Days In Hospital Y4
LengthOfStay some entries

(Target)

DaysSinceFirstClaimThatYear
PrimaryConditionGroup MemberID
CharlsonIndex Claims Truncated
ProcedureGroup DaysInHospital
SupLOS
76039 Entries(Y2)

MemberID, Year, 361,485 Entries

71436 Entries (Y3)

LabCount DSFS,LabCount Fairly Consistent

Coding (10+)

70943 Entries

Lots Of Zeros

MemberID, Year, 818,242 Entries

DrugCount DSFS,DrugCount Fairly Consistent

Coding (10+)

Calculation & Prizes

Prediction Error Rate

Deadline
Apr 04,2013

Deadline : Aug 31,2011 06:59:59 UTC

Deadline : Feb 13,2012

Deadline : Sep 04,2012

Now it is our turn …

HHP ANALYTICS

POA
—  Load data into SQLite
—  Use SQL to de-normalize & pick out
datasets
—  Load them into R for analytics
—  Total/Distinct count
◦  Claims = 2,668,991/113,001
◦  Members = 113,001
◦  Drug = 818,242/75,999 <- unique = 141,532/75,999(test)
◦  Lab = 361,485/86,640 <- unique = 154,935/86,640 (test)
◦  dih_y2 = 76,039 / distinct/11,770 dih > 0
◦  dih_y3 = 71,436/distinct/10,730 dih > 0
◦  dih_y4 = 70,943/distinct

Idea #1
—  dih_Y2
=
β0
+
β1dih_Y1
+
β2DC
+
β3LC

—  dih_Y3
=
β0
+
β1dih_Y2
+
β2DC
+
β3LC

—  dih_Y4
=
β0
+
β1dih_Y3
+
β2DC
+
β3LC

—  select count(*) from dih_y2 join dih_y3 on
dih_y2.member_id = dih_y3.member_id;
—  Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683
(7,699 dih_y3 > 0)

—  Data is not straightforward to get this
◦  Summarize drug and lab by member, year
◦  Split into year to get DC
&
LC
by
year
◦  Add to dih_Yx table
◦  Linear Regression

Some SQL for idea #1
—  create table drug_tot as select
member_id,Year, total(drug_count) from
drug_count group by member_id,year
order by member_id,year; <- total drug,
lab per year for each member !
—  Same for lab_tot
—  create table drug_tot_y1 as select * from
drug_tot where year = “Y1”
—  … for y2,y3 and y1, y2,y3 for lab_tot
—  … join with dih_yx tables

Idea #2
—  Add claims at yx to the Idea #1 equations
—  dih_Yn
=
β0
+
β1dih_Yn-‐1
+
β2DC/n-‐1

+
β3LC/n-‐1
+
β4Caimn-‐1

—  Then we will have to define the criteria for
Caimn-‐1
from the claim predictors viz.
PrimaryConditionGroup, CharlsonIndex
and ProcedureGroup

The Beginning As the End
—  We started with a set of
goals
—  Homework
◦  For me :
–  To finish the hands-on walkthrough
& post it in ~10 days
◦  For you
–  Go through the slides
–  Do the walkthrough
–  Submit entries to Kaggle

I
enjoyed a lot
preparing
the materials
…
Hope
you enjoyed
more
attending …
Questions ?!

IDE <- RStudio
R_Packages <- c(plyr, rattle, rpart, randomForest)
R_Search <- http://www.rseek.org/, powered=google

The Hitchhiker’s Guide to Kaggle

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to The Hitchhiker’s Guide to Kaggle

Similar to The Hitchhiker’s Guide to Kaggle (20)

More from Krishna Sankar

More from Krishna Sankar (18)

Recently uploaded

Recently uploaded (20)

The Hitchhiker’s Guide to Kaggle