The Hitchhiker’s Guide to Kaggle July 27, 2011 firstname.lastname@example.org [doubleclix.wordpress.com] email@example.com
The Amateur Data Scientist CART Analytics Competitions! Algorithms randomForest Tools Old DataSets CompetitionCompetition Titanic in-flight Churn HHP Ford
Encounters 1st ◦ This Workshop 2nd ◦ Do Hands-on Walkthrough ◦ I will post the walkthrough scripts in ~ 10 days 3rd ◦ Participate in HHP & Other competitions
Goals Of This workshop1. Introduction to Analytics Competitions from Data, Algorithms & Tools perspective2. End-To-End Flow of a Kaggle Competition – Ford3. Introduction to the Heritage Health Prize Competition4. Materials for you to explore further ◦ Lot more slides ◦ Walkthrough – will post in 10 days
Agenda Algorithms for the Amateur Data Scientist [25Min] ◦ Algorithms, Tools & frameworks in perspective The Art of Analytics Competitions[10Min] ◦ The Kaggle challenges How the RTA FORD was won - Anatomy of a competition [15Min] ◦ Predicting FORD using Trees ◦ Submit an Entry Competition in flight - The Heritage Health Prize [30Min] ◦ Walkthrough Introduction Dataset Organization Analytics Walkthrough ◦ Submit our entry Conclusion [5Min]
ALGORITHMS FOR THEAMATEUR DATA SCIENTIST Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have … “A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.” - From The Hitchhikers Guide to the Galaxy, by Douglas Adams. Published by Harmony Books in 1979
The Amateur Data Scientist Am not a quant or a ML expert School of Amz, Springer & UTube For the Rest of us References I used (Refs also in the respective slide): ◦ The Elements Of Statistical Learning (a.k.a ESLII) By Hastie,Tibshirani & Friedman ◦ Statistical Learning From a Regression Perspective By Richard Berk As Jeremy says it, you can dig into it as needed ◦ Not necessarily be an expert in R toolbox
Jeremy’s Axioms Iteratively explore data Tools ◦ Excel Format, Perl, Perl Book Get your head around data ◦ Pivot Table Don’t over-complicate If people give you data, don’t assume that you need to use all of it Look at pictures ! History of your submissions – keep a tab Don’t be afraid to submit simple solutions ◦ We will do this during this workshop Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data- sciencetalk-by-jeremy-howard/ !
Don’t throw away1 any data ! Big data to smart data Be ready for different2 ways of organizing the data summary
Users apply different techniques• Support Vector Machine • Genetic Algorithms• adaBoost • Monte Carlo Methods• Bayesian Networks • Principal Component• Decision Trees Analysis• Ensemble Methods • Kalman Filter• Random Forest • Evolutionary Fuzzy• Logistic Regression Modelling • Neural NetworksQuora• http://www.quora.com/What-are-the-top-10- data-mining-or-machine-learning-algorithms Ref: Anthony’s Kaggle Presentation!
Let us take a 15 min overview of the algorithms ◦ Relevant in the context of this workshop ◦ From the perspective of the datasets we plan to use More of a qualitative than mathematical To get a feel for the how & the why
Bias Continuous Linear Variance Variables Regression Model Complexity Over-ﬁtting Categorical Variables Classifiers k-NNBagging Boosting Decision (Nearest Trees Neighbors) CART
Titanic Passenger Metadata Customer Churn • Small • 17 Predictors • 3 Predictors • Class • Sex • Age • Survived?Kaggle Competition - Stay AlertFord Challenge• Simple Dataset• Competition Class Heritage Health Prize Data • Complex • Competition in Flight http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic! http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!
Titanic Dataset Taken from passenger manifest Good candidate for a Decision Tree CART [Classification & Regression Tree] ◦ Greedy, top-down binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions CART in R http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!
Titanic Dataset Y Male ?R walk through Load libraries 3rd? Load data Model CART N Y Model rattle() Y Adult? Tree Y Discussion N 3rd? N Y
CART Y Male ? Female 3rd? N YY Adult? Child Y 3rd?N N Y
CART Y Male ? Female 1 Do Not Over-fit 3rd? N2 All predictors are not needed N Y3 All data rows are not needed4 Tuning the algorithms will give different results
Churn Data Predictchurn Based on ◦ Service calls, v-mail and so forth
Challenges Model Complexity ◦ Complex Model increases the training data fit ◦ But then over-fits and doesnt perform as well with real data Bias vs.Variance ◦ Classical diagram ◦ From ELSII Prediction Error ◦ By Hastie,Tibshirani & Friedman Training Error
Goal ◦ Model Complexity (-)Solution #1 ◦ Variance (-) ◦ Prediction Accuracy (+)Partition Data ! ◦ Training (60%) ◦ Validation(20%) & ◦ “Vault” Test (20%) Data setsk-fold Cross-Validation ◦ Split data into k equal parts ◦ Fit model to k-1 parts & calculate prediction error on kth part ◦ Non-overlapping dataset But the fundamental problem still exists !
Goal ◦ Model Complexity (-) Solution #2 ◦ Variance (-) ◦ Prediction Accuracy (+)Bootstrap ◦ Draw datasets (with replacement) and fit model for each dataset Remember : Data Partitioning (#1) & Cross Validation (#2) are without replacementBagging (Bootstrap aggregation) ◦ Average prediction over a collection of bootstrap-ed samples, thus reducing variance
Goal ◦ Model Complexity (-) Solution #3 ◦ Variance (-) ◦ Prediction Accuracy (+)Boosting ◦ “Output of weak classifiers into a powerful committee” ◦ Final Prediction = weighted majority vote ◦ Later classifiers get misclassified points With higher weight, So they are forced To concentrate on them ◦ AdaBoost (AdaptiveBoosting) ◦ Boosting vs Bagging Bagging – independent trees Boosting – successively weighted
Goal ◦ Model Complexity (-)Solution #4 ◦ Variance (-) ◦ Prediction Accuracy (+)Random Forests+ ◦ Builds large collection of de-correlated trees & averages them ◦ Improves Bagging by selecting i.i.d* random variables for splitting ◦ Simpler to train & tune ◦ “Do remarkably well, with very little tuning required” – ESLII ◦ Less suseptible to overfitting (than boosting) ◦ Many RF implementations Original version - Fortran-77 ! By Breiman/Cutler R, Mahout, Weka, Milk (ML toolkit for py), matlab * i.i.d – independent identically distributed! + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm!
Goal ◦ Model Complexity (-)Solution - General ◦ Variance (-) ◦ Prediction Accuracy (+)Ensemble methods ◦ Two Step Develop a set of learners Combine the results to develop a composite predictor ◦ Ensemble methods can take the form of: Using different algorithms, Using the same algorithm with different settings Assigning different parts of the dataset to different classifiers ◦ Bagging & Random Forests are examples of ensemble method Ref: Machine Learning In Action !
Random Forests While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller) Error prediction ◦ For each iteration, predict for dataset that is not in the sample (OOB data) ◦ Aggregate OOB predictions ◦ Calculate Prediction Error for the aggregate, which is basically the OOB estimate of error rate Can use this to search for optimal # of predictors ◦ We will see how close this is to the actual error in the Heritage Health Prize Assumes equal cost for mis-prediction. Can add a cost function Proximity matrix & applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002 ! Statistical Learning from a Regression Perspective : Berk! A Brief Overview of RF by Dan Steinberg!
Lot more to explore (Homework!) Loss matrix ◦ E.g. Telcom churn - Better to give incentives to false + (who is not leaving) than optimize in incentives for false –ves(who is leaving) Missing values Additive Models Bayesian Models Gradient Boosting Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4- New_Tree_Data_Set_and_Loss_Matrices.pdf!
“I keep saying the sexy job inthe next ten years will bestatisticians.”Hal VarianGoogle Chief Economist2009
CrowdsourcingMismatch between those with data and those with the skills to analyse it
Tourism Forecasting CompetitionForecast Error (MASE) Existing model Aug 9 2 weeks 1 month Competition later later End
Existing model (ELO) Error Rate (RMSE) Aug 4 1 month 2 months Today later laterChess Ratings Competition
12,500 “Amateur” Data Scientists with different backgrounds
R R Matlab Matlab SAS SAS WEKA WEKA SPSS SPSS Python Python Excel Excel Mathematica Mathematica Stata StataR on Kaggle Among academics R Matlab SAS WEKA SPSS Python Excel Among Americans Mathematica Stata Ref: Anthony’s Kaggle Presentation!
Mapping Dark Matter is a image analysis competition whose aim is to encourage the development of new algorithms that can be applied to challenge of measuring the tiny distortions in galaxy images caused by dark matter. ~25% Successful grant applicationsNASA tried, now it s our turn
“The world’s brightestphysicists have beenworking for decades onsolving one of the greatunifying problems of ouruniverse” “In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of- the-art algorithms”
How the Ford Competition was won How I Did It Blogs http://blog.kaggle.com/ 2011/03/25/inference- on-winning-the-ford- stay-alert- competition/ http://blog.kaggle.com/ 2011/04/20/mick- wagner-on-finishing- second-in-the-ford- challenge/ http://blog.kaggle.com/ 2011/03/16/junpei- komiyama-on- finishing-4th-in-the- ford-competition/
How the Ford Competition was won Junpei Komiyama (#4) ◦ To solve this problem, I constructed a Support Vector Machine (SVM), which is one of the best tools for classification and regression analysis, using the libSVM package. ◦ This approach took more than 3 hours to complete ◦ I found some data (P3-P6) were characterized by strong noise... Also, many environmental and vehicular data showed discrete values continuously increased and decreased.These suggested the necessity of pre- processing the observation data before SVM analysis for better performance
How the Ford Competition was won Junpei Komiyama (#4) ◦ Averaging – improved score and processing time ◦ Average 7 data points Reduced processing by 86% & Increased score by 0.01 ◦ Tools Python processing of csv libSVM
How the Ford Competition was won Mick Wagner (#2) ◦ Tools Excel, SQL Server ◦ I spent the majority of my time analyzing the data. I inputted the data into Excel and started examining the data taking note of discrete and continuous values, category based parameters, and simple statistics (mean, median, variance, coefficient of variance). I also looked for extreme outliers. ◦ I made the first 150 trials (~30%) be my test data and the remainder be my training dataset (~70%). This single factor had the largest impact on the accuracy of my final model. ◦ I was concerned that using the entire data set would create too much noise and lead to inaccuracies in the model … so focussed on data with state change
How the Ford Competition was won Mick Wagner (#2) ◦ After testing the Decision Tree and Neural Network algorithms against each other and submitting models to Kaggle, I found the Neural Network model to be more accurate ◦ Only used E4, E5, E6, E7, E8, E9, E10, P6,V4,V6, V10, and V11
How the Ford Competition was won Inference (#1) ◦ Very interesting ◦ “Our first observation is that trials are not homogeneous – so calculated mean, sd et al” ◦ “Training set & test set are not from the same population” – a good fit for training will result in a low score ◦ Lucky Model (Regression) -‐410.6073(sd(E5)) + 0.1494(V11) + 4.4185(E9) ◦ (Remember – Data had P1-P8,E1-E11,V1-V11)
HOW THE RTA WASWON“This competition requires participants to predict travel time onSydneys M4 freeway from past travel time observations.”
Thanks to ◦ François GUILLEM & ◦ Andrzej Janusz They both used R Share their code & algorithms
How the RTA was won I effectively used R for the RTA competition. For my best submission, I just used simple technics (OLS and means) but in a clever way - François GUILLEM (#14) I used a simple k-NN approach but the idea was to process data first & to compute some summaries of time series in consecutive timestamps using some standard indicators from technical analysis - Andrzej Janusz(#17)
How the RTA was won #1 used Random Forests ◦ Time, Date & Week as predictors - José P. González-Brenes and Matías Cortés Regression models for data segments (total ~600!) Tools: ◦ Java/Weka ◦ 4 processors, 12 GB RAM ◦ 48 hours of computations - Marcin Pionnier (#5) Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/!Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!
Lessons from Kaggle Winners1 Don’t over-fit2 All predictors are not needed3 All data rows are not needed, either4 Tuning the algorithms will give different results5 Reduce the dataset (Average, select transition data,…)6 Test set & training set can differ7 Iteratively explore & get your head around data8 Don’t be afraid to submit simple solutions9 Keep a tab & history your submissions
The Competition“The goal of the prize is to develop a predictivealgorithm that can identify patients who will beadmitted to the hospital within the next year,using historical claims data”
Data Organization ID 113,000 Entries Members Age at 1st Claim Missing values Sex Days In Hospital Y2 MemberID 2,668,990 Entries Prov ID Missing values Claims Diﬀerent Coding Vendor, PCP, Delay 162+ Days In Hospital Y3 Year Speciality SupLOS – Length of stay is PlaceOfSvc suppressed during de- PayDelay identiﬁcation process for Days In Hospital Y4 LengthOfStay some entries (Target) DaysSinceFirstClaimThatYear PrimaryConditionGroup MemberID CharlsonIndex Claims Truncated ProcedureGroup DaysInHospital SupLOS 76039 Entries(Y2) MemberID, Year, 361,485 Entries 71436 Entries (Y3) LabCount DSFS,LabCount Fairly Consistent Coding (10+) 70943 Entries Lots Of Zeros MemberID, Year, 818,242 Entries DrugCount DSFS,DrugCount Fairly Consistent Coding (10+)
Calculation & Prizes Prediction Error Rate Deadline Apr 04,2013 Deadline : Aug 31,2011 06:59:59 UTC Deadline : Feb 13,2012 Deadline : Sep 04,2012
POA Load data into SQLite Use SQL to de-normalize & pick out datasets Load them into R for analytics Total/Distinct count ◦ Claims = 2,668,991/113,001 ◦ Members = 113,001 ◦ Drug = 818,242/75,999 <- unique = 141,532/75,999(test) ◦ Lab = 361,485/86,640 <- unique = 154,935/86,640 (test) ◦ dih_y2 = 76,039 / distinct/11,770 dih > 0 ◦ dih_y3 = 71,436/distinct/10,730 dih > 0 ◦ dih_y4 = 70,943/distinct
Idea #1 dih_Y2 = β0 + β1dih_Y1 + β2DC + β3LC dih_Y3 = β0 + β1dih_Y2 + β2DC + β3LC dih_Y4 = β0 + β1dih_Y3 + β2DC + β3LC select count(*) from dih_y2 join dih_y3 on dih_y2.member_id = dih_y3.member_id; Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683 (7,699 dih_y3 > 0) Data is not straightforward to get this ◦ Summarize drug and lab by member, year ◦ Split into year to get DC & LC by year ◦ Add to dih_Yx table ◦ Linear Regression
Some SQL for idea #1 create table drug_tot as select member_id,Year, total(drug_count) from drug_count group by member_id,year order by member_id,year; <- total drug, lab per year for each member ! Same for lab_tot create table drug_tot_y1 as select * from drug_tot where year = “Y1” … for y2,y3 and y1, y2,y3 for lab_tot … join with dih_yx tables
Idea #2 Add claims at yx to the Idea #1 equations dih_Yn = β0 + β1dih_Yn-‐1 + β2DC/n-‐1 + β3LC/n-‐1 + β4Caimn-‐1 Then we will have to define the criteria for Caimn-‐1 from the claim predictors viz. PrimaryConditionGroup, CharlsonIndex and ProcedureGroup
The Beginning As the End We started with a set of goals Homework ◦ For me : To finish the hands-on walkthrough & post it in ~10 days ◦ For you Go through the slides Do the walkthrough Submit entries to Kaggle
Ienjoyed a lot preparingthe materials … Hope you enjoyed moreattending … Questions ?! IDE <- RStudio R_Packages <- c(plyr, rattle, rpart, randomForest) R_Search <- http://www.rseek.org/, powered=google