Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Hitchhiker’s Guide to Kaggle


Published on

For the OSCON Data 2011 workshop "The Hitchhiker’s Guide to A Kaggle Competition"

Published in: Technology, Education
  • Be the first to comment

The Hitchhiker’s Guide to Kaggle

  1. 1. The Hitchhiker’s Guide to Kaggle July 27, 2011 []
  2. 2. The Amateur Data Scientist CART Analytics Competitions! Algorithms randomForest Tools Old DataSets CompetitionCompetition Titanic in-flight Churn HHP Ford
  3. 3. Encounters—  1st ◦  This Workshop —  2nd ◦  Do Hands-on Walkthrough ◦  I will post the walkthrough scripts in ~ 10 days —  3rd ◦  Participate in HHP & Other competitions
  4. 4. Goals Of This workshop1.  Introduction to Analytics Competitions from Data, Algorithms & Tools perspective2.  End-To-End Flow of a Kaggle Competition – Ford3.  Introduction to the Heritage Health Prize Competition4.  Materials for you to explore further ◦  Lot more slides ◦  Walkthrough – will post in 10 days
  5. 5. Agenda—  Algorithms for the Amateur Data Scientist [25Min] ◦  Algorithms, Tools & frameworks in perspective—  The Art of Analytics Competitions[10Min] ◦  The Kaggle challenges—  How the RTA FORD was won - Anatomy of a competition [15Min] ◦  Predicting FORD using Trees ◦  Submit an Entry—  Competition in flight - The Heritage Health Prize [30Min] ◦  Walkthrough –  Introduction –  Dataset Organization –  Analytics Walkthrough ◦  Submit our entry—  Conclusion [5Min]
  6. 6. ALGORITHMS FOR THEAMATEUR DATA SCIENTIST Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have … “A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.” - From The Hitchhikers Guide to the Galaxy, by Douglas Adams. Published by Harmony Books in 1979
  7. 7. The Amateur Data Scientist—  Am not a quant or a ML expert—  School of Amz, Springer & UTube—  For the Rest of us—  References I used (Refs also in the respective slide): ◦  The Elements Of Statistical Learning (a.k.a ESLII) –  By Hastie,Tibshirani & Friedman ◦  Statistical Learning From a Regression Perspective –  By Richard Berk—  As Jeremy says it, you can dig into it as needed ◦  Not necessarily be an expert in R toolbox
  8. 8. Jeremy’s Axioms—  Iteratively explore data—  Tools ◦  Excel Format, Perl, Perl Book—  Get your head around data ◦  Pivot Table—  Don’t over-complicate—  If people give you data, don’t assume that you need to use all of it—  Look at pictures !—  History of your submissions – keep a tab—  Don’t be afraid to submit simple solutions ◦  We will do this during this workshop Ref: sciencetalk-by-jeremy-howard/ !
  9. 9. Don’t throw away1 any data ! Big data to smart data Be ready for different2 ways of organizing the data —  summary
  10. 10. Users apply different techniques•  Support Vector Machine •  Genetic Algorithms•  adaBoost •  Monte Carlo Methods•  Bayesian Networks •  Principal Component•  Decision Trees Analysis•  Ensemble Methods •  Kalman Filter•  Random Forest •  Evolutionary Fuzzy•  Logistic Regression Modelling •  Neural NetworksQuora• data-mining-or-machine-learning-algorithms Ref: Anthony’s Kaggle Presentation!
  11. 11. —  Let us take a 15 min overview of the algorithms ◦  Relevant in the context of this workshop ◦  From the perspective of the datasets we plan to use—  More of a qualitative than mathematical—  To get a feel for the how & the why
  12. 12. Bias Continuous Linear Variance Variables Regression Model Complexity Over-fitting Categorical Variables Classifiers k-NNBagging Boosting Decision (Nearest Trees Neighbors) CART
  13. 13. Titanic Passenger Metadata Customer Churn •  Small •  17 Predictors •  3 Predictors •  Class •  Sex •  Age •  Survived?Kaggle Competition - Stay AlertFord Challenge•  Simple Dataset•  Competition Class Heritage Health Prize Data •  Complex •  Competition in Flight!!
  14. 14. Titanic Dataset—  Taken from passenger manifest—  Good candidate for a Decision Tree—  CART [Classification & Regression Tree] ◦  Greedy, top-down binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions—  CART in R!
  15. 15. Titanic Dataset Y Male ?R walk through —  Load libraries 3rd? —  Load data —  Model CART N Y —  Model rattle() Y Adult? —  Tree Y —  Discussion N 3rd? N Y
  16. 16. CART Y Male ? Female 3rd? N YY Adult? Child Y 3rd?N N Y
  17. 17. CART Y Male ? Female 1 Do Not Over-fit 3rd? N2 All predictors are not needed N Y3 All data rows are not needed4 Tuning the algorithms will give different results
  18. 18. Churn Data—  Predictchurn—  Based on ◦  Service calls, v-mail and so forth
  19. 19. CART Tree
  20. 20. Challenges—  Model Complexity ◦  Complex Model increases the training data fit ◦  But then over-fits and doesnt perform as well with real data—  Bias vs.Variance ◦  Classical diagram ◦  From ELSII Prediction Error ◦  By Hastie,Tibshirani & Friedman Training Error
  21. 21. —  Goal ◦  Model Complexity (-)Solution #1 ◦  Variance (-) ◦  Prediction Accuracy (+)Partition Data ! ◦  Training (60%) ◦  Validation(20%) & ◦  “Vault” Test (20%) Data setsk-fold Cross-Validation ◦  Split data into k equal parts ◦  Fit model to k-1 parts & calculate prediction error on kth part ◦  Non-overlapping dataset But the fundamental problem still exists !
  22. 22. —  Goal ◦  Model Complexity (-) Solution #2 ◦  Variance (-) ◦  Prediction Accuracy (+)Bootstrap ◦  Draw datasets (with replacement) and fit model for each dataset –  Remember : Data Partitioning (#1) & Cross Validation (#2) are without replacementBagging (Bootstrap aggregation) ◦  Average prediction over a collection of bootstrap-ed samples, thus reducing variance
  23. 23. —  Goal ◦  Model Complexity (-) Solution #3 ◦  Variance (-) ◦  Prediction Accuracy (+)Boosting ◦  “Output of weak classifiers into a powerful committee” ◦  Final Prediction = weighted majority vote ◦  Later classifiers get misclassified points –  With higher weight, –  So they are forced –  To concentrate on them ◦  AdaBoost (AdaptiveBoosting) ◦  Boosting vs Bagging –  Bagging – independent trees –  Boosting – successively weighted
  24. 24. —  Goal ◦  Model Complexity (-)Solution #4 ◦  Variance (-) ◦  Prediction Accuracy (+)Random Forests+ ◦  Builds large collection of de-correlated trees & averages them ◦  Improves Bagging by selecting i.i.d* random variables for splitting ◦  Simpler to train & tune ◦  “Do remarkably well, with very little tuning required” – ESLII ◦  Less suseptible to overfitting (than boosting) ◦  Many RF implementations –  Original version - Fortran-77 ! By Breiman/Cutler –  R, Mahout, Weka, Milk (ML toolkit for py), matlab * i.i.d – independent identically distributed! +!
  25. 25. —  Goal ◦  Model Complexity (-)Solution - General ◦  Variance (-) ◦  Prediction Accuracy (+)Ensemble methods ◦  Two Step –  Develop a set of learners –  Combine the results to develop a composite predictor ◦  Ensemble methods can take the form of: –  Using different algorithms, –  Using the same algorithm with different settings –  Assigning different parts of the dataset to different classifiers ◦  Bagging & Random Forests are examples of ensemble method Ref: Machine Learning In Action !
  26. 26. Random Forests—  While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables—  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller)—  Error prediction ◦  For each iteration, predict for dataset that is not in the sample (OOB data) ◦  Aggregate OOB predictions ◦  Calculate Prediction Error for the aggregate, which is basically the OOB estimate of error rate –  Can use this to search for optimal # of predictors ◦  We will see how close this is to the actual error in the Heritage Health Prize—  Assumes equal cost for mis-prediction. Can add a cost function—  Proximity matrix & applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002 ! Statistical Learning from a Regression Perspective : Berk! A Brief Overview of RF by Dan Steinberg!
  27. 27. Lot more to explore (Homework!)—  Loss matrix ◦  E.g. Telcom churn - Better to give incentives to false + (who is not leaving) than optimize in incentives for false –ves(who is leaving)—  Missing values—  Additive Models—  Bayesian Models—  Gradient Boosting Ref: New_Tree_Data_Set_and_Loss_Matrices.pdf!
  28. 28. Churn Data w/ randomForest
  30. 30. “I keep saying the sexy job inthe next ten years will bestatisticians.”Hal VarianGoogle Chief Economist2009
  31. 31. CrowdsourcingMismatch between those with data and those with the skills to analyse it
  32. 32. Tourism Forecasting CompetitionForecast Error (MASE) Existing model Aug 9 2 weeks 1 month Competition later later End
  33. 33. Existing model (ELO) Error Rate (RMSE) Aug 4 1 month 2 months Today later laterChess Ratings Competition
  34. 34. 12,500 “Amateur” Data Scientists with different backgrounds
  35. 35. R R Matlab Matlab SAS SAS WEKA WEKA SPSS SPSS Python Python Excel Excel Mathematica Mathematica Stata StataR on Kaggle Among academics R Matlab SAS WEKA SPSS Python Excel Among Americans Mathematica Stata Ref: Anthony’s Kaggle Presentation!
  36. 36. Mapping Dark Matter is a image analysis competition whose aim is to encourage the development of new algorithms that can be applied to challenge of measuring the tiny distortions in galaxy images caused by dark matter. ~25% Successful grant applicationsNASA tried, now it s our turn
  37. 37. “The world’s brightestphysicists have beenworking for decades onsolving one of the greatunifying problems of ouruniverse” “In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of- the-art algorithms”
  38. 38. Who to hire?
  39. 39. Why Participants Compete 1 2 Clean, Real world data Professional Reputation & Experience 3 4 Interactions with experts in related fields Prizes
  40. 40. Use the wizard to post a competition
  41. 41. Participants make their entries
  42. 42. Competitions are judged based on predictive accuracy
  43. 43. Competition Mechanics Competitions are judged on objective criteria
  45. 45. Ford Challenge - DataSet—  Goal: ◦  Predict Driver Alertness—  Predictors: ◦  Psychology – P1 .. P8 ◦  Environment – E1 .. E11 ◦  Vehicle – V1 ..V11 ◦  IsAlert ?—  Datastatistics meaningless outside the IsAlert context
  46. 46. Ford Challenge – DataSet Files—  Three files ◦  ford_train –  510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows ◦  ford_test –  100 Trials,~1,200 observations/trial, 120,841 rows ◦  example_submission.csv
  47. 47. A Plan
  48. 48. glm
  49. 49. Submission & ResultsRaw, all variables, rpartRaw, selected variables, rpartAll variables, glm
  50. 50. How the Ford Competition was won—  How I Did It Blogs— 2011/03/25/inference- on-winning-the-ford- stay-alert- competition/— 2011/04/20/mick- wagner-on-finishing- second-in-the-ford- challenge/— 2011/03/16/junpei- komiyama-on- finishing-4th-in-the- ford-competition/
  51. 51. How the Ford Competition was won—  Junpei Komiyama (#4) ◦  To solve this problem, I constructed a Support Vector Machine (SVM), which is one of the best tools for classification and regression analysis, using the libSVM package. ◦  This approach took more than 3 hours to complete ◦  I found some data (P3-P6) were characterized by strong noise... Also, many environmental and vehicular data showed discrete values continuously increased and decreased.These suggested the necessity of pre- processing the observation data before SVM analysis for better performance
  52. 52. How the Ford Competition was won—  Junpei Komiyama (#4) ◦  Averaging – improved score and processing time ◦  Average 7 data points –  Reduced processing by 86% & –  Increased score by 0.01 ◦  Tools –  Python processing of csv –  libSVM
  53. 53. How the Ford Competition was won—  Mick Wagner (#2) ◦  Tools –  Excel, SQL Server ◦  I spent the majority of my time analyzing the data. I inputted the data into Excel and started examining the data taking note of discrete and continuous values, category based parameters, and simple statistics (mean, median, variance, coefficient of variance). I also looked for extreme outliers. ◦  I made the first 150 trials (~30%) be my test data and the remainder be my training dataset (~70%). This single factor had the largest impact on the accuracy of my final model. ◦  I was concerned that using the entire data set would create too much noise and lead to inaccuracies in the model … so focussed on data with state change
  54. 54. How the Ford Competition was won—  Mick Wagner (#2) ◦  After testing the Decision Tree and Neural Network algorithms against each other and submitting models to Kaggle, I found the Neural Network model to be more accurate ◦  Only used E4, E5, E6, E7, E8, E9, E10, P6,V4,V6, V10, and V11
  55. 55. How the Ford Competition was won—  Inference (#1) ◦  Very interesting ◦  “Our first observation is that trials are not homogeneous – so calculated mean, sd et al” ◦  “Training set & test set are not from the same population” – a good fit for training will result in a low score ◦  Lucky Model (Regression) –  -­‐410.6073(sd(E5))  +  0.1494(V11)  +  4.4185(E9)   ◦  (Remember – Data had P1-P8,E1-E11,V1-V11)
  56. 56. HOW THE RTA WASWON“This competition requires participants to predict travel time onSydneys M4 freeway from past travel time observations.”
  57. 57. —  Thanks to ◦  François GUILLEM & ◦  Andrzej Janusz—  They both used R—  Share their code & algorithms
  58. 58. How the RTA was won—  I effectively used R for the RTA competition. For my best submission, I just used simple technics (OLS and means) but in a clever way - François GUILLEM (#14)—  I used a simple k-NN approach but the idea was to process data first & to compute some summaries of time series in consecutive timestamps using some standard indicators from technical analysis - Andrzej Janusz(#17)
  59. 59. How the RTA was won —  #1 used Random Forests ◦  Time, Date & Week as predictors - José P. González-Brenes and Matías Cortés —  Regression models for data segments (total ~600!) —  Tools: ◦  Java/Weka ◦  4 processors, 12 GB RAM ◦  48 hours of computations - Marcin Pionnier (#5) Ref:!Ref:!
  60. 60. THE HHPTimeCheck : Should be ~2:40!!
  61. 61. Lessons from Kaggle Winners1 Don’t over-fit2 All predictors are not needed3 All data rows are not needed, either4 Tuning the algorithms will give different results5 Reduce the dataset (Average, select transition data,…)6 Test set & training set can differ7 Iteratively explore & get your head around data8 Don’t be afraid to submit simple solutions9 Keep a tab & history your submissions
  62. 62. The Competition“The goal of the prize is to develop a predictivealgorithm that can identify patients who will beadmitted to the hospital within the next year,using historical claims data”
  63. 63. TimeLine
  64. 64. Data Organization ID 113,000 Entries Members Age at 1st Claim Missing values Sex Days In Hospital Y2 MemberID 2,668,990 Entries Prov ID Missing values Claims Different Coding Vendor, PCP, Delay 162+ Days In Hospital Y3 Year Speciality SupLOS – Length of stay is PlaceOfSvc suppressed during de- PayDelay identification process for Days In Hospital Y4 LengthOfStay some entries (Target) DaysSinceFirstClaimThatYear PrimaryConditionGroup MemberID CharlsonIndex Claims Truncated ProcedureGroup DaysInHospital SupLOS 76039 Entries(Y2) MemberID, Year, 361,485 Entries 71436 Entries (Y3) LabCount DSFS,LabCount Fairly Consistent Coding (10+) 70943 Entries Lots Of Zeros MemberID, Year, 818,242 Entries DrugCount DSFS,DrugCount Fairly Consistent Coding (10+)
  65. 65. Calculation & Prizes Prediction Error Rate Deadline Apr 04,2013 Deadline : Aug 31,2011 06:59:59 UTC Deadline : Feb 13,2012 Deadline : Sep 04,2012
  66. 66. Now it is our turn …HHP ANALYTICS
  67. 67. POA—  Load data into SQLite—  Use SQL to de-normalize & pick out datasets—  Load them into R for analytics—  Total/Distinct count ◦  Claims = 2,668,991/113,001 ◦  Members = 113,001 ◦  Drug = 818,242/75,999 <- unique = 141,532/75,999(test) ◦  Lab = 361,485/86,640 <- unique = 154,935/86,640 (test) ◦  dih_y2 = 76,039 / distinct/11,770 dih > 0 ◦  dih_y3 = 71,436/distinct/10,730 dih > 0 ◦  dih_y4 = 70,943/distinct
  68. 68. Idea #1—  dih_Y2  =  β0  +  β1dih_Y1  +  β2DC  +  β3LC  —  dih_Y3  =  β0  +  β1dih_Y2  +  β2DC  +  β3LC  —  dih_Y4  =  β0  +  β1dih_Y3  +  β2DC  +  β3LC  —  select count(*) from dih_y2 join dih_y3 on dih_y2.member_id = dih_y3.member_id;—  Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683 (7,699 dih_y3 > 0)—  Data is not straightforward to get this ◦  Summarize drug and lab by member, year ◦  Split into year to get DC  &  LC  by  year ◦  Add to dih_Yx table ◦  Linear Regression
  69. 69. Some SQL for idea #1—  create table drug_tot as select member_id,Year, total(drug_count) from drug_count group by member_id,year order by member_id,year; <- total drug, lab per year for each member !—  Same for lab_tot—  create table drug_tot_y1 as select * from drug_tot where year = “Y1”—  … for y2,y3 and y1, y2,y3 for lab_tot—  … join with dih_yx tables
  70. 70. Idea #2—  Add claims at yx to the Idea #1 equations—  dih_Yn  =  β0  +  β1dih_Yn-­‐1  +  β2DC/n-­‐1   +  β3LC/n-­‐1  +  β4Caimn-­‐1  —  Then we will have to define the criteria for Caimn-­‐1  from the claim predictors viz. PrimaryConditionGroup, CharlsonIndex and ProcedureGroup
  71. 71. The Beginning As the End —  We started with a set of goals —  Homework ◦  For me : –  To finish the hands-on walkthrough & post it in ~10 days ◦  For you –  Go through the slides –  Do the walkthrough –  Submit entries to Kaggle
  72. 72. Ienjoyed a lot preparingthe materials … Hope you enjoyed moreattending … Questions ?! IDE <- RStudio R_Packages <- c(plyr, rattle, rpart, randomForest) R_Search <-, powered=google