SlideShare a Scribd company logo
The Hitchhiker’s Guide to Kaggle




                                          July 27, 2011
         ksankar42@gmail.com [doubleclix.wordpress.com]
                         anthony.goldbloom@kaggle.com
The Hitchhiker’s Guide to Kaggle
The Amateur Data Scientist                                      CART

       Analytics
      Competitions!                           Algorithms
                                                                        randomForest




                                                     Tools




                       Old               DataSets
                    Competition

Competition                                                   Titanic
  in-flight
                                            Churn


              HHP                 Ford
Encounters
—  1st
   ◦  This Workshop

                 —    2nd
                       ◦  Do Hands-on Walkthrough
                       ◦  I will post the walkthrough
                          scripts in ~ 10 days



                                   —  3rd
                                      ◦  Participate in HHP &
                                         Other competitions
Goals Of This workshop
1.     Introduction to Analytics Competitions
       from Data, Algorithms & Tools
       perspective
2.     End-To-End Flow of a Kaggle
       Competition – Ford
3.     Introduction to the Heritage Health Prize
       Competition
4.     Materials for you to explore further
      ◦  Lot more slides
      ◦  Walkthrough – will post in 10 days
Agenda
—    Algorithms for the Amateur Data Scientist [25Min]
      ◦  Algorithms, Tools & frameworks in perspective
—    The Art of Analytics Competitions[10Min]
      ◦  The Kaggle challenges
—    How the RTA FORD was won - Anatomy of a competition
      [15Min]
      ◦  Predicting FORD using Trees
      ◦  Submit an Entry
—    Competition in flight - The Heritage Health Prize [30Min]
      ◦  Walkthrough
        –  Introduction
        –  Dataset Organization
        –  Analytics Walkthrough
      ◦  Submit our entry
—    Conclusion [5Min]
ALGORITHMS FOR THE
AMATEUR DATA SCIENTIST


      Algorithms ! The Most Massively useful thing an Amateur Data
      Scientist can have …




      “A towel is about the most massively useful thing an
      interstellar hitchhiker can have … any man who can hitch
      the length and breadth of the Galaxy, rough it … win
      through, and still know where his towel is, is clearly a
      man to be reckoned with.”
             - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
                                        Published by Harmony Books in 1979
The Amateur Data Scientist
—    Am not a quant or a ML expert
—    School of Amz, Springer & UTube
—    For the Rest of us
—    References I used (Refs also in the respective slide):
      ◦  The Elements Of Statistical Learning (a.k.a ESLII)
        –  By Hastie,Tibshirani & Friedman
      ◦  Statistical Learning From a Regression Perspective
        –  By Richard Berk
—    As Jeremy says it, you can dig into it as needed
      ◦  Not necessarily be an expert in R toolbox
Jeremy’s Axioms
—  Iteratively explore data
—  Tools
      ◦  Excel Format, Perl, Perl Book
—    Get your head around data
      ◦  Pivot Table
—    Don’t over-complicate
—    If people give you data, don’t assume
      that you need to use all of it
—    Look at pictures !
—    History of your submissions – keep a
      tab
—    Don’t be afraid to submit simple
      solutions
      ◦  We will do this during this workshop


                       Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-
                       sciencetalk-by-jeremy-howard/ !
Don’t throw away
1   any data !
          Big data to smart data
    Be ready for different
2   ways of organizing the
    data
           —  summary
Users apply different techniques



•    Support Vector Machine                     •    Genetic Algorithms
•    adaBoost                                   •    Monte Carlo Methods
•    Bayesian Networks                          •    Principal Component
•    Decision Trees                                  Analysis
•    Ensemble Methods                           •    Kalman Filter
•    Random Forest                              •    Evolutionary Fuzzy
•    Logistic Regression                             Modelling
                                                •    Neural Networks
Quora
•  http://www.quora.com/What-are-the-top-10-
   data-mining-or-machine-learning-algorithms
                                                      Ref: Anthony’s Kaggle Presentation!
—  Let
      us take a 15 min overview of the
  algorithms
  ◦  Relevant in the context of this workshop
  ◦  From the perspective of the datasets we plan
     to use
—  More  of a qualitative than mathematical
—  To get a feel for the how & the why
Bias               Continuous
       Linear
      Variance              Variables
      Regression
   Model Complexity
     Over-fitting            Categorical 
                            Variables
    Classifiers




                                                           k-NN
Bagging          Boosting            Decision             (Nearest
                                      Trees              Neighbors)




                                                CART
Titanic Passenger Metadata                 Customer Churn
  •  Small                                   •  17 Predictors
  •  3 Predictors
       •  Class
       •  Sex
       •  Age
       •  Survived?



Kaggle Competition - Stay Alert
Ford Challenge
•  Simple Dataset
•  Competition Class                           Heritage Health Prize Data	
                                               •  Complex	
                                               •  Competition in Flight	
                                                           http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic!
                                  http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!
Titanic Dataset
—  Taken from passenger
    manifest
—  Good candidate for a
    Decision Tree
—  CART [Classification
    & Regression Tree]
      ◦  Greedy, top-down
         binary, recursive
         partitioning that
         divides feature space
         into sets of disjoint
         rectangular regions
—    CART in R                  http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!
Titanic Dataset               Y
                                   Male ?
R walk through
 —  Load libraries
                                            3rd?
 —  Load data
 —  Model CART
                                      N            Y
 —  Model rattle()   Y
                          Adult?
 —  Tree
                               Y
 —  Discussion       N
                                   3rd?



                              N             Y
CART
        Y
             Male ?
                                 Female	




                             3rd?



                N                       Y
Y
    Adult?
                  Child	


         Y
             3rd?
N



        N                    Y
CART


                                          Y
                                              Male ?


                                                          Female	

1   Do Not Over-fit                                    3rd?
                                      N
2   All predictors are not needed
                                                 N               Y
3   All data rows are not needed


4   Tuning the algorithms will give
    different results
Churn Data


—  Predictchurn
—  Based on
  ◦  Service calls, v-mail and so forth
CART Tree
Challenges
—  Model    Complexity
  ◦  Complex Model increases the training data fit
  ◦  But then over-fits and doesn't perform as well
     with real data
—  Bias   vs.Variance
  ◦  Classical diagram
  ◦  From ELSII
                                         Prediction Error
  ◦  By Hastie,Tibshirani &
     Friedman                 Training
                              Error
—    Goal
                                    ◦  Model Complexity (-)
Solution #1                         ◦  Variance (-)
                                    ◦  Prediction Accuracy (+)


Partition Data !
 ◦  Training (60%)
 ◦  Validation(20%) &
 ◦  “Vault” Test (20%) Data sets
k-fold Cross-Validation
  ◦  Split data into k equal parts
  ◦  Fit model to k-1 parts & calculate prediction
     error on kth part
  ◦  Non-overlapping dataset
 But the fundamental problem still exists !
—    Goal
                                           ◦  Model Complexity (-)
 Solution #2                               ◦  Variance (-)
                                           ◦  Prediction Accuracy (+)

Bootstrap
  ◦  Draw datasets (with replacement) and fit model
     for each dataset
      –  Remember : Data Partitioning (#1) & Cross Validation
          (#2) are without replacement

Bagging (Bootstrap aggregation)
 ◦     Average prediction over a
       collection of bootstrap-ed
       samples, thus reducing variance
—    Goal
                                                      ◦  Model Complexity (-)
      Solution #3                                     ◦  Variance (-)
                                                      ◦  Prediction Accuracy (+)


Boosting
 ◦  “Output of weak classifiers into a powerful
    committee”
 ◦  Final Prediction = weighted majority vote
 ◦  Later classifiers get misclassified points
      –     With higher weight,
      –     So they are forced
      –     To concentrate on them
 ◦          AdaBoost (AdaptiveBoosting)
 ◦          Boosting vs Bagging
      –     Bagging – independent trees
      –     Boosting – successively weighted
—    Goal
                                                          ◦  Model Complexity (-)
Solution #4                                               ◦  Variance (-)
                                                          ◦  Prediction Accuracy (+)


Random Forests+
 ◦          Builds large collection of de-correlated trees &
            averages them
 ◦          Improves Bagging by selecting i.i.d* random
            variables for splitting
 ◦          Simpler to train & tune
 ◦          “Do remarkably well, with very little tuning
            required” – ESLII
 ◦          Less suseptible to overfitting (than boosting)
 ◦          Many RF implementations
      –     Original version - Fortran-77 ! By Breiman/Cutler
      –     R, Mahout, Weka, Milk (ML toolkit for py), matlab
                                                * i.i.d – independent identically distributed!
                    + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm!
—    Goal
                                                  ◦  Model Complexity (-)
Solution - General                                ◦  Variance (-)
                                                  ◦  Prediction Accuracy (+)


Ensemble methods
 ◦          Two Step
      –     Develop a set of learners
      –     Combine the results to develop a composite
             predictor
 ◦          Ensemble methods can take the form of:
      –     Using different algorithms,
      –     Using the same algorithm with different settings
      –     Assigning different parts of the dataset to different
             classifiers
 ◦          Bagging & Random Forests are examples of
            ensemble method
                                                    Ref: Machine Learning In Action !
Random Forests
—  While Boosting splits based on best among all variables, RF splits
    based on best among randomly chosen variables
—  Simpler because it requires two variables – no. of Predictors
    (typically √k) & no. of trees (500 for large dataset, 150 for smaller)
—  Error prediction
      ◦  For each iteration, predict for dataset that is not in the sample (OOB
         data)
      ◦  Aggregate OOB predictions
      ◦  Calculate Prediction Error for the aggregate, which is basically the OOB
         estimate of error rate
        –  Can use this to search for optimal # of predictors
      ◦  We will see how close this is to the actual error in the Heritage Health
         Prize
—  Assumes equal cost for mis-prediction. Can add a cost function
—  Proximity matrix & applications like adding missing data, dropping
    outliers
                                                                  Ref: R News Vol 2/3, Dec 2002 !
                                        Statistical Learning from a Regression Perspective : Berk!
                                                        A Brief Overview of RF by Dan Steinberg!
Lot more to explore (Homework!)

—  Loss   matrix
  ◦  E.g. Telcom churn - Better to give incentives to
     false + (who is not leaving) than optimize in
     incentives for false –ves(who is leaving)
—  Missing values
—  Additive Models
—  Bayesian Models
—  Gradient Boosting

            Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4-
                                              New_Tree_Data_Set_and_Loss_Matrices.pdf!
Churn Data w/ randomForest
KAGGLE
COMPETITIONS
“I keep saying the sexy job in
the next ten years will be
statisticians.”
Hal Varian
Google Chief Economist
2009
The Hitchhiker’s Guide to Kaggle
Crowdsourcing




Mismatch between those with data and
   those with the skills to analyse it
Tourism Forecasting Competition




Forecast
   Error
 (MASE)
                   Existing model




           Aug 9                    2 weeks   1 month   Competition
                                      later     later          End
Existing model (ELO)




         Error Rate
            (RMSE)




                      Aug 4                   1 month   2 months   Today
                                                later     later


Chess Ratings Competition
12,500 “Amateur” Data Scientists with different
               backgrounds
R
                                                     R
              Matlab
                                                     Matlab
              SAS
                                                     SAS
              WEKA
                                                     WEKA
              SPSS
                                                     SPSS
              Python
                                                     Python
              Excel
                                                     Excel
              Mathematica
                                                     Mathematica
              Stata
                                                     Stata



R on Kaggle                  Among academics

          R
          Matlab
          SAS
          WEKA
          SPSS
          Python
          Excel
                            Among Americans
          Mathematica
          Stata                 Ref: Anthony’s Kaggle Presentation!
Mapping Dark Matter is a image
                           analysis competition whose aim is to
                           encourage the development of new
                           algorithms that can be applied to
                           challenge of measuring the tiny
                           distortions in galaxy images caused by
                           dark matter.	


      ~25%
      Successful
      grant applications




NASA tried, now it s our turn
“The world’s brightest
physicists have been
working for decades on
solving one of the great
unifying problems of our
universe”




  “In less than a week,
  Martin O’Leary, a PhD
  student in glaciology,
  outperformed the state-of-
  the-art algorithms”
Who to hire?
Why Participants Compete


            1                                               2




             Clean, Real world data             Professional Reputation & Experience



            3                                           4




  Interactions with experts in related fields                   Prizes
Use the wizard to post a competition
Participants make their entries
Competitions are judged based on predictive accuracy
Competition Mechanics




 Competitions are judged on objective criteria
The Anatomy of a KAGGLE COMPETITION

THE FORD
COMPETITION
Ford Challenge - DataSet
—  Goal:
  ◦  Predict Driver Alertness
—  Predictors:
  ◦  Psychology – P1 .. P8
  ◦  Environment – E1 .. E11
  ◦  Vehicle – V1 ..V11
  ◦  IsAlert ?
—  Datastatistics meaningless outside the
  IsAlert context
Ford Challenge – DataSet Files
—    Three files
      ◦  ford_train
        –  510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows



      ◦  ford_test
        –  100 Trials,~1,200 observations/trial, 120,841 rows




      ◦  example_submission.csv
A Plan
glm
Submission & Results
Raw, all variables, rpart




Raw, selected variables, rpart




All variables, glm
How the Ford Competition was won
—    How I Did It Blogs
—    http://blog.kaggle.com/
      2011/03/25/inference-
      on-winning-the-ford-
      stay-alert-
      competition/
—    http://blog.kaggle.com/
      2011/04/20/mick-
      wagner-on-finishing-
      second-in-the-ford-
      challenge/
—    http://blog.kaggle.com/
      2011/03/16/junpei-
      komiyama-on-
      finishing-4th-in-the-
      ford-competition/
How the Ford Competition was won
—  Junpei   Komiyama (#4)
  ◦  To solve this problem, I constructed a Support Vector
     Machine (SVM), which is one of the best tools for
     classification and regression analysis, using the libSVM
     package.
  ◦  This approach took more than 3 hours to complete
  ◦  I found some data (P3-P6) were characterized by
     strong noise... Also, many environmental and vehicular
     data showed discrete values continuously increased and
     decreased.These suggested the necessity of pre-
     processing the observation data before SVM analysis
     for better performance
How the Ford Competition was won

—  Junpei   Komiyama (#4)
  ◦  Averaging – improved score and processing time
  ◦  Average 7 data points
    –  Reduced processing by 86% &
    –  Increased score by 0.01
  ◦  Tools
    –  Python processing of csv
    –  libSVM
How the Ford Competition was won
—    Mick Wagner (#2)
      ◦  Tools
        –  Excel, SQL Server
      ◦  I spent the majority of my time analyzing the data. I
         inputted the data into Excel and started examining the
         data taking note of discrete and continuous values, category
         based parameters, and simple statistics (mean, median,
         variance, coefficient of variance). I also looked for extreme
         outliers.
      ◦  I made the first 150 trials (~30%) be my test data and the
         remainder be my training dataset (~70%). This single factor
         had the largest impact on the accuracy of my final model.
      ◦  I was concerned that using the entire data set would create
         too much noise and lead to inaccuracies in the model … so
         focussed on data with state change
How the Ford Competition was won

—  Mick Wagner   (#2)
 ◦  After testing the Decision Tree and Neural Network
    algorithms against each other and submitting
    models to Kaggle, I found the Neural Network
    model to be more accurate
 ◦  Only used E4, E5, E6, E7, E8, E9, E10, P6,V4,V6,
    V10, and V11
How the Ford Competition was won

—  Inference    (#1)
  ◦  Very interesting
  ◦  “Our first observation is that trials are not
     homogeneous – so calculated mean, sd et al”
  ◦  “Training set & test set are not from the same
     population” – a good fit for training will result in a
     low score
  ◦  Lucky Model (Regression)
    –  -­‐410.6073(sd(E5))	
  +	
  0.1494(V11)	
  +	
  4.4185(E9)	
  
  ◦  (Remember – Data had P1-P8,E1-E11,V1-V11)
HOW THE RTA WAS
WON


“This competition requires participants to predict travel time on
Sydney's M4 freeway from past travel time observations.”
—  Thanks   to
  ◦  François
     GUILLEM &
  ◦  Andrzej Janusz
—  They  both
    used R
—  Share their
    code &
    algorithms
How the RTA was won
—  I effectively used R for the RTA competition.
    For my best submission, I just used simple
    technics (OLS and means) but in a clever way
                          - François GUILLEM (#14)
—  I used a simple k-NN approach but the idea
    was to process data first & to compute some
    summaries of time series in consecutive
    timestamps using some standard indicators
    from technical analysis
                               - Andrzej Janusz(#17)
How the RTA was won
  —  #1      used Random Forests
       ◦  Time, Date & Week as predictors
                  - José P. González-Brenes and Matías Cortés

  —  Regression              models for data segments (total
      ~600!)
  —  Tools:
       ◦  Java/Weka
       ◦  4 processors, 12 GB RAM
       ◦  48 hours of computations
                                                               - Marcin Pionnier (#5)
               Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/!
Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!
THE HHP




TimeCheck : Should be ~2:40!!
Lessons from Kaggle Winners
1    Don’t over-fit

2    All predictors are not needed

3    All data rows are not needed, either

4    Tuning the algorithms will give different results

5    Reduce the dataset (Average, select transition data,…)

6    Test set & training set can differ

7    Iteratively explore & get your head around data

8    Don’t be afraid to submit simple solutions

9    Keep a tab & history your submissions
The Competition
“The goal of the prize is to develop a predictive
algorithm that can identify patients who will be
admitted to the hospital within the next year,
using historical claims data”
TimeLine
Data Organization
            ID                 113,000 Entries	

Members     Age at 1st Claim   Missing values	

            Sex                                          Days In Hospital Y2
            MemberID        2,668,990 Entries	

            Prov ID         Missing values	

 Claims                     Different Coding 	

            Vendor, PCP,
                                	

Delay 162+	

         Days In Hospital Y3
            Year
            Speciality      SupLOS – Length of stay is
            PlaceOfSvc      suppressed during de-
            PayDelay        identification process for    Days In Hospital Y4
            LengthOfStay    some entries	

                                                              (Target)
                            	

            DaysSinceFirstClaimThatYear
            PrimaryConditionGroup                         MemberID
            CharlsonIndex                                 Claims Truncated
            ProcedureGroup                                DaysInHospital
            SupLOS
                                                          76039 Entries(Y2)	

            MemberID, Year, 361,485 Entries	

            71436 Entries (Y3)	

LabCount    DSFS,LabCount Fairly Consistent	

                            Coding (10+)	

               70943 Entries	

                                                          Lots Of Zeros	

            MemberID, Year, 818,242 Entries	

DrugCount   DSFS,DrugCount Fairly Consistent	

                            Coding (10+)
The Hitchhiker’s Guide to Kaggle
Calculation & Prizes

 Prediction Error Rate




                           Deadline
                            Apr 04,2013


 Deadline : Aug 31,2011   06:59:59 UTC

 Deadline : Feb 13,2012

 Deadline : Sep 04,2012
Now it is our turn …

HHP ANALYTICS
POA
—  Load  data into SQLite
—  Use SQL to de-normalize & pick out
    datasets
—  Load them into R for analytics
—  Total/Distinct count
  ◦    Claims = 2,668,991/113,001
  ◦    Members = 113,001
  ◦    Drug = 818,242/75,999 <- unique = 141,532/75,999(test)
  ◦    Lab = 361,485/86,640 <- unique = 154,935/86,640 (test)
  ◦    dih_y2 = 76,039 / distinct/11,770 dih > 0
  ◦    dih_y3 = 71,436/distinct/10,730 dih > 0
  ◦    dih_y4 = 70,943/distinct
Idea #1
—  dih_Y2	
  =	
  β0	
  +	
  β1dih_Y1	
  +	
  β2DC	
  +	
  β3LC	
  
—  dih_Y3	
  =	
  β0	
  +	
  β1dih_Y2	
  +	
  β2DC	
  +	
  β3LC	
  
—  dih_Y4	
  =	
  β0	
  +	
  β1dih_Y3	
  +	
  β2DC	
  +	
  β3LC	
  
—  select count(*) from dih_y2 join dih_y3 on
    dih_y2.member_id = dih_y3.member_id;
—  Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683
    (7,699 dih_y3 > 0)

—  Data   is not straightforward to get this
   ◦  Summarize drug and lab by member, year
   ◦  Split into year to get DC	
  &	
  LC	
  by	
  year
   ◦  Add to dih_Yx table
   ◦  Linear Regression
Some SQL for idea #1
—  create table drug_tot as select
    member_id,Year, total(drug_count) from
    drug_count group by member_id,year
    order by member_id,year; <- total drug,
    lab per year for each member !
—  Same for lab_tot
—  create table drug_tot_y1 as select * from
    drug_tot where year = “Y1”
—  … for y2,y3 and y1, y2,y3 for lab_tot
—  … join with dih_yx tables
Idea #2
—  Add    claims at yx to the Idea #1 equations
—  dih_Yn	
  =	
  β0	
  +	
  β1dih_Yn-­‐1	
  +	
  β2DC/n-­‐1	
  
    +	
  β3LC/n-­‐1	
  +	
  β4Caimn-­‐1	
  
—  Then we will have to define the criteria for
    Caimn-­‐1	
  from the claim predictors viz.
    PrimaryConditionGroup, CharlsonIndex
    and ProcedureGroup
The Beginning As the End
          —  We  started with a set of
              goals
          —  Homework
            ◦  For me :
             –  To finish the hands-on walkthrough
                 & post it in ~10 days
            ◦  For you
             –  Go through the slides
             –  Do the walkthrough
             –  Submit entries to Kaggle
I
enjoyed a lot
  preparing
the materials
     …
    Hope
 you enjoyed
    more
attending …
                            Questions ?!




                IDE <- RStudio	
                R_Packages <- c(plyr, rattle, rpart, randomForest)	
                R_Search <- http://www.rseek.org/, powered=google

More Related Content

What's hot

Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
HJ van Veen
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
mark_landry
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
Ramesh Sampath
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
mark_landry
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
Xiang Zhang
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
Carol McDonald
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Hayim Makabee
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
Sri Ambati
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
Wush Wu
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
Wayne Chen
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
ananth
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
Joaquin Vanschoren
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
DataRobot
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Methods for meta learning in AutoML
Methods for meta learning in AutoMLMethods for meta learning in AutoML
Methods for meta learning in AutoML
Mohamed Maher
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
sscdotopen
 
QCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneQCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for Everyone
Dhiana Deva
 

What's hot (20)

Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Methods for meta learning in AutoML
Methods for meta learning in AutoMLMethods for meta learning in AutoML
Methods for meta learning in AutoML
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
QCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for EveryoneQCon Rio - Machine Learning for Everyone
QCon Rio - Machine Learning for Everyone
 

Viewers also liked

Machine learning from disaster
Machine learning from disasterMachine learning from disaster
Machine learning from disaster
Phillip Trelford
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
Manage services presentation
Manage services presentationManage services presentation
Manage services presentation
Len Moncrieffe
 
Managed Services is not a product, it's a business model!
Managed Services is not a product, it's a business model!Managed Services is not a product, it's a business model!
Managed Services is not a product, it's a business model!
Stuart Selbst Consulting
 
Kaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data AnalyticsKaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data Analytics
Jeffrey Funk Business Models
 
NYAI - Interactive Machine Learning by Daniel Hsu
NYAI - Interactive Machine Learning by Daniel HsuNYAI - Interactive Machine Learning by Daniel Hsu
NYAI - Interactive Machine Learning by Daniel Hsu
Rizwan Habib
 
Managed Services Presentation
Managed Services PresentationManaged Services Presentation
Managed Services Presentation
Scott Gombar
 
How to become a data scientist in 6 months
How to become a data scientist in 6 monthsHow to become a data scientist in 6 months
How to become a data scientist in 6 months
Tetiana Ivanova
 
Final pink panthers_03_31
Final pink panthers_03_31Final pink panthers_03_31
Final pink panthers_03_31
Michelle Darling
 
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでくださいカスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
Takaaki Umada
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
Darius Barušauskas
 

Viewers also liked (11)

Machine learning from disaster
Machine learning from disasterMachine learning from disaster
Machine learning from disaster
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Manage services presentation
Manage services presentationManage services presentation
Manage services presentation
 
Managed Services is not a product, it's a business model!
Managed Services is not a product, it's a business model!Managed Services is not a product, it's a business model!
Managed Services is not a product, it's a business model!
 
Kaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data AnalyticsKaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data Analytics
 
NYAI - Interactive Machine Learning by Daniel Hsu
NYAI - Interactive Machine Learning by Daniel HsuNYAI - Interactive Machine Learning by Daniel Hsu
NYAI - Interactive Machine Learning by Daniel Hsu
 
Managed Services Presentation
Managed Services PresentationManaged Services Presentation
Managed Services Presentation
 
How to become a data scientist in 6 months
How to become a data scientist in 6 monthsHow to become a data scientist in 6 months
How to become a data scientist in 6 months
 
Final pink panthers_03_31
Final pink panthers_03_31Final pink panthers_03_31
Final pink panthers_03_31
 
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでくださいカスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 

Similar to The Hitchhiker’s Guide to Kaggle

R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
Krishna Sankar
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
Krishna Sankar
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Arithmer Inc.
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
butest
 
Why am I doing this???
Why am I doing this???Why am I doing this???
Why am I doing this???
Anne-Marie Tousch
 
Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015
Carlos Hernandez
 
Big Data Workshop
Big Data WorkshopBig Data Workshop
Big Data Workshop
Adrien Ickowicz
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
Wake Tech BAS
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical Features
Domino Data Lab
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
Traveloka
 
Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)
GLA University
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluation
krisztianbalog
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
Krishna Sankar
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
Mark Lefevre, CQF
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
Pratik Doshi
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
Fabricio Quintanilla
 
Skew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationSkew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregation
David Gleich
 
A General Overview of Machine Learning
A General Overview of Machine LearningA General Overview of Machine Learning
A General Overview of Machine Learning
Ashish Sharma
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
Ahmed Youssef Ali Amer
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
Sanghamitra Deb
 

Similar to The Hitchhiker’s Guide to Kaggle (20)

R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
 
Why am I doing this???
Why am I doing this???Why am I doing this???
Why am I doing this???
 
Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015
 
Big Data Workshop
Big Data WorkshopBig Data Workshop
Big Data Workshop
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical Features
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
 
Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)Titanic survivor prediction ppt (5)
Titanic survivor prediction ppt (5)
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluation
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Skew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationSkew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregation
 
A General Overview of Machine Learning
A General Overview of Machine LearningA General Overview of Machine Learning
A General Overview of Machine Learning
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 

More from Krishna Sankar

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
Krishna Sankar
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
Krishna Sankar
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
Krishna Sankar
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
Krishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
Krishna Sankar
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
Krishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
Krishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
Krishna Sankar
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
Krishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
Krishna Sankar
 

More from Krishna Sankar (18)

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Recently uploaded

Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Zilliz
 
CiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.pptCiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.ppt
moinahousna
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
313mohammedarshad
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
ChristopherTHyatt
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
Ivanti
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
Shiv Technolabs
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
CEPTES Software Inc
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 

Recently uploaded (20)

Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
 
CiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.pptCiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.ppt
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 

The Hitchhiker’s Guide to Kaggle

  • 1. The Hitchhiker’s Guide to Kaggle July 27, 2011 ksankar42@gmail.com [doubleclix.wordpress.com] anthony.goldbloom@kaggle.com
  • 3. The Amateur Data Scientist CART Analytics Competitions! Algorithms randomForest Tools Old DataSets Competition Competition Titanic in-flight Churn HHP Ford
  • 4. Encounters —  1st ◦  This Workshop —  2nd ◦  Do Hands-on Walkthrough ◦  I will post the walkthrough scripts in ~ 10 days —  3rd ◦  Participate in HHP & Other competitions
  • 5. Goals Of This workshop 1.  Introduction to Analytics Competitions from Data, Algorithms & Tools perspective 2.  End-To-End Flow of a Kaggle Competition – Ford 3.  Introduction to the Heritage Health Prize Competition 4.  Materials for you to explore further ◦  Lot more slides ◦  Walkthrough – will post in 10 days
  • 6. Agenda —  Algorithms for the Amateur Data Scientist [25Min] ◦  Algorithms, Tools & frameworks in perspective —  The Art of Analytics Competitions[10Min] ◦  The Kaggle challenges —  How the RTA FORD was won - Anatomy of a competition [15Min] ◦  Predicting FORD using Trees ◦  Submit an Entry —  Competition in flight - The Heritage Health Prize [30Min] ◦  Walkthrough –  Introduction –  Dataset Organization –  Analytics Walkthrough ◦  Submit our entry —  Conclusion [5Min]
  • 7. ALGORITHMS FOR THE AMATEUR DATA SCIENTIST Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have … “A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.” - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. Published by Harmony Books in 1979
  • 8. The Amateur Data Scientist —  Am not a quant or a ML expert —  School of Amz, Springer & UTube —  For the Rest of us —  References I used (Refs also in the respective slide): ◦  The Elements Of Statistical Learning (a.k.a ESLII) –  By Hastie,Tibshirani & Friedman ◦  Statistical Learning From a Regression Perspective –  By Richard Berk —  As Jeremy says it, you can dig into it as needed ◦  Not necessarily be an expert in R toolbox
  • 9. Jeremy’s Axioms —  Iteratively explore data —  Tools ◦  Excel Format, Perl, Perl Book —  Get your head around data ◦  Pivot Table —  Don’t over-complicate —  If people give you data, don’t assume that you need to use all of it —  Look at pictures ! —  History of your submissions – keep a tab —  Don’t be afraid to submit simple solutions ◦  We will do this during this workshop Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data- sciencetalk-by-jeremy-howard/ !
  • 10. Don’t throw away 1 any data ! Big data to smart data Be ready for different 2 ways of organizing the data —  summary
  • 11. Users apply different techniques •  Support Vector Machine •  Genetic Algorithms •  adaBoost •  Monte Carlo Methods •  Bayesian Networks •  Principal Component •  Decision Trees Analysis •  Ensemble Methods •  Kalman Filter •  Random Forest •  Evolutionary Fuzzy •  Logistic Regression Modelling •  Neural Networks Quora •  http://www.quora.com/What-are-the-top-10- data-mining-or-machine-learning-algorithms Ref: Anthony’s Kaggle Presentation!
  • 12. —  Let us take a 15 min overview of the algorithms ◦  Relevant in the context of this workshop ◦  From the perspective of the datasets we plan to use —  More of a qualitative than mathematical —  To get a feel for the how & the why
  • 13. Bias Continuous Linear Variance Variables Regression Model Complexity Over-fitting Categorical Variables Classifiers k-NN Bagging Boosting Decision (Nearest Trees Neighbors) CART
  • 14. Titanic Passenger Metadata Customer Churn •  Small •  17 Predictors •  3 Predictors •  Class •  Sex •  Age •  Survived? Kaggle Competition - Stay Alert Ford Challenge •  Simple Dataset •  Competition Class Heritage Health Prize Data •  Complex •  Competition in Flight http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic! http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!
  • 15. Titanic Dataset —  Taken from passenger manifest —  Good candidate for a Decision Tree —  CART [Classification & Regression Tree] ◦  Greedy, top-down binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions —  CART in R http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!
  • 16. Titanic Dataset Y Male ? R walk through —  Load libraries 3rd? —  Load data —  Model CART N Y —  Model rattle() Y Adult? —  Tree Y —  Discussion N 3rd? N Y
  • 17. CART Y Male ? Female 3rd? N Y Y Adult? Child Y 3rd? N N Y
  • 18. CART Y Male ? Female 1 Do Not Over-fit 3rd? N 2 All predictors are not needed N Y 3 All data rows are not needed 4 Tuning the algorithms will give different results
  • 19. Churn Data —  Predictchurn —  Based on ◦  Service calls, v-mail and so forth
  • 21. Challenges —  Model Complexity ◦  Complex Model increases the training data fit ◦  But then over-fits and doesn't perform as well with real data —  Bias vs.Variance ◦  Classical diagram ◦  From ELSII Prediction Error ◦  By Hastie,Tibshirani & Friedman Training Error
  • 22. —  Goal ◦  Model Complexity (-) Solution #1 ◦  Variance (-) ◦  Prediction Accuracy (+) Partition Data ! ◦  Training (60%) ◦  Validation(20%) & ◦  “Vault” Test (20%) Data sets k-fold Cross-Validation ◦  Split data into k equal parts ◦  Fit model to k-1 parts & calculate prediction error on kth part ◦  Non-overlapping dataset But the fundamental problem still exists !
  • 23. —  Goal ◦  Model Complexity (-) Solution #2 ◦  Variance (-) ◦  Prediction Accuracy (+) Bootstrap ◦  Draw datasets (with replacement) and fit model for each dataset –  Remember : Data Partitioning (#1) & Cross Validation (#2) are without replacement Bagging (Bootstrap aggregation) ◦  Average prediction over a collection of bootstrap-ed samples, thus reducing variance
  • 24. —  Goal ◦  Model Complexity (-) Solution #3 ◦  Variance (-) ◦  Prediction Accuracy (+) Boosting ◦  “Output of weak classifiers into a powerful committee” ◦  Final Prediction = weighted majority vote ◦  Later classifiers get misclassified points –  With higher weight, –  So they are forced –  To concentrate on them ◦  AdaBoost (AdaptiveBoosting) ◦  Boosting vs Bagging –  Bagging – independent trees –  Boosting – successively weighted
  • 25. —  Goal ◦  Model Complexity (-) Solution #4 ◦  Variance (-) ◦  Prediction Accuracy (+) Random Forests+ ◦  Builds large collection of de-correlated trees & averages them ◦  Improves Bagging by selecting i.i.d* random variables for splitting ◦  Simpler to train & tune ◦  “Do remarkably well, with very little tuning required” – ESLII ◦  Less suseptible to overfitting (than boosting) ◦  Many RF implementations –  Original version - Fortran-77 ! By Breiman/Cutler –  R, Mahout, Weka, Milk (ML toolkit for py), matlab * i.i.d – independent identically distributed! + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm!
  • 26. —  Goal ◦  Model Complexity (-) Solution - General ◦  Variance (-) ◦  Prediction Accuracy (+) Ensemble methods ◦  Two Step –  Develop a set of learners –  Combine the results to develop a composite predictor ◦  Ensemble methods can take the form of: –  Using different algorithms, –  Using the same algorithm with different settings –  Assigning different parts of the dataset to different classifiers ◦  Bagging & Random Forests are examples of ensemble method Ref: Machine Learning In Action !
  • 27. Random Forests —  While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables —  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller) —  Error prediction ◦  For each iteration, predict for dataset that is not in the sample (OOB data) ◦  Aggregate OOB predictions ◦  Calculate Prediction Error for the aggregate, which is basically the OOB estimate of error rate –  Can use this to search for optimal # of predictors ◦  We will see how close this is to the actual error in the Heritage Health Prize —  Assumes equal cost for mis-prediction. Can add a cost function —  Proximity matrix & applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002 ! Statistical Learning from a Regression Perspective : Berk! A Brief Overview of RF by Dan Steinberg!
  • 28. Lot more to explore (Homework!) —  Loss matrix ◦  E.g. Telcom churn - Better to give incentives to false + (who is not leaving) than optimize in incentives for false –ves(who is leaving) —  Missing values —  Additive Models —  Bayesian Models —  Gradient Boosting Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4- New_Tree_Data_Set_and_Loss_Matrices.pdf!
  • 29. Churn Data w/ randomForest
  • 31. “I keep saying the sexy job in the next ten years will be statisticians.” Hal Varian Google Chief Economist 2009
  • 33. Crowdsourcing Mismatch between those with data and those with the skills to analyse it
  • 34. Tourism Forecasting Competition Forecast Error (MASE) Existing model Aug 9 2 weeks 1 month Competition later later End
  • 35. Existing model (ELO) Error Rate (RMSE) Aug 4 1 month 2 months Today later later Chess Ratings Competition
  • 36. 12,500 “Amateur” Data Scientists with different backgrounds
  • 37. R R Matlab Matlab SAS SAS WEKA WEKA SPSS SPSS Python Python Excel Excel Mathematica Mathematica Stata Stata R on Kaggle Among academics R Matlab SAS WEKA SPSS Python Excel Among Americans Mathematica Stata Ref: Anthony’s Kaggle Presentation!
  • 38. Mapping Dark Matter is a image analysis competition whose aim is to encourage the development of new algorithms that can be applied to challenge of measuring the tiny distortions in galaxy images caused by dark matter. ~25% Successful grant applications NASA tried, now it s our turn
  • 39. “The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe” “In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of- the-art algorithms”
  • 41. Why Participants Compete 1 2 Clean, Real world data Professional Reputation & Experience 3 4 Interactions with experts in related fields Prizes
  • 42. Use the wizard to post a competition
  • 44. Competitions are judged based on predictive accuracy
  • 45. Competition Mechanics Competitions are judged on objective criteria
  • 46. The Anatomy of a KAGGLE COMPETITION THE FORD COMPETITION
  • 47. Ford Challenge - DataSet —  Goal: ◦  Predict Driver Alertness —  Predictors: ◦  Psychology – P1 .. P8 ◦  Environment – E1 .. E11 ◦  Vehicle – V1 ..V11 ◦  IsAlert ? —  Datastatistics meaningless outside the IsAlert context
  • 48. Ford Challenge – DataSet Files —  Three files ◦  ford_train –  510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows ◦  ford_test –  100 Trials,~1,200 observations/trial, 120,841 rows ◦  example_submission.csv
  • 50. glm
  • 51. Submission & Results Raw, all variables, rpart Raw, selected variables, rpart All variables, glm
  • 52. How the Ford Competition was won —  How I Did It Blogs —  http://blog.kaggle.com/ 2011/03/25/inference- on-winning-the-ford- stay-alert- competition/ —  http://blog.kaggle.com/ 2011/04/20/mick- wagner-on-finishing- second-in-the-ford- challenge/ —  http://blog.kaggle.com/ 2011/03/16/junpei- komiyama-on- finishing-4th-in-the- ford-competition/
  • 53. How the Ford Competition was won —  Junpei Komiyama (#4) ◦  To solve this problem, I constructed a Support Vector Machine (SVM), which is one of the best tools for classification and regression analysis, using the libSVM package. ◦  This approach took more than 3 hours to complete ◦  I found some data (P3-P6) were characterized by strong noise... Also, many environmental and vehicular data showed discrete values continuously increased and decreased.These suggested the necessity of pre- processing the observation data before SVM analysis for better performance
  • 54. How the Ford Competition was won —  Junpei Komiyama (#4) ◦  Averaging – improved score and processing time ◦  Average 7 data points –  Reduced processing by 86% & –  Increased score by 0.01 ◦  Tools –  Python processing of csv –  libSVM
  • 55. How the Ford Competition was won —  Mick Wagner (#2) ◦  Tools –  Excel, SQL Server ◦  I spent the majority of my time analyzing the data. I inputted the data into Excel and started examining the data taking note of discrete and continuous values, category based parameters, and simple statistics (mean, median, variance, coefficient of variance). I also looked for extreme outliers. ◦  I made the first 150 trials (~30%) be my test data and the remainder be my training dataset (~70%). This single factor had the largest impact on the accuracy of my final model. ◦  I was concerned that using the entire data set would create too much noise and lead to inaccuracies in the model … so focussed on data with state change
  • 56. How the Ford Competition was won —  Mick Wagner (#2) ◦  After testing the Decision Tree and Neural Network algorithms against each other and submitting models to Kaggle, I found the Neural Network model to be more accurate ◦  Only used E4, E5, E6, E7, E8, E9, E10, P6,V4,V6, V10, and V11
  • 57. How the Ford Competition was won —  Inference (#1) ◦  Very interesting ◦  “Our first observation is that trials are not homogeneous – so calculated mean, sd et al” ◦  “Training set & test set are not from the same population” – a good fit for training will result in a low score ◦  Lucky Model (Regression) –  -­‐410.6073(sd(E5))  +  0.1494(V11)  +  4.4185(E9)   ◦  (Remember – Data had P1-P8,E1-E11,V1-V11)
  • 58. HOW THE RTA WAS WON “This competition requires participants to predict travel time on Sydney's M4 freeway from past travel time observations.”
  • 59. —  Thanks to ◦  François GUILLEM & ◦  Andrzej Janusz —  They both used R —  Share their code & algorithms
  • 60. How the RTA was won —  I effectively used R for the RTA competition. For my best submission, I just used simple technics (OLS and means) but in a clever way - François GUILLEM (#14) —  I used a simple k-NN approach but the idea was to process data first & to compute some summaries of time series in consecutive timestamps using some standard indicators from technical analysis - Andrzej Janusz(#17)
  • 61. How the RTA was won —  #1 used Random Forests ◦  Time, Date & Week as predictors - José P. González-Brenes and Matías Cortés —  Regression models for data segments (total ~600!) —  Tools: ◦  Java/Weka ◦  4 processors, 12 GB RAM ◦  48 hours of computations - Marcin Pionnier (#5) Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/! Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!
  • 62. THE HHP TimeCheck : Should be ~2:40!!
  • 63. Lessons from Kaggle Winners 1 Don’t over-fit 2 All predictors are not needed 3 All data rows are not needed, either 4 Tuning the algorithms will give different results 5 Reduce the dataset (Average, select transition data,…) 6 Test set & training set can differ 7 Iteratively explore & get your head around data 8 Don’t be afraid to submit simple solutions 9 Keep a tab & history your submissions
  • 64. The Competition “The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data”
  • 66. Data Organization ID 113,000 Entries Members Age at 1st Claim Missing values Sex Days In Hospital Y2 MemberID 2,668,990 Entries Prov ID Missing values Claims Different Coding Vendor, PCP, Delay 162+ Days In Hospital Y3 Year Speciality SupLOS – Length of stay is PlaceOfSvc suppressed during de- PayDelay identification process for Days In Hospital Y4 LengthOfStay some entries (Target) DaysSinceFirstClaimThatYear PrimaryConditionGroup MemberID CharlsonIndex Claims Truncated ProcedureGroup DaysInHospital SupLOS 76039 Entries(Y2) MemberID, Year, 361,485 Entries 71436 Entries (Y3) LabCount DSFS,LabCount Fairly Consistent Coding (10+) 70943 Entries Lots Of Zeros MemberID, Year, 818,242 Entries DrugCount DSFS,DrugCount Fairly Consistent Coding (10+)
  • 68. Calculation & Prizes Prediction Error Rate Deadline Apr 04,2013 Deadline : Aug 31,2011 06:59:59 UTC Deadline : Feb 13,2012 Deadline : Sep 04,2012
  • 69. Now it is our turn … HHP ANALYTICS
  • 70. POA —  Load data into SQLite —  Use SQL to de-normalize & pick out datasets —  Load them into R for analytics —  Total/Distinct count ◦  Claims = 2,668,991/113,001 ◦  Members = 113,001 ◦  Drug = 818,242/75,999 <- unique = 141,532/75,999(test) ◦  Lab = 361,485/86,640 <- unique = 154,935/86,640 (test) ◦  dih_y2 = 76,039 / distinct/11,770 dih > 0 ◦  dih_y3 = 71,436/distinct/10,730 dih > 0 ◦  dih_y4 = 70,943/distinct
  • 71. Idea #1 —  dih_Y2  =  β0  +  β1dih_Y1  +  β2DC  +  β3LC   —  dih_Y3  =  β0  +  β1dih_Y2  +  β2DC  +  β3LC   —  dih_Y4  =  β0  +  β1dih_Y3  +  β2DC  +  β3LC   —  select count(*) from dih_y2 join dih_y3 on dih_y2.member_id = dih_y3.member_id; —  Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683 (7,699 dih_y3 > 0) —  Data is not straightforward to get this ◦  Summarize drug and lab by member, year ◦  Split into year to get DC  &  LC  by  year ◦  Add to dih_Yx table ◦  Linear Regression
  • 72. Some SQL for idea #1 —  create table drug_tot as select member_id,Year, total(drug_count) from drug_count group by member_id,year order by member_id,year; <- total drug, lab per year for each member ! —  Same for lab_tot —  create table drug_tot_y1 as select * from drug_tot where year = “Y1” —  … for y2,y3 and y1, y2,y3 for lab_tot —  … join with dih_yx tables
  • 73. Idea #2 —  Add claims at yx to the Idea #1 equations —  dih_Yn  =  β0  +  β1dih_Yn-­‐1  +  β2DC/n-­‐1   +  β3LC/n-­‐1  +  β4Caimn-­‐1   —  Then we will have to define the criteria for Caimn-­‐1  from the claim predictors viz. PrimaryConditionGroup, CharlsonIndex and ProcedureGroup
  • 74. The Beginning As the End —  We started with a set of goals —  Homework ◦  For me : –  To finish the hands-on walkthrough & post it in ~10 days ◦  For you –  Go through the slides –  Do the walkthrough –  Submit entries to Kaggle
  • 75. I enjoyed a lot preparing the materials … Hope you enjoyed more attending … Questions ?! IDE <- RStudio R_Packages <- c(plyr, rattle, rpart, randomForest) R_Search <- http://www.rseek.org/, powered=google