Research Triangle Analysts
          rtpanalysts.org
 • Intro to Kaggle.com
 • Titantic Getting Started Competition
 • Prediction Problem with two outcome Levels
 • Opportunity for an extended Data Shootout with Kaggle.com
   providing data, scoring, tutorials, forums.
 • Public domain data allows for detailed discussion of modeling
   issues and solutions without client data confidentiality concerns.
 • A common ground for in depth learning and debates on
   analytics topics.

                                    • Participants of all levels of expertise welcome
                                    • You influence the direction of this effort by your
                                      participation. Post questions and thoughts on
                                      rtpanalysts.org .
                                    • Welcome!



Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group member list   1
Classification Problems
• 2- levels or outcomes
• Data       Model      Predictions
• Examples
  – Find customers who are likely to buy product
  – Id patients likely to be admitted to hospital
  – Categorize cells as cancerous or benign
  – Who survives the Titanic disaster?


       Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group member
                                                                                                  2
                                                  list
Classifier - Trees
• Decision Trees


                            All
                        Passengers



          Female]                                     Male



                 Second
First Class                           Age < 16                 Age >= 16
                Third Class

                               Slides by Linda Schumacher. Contact via
                              Research Triangle Analysts LinkedIn group    3
                                             member list
Classifier - Logistic Regression
• Equation – Logistic Regression
• F(x) = sigmoid(age+class-embarked+gender)




                Slides by Linda Schumacher. Contact via
               Research Triangle Analysts LinkedIn group   4
                              member list
Titanic Data
• Passenger List
  – Name, class, fare, embarked, family
    members, age, cabin, etc
  – Survival
• Training Set of 891 Passengers
• Test Set of 418



                    Slides by Linda Schumacher. Contact via
                   Research Triangle Analysts LinkedIn group   5
                                  member list
Kaggle.com

• Data
• Tutorials
  – Tools – Excel, Python
  – Models – Trees, Random Forests
• Submission
• Leaderboard

                 Slides by Linda Schumacher. Contact via
                Research Triangle Analysts LinkedIn group   6
                               member list
Where to Start
• create a Kaggle account
  http://www.kaggle.com/account/register
• read and agree to the rules if you choose to continue
• enter the Kaggle Titantic Competition
  http://www.kaggle.com/c/titanic-gettingStarted
• download train.csv and test.csv
• If you choose to use R, obtain-download R from
  http://www.r-project.org/ You will have to choose a
  ‘mirror’ or site – usually a university or research site
• If you share code or data outside of your Kaggle
  team, be sure to post a copy on Kaggle Titanic Forum
  see http://www.kaggle.com/c/titanic-
  gettingStarted/details/rules
                     Slides by Linda Schumacher. Contact via
                    Research Triangle Analysts LinkedIn group   7
                                   member list
Benefits
• Extended Data Shoot-Out
• Tailor participation
• Opportunities
  -   New classifiers
  -   New tools, languages
  -   Training vs test error
  -   Round Table Discussion of Solutions
       - Compare model results


                     Slides by Linda Schumacher. Contact via
                    Research Triangle Analysts LinkedIn group   8
                                   member list

Titanic prediction

  • 1.
    Research Triangle Analysts rtpanalysts.org • Intro to Kaggle.com • Titantic Getting Started Competition • Prediction Problem with two outcome Levels • Opportunity for an extended Data Shootout with Kaggle.com providing data, scoring, tutorials, forums. • Public domain data allows for detailed discussion of modeling issues and solutions without client data confidentiality concerns. • A common ground for in depth learning and debates on analytics topics. • Participants of all levels of expertise welcome • You influence the direction of this effort by your participation. Post questions and thoughts on rtpanalysts.org . • Welcome! Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group member list 1
  • 2.
    Classification Problems • 2-levels or outcomes • Data Model Predictions • Examples – Find customers who are likely to buy product – Id patients likely to be admitted to hospital – Categorize cells as cancerous or benign – Who survives the Titanic disaster? Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group member 2 list
  • 3.
    Classifier - Trees •Decision Trees All Passengers Female] Male Second First Class Age < 16 Age >= 16 Third Class Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group 3 member list
  • 4.
    Classifier - LogisticRegression • Equation – Logistic Regression • F(x) = sigmoid(age+class-embarked+gender) Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group 4 member list
  • 5.
    Titanic Data • PassengerList – Name, class, fare, embarked, family members, age, cabin, etc – Survival • Training Set of 891 Passengers • Test Set of 418 Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group 5 member list
  • 6.
    Kaggle.com • Data • Tutorials – Tools – Excel, Python – Models – Trees, Random Forests • Submission • Leaderboard Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group 6 member list
  • 7.
    Where to Start •create a Kaggle account http://www.kaggle.com/account/register • read and agree to the rules if you choose to continue • enter the Kaggle Titantic Competition http://www.kaggle.com/c/titanic-gettingStarted • download train.csv and test.csv • If you choose to use R, obtain-download R from http://www.r-project.org/ You will have to choose a ‘mirror’ or site – usually a university or research site • If you share code or data outside of your Kaggle team, be sure to post a copy on Kaggle Titanic Forum see http://www.kaggle.com/c/titanic- gettingStarted/details/rules Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group 7 member list
  • 8.
    Benefits • Extended DataShoot-Out • Tailor participation • Opportunities - New classifiers - New tools, languages - Training vs test error - Round Table Discussion of Solutions - Compare model results Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group 8 member list