0
The Titanic:
Machine Learning
from Disaster
Data Mining and Machine Learning. Winter 2014. Final Project
Jean Callao | Mic...
AGENDA
In depth analysis: by Jean Callao
• Logistic Regression: glm
• Tree-based methods: rpart, ctree
In depth analysis: ...
Titanic: Machine Learning from Disaster
Why we picked this project:
● Historical context to understand
"What does the data...
April 1912
The Titanic Disaster
RMS Titanic, April 1912
A priori knowledge from problem domain
What factors contributed to survival?
Gender, Age, Passenge...
Titanic Dataset
Predictor & Target Variables
Response
VARIABLE
Survived
(1 = Yes; 0 = No)
Predictor
Variables DESCRIPTION
...
Feature Engineering
Data relating to one's location on the ship
data$cabin.last.digit <- str_sub(data$Cabin, -1)
data$Side...
Decision Trees and
Logistic Regression
Presented by Jean Callao
Decision Trees
• A decision tree is a simple, but
powerful form of multiple variable
analysis. It displays a tree-like
gra...
Decision Tree with
New Variables
tree <- rpart(Survived~ Class + Sex +
Age + SibSp + Parch + Fare + Title +
Side,
data=tra...
Decision Tree with
New Variables
Root node-> 62% perished, 38% didn’t perished
Mr or Noble-> 84% perished, 16% didn’t peri...
Overfitted rpart Decision Tree
Disadvantages of rpart:
• Can suffer from:
o High Variance
o High Bias
• Decision tree algo...
Conditional Tree: ctree
train.ctree <- ctree(Survived ~
Class + Sex + Age + Fare +Title +
Side,data=train)
plot(train.ctre...
Mr or Noble-> Side-> Port or Starboard:
40% of surviving, 60% of dying
Mr or Noble-> Side-> Unknown:
16% of surviving, 84%...
Logistic Regression
Least squares linear
regression
Predicted probabilities can
be greater than 1 or less
than 0 if used f...
The “logit” model solves the problem:
Where:
• “p” is the probability that Y
for cases equals 1, p (Y=1).
• “1- p” is the ...
Confirming “women &
children first” policy
Titanic.glm <- glm(Survived~
I(Sex=="female") + Class +
I(Age<=10) + Embarked +...
Making Predictions
Sex==female who is 10 yrs old has an estimated
survival probability of:
2nd class men who paid 20 dolla...
Interpreting Coefficients…
summary(Titanic.glm)
Estimate > 0  higher probability of
surviving
Estimate < 0  lower probab...
Passengers travelling with relatives
have higher chances of survival.
Titanic.glm2<- glm(Survived ~
Class+I(FamilySize>=2)...
First class adult males
have lower chances of survival
Titanic.glm3<- glm(Survived ~
Class + I(Title=="Mr")+
I(Title=="Nob...
"Any data relating to one's location on the ship could
prove helpful to survival predictions…"
First class adult males had
lower chances of survival
summary(Titanic.glm3)Those in upper decks (1st class) had more
timel...
Third class adult males had
lower chance of survival
summary(Titanic.glm4)
Those located in the bow or
lower decks (3rd Cl...
Ensemble Methods:
randomForest and cforest
Presented by Paul Marxhausen
Random Forests
Advantages:
• Easy to use: can be used quite efficiently
with default parameters.
• Ideal for people withou...
Random Forests: Randomness logic
built-in and Data pre-processing
Randomness logic used
• Built-in; random rows (bagging)
...
Model: randomForest(…) using
‘randomForest’ package
# Build Random Forest Ensemble
set.seed(415) # two sources of randomne...
cForest(…): type of random forest;
implementation using party package
# Build condition inference tree Forest
fit <- cfore...
randomForest package vs. party package
randomForest package
• randomForest(…) function
• mtry is floor(sqrt(p)), which is
...
Model Description Result
fit <-
randomForest(…)
Traditional Random Forest
(randomForest package)
0.81818
Leader Board
03/2...
Summary:
Data Visualization
Algorithms & kaggle results
by Michelle Darling
www.datastudentblog.wordpress.com
Data Visualization
Summary
1. Created Conceptual Data Model
• to understand denormalized data file.
2. Tried lots of visua...
Text Analysis of
Passenger Name
SURVIVORS PERISHED
Word Clouds created in www.wordle.net for Survivors$Name and Perished$N...
Do family members affect survival?
> table(Survived, Parch)
Parch
Survived 0 1 2 3 4 5 6
0 445 53 40 2 4 4 1
1 233 65 40 3...
What is the relationship between:
Embarked, Pclass, Ticket, Fare?
Cherbourg,
France Southhampton, EnglandQueenstown,
Irela...
Google Fusion Tables
Geospatial Heatmap, Network Diagrams
Google Fusion Heatmap
GEOCODED by Embarkation Port:
• Southampto...
Rule-Based Models
Everyone Survived vs. Everyone Perished
# Model: Everyone survived
test$Survived <- 1
submit <- data.fra...
Rule-Based Models
Random vs. Informed Guess
# Model: Random Guess
test$Survived <- sample(c(0,1), 418,
replace = TRUE)
sub...
Data Visualization in R
R Visualization Packages:
• Base R: plot, barplot, boxplot,
hist, dotchart, heatmap, pairs
• ggplo...
Continuous vs. Discrete (Categorical) Variables
CORRELOGRAM: {base R} pairs()
t <- as.data.frame(Survived,Pclass,Sex,
Age,...
Which variables are correlated?
(Models perform better when variables are independent!)
Correlation plots created
using {r...
Continuous, Multivariate
Marginal Plots:
{rattle} latticist
• {rattle} is an R package
• latticist is an interactive GUI
f...
Continuous, Multivariate
Intensity Map
{base R} heatmap()
• Useful for visualizing and
comparing data sets.
• Requires a d...
Continuous, Univariate
Histogram: {base R} hist()
Show range, density
and distribution of a
single, continuous
variable.
#...
"Small Multiples" of Bar Plots for categorical variables. E.g., barplot(table(test$Child))
Categorical, Univariate
Bar Plo...
Continuous, Univariate
Dot Plot: {lattice} dotplot()
library(lattice)
attach(train)
# Each dot is
# a passenger.
# Survive...
Continuous, Univariate
Box Plot: {Base R} boxplot()
Shows interquartile range (IQR),
Median, outliers.
# Plot Age grouped ...
Categorical, Multivariate
Spine Plot = 3 Bar Plots
35% 65% 68% 32% 15% 85%
314
577
233
109
81
468
32%68%
FEMALES:
greater ...
Categorical, Multivariate
Spine Plot: {base R} spineplot()
Indicates a higher
than expected survival rate.
Visualization of a contingency table.
vcd = "Visualizing Categorical Data"
Blue – High Probability
Gray – Neutral
Red – Lo...
females (survived)
 36% of all passengers
77% of all survivors
male
adults
male children
female
children
female
adults
m...
Rule-Based Models
"Females" / "Women or Children"
# Model: Females Survive
test$Survived <-0
test$Survived[test$Sex=='fema...
Rule-based model (70 rules)
Sex : Child : Fare2: FamilySize
Principal Components Analysis
• Inspired by
PCA
• Performed
be...
Model Description Result
70-Rule Model aggregate(Survived~Sex+Child+Fare2+FamilySize,
data=train, FUN=function(x) {sum(x)/...
START: Is training data available?
No 
UNSUPER-
VISED
LEARNING
Yes -- train.csv SUPERVISED LEARNING
Continuous
Target 
R...
Overview of Machine Learning Algorithms
QDA (0.75598) vs Logistic Regression (.76077)
• Linear model = straight line boundaries.
• Better fit for Titanic data set...
Naïve Bayes (0.76555) vs. KNN (0.77990)
ptm <- proc.time()
partimat(Survived~.,data=train_bc,method="sknn")
end <- (proc.t...
AdaBoost (0.77990 – same as KNN)
# rattle Model output
Summary of the Ada Boost model:
Call:
ada(Survived ~ ., data = crs$...
Examples of AdaBoost "weak learner" trees:1,3,10,20,35,47. Total: 50 trees
linear, cost=1, 68% correct radial, cost=100, 73.4% correct
polynomial, cost=10, 68% correct sigmoid, cost=0.1, 66% correc...
Scatterplots for visualizing SVM
2D {ggplot2} qplot vs. 3-D {Rcmdr} scatter3d
# Interactive 3D hyperplane with spline
libr...
SVM
using 11 inputs
Advantages of SVM:
• Minimal pre-processing needed.
• Tuning improves accuracy.
• Helps reveal best fi...
cforest (.81818) + Lifeboat Data Fusion = .83732
# Added 12 male survivors based on merged
# lifeboat data from Encycloped...
"Ensemble of ensembles":
randomForest + cForest + random tiebreaker
# Code for 95/05 tiebreaker (score 0.81818)
# Merge ra...
Data mining using lifeboat info = competitive edge. 12
additional male survivors is highly significant because they
counte...
Upcoming SlideShare
Loading in...5
×

Final pink panthers_03_31

2,341

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,341
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
61
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Final pink panthers_03_31"

  1. 1. The Titanic: Machine Learning from Disaster Data Mining and Machine Learning. Winter 2014. Final Project Jean Callao | Michelle Darling | Paul Marxhausen
  2. 2. AGENDA In depth analysis: by Jean Callao • Logistic Regression: glm • Tree-based methods: rpart, ctree In depth analysis: by Paul Marxhausen • Ensemble Methods: randomForest, cForest Summary: by Michelle Darling • Data Visualization • Machine Learning Kaggle Results
  3. 3. Titanic: Machine Learning from Disaster Why we picked this project: ● Historical context to understand "What does the data mean?" ● Learn one data set well, and then apply different algorithms and modelling tools. ● Practice the steps of data analysis: ○ Data exploration and visualization. ○ Model selection, building and testing. ● Prize: $0 + "knowledge & confidence" to go on to more challenging data science problems. kaggle.com provides:  Online data science competitions.  Structured problems, tutorials, help forums and discussion groups.  Easy, consistent way to test models and track results.  >>> Focus <<<
  4. 4. April 1912 The Titanic Disaster
  5. 5. RMS Titanic, April 1912 A priori knowledge from problem domain What factors contributed to survival? Gender, Age, Passenger Class, Fare, Family More likely to survive • Females • Children, Adults<50 • 1st Class • Paid higher fares • Travelling with family More likely to perish • Males • Adults >50 • 2nd, 3rd class • Paid lower fares • Travelling alone • Immigrants
  6. 6. Titanic Dataset Predictor & Target Variables Response VARIABLE Survived (1 = Yes; 0 = No) Predictor Variables DESCRIPTION Pclass Passenger Class (1=1st; 2=2nd; 3=3rd) Name Passenger Name Sex Sex ("male", "female") Age Age (Numeric fraction e.g., 1.5) Fare Passenger Fare Sibsp Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard Ticket Ticket Number Cabin Cabin Embarked Port of Embarkation (C=Cherbourg; Q=Queenstown; S=Southampton) QUANTITATIVE Variables; the rest are QUALITATIVE.
  7. 7. Feature Engineering Data relating to one's location on the ship data$cabin.last.digit <- str_sub(data$Cabin, -1) data$Side <- "Unknown” data$Side[which(isEven(data$cabin.last.digit))] <- "port” data$Side[which(isOdd(data$cabin.last.digit))] <- "starboard” Classifying Fares combi$Fare2 <- '30+' combi$Fare2[combi$Fare < 30 & combi$Fare >= 20] <- '20-30' combi$Fare2[combi$Fare < 20 & combi$Fare >= 10] <- '10-20’ combi$Fare2[combi$Fare < 10] <- '<10' Title - Extract from name to find wealthy passengers: combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle‘ combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess')] <- 'Lady' combi$Title[combi$Title %in% c('Capt', 'Col', 'Don', 'Dr','Jonkheer', 'Major', 'Rev', 'Sir')]<- 'Noble’ FamilySize - Combining spouse, siblings and parents combi$FamilySize <- combi$SibSp + combi$Parch + 1
  8. 8. Decision Trees and Logistic Regression Presented by Jean Callao
  9. 9. Decision Trees • A decision tree is a simple, but powerful form of multiple variable analysis. It displays a tree-like graph of decisions and their possible consequences. • Recursive Partitioning-> at each step, we identify a question that we use to partition the data. Advantages: • Data-driven: Makes no prior assumptions; selects significant predictors based on the greatest information gain. • Flexible: No data pre-processing needed! Handles numeric and categorical data. • Easy to interpret and explain to others.
  10. 10. Decision Tree with New Variables tree <- rpart(Survived~ Class + Sex + Age + SibSp + Parch + Fare + Title + Side, data=train, method="class", control = rpart.control(minsplit = 0, minbucket = 0, maxdepth = 10)) fancyRpartPlot(tree) Prediction <- predict(tree, test, type = "class") table(Prediction) Perished Survived 262 156
  11. 11. Decision Tree with New Variables Root node-> 62% perished, 38% didn’t perished Mr or Noble-> 84% perished, 16% didn’t perished Not a Mr or Noble-> 28% didn't survive, 72% survived 3rd class-> 52% died, 48% didn’t died Not a 3rd class-> 5% didn't survive, 95% survived Pay >=$23-> 91% perished, 9% didn’t perished Pay <=$23-> 38% didn't survive, 62% survived If >=36 yrs-> 86% died, 14% didn't died If <=36 yrs-> 36% didn't survive, 64% survived
  12. 12. Overfitted rpart Decision Tree Disadvantages of rpart: • Can suffer from: o High Variance o High Bias • Decision tree algorithms can result in overly complex or overfitted trees. Function ctree() in package party addresses these weaknesses by providing: • Unbiased variable selection • Statistical stopping rules to optimize tree growth.
  13. 13. Conditional Tree: ctree train.ctree <- ctree(Survived ~ Class + Sex + Age + Fare +Title + Side,data=train) plot(train.ctree) Prediction2 <- predict(train.ctree , newdata=test, type="response") table(Prediction2) Perished Survived 256 162
  14. 14. Mr or Noble-> Side-> Port or Starboard: 40% of surviving, 60% of dying Mr or Noble-> Side-> Unknown: 16% of surviving, 84% of dying Not a Mr or Noble-> 1st or 2nd Class: 98% of surviving, 2% of dying Not a Mr or Noble-> 3rd Class-> Pay $23.25 61% of surviving, 39% of dying Not a Mr or Noble-> 3rd Class-> Pay > $23.25 14% of surviving, 86% of dying  Conditional Tree: ctree
  15. 15. Logistic Regression Least squares linear regression Predicted probabilities can be greater than 1 or less than 0 if used for classification! LOGISTIC REGRESSION • Used for binary qualitative response. • Using logit ensures all probabilities are between 1 and 0 only. Why use Logistic Regression? Allows us to establish a relationship between a binary outcome variable and a group of predictor variables. Can be used as: • CLASSIFICATION METHOD: Classifies binary response (E.g. Yes/No, Pass/Fail, Survived/Perished) • REGRESSION METHOD: Calculates probability (0.0 to 1.0) of the response.
  16. 16. The “logit” model solves the problem: Where: • “p” is the probability that Y for cases equals 1, p (Y=1). • “1- p” is the probability that Y for cases equals 0. Transformed, the “log odds” are linear.     0 1 0 1 Linear CombiantionLog Odds(logit) 0 1 / 1 or log / 1e B B X ln p p B B p y X p B B X                   0 1 0 1 Solving.... / 1 / 1 B B X B B X Odds e p p p p e            Probability (Logistic function): that Produces an S-shape curve.
  17. 17. Confirming “women & children first” policy Titanic.glm <- glm(Survived~ I(Sex=="female") + Class + I(Age<=10) + Embarked + Fare2, data = train, family=binomial("logit")) table(test$Survived) Perished Survived 252 166 summary(Titanic.glm) The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
  18. 18. Making Predictions Sex==female who is 10 yrs old has an estimated survival probability of: 2nd class men who paid 20 dollars for a ticket has an estimated survival probability of: 12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20) 12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20) 0.70 1 e p e              12.3958 2.6816(1) 1.6133(10) 12.3958 2.6816(1) 1.6133(10) 0.99 1 e p e       
  19. 19. Interpreting Coefficients… summary(Titanic.glm) Estimate > 0  higher probability of surviving Estimate < 0  lower probability of surviving
  20. 20. Passengers travelling with relatives have higher chances of survival. Titanic.glm2<- glm(Survived ~ Class+I(FamilySize>=2) + Parch+I(SibSp>=2), data = train, family=binomial("logit")) table(test$Survived) Perished Survived 276 142 summary(Titanic.glm2) We see that PClass is a strong predictor supporting the hypotheses about: • location on the ship • lifeboat access.
  21. 21. First class adult males have lower chances of survival Titanic.glm3<- glm(Survived ~ Class + I(Title=="Mr")+ I(Title=="Noble") + I(Age>=30 & Age<=50)+I(Fare>=27), data = train, family=binomial("logit")) table(test$Survived) Perished Survived 239 179 summary(Titanic.glm3)
  22. 22. "Any data relating to one's location on the ship could prove helpful to survival predictions…"
  23. 23. First class adult males had lower chances of survival summary(Titanic.glm3)Those in upper decks (1st class) had more timely, accurate information and shorter journey to the lifeboats… Yet why did 1st Class Males have lower survival rates? Possible explanation: • 1st Class Males were expected to be "gentlemen" and perish with the ship. "No woman shall be left aboard this ship because Ben Guggenheim was a coward." • 1st Class Male Survivors were condemned by society: > Bruce Ismay – had to resign as Chairman of White Star Line. > William Carter – divorced by wife.
  24. 24. Third class adult males had lower chance of survival summary(Titanic.glm4) Those located in the bow or lower decks (3rd Class) had less chance of survival. Titanic.gml4 <- glm(Survived ~ Class+I(Age>=30 & Age<=65) +I(Title=="Mr"& Class=="Third")+ I(Fare<=10), data = train, family=binomial("logit")) table(test$Survived) Perished Survived 258 160
  25. 25. Ensemble Methods: randomForest and cforest Presented by Paul Marxhausen
  26. 26. Random Forests Advantages: • Easy to use: can be used quite efficiently with default parameters. • Ideal for people without a deep background in statistics. • Produces fairly strong predictions with only a small amount of coding. • A group of actors who perform together. • An example of an ENSEMBLE METHOD -- combines multiple models to produce one result. • Unlike single decision trees which can suffer from high variance or high bias, Random Forests use random sampling and averaging to find a natural balance between the two extremes.
  27. 27. Random Forests: Randomness logic built-in and Data pre-processing Randomness logic used • Built-in; random rows (bagging) and columns (mtry) as part of fitting with training data. Restriction Disadvantages • Data has to be pre-processed to remove NAs, NULLs, blanks • We have to fix Age, Embarked, Fare and FamilyID to meet these requirements. • Factor levels must be <=32 for FamilyId (start with ~double) DATA PRE-PROCESSING TASKS USING COMBINED DATA • Age(263 NA’s)=>rpart/predict • Embarked (2 blanks) => assign • Fare (1 NA) => median • FamilyID (exceeded levels) => re-group (now 22 levels) # Replace Fare NAs (see example) which(is.na(combi$Fare)) combi$Fare[1044] <- median(combi$Fare, na.rm=TRUE)
  28. 28. Model: randomForest(…) using ‘randomForest’ package # Build Random Forest Ensemble set.seed(415) # two sources of randomness fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID2, data=train, importance=TRUE, ntree=2000) # generate importance graphs varImpPlot(fit) # Now let's make a prediction Prediction <- predict(fit, test) Kaggle.com score 0.81818 Surprised
  29. 29. cForest(…): type of random forest; implementation using party package # Build condition inference tree Forest fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data = train, controls=cforest_unbiased(ntree=4000, mtry=2)) # Now let's make a prediction and write a submission file Prediction <- predict(fit, test, OOB=TRUE, type = "response") kaggle score after parameter adjustments 0.81818 Surprised Again !!!
  30. 30. randomForest package vs. party package randomForest package • randomForest(…) function • mtry is floor(sqrt(p)), which is the number of features to randomly select at each split. • randomForest is computationally faster. • Popular in applied research party package • cForest(…) function • mtry set to the number 5 by default for technical reasons • Resulting forests are unbiased if the predictor variables are of different types. • Ensembles of conditional inference trees have not yet been extensively tested, so this routine is meant for the expert user only and its current state is rather experimental.
  31. 31. Model Description Result fit <- randomForest(…) Traditional Random Forest (randomForest package) 0.81818 Leader Board 03/20/14 fit <- cForest(…) Conditional Inference tree (party package) 0.81340 Leader Board 03/20/14 fit <-cForest Changed ntree from 2000 to 4000, and mtry from 3 to 2. 0.81818 Leader Board 03/22/14 Ensemble Methods: kaggle results
  32. 32. Summary: Data Visualization Algorithms & kaggle results by Michelle Darling www.datastudentblog.wordpress.com
  33. 33. Data Visualization Summary 1. Created Conceptual Data Model • to understand denormalized data file. 2. Tried lots of visualizations: • Categorical vs. Continuous • Uni-, Bi- and Multivariate 3. Compared datasets: • Titanic vs. train vs. test ARE similar 4. Created rule-based models using the most significant predictors: • Sex == "female" • Sex=="female OR Age <10 • Sex:Child:Fare:FamilySize Data Visualization prototyping tools: • MS Excel • wordle.net • Google Fusion • R {rattle} package PORT Embarked S=Southhampton C=Cherbourg Q=Queenstown TICKET Ticket Pclass Cabin PASSENGER PassengerID Name Age SibSp Parch Fare Survived
  34. 34. Text Analysis of Passenger Name SURVIVORS PERISHED Word Clouds created in www.wordle.net for Survivors$Name and Perished$Name Survivors <-train[train$Survived==1,]; Perished <-train[train$Survived==0,]  Sex ("male" vs. "female") is an important predictor of survival.
  35. 35. Do family members affect survival? > table(Survived, Parch) Parch Survived 0 1 2 3 4 5 6 0 445 53 40 2 4 4 1 1 233 65 40 3 0 1 0 > table(Survived, SibSp) SibSp Survived 0 1 2 3 4 5 8 0 398 97 15 12 15 5 7 1 210 112 13 4 3 0 0  Survival is higher for passengers with Parch==3 (60%), or SibSp==1 (54%)
  36. 36. What is the relationship between: Embarked, Pclass, Ticket, Fare? Cherbourg, France Southhampton, EnglandQueenstown, Ireland  All three Embarked Ports (C,Q,S) boarded passengers from all classes (1st, 2nd, 3rd).  But 50% of Cherbourg Passengers were 1st Class; they paid much higher fares (blue spikes).  Based on this, Fare is likely a stronger predictor of survival than Embarked. Graph created in MSExcel using data from table(Embarked, Pclass, Fare, Ticket)
  37. 37. Google Fusion Tables Geospatial Heatmap, Network Diagrams Google Fusion Heatmap GEOCODED by Embarkation Port: • Southampton, UK -- 644 pasengers • Cherbourg, France -- 168 passengers • Queenstown, Ireland – 77 passengers No Lifeboat SURVIVORSPERISHED Network Diagrams showing Lifeboats (orange) vs. Embarkation Port (blue) Based on external data (Encyclopedia Titanica) imported into Google Fusion Tables.
  38. 38. Rule-Based Models Everyone Survived vs. Everyone Perished # Model: Everyone survived test$Survived <- 1 submit <- data.frame(PassengerId = test$PassengerId, Survived =test$Survived) write.csv(submit, file = "mdarling_model_0.csv", row.names = FALSE) Result: 0.37321☹ # Model: Everyone perished test$Survived <- 0 submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "mdarling_model_1.csv", row.names = FALSE) Result: Your Best Entry: 0.62679 ☺ You improved on your best score by 0.25359. You just moved up 12 positions on the leaderboard  Survival rate for test is similar to RMS Titanic
  39. 39. Rule-Based Models Random vs. Informed Guess # Model: Random Guess test$Survived <- sample(c(0,1), 418, replace = TRUE) submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "mdarling_model_1random.csv", row.names = FALSE) Your submission scored 0.50718, ☹ which is not an improvement of your best score. Model: Informed Guess ● Used problem domain info, data visualizations and intuition to make an “informed guess” about each passenger. ● Manually typed in 1,0 into test.csv file with 418 rows… Your Best Entry: 0.70335! ☺ You improved on your best score by 0.07656! Process is similar to everyday human decision-making (no machine learning). Score is much better than random chance!
  40. 40. Data Visualization in R R Visualization Packages: • Base R: plot, barplot, boxplot, hist, dotchart, heatmap, pairs • ggplot2: qplot, ggplot • lattice: xyplot, dotplot, parallelplot • vcd: "Visualizing Categorical Data" mosaic, assoc • rcmdr: "Rcommander" scatter3d • rattle: Explore Tab. latticist, ggobi
  41. 41. Continuous vs. Discrete (Categorical) Variables CORRELOGRAM: {base R} pairs() t <- as.data.frame(Survived,Pclass,Sex, Age,Fare,Embarked,SibSp,Parch) pairs(t, col=t$Pclass+2) # Shift base R color palette by 2 # 1st class – green (1+2=3) # 2nd class – blue (2+2=4) # 3rd class – cyan (3+2=5) # base R Color Wheel is not very subtle! • Correlogram is meant to show pair-wise relationships. • Continuous variables appear as "clouds" • Discrete variables appear as "bands"
  42. 42. Which variables are correlated? (Models perform better when variables are independent!) Correlation plots created using {rattle} R package FamilySize SibSp Parch Fare3 Fare Age
  43. 43. Continuous, Multivariate Marginal Plots: {rattle} latticist • {rattle} is an R package • latticist is an interactive GUI for Data Visualization
  44. 44. Continuous, Multivariate Intensity Map {base R} heatmap() • Useful for visualizing and comparing data sets. • Requires a data matrix. • Values must be numeric (recode qualitative variables e.g., Pclass, Gender). • Can use custom color palette (e.g., RColorBrewer) test does not have a Survived attribute. PassengerID 1:891 (train) 892:1309 (test) 891 obs. 418 obs.  train is representative of test. "Soup Analogy": values look like they are randomly distributed and "well-stirred" – no big chunks of dark or light bands.  Models based on train can be used to predict test fairly accurately.
  45. 45. Continuous, Univariate Histogram: {base R} hist() Show range, density and distribution of a single, continuous variable. # Use 2X2 grid par(mfrow=c(2,2)) hist(test$Age) hist(test$Fare) hist(train$Age) hist(train$Fare) "Small Multiples" concept by Tukey: Displaying multiple small plots side-by-side is effective for analysis.  test and train have similar distributions for continuous variables.
  46. 46. "Small Multiples" of Bar Plots for categorical variables. E.g., barplot(table(test$Child)) Categorical, Univariate Bar Plots: {base R} barplot()  test and train have similar distributions for categorical variables.
  47. 47. Continuous, Univariate Dot Plot: {lattice} dotplot() library(lattice) attach(train) # Each dot is # a passenger. # Survived==1 Red # Survived==0 Black dotplot(Age, pch=1, col=Survived, main="train$Age") dotplot(Fare, pch=1, col=Survived, main="train$Fare") cluster of survivors (young children) outliers cluster of perished passengers (who paid lowest fares).
  48. 48. Continuous, Univariate Box Plot: {Base R} boxplot() Shows interquartile range (IQR), Median, outliers. # Plot Age grouped by Pclass par(mfrow=c(1,2)) Survivors <-train[train$Survived==1,] Perished <-train[train$Survived==0,] boxplot(Age ~ Pclass, data = Survivors, col = "light blue", main="Survived", xlab="Passenger Class", ylab="Age") boxplot(Age ~ Pclass, data = Perished, col = "gray", main="Perished", xlab="Passenger Class", ylab="Age")  Survivors had younger age range compared to perished across all three passenger classes. Median 33.50 Median 28.00 Median 27.00 Median 28.00 Median 30.00 Median 38.50
  49. 49. Categorical, Multivariate Spine Plot = 3 Bar Plots 35% 65% 68% 32% 15% 85% 314 577 233 109 81 468 32%68% FEMALES: greater than expected survival rate 85% MALES: greater than expected mortality rate 15 % Class: mutually exclusive, rectilinear partition. E.g., Female Survivors Probability: frequency count/whole set. E.g, 233/891 = 68% Spine Plot is a visualization of a rules-based model; it exhaustively describes the feature space = Titanic Passengers (female vs male)
  50. 50. Categorical, Multivariate Spine Plot: {base R} spineplot() Indicates a higher than expected survival rate.
  51. 51. Visualization of a contingency table. vcd = "Visualizing Categorical Data" Blue – High Probability Gray – Neutral Red – Low Probability Example: 3rd Class Male Sex==male & Pclass==3 • High Probability: Survived ==0 • Low Probability: Survived==1 # Mosaic Plot library(vcd) attach(train) t <- table(Sex,Survived,Child) mosaic(t, shade=TRUE, main="train dataset") Categorical, Multivariate Mosaic Plot: {vcd} mosaic() female adults female children male adults male children female children female adults male adults male children 60% Perished 40% Survived
  52. 52. females (survived)  36% of all passengers 77% of all survivors male adults male children female children female adults male adults (perished)  61% of all passengers  83% of all who perished male children Similar Mosaic Plot Decision Tree 60% Perished 40% Survived male adults (perished) male children (survived) females (survived) males (perished)
  53. 53. Rule-Based Models "Females" / "Women or Children" # Model: Females Survive test$Survived <-0 test$Survived[test$Sex=='female']<-1 submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)write.csv(submit, file = "mdarling_model_female.csv", row.names = FALSE) Your Best Entry: 0.76555☺ You improved on your best score by 0.06220. # Model: Women OR Children Survive test$Survived <-0 test$Survived[test$Sex=='female'] <-1 test$Survived[test$Age<10] <-1 # Tried different age cutoffs until score improved. submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "mdarling_model_wc.csv", row.names = FALSE) Your Best Entry: 0.77033☺ You improved on your best score by 0.00478
  54. 54. Rule-based model (70 rules) Sex : Child : Fare2: FamilySize Principal Components Analysis • Inspired by PCA • Performed better than naiveBayes, qda, glm, svm(radial, sigmoid, polynomial)! aggregate(Survived~Sex+Child+Fare2+FamilySize, data=train, FUN=function(x) {sum(x)/length(x)})
  55. 55. Model Description Result 70-Rule Model aggregate(Survived~Sex+Child+Fare2+FamilySize, data=train, FUN=function(x) {sum(x)/length(x)}) 0.77512 Female OR Child [test$Sex =='female'| test$Age < 10] 0.77033 Female [test$Sex =='female'] 0.76555 Informed Guess Data Visualization + Problem Domain info+ manual typing 1,0 into .csv file. 0.70335 Random Guess sample(c(1,0), 418, replace=TRUE) 0.50718 Everyone Perished test$Survived <- 0 0.62679 Everyone Survived test$Survived <- 1 0.37321 Summary: kaggle.com results so far…
  56. 56. START: Is training data available? No  UNSUPER- VISED LEARNING Yes -- train.csv SUPERVISED LEARNING Continuous Target  REGRESSION Categorical Target: Survived  CLASSIFICATION Multivariate Classification BINARY Classification == 1,0 SINGLE CLASSIFIERS glm, knn, qda naiveBayes, rpart, ctree, svm ENSEMBLE METHODS randomForest, cforest Machine Learning: Titanic Dataset
  57. 57. Overview of Machine Learning Algorithms
  58. 58. QDA (0.75598) vs Logistic Regression (.76077) • Linear model = straight line boundaries. • Better fit for Titanic data set. • Eager Learners. 2 step process: 1) Fit model using global info. 2) Predict test using reusable model. • Polynomial model = curved boundaries.
  59. 59. Naïve Bayes (0.76555) vs. KNN (0.77990) ptm <- proc.time() partimat(Survived~.,data=train_bc,method="sknn") end <- (proc.time() - ptm) # 769.72 milliseconds – MORE TIME CONSUMING but MORE CUSTOMIZED BOUNDARIES –> greater accuracy. ptm <- proc.time() partimat(Survived~.,data=train_bc,method="naiveBayes") end <- (proc.time() - ptm) # 39.99 milliseconds – only 5% of the knn time.
  60. 60. AdaBoost (0.77990 – same as KNN) # rattle Model output Summary of the Ada Boost model: Call: ada(Survived ~ ., data = crs$dataset[crs$train, c(crs$input, crs$target)], control = rpart.control(maxdepth = 30, cp = 0.01, minsplit = 20, xval = 10), iter = 50) Loss: exponential Method: discrete Iteration: 50 Final Confusion Matrix for Data: Final Prediction True value 0 1 0 350 23 1 45 205 Train Error: 0.109 Out-Of-Bag Error: 0.136 iteration= 50 Additional Estimates of number of iterations: train.err1 train.kap1 50 50 Variables actually used in tree construction: [1] "Age" "FamilyID2" "Fare" "Sex" "Title" Frequency of variables actually used: FamilyID2 Fare Title Age Sex 49 49 48 46 8 Time taken: 3.42 secs Only 50 trees compared to 4000 trees in cforest, hence lower performance.
  61. 61. Examples of AdaBoost "weak learner" trees:1,3,10,20,35,47. Total: 50 trees
  62. 62. linear, cost=1, 68% correct radial, cost=100, 73.4% correct polynomial, cost=10, 68% correct sigmoid, cost=0.1, 66% correct Support Vector Machines (2D) SVM Kernels & Decision Boundary Shapes • Linear  Line • Radial  Circle • Polynomial  C Curve • Sigmoid  S Curve "Goodness of Fit" – svm: radial performed best with two dimensions (.77033).
  63. 63. Scatterplots for visualizing SVM 2D {ggplot2} qplot vs. 3-D {Rcmdr} scatter3d # Interactive 3D hyperplane with spline library(Rcmdr); attach(train) scatter3d(Age,Survived,Fare) # Point and Line ScatterPlot library(ggplot2); attach(train) qplot(Age, Fare, data=train, geom=c("point","line"),colour=Survived, main = "Titanic Passengers")
  64. 64. SVM using 11 inputs Advantages of SVM: • Minimal pre-processing needed. • Tuning improves accuracy. • Helps reveal best fit (linear/poly/radial/sigmoid). • Immune to "Curse of Dimensionality". • Instead of worsening, accuracy improved when dimensions increased from 2 to 11 attributes. 0.79904 good, but still not better than cforest or randomForest 0.81818
  65. 65. cforest (.81818) + Lifeboat Data Fusion = .83732 # Added 12 male survivors based on merged # lifeboat data from Encyclopedia Titanica. ciforest2 <- read.csv("ciforest2.csv") testlb <- read.csv("test_lifeboats.csv") ensembles <- merge(ciforest2, testlb, by.x="PassengerId", by.y="PassengerId") ensembles$Survived[ensembles$Lifeboat==1] <-1 table(ensembles$Survived) #0 1 #272 146 submit <- data.frame(PassengerId = ensembles$PassengerId, Survived = ensembles$Survived) write.csv(submit, file = "ensembles_5.csv", row.names = FALSE)
  66. 66. "Ensemble of ensembles": randomForest + cForest + random tiebreaker # Code for 95/05 tiebreaker (score 0.81818) # Merge randomForest and cForest and average # the results. Reuse unanimous votes. ensembles <- merge(rforest, ciforest2, by.x="PassengerId", by.y="PassengerId") ensembles$Vote <- (as.numeric(ensembles$Survived.x)+ as.numeric(ensembles$Survived.y))/2 ensembles$Survived[ensembles$Vote==1.0] <-1 ensembles$Survived[ensembles$Vote==0.0] <-0 # Create vector of 418 random 0s and 1s set.seed(pi) probs<-c(.95,.05) ensembles$rvote <-sample(c(0,1), 418,replace = TRUE,prob=probs) #For each tie, use a random vote ensembles$Survived[ensembles$Vote==0.5] <- ensembles$rvote[ensembles$Vote==0.5] table(ensembles$Survived) 0 1 281 137 What if we combine results from randomForest and cForest? Use random tiebreaker for non-unanimous votes. Results: Combinations did not outperform individuals, even when lifeboat data was added.
  67. 67. Data mining using lifeboat info = competitive edge. 12 additional male survivors is highly significant because they countered social norms and survived "against the odds". Ensemble methods (randomForest, cforest) outperform single classifiers. "Many models work better than one." Embedded feature selection models (svm, ctree, rpart) outperform models that need "manual" feature selection. Decision trees are great communication tools. knn has same accuracy as glm and AdaBoost, but takes a lot of processing time. Simple rule-based models can outperform naiveBayes if features chosen by Principal Components Analysis (PCA). Social norms ("Women and Children First", "Male survivors are cowards" ) greatly influenced survival. Human decision-making outperforms random chance, and can outperform machine learning (depending on the human's expertise). Math-based models like glm sensitive to feature selection. "Goodness of fit" determines performance. Linear and radial (glm, svm:linear/radial) outperformed others (qda,svm:polynomial/sigmoid). Machine Learning Summary
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×