SlideShare a Scribd company logo
Predicting a Match
For Speed Dating
SAMUEL BINENFELD
Contents
Introduction
Data Cleaning
Data Exploration
Modeling
Conclusions
Introduction
The Problem: The dating process is inefficient, and dates are unsuccessful far too often.
The Solution: If we can successfully predict the likelihood that two people will be a
match for each other, then we can improve the success rate of dates.
How We Do It: We analyze the Speed Dating Experiment dataset from Kaggle.com to
find out what makes two people a match for each other. Then, we create a machine
learning model that can predict match likelihood.
Data Cleaning
The data set started out with 195 columns and 8,378 rows. We reduced this to
38 columns and 8,038 rows. We also reformatted some of the variables. The
actions taken and the reasons why are described in this section.
Data Cleaning: Filtering Columns
Remove fields with significant missing data- Any field that was missing 10% or more of our data was
removed.
 This step removed 108 fields
 Remove repetitive fields- Some fields were the same questions being asked over and over again.
 This step removed 6 fields
Remove varied fields- Some fields were simply too varied to give any insight.
 This step removed 6 fields
Remove after-date information- Some fields contained information gathered after the date, and we
wanted to keep our analysis to information gathered pre-date.
 This step removed 16 fields
Remove irrelevant fields- Fields where nothing of importance could be found were removed.
 This step removed 11 fields
Data Cleaning: Filtering Rows
Remove dates with null values- We had to remove rows with null values because we could not
make educated guesses as to what their values were.
 This step removed 263 rows
Remove dates with 55 year old person- This was person was an outlier in terms of age.
 This step removed 6 rows
Remove dates that had a partner who was never a primary- Every date consisted of two
people, a “primary”, and a “partner”. Later, we engineer variables that contain information from
both daters, so we had to remove anyone who was never a primary.
 This step removed 71 rows
Data Cleaning: Reformat Variables
Fix “Date” and “Go Out” Variables- The rating scale for these variables was 1 through 7, with 1
being “very often”, and 7 being “never”. We found these variables to be easier to interpret
when the scale was reversed, and so we subtracted each row from 8 to accomplish this.
Fix Race Variable Errors- There was an error where some participants were being recorded as
the wrong race. We were able to correct these by inputting the race that the person was listed
as most frequently.
Change “Other” Race from 6 to 5- Race was listed as an integer. Since there was no race with
the integer 5, we changed “Other” from 6 to 5.
Data Exploration
Our dataset consists of 538 people, who went on a combined total of 4,019 dates
All dates were of heterosexual nature, and had one male and one female
Our variable of interest is match– A date is a match if both daters say yes to wanting to
see their date again
The average match rate for all dates was 16.62%
Data Exploration: Gender
Males were more accepting than
females
Males said yes to females 47.87% of
the time
Females said yes to males 36.63% of
the time
Data Exploration: Age
The average age was 26.25
The age distribution was slightly skewed to the right
Data Exploration: Age
Daters preferred to date those who were closer to their own age
We saw a decrease in match rate as the age gap between daters increased
Data Exploration: Race
The race distribution was not balanced
Over half were Caucasian/European
About one fourth were Asian/Pacific
Islander/Asian American
1- African American
2- Caucasian/European
3- Latino/Hispanic
4- Asian/Pacific Islander/Asian-American
5- Other
Data Exploration: Same Race
We looked at match rates where both daters were the same race.
We saw that African Americans had the highest match rate increase– However, there were only
8 dates where both daters were African American, and so our sample size was too small. We
would need to get a more racially diverse dataset to confirm this trend.
1- African American
2- Caucasian/European
3- Latino/Hispanic
4- Asian/Pacific Islander/Asian-American
5- Other
Data Exploration: Interests
Participants were asked to rate their level of interest on a variety of activities such as sports,
movies, art, etc.
To get a look at each interest’s relationship with match rate, we performed a correlation test
Those who rated clubbing and yoga highly, had a higher match rate (on average)
Those who rated movies and tv highly, had a lower match rate (on average)
Data Exploration: Interests
We engineered a new variable that was a combination of both participants’ interest rating
We found that for some of the variables, match rate was higher when the combined interest
variable was high
This makes sense, if we reason that people with shared interests are more likely to be a match
Here is an example with our combined interest variable for clubbing, labeled clubbing_com
Data Exploration:
Desires and Preferences
Desires were ratings of attributes in response to the question “what do you look for in a date?”
Preferences were ratings of a diverse, mixed-bag of questions
We perform correlation tests to see the relationship between each variable and match
Those who desired their partner to be fun, had a higher match rate (on average)
Those who preferred going out and dating, also had a higher match rate (on average)
*See Appendix 1 for full list
of the variables’ definitions
Modeling
For modeling, we go through the following steps
Recap all of our engineered features
Pre-process the data
Model the data
Perform feature selection and re-model the data
Tune our models’ parameters
Modeling: Feature Engineering
The following is a list of the steps taken for feature engineering:
Create interaction terms (primary rating * partner rating) for all variables in the interests,
desires, and preferences categories
Create age difference variable (male age – female age)
Create age group variable that separates age into bins of [18-24, 25-30, 31-42]
Create combined age variable (primary age + partner age)
Create category variable that contains both the male and female’s race
Modeling: Data Pre-Processing
We split the dataset into two parts, X and y.
X- all variables except for dec, dec_o, and match
y- match
We then split each of these up into training and testing sets. Our training set uses 75% of the
data and our test set is the remaining 25%.
Modeling: First Try
We use four different supervised learning algorithms in modeling our data
Since the overall match rate, is 16.62%, a model that predicts “no match” for every date, would be
83.38% accurate. This means we want our models to strive for higher than this.
Our first run is a disappointment, as even our highest test accuracy score barely beats 83.38%.
Model
Train
Accuracy
Test
Accuracy
Precision Recall ROC AUC PR AUC
Random Forest 98.47% 84.03% 74.51% 10.98% 75.32% 47.60%
Logistic Regression 83.69% 82.74% 40.00% 0.58% 63.44% 29.22%
SVC 100.00% 82.79% 0.00% 0.00% 93.54% 88.45%
Gradient Boosting 85.95% 84.13% 93.55% 8.38% 77.17% 47.58%
Modeling: Feature Selection
Using the feature importance attribute in our
Random Forest model, we take a look to see
which features are performing well
We see that many of the top-performing
features are our combined interest variables
that we created
We will reduce the amount of features in X to
only the combined interest variables, and re-
model the data
Features Importance
clubbing_com 0.035848
yoga_com 0.027607
date_com 0.026963
exercise_com 0.02595
shopping_com 0.02567
concerts_com 0.025075
hiking_com 0.025022
sinc1_com 0.02416
pid 0.023971
exphappy_com 0.023931
Modeling- Second Try
We see tremendous improvement with the Random Forest Classifier and the SVC
The next slide shows the AUC for the ROC and PR curves
Model
Train
Accuracy
Test
Accuracy
Precision Recall ROC AUC PR AUC
Random Forest 99.45% 94.08% 100.00% 63.38% 93.15% 87.06%
Logistic
Regression
83.23% 83.83% 0.00% 0.00% 58.40% 22.17%
SVC 100.00% 95.42% 100.00% 71.69% 96.67% 92.23%
Gradient Boosting 85.50% 84.88% 86.21% 7.69% 69.65% 38.54%
Modeling: ROC and PR Curves
Modeling- RF Parameter Tuning
We move forward with only the Random Forest and SVC and tune the parameters using Grid-
Search CV.
We’ll start with Random Forest. On the left are the range of values we tried for each
parameter, and on the right are the optimal parameters.
Ranges Attempted
n_estimators – [10, 25, 50, 75, 100]
min_samples_leaf – [1, 25, 50, 75, 100]
max_features – [.1, .25, .5, .75]
Optimal
n_estimators – 100
min_samples_leaf – 1
max_features – .1
Modeling- RF Parameter Tuning
We re-model the Random Forest with the tuned parameters
We see a strong improvement, especially in recall
The RF model now looks to be about as accurate as the SVC
Model
Train
Accuracy
Test
Accuracy
Precision Recall ROC AUC PR AUC
RF (default) 99.45% 94.08% 100.00% 63.38% 93.15% 87.06%
RF (new params) 100.00% 95.52% 100.00% 73.13% 96.15% 90.30%
Improvement 0.55% 1.44% 0.00% 9.75% 3.00% 3.24%
Modeling- SVC Parameter Tuning
The following is the parameter tuning for the SVC
The gamma parameter proved to be irrelevant, so we removed it
The optimal parameters end up being the same as SVC’s default parameters
We don’t need to re-model it, since we know our previous model was already optimal
Ranges Attempted
C – [.0001,.001,.01,.1,1]
kernel – ['linear', 'rbf', 'sigmoid', 'poly']
gamma –
[0.01,0.02,0.03,0.04,0.05,0.10,0.2,0.3,0.4,0.5]
Optimal
C – 1
kernel – 'rbf’
gamma – irrelevant
Conclusions
The most important features in predicting a match were the combined interests variables
We exceeded our goal for our models, with our highest accuracy score reaching 95.52%
The best performing models were the Random Forest Classifier with tuned parameters, and the
SVC with default parameters
Both models performed well in test accuracy score and AUC of PR Curve, so we are comfortable
using either one
Appendix 1
Appendix 1 cont.
Appendix 1 cont.
Appendix 1 cont.
Appendix 1 cont.

More Related Content

Similar to Speed Dating SS

Regression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalRegression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalJohn Michael Croft
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points FinalJohn Michael Croft
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search Engine
VirenKhandal
 
The Beginnings of a Search Engine
The Beginnings of a Search EngineThe Beginnings of a Search Engine
The Beginnings of a Search Engine
VirenKhandal
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic RegressionTaweh Beysolow II
 
Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questions
IEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
IEEEGLOBALSOFTTECHNOLOGIES
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
Jaideep Adusumelli
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Simplilearn
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
egoodwintx
 
Employee mode of commuting
Employee mode of commutingEmployee mode of commuting
Employee mode of commuting
Saleesh Satheeshchandran
 
OPIM 5604 predictive modeling presentation group7
OPIM 5604 predictive modeling presentation group7OPIM 5604 predictive modeling presentation group7
OPIM 5604 predictive modeling presentation group7
Shu-Feng Tsao
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
Jen Stirrup
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
Boston Institute of Analytics
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
Curvefitting
CurvefittingCurvefitting
Curvefitting
Philberto Saroni
 
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docxDataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
theodorelove43763
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga
 
SEM
SEMSEM

Similar to Speed Dating SS (20)

Regression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalRegression Analysis of SAT Scores Final
Regression Analysis of SAT Scores Final
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points Final
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search Engine
 
The Beginnings of a Search Engine
The Beginnings of a Search EngineThe Beginnings of a Search Engine
The Beginnings of a Search Engine
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic Regression
 
Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questions
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
Employee mode of commuting
Employee mode of commutingEmployee mode of commuting
Employee mode of commuting
 
OPIM 5604 predictive modeling presentation group7
OPIM 5604 predictive modeling presentation group7OPIM 5604 predictive modeling presentation group7
OPIM 5604 predictive modeling presentation group7
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
SecondaryStructurePredictionReport
SecondaryStructurePredictionReportSecondaryStructurePredictionReport
SecondaryStructurePredictionReport
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Curvefitting
CurvefittingCurvefitting
Curvefitting
 
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docxDataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
SEM
SEMSEM
SEM
 

Recently uploaded

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 

Recently uploaded (20)

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 

Speed Dating SS

  • 1. Predicting a Match For Speed Dating SAMUEL BINENFELD
  • 3. Introduction The Problem: The dating process is inefficient, and dates are unsuccessful far too often. The Solution: If we can successfully predict the likelihood that two people will be a match for each other, then we can improve the success rate of dates. How We Do It: We analyze the Speed Dating Experiment dataset from Kaggle.com to find out what makes two people a match for each other. Then, we create a machine learning model that can predict match likelihood.
  • 4. Data Cleaning The data set started out with 195 columns and 8,378 rows. We reduced this to 38 columns and 8,038 rows. We also reformatted some of the variables. The actions taken and the reasons why are described in this section.
  • 5. Data Cleaning: Filtering Columns Remove fields with significant missing data- Any field that was missing 10% or more of our data was removed.  This step removed 108 fields  Remove repetitive fields- Some fields were the same questions being asked over and over again.  This step removed 6 fields Remove varied fields- Some fields were simply too varied to give any insight.  This step removed 6 fields Remove after-date information- Some fields contained information gathered after the date, and we wanted to keep our analysis to information gathered pre-date.  This step removed 16 fields Remove irrelevant fields- Fields where nothing of importance could be found were removed.  This step removed 11 fields
  • 6. Data Cleaning: Filtering Rows Remove dates with null values- We had to remove rows with null values because we could not make educated guesses as to what their values were.  This step removed 263 rows Remove dates with 55 year old person- This was person was an outlier in terms of age.  This step removed 6 rows Remove dates that had a partner who was never a primary- Every date consisted of two people, a “primary”, and a “partner”. Later, we engineer variables that contain information from both daters, so we had to remove anyone who was never a primary.  This step removed 71 rows
  • 7. Data Cleaning: Reformat Variables Fix “Date” and “Go Out” Variables- The rating scale for these variables was 1 through 7, with 1 being “very often”, and 7 being “never”. We found these variables to be easier to interpret when the scale was reversed, and so we subtracted each row from 8 to accomplish this. Fix Race Variable Errors- There was an error where some participants were being recorded as the wrong race. We were able to correct these by inputting the race that the person was listed as most frequently. Change “Other” Race from 6 to 5- Race was listed as an integer. Since there was no race with the integer 5, we changed “Other” from 6 to 5.
  • 8. Data Exploration Our dataset consists of 538 people, who went on a combined total of 4,019 dates All dates were of heterosexual nature, and had one male and one female Our variable of interest is match– A date is a match if both daters say yes to wanting to see their date again The average match rate for all dates was 16.62%
  • 9. Data Exploration: Gender Males were more accepting than females Males said yes to females 47.87% of the time Females said yes to males 36.63% of the time
  • 10. Data Exploration: Age The average age was 26.25 The age distribution was slightly skewed to the right
  • 11. Data Exploration: Age Daters preferred to date those who were closer to their own age We saw a decrease in match rate as the age gap between daters increased
  • 12. Data Exploration: Race The race distribution was not balanced Over half were Caucasian/European About one fourth were Asian/Pacific Islander/Asian American 1- African American 2- Caucasian/European 3- Latino/Hispanic 4- Asian/Pacific Islander/Asian-American 5- Other
  • 13. Data Exploration: Same Race We looked at match rates where both daters were the same race. We saw that African Americans had the highest match rate increase– However, there were only 8 dates where both daters were African American, and so our sample size was too small. We would need to get a more racially diverse dataset to confirm this trend. 1- African American 2- Caucasian/European 3- Latino/Hispanic 4- Asian/Pacific Islander/Asian-American 5- Other
  • 14. Data Exploration: Interests Participants were asked to rate their level of interest on a variety of activities such as sports, movies, art, etc. To get a look at each interest’s relationship with match rate, we performed a correlation test Those who rated clubbing and yoga highly, had a higher match rate (on average) Those who rated movies and tv highly, had a lower match rate (on average)
  • 15. Data Exploration: Interests We engineered a new variable that was a combination of both participants’ interest rating We found that for some of the variables, match rate was higher when the combined interest variable was high This makes sense, if we reason that people with shared interests are more likely to be a match Here is an example with our combined interest variable for clubbing, labeled clubbing_com
  • 16. Data Exploration: Desires and Preferences Desires were ratings of attributes in response to the question “what do you look for in a date?” Preferences were ratings of a diverse, mixed-bag of questions We perform correlation tests to see the relationship between each variable and match Those who desired their partner to be fun, had a higher match rate (on average) Those who preferred going out and dating, also had a higher match rate (on average) *See Appendix 1 for full list of the variables’ definitions
  • 17. Modeling For modeling, we go through the following steps Recap all of our engineered features Pre-process the data Model the data Perform feature selection and re-model the data Tune our models’ parameters
  • 18. Modeling: Feature Engineering The following is a list of the steps taken for feature engineering: Create interaction terms (primary rating * partner rating) for all variables in the interests, desires, and preferences categories Create age difference variable (male age – female age) Create age group variable that separates age into bins of [18-24, 25-30, 31-42] Create combined age variable (primary age + partner age) Create category variable that contains both the male and female’s race
  • 19. Modeling: Data Pre-Processing We split the dataset into two parts, X and y. X- all variables except for dec, dec_o, and match y- match We then split each of these up into training and testing sets. Our training set uses 75% of the data and our test set is the remaining 25%.
  • 20. Modeling: First Try We use four different supervised learning algorithms in modeling our data Since the overall match rate, is 16.62%, a model that predicts “no match” for every date, would be 83.38% accurate. This means we want our models to strive for higher than this. Our first run is a disappointment, as even our highest test accuracy score barely beats 83.38%. Model Train Accuracy Test Accuracy Precision Recall ROC AUC PR AUC Random Forest 98.47% 84.03% 74.51% 10.98% 75.32% 47.60% Logistic Regression 83.69% 82.74% 40.00% 0.58% 63.44% 29.22% SVC 100.00% 82.79% 0.00% 0.00% 93.54% 88.45% Gradient Boosting 85.95% 84.13% 93.55% 8.38% 77.17% 47.58%
  • 21. Modeling: Feature Selection Using the feature importance attribute in our Random Forest model, we take a look to see which features are performing well We see that many of the top-performing features are our combined interest variables that we created We will reduce the amount of features in X to only the combined interest variables, and re- model the data Features Importance clubbing_com 0.035848 yoga_com 0.027607 date_com 0.026963 exercise_com 0.02595 shopping_com 0.02567 concerts_com 0.025075 hiking_com 0.025022 sinc1_com 0.02416 pid 0.023971 exphappy_com 0.023931
  • 22. Modeling- Second Try We see tremendous improvement with the Random Forest Classifier and the SVC The next slide shows the AUC for the ROC and PR curves Model Train Accuracy Test Accuracy Precision Recall ROC AUC PR AUC Random Forest 99.45% 94.08% 100.00% 63.38% 93.15% 87.06% Logistic Regression 83.23% 83.83% 0.00% 0.00% 58.40% 22.17% SVC 100.00% 95.42% 100.00% 71.69% 96.67% 92.23% Gradient Boosting 85.50% 84.88% 86.21% 7.69% 69.65% 38.54%
  • 23. Modeling: ROC and PR Curves
  • 24. Modeling- RF Parameter Tuning We move forward with only the Random Forest and SVC and tune the parameters using Grid- Search CV. We’ll start with Random Forest. On the left are the range of values we tried for each parameter, and on the right are the optimal parameters. Ranges Attempted n_estimators – [10, 25, 50, 75, 100] min_samples_leaf – [1, 25, 50, 75, 100] max_features – [.1, .25, .5, .75] Optimal n_estimators – 100 min_samples_leaf – 1 max_features – .1
  • 25. Modeling- RF Parameter Tuning We re-model the Random Forest with the tuned parameters We see a strong improvement, especially in recall The RF model now looks to be about as accurate as the SVC Model Train Accuracy Test Accuracy Precision Recall ROC AUC PR AUC RF (default) 99.45% 94.08% 100.00% 63.38% 93.15% 87.06% RF (new params) 100.00% 95.52% 100.00% 73.13% 96.15% 90.30% Improvement 0.55% 1.44% 0.00% 9.75% 3.00% 3.24%
  • 26. Modeling- SVC Parameter Tuning The following is the parameter tuning for the SVC The gamma parameter proved to be irrelevant, so we removed it The optimal parameters end up being the same as SVC’s default parameters We don’t need to re-model it, since we know our previous model was already optimal Ranges Attempted C – [.0001,.001,.01,.1,1] kernel – ['linear', 'rbf', 'sigmoid', 'poly'] gamma – [0.01,0.02,0.03,0.04,0.05,0.10,0.2,0.3,0.4,0.5] Optimal C – 1 kernel – 'rbf’ gamma – irrelevant
  • 27. Conclusions The most important features in predicting a match were the combined interests variables We exceeded our goal for our models, with our highest accuracy score reaching 95.52% The best performing models were the Random Forest Classifier with tuned parameters, and the SVC with default parameters Both models performed well in test accuracy score and AUC of PR Curve, so we are comfortable using either one