SlideShare a Scribd company logo
1 of 32
Predicting a Match
For Speed Dating
SAMUEL BINENFELD
Contents
Introduction
Data Cleaning
Data Exploration
Modeling
Conclusions
Introduction
The Problem: The dating process is inefficient, and dates are unsuccessful far too often.
The Solution: If we can successfully predict the likelihood that two people will be a
match for each other, then we can improve the success rate of dates.
How We Do It: We analyze the Speed Dating Experiment dataset from Kaggle.com to
find out what makes two people a match for each other. Then, we create a machine
learning model that can predict match likelihood.
Data Cleaning
The data set started out with 195 columns and 8,378 rows. We reduced this to
38 columns and 8,038 rows. We also reformatted some of the variables. The
actions taken and the reasons why are described in this section.
Data Cleaning: Filtering Columns
Remove fields with significant missing data- Any field that was missing 10% or more of our data was
removed.
 This step removed 108 fields
 Remove repetitive fields- Some fields were the same questions being asked over and over again.
 This step removed 6 fields
Remove varied fields- Some fields were simply too varied to give any insight.
 This step removed 6 fields
Remove after-date information- Some fields contained information gathered after the date, and we
wanted to keep our analysis to information gathered pre-date.
 This step removed 16 fields
Remove irrelevant fields- Fields where nothing of importance could be found were removed.
 This step removed 11 fields
Data Cleaning: Filtering Rows
Remove dates with null values- We had to remove rows with null values because we could not
make educated guesses as to what their values were.
 This step removed 263 rows
Remove dates with 55 year old person- This was person was an outlier in terms of age.
 This step removed 6 rows
Remove dates that had a partner who was never a primary- Every date consisted of two
people, a “primary”, and a “partner”. Later, we engineer variables that contain information from
both daters, so we had to remove anyone who was never a primary.
 This step removed 71 rows
Data Cleaning: Reformat Variables
Fix “Date” and “Go Out” Variables- The rating scale for these variables was 1 through 7, with 1
being “very often”, and 7 being “never”. We found these variables to be easier to interpret
when the scale was reversed, and so we subtracted each row from 8 to accomplish this.
Fix Race Variable Errors- There was an error where some participants were being recorded as
the wrong race. We were able to correct these by inputting the race that the person was listed
as most frequently.
Change “Other” Race from 6 to 5- Race was listed as an integer. Since there was no race with
the integer 5, we changed “Other” from 6 to 5.
Data Exploration
Our dataset consists of 538 people, who went on a combined total of 4,019 dates
All dates were of heterosexual nature, and had one male and one female
Our variable of interest is match– A date is a match if both daters say yes to wanting to
see their date again
The average match rate for all dates was 16.62%
Data Exploration: Gender
Males were more accepting than
females
Males said yes to females 47.87% of
the time
Females said yes to males 36.63% of
the time
Data Exploration: Age
The average age was 26.25
The age distribution was slightly skewed to the right
Data Exploration: Age
Daters preferred to date those who were closer to their own age
We saw a decrease in match rate as the age gap between daters increased
Data Exploration: Race
The race distribution was not balanced
Over half were Caucasian/European
About one fourth were Asian/Pacific
Islander/Asian American
1- African American
2- Caucasian/European
3- Latino/Hispanic
4- Asian/Pacific Islander/Asian-American
5- Other
Data Exploration: Same Race
We looked at match rates where both daters were the same race.
We saw that African Americans had the highest match rate increase– However, there were only
8 dates where both daters were African American, and so our sample size was too small. We
would need to get a more racially diverse dataset to confirm this trend.
1- African American
2- Caucasian/European
3- Latino/Hispanic
4- Asian/Pacific Islander/Asian-American
5- Other
Data Exploration: Interests
Participants were asked to rate their level of interest on a variety of activities such as sports,
movies, art, etc.
To get a look at each interest’s relationship with match rate, we performed a correlation test
Those who rated clubbing and yoga highly, had a higher match rate (on average)
Those who rated movies and tv highly, had a lower match rate (on average)
Data Exploration: Interests
We engineered a new variable that was a combination of both participants’ interest rating
We found that for some of the variables, match rate was higher when the combined interest
variable was high
This makes sense, if we reason that people with shared interests are more likely to be a match
Here is an example with our combined interest variable for clubbing, labeled clubbing_com
Data Exploration:
Desires and Preferences
Desires were ratings of attributes in response to the question “what do you look for in a date?”
Preferences were ratings of a diverse, mixed-bag of questions
We perform correlation tests to see the relationship between each variable and match
Those who desired their partner to be fun, had a higher match rate (on average)
Those who preferred going out and dating, also had a higher match rate (on average)
*See Appendix 1 for full list
of the variables’ definitions
Modeling
For modeling, we go through the following steps
Recap all of our engineered features
Pre-process the data
Model the data
Perform feature selection and re-model the data
Tune our models’ parameters
Modeling: Feature Engineering
The following is a list of the steps taken for feature engineering:
Create interaction terms (primary rating * partner rating) for all variables in the interests,
desires, and preferences categories
Create age difference variable (male age – female age)
Create age group variable that separates age into bins of [18-24, 25-30, 31-42]
Create combined age variable (primary age + partner age)
Create category variable that contains both the male and female’s race
Modeling: Data Pre-Processing
We split the dataset into two parts, X and y.
X- all variables except for dec, dec_o, and match
y- match
We then split each of these up into training and testing sets. Our training set uses 75% of the
data and our test set is the remaining 25%.
Modeling: First Try
We use four different supervised learning algorithms in modeling our data
Since the overall match rate, is 16.62%, a model that predicts “no match” for every date, would be
83.38% accurate. This means we want our models to strive for higher than this.
Our first run is a disappointment, as even our highest test accuracy score barely beats 83.38%.
Model
Train
Accuracy
Test
Accuracy
Precision Recall ROC AUC PR AUC
Random Forest 98.47% 84.03% 74.51% 10.98% 75.32% 47.60%
Logistic Regression 83.69% 82.74% 40.00% 0.58% 63.44% 29.22%
SVC 100.00% 82.79% 0.00% 0.00% 93.54% 88.45%
Gradient Boosting 85.95% 84.13% 93.55% 8.38% 77.17% 47.58%
Modeling: Feature Selection
Using the feature importance attribute in our
Random Forest model, we take a look to see
which features are performing well
We see that many of the top-performing
features are our combined interest variables
that we created
We will reduce the amount of features in X to
only the combined interest variables, and re-
model the data
Features Importance
clubbing_com 0.035848
yoga_com 0.027607
date_com 0.026963
exercise_com 0.02595
shopping_com 0.02567
concerts_com 0.025075
hiking_com 0.025022
sinc1_com 0.02416
pid 0.023971
exphappy_com 0.023931
Modeling- Second Try
We see tremendous improvement with the Random Forest Classifier and the SVC
The next slide shows the AUC for the ROC and PR curves
Model
Train
Accuracy
Test
Accuracy
Precision Recall ROC AUC PR AUC
Random Forest 99.45% 94.08% 100.00% 63.38% 93.15% 87.06%
Logistic
Regression
83.23% 83.83% 0.00% 0.00% 58.40% 22.17%
SVC 100.00% 95.42% 100.00% 71.69% 96.67% 92.23%
Gradient Boosting 85.50% 84.88% 86.21% 7.69% 69.65% 38.54%
Modeling: ROC and PR Curves
Modeling- RF Parameter Tuning
We move forward with only the Random Forest and SVC and tune the parameters using Grid-
Search CV.
We’ll start with Random Forest. On the left are the range of values we tried for each
parameter, and on the right are the optimal parameters.
Ranges Attempted
n_estimators – [10, 25, 50, 75, 100]
min_samples_leaf – [1, 25, 50, 75, 100]
max_features – [.1, .25, .5, .75]
Optimal
n_estimators – 100
min_samples_leaf – 1
max_features – .1
Modeling- RF Parameter Tuning
We re-model the Random Forest with the tuned parameters
We see a strong improvement, especially in recall
The RF model now looks to be about as accurate as the SVC
Model
Train
Accuracy
Test
Accuracy
Precision Recall ROC AUC PR AUC
RF (default) 99.45% 94.08% 100.00% 63.38% 93.15% 87.06%
RF (new params) 100.00% 95.52% 100.00% 73.13% 96.15% 90.30%
Improvement 0.55% 1.44% 0.00% 9.75% 3.00% 3.24%
Modeling- SVC Parameter Tuning
The following is the parameter tuning for the SVC
The gamma parameter proved to be irrelevant, so we removed it
The optimal parameters end up being the same as SVC’s default parameters
We don’t need to re-model it, since we know our previous model was already optimal
Ranges Attempted
C – [.0001,.001,.01,.1,1]
kernel – ['linear', 'rbf', 'sigmoid', 'poly']
gamma –
[0.01,0.02,0.03,0.04,0.05,0.10,0.2,0.3,0.4,0.5]
Optimal
C – 1
kernel – 'rbf’
gamma – irrelevant
Conclusions
The most important features in predicting a match were the combined interests variables
We exceeded our goal for our models, with our highest accuracy score reaching 95.52%
The best performing models were the Random Forest Classifier with tuned parameters, and the
SVC with default parameters
Both models performed well in test accuracy score and AUC of PR Curve, so we are comfortable
using either one
Appendix 1
Appendix 1 cont.
Appendix 1 cont.
Appendix 1 cont.
Appendix 1 cont.

More Related Content

What's hot

Basic statistics 8 - statistical estimation (2)
Basic statistics   8 - statistical estimation (2)Basic statistics   8 - statistical estimation (2)
Basic statistics 8 - statistical estimation (2)angita wahyu suprapti
 
Statistika parametrik_teknik analisis korelasi
Statistika parametrik_teknik analisis korelasiStatistika parametrik_teknik analisis korelasi
Statistika parametrik_teknik analisis korelasiM. Jainuri, S.Pd., M.Pd
 
Dasar dasar aljabar linier
Dasar dasar aljabar linierDasar dasar aljabar linier
Dasar dasar aljabar linierL Yudhi Prihadi
 
3. distribusi bentuk kuadrat
3. distribusi bentuk kuadrat3. distribusi bentuk kuadrat
3. distribusi bentuk kuadratjeky_SUY
 
Analisis regresi dan korelasi
Analisis regresi dan korelasiAnalisis regresi dan korelasi
Analisis regresi dan korelasiMousetha Bell
 
KORELASI PARSIAL.pptx
KORELASI PARSIAL.pptxKORELASI PARSIAL.pptx
KORELASI PARSIAL.pptxarisantomico
 
PPT Regresi Berganda
PPT Regresi BergandaPPT Regresi Berganda
PPT Regresi BergandaLusi Kurnia
 
Distribusi probabilitas deskriptif
Distribusi probabilitas deskriptifDistribusi probabilitas deskriptif
Distribusi probabilitas deskriptifAgus Candra
 
Teori perilaku konsumen
Teori perilaku konsumenTeori perilaku konsumen
Teori perilaku konsumenDaniel Arie
 
DIFERENSIAL PARSIAL/3/EKOMA/1
DIFERENSIAL PARSIAL/3/EKOMA/1DIFERENSIAL PARSIAL/3/EKOMA/1
DIFERENSIAL PARSIAL/3/EKOMA/1muliajayaabadi
 
APG Pertemuan 6 : Mean Vectors From Two Populations
APG Pertemuan 6 : Mean Vectors From Two PopulationsAPG Pertemuan 6 : Mean Vectors From Two Populations
APG Pertemuan 6 : Mean Vectors From Two PopulationsRani Nooraeni
 
Salinan terjemahan johnson, richard a wichern, dean w applied multivariate ...
Salinan terjemahan johnson, richard a wichern, dean w   applied multivariate ...Salinan terjemahan johnson, richard a wichern, dean w   applied multivariate ...
Salinan terjemahan johnson, richard a wichern, dean w applied multivariate ...PratiwiArdinDatu1
 
Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties Wong Hsiung
 
APG Pertemuan 4 : Multivariate Normal Distribution (2)
APG Pertemuan 4 : Multivariate Normal Distribution (2)APG Pertemuan 4 : Multivariate Normal Distribution (2)
APG Pertemuan 4 : Multivariate Normal Distribution (2)Rani Nooraeni
 
Isoquant. "ekonomi produksi"
Isoquant. "ekonomi produksi"Isoquant. "ekonomi produksi"
Isoquant. "ekonomi produksi"nuelsitohang
 

What's hot (20)

Basic statistics 8 - statistical estimation (2)
Basic statistics   8 - statistical estimation (2)Basic statistics   8 - statistical estimation (2)
Basic statistics 8 - statistical estimation (2)
 
Statistika parametrik_teknik analisis korelasi
Statistika parametrik_teknik analisis korelasiStatistika parametrik_teknik analisis korelasi
Statistika parametrik_teknik analisis korelasi
 
Dasar dasar aljabar linier
Dasar dasar aljabar linierDasar dasar aljabar linier
Dasar dasar aljabar linier
 
3. distribusi bentuk kuadrat
3. distribusi bentuk kuadrat3. distribusi bentuk kuadrat
3. distribusi bentuk kuadrat
 
Analisis regresi dan korelasi
Analisis regresi dan korelasiAnalisis regresi dan korelasi
Analisis regresi dan korelasi
 
Materi p13 nonpar_satu sampel
Materi p13 nonpar_satu sampelMateri p13 nonpar_satu sampel
Materi p13 nonpar_satu sampel
 
KORELASI PARSIAL.pptx
KORELASI PARSIAL.pptxKORELASI PARSIAL.pptx
KORELASI PARSIAL.pptx
 
PPT Regresi Berganda
PPT Regresi BergandaPPT Regresi Berganda
PPT Regresi Berganda
 
Distribusi probabilitas deskriptif
Distribusi probabilitas deskriptifDistribusi probabilitas deskriptif
Distribusi probabilitas deskriptif
 
Teori perilaku konsumen
Teori perilaku konsumenTeori perilaku konsumen
Teori perilaku konsumen
 
DIFERENSIAL PARSIAL/3/EKOMA/1
DIFERENSIAL PARSIAL/3/EKOMA/1DIFERENSIAL PARSIAL/3/EKOMA/1
DIFERENSIAL PARSIAL/3/EKOMA/1
 
APG Pertemuan 6 : Mean Vectors From Two Populations
APG Pertemuan 6 : Mean Vectors From Two PopulationsAPG Pertemuan 6 : Mean Vectors From Two Populations
APG Pertemuan 6 : Mean Vectors From Two Populations
 
Contoh-soal-kalkulus-iii
Contoh-soal-kalkulus-iiiContoh-soal-kalkulus-iii
Contoh-soal-kalkulus-iii
 
Ekonometrika 1
Ekonometrika 1Ekonometrika 1
Ekonometrika 1
 
Salinan terjemahan johnson, richard a wichern, dean w applied multivariate ...
Salinan terjemahan johnson, richard a wichern, dean w   applied multivariate ...Salinan terjemahan johnson, richard a wichern, dean w   applied multivariate ...
Salinan terjemahan johnson, richard a wichern, dean w applied multivariate ...
 
Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties
 
APG Pertemuan 4 : Multivariate Normal Distribution (2)
APG Pertemuan 4 : Multivariate Normal Distribution (2)APG Pertemuan 4 : Multivariate Normal Distribution (2)
APG Pertemuan 4 : Multivariate Normal Distribution (2)
 
proses poisson
proses poissonproses poisson
proses poisson
 
Isoquant. "ekonomi produksi"
Isoquant. "ekonomi produksi"Isoquant. "ekonomi produksi"
Isoquant. "ekonomi produksi"
 
PERILAKU KONSUMEN ( KARDINAL).pptx
PERILAKU KONSUMEN ( KARDINAL).pptxPERILAKU KONSUMEN ( KARDINAL).pptx
PERILAKU KONSUMEN ( KARDINAL).pptx
 

Similar to Speed Dating SS

Regression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalRegression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalJohn Michael Croft
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points FinalJohn Michael Croft
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search EngineVirenKhandal
 
The Beginnings of a Search Engine
The Beginnings of a Search EngineThe Beginnings of a Search Engine
The Beginnings of a Search EngineVirenKhandal
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic RegressionTaweh Beysolow II
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...IEEEGLOBALSOFTTECHNOLOGIES
 
Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questionsIEEEFINALYEARPROJECTS
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling reviewJaideep Adusumelli
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with Regoodwintx
 
OPIM 5604 predictive modeling presentation group7
OPIM 5604 predictive modeling presentation group7OPIM 5604 predictive modeling presentation group7
OPIM 5604 predictive modeling presentation group7Shu-Feng Tsao
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsJen Stirrup
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docxDataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docxtheodorelove43763
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
 

Similar to Speed Dating SS (20)

Regression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalRegression Analysis of SAT Scores Final
Regression Analysis of SAT Scores Final
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points Final
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search Engine
 
The Beginnings of a Search Engine
The Beginnings of a Search EngineThe Beginnings of a Search Engine
The Beginnings of a Search Engine
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic Regression
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
 
Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questions
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
Employee mode of commuting
Employee mode of commutingEmployee mode of commuting
Employee mode of commuting
 
OPIM 5604 predictive modeling presentation group7
OPIM 5604 predictive modeling presentation group7OPIM 5604 predictive modeling presentation group7
OPIM 5604 predictive modeling presentation group7
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
SecondaryStructurePredictionReport
SecondaryStructurePredictionReportSecondaryStructurePredictionReport
SecondaryStructurePredictionReport
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Curvefitting
CurvefittingCurvefitting
Curvefitting
 
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docxDataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseD.docx
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
SEM
SEMSEM
SEM
 
Recommender system
Recommender systemRecommender system
Recommender system
 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 

Speed Dating SS

  • 1. Predicting a Match For Speed Dating SAMUEL BINENFELD
  • 3. Introduction The Problem: The dating process is inefficient, and dates are unsuccessful far too often. The Solution: If we can successfully predict the likelihood that two people will be a match for each other, then we can improve the success rate of dates. How We Do It: We analyze the Speed Dating Experiment dataset from Kaggle.com to find out what makes two people a match for each other. Then, we create a machine learning model that can predict match likelihood.
  • 4. Data Cleaning The data set started out with 195 columns and 8,378 rows. We reduced this to 38 columns and 8,038 rows. We also reformatted some of the variables. The actions taken and the reasons why are described in this section.
  • 5. Data Cleaning: Filtering Columns Remove fields with significant missing data- Any field that was missing 10% or more of our data was removed.  This step removed 108 fields  Remove repetitive fields- Some fields were the same questions being asked over and over again.  This step removed 6 fields Remove varied fields- Some fields were simply too varied to give any insight.  This step removed 6 fields Remove after-date information- Some fields contained information gathered after the date, and we wanted to keep our analysis to information gathered pre-date.  This step removed 16 fields Remove irrelevant fields- Fields where nothing of importance could be found were removed.  This step removed 11 fields
  • 6. Data Cleaning: Filtering Rows Remove dates with null values- We had to remove rows with null values because we could not make educated guesses as to what their values were.  This step removed 263 rows Remove dates with 55 year old person- This was person was an outlier in terms of age.  This step removed 6 rows Remove dates that had a partner who was never a primary- Every date consisted of two people, a “primary”, and a “partner”. Later, we engineer variables that contain information from both daters, so we had to remove anyone who was never a primary.  This step removed 71 rows
  • 7. Data Cleaning: Reformat Variables Fix “Date” and “Go Out” Variables- The rating scale for these variables was 1 through 7, with 1 being “very often”, and 7 being “never”. We found these variables to be easier to interpret when the scale was reversed, and so we subtracted each row from 8 to accomplish this. Fix Race Variable Errors- There was an error where some participants were being recorded as the wrong race. We were able to correct these by inputting the race that the person was listed as most frequently. Change “Other” Race from 6 to 5- Race was listed as an integer. Since there was no race with the integer 5, we changed “Other” from 6 to 5.
  • 8. Data Exploration Our dataset consists of 538 people, who went on a combined total of 4,019 dates All dates were of heterosexual nature, and had one male and one female Our variable of interest is match– A date is a match if both daters say yes to wanting to see their date again The average match rate for all dates was 16.62%
  • 9. Data Exploration: Gender Males were more accepting than females Males said yes to females 47.87% of the time Females said yes to males 36.63% of the time
  • 10. Data Exploration: Age The average age was 26.25 The age distribution was slightly skewed to the right
  • 11. Data Exploration: Age Daters preferred to date those who were closer to their own age We saw a decrease in match rate as the age gap between daters increased
  • 12. Data Exploration: Race The race distribution was not balanced Over half were Caucasian/European About one fourth were Asian/Pacific Islander/Asian American 1- African American 2- Caucasian/European 3- Latino/Hispanic 4- Asian/Pacific Islander/Asian-American 5- Other
  • 13. Data Exploration: Same Race We looked at match rates where both daters were the same race. We saw that African Americans had the highest match rate increase– However, there were only 8 dates where both daters were African American, and so our sample size was too small. We would need to get a more racially diverse dataset to confirm this trend. 1- African American 2- Caucasian/European 3- Latino/Hispanic 4- Asian/Pacific Islander/Asian-American 5- Other
  • 14. Data Exploration: Interests Participants were asked to rate their level of interest on a variety of activities such as sports, movies, art, etc. To get a look at each interest’s relationship with match rate, we performed a correlation test Those who rated clubbing and yoga highly, had a higher match rate (on average) Those who rated movies and tv highly, had a lower match rate (on average)
  • 15. Data Exploration: Interests We engineered a new variable that was a combination of both participants’ interest rating We found that for some of the variables, match rate was higher when the combined interest variable was high This makes sense, if we reason that people with shared interests are more likely to be a match Here is an example with our combined interest variable for clubbing, labeled clubbing_com
  • 16. Data Exploration: Desires and Preferences Desires were ratings of attributes in response to the question “what do you look for in a date?” Preferences were ratings of a diverse, mixed-bag of questions We perform correlation tests to see the relationship between each variable and match Those who desired their partner to be fun, had a higher match rate (on average) Those who preferred going out and dating, also had a higher match rate (on average) *See Appendix 1 for full list of the variables’ definitions
  • 17. Modeling For modeling, we go through the following steps Recap all of our engineered features Pre-process the data Model the data Perform feature selection and re-model the data Tune our models’ parameters
  • 18. Modeling: Feature Engineering The following is a list of the steps taken for feature engineering: Create interaction terms (primary rating * partner rating) for all variables in the interests, desires, and preferences categories Create age difference variable (male age – female age) Create age group variable that separates age into bins of [18-24, 25-30, 31-42] Create combined age variable (primary age + partner age) Create category variable that contains both the male and female’s race
  • 19. Modeling: Data Pre-Processing We split the dataset into two parts, X and y. X- all variables except for dec, dec_o, and match y- match We then split each of these up into training and testing sets. Our training set uses 75% of the data and our test set is the remaining 25%.
  • 20. Modeling: First Try We use four different supervised learning algorithms in modeling our data Since the overall match rate, is 16.62%, a model that predicts “no match” for every date, would be 83.38% accurate. This means we want our models to strive for higher than this. Our first run is a disappointment, as even our highest test accuracy score barely beats 83.38%. Model Train Accuracy Test Accuracy Precision Recall ROC AUC PR AUC Random Forest 98.47% 84.03% 74.51% 10.98% 75.32% 47.60% Logistic Regression 83.69% 82.74% 40.00% 0.58% 63.44% 29.22% SVC 100.00% 82.79% 0.00% 0.00% 93.54% 88.45% Gradient Boosting 85.95% 84.13% 93.55% 8.38% 77.17% 47.58%
  • 21. Modeling: Feature Selection Using the feature importance attribute in our Random Forest model, we take a look to see which features are performing well We see that many of the top-performing features are our combined interest variables that we created We will reduce the amount of features in X to only the combined interest variables, and re- model the data Features Importance clubbing_com 0.035848 yoga_com 0.027607 date_com 0.026963 exercise_com 0.02595 shopping_com 0.02567 concerts_com 0.025075 hiking_com 0.025022 sinc1_com 0.02416 pid 0.023971 exphappy_com 0.023931
  • 22. Modeling- Second Try We see tremendous improvement with the Random Forest Classifier and the SVC The next slide shows the AUC for the ROC and PR curves Model Train Accuracy Test Accuracy Precision Recall ROC AUC PR AUC Random Forest 99.45% 94.08% 100.00% 63.38% 93.15% 87.06% Logistic Regression 83.23% 83.83% 0.00% 0.00% 58.40% 22.17% SVC 100.00% 95.42% 100.00% 71.69% 96.67% 92.23% Gradient Boosting 85.50% 84.88% 86.21% 7.69% 69.65% 38.54%
  • 23. Modeling: ROC and PR Curves
  • 24. Modeling- RF Parameter Tuning We move forward with only the Random Forest and SVC and tune the parameters using Grid- Search CV. We’ll start with Random Forest. On the left are the range of values we tried for each parameter, and on the right are the optimal parameters. Ranges Attempted n_estimators – [10, 25, 50, 75, 100] min_samples_leaf – [1, 25, 50, 75, 100] max_features – [.1, .25, .5, .75] Optimal n_estimators – 100 min_samples_leaf – 1 max_features – .1
  • 25. Modeling- RF Parameter Tuning We re-model the Random Forest with the tuned parameters We see a strong improvement, especially in recall The RF model now looks to be about as accurate as the SVC Model Train Accuracy Test Accuracy Precision Recall ROC AUC PR AUC RF (default) 99.45% 94.08% 100.00% 63.38% 93.15% 87.06% RF (new params) 100.00% 95.52% 100.00% 73.13% 96.15% 90.30% Improvement 0.55% 1.44% 0.00% 9.75% 3.00% 3.24%
  • 26. Modeling- SVC Parameter Tuning The following is the parameter tuning for the SVC The gamma parameter proved to be irrelevant, so we removed it The optimal parameters end up being the same as SVC’s default parameters We don’t need to re-model it, since we know our previous model was already optimal Ranges Attempted C – [.0001,.001,.01,.1,1] kernel – ['linear', 'rbf', 'sigmoid', 'poly'] gamma – [0.01,0.02,0.03,0.04,0.05,0.10,0.2,0.3,0.4,0.5] Optimal C – 1 kernel – 'rbf’ gamma – irrelevant
  • 27. Conclusions The most important features in predicting a match were the combined interests variables We exceeded our goal for our models, with our highest accuracy score reaching 95.52% The best performing models were the Random Forest Classifier with tuned parameters, and the SVC with default parameters Both models performed well in test accuracy score and AUC of PR Curve, so we are comfortable using either one