SlideShare a Scribd company logo
1 of 8
Download to read offline
Classification via Logistic Regression: Predicting Probability of Speed Dating Success
	
1	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
Classification via Logistic Regression:
Predicting Probability of Speed Dating Success
Taweh Beysolow II
Professor Nagaraja
Fordham University
Classification via Logistic Regression: Predicting Probability of Speed Dating Success
	
2	
I. Introduction
In speed dating, participants meet many people, each for a few minutes, and then
decide who they would like to see again. The data set we will be working with contains
information on speed dating experiments conducted on graduate and professional
students. Each person in the experiment met with 10-20 randomly selected people of the
opposite sex (only heterosexual pairings) for four minutes. After each speed date, each
participant filled out a questionnaire about the other person. Our goal is to build a model
to predict which pairs of daters want to meet each other again (i.e., have a second date).
The list of variables are as follows:
• Decision : 1 = Yes (you would like to see the date again), 0 = No (you would not
like to see the date again
• Like: Overall, how much o you like this person? (1 = not at all, 10 = like a lot)
• PartnerYes: How probable do you think it is that this person will say ‘yes’ for
you? (1 = not probable, 10 = extremely probable)
• Age: Age
• Race: Caucausian, Asian, Black, Latino, or Other
• Attractive: Rate attractiveness of partner on a scale of 1-10 (1 = awful, 10 =
great)
• Sincere: Rate sincerity of partner on a sale of 1-10 (1 = awful, 10 = great)
• Fun: Rate how fun partner is on a scale of 1 – 10 (1 = awful, 10 = great)
• Ambitious: Rate ambition of partner on a scale of 1 – 10 (1 = awful, 10 = great)
• Shared Interest: Rate the extent to which you share interests/hobbies with
partner on a scale of 1 – 10 (1 = awful, 10 = great)
We will be using a reduced version of this experimental data with 276 unique male-
female date pairs. In the file “SpeedDating.csv”, the variables have either “M” for male
or “F” for female. For example, “LikeM” refers to the “Like” variable as answered by the
male participant (about the female participant). Treat the rating scale variables (such as
“PartnerYes”, ”Attractive”, etc.) as numerical variables instead of categorical ones for
our analysis.
Classification via Logistic Regression: Predicting Probability of Speed Dating Success
	
3	
II. Exploratory Analysis
When constructing the contingency table below, we see the following results:
As such, we observe that approximately 22.83% percent of those who participated
in the study are interested in a second date. Moving forward from this, we will say a
second date is planned only if both people within the matched pair want to see each other
again. As such, we will make a new column in our data set and call it “second.date”. The
value in this column will be 0 if there will be no second date, 1 if there will be a second
date.
Here, we observe that there are roughly equal amounts of each data point in each of
the relative corners of the scatterplot. This is also the visual representation of the
contingency index, as we described prior to constructing this graph. Using the Jitter
function, we were able to add random noise, otherwise, all of the data points would have
plotted on top of each other at the corresponding corners of the graph. The blue denotes
clusters of people who will be going on a second date, whereas the red denotes those who
will no
Classification via Logistic Regression: Predicting Probability of Speed Dating Success
	
4	
All of the variables, with the exception of the decision, second date, and age variables,
are on a 1 to 10 scale. The responses recorded, excluding NA values, all as low as 1 to as
high as 10. Furthermore, we observe that there are exactly 142 entries of data missing.
These are scatter among all the variables, excluding the decision variable and also our
newly constructed second date variable. Being that the great majority of responses for the
variables are on a 1-10 scale, and sort of centering of values would be unnecessary. For
those variables that aren’t on a 1 to 10 scale, we shall choose to exclude the decision
variable from our response, and should we choose to use variables that aren’t on the 1 –
10 scale, consider centering or normalization of some sort. We should hesitate to remove
NA values before we decide what variables we are using in our model, as we do not want
to unnecessarily reduce our sample size.
The possible categories for race observed are Asian, Black, Caucasian, Latino, and
other. There are 3 missing races within the data set. It is worth consider whether to delete
these data points, however this should be done so only if we are to use this variable in our
experiment. The reason being is that while the respondent in this study might have
forgotten to fill out their race, they could have filled out responses for other variables. As
such, we don’t want to unnecessarily remove variables from our data study, as to preserve
as high a sample size as possible.
Classification via Logistic Regression: Predicting Probability of Speed Dating Success
	
5	
III. Experimental Design
To determine the best logistic regression model, we included all of the variables on a
1 – 10 scale, and left out race as well as the decision variable. From this, we observe that
our model is relatively weak, so we perform backwards-stepwise regression, and pick the
variables with the highest AIC score. After two more iterations, we stop at what is our
third model. With second date as the response variable, shared interests male and shared
interests female become the explanatory variables. The summary output for this model is
as shown below:
We observe an AIC Score of 211.09, and we see that all variables are statistically
significant, and above a 95% significance level. As for the regression assumptions, the
data was collected in an independent manner, as test subjects were asked their responses
separately and we assume that these responses were collected accurately. There are few
outliers, as we have limited our scores within defined ranges that no participants have
deviated from (with the exception of the NA values which the regression can handle).
Our sample size is considerably large, at 226, approximately 81% o f the original test
data being preserved. With respect to the residuals, we observe the following:
Classification via Logistic Regression: Predicting Probability of Speed Dating Success
	
6	
The errors exhibit independence and we can also see that the residuals are substantial
enough to prove that knowing x values does not completely determine whether y = 0 or y
= 1. As such, we can conclude that our regression assumptions are satisfied and that we
can proceed with the remainder of the experiment. We remove NA values by row, and
establish a threshold based on what maximizes our sensitivity statistic. As described prior,
we remove 50 NA values, and we observe that our threshold for probabilities will be
approximately 48%.
When looking at the coefficients for our explanatory variables and the slope
coefficient, we observe the following:
Where “sharm” is Shared Interests Male and “sharf” is Shared Interests Female.
Both variables have similar slopes, and both help to increase the probability of a date.
Be this as it may, the intercept has a significantly negative effect on the probability of a
date relative to the other slopes. While it was expected that the two variables should have
an increase in the probability of a date, as they are both highly correlated variables with
Classification via Logistic Regression: Predicting Probability of Speed Dating Success
	
7	
one another, it was surprising that the intercept had this negative of an effect on the
probability. This indicates, intuitively, that the model has a bias to assume two
individuals will generally not be a good match for each other prior to inputting data.
As briefly touched upon before, we went with a threshold of roughly 48%, which was
calculated as the average of the mean of the odds and the median of the odds as
calculated by the model. This was accomplished by choosing to maximize the sensitivity
statistic, while also retaining a reasonably high value in specificity and sensitivity as well.
We observed the effect of choosing a 10 percent, 20 percent, mean of the odds percent,
and the mean of the mean and median of the odds percent thresholds on all of these
statistics. As such, we saw that our sensitivity statistic was equal to approximately 89% ,
while others were higher under the final threshold choice, and as such we chose 48%.’
IV. Results
a. Accuracy
0.6725664
b. Sensitivity
0.8888889
c. Specificity
0.4
d. ROC Curve
Area Under the Curve: 0.696
Classification via Logistic Regression: Predicting Probability of Speed Dating Success
	
8	
V. Conclusion
As we can see from the new contingency table, our model has a sensitivity rating of
approx. 89%. The reason that we wanted to maximize accuracy rather than other statistics,
such as sensitivity or specificity, is that for the application of this data in a contemporary
context, there is much more benefit to being able to match people properly than to
prevent them from matching people they might not like as much. Should this model be
used for online dating applications, user retention would want to be maximized and new
users would want to be drawn. As such, most people’s highest priority when using a
dating application would be to correctly match with someone.
With this being said, the accuracy of the model under this threshold is not as
optimal as that of the model when using the threshold determined by the glm function, as
it both maximizes sensitivity and specificity. Our accuracy is not as optimal as it could be,
however our sensitivity is also markedly higher. An increase in sensitivity is correlated
with a decrease in specificity, so there must be some loss accounted for when choosing a
threshold to maximize one of these statistics. In conclusion, approximately 70% accuracy
and approximately 70% AUC allows us the ability to forecast second dates to a
reasonable degree. Using the glm function without adjusting the threshold, however,
leads to a higher AUC.

More Related Content

What's hot

Chapter 7 – Confidence Intervals And Sample Size
Chapter 7 – Confidence Intervals And Sample SizeChapter 7 – Confidence Intervals And Sample Size
Chapter 7 – Confidence Intervals And Sample SizeRose Jenkins
 
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsPredictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsMichele Vincent
 
Predicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsPredicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsMichele Vincent
 
A Lecture on Sample Size and Statistical Inference for Health Researchers
A Lecture on Sample Size and Statistical Inference for Health ResearchersA Lecture on Sample Size and Statistical Inference for Health Researchers
A Lecture on Sample Size and Statistical Inference for Health ResearchersDr Arindam Basu
 
inferencial statistics
inferencial statisticsinferencial statistics
inferencial statisticsanjaemerry
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Marina Santini
 
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...Estimating ambiguity preferences and perceptions in multiple prior models: Ev...
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...Nicha Tatsaneeyapan
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group ProjectErik Bebernes
 
Statistics lecture 8 (chapter 7)
Statistics lecture 8 (chapter 7)Statistics lecture 8 (chapter 7)
Statistics lecture 8 (chapter 7)jillmitchell8778
 
Odds ratio and confidence interval
Odds ratio and confidence intervalOdds ratio and confidence interval
Odds ratio and confidence intervalUttamaTungkhang
 
cfa in mplus
cfa in mplus cfa in mplus
cfa in mplus SadhakNK
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
HYPOTHESIS TESTING
HYPOTHESIS TESTINGHYPOTHESIS TESTING
HYPOTHESIS TESTINGAmna Sheikh
 
Hypothesis testing and p-value, www.eyenirvaan.com
Hypothesis testing and p-value, www.eyenirvaan.comHypothesis testing and p-value, www.eyenirvaan.com
Hypothesis testing and p-value, www.eyenirvaan.comEyenirvaan
 
Lecture6 Applied Econometrics and Economic Modeling
Lecture6 Applied Econometrics and Economic ModelingLecture6 Applied Econometrics and Economic Modeling
Lecture6 Applied Econometrics and Economic Modelingstone55
 
Chap#9 hypothesis testing (3)
Chap#9 hypothesis testing (3)Chap#9 hypothesis testing (3)
Chap#9 hypothesis testing (3)shafi khan
 

What's hot (20)

Chapter 7 – Confidence Intervals And Sample Size
Chapter 7 – Confidence Intervals And Sample SizeChapter 7 – Confidence Intervals And Sample Size
Chapter 7 – Confidence Intervals And Sample Size
 
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsPredictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
 
Predicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsPredicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation Amounts
 
A Lecture on Sample Size and Statistical Inference for Health Researchers
A Lecture on Sample Size and Statistical Inference for Health ResearchersA Lecture on Sample Size and Statistical Inference for Health Researchers
A Lecture on Sample Size and Statistical Inference for Health Researchers
 
inferencial statistics
inferencial statisticsinferencial statistics
inferencial statistics
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...Estimating ambiguity preferences and perceptions in multiple prior models: Ev...
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group Project
 
RESEARCH METHODS LESSON 3
RESEARCH METHODS LESSON 3RESEARCH METHODS LESSON 3
RESEARCH METHODS LESSON 3
 
Statistics lecture 8 (chapter 7)
Statistics lecture 8 (chapter 7)Statistics lecture 8 (chapter 7)
Statistics lecture 8 (chapter 7)
 
P value
P valueP value
P value
 
Odds ratio and confidence interval
Odds ratio and confidence intervalOdds ratio and confidence interval
Odds ratio and confidence interval
 
Hypothesis testing 1.0
Hypothesis testing 1.0Hypothesis testing 1.0
Hypothesis testing 1.0
 
cfa in mplus
cfa in mplus cfa in mplus
cfa in mplus
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
HYPOTHESIS TESTING
HYPOTHESIS TESTINGHYPOTHESIS TESTING
HYPOTHESIS TESTING
 
Hypothesis testing and p-value, www.eyenirvaan.com
Hypothesis testing and p-value, www.eyenirvaan.comHypothesis testing and p-value, www.eyenirvaan.com
Hypothesis testing and p-value, www.eyenirvaan.com
 
Lecture6 Applied Econometrics and Economic Modeling
Lecture6 Applied Econometrics and Economic ModelingLecture6 Applied Econometrics and Economic Modeling
Lecture6 Applied Econometrics and Economic Modeling
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 
Chap#9 hypothesis testing (3)
Chap#9 hypothesis testing (3)Chap#9 hypothesis testing (3)
Chap#9 hypothesis testing (3)
 

Viewers also liked

Automatic segmentation and disentangling of chromosomes in q band image
Automatic segmentation and disentangling of chromosomes in q band imageAutomatic segmentation and disentangling of chromosomes in q band image
Automatic segmentation and disentangling of chromosomes in q band imagesnehajit
 
Asset Price Prediction with Machine Learning
Asset Price Prediction with Machine LearningAsset Price Prediction with Machine Learning
Asset Price Prediction with Machine LearningTaweh Beysolow II
 
Overview and Implementation of Principal Component Analysis
Overview and Implementation of Principal Component Analysis Overview and Implementation of Principal Component Analysis
Overview and Implementation of Principal Component Analysis Taweh Beysolow II
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learningAmr BARAKAT
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionDrZahid Khan
 

Viewers also liked (9)

Automatic segmentation and disentangling of chromosomes in q band image
Automatic segmentation and disentangling of chromosomes in q band imageAutomatic segmentation and disentangling of chromosomes in q band image
Automatic segmentation and disentangling of chromosomes in q band image
 
Asset Price Prediction with Machine Learning
Asset Price Prediction with Machine LearningAsset Price Prediction with Machine Learning
Asset Price Prediction with Machine Learning
 
Gini Index Research
Gini Index ResearchGini Index Research
Gini Index Research
 
Overview and Implementation of Principal Component Analysis
Overview and Implementation of Principal Component Analysis Overview and Implementation of Principal Component Analysis
Overview and Implementation of Principal Component Analysis
 
Statistical Arbitrage
Statistical Arbitrage Statistical Arbitrage
Statistical Arbitrage
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 

Similar to Classification via Logistic Regression

BUS308 – Week 5 Lecture 1 A Different View Expected Ou.docx
BUS308 – Week 5 Lecture 1 A Different View Expected Ou.docxBUS308 – Week 5 Lecture 1 A Different View Expected Ou.docx
BUS308 – Week 5 Lecture 1 A Different View Expected Ou.docxcurwenmichaela
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurementattique1960
 
Statistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxStatistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxdarwinming1
 
analysing_data_using_spss.pdf
analysing_data_using_spss.pdfanalysing_data_using_spss.pdf
analysing_data_using_spss.pdfDrAnilKannur1
 
Analysis Of Data Using SPSS
Analysis Of Data Using SPSSAnalysis Of Data Using SPSS
Analysis Of Data Using SPSSBrittany Brown
 
Estimating Models Using Dummy VariablesYou have had plenty of op.docx
Estimating Models Using Dummy VariablesYou have had plenty of op.docxEstimating Models Using Dummy VariablesYou have had plenty of op.docx
Estimating Models Using Dummy VariablesYou have had plenty of op.docxSANSKAR20
 
Data enriched linear regression
Data enriched linear regressionData enriched linear regression
Data enriched linear regressionSunny Kr
 
Selection of appropriate data analysis technique
Selection of appropriate data analysis techniqueSelection of appropriate data analysis technique
Selection of appropriate data analysis techniqueRajaKrishnan M
 
Real Estate Data Set
Real Estate Data SetReal Estate Data Set
Real Estate Data SetSarah Jimenez
 
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docxBUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docxcurwenmichaela
 
Best crime predictor: Linear Regression
Best crime predictor: Linear RegressionBest crime predictor: Linear Regression
Best crime predictor: Linear RegressionJonathan Chauwa
 
Data Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryData Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryKelvinNMhina
 
Analyzing quantitative data
Analyzing quantitative dataAnalyzing quantitative data
Analyzing quantitative datamostafasharafiye
 
level of measurement TED TALK.docx
level of measurement TED TALK.docxlevel of measurement TED TALK.docx
level of measurement TED TALK.docxArnsGalvezLpt
 
Chapter 7 Estimation Chapter Learning Objectives 1.docx
Chapter 7 Estimation Chapter Learning Objectives 1.docxChapter 7 Estimation Chapter Learning Objectives 1.docx
Chapter 7 Estimation Chapter Learning Objectives 1.docxchristinemaritza
 

Similar to Classification via Logistic Regression (20)

BUS308 – Week 5 Lecture 1 A Different View Expected Ou.docx
BUS308 – Week 5 Lecture 1 A Different View Expected Ou.docxBUS308 – Week 5 Lecture 1 A Different View Expected Ou.docx
BUS308 – Week 5 Lecture 1 A Different View Expected Ou.docx
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurement
 
Statistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxStatistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docx
 
analysing_data_using_spss.pdf
analysing_data_using_spss.pdfanalysing_data_using_spss.pdf
analysing_data_using_spss.pdf
 
analysing_data_using_spss.pdf
analysing_data_using_spss.pdfanalysing_data_using_spss.pdf
analysing_data_using_spss.pdf
 
Analysis Of Data Using SPSS
Analysis Of Data Using SPSSAnalysis Of Data Using SPSS
Analysis Of Data Using SPSS
 
X18136931 statistics ca2_updated
X18136931 statistics ca2_updatedX18136931 statistics ca2_updated
X18136931 statistics ca2_updated
 
Estimating Models Using Dummy VariablesYou have had plenty of op.docx
Estimating Models Using Dummy VariablesYou have had plenty of op.docxEstimating Models Using Dummy VariablesYou have had plenty of op.docx
Estimating Models Using Dummy VariablesYou have had plenty of op.docx
 
Data enriched linear regression
Data enriched linear regressionData enriched linear regression
Data enriched linear regression
 
Selection of appropriate data analysis technique
Selection of appropriate data analysis techniqueSelection of appropriate data analysis technique
Selection of appropriate data analysis technique
 
Real Estate Data Set
Real Estate Data SetReal Estate Data Set
Real Estate Data Set
 
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docxBUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
 
Best crime predictor: Linear Regression
Best crime predictor: Linear RegressionBest crime predictor: Linear Regression
Best crime predictor: Linear Regression
 
02_AJMS_441_22.pdf
02_AJMS_441_22.pdf02_AJMS_441_22.pdf
02_AJMS_441_22.pdf
 
T test
T test T test
T test
 
Data Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryData Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies Summary
 
Sample Size Determination
Sample Size DeterminationSample Size Determination
Sample Size Determination
 
Analyzing quantitative data
Analyzing quantitative dataAnalyzing quantitative data
Analyzing quantitative data
 
level of measurement TED TALK.docx
level of measurement TED TALK.docxlevel of measurement TED TALK.docx
level of measurement TED TALK.docx
 
Chapter 7 Estimation Chapter Learning Objectives 1.docx
Chapter 7 Estimation Chapter Learning Objectives 1.docxChapter 7 Estimation Chapter Learning Objectives 1.docx
Chapter 7 Estimation Chapter Learning Objectives 1.docx
 

Classification via Logistic Regression

  • 1. Classification via Logistic Regression: Predicting Probability of Speed Dating Success 1 Classification via Logistic Regression: Predicting Probability of Speed Dating Success Taweh Beysolow II Professor Nagaraja Fordham University
  • 2. Classification via Logistic Regression: Predicting Probability of Speed Dating Success 2 I. Introduction In speed dating, participants meet many people, each for a few minutes, and then decide who they would like to see again. The data set we will be working with contains information on speed dating experiments conducted on graduate and professional students. Each person in the experiment met with 10-20 randomly selected people of the opposite sex (only heterosexual pairings) for four minutes. After each speed date, each participant filled out a questionnaire about the other person. Our goal is to build a model to predict which pairs of daters want to meet each other again (i.e., have a second date). The list of variables are as follows: • Decision : 1 = Yes (you would like to see the date again), 0 = No (you would not like to see the date again • Like: Overall, how much o you like this person? (1 = not at all, 10 = like a lot) • PartnerYes: How probable do you think it is that this person will say ‘yes’ for you? (1 = not probable, 10 = extremely probable) • Age: Age • Race: Caucausian, Asian, Black, Latino, or Other • Attractive: Rate attractiveness of partner on a scale of 1-10 (1 = awful, 10 = great) • Sincere: Rate sincerity of partner on a sale of 1-10 (1 = awful, 10 = great) • Fun: Rate how fun partner is on a scale of 1 – 10 (1 = awful, 10 = great) • Ambitious: Rate ambition of partner on a scale of 1 – 10 (1 = awful, 10 = great) • Shared Interest: Rate the extent to which you share interests/hobbies with partner on a scale of 1 – 10 (1 = awful, 10 = great) We will be using a reduced version of this experimental data with 276 unique male- female date pairs. In the file “SpeedDating.csv”, the variables have either “M” for male or “F” for female. For example, “LikeM” refers to the “Like” variable as answered by the male participant (about the female participant). Treat the rating scale variables (such as “PartnerYes”, ”Attractive”, etc.) as numerical variables instead of categorical ones for our analysis.
  • 3. Classification via Logistic Regression: Predicting Probability of Speed Dating Success 3 II. Exploratory Analysis When constructing the contingency table below, we see the following results: As such, we observe that approximately 22.83% percent of those who participated in the study are interested in a second date. Moving forward from this, we will say a second date is planned only if both people within the matched pair want to see each other again. As such, we will make a new column in our data set and call it “second.date”. The value in this column will be 0 if there will be no second date, 1 if there will be a second date. Here, we observe that there are roughly equal amounts of each data point in each of the relative corners of the scatterplot. This is also the visual representation of the contingency index, as we described prior to constructing this graph. Using the Jitter function, we were able to add random noise, otherwise, all of the data points would have plotted on top of each other at the corresponding corners of the graph. The blue denotes clusters of people who will be going on a second date, whereas the red denotes those who will no
  • 4. Classification via Logistic Regression: Predicting Probability of Speed Dating Success 4 All of the variables, with the exception of the decision, second date, and age variables, are on a 1 to 10 scale. The responses recorded, excluding NA values, all as low as 1 to as high as 10. Furthermore, we observe that there are exactly 142 entries of data missing. These are scatter among all the variables, excluding the decision variable and also our newly constructed second date variable. Being that the great majority of responses for the variables are on a 1-10 scale, and sort of centering of values would be unnecessary. For those variables that aren’t on a 1 to 10 scale, we shall choose to exclude the decision variable from our response, and should we choose to use variables that aren’t on the 1 – 10 scale, consider centering or normalization of some sort. We should hesitate to remove NA values before we decide what variables we are using in our model, as we do not want to unnecessarily reduce our sample size. The possible categories for race observed are Asian, Black, Caucasian, Latino, and other. There are 3 missing races within the data set. It is worth consider whether to delete these data points, however this should be done so only if we are to use this variable in our experiment. The reason being is that while the respondent in this study might have forgotten to fill out their race, they could have filled out responses for other variables. As such, we don’t want to unnecessarily remove variables from our data study, as to preserve as high a sample size as possible.
  • 5. Classification via Logistic Regression: Predicting Probability of Speed Dating Success 5 III. Experimental Design To determine the best logistic regression model, we included all of the variables on a 1 – 10 scale, and left out race as well as the decision variable. From this, we observe that our model is relatively weak, so we perform backwards-stepwise regression, and pick the variables with the highest AIC score. After two more iterations, we stop at what is our third model. With second date as the response variable, shared interests male and shared interests female become the explanatory variables. The summary output for this model is as shown below: We observe an AIC Score of 211.09, and we see that all variables are statistically significant, and above a 95% significance level. As for the regression assumptions, the data was collected in an independent manner, as test subjects were asked their responses separately and we assume that these responses were collected accurately. There are few outliers, as we have limited our scores within defined ranges that no participants have deviated from (with the exception of the NA values which the regression can handle). Our sample size is considerably large, at 226, approximately 81% o f the original test data being preserved. With respect to the residuals, we observe the following:
  • 6. Classification via Logistic Regression: Predicting Probability of Speed Dating Success 6 The errors exhibit independence and we can also see that the residuals are substantial enough to prove that knowing x values does not completely determine whether y = 0 or y = 1. As such, we can conclude that our regression assumptions are satisfied and that we can proceed with the remainder of the experiment. We remove NA values by row, and establish a threshold based on what maximizes our sensitivity statistic. As described prior, we remove 50 NA values, and we observe that our threshold for probabilities will be approximately 48%. When looking at the coefficients for our explanatory variables and the slope coefficient, we observe the following: Where “sharm” is Shared Interests Male and “sharf” is Shared Interests Female. Both variables have similar slopes, and both help to increase the probability of a date. Be this as it may, the intercept has a significantly negative effect on the probability of a date relative to the other slopes. While it was expected that the two variables should have an increase in the probability of a date, as they are both highly correlated variables with
  • 7. Classification via Logistic Regression: Predicting Probability of Speed Dating Success 7 one another, it was surprising that the intercept had this negative of an effect on the probability. This indicates, intuitively, that the model has a bias to assume two individuals will generally not be a good match for each other prior to inputting data. As briefly touched upon before, we went with a threshold of roughly 48%, which was calculated as the average of the mean of the odds and the median of the odds as calculated by the model. This was accomplished by choosing to maximize the sensitivity statistic, while also retaining a reasonably high value in specificity and sensitivity as well. We observed the effect of choosing a 10 percent, 20 percent, mean of the odds percent, and the mean of the mean and median of the odds percent thresholds on all of these statistics. As such, we saw that our sensitivity statistic was equal to approximately 89% , while others were higher under the final threshold choice, and as such we chose 48%.’ IV. Results a. Accuracy 0.6725664 b. Sensitivity 0.8888889 c. Specificity 0.4 d. ROC Curve Area Under the Curve: 0.696
  • 8. Classification via Logistic Regression: Predicting Probability of Speed Dating Success 8 V. Conclusion As we can see from the new contingency table, our model has a sensitivity rating of approx. 89%. The reason that we wanted to maximize accuracy rather than other statistics, such as sensitivity or specificity, is that for the application of this data in a contemporary context, there is much more benefit to being able to match people properly than to prevent them from matching people they might not like as much. Should this model be used for online dating applications, user retention would want to be maximized and new users would want to be drawn. As such, most people’s highest priority when using a dating application would be to correctly match with someone. With this being said, the accuracy of the model under this threshold is not as optimal as that of the model when using the threshold determined by the glm function, as it both maximizes sensitivity and specificity. Our accuracy is not as optimal as it could be, however our sensitivity is also markedly higher. An increase in sensitivity is correlated with a decrease in specificity, so there must be some loss accounted for when choosing a threshold to maximize one of these statistics. In conclusion, approximately 70% accuracy and approximately 70% AUC allows us the ability to forecast second dates to a reasonable degree. Using the glm function without adjusting the threshold, however, leads to a higher AUC.