Classification via Logistic Regression

Classification via Logistic Regression: Predicting Probability of Speed Dating Success

1

Classification via Logistic Regression:
Predicting Probability of Speed Dating Success
Taweh Beysolow II
Professor Nagaraja
Fordham University


2
I. Introduction
In speed dating, participants meet many people, each for a few minutes, and then
decide who they would like to see again. The data set we will be working with contains
information on speed dating experiments conducted on graduate and professional
students. Each person in the experiment met with 10-20 randomly selected people of the
opposite sex (only heterosexual pairings) for four minutes. After each speed date, each
participant filled out a questionnaire about the other person. Our goal is to build a model
to predict which pairs of daters want to meet each other again (i.e., have a second date).
The list of variables are as follows:
• Decision : 1 = Yes (you would like to see the date again), 0 = No (you would not
like to see the date again
• Like: Overall, how much o you like this person? (1 = not at all, 10 = like a lot)
• PartnerYes: How probable do you think it is that this person will say ‘yes’ for
you? (1 = not probable, 10 = extremely probable)
• Age: Age
• Race: Caucausian, Asian, Black, Latino, or Other
• Attractive: Rate attractiveness of partner on a scale of 1-10 (1 = awful, 10 =
great)
• Sincere: Rate sincerity of partner on a sale of 1-10 (1 = awful, 10 = great)
• Fun: Rate how fun partner is on a scale of 1 – 10 (1 = awful, 10 = great)
• Ambitious: Rate ambition of partner on a scale of 1 – 10 (1 = awful, 10 = great)
• Shared Interest: Rate the extent to which you share interests/hobbies with
partner on a scale of 1 – 10 (1 = awful, 10 = great)
We will be using a reduced version of this experimental data with 276 unique male-
female date pairs. In the file “SpeedDating.csv”, the variables have either “M” for male
or “F” for female. For example, “LikeM” refers to the “Like” variable as answered by the
male participant (about the female participant). Treat the rating scale variables (such as
“PartnerYes”, ”Attractive”, etc.) as numerical variables instead of categorical ones for
our analysis.


3
II. Exploratory Analysis
When constructing the contingency table below, we see the following results:
As such, we observe that approximately 22.83% percent of those who participated
in the study are interested in a second date. Moving forward from this, we will say a
second date is planned only if both people within the matched pair want to see each other
again. As such, we will make a new column in our data set and call it “second.date”. The
value in this column will be 0 if there will be no second date, 1 if there will be a second
date.
Here, we observe that there are roughly equal amounts of each data point in each of
the relative corners of the scatterplot. This is also the visual representation of the
contingency index, as we described prior to constructing this graph. Using the Jitter
function, we were able to add random noise, otherwise, all of the data points would have
plotted on top of each other at the corresponding corners of the graph. The blue denotes
clusters of people who will be going on a second date, whereas the red denotes those who
will no


4
All of the variables, with the exception of the decision, second date, and age variables,
are on a 1 to 10 scale. The responses recorded, excluding NA values, all as low as 1 to as
high as 10. Furthermore, we observe that there are exactly 142 entries of data missing.
These are scatter among all the variables, excluding the decision variable and also our
newly constructed second date variable. Being that the great majority of responses for the
variables are on a 1-10 scale, and sort of centering of values would be unnecessary. For
those variables that aren’t on a 1 to 10 scale, we shall choose to exclude the decision
variable from our response, and should we choose to use variables that aren’t on the 1 –
10 scale, consider centering or normalization of some sort. We should hesitate to remove
NA values before we decide what variables we are using in our model, as we do not want
to unnecessarily reduce our sample size.
The possible categories for race observed are Asian, Black, Caucasian, Latino, and
other. There are 3 missing races within the data set. It is worth consider whether to delete
these data points, however this should be done so only if we are to use this variable in our
experiment. The reason being is that while the respondent in this study might have
forgotten to fill out their race, they could have filled out responses for other variables. As
such, we don’t want to unnecessarily remove variables from our data study, as to preserve
as high a sample size as possible.


5
III. Experimental Design
To determine the best logistic regression model, we included all of the variables on a
1 – 10 scale, and left out race as well as the decision variable. From this, we observe that
our model is relatively weak, so we perform backwards-stepwise regression, and pick the
variables with the highest AIC score. After two more iterations, we stop at what is our
third model. With second date as the response variable, shared interests male and shared
interests female become the explanatory variables. The summary output for this model is
as shown below:
We observe an AIC Score of 211.09, and we see that all variables are statistically
significant, and above a 95% significance level. As for the regression assumptions, the
data was collected in an independent manner, as test subjects were asked their responses
separately and we assume that these responses were collected accurately. There are few
outliers, as we have limited our scores within defined ranges that no participants have
deviated from (with the exception of the NA values which the regression can handle).
Our sample size is considerably large, at 226, approximately 81% o f the original test
data being preserved. With respect to the residuals, we observe the following:


6
The errors exhibit independence and we can also see that the residuals are substantial
enough to prove that knowing x values does not completely determine whether y = 0 or y
= 1. As such, we can conclude that our regression assumptions are satisfied and that we
can proceed with the remainder of the experiment. We remove NA values by row, and
establish a threshold based on what maximizes our sensitivity statistic. As described prior,
we remove 50 NA values, and we observe that our threshold for probabilities will be
approximately 48%.
When looking at the coefficients for our explanatory variables and the slope
coefficient, we observe the following:
Where “sharm” is Shared Interests Male and “sharf” is Shared Interests Female.
Both variables have similar slopes, and both help to increase the probability of a date.
Be this as it may, the intercept has a significantly negative effect on the probability of a
date relative to the other slopes. While it was expected that the two variables should have
an increase in the probability of a date, as they are both highly correlated variables with


7
one another, it was surprising that the intercept had this negative of an effect on the
probability. This indicates, intuitively, that the model has a bias to assume two
individuals will generally not be a good match for each other prior to inputting data.
As briefly touched upon before, we went with a threshold of roughly 48%, which was
calculated as the average of the mean of the odds and the median of the odds as
calculated by the model. This was accomplished by choosing to maximize the sensitivity
statistic, while also retaining a reasonably high value in specificity and sensitivity as well.
We observed the effect of choosing a 10 percent, 20 percent, mean of the odds percent,
and the mean of the mean and median of the odds percent thresholds on all of these
statistics. As such, we saw that our sensitivity statistic was equal to approximately 89% ,
while others were higher under the final threshold choice, and as such we chose 48%.’
IV. Results
a. Accuracy
0.6725664
b. Sensitivity
0.8888889
c. Specificity
0.4
d. ROC Curve
Area Under the Curve: 0.696


8
V. Conclusion
As we can see from the new contingency table, our model has a sensitivity rating of
approx. 89%. The reason that we wanted to maximize accuracy rather than other statistics,
such as sensitivity or specificity, is that for the application of this data in a contemporary
context, there is much more benefit to being able to match people properly than to
prevent them from matching people they might not like as much. Should this model be
used for online dating applications, user retention would want to be maximized and new
users would want to be drawn. As such, most people’s highest priority when using a
dating application would be to correctly match with someone.
With this being said, the accuracy of the model under this threshold is not as
optimal as that of the model when using the threshold determined by the glm function, as
it both maximizes sensitivity and specificity. Our accuracy is not as optimal as it could be,
however our sensitivity is also markedly higher. An increase in sensitivity is correlated
with a decrease in specificity, so there must be some loss accounted for when choosing a
threshold to maximize one of these statistics. In conclusion, approximately 70% accuracy
and approximately 70% AUC allows us the ability to forecast second dates to a
reasonable degree. Using the glm function without adjusting the threshold, however,
leads to a higher AUC.

Classification via Logistic Regression

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Classification via Logistic Regression

Similar to Classification via Logistic Regression (20)

Classification via Logistic Regression