SlideShare a Scribd company logo
1 of 11
MIS 6324 BUSINESS ANALYTICS WITH SAS
Dating Application
Group 3
Vaibhav Pande, Mary Gramer,Tejasvi Ramdas Sagar, Ritesh KP,Foram Gohil
11/27/2016
1
Executive Summary
Young adultsinthe twenty-firstcenturyare amongthe busiestandtechnology-tetheredgenerations.
Whentheyare notjugglingschool,careers,orhobbies,manyof themare gluedtotheirsmart-phones
surfingthe web,toopreoccupiedtomeetnew people.In thiscultural environment,itisdifficultfor
young,single adultstofindpotential datingpartners.
Usingdata from twenty-onespeeddatingeventstocreate a new datingapp,we can connecttwo
individualsbasedontheirinterestandpreferencesthus expeditingthe datingprocess.The appwill
directthe userto rate otherusers’profilesbasedonnotonlythe user’simage,butalsohow much
he/she likesthe otheruserbasedontheirprofile information.The profileswill include demographic
information,sharedInterests,andotherattributessuchasfunfactor, attractiveness,etc.After
evaluatingeachuser’spreferencesandrating,the appwill suggestpartnerswhohave similarinterests
and matchingpreferences.
Aftercomparingthe accuraciesand the true positive ratesof variousmodelscreatedusingSAS
Enterprise Miner,we have selectedadecisiontree topredictthe targetvariable.The datawasfirst
alteredbyapplyingareplacementnode inSAS.Ourmodel canpredictwhetherornot an individual will
be interestedindatinganotherhumanbasedontheirattributesandinterestswith80.5% true positive
rate andwith81.1% accuracy.
Project Motivation
The current popularmobile applicationsformeetingothersinglessuchas Tinder,orBumble, donot
considera person’spreferencesorpersonality - the onlydecidingfactoronwhetherornot twopeople
converse istheirpictures.Thisinefficientsystemcausessinglestowaste theirtime messagingwith
people whodonotshare any of theirinterests.Afterspendingperhapshourschatting,twopeoplemay
realize thattheyare not interestedingoingona face-to-face date withtheir‘match.’Usingthe speed
datingdata, we can create a superiordatingappforyoungadults.
Descriptionof Data
The datasetincludesobservationsfromtwenty-one speeddatingevents(alsocalledwaves) inwhich
each personwaspairedwithfive totwenty-twopartnersof the oppositegenderforfourminuteseach.
Before,during,andafterthe event,participantswere askedtorate multiplecharacteristicabout
themselves,andeachof the partnerswithwhichtheymet.Everyparticipantidentifiedwhichattributes
ina partnerare mostimportantto them, ratedeachpartnertheymet withonthese same attributes
(calledthe ‘scorecard’foreachmember),andif theywouldliketogoon a seconddate withthe partner.
2
The scorecard giventoeach participantafterthe date isas follows:
SCORECARD
YOUR ID NUMBER:
Circle “Yes”or “No” belowthe IDnumberof eachperson youmeetto indicate whetheryouwouldlike
to see himor heragain. Rate theirattributesona scale of 1-10: (1=awful,10=great). If youhaven’t
formedanopinionbasedonyourconversation,fillinN/A,butplease fill inall boxes. Thiswillbe
TOTALLY confidential andwillNOTbe sharedwithanyone. Then,answerthe remainingquestionsfor
each personyoumeet.
ID #: 1 2 3 4 5 6 7 8 9 10
Decision 1=yes
0=no
Yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
Attributes
(1=awful, 10=great)
Attractive attr
Sincere sinc
Intelligent intel
Fun fun
Ambitious amb
Shared Interests/Hobbies shar
Overall, how much do you like this person?
(1=don't like at all, 10=like a lot)
like
How probable do you think it is that this person will say
'yes' for you?
(1=not probable, 10=extremely probable)
prob
Have you met this person before? met
1=yes
2=no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
3
In the data set,eachobservationrepresentsameetingbetweenaparticipantanda partner.The
observationincludesall the informationcollectedaboutthe participant,includingdemographics,
preferences,howthey scoredtheirpartner,how theirpartnerscoredthem, andwhetherbothpeople
agreedto go onseconddate.
Priorto modelingthe data,there were manydiscrepanciesandnon-uniformitiesamongthe variablesto
be reconciled.Forfourof the speeddatingevents(numberssix tonine),the participantsrankedtheir
preference foreachof the six attributesona scale of 1-10. For the remainingevents,participantsranked
theirpreference byallocating100 pointstothe same six attributes.Tocreate consistency inthese
variables,the valuesforthe rankinginspeeddatingeventssixtonine have beenscaledto100 pointsto
be consistentwiththe otherwaves.
We usedthe followingformulatoscale the data forthe waves6-9:
𝑹𝒂𝒕𝒊𝒏𝒈 𝒔𝒄𝒂𝒍𝒆𝒅 =
𝟏𝟎𝟎
𝜮𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝑹𝒂𝒕𝒊𝒏𝒈𝒔
× 𝑹𝒂𝒕𝒊𝒏𝒈 𝒐𝒓𝒊𝒈𝒊𝒏𝒂𝒍
The target variable we have selectedisdecision (dec).Inthe appwe are developingwe are more
concernedmakingthe rightrecommendationsforaperson.
We rejectedthe followingbinary attributesfromthe data:
 Match (Whenbothpersonagreestogo on a seconddate)
 dec_o (decisionof partnertogo on a seconddate)
 Num_in_3(Howmanyof your matcheshave youbeenona date withsofar)
Match anddec_o were rejectedbecause the combinationof the participantandpartner’sdecisionis
equivalenttoourtarget variable,decision.If we keepthese twovariables,the model wouldpredictthe
target variable (decisiontogoon seconddate) withclose to100% accuracy. We rejectedNum_in_3
because more than90% of observationswere missing.
Afterprocessingthe dataset,exploredthe observationstogaina betterunderstandingof the data.
Interestingaspectsinclude:
Overall Match Rate: 16.5%
 Individual ‘Yes’Rate:42%
Age Range ofParticipants: 18-55
 Mean: 26.3
 St. Deviation:3.6
 Skewness:1.07
Usinginteractive decisiontreesinSAS,we chose several initialnodestosplitthe dataon,and thenlet
SASdecide howtosplitthe tree intosubsequentbranches.Thismethodshowshow the target variable
variousamongparticipantsof differentgenders,races,age,andthe seasoninwhichthe eventwasheld.
Resultsare shownbelow.
4
Gender:
Note:‘0’ representsfemale,‘1’representsmale.
As the tree shows,femalesare more conservativein whotheychoose togo ona seconddate with.On
average,womensaid‘yes’toonly37.4% of maleswhile mensaid‘yes’to46.57% of females.
Race:
The decisionrate variesamongraces.Black/AfricanAmericanssaid‘yes’to51.2% of partners,while
European/Caucasiansaid‘Yes’to38.79% of partners.The percentage forthe otherraces lie somewhere
inbetween.
5
Age:
SASEnterprise Minersplitthe tree basedontwoage ranges, fewerthan 38.5 and above 38.5 yearsold.
For participantsunder38.5 yearsold,like wasthe mostimportantattribute whendecidingif theywant
to go on a seconddate.Howeverforparticipantsover38.5 yearsold,the most importantfactorwas
how‘fun’theyfound theirpartnertobe.
Season:
We founda slightdifference inthe outcome of the decisionvariable whenwe chose tosplitthe decision
tree basedon whatseasonthe speeddatingeventwasheld.
The tree showsthat people are more likelytosay‘yes’toany givendate if the speeddatingeventisheld
inwinter.
These nuancesinthe data helpusunderstandhow the decisionvariable isaffectedbyauser’s
demographics.
In the dataset, the binarytargetvariable ‘Decision’is‘yes’ 41.99 percent.If we take a simple model in
whichwe predictthatevery‘Decision’isno,ourmisclassificationrate wouldbe 41.99.
6
BI Model:
We partitionedthe dataTrain70%, validation20% and Test10% we triedrunningall the classifierswith
differentsamplingtechniqueslike simple randomandstratifiedtechniquesandwe gotthe bestresults
usingstratifiedsamplingtechnique.
For the observation, whichare missingvaluesforcertainvariables,we have usedthe replacementnode
to replace the missingclassvariableswithadotso SASwill recognize the variablesasmissing.
Afterdata pre-processing,we ranthe followingmodels:
 Regressionwithreplacementnode
 Regressionwithreplacement,variable selectionandimpute node
 Regressionwithreplacement,variable selection,imputeandtransformvariables
 Dmine regressionwithreplacement,variable selection,impute andvariable transformation
 Neural networkswithvariable selection
 DecisionTree
 Decisiontree usingvariableselectionandreplacement node
 Gradientboostingwithreplacement
 Decisiontree withreplacementnode
7
Impute node: The datasethas numerousmissingvalues.Toaddressthisissue,the meanvalue foreach
relative variablewasusedtoreplace the missinginterval valuesandthe mode of eachrelative variable
was usedtoreplace missingvalue ordinalvalues.
Variable selection:Since we have manyattributes,the variableselectionnode wasusedtoletSAS
automaticallychoose the variableswhichmostaffectedthe targetvariable, ‘match.’
Variable transformation: Certainattributeswere highlypositivelyskewed.These variableshave been
transformedusingthe logfunction.Thismethodgave superiorresultstoothermethodssuchasinverse
or square root.
We altered the ‘maximumbranch’parameterinevery decisiontree andgotthe bestresultswhenthe
‘maximumbranch’wassetto 4 forthe Decisiontree withreplacementnode.
We executed forward, backward, andstepwise regression foreveryregressionnode.We getthe best
resultswhile keepingthe ‘model selection’parameter tonone withthe Regressionwithimpute node.
Model comparisonresults:
The model comparisonnode showsthatthe bestmodel selectedbythe SASenterprise mineristhe
GradientBoostingwithreplacementnode withamisclassificationrate of 18.1 percent.
8
ROC curve for all Models
For our applicationwe are more interestedinthe true positive rate of the model because we will be
makingrecommendationsanditwouldbe bettertorecommenduserthe peoplewhomtheyare more
likelytosayyesforgoingon a date.
True positive rates:
Dmine Regression 71.4%
Regression with impute 74.2%
Regression with transformed variables 73.8%
Neural Network 31.5%
Gradient boosting 75.2%
Decision tree 72.7%
Decision with variable selection 75.8%
Decision tree with replacement 80.2%
Eventhoughgradientboostinggivesthe bestmisclassificationrate,we have chosenDecisiontree with
replacementourBImodel basedonhighertrue positiverate.Decisiontree hasatrue positive rate of
80.2 percentwhereasgradientboostinghasa true positive rate of 75.2 percent. Pleasesee the attached
documentcontainingthe image of the decisiontree.
9
Conclusion:
The decisiontree usesthese variablestosplituponandthe rootnode selectedis like
Some interestingresults:
All the ratingsare on the scale of 1 to 10
 If user likesaperson greaterthanequal to 8 → userratesthemon attractivenessgreaterthan
equal to 7.5 → userthinksthe probabilityof gettingamatchis greaterthanequal to 3 .Then
there isa 86.28 percent chance that the user will sayyes
 If the userlikesapersongreaterthan equal to8 → user ratesthemattractive greaterthan
equal to4 and lessthan7 → user estimatesthatthe numberof matches(match_es) greater
than 1.5 andtheyare of the same race. 60 percentchance that the userwill sayyes.
 If the userlikesthe persongreaterthanequal to5.5 andlessthan6.5 → if theyare from
London,England.Theyhave 100 percentchance of sayinga yesbutif the userisfrom Alabama,
Texas,Argentinathere is68.12 percentchance of sayingno.
 If the userlikesapersonlessthan5.5 → is a lawyer. Thenthere isa 93.16 percentchance that
userwill sayno the otherperson.Similarlythe userisinthe fieldof Informaticsor Psychology,
the userwill sayno 100 percentof the time andif the userisa journalist, thereisan83 percent
chance of sayinga yes.
10
Overviewonthe application:
 In the mobile applicationusersmake theirprofiles withsomepicturesanddescriptionabout
themselves.The usersare askedtospecifytheirpreferenceslike age range of theirpartnersand
the location range and whethertheyare interestedinmeaningful friendships orrelationships.
 The user is shownthe profilesof peoplewhomatchthe userpreferencesandthe userisasked
for rate theirprofile onfeaturessuchasattractiveness,funandhow muchtheylike the overall
profile of the person.
 Basedon these ratingsourBI model generatesalistof potential partnerswithwhomuser is
likelytobe compatible withand hasan optionto start a chat.
 Aftersignificantuserbase hasbeenestablishedwe willbe able todesignarecommendation
systemthatincreasesthe accuracy byselectingthe profileswhichsimilarusershave matched
with.
References:
Data source:Kaggle.com
Columbia Business School. Ray Fisman and Sheena. Gender Differences in Mate Selection:
Evidence from a Speed Dating Experiment. https://www.kaggle.com/annavictoria/speed-
dating-experiment

More Related Content

Similar to Project report SAS

Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic RegressionTaweh Beysolow II
 
Prediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey dataPrediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey dataAlex Papageorgiou
 
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGPREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGIJDKP
 
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGPREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGIJDKP
 
Trust Metrics In Recommender System : A Survey
Trust Metrics In Recommender System : A SurveyTrust Metrics In Recommender System : A Survey
Trust Metrics In Recommender System : A Surveyaciijournal
 
Regoli fairness deep_learningitalia_20220127
Regoli fairness deep_learningitalia_20220127Regoli fairness deep_learningitalia_20220127
Regoli fairness deep_learningitalia_20220127Matteo Testi
 
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.Souma Maiti
 
Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paperShubhashish Biswas
 
Computing Descriptive Statistics © 2014 Argos.docx
 Computing Descriptive Statistics     © 2014 Argos.docx Computing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docxaryan532920
 
Computing Descriptive Statistics © 2014 Argos.docx
Computing Descriptive Statistics     © 2014 Argos.docxComputing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docxAASTHA76
 
data science course with placement in hyderabad
data science course with placement in hyderabaddata science course with placement in hyderabad
data science course with placement in hyderabadmaneesha2312
 
All That Glitters Is Not Gold Digging Beneath The Surface Of Data Mining
All That Glitters Is Not Gold  Digging Beneath The Surface Of Data MiningAll That Glitters Is Not Gold  Digging Beneath The Surface Of Data Mining
All That Glitters Is Not Gold Digging Beneath The Surface Of Data MiningJim Webb
 
Statistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxStatistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxdarwinming1
 
Python for Data Analysis: A Comprehensive Guide
Python for Data Analysis: A Comprehensive GuidePython for Data Analysis: A Comprehensive Guide
Python for Data Analysis: A Comprehensive GuideAivada
 
j.eswa.2019.03.014.pdf
j.eswa.2019.03.014.pdfj.eswa.2019.03.014.pdf
j.eswa.2019.03.014.pdfJAHANZAIBALVI3
 
Information Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slidesInformation Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slidesDaniel Valcarce
 
Survey Results Age Of Unbounded Data June 03 10
Survey Results Age Of Unbounded Data June 03 10Survey Results Age Of Unbounded Data June 03 10
Survey Results Age Of Unbounded Data June 03 10nhaque
 
GeneralizibilityFairness - DEFirst Reading Group
GeneralizibilityFairness - DEFirst Reading GroupGeneralizibilityFairness - DEFirst Reading Group
GeneralizibilityFairness - DEFirst Reading GroupHossein A. (Saeed) Rahmani
 
what is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysiswhat is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysisData analysis ireland
 

Similar to Project report SAS (20)

Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic Regression
 
Prediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey dataPrediciting happiness from mobile app survey data
Prediciting happiness from mobile app survey data
 
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGPREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
 
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGPREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
 
Trust Metrics In Recommender System : A Survey
Trust Metrics In Recommender System : A SurveyTrust Metrics In Recommender System : A Survey
Trust Metrics In Recommender System : A Survey
 
Data analytics
Data analyticsData analytics
Data analytics
 
Regoli fairness deep_learningitalia_20220127
Regoli fairness deep_learningitalia_20220127Regoli fairness deep_learningitalia_20220127
Regoli fairness deep_learningitalia_20220127
 
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
 
Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paper
 
Computing Descriptive Statistics © 2014 Argos.docx
 Computing Descriptive Statistics     © 2014 Argos.docx Computing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docx
 
Computing Descriptive Statistics © 2014 Argos.docx
Computing Descriptive Statistics     © 2014 Argos.docxComputing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docx
 
data science course with placement in hyderabad
data science course with placement in hyderabaddata science course with placement in hyderabad
data science course with placement in hyderabad
 
All That Glitters Is Not Gold Digging Beneath The Surface Of Data Mining
All That Glitters Is Not Gold  Digging Beneath The Surface Of Data MiningAll That Glitters Is Not Gold  Digging Beneath The Surface Of Data Mining
All That Glitters Is Not Gold Digging Beneath The Surface Of Data Mining
 
Statistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxStatistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docx
 
Python for Data Analysis: A Comprehensive Guide
Python for Data Analysis: A Comprehensive GuidePython for Data Analysis: A Comprehensive Guide
Python for Data Analysis: A Comprehensive Guide
 
j.eswa.2019.03.014.pdf
j.eswa.2019.03.014.pdfj.eswa.2019.03.014.pdf
j.eswa.2019.03.014.pdf
 
Information Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slidesInformation Retrieval Models for Recommender Systems - PhD slides
Information Retrieval Models for Recommender Systems - PhD slides
 
Survey Results Age Of Unbounded Data June 03 10
Survey Results Age Of Unbounded Data June 03 10Survey Results Age Of Unbounded Data June 03 10
Survey Results Age Of Unbounded Data June 03 10
 
GeneralizibilityFairness - DEFirst Reading Group
GeneralizibilityFairness - DEFirst Reading GroupGeneralizibilityFairness - DEFirst Reading Group
GeneralizibilityFairness - DEFirst Reading Group
 
what is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysiswhat is ..how to process types and methods involved in data analysis
what is ..how to process types and methods involved in data analysis
 

Project report SAS

  • 1. MIS 6324 BUSINESS ANALYTICS WITH SAS Dating Application Group 3 Vaibhav Pande, Mary Gramer,Tejasvi Ramdas Sagar, Ritesh KP,Foram Gohil 11/27/2016
  • 2. 1 Executive Summary Young adultsinthe twenty-firstcenturyare amongthe busiestandtechnology-tetheredgenerations. Whentheyare notjugglingschool,careers,orhobbies,manyof themare gluedtotheirsmart-phones surfingthe web,toopreoccupiedtomeetnew people.In thiscultural environment,itisdifficultfor young,single adultstofindpotential datingpartners. Usingdata from twenty-onespeeddatingeventstocreate a new datingapp,we can connecttwo individualsbasedontheirinterestandpreferencesthus expeditingthe datingprocess.The appwill directthe userto rate otherusers’profilesbasedonnotonlythe user’simage,butalsohow much he/she likesthe otheruserbasedontheirprofile information.The profileswill include demographic information,sharedInterests,andotherattributessuchasfunfactor, attractiveness,etc.After evaluatingeachuser’spreferencesandrating,the appwill suggestpartnerswhohave similarinterests and matchingpreferences. Aftercomparingthe accuraciesand the true positive ratesof variousmodelscreatedusingSAS Enterprise Miner,we have selectedadecisiontree topredictthe targetvariable.The datawasfirst alteredbyapplyingareplacementnode inSAS.Ourmodel canpredictwhetherornot an individual will be interestedindatinganotherhumanbasedontheirattributesandinterestswith80.5% true positive rate andwith81.1% accuracy. Project Motivation The current popularmobile applicationsformeetingothersinglessuchas Tinder,orBumble, donot considera person’spreferencesorpersonality - the onlydecidingfactoronwhetherornot twopeople converse istheirpictures.Thisinefficientsystemcausessinglestowaste theirtime messagingwith people whodonotshare any of theirinterests.Afterspendingperhapshourschatting,twopeoplemay realize thattheyare not interestedingoingona face-to-face date withtheir‘match.’Usingthe speed datingdata, we can create a superiordatingappforyoungadults. Descriptionof Data The datasetincludesobservationsfromtwenty-one speeddatingevents(alsocalledwaves) inwhich each personwaspairedwithfive totwenty-twopartnersof the oppositegenderforfourminuteseach. Before,during,andafterthe event,participantswere askedtorate multiplecharacteristicabout themselves,andeachof the partnerswithwhichtheymet.Everyparticipantidentifiedwhichattributes ina partnerare mostimportantto them, ratedeachpartnertheymet withonthese same attributes (calledthe ‘scorecard’foreachmember),andif theywouldliketogoon a seconddate withthe partner.
  • 3. 2 The scorecard giventoeach participantafterthe date isas follows: SCORECARD YOUR ID NUMBER: Circle “Yes”or “No” belowthe IDnumberof eachperson youmeetto indicate whetheryouwouldlike to see himor heragain. Rate theirattributesona scale of 1-10: (1=awful,10=great). If youhaven’t formedanopinionbasedonyourconversation,fillinN/A,butplease fill inall boxes. Thiswillbe TOTALLY confidential andwillNOTbe sharedwithanyone. Then,answerthe remainingquestionsfor each personyoumeet. ID #: 1 2 3 4 5 6 7 8 9 10 Decision 1=yes 0=no Yes no yes no yes no yes no yes no yes no yes no yes no yes no Attributes (1=awful, 10=great) Attractive attr Sincere sinc Intelligent intel Fun fun Ambitious amb Shared Interests/Hobbies shar Overall, how much do you like this person? (1=don't like at all, 10=like a lot) like How probable do you think it is that this person will say 'yes' for you? (1=not probable, 10=extremely probable) prob Have you met this person before? met 1=yes 2=no yes no yes no yes no yes no yes no yes no yes no yes no yes no
  • 4. 3 In the data set,eachobservationrepresentsameetingbetweenaparticipantanda partner.The observationincludesall the informationcollectedaboutthe participant,includingdemographics, preferences,howthey scoredtheirpartner,how theirpartnerscoredthem, andwhetherbothpeople agreedto go onseconddate. Priorto modelingthe data,there were manydiscrepanciesandnon-uniformitiesamongthe variablesto be reconciled.Forfourof the speeddatingevents(numberssix tonine),the participantsrankedtheir preference foreachof the six attributesona scale of 1-10. For the remainingevents,participantsranked theirpreference byallocating100 pointstothe same six attributes.Tocreate consistency inthese variables,the valuesforthe rankinginspeeddatingeventssixtonine have beenscaledto100 pointsto be consistentwiththe otherwaves. We usedthe followingformulatoscale the data forthe waves6-9: 𝑹𝒂𝒕𝒊𝒏𝒈 𝒔𝒄𝒂𝒍𝒆𝒅 = 𝟏𝟎𝟎 𝜮𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝑹𝒂𝒕𝒊𝒏𝒈𝒔 × 𝑹𝒂𝒕𝒊𝒏𝒈 𝒐𝒓𝒊𝒈𝒊𝒏𝒂𝒍 The target variable we have selectedisdecision (dec).Inthe appwe are developingwe are more concernedmakingthe rightrecommendationsforaperson. We rejectedthe followingbinary attributesfromthe data:  Match (Whenbothpersonagreestogo on a seconddate)  dec_o (decisionof partnertogo on a seconddate)  Num_in_3(Howmanyof your matcheshave youbeenona date withsofar) Match anddec_o were rejectedbecause the combinationof the participantandpartner’sdecisionis equivalenttoourtarget variable,decision.If we keepthese twovariables,the model wouldpredictthe target variable (decisiontogoon seconddate) withclose to100% accuracy. We rejectedNum_in_3 because more than90% of observationswere missing. Afterprocessingthe dataset,exploredthe observationstogaina betterunderstandingof the data. Interestingaspectsinclude: Overall Match Rate: 16.5%  Individual ‘Yes’Rate:42% Age Range ofParticipants: 18-55  Mean: 26.3  St. Deviation:3.6  Skewness:1.07 Usinginteractive decisiontreesinSAS,we chose several initialnodestosplitthe dataon,and thenlet SASdecide howtosplitthe tree intosubsequentbranches.Thismethodshowshow the target variable variousamongparticipantsof differentgenders,races,age,andthe seasoninwhichthe eventwasheld. Resultsare shownbelow.
  • 5. 4 Gender: Note:‘0’ representsfemale,‘1’representsmale. As the tree shows,femalesare more conservativein whotheychoose togo ona seconddate with.On average,womensaid‘yes’toonly37.4% of maleswhile mensaid‘yes’to46.57% of females. Race: The decisionrate variesamongraces.Black/AfricanAmericanssaid‘yes’to51.2% of partners,while European/Caucasiansaid‘Yes’to38.79% of partners.The percentage forthe otherraces lie somewhere inbetween.
  • 6. 5 Age: SASEnterprise Minersplitthe tree basedontwoage ranges, fewerthan 38.5 and above 38.5 yearsold. For participantsunder38.5 yearsold,like wasthe mostimportantattribute whendecidingif theywant to go on a seconddate.Howeverforparticipantsover38.5 yearsold,the most importantfactorwas how‘fun’theyfound theirpartnertobe. Season: We founda slightdifference inthe outcome of the decisionvariable whenwe chose tosplitthe decision tree basedon whatseasonthe speeddatingeventwasheld. The tree showsthat people are more likelytosay‘yes’toany givendate if the speeddatingeventisheld inwinter. These nuancesinthe data helpusunderstandhow the decisionvariable isaffectedbyauser’s demographics. In the dataset, the binarytargetvariable ‘Decision’is‘yes’ 41.99 percent.If we take a simple model in whichwe predictthatevery‘Decision’isno,ourmisclassificationrate wouldbe 41.99.
  • 7. 6 BI Model: We partitionedthe dataTrain70%, validation20% and Test10% we triedrunningall the classifierswith differentsamplingtechniqueslike simple randomandstratifiedtechniquesandwe gotthe bestresults usingstratifiedsamplingtechnique. For the observation, whichare missingvaluesforcertainvariables,we have usedthe replacementnode to replace the missingclassvariableswithadotso SASwill recognize the variablesasmissing. Afterdata pre-processing,we ranthe followingmodels:  Regressionwithreplacementnode  Regressionwithreplacement,variable selectionandimpute node  Regressionwithreplacement,variable selection,imputeandtransformvariables  Dmine regressionwithreplacement,variable selection,impute andvariable transformation  Neural networkswithvariable selection  DecisionTree  Decisiontree usingvariableselectionandreplacement node  Gradientboostingwithreplacement  Decisiontree withreplacementnode
  • 8. 7 Impute node: The datasethas numerousmissingvalues.Toaddressthisissue,the meanvalue foreach relative variablewasusedtoreplace the missinginterval valuesandthe mode of eachrelative variable was usedtoreplace missingvalue ordinalvalues. Variable selection:Since we have manyattributes,the variableselectionnode wasusedtoletSAS automaticallychoose the variableswhichmostaffectedthe targetvariable, ‘match.’ Variable transformation: Certainattributeswere highlypositivelyskewed.These variableshave been transformedusingthe logfunction.Thismethodgave superiorresultstoothermethodssuchasinverse or square root. We altered the ‘maximumbranch’parameterinevery decisiontree andgotthe bestresultswhenthe ‘maximumbranch’wassetto 4 forthe Decisiontree withreplacementnode. We executed forward, backward, andstepwise regression foreveryregressionnode.We getthe best resultswhile keepingthe ‘model selection’parameter tonone withthe Regressionwithimpute node. Model comparisonresults: The model comparisonnode showsthatthe bestmodel selectedbythe SASenterprise mineristhe GradientBoostingwithreplacementnode withamisclassificationrate of 18.1 percent.
  • 9. 8 ROC curve for all Models For our applicationwe are more interestedinthe true positive rate of the model because we will be makingrecommendationsanditwouldbe bettertorecommenduserthe peoplewhomtheyare more likelytosayyesforgoingon a date. True positive rates: Dmine Regression 71.4% Regression with impute 74.2% Regression with transformed variables 73.8% Neural Network 31.5% Gradient boosting 75.2% Decision tree 72.7% Decision with variable selection 75.8% Decision tree with replacement 80.2% Eventhoughgradientboostinggivesthe bestmisclassificationrate,we have chosenDecisiontree with replacementourBImodel basedonhighertrue positiverate.Decisiontree hasatrue positive rate of 80.2 percentwhereasgradientboostinghasa true positive rate of 75.2 percent. Pleasesee the attached documentcontainingthe image of the decisiontree.
  • 10. 9 Conclusion: The decisiontree usesthese variablestosplituponandthe rootnode selectedis like Some interestingresults: All the ratingsare on the scale of 1 to 10  If user likesaperson greaterthanequal to 8 → userratesthemon attractivenessgreaterthan equal to 7.5 → userthinksthe probabilityof gettingamatchis greaterthanequal to 3 .Then there isa 86.28 percent chance that the user will sayyes  If the userlikesapersongreaterthan equal to8 → user ratesthemattractive greaterthan equal to4 and lessthan7 → user estimatesthatthe numberof matches(match_es) greater than 1.5 andtheyare of the same race. 60 percentchance that the userwill sayyes.  If the userlikesthe persongreaterthanequal to5.5 andlessthan6.5 → if theyare from London,England.Theyhave 100 percentchance of sayinga yesbutif the userisfrom Alabama, Texas,Argentinathere is68.12 percentchance of sayingno.  If the userlikesapersonlessthan5.5 → is a lawyer. Thenthere isa 93.16 percentchance that userwill sayno the otherperson.Similarlythe userisinthe fieldof Informaticsor Psychology, the userwill sayno 100 percentof the time andif the userisa journalist, thereisan83 percent chance of sayinga yes.
  • 11. 10 Overviewonthe application:  In the mobile applicationusersmake theirprofiles withsomepicturesanddescriptionabout themselves.The usersare askedtospecifytheirpreferenceslike age range of theirpartnersand the location range and whethertheyare interestedinmeaningful friendships orrelationships.  The user is shownthe profilesof peoplewhomatchthe userpreferencesandthe userisasked for rate theirprofile onfeaturessuchasattractiveness,funandhow muchtheylike the overall profile of the person.  Basedon these ratingsourBI model generatesalistof potential partnerswithwhomuser is likelytobe compatible withand hasan optionto start a chat.  Aftersignificantuserbase hasbeenestablishedwe willbe able todesignarecommendation systemthatincreasesthe accuracy byselectingthe profileswhichsimilarusershave matched with. References: Data source:Kaggle.com Columbia Business School. Ray Fisman and Sheena. Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment. https://www.kaggle.com/annavictoria/speed- dating-experiment