Project report SAS

MIS 6324 BUSINESS ANALYTICS WITH SAS
Dating Application
Group 3
Vaibhav Pande, Mary Gramer,Tejasvi Ramdas Sagar, Ritesh KP,Foram Gohil
11/27/2016

1
Executive Summary
Young adultsinthe twenty-firstcenturyare amongthe busiestandtechnology-tetheredgenerations.
Whentheyare notjugglingschool,careers,orhobbies,manyof themare gluedtotheirsmart-phones
surfingthe web,toopreoccupiedtomeetnew people.In thiscultural environment,itisdifficultfor
young,single adultstofindpotential datingpartners.
Usingdata from twenty-onespeeddatingeventstocreate a new datingapp,we can connecttwo
individualsbasedontheirinterestandpreferencesthus expeditingthe datingprocess.The appwill
directthe userto rate otherusers’profilesbasedonnotonlythe user’simage,butalsohow much
he/she likesthe otheruserbasedontheirprofile information.The profileswill include demographic
information,sharedInterests,andotherattributessuchasfunfactor, attractiveness,etc.After
evaluatingeachuser’spreferencesandrating,the appwill suggestpartnerswhohave similarinterests
and matchingpreferences.
Aftercomparingthe accuraciesand the true positive ratesof variousmodelscreatedusingSAS
Enterprise Miner,we have selectedadecisiontree topredictthe targetvariable.The datawasfirst
alteredbyapplyingareplacementnode inSAS.Ourmodel canpredictwhetherornot an individual will
be interestedindatinganotherhumanbasedontheirattributesandinterestswith80.5% true positive
rate andwith81.1% accuracy.
Project Motivation
The current popularmobile applicationsformeetingothersinglessuchas Tinder,orBumble, donot
considera person’spreferencesorpersonality - the onlydecidingfactoronwhetherornot twopeople
converse istheirpictures.Thisinefficientsystemcausessinglestowaste theirtime messagingwith
people whodonotshare any of theirinterests.Afterspendingperhapshourschatting,twopeoplemay
realize thattheyare not interestedingoingona face-to-face date withtheir‘match.’Usingthe speed
datingdata, we can create a superiordatingappforyoungadults.
Descriptionof Data
The datasetincludesobservationsfromtwenty-one speeddatingevents(alsocalledwaves) inwhich
each personwaspairedwithfive totwenty-twopartnersof the oppositegenderforfourminuteseach.
Before,during,andafterthe event,participantswere askedtorate multiplecharacteristicabout
themselves,andeachof the partnerswithwhichtheymet.Everyparticipantidentifiedwhichattributes
ina partnerare mostimportantto them, ratedeachpartnertheymet withonthese same attributes
(calledthe ‘scorecard’foreachmember),andif theywouldliketogoon a seconddate withthe partner.

2
The scorecard giventoeach participantafterthe date isas follows:
SCORECARD
YOUR ID NUMBER:
Circle “Yes”or “No” belowthe IDnumberof eachperson youmeetto indicate whetheryouwouldlike
to see himor heragain. Rate theirattributesona scale of 1-10: (1=awful,10=great). If youhaven’t
formedanopinionbasedonyourconversation,fillinN/A,butplease fill inall boxes. Thiswillbe
TOTALLY confidential andwillNOTbe sharedwithanyone. Then,answerthe remainingquestionsfor
each personyoumeet.
ID #: 1 2 3 4 5 6 7 8 9 10
Decision 1=yes
0=no
Yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
Attributes
(1=awful, 10=great)
Attractive attr
Sincere sinc
Intelligent intel
Fun fun
Ambitious amb
Shared Interests/Hobbies shar
Overall, how much do you like this person?
(1=don't like at all, 10=like a lot)
like
How probable do you think it is that this person will say
'yes' for you?
(1=not probable, 10=extremely probable)
prob
Have you met this person before? met
1=yes
2=no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no
yes
no

3
In the data set,eachobservationrepresentsameetingbetweenaparticipantanda partner.The
observationincludesall the informationcollectedaboutthe participant,includingdemographics,
preferences,howthey scoredtheirpartner,how theirpartnerscoredthem, andwhetherbothpeople
agreedto go onseconddate.
Priorto modelingthe data,there were manydiscrepanciesandnon-uniformitiesamongthe variablesto
be reconciled.Forfourof the speeddatingevents(numberssix tonine),the participantsrankedtheir
preference foreachof the six attributesona scale of 1-10. For the remainingevents,participantsranked
theirpreference byallocating100 pointstothe same six attributes.Tocreate consistency inthese
variables,the valuesforthe rankinginspeeddatingeventssixtonine have beenscaledto100 pointsto
be consistentwiththe otherwaves.
We usedthe followingformulatoscale the data forthe waves6-9:
𝑹𝒂𝒕𝒊𝒏𝒈 𝒔𝒄𝒂𝒍𝒆𝒅 =
𝟏𝟎𝟎
𝜮𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝑹𝒂𝒕𝒊𝒏𝒈𝒔
× 𝑹𝒂𝒕𝒊𝒏𝒈 𝒐𝒓𝒊𝒈𝒊𝒏𝒂𝒍
The target variable we have selectedisdecision (dec).Inthe appwe are developingwe are more
concernedmakingthe rightrecommendationsforaperson.
We rejectedthe followingbinary attributesfromthe data:
 Match (Whenbothpersonagreestogo on a seconddate)
 dec_o (decisionof partnertogo on a seconddate)
 Num_in_3(Howmanyof your matcheshave youbeenona date withsofar)
Match anddec_o were rejectedbecause the combinationof the participantandpartner’sdecisionis
equivalenttoourtarget variable,decision.If we keepthese twovariables,the model wouldpredictthe
target variable (decisiontogoon seconddate) withclose to100% accuracy. We rejectedNum_in_3
because more than90% of observationswere missing.
Afterprocessingthe dataset,exploredthe observationstogaina betterunderstandingof the data.
Interestingaspectsinclude:
Overall Match Rate: 16.5%
 Individual ‘Yes’Rate:42%
Age Range ofParticipants: 18-55
 Mean: 26.3
 St. Deviation:3.6
 Skewness:1.07
Usinginteractive decisiontreesinSAS,we chose several initialnodestosplitthe dataon,and thenlet
SASdecide howtosplitthe tree intosubsequentbranches.Thismethodshowshow the target variable
variousamongparticipantsof differentgenders,races,age,andthe seasoninwhichthe eventwasheld.
Resultsare shownbelow.

4
Gender:
Note:‘0’ representsfemale,‘1’representsmale.
As the tree shows,femalesare more conservativein whotheychoose togo ona seconddate with.On
average,womensaid‘yes’toonly37.4% of maleswhile mensaid‘yes’to46.57% of females.
Race:
The decisionrate variesamongraces.Black/AfricanAmericanssaid‘yes’to51.2% of partners,while
European/Caucasiansaid‘Yes’to38.79% of partners.The percentage forthe otherraces lie somewhere
inbetween.

5
Age:
SASEnterprise Minersplitthe tree basedontwoage ranges, fewerthan 38.5 and above 38.5 yearsold.
For participantsunder38.5 yearsold,like wasthe mostimportantattribute whendecidingif theywant
to go on a seconddate.Howeverforparticipantsover38.5 yearsold,the most importantfactorwas
how‘fun’theyfound theirpartnertobe.
Season:
We founda slightdifference inthe outcome of the decisionvariable whenwe chose tosplitthe decision
tree basedon whatseasonthe speeddatingeventwasheld.
The tree showsthat people are more likelytosay‘yes’toany givendate if the speeddatingeventisheld
inwinter.
These nuancesinthe data helpusunderstandhow the decisionvariable isaffectedbyauser’s
demographics.
In the dataset, the binarytargetvariable ‘Decision’is‘yes’ 41.99 percent.If we take a simple model in
whichwe predictthatevery‘Decision’isno,ourmisclassificationrate wouldbe 41.99.

6
BI Model:
We partitionedthe dataTrain70%, validation20% and Test10% we triedrunningall the classifierswith
differentsamplingtechniqueslike simple randomandstratifiedtechniquesandwe gotthe bestresults
usingstratifiedsamplingtechnique.
For the observation, whichare missingvaluesforcertainvariables,we have usedthe replacementnode
to replace the missingclassvariableswithadotso SASwill recognize the variablesasmissing.
Afterdata pre-processing,we ranthe followingmodels:
 Regressionwithreplacementnode
 Regressionwithreplacement,variable selectionandimpute node
 Regressionwithreplacement,variable selection,imputeandtransformvariables
 Dmine regressionwithreplacement,variable selection,impute andvariable transformation
 Neural networkswithvariable selection
 DecisionTree
 Decisiontree usingvariableselectionandreplacement node
 Gradientboostingwithreplacement
 Decisiontree withreplacementnode

7
Impute node: The datasethas numerousmissingvalues.Toaddressthisissue,the meanvalue foreach
relative variablewasusedtoreplace the missinginterval valuesandthe mode of eachrelative variable
was usedtoreplace missingvalue ordinalvalues.
Variable selection:Since we have manyattributes,the variableselectionnode wasusedtoletSAS
automaticallychoose the variableswhichmostaffectedthe targetvariable, ‘match.’
Variable transformation: Certainattributeswere highlypositivelyskewed.These variableshave been
transformedusingthe logfunction.Thismethodgave superiorresultstoothermethodssuchasinverse
or square root.
We altered the ‘maximumbranch’parameterinevery decisiontree andgotthe bestresultswhenthe
‘maximumbranch’wassetto 4 forthe Decisiontree withreplacementnode.
We executed forward, backward, andstepwise regression foreveryregressionnode.We getthe best
resultswhile keepingthe ‘model selection’parameter tonone withthe Regressionwithimpute node.
Model comparisonresults:
The model comparisonnode showsthatthe bestmodel selectedbythe SASenterprise mineristhe
GradientBoostingwithreplacementnode withamisclassificationrate of 18.1 percent.

8
ROC curve for all Models
For our applicationwe are more interestedinthe true positive rate of the model because we will be
makingrecommendationsanditwouldbe bettertorecommenduserthe peoplewhomtheyare more
likelytosayyesforgoingon a date.
True positive rates:
Dmine Regression 71.4%
Regression with impute 74.2%
Regression with transformed variables 73.8%
Neural Network 31.5%
Gradient boosting 75.2%
Decision tree 72.7%
Decision with variable selection 75.8%
Decision tree with replacement 80.2%
Eventhoughgradientboostinggivesthe bestmisclassificationrate,we have chosenDecisiontree with
replacementourBImodel basedonhighertrue positiverate.Decisiontree hasatrue positive rate of
80.2 percentwhereasgradientboostinghasa true positive rate of 75.2 percent. Pleasesee the attached
documentcontainingthe image of the decisiontree.

9
Conclusion:
The decisiontree usesthese variablestosplituponandthe rootnode selectedis like
Some interestingresults:
All the ratingsare on the scale of 1 to 10
 If user likesaperson greaterthanequal to 8 → userratesthemon attractivenessgreaterthan
equal to 7.5 → userthinksthe probabilityof gettingamatchis greaterthanequal to 3 .Then
there isa 86.28 percent chance that the user will sayyes
 If the userlikesapersongreaterthan equal to8 → user ratesthemattractive greaterthan
equal to4 and lessthan7 → user estimatesthatthe numberof matches(match_es) greater
than 1.5 andtheyare of the same race. 60 percentchance that the userwill sayyes.
 If the userlikesthe persongreaterthanequal to5.5 andlessthan6.5 → if theyare from
London,England.Theyhave 100 percentchance of sayinga yesbutif the userisfrom Alabama,
Texas,Argentinathere is68.12 percentchance of sayingno.
 If the userlikesapersonlessthan5.5 → is a lawyer. Thenthere isa 93.16 percentchance that
userwill sayno the otherperson.Similarlythe userisinthe fieldof Informaticsor Psychology,
the userwill sayno 100 percentof the time andif the userisa journalist, thereisan83 percent
chance of sayinga yes.

10
Overviewonthe application:
 In the mobile applicationusersmake theirprofiles withsomepicturesanddescriptionabout
themselves.The usersare askedtospecifytheirpreferenceslike age range of theirpartnersand
the location range and whethertheyare interestedinmeaningful friendships orrelationships.
 The user is shownthe profilesof peoplewhomatchthe userpreferencesandthe userisasked
for rate theirprofile onfeaturessuchasattractiveness,funandhow muchtheylike the overall
profile of the person.
 Basedon these ratingsourBI model generatesalistof potential partnerswithwhomuser is
likelytobe compatible withand hasan optionto start a chat.
 Aftersignificantuserbase hasbeenestablishedwe willbe able todesignarecommendation
systemthatincreasesthe accuracy byselectingthe profileswhichsimilarusershave matched
with.
References:
Data source:Kaggle.com
Columbia Business School. Ray Fisman and Sheena. Gender Differences in Mate Selection:
Evidence from a Speed Dating Experiment. https://www.kaggle.com/annavictoria/speed-
dating-experiment

Project report SAS

Recommended

Recommended

More Related Content

Similar to Project report SAS

Similar to Project report SAS (20)

Project report SAS