Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Project report SAS

104 views

Published on

  • Be the first to comment

  • Be the first to like this

Project report SAS

  1. 1. MIS 6324 BUSINESS ANALYTICS WITH SAS Dating Application Group 3 Vaibhav Pande, Mary Gramer,Tejasvi Ramdas Sagar, Ritesh KP,Foram Gohil 11/27/2016
  2. 2. 1 Executive Summary Young adultsinthe twenty-firstcenturyare amongthe busiestandtechnology-tetheredgenerations. Whentheyare notjugglingschool,careers,orhobbies,manyof themare gluedtotheirsmart-phones surfingthe web,toopreoccupiedtomeetnew people.In thiscultural environment,itisdifficultfor young,single adultstofindpotential datingpartners. Usingdata from twenty-onespeeddatingeventstocreate a new datingapp,we can connecttwo individualsbasedontheirinterestandpreferencesthus expeditingthe datingprocess.The appwill directthe userto rate otherusers’profilesbasedonnotonlythe user’simage,butalsohow much he/she likesthe otheruserbasedontheirprofile information.The profileswill include demographic information,sharedInterests,andotherattributessuchasfunfactor, attractiveness,etc.After evaluatingeachuser’spreferencesandrating,the appwill suggestpartnerswhohave similarinterests and matchingpreferences. Aftercomparingthe accuraciesand the true positive ratesof variousmodelscreatedusingSAS Enterprise Miner,we have selectedadecisiontree topredictthe targetvariable.The datawasfirst alteredbyapplyingareplacementnode inSAS.Ourmodel canpredictwhetherornot an individual will be interestedindatinganotherhumanbasedontheirattributesandinterestswith80.5% true positive rate andwith81.1% accuracy. Project Motivation The current popularmobile applicationsformeetingothersinglessuchas Tinder,orBumble, donot considera person’spreferencesorpersonality - the onlydecidingfactoronwhetherornot twopeople converse istheirpictures.Thisinefficientsystemcausessinglestowaste theirtime messagingwith people whodonotshare any of theirinterests.Afterspendingperhapshourschatting,twopeoplemay realize thattheyare not interestedingoingona face-to-face date withtheir‘match.’Usingthe speed datingdata, we can create a superiordatingappforyoungadults. Descriptionof Data The datasetincludesobservationsfromtwenty-one speeddatingevents(alsocalledwaves) inwhich each personwaspairedwithfive totwenty-twopartnersof the oppositegenderforfourminuteseach. Before,during,andafterthe event,participantswere askedtorate multiplecharacteristicabout themselves,andeachof the partnerswithwhichtheymet.Everyparticipantidentifiedwhichattributes ina partnerare mostimportantto them, ratedeachpartnertheymet withonthese same attributes (calledthe ‘scorecard’foreachmember),andif theywouldliketogoon a seconddate withthe partner.
  3. 3. 2 The scorecard giventoeach participantafterthe date isas follows: SCORECARD YOUR ID NUMBER: Circle “Yes”or “No” belowthe IDnumberof eachperson youmeetto indicate whetheryouwouldlike to see himor heragain. Rate theirattributesona scale of 1-10: (1=awful,10=great). If youhaven’t formedanopinionbasedonyourconversation,fillinN/A,butplease fill inall boxes. Thiswillbe TOTALLY confidential andwillNOTbe sharedwithanyone. Then,answerthe remainingquestionsfor each personyoumeet. ID #: 1 2 3 4 5 6 7 8 9 10 Decision 1=yes 0=no Yes no yes no yes no yes no yes no yes no yes no yes no yes no Attributes (1=awful, 10=great) Attractive attr Sincere sinc Intelligent intel Fun fun Ambitious amb Shared Interests/Hobbies shar Overall, how much do you like this person? (1=don't like at all, 10=like a lot) like How probable do you think it is that this person will say 'yes' for you? (1=not probable, 10=extremely probable) prob Have you met this person before? met 1=yes 2=no yes no yes no yes no yes no yes no yes no yes no yes no yes no
  4. 4. 3 In the data set,eachobservationrepresentsameetingbetweenaparticipantanda partner.The observationincludesall the informationcollectedaboutthe participant,includingdemographics, preferences,howthey scoredtheirpartner,how theirpartnerscoredthem, andwhetherbothpeople agreedto go onseconddate. Priorto modelingthe data,there were manydiscrepanciesandnon-uniformitiesamongthe variablesto be reconciled.Forfourof the speeddatingevents(numberssix tonine),the participantsrankedtheir preference foreachof the six attributesona scale of 1-10. For the remainingevents,participantsranked theirpreference byallocating100 pointstothe same six attributes.Tocreate consistency inthese variables,the valuesforthe rankinginspeeddatingeventssixtonine have beenscaledto100 pointsto be consistentwiththe otherwaves. We usedthe followingformulatoscale the data forthe waves6-9: 𝑹𝒂𝒕𝒊𝒏𝒈 𝒔𝒄𝒂𝒍𝒆𝒅 = 𝟏𝟎𝟎 𝜮𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝑹𝒂𝒕𝒊𝒏𝒈𝒔 × 𝑹𝒂𝒕𝒊𝒏𝒈 𝒐𝒓𝒊𝒈𝒊𝒏𝒂𝒍 The target variable we have selectedisdecision (dec).Inthe appwe are developingwe are more concernedmakingthe rightrecommendationsforaperson. We rejectedthe followingbinary attributesfromthe data:  Match (Whenbothpersonagreestogo on a seconddate)  dec_o (decisionof partnertogo on a seconddate)  Num_in_3(Howmanyof your matcheshave youbeenona date withsofar) Match anddec_o were rejectedbecause the combinationof the participantandpartner’sdecisionis equivalenttoourtarget variable,decision.If we keepthese twovariables,the model wouldpredictthe target variable (decisiontogoon seconddate) withclose to100% accuracy. We rejectedNum_in_3 because more than90% of observationswere missing. Afterprocessingthe dataset,exploredthe observationstogaina betterunderstandingof the data. Interestingaspectsinclude: Overall Match Rate: 16.5%  Individual ‘Yes’Rate:42% Age Range ofParticipants: 18-55  Mean: 26.3  St. Deviation:3.6  Skewness:1.07 Usinginteractive decisiontreesinSAS,we chose several initialnodestosplitthe dataon,and thenlet SASdecide howtosplitthe tree intosubsequentbranches.Thismethodshowshow the target variable variousamongparticipantsof differentgenders,races,age,andthe seasoninwhichthe eventwasheld. Resultsare shownbelow.
  5. 5. 4 Gender: Note:‘0’ representsfemale,‘1’representsmale. As the tree shows,femalesare more conservativein whotheychoose togo ona seconddate with.On average,womensaid‘yes’toonly37.4% of maleswhile mensaid‘yes’to46.57% of females. Race: The decisionrate variesamongraces.Black/AfricanAmericanssaid‘yes’to51.2% of partners,while European/Caucasiansaid‘Yes’to38.79% of partners.The percentage forthe otherraces lie somewhere inbetween.
  6. 6. 5 Age: SASEnterprise Minersplitthe tree basedontwoage ranges, fewerthan 38.5 and above 38.5 yearsold. For participantsunder38.5 yearsold,like wasthe mostimportantattribute whendecidingif theywant to go on a seconddate.Howeverforparticipantsover38.5 yearsold,the most importantfactorwas how‘fun’theyfound theirpartnertobe. Season: We founda slightdifference inthe outcome of the decisionvariable whenwe chose tosplitthe decision tree basedon whatseasonthe speeddatingeventwasheld. The tree showsthat people are more likelytosay‘yes’toany givendate if the speeddatingeventisheld inwinter. These nuancesinthe data helpusunderstandhow the decisionvariable isaffectedbyauser’s demographics. In the dataset, the binarytargetvariable ‘Decision’is‘yes’ 41.99 percent.If we take a simple model in whichwe predictthatevery‘Decision’isno,ourmisclassificationrate wouldbe 41.99.
  7. 7. 6 BI Model: We partitionedthe dataTrain70%, validation20% and Test10% we triedrunningall the classifierswith differentsamplingtechniqueslike simple randomandstratifiedtechniquesandwe gotthe bestresults usingstratifiedsamplingtechnique. For the observation, whichare missingvaluesforcertainvariables,we have usedthe replacementnode to replace the missingclassvariableswithadotso SASwill recognize the variablesasmissing. Afterdata pre-processing,we ranthe followingmodels:  Regressionwithreplacementnode  Regressionwithreplacement,variable selectionandimpute node  Regressionwithreplacement,variable selection,imputeandtransformvariables  Dmine regressionwithreplacement,variable selection,impute andvariable transformation  Neural networkswithvariable selection  DecisionTree  Decisiontree usingvariableselectionandreplacement node  Gradientboostingwithreplacement  Decisiontree withreplacementnode
  8. 8. 7 Impute node: The datasethas numerousmissingvalues.Toaddressthisissue,the meanvalue foreach relative variablewasusedtoreplace the missinginterval valuesandthe mode of eachrelative variable was usedtoreplace missingvalue ordinalvalues. Variable selection:Since we have manyattributes,the variableselectionnode wasusedtoletSAS automaticallychoose the variableswhichmostaffectedthe targetvariable, ‘match.’ Variable transformation: Certainattributeswere highlypositivelyskewed.These variableshave been transformedusingthe logfunction.Thismethodgave superiorresultstoothermethodssuchasinverse or square root. We altered the ‘maximumbranch’parameterinevery decisiontree andgotthe bestresultswhenthe ‘maximumbranch’wassetto 4 forthe Decisiontree withreplacementnode. We executed forward, backward, andstepwise regression foreveryregressionnode.We getthe best resultswhile keepingthe ‘model selection’parameter tonone withthe Regressionwithimpute node. Model comparisonresults: The model comparisonnode showsthatthe bestmodel selectedbythe SASenterprise mineristhe GradientBoostingwithreplacementnode withamisclassificationrate of 18.1 percent.
  9. 9. 8 ROC curve for all Models For our applicationwe are more interestedinthe true positive rate of the model because we will be makingrecommendationsanditwouldbe bettertorecommenduserthe peoplewhomtheyare more likelytosayyesforgoingon a date. True positive rates: Dmine Regression 71.4% Regression with impute 74.2% Regression with transformed variables 73.8% Neural Network 31.5% Gradient boosting 75.2% Decision tree 72.7% Decision with variable selection 75.8% Decision tree with replacement 80.2% Eventhoughgradientboostinggivesthe bestmisclassificationrate,we have chosenDecisiontree with replacementourBImodel basedonhighertrue positiverate.Decisiontree hasatrue positive rate of 80.2 percentwhereasgradientboostinghasa true positive rate of 75.2 percent. Pleasesee the attached documentcontainingthe image of the decisiontree.
  10. 10. 9 Conclusion: The decisiontree usesthese variablestosplituponandthe rootnode selectedis like Some interestingresults: All the ratingsare on the scale of 1 to 10  If user likesaperson greaterthanequal to 8 → userratesthemon attractivenessgreaterthan equal to 7.5 → userthinksthe probabilityof gettingamatchis greaterthanequal to 3 .Then there isa 86.28 percent chance that the user will sayyes  If the userlikesapersongreaterthan equal to8 → user ratesthemattractive greaterthan equal to4 and lessthan7 → user estimatesthatthe numberof matches(match_es) greater than 1.5 andtheyare of the same race. 60 percentchance that the userwill sayyes.  If the userlikesthe persongreaterthanequal to5.5 andlessthan6.5 → if theyare from London,England.Theyhave 100 percentchance of sayinga yesbutif the userisfrom Alabama, Texas,Argentinathere is68.12 percentchance of sayingno.  If the userlikesapersonlessthan5.5 → is a lawyer. Thenthere isa 93.16 percentchance that userwill sayno the otherperson.Similarlythe userisinthe fieldof Informaticsor Psychology, the userwill sayno 100 percentof the time andif the userisa journalist, thereisan83 percent chance of sayinga yes.
  11. 11. 10 Overviewonthe application:  In the mobile applicationusersmake theirprofiles withsomepicturesanddescriptionabout themselves.The usersare askedtospecifytheirpreferenceslike age range of theirpartnersand the location range and whethertheyare interestedinmeaningful friendships orrelationships.  The user is shownthe profilesof peoplewhomatchthe userpreferencesandthe userisasked for rate theirprofile onfeaturessuchasattractiveness,funandhow muchtheylike the overall profile of the person.  Basedon these ratingsourBI model generatesalistof potential partnerswithwhomuser is likelytobe compatible withand hasan optionto start a chat.  Aftersignificantuserbase hasbeenestablishedwe willbe able todesignarecommendation systemthatincreasesthe accuracy byselectingthe profileswhichsimilarusershave matched with. References: Data source:Kaggle.com Columbia Business School. Ray Fisman and Sheena. Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment. https://www.kaggle.com/annavictoria/speed- dating-experiment

×