SlideShare a Scribd company logo
1 of 48
Yelp Dataset Challenge
MSIS 5633
Deliverable 2
25 NOV 2015
James Lynn (CWID11644030)
Yolande Mbah Mbole (CWID11696431)
Vegard Oelstad(CWID11681522)
2 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Executive Summary
Yelpisa webbasedcompanyprovidingcrowd-sourcedreviewsof local business viaYelp.com.Itsstated
goal is to connectpeople withgreatlocal businesses.Inrecentyears,Yelphas made subsetsof itsdata
available tothe publictopromote innovative usesof dataandgroundbreaking research.
The goal of our projectisto leverage thisYelpdatato create a classificationscheme utilizingRatingsand
Price information.The analysisshouldprovideinsightsintowhatmakessome restaurantsearntop
rankingswhile othersfall short.Obviously,consumersexpecthighqualityintermsof service,food,
ambiance etc.The questioniswhichdimensionsare more important.Cana restaurantfall shortinsome
areas andstill be ratedhighly?
Our projectcouldbenefitthose lookingtoopenanew restaurantby identifyingkeyareastofocuson. It
couldalsohelpeducate inexperienced restaurateurs oncustomerexpectationsandwhatittakesto
succeedintermsof ratingsandcustomerperception.Everyadvantage canhelpwhenyouconsiderthat
a studyby Cornell UniversityandMichiganState University researchersfoundthatafterthe firstyear
27% of restaurantstartupsfailed.Chef RobertIrvineof TV’sRestaurantImpossiblecitedinexperience as
the primaryreasonmost restaurantsfail.Ourprojectcanhelp educate inexperienced restaurateurs on
customerexpectationsandwhatittakesto succeedintermsof ratingsand customerperception.
The one thingfoundinthe analysistoimprove the restaurantisthe openinghours.Despite the factthat
longeropeninghoursmayincrease the revenue,shorterhourshelpsincreasethe ratingof the place.
This,togetherwiththe factthat the majorityof the reviewsare concernedaboutfoodandservice may
argue that the managersmayconsiderreducingthe hourstoincrease itsratings – whichin turnwill help
bringin more customersandmore revenues.
Project Schedule, DurationandEstimates
Initial Project Timeline
YELP DATASET CHALLENGE ANALYSIS TIMELINE
9/7 9/14 9/21 9/28 10/5 10/12 10/19 10/26 11/2 11/9 11/16 11/23 11/30 12/7 12/14
Milestone Kick OffMeeting Team 1 9/2/15 9/2/15
Prepare projectproposal Team 7 9/6/15 9/12/15 9/12
Submitprojectproposal Team 1 9/13/15 9/13/15 9/13
Define data requirements for analysis Team 5 9/13/15 9/18/15 9/18
Data consolidation Team 27 9/18/15 10/15/15 10/15
Data cleaning Team 27 9/18/15 10/15/15 10/15
Data reduction Team 27 9/18/15 10/15/15 10/15
Prepare firstdeliverable Team 3 10/15/15 10/17/15 10/17
Submitfirstdeliverable Team 1 10/18/15 10/18/15 10/18
Build models Team 10 10/19/15 10/30/15 10/30
Analyze models Team 24 11/1/15 11/24/15 11/24
Prepare second deliverable Team 3 11/25/15 11/28/15 11/28
Submitsecond deliverable Team 1 11/29/15 11/29/15 11/29
Prepare reportand presentation Team 11 11/30/15 12/10/15 12/10
Submitfinal deliverable Team 1 12/11/15 12/11/15 12/11
Step Task Lead
Est.
Duration
Start
Date
End Date
3 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Final Project Timeline
Comparingourinitial timelinewiththe finalone,we initiallyplannedtodo the data reductionbefore
submittingthe firstdeliverable butwere onlyable tosoaftersubmittingthe deliverable because we
spentmore time thanexpectedonthe datacleaningandconsolidation.We alsoincludedthe duration
of the Data Transformation inourupdatedtimeline.We metalmosteveryweek,butonlythe major
onesare includedinourfinal timeline.Anothermajordifference inourplannedandactual scheduleis
that we spentmore time ondata Transformationthanplanned.Asaresult,we hadto use some of the
time we plannedtospendonbuildingandanalyzingourmodelsonthe datatransformation.Itworked
out well andwe were able tocomplete the projectontime.
Work Based Structure
YELP DATASET CHALLENGE ANALYSIS TIMELINE
9/7 9/14 9/21 9/28 10/5 10/12 10/19 10/26 11/2 11/9 11/16 11/23 11/30 12/7 12/14
Kick OffMeeting Team 1 9/2/15 9/2/15
Prepare projectproposal Team 7 9/6/15 9/12/15 9/12
Submitprojectproposal Team 1 9/13/15 9/13/15 9/13
** Major Group meeting Team 1 9/14/15 9/14/15
Define data requirements for analysis Team 4 9/15/15 9/18/15 9/18
Data cleaning and data consolidation Team 27 9/18/15 10/15/15 10/15
Prepare firstdeliverable Team 3 10/15/15 10/17/15 10/17
Submitfirstdeliverable Team 1 10/18/15 10/18/15 10/18
** Major Group meeting Team 1 10/19/15 10/19/15 10/19
Data Transformation Team 18 10/20/15 11/7/15 11/7
Data Reduction Team 6 11/8/15 11/14/15 11/14
** Major Group meeting Team 1 11/15/15 11/15/15 11/15
Build models Team 5 11/16/15 11/20/15 11/20
Analyze models and startpreparing 2nd deliverable Team 3 11/21/15 11/23/15 11/23
** Major Group meeting Team 1 11/23/15 11/23/15 11/23
Finalize second deliverable Team 1 11/24/15 11/24/15 11/28
Submitsecond deliverable Team 1 11/25/15 11/25/15 11/29
** Major Group meeting Team 1 11/26/15 11/26/15 11/26
Prepare reportand presentation Team 10 11/27/15 12/6/15 12/6
Submitfinal deliverable Team 1 12/7/15 12/7/15 12/7
Step Task Lead
Est.
Duration
Start
Date
End Date
YELP Data Mining Project
First Deliverable
-Define data requirements for
analysis
-Data cleaning and
consolidation
Second Deliverable
-Data Transformation
-Data reduction
-Building and analyzing
models
Final Deliverable
-Report
-Final Presentation
Project Proposal
4 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Statement of Scope
Project Objective
The objective of ouranalysisistouncoverthe factors mostimportantincategorizingaYelprestaurant
intoa highreviewcategory(4,4.5, or 5 Star rating).
Target Variable
 TARGET – thistarget variable isabinaryfieldwithvaluesof 0or 1. Itis createdbyassigninga
value of 1 to restaurantswithinthe Highreview category.All otherrestaurantswill be assigned
a 0 value.
Predictor Variables
Our initial fileincluded over100 possible predictorvariables. Tolimitthe scope, we startedwiththe
variablesbelow andusedadecisiontree toidentifythe mostimportantvariablesindeterminingthe
desiredoutcome.Inaddition,we selectedafew additional variablesbasedonourintuitionandcuriosity
to see howwell theyperformedintermsof classificationandprediction. The boldedvariablesare those
actuallyselectedforuse inourmodels.
 Ethnicity – type of food(e.g.Italian,Mexican,etc.)
 Neighborhood Flag–binaryvariable toindicate whetherneighborhoodswere listed;couldbe an
indicatorof trendylocations
 ReviewCount- numberof Yelpreviews
 Good forKids – whetherrestaurantisgoodforKids
 Alcohol – full bar,beerandwine,none,etc.
 Noise Level –loud,veryloud,average,etc.
 Attire – dressy,casual,etc.
 Coat Check – True, False
 Romantic– True,False
 Classy – True, False
 Intimate – True,False
 Hipster– True,False
 Divey – True,False
 Touristy – True, False
 Trendy – True,False
 Upscale–True, False
 Casual – True,False
 Good forDessert – True, False
 Good forLate Night – True, False
 Good forLunch – True,False
 Good forDinner– True,False
 Good forBreakfast – True,False
 Good forBrunch – True,False
 Live Music – True, False
5 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
 DairyFree – True,False
 GlutenFree – True,False
 Vegan– True,False
 Vegetarian –True,False
 Wi-Fi – True,False
 TakesReservations –True,False
 Smoking– Yes,No,Outdoor
 Hours Open – open/close time brokenoutbydayof week
 Text Topics 1-20 – themesidentifiedthroughtextmining
 Total Reviewsvoted as cool
 Total hours open on weekends
 Total Tips
 Total Likes of Tips
 Percentage of reviewsvotedFunny
 Percentage of reviewsvotedUseful
 Percentage of reviewsvotedCool
People Benefittingfromthe Analysis
The primarybenefactorsof thisanalysiswillbe restaurantownersandoperators.Theywillreceive
insightsintothe mostimportantdimensionsof ahighlyratedrestaurant.
Consumersmayalsobenefit.Whenrestaurantsaren’tratedorwhentheyhave fewerreviews,the
criteriamayhelpthemdetermine whetherornotto take a chance on a restaurant.
Yelpand advertisersmayalsobenefit.Theycanuse the informationfromthe analysistoapproach
businessesinamore consultative fashionbyprovidingofferingsandrecommendationsthathelp
restaurantsimprove keyareasof weaknessorconsumerperceptionsinthose areas.
Companieswhohelprestaurants couldbenefit.Perhapsarestaurantscoreslow forambiance.
Companiesspecializinginremodelingorinteriordesigncouldapproachthese restaurantswithproposals
or ideasonhow improvementscouldbe made.
Finally,jobseekersmaybenefit.The resultsof the analysiswouldgive them cluesonthe majorvalues
and characteristics thatdistinguishone restaurantfromanother.Theywouldthenbe able tomake a
betterchoice of the restauranttheywantto work for basedon the attributes theyvalue most.
Constraints and Limitations
There are a numberof possible constraintsassociatedwiththisproject.
1. Small sample size of highlyrated,expensive restaurants - While there are over6,000 restaurants
inthe data setratedas a 4, 4.5, or 5, there are onlyabout175 withthose ratingsalsofallinginto
the most expensivecategory (ratingof 4).Giventhatfact, we adjustedour original projectidea
of investigatingwhyexpensive restaurantsreceive low ratingstosomethingbroader.We are
nowlookingtopredicthigh restaurantratingsirrespective of price.
6 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
2. Format of the data - There are several datafieldsthatincludenuggetsof informationthatisnot
easilyaccessible withouttextmining.Evenwithtextmining,over400 conceptsemerge.These
conceptsmustbe combinedintothemes.Thisisatime consumingandinexactprocess.
3. Samples - The samples we are usingare froma few U.S. cities - Pittsburgh,Charlotte,Urbana-
Champaign,Phoenix,LasVegas, andMadison.The samplesmaynotbe representative of the
U.S. as a whole.
4. Timing– As of the time thispaperwas written,we have receivednoformal feedbackonour
original projectproposal.Shouldchangesbe required,we will have lesstime toadapt.
5. Expertise –A gooddata science teamiscomprisedof individualswithexpertise inseveral
disciplines –statistics,computerscience,statistics/math,andthe businessdomain.Ourgroup
lacksanyone withan in-depthstatistics/mathbackground.
Project Costs
The projectteam associatedwiththisanalysisconsistsof 3seniordataanalysts.We estimate the time
requiredtobe 50 hoursper analyst(150 hourstotal).Ata rate of $250 perhour, the total projectcost to
be $37,500. Thisestimate doesnottake intoaccountthe opportunitycostof otherprojectsthat are not
undertaken.
Since we are usingfree analysissoftware andthere are nodata charges,the intangible costsare
negligible.
FeasibilityandRisk Assessment
Despite ourteam’sshortcomingsinthe realmof statistics,we feltourprojectwasfeasiblebasedonthe
trainingwe have receivedinMSIS5633. We feltthe biggestchallenge facinguswasthe conversionof
JSON filestoa formateasilyreadable bySPSSModeler. The restof the project waslessdaunting.
Timingandresource availability wasone challenge we faced.Withadistance learningstudentand
studentathlete onthe team,schedulingmeetings wassometimesdifficult. We were able to overcome
the challenge byschedulingregularmeetingsonGoogle Hangoutsandmaintainingongoing,open
communicationviaemail.
We were fortunate tohave a robustdata setfrom Yelp.The data setpermittedustoeasilyadjustor
modifyoursample andthe specificdatato be usedinthe project. We also had the necessaryprograms
to performouranalysiswitheachteammemberhavingaccessto Excel,JMP, R, SAS,SPSSModelerand
Tableau.These tools,combinedwithtrainingonkeydataminingandanalysistechniquesfromMSIS
5633 gave us the toolsrequiredtosuccessfully achieve ourprojectgoals.
Implementingthe Plan/ MeasuringResults
To implementourplan,we wouldidentifystartuprestaurantsinthe citiesoursample wasbasedon
(Pittsburgh,Charlotte,Urbana-Champaign,Phoenix,LasVegas, andMadison) andpresentourideasto
them.
Our analyticprogramwill be successful if we are able todetermineif there are factorsinthe Yelpdata
setthat can accuratelyidentifythe factorsthatmostcontribute toan expensive restauranthavinga
poor rating.If we discoverthatnone of the factorspresentpredict alow rating,which is an interesting
7 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
insightthatmay be of value to Yelp.If we discoverthere are factorsthat may resultinlow ratings,which
will be of interesttoYelp,restaurantowners,andpossiblydiners.
Beyondouranalysis,we wouldliketosucceedby helpingstrugglingrestaurants.Byleveragingour
insights,theycouldimprovethe numberof customervisitsaswell astheirreviews.If the numberof
customerssignificantlyincreasesalongside highratings, ouranalysishasdone more thansucceed.
Our potential clientswouldbe mainlystartuprestaurants,aswell asrestaurantswithreallylow ratings
(1 or 2 stars). We couldpresentourfindingsata range of industryeventslike the National Restaurant
AssociationConference,the RestaurantFinance&DevelopmentConference,orsomethingmore
interestinglike the TV showRestaurantImpossible.
Beyondthat,we wouldpresentourmodel tocustomerswhomayhave a vestedinterestinhelping
strugglingrestaurantsturntheirbusinessesaround.Thiscouldinclude chefswhohelpwithmenu
selections,interiordesignerswhocouldimprove the look,musicianswhocouldimprove the ambience,
etc.
Scope Proposal
The scope of thisproject waslimitedtoU.S.restaurantsinthe Yelp DatasetChallenge data.We focused
on identifyingthe factorscommontohighlyratedrestaurantswithinthisgroupthatare notpresentin
restaurantswithlowerratings.
Data Dictionary
Our data dictionaryisextensivegiventhe numberof variablesprovidedbyYelpandthe numberof
derivedfieldswe created.We electedtomaintainalarge data dictionarytoillustrate the breadthof
data we had available andthe newfieldswe created.We alsousedvariablescreeningmethodsthat
leveragedalarge numberof variablestoidentifythose usefultoourmodel.
Yelp Data Set Challenge Master Data Dictionary
Variable Description Type Length Format Informat
Ages_Allowed Describes ages allowed in
restaurant (e.g. 19plus).
Char 7 $CHAR
7.
$CHAR7.
Alcohol Describes if/how alcohol is served
(e.g. full bar, beer and wine, etc.).
Char 13 $CHAR
13.
$CHAR13
.
Attire Describes appropriate dress for
restaurant (e.g. dressy, casual).
Char 6 $CHAR
6.
$CHAR6.
BYOB Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
BYOB_Corkage Field identifies whether attribute is
True, False, or NA.
Char 11 $CHAR
11.
$CHAR11
.
Caters Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Coat_Check Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Corkage Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Credit_Cards Field identifies whether attribute is Char 6 $CHAR $CHAR6.
8 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
True, False, or NA. 6.
Delivery Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Dogs_Allowed Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Drive_Thru Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Friday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Friday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Good_For_Dancing Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_Groups Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_Kids2 Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_breakfast Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_brunch Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_dessert Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_dinner Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_latenight Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_lunch Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_for_Kids Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Happy_Hour Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Has_TV Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Monday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Monday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Music_dj Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_jukebox Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_karaoke Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_live Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_playlist Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_video Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Noise_Level Describes noise level (e.g. average,
quiet, loud).
Char 9 $CHAR
9.
$CHAR9.
9 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Open_24_Hrs Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Order_at_Counter Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Outdoor_Seating Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_garage Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_lot Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_street Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_valet Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_validated Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_amex Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_cash_only Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_discover Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_mastercard Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_visa Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Saturday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Saturday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Smoking Describes if/where smoking is
permitted (e.g. no, outdoor).
Char 7 $CHAR
7.
$CHAR7.
Sunday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Sunday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Take_out Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Takes_Reservations Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Thursday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Thursday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Tuesday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Tuesday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Waiter_Service Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Wednesday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Wednesday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
10 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Wheelchair_Accessible Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Wi_Fi Describes wi-fi availability and cost
(e.g. no, free).
Char 4 $CHAR
4.
$CHAR4.
afternoon_check-ins* Derived from check-ins file. Sum of
afternoon check-ins from 11AM to
3PM.
Num 8
avgstars_review_file* Derived from reviews file. Average
ratings on rating file for a restaurant.
Num 8
background_music Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
business_id Unique identifier for individual
restaurants. Also the primary key.
Char 22 $CHAR
22.
$CHAR22
.
casual Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
categories Catchall field from Yelp that includes
restaurant type, foods, etc.
Char 199 $CHAR
199.
$CHAR19
9.
city City where restaurant is located. Char 35 $CHAR
35.
$CHAR35
.
classy Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
cool_pct* Derived from reviews file. Percent of
total reviews that were voted cool.
Num 8
dairy_free Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
divey Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
ethnicity* Derived from restaurants file. Text
mining done to create flags for food
type.
Char 25
evening_check-ins* Derived from check-ins file. Sum of
evening check-ins from 6PM to
11PM.
Num 8
frihours* Derived from open and close times.
Number of hours open this day.
Num 8
full_address Full physical address of restaurant. Char 110 $CHAR
110.
$CHAR11
0.
fullweek_hours* Derived from open and close times.
Number of hours open for the week.
Num 8
funny_pct* Derived from reviews file. Percent of
total reviews that were voted funny.
Num 8
gluten_free Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
halal Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
hipster Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
intimate Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
kosher Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
lateafternoon_check-ins* Derived from check-ins file. Sum of
check-ins from 3PM to 6PM.
Num 8
11 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
latenight_check-ins* Derived from check-ins file. Sum of
check-ins from 11PM to 5AM.
Num 8
latitude Latitude of restaurant. Num 8 BEST1
6.
BEST16.
longitude Longitude of restaurant. Num 8 BEST1
7.
BEST17.
monhours* Derived from open and close times.
Number of hours open this day.
Num 8
morning_check-ins* Derived from check-ins file. Sum of
morning check-ins from 5AM to
11AM.
Num 8
name Name of restaurant. Char 61 $CHAR
61.
$CHAR61
.
neighborhoods Neighborhood restaurant is located
in.
Char 52 $CHAR
52.
$CHAR52
.
open Whether the restaurant is still in
business (True or False).
Char 5 $CHAR
5.
$CHAR5.
pct_likes_of_tips* Derived from Tips file. Percentage of
tips that were liked by other users.
Num 8
price_range 1 to 4 with 4 being the most
expensive.
Char 2 $7,00 $CHAR2.
rating* Derived from Stars field. Low (1-2),
Medium (2.5-3.5), High(3.5-5)
Char 3 $3,00
restaurant_type* Derived from text mining categories
field. Type of restaurant (e.g. Bar,
Pub, Fast Food).
Char 25
review_count Total number of reviews for
restaurant as reported on Yelp
business file.
Num 8 BEST4. BEST4.
romantic Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
sathours* Derived from open and close times.
Number of hours open this day.
Num 8
soy_free Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
stars Overall rating of restaurant. Num 8 BEST3. BEST3.
state State where restaurant is located. Char 3 $CHAR
3.
$CHAR3.
sunhours* Derived from open and close times.
Number of hours open this day.
Num 8
target* Derived dependent variable. 1 when
restaurant has High rating. Zero
otherwise.
Num 8
thurshours* Derived from open and close times.
Number of hours open this day.
Num 8
tot_check-ins* Derived from check-ins file. Total
number of check-ins for restaurant.
Num 8
tot_cool* Derived from tips file. Total number
of tips voted cool.
Num 8
tot_funny* Derived from tips file. Total number
of tips voted funny.
Num 8
tot_reviews* Derived from reviews file. Total
number of reviews for restaurant.
Num 8
12 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
tot_tip_likes* Derived from tips file. Total number
of likes for all tips for a restaurant.
Num 8
tot_tips* Derived from tips file. Total number
of tips for restaurant.
Num 8
tot_useful* Derived from tips file. Total number
of reviews voted useful.
Num 8
touristy Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
trendy Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
tueshours* Derived from open and close times.
Number of hours open this day.
Num 8
type Type of record (e.g. business,
review, tip, etc.)
Char 8 $CHAR
8.
$CHAR8.
upscale Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
useful_pct* Derived field. Percent of total
reviews that were voted useful.
Num 8
vegan Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
vegetarian Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
wedhours* Derived from open and close times.
Number of hours open this day.
Num 8
weekday_afternoon_check-
ins*
Derived from check-ins file. Sum of
weekday afternoon check-ins from
11AM to 3PM.
Num 8
weekday_evening_check-
ins*
Derived from check-ins file. Sum of
weekday evening check-ins from
6PM to 11PM.
Num 8
weekday_hours* Derived from check-ins file. Sum of
hours open Monday-Friday.
Num 8
weekday_lateafternoon_ch
eck-ins*
Derived from check-ins file. Sum of
weekday check-ins from 3PM to
6PM.
Num 8
weekday_latenight_check-
ins*
Derived from check-ins file. Sum of
weekday check-ins from 11PM to
5AM.
Num 8
weekday_morn_check-ins* Derived from check-ins file. Sum of
weekday morning check-ins from
5AM to 11AM.
Num 8
weekend_afternoon_check-
ins*
Derived from check-ins file. Sum of
weekend afternoon check-ins from
11AM to 3PM.
Num 8
weekend_evening_check-
ins*
Derived from check-ins file. Sum of
weekend evening check-ins from
6PM to 11PM.
Num 8
weekend_hours* Derived from check-ins file. Sum of
hours open Saturday-Sunday.
Num 8
weekend_lateafternoon_ch
eck-ins*
Derived from check-ins file. Sum of
weekend check-ins from 3PM to
6PM.
Num 8
weekend_latenight_check-
ins*
Derived from check-ins file. Sum of
weekday check-ins from 11PM to
Num 8
13 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
6AM.
weekend_morn_check-ins* Derived from check-ins file. Sum of
weekend morning check-ins from
5AM to 11AM.
Num 8
budget_tm* Derived from text mining tips file.
Concepts related to money.
0=False, 1=True
Num 8
drinks_tm* Derived from text mining tips file.
Concepts related to drinks in general
e.g beer, juice, water, tea, shakes.
0=False, 1=True
Num 8
food_tm* Derived from text mining tips file.
Concepts related to food,
ingredients, vegetables, fruits,
dessert. 0=False, 1=True
Num 8
hours_tm* Derived from text mining tips file.
Concepts related to days, dates,
time, open, closed etc. 0=False,
1=True
Num 8
location_tm* Derived from text mining tips file.
Concepts related to location and
ambiance of the location e.g seats,
doors, kitchen, Arizona. 0=False,
1=True
Num 8
negative_tm* Derived from text mining tips file.
Concepts related to negative
feelings e.g rude, dirty. 0=False,
1=True
Num 8
people_tm* Derived from text mining tips file.
Concepts related to individuals e.g
family, friends, kids, wife. 0=False,
1=True
Num 8
positive_tm* Derived from text mining tips file.
Concepts which were generally
related to positive feelings e.g clean,
crispy. 0=False, 1=True
Num 8
service_tm* Derived from text mining tips file.
Concepts related to how the service
is viewed e.g waitress, manager,
wait time. 0=False, 1=True
Num 8
neighborhood_flg* Derived from neighborhood field. 1 if
neighborhood was listed, 0 if not.
Num 8
text_topic1* Derived from text mining reviews.
Concepts related to:
"+taco,+salsa,+chip,+burrito,mexica
n"
Num 8
text_topic2* Derived from text mining reviews.
Concepts related to:
"+customer,+know,+bad,+manager,
+location"
Num 8
text_topic3* Derived from text mining reviews.
Concepts related to:
"+pizza,+crust,+slice,+cheese,+thin"
Num 8
14 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
text_topic4* Derived from text mining reviews.
Concepts related to: "+great,+great
food,+great service,+service,+food"
Num 8
text_topic5* Derived from text mining reviews.
Concepts related to:
"+burger,fries,+fry,+bun,+onion"
Num 8
text_topic6* Derived from text mining reviews.
Concepts related to:
"+wine,+restaurant,+dish,+dessert,+
meal"
Num 8
text_topic7* Derived from text mining reviews.
Concepts related to:
"+sushi,+roll,+fish,+tuna,+roll"
Num 8
text_topic8* Derived from text mining reviews.
Concepts related to:
"+breakfast,+egg,+coffee,+toast,+pa
ncake"
Num 8
text_topic9* Derived from text mining reviews.
Concepts related to:
"+thai,+rice,+dish,+noodle,thai"
Num 8
text_topic10* Derived from text mining reviews.
Concepts related to:
"+buffet,+crab,+dessert,+leg,+selecti
on"
Num 8
text_topic11* Derived from text mining reviews.
Concepts related to:
"+beer,+bar,+selection,+drink,+night
"
Num 8
text_topic12* Derived from text mining reviews.
Concepts related to:
"+sandwich,+bread,+lunch,+salad,+
meat"
Num 8
text_topic13* Derived from text mining reviews.
Concepts related to:
"+hour,+happy,+happy
hour,+drink,+special"
Num 8
text_topic14* Derived from text mining reviews.
Concepts related to:
"+price,+steak,+good,good,+portion"
Num 8
text_topic15* Derived from text mining reviews.
Concepts related to:
"de,est,le,à,+pour"
Num 8
text_topic16* Derived from text mining reviews.
Concepts related to:
"+steak,+rib,+chicken,bbq,+sauce"
Num 8
text_topic17* Derived from text mining reviews.
Concepts related to:
"+minute,+wait,+table,+wait,+order"
Num 8
text_topic18* Derived from text mining reviews.
Concepts related to:
"always,+staff,+friendly,+love,+locati
on"
Num 8
text_topic19* Derived from text mining reviews.
Concepts related to:
Num 8
15 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
"+time,first,+first time,vegas,+love"
text_topic20* Derived from text mining reviews.
Concepts related to:
"+salad,+lunch,+chicken,always,+sp
ecial"
Num 8
* Denotes that this is a derived or
calculated field.
Data Access
Our data wasdownloadedfromthe YelpDatasetChallenge webpage.The URLfor thatpage is
http://www.yelp.com/dataset_challenge.Clickonthe ‘Getthe Data’ buttonand complete aformto
download.
The data includesinformationonthe businessesthathave beenreviewed,the reviews,the
user/reviewer,usercheck-ins,anduserprovidedtips.Yelpdefinesthe dataasfollows:
The Challenge Dataset:
 1.6M reviewsand500K tipsby366K usersfor61K businesses
 481K businessattributes,e.g.,hours,parkingavailability,ambience.
 Social networkof 366K usersfora total of 2.9M social edges.
 Aggregatedcheck-insovertime foreachof the 61K businesses
Cities:
 U.K.: Edinburgh
 Germany:Karlsruhe
 Canada: Montreal andWaterloo
 U.S.: Pittsburgh,Charlotte,Urbana-Champaign,Phoenix,LasVegas,Madison
From the data, we focusedonlyonrecordsassociatedwithrestaurants. The processingof consolidating
and cleaningthe dataisoutlinedinthe sectionsthatfollow.
Data Consolidation
Yelpprovidedthe datain5 files.Descriptionsof eachfile are includedbelow.
File Name Description File Format Size Number of Records
yelp_academic_dataset_business List of reviewed businesses JSON 54MB 61,181
yelp_academic_dataset_review Review information on businesses JSON 1.39GB 1,569,264
yelp_academic_dataset_user Information on Yelp users/reviewers JSON 162MB 366,715
yelp_academic_dataset_checkin Information check-ins at businesses JSON 20MB 45,166
yelp_academic_dataset_tip Tips for each business JSON 96MB 495,107
16 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
A lotof data cleansingandmanipulationhadtobe done to consolidate the dataintoasingle datasetfor
modelingpurposes. Inordertogetto a single dataset,we wentthrougha 5 stepprocess.
1. Identifyrestaurantsonthe businessfile
2. Create a subsetof the businessfile thatonlyincludesrestaurants
3. Create subsetsof the reviews,check-ins,andtips files
4. Summarize datafromthe review, check-in,andtipsfile (e.g.sumthe numberof check-
ins/tips/reviewsforeachrestaurant) andcreate a file forthe summarizeddatacontainingonly
businessIDandsummaryfieldsthatcan be appendedbacktothe restaurantsfile
5. Textmine keytextfieldsinthe review andtipsfiletocreate contentcategoryflags foreach
restaurant
6. The final stepwasto merge the summarytablesbackto the restaurant/businessfile thatwould
serve asthe final modeling dataset
Here is a sample of the SQL code usedto merge the individualfilesbacktothe master.
proc sql;
create table yelp.yelp_restaurant_reviewsas
selecta.*,b.rating,b.starsas avg_star_rating
fromyelp.yelp_restaurant_reviewsaleftjoin yelp.yelp_restaurantsbon
a.business_id=b.business_id;
quit;
Data Cleaning
The data cleaningprocesswasextensive andtime consuming withthe Yelpdata.The JSON data
requiredextensive formattingandsome Yelpdatafieldscombine somewhatunrelateddataintoasingle
field.
To convertthe JSON fieldsintoamore useable tabdelimitedtextformat,we usedthe jsonlite Rpackage
and the followingcommandsforeachfile.The filenameswere changedforeachrunto match the file
beingprocessed.
library(jsonlite) # load jsonlitelibrary
yelp<-"yelp_academic_dataset_review.json" # assign fileto yelp variable
reviews<-stream_in(file(yelp)) # read in file
reviews<-flatten(reviews, recursive= TRUE) # flatten JSON file
reviews$text <- gsub('n', ' ', reviews$text) # strip linefeed from text field
reviews$text <- gsub('r',' ', reviews$text) # strip carriagereturn from text field
reviews <- data.frame(lapply(reviews,as.character),stringsAsFactors=FALSE) # create data frame that works with
write table
write.table(reviews, "yelp_reviews.txt", sep="t", row.names=FALSE) # write out data frame as tab delimted text
file
17 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The Business/Restaurantfilehad a field labeled category which was basically a listof key/value pairs.Agreat deal
of text mining leveragingSPSS Text Analytics was required to create clean and create new fields fromthis
attribute.
Data Transformation
Our data transformationfocusedprimarilyonthe conversionof free-formtextfieldsintoflagsthat
indicate whetherarestauranthad reviews,tips,orcategorydescriptionscontainingcertainkeywordsor
themes.Toaccomplishthese transformations,we essentiallyconstructedtextminingmodelstocreate
fieldsthatcouldbe fedintoourfinal classificationandpredictormodels.
Our textmininginitiativesleveragedSPSSModelerTextAnalyticstoaccomplishthistaskfortextinthe
Tipsfile andRestaurantsFile.SAS TextAnalyticswasusedtocreate clustersfromthe review files.
A numberof derivedfieldswere alsocreated.Thesewere generallywaystosummarize datathatwas
alreadyavailable inadifferentform.The hourseachrestaurantwasopenon a daily,weekly,and
weekendlevel were calculatedfromthe startandclose time,forexample.
Some of the more importantderivedfieldsare describedbelow.
Rating– a fieldthatbinsYelpstarratings froma 1 to 5 (inincrementsof .5) scale intoLow,Medium,or
High
TextMiningFields –we are miningreviewsforthe restaurantstocreate a listof indicatorsforthe key
conceptsthat emerge.Anexample of atheme isbudget_tm whichincludedconceptsinvolving
keywordssurroundingprice.A value of 1 indicatesthata restauranthada tiprelatedtobudget,0
indicatedthatthe restaurantdidnot.
Target – a fieldthatservesasthe targetvariable forouranalysis.Itidentifiesthe restaurantswitha
price value of 4 (the highestvalue) andarating of High
Categories –The businessfile categoriesfieldcontainsalotof valuable informationabouteach
restaurant.Unfortunately,the informationisoftenunrelatedandmustbe parsedout usinga text
miningtool tocreate indicatorvariables.The fieldmaycontainmultiple values –Mexican,Tex-Mex,
Nightlife,Lounge,etc.
In all,more than30 fieldswerecreatedthroughthe textminingprocess.Those fields,aswell asother
derivedfields,are denotedinthe datadictionarywithanasterisk.
Data Reduction
Data reductioneffortsfocusedon restrictingourdataonlyto the businesswe identifiedasa restaurant.
To do that, we restrictedourbusinessfile universe torestaurantsusingthe code below tolookforthe
keywordrestaurantsinthe Yelpcategoriesfield.Fromthere,we createdanew restaurantindicator.We
were able tosubsetthe data inthe secondline of code below withthe new restaurantindicator
18 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
variable. ThisbusinessIDsfromthissubsetof restaurantswasusedtorestrictrecords inour reviews,
tips,and check-insfilestorestaurantsonly.
# Identifyrestaurants
business$restaurant_flg<- grepl("Restaurant|restaurant",business$categories)
yelp_restaurants<-business[business$restaurant_flg=="TRUE",]
Our nexttaskwas to reduce the review datasettoinclude onlyreviewsthatcorrespondedtoournewly
createdlistof restaurants.The code below showsourapproachto thisprocessusingR.
ids<-yelp_restaurants$business_id
#subset
restaurant_reviews<- reviews[reviews$business_id%in% ids,]
Descriptive analysis
UsingJMP 12, we didsome descriptiveanalysistogeta betterunderstanding of the distributionsof
some of the keyvariables.
Ethnicity
First,the ethnicityvariableagainstthe targetvariable (seedatatransformation) showsusthe likelihood
of a restaurantbeinga4-5 star restaurantfor the differentethnicities.
19 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
In the graph,we can see thatcertainethnicitiesstandout.Interms of highlikelihoodof highrating,
Polish,Russian,Scandinavian,andAfricanrestaurantsseemtobe well received.Onthe otherendof the
scale,American,Irish,Mexican,andUnknownrestaurantsare notparticularlysuccessful.
To illustrate anessential problemwiththisanalysis,we alsobroughtinafrequencytable forthe
differentrestaurants.Here we see thatmostof the differentethnicitieshave relativelyfew recordsto
base any assumptionson.
Basedon the frequencytable above,the mostfrequentethnicitiesare American,Asian,Mexican,Italian,
and Unknown. Interestinglyenough,thislistof ethnicitiesseemstobe prettymuchthe opposite of the
likelihoodof ahighrating. Thiscouldbe takenas an indicatorthat one of the aspectsneeded foragood
reviewmightbe scarcityororiginality,whichwouldmake senseforvariousreasons.Byhavinga
restaurantthat servesthe onlyfoodof itskind,there will be fewerrestaurantstocompare itto.You see
thishappeningtopeople thattaste very highendfood – theirstandardsrise aftergoingtoa Michelin
ratedrestaurant,comparedto someone whohasnevertastedaMichelinstarworthymeal.
Weekly hours
Anotherinterestingobservationisthe importance of the weeklyhours.Inthe graphbelow, youcansee
that likelihoodof ahighratingdecrease asthe numberof hours goesabove 70.
20 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Again,we doa simple frequencytabletodouble checkthatwe are not makingassumptionsbasedona
small sample size.
As seeninthe frequencytable,there are atleast400 reviewsforeachof the blocksof full-weekhours
between30and 110 hours.Hence makingassumptionswithinthisrange maybe safe todo. Focusingon
fewerhoursmayhelpincrease the qualityof the restaurant,asitmay helpensure thathighqualitystaff
21 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
isalreadyat the restaurant,as havingmore shiftswill increasethe chance of havingtohire lessqualified
workers.
Location
It isinterestingtosee the importance of location.Hence we made amapinTableauto show the
relationshipbetweenthe location,numberof reviews,andrating.
Scale:
Karlsruhe,Germany
22 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Edinburgh,U.K.
Montreal,Canada
23 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Waterloo,Canada
Pittsburgh,PA
Madison,WI
24 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Urbana-Campaign,IL
Charlotte,NC
Phoenix,AZ
25 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Las Vegas,NV
As seeninthe mapsabove,the distributionof highratedrestaurantsseemstobe independentof the
centralityof the locationforall the cities.There doeshoweverseemtobe more high-endrestaurantsin
the largercities.
RestaurantType
Anotheraspect,similartothe restaurantethnicityisthe restauranttype.Below,youcansee graphsand
summarystatisticsgeneratedusingJMP12.
26 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
We see thatthere are certaingroupsthat seemtobe underrepresentedinthe highratingcategory.
Examplesof these are fastfood,caterer,andbuffet.Amongstthe onesthatare relativelymore
representedinthe highratedcategory,we findbakeries,Cafés,Deli,Coffee/TeaHouses,FoodTrucks,
and Tapas Bars. Again,acase of originalityseemstooccur,as we saw in the analysisof ethnicity.
SelectModelingTechniques
We electedtobuild multiple modelsinordertohave a range of techniquesandpotential outcomes.This
sectionprovidesthe detailsoneachmodel –whyit wasselected,how itwasused,how itwasbuilt,and
itsresults.
Model1 – The Decision Tree
Our firstmodel choice was a decisiontree.Giventhe highnumberof potential independentvariablesin
our data set,we neededawayto quicklyidentifythe variablesmostuseful inclassifyingeachrecord
intothe highlyratedrestaurantbucketor non-highlyratedrestaurantbucketusingourtargetvariable.A
decisiontree seemedtobe alogical choice.Decisiontreesofferanumberof benefitsinthissortof
scenario:
1. Theyare easyto understandandvisualize
2. Theyare easyto implement
27 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
3. Theyhandle mostanykindof data solittle pre-processingisrequired(missingvaluecorrections,
binning,correlationanalysis,etc.generallyaren’tneeded)
4. Outliersgenerallyaren’taproblem
Consequently,decisiontreesprovide aquickwaytoexplore dataanddetermine whichvariablesmaybe
of interestinpredictive modeling.
Model1 – DataSplitting and Sub-sampling
Before buildingthe model,we hadtodetermine how the datawasto be splitand sampledwithinSPSS
Modeler. Model 1 usesthree datapartition.
 Training(usedtobuildthe model) –60% of file
 Testing(usedtoevaluate modelondifferentdatasample) –20% of file
 Validation(usedtoverifyaccuracyof model ona thirdsample) –20% of file
Our data setsize of over21,000 records allowedforthe three partitions.The ratioof these splitsshould
provide sufficientquantities tominimizevariance ineach. We usedthe defaultseedsettingtoensure
that our seedassignmentwasrepeatableinvariousiterationsandmodels.
SPSSModelerPartition Settings
These settingsdidagood job of randomlyassigningtargetrecordsineachpartition. The screencapture
belowillustratesthatthe distributionof 0and 1 values(HighRating=1,Non-HighRatings=9) isroughly
proportional inthe Training,Testing,andValidationdatasets.
28 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model1 – Building the Model
The constructionof our initial decisiontree modelwasbasedonourgoal of identifyingthe variables
that are mostimportantinclassifyingourtargetvariable.Withthatinmind,ourtargetvariable wasthe
target fielditself.
Most potential classifier/predictive variableswere fedintothe modelinanefforttoscreenfor
independentvariablesforothermodel types.The onlyfieldsthatwere excludedwere those thathada
directtie to the target variable (e.g.the targetvariablewasderivedfromratingssoall variationsof the
ratingsfieldwere excluded).
InputFieldsforthe DecisionTree
29 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
InputFieldsforDecisionTree Continued
30 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
InputFieldsforDecisionTree Continued
Withthe inputvariablesreadyandpartitionscreated,the nextstepwastoselectthe appropriate type
of decision tree tobuild.Pastexperiencehasshownthatthe decisiontree variantswithinSPSSmodeler
produce similarresults. Evenso,we decidedtoexperimentwith CART,Quest,C5andCHAID treesto
determine whichprovidedthe bestinitial results. The screencapture below showshow the resulting
SPSSModelerstream.
As we will see,the CARTtree performedbestonourdata so that’swhere will focusourbuildscreen
captures.For the final CARTmodel we made a changestothe defaultsettingsinanattemptto enhance
performance.
31 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The firstchange wasto enable boosting.Thismeansthataseriesof treesare builttoimprove fitting.
The secondchange was to broadenthe tree depth inan attemptto bringinmore variablesthatmay be
of importance infuture model builds(e.g.predictive models).
Model 1 – Assessingthe Model
Our primarymetricinevaluatingandassessingdecisiontreeswasthe percentage of recordsaccurately
classifiedonthe Validationdataset. Generallyspeaking,all of ourdecisiontreesperformedwell.They
all correctlyclassifiedourtargetvariable around66-68% of the time.
You can see fromthe followingtable thatCARThadthe bestperformance at68.46%.
CART Results– DefaultSettings
The Cart resultswithdefaultsettingsare listedbelow.The testperformedconsistentlyfromTrainingto
TestingtoValidationwhichmeansthere waslittle overfitting.Additionally,10variablesshoweduphas
havingthe mostpredictive performance.Fourof those,tot_cool,text_topic4,weekend_hoursand
text_topic2stoodoutfrom the pack. These maybe keyvariablestofocuson withsomethinglike a
logisticregressionmodel.
Model % Correct (Validation Data)
CART 68.46%
QUEST 66.50%
C5 68.37%
CHAID 67.66%
32 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The actual tree outputand decisionruleshave beenomittedsince we wereusingthismodelonlyto
identifythe variableswiththe mostpredictive importance.
CART Results – Enhanced Settings
Runningthe same CARTtree withboostingimprovedresultsabit.The percentage accuratelyclassified
movedupto 70.25%. The listof variableswiththe mostpredictive performance lookedverydifferent,
however.The top10 fieldsare totallydifferentandtheirpredictive importance asassessedbythe tree is
much more evenlybalanced.
33 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Basedon our results, we have twogooddecisiontree modelsforclassifyingrecordsbasedonourtarget
variable.The questionnow becomeswhetherthe variablesidentifiedcanbe usedina predictive model.
Model2 – Logistic Regression
The secondmodel buildsonthe outputof the first.The original decisiontree identified4variablesthat
may be useful inapredictmodel - tot_cool,text_topic4,weekend_hoursandtext_topic2.The goal of
thismodel isto determinethese fieldscanbe usedtopredictour targetvariable (HighYelprating).
Giventhatwe have a binarytargetvariable,abinarylogisticregressionmodelseemsappropriate.
Binarylogisticregressionmodelsrequirethatthe dependentvariable be binary(have onlyhave two
possible valueslike 0/1or True/False).Ourtargetvariable meetsthatcriteria.Althoughlogistic
regressionmodelsappearsimilartolinearregression, theydon’trelyonmanyof the assumptionsthat
linearregressionmodelsdo.Inparticular,logisticregressiondoesnotrequire the following:
 Linearrelationshipbetweenindependentanddependentvariables
 Independentvariablesdonotneedtobe normal
 Error termsdonot needtobe normallydistributed
 Homoscedasticityisnotrequired
 Ordinal andnominal variablescanbe usedaspredictors
These differencesmeanthatthe testsrequiredforthe linearregressionmodelsdiscussedinclassdonot
applyto thismodelingtechnique.
34 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model2 – DataSplitting and Sub-sampling
Thismodel will use the same datasplittingandsub-samplingtechniquesdescribedforModel 1.It will
leverage aTrainingdataset (60% of original file),Testdataset(20% of original file),andValidationdata
set(20% of original file).The rationaleforthisdecisionisthe same asfor Model 1.
Model2 – Building the Model
Constructionof the logisticregressionmodelisanoutflow of the decisiontree createdforModel 1. The
target variable willbe the binarytargetfieldcreatedtoindicate whetherarestaurantwasrated highly.
The independentpredictorvariableswillinclude the variablesthatstoodoutinthe original decisiontree
(tot_cool,text_topic4,weekend_hoursandtext_topic2).
The LogisticNode wasselectedinIBMSPSS modelerforthismodel. The resultingstreamisshown
below.
Logistic Regression ModelStreamin IBM SPSSModeler
The Enter methodwasleveraged forvariableselection. Usingthisapproach,all variablesare enteredin
a single step.Thismakessense inourscenariobecause we wanttotestthe variablesidentifiedinthe
decisiontree together.
35 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The Model Evaluationsettingwaschangedto calculate predictorimportance.Thiswill resultinoutput
that showsthe predictive powerof eachmodel variable.
Aside fromthese selections,the defaultsettingswereused.
Model 2 – Assessingthe Model
Our primarymetricforevaluatingthismodel isaccuracyinpredictingourtargetvalue of 1 inthe
Validationdataset.Asillustratedinthe screenshotbelow,the model didnotdoa goodjob of
prediction.The model correctly identifiedthe targetvariable inthe Validationdataseton39.44% of the
time.
PseudoRSquare valuesconfirmthatthe model wasnot fitwell.McFaddenPseudoRSquare values
between.2and.4 generallyindicate thatamodel hasan excellentfit.Thismodelismuchlowerat.078.
The independentvariables,although knownare shownbelow.Interestingly,the predictiveimportance
was differentbetweenthe decisiontree andthe logisticregressionmodel. Tot_cool,the numberof
reviewsclassifiedascool,remainedatthe topinboth models,however.
36 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The equationforthislogisticregressionmodelwas:
Althoughthe equationisn’tterriblypredictive,itisinterestingthatthe total cool ratingshasa positive
impacttoward a highratingwhile weekendhoursisslightlynegative.
While the variablesfromourdecisiontree inModel 1seemedtoworkwell forclassification,theydid
not performwell forprediction.We hadtotry differentapproachestoboostpredictive performance.
Model3 – Logistic Regression PartII
Our firstlogisticregressionmodelwasconstructedusingvariablesthatlookedpromisingfromthe
decisiontree inModel 1.Since that logisticregressionmodeldidnotperformwellintermof predictive
power,we decidedtotrylogisticregressionagain.Thistime,the focusisbasedonvariablesselected
usingour intuitionandcuriosity.Forthismodel,more variableswereselected.The ideawastoletthe
model selectthose withthe mostpredictivepower.
Model3 – DataSplitting and Sub-sampling
Once again,we usedthe same data splitting andsub-samplingmethodologyusedinpriormodels.60%
Training,20% Test,and 20% Validation.
37 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model3 – Building the Model
For thismodel,the same targetvariable wasused. The independent variablesshiftedtoinclude 50
variablesrelatedtotype of food,foodspecialties,total reviews,typesof reviews,hoursopen, check-in
timesanddays,and a range of textminingfields.Forbrevity,the fieldsare notlistedhere.The model
assessmentsectionhighlightsthoseselectedbythe model,however.
For thismodel, the variable selectionmethodwassettoStepwise.Stepwise isagoodmethodtouse
whenyouhave a large numberof potential independentvariablesandare unsure whichmaybe bestfor
modeling.Itallowsformultiplemodeliterationswhere variablesare addedandremoved
simultaneouslyuntil the bestcombinationof variableshave beenselected.
Aside fromthischange,all settingsremainthe same asinthe previouslogisticregressionmodel.
Model 3 – Assessingthe Model
Usingthe same criteriato evaluate thislogisticregressionmodel,we see thatitcorrectlypredictedtrue
valuesforthe targetvariable only39.74% of the time.Thisisa slightimprovementoverthe previous
model butit’spredictive powerisstill weak.
The listof variablespulledintothe modelshowsthe variableswiththe mostpredictiveimportance.
38 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
A fewinterestingvariablesrise tothe top – weekendhours,ethnicity,restauranttype,afternoon check-
ins,touristy,goodforbreakfastandgood for late nightcouldall informrestaurantdecisionmakingto
drive higherreviews.Unfortunately,theirpredictive performance isrelativelylow.Decisionmaking
basedon the variablesselectedwouldbe sketchyatbest.
The regressionequationforthis modelbecomesextremelylongmakingitvirtuallyunusable.Forthat
reason,ithas beenomitted.
The McFadden PseudoRSquare value hasimprovedbutnotabove .2 where we couldsaythe model is
well fitted.
Model4 – Fit Least Squares
To investigate the topicsfoundinthe textmining,we wentaboutanddida leastsquare regressionwith
the 20 topicsas the variablesusingthe JMP12 software.The software wouldpickthe topicsthatwould
give the lowestLogWorth(calculatedas –log(p-value)),andthen use thatto compute the bestmodel.
Model4 – DataSplitting and Sub-Sampling
There wasno needto doany splitting,asJMPwas able torun throughall the variablesandrecords
withoutanysplitsorsamples.
39 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model4 – Building the model
To buildthe model,we usedthe FitModel functioninJMP. Withthismodel,we usedstars asthe Y
variable tobe predicted,andthe text_topic1-20toconstruct model effects. The personalitywas
StandardLeast Squares.
40 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model4 – Assessing themodel
In the outputabove,youcan see the importance of the differenttopics.The R-square beingas lowas
0.22 showshowpoorlythismodel isworkingthough.The thingthatcan be takenfrom thismodel,
however,isthe LogWorthvalue forthe differenttopics.We can see that text_topic2 and4 are the more
importantoneswhenanalyzingthe differenttopics,togetherwithtopic18, 17, 6, and12 inorderof
descendingimportance. Itisinterestingtonote thattext_topic2andText_topic4alsostoodout inour
decision tree model.
If we lookat the followinggroups,we cansee thatthe most importantthingsare manager,location,
food,service,wine,dessert,staff,friendliness,time,andbread,salad andmeat.Sofor the opening
restaurants,there isa greatneedof focusingonthese partsof the restaurant.
text_topic2* "+customer,+know,+bad,+manager,+location"
text_topic4* "+great,+greatfood,+great
service,+service,+food"
41 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
text_topic6* "+wine,+restaurant,+dish,+dessert,+meal"
text_topic18* "always,+staff,+friendly,+love,+location"
text_topic17* "+minute,+wait,+table,+wait,+order"
text_topic12* "+sandwich,+bread,+lunch,+salad,+meat"
Model5 - Text Profiling
To investigate the reviewstofindwhattermswere the onesmostassociatedwiththe differentstars,we
chose to go throughSAS’TextProfilertool.The resultingoutputwouldgive the mostcommonly
occurringterms inthe differentstarreviews.
Model5 – Data Splitting and Sub-Sampling
The data was firstsplitintoa 5% sample tobe able tohandle the size of the data. Thenthe data split
intothree separate sections,training(20%),validation(50%),andtesting(30%).
Model5 – Building the Model
To buildthe model,the datawasfirstsub-sampledintoa5% sample.Thenthe sample wasrunthrough
a partitionnode tosplitthe data intoa 20-50-30 training,validation,testingsplit.Nextwasatext
parsingnode to extractthe textfilestobe usedinthe analysis.Thenatextfiltertofilterout
unnecessaryterms,specialsigns,etc.Atlast,before the textprofilingnode,atexttopicnode to create a
setof categorical variablestobe usedinthe textprofiling.
TextParsingsettings:
42 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
TextFiltersettings:
43 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
TextTopic settings:
TextProfile settings:
TextProfile output:
Model5 – Assessing themodel
Withthe textprofile,we cansee thatthe there are certainareas the customersseemtobe more
concernedaboutwhenrating.Forthe low rated restaurants,the termsseemtobe focusedon
staff/service,mistakeslike hairinthe food,price,portion,andtaste.Forthe betterrestaurants,the
maintermsfoundinthe reviewsseemtobe more aboutowner,town,service,andgreatfood.
Thismodel representsverywell how we cangoabout analyzingthe YELPreviews.Asitishardto predict
the rating basedonany termsor otheraspects,the bestway seemstobe throughdescriptive analytics,
and findingthe commonalitiesbetweenthe bestreviews.
Model5 Modification
Whenanalyzingthe model 5,we didcome across one problem:Adjectives.Despite tellingusaboutthe
contentof the review,adjectivesdon’tgivemuchknowledge intermsof specificpartstofocuson when
tryingto make a restaurantsuccessful.Hence,we separatedeverythingbutthe nounsfound inthe
reviewsbyignoringall the othertermsinthe textparsingnode.The followingwasthe resultingterms
the reviewersfocusedon:
44 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
ModifiedModel 5Output:
In thismodel,we cansee that the mostimportantthingstothe reviewersseemtobe staff/service,
town,food,portionandprice.
Model 6
Model 6 was built by using linear regression to predict the degree to which the nature of
Reviews and Tips influences ratings. The target variable for the model was “Stars,” which is
made up of the number of stars per each of the ratings.
Model 6 - Building the model
Before building the model, we assessed the numeric dependent variables to determine which
to include in the model. Based on the results of the statistical analysis, we excluded all the
independent variables with a correlation value higher than 0.7 with other independent
variables from the model.
Correlation between Independent variables
45 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Correlation between dependent variable and independent variables
This left us with 5 input variables which were included in the final model:
Model 6 - Assessing the Model
The basic results show the Percentage of total reviews voted “Cool” to have the greatest
predictor importance on Ratings, followed by the Percentage of total reviews voted “Funny”.
46 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
tot_tips and tot_tip_likes had the same degree of importance, which was not very significant. It
was interesting to discover that the Percentage of reviews voted “Useful” had a predictor
importance of zero, though it had a strong correlation with the Target variable.
The regression equation: Stars = 3.407 - -0.01232 funny_pct + -0.0012 useful_pct + 0.013202
cool_pct + 0.002287 tot_tips + 0.02388
The results of the regression are presented in the following screenshots:
The adjusted R squared value of 0.108 means that the model does not do a very good job of
explaining variation in the dependent variable. Looking at the F value and t values, it seems that
the independent variables selected for the model do have some limited ability to explain
variation in the dependent variable.
47 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Discussion
From the modelswe have triedtocreate,there seemstobe greatdifficultyinactuallypredictingthe
reviewthata customerisgoingto give.Thisisnatural,as people are of greatdiversity,andpeople focus
on differentthings.Notwopeopleare goingtothinkthe exactsame thingabouta place.There isstill
some supportinsayingcertainfactorsmay helpimprove the chancesof satisfiedcustomers.
In model 1,we saw that the most importantfactorswere total cool reviews,texttopic4 (foodand
service),weekendhours,andtexttopic2(customer,manager,andlocation).Thisissimilartowhatwe
foundinmodel 4 and 5 in termsof texttopics,andsimilartomodel 2,3, and6 in termsof the
importance of weekendhoursandtotal cool reviews.
Thoughthe numberof cool reviewsmaynotexplainalottous about whatto focuson whenmakinga
successful restaurant,the factthatweekendhoursseemstobe soimportantisof interest. Asseenin
the plotbelow, there doesseemtobe atrendsimilartothat whichwe saw inthe descriptive analytics
part: Lesshours = more stars. The reasonmay be hard to explainwithoutfurtherinvestigationanddata
fromthe businesses,butapossible reasonmaybe asexplainedinthe descriptiveanalysissection:Fewer
shiftsmayhelpensure ahighqualitystaff atall time.
The suggestionaboutthe staff doesseemtoholdupin the othermodelstoo.Whenlookingattopic4
and model 5,the maintwothingspeople seemtobe concernedaboutisinfactthe staff/service,and
food.The argumentthat lessshiftshelpsimprove the qualityishence alsoshowninthose models(we
48 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
mustnot forgetthat foodisas closelyconnectedtopeopleasservice,asitisthe chefspreparingthe
foodthat determine howgoodthe foodtastes).
Conclusion
From the above models,we cansee thatthe data givenfromYELP doesnot workverywell with
predictive models.Hence,the betterwaytogoabout analyzingthe reviewsseemstobe throughtext
analyticsandgrouping.Throughthe TextProfiler,we foundthatthe mostimportanttermsseemtobe
food,andservice.Intermsof service,we actuallysee thatpeople use wordslike love,goodservice,hair,
bug,and care. Inother words,if the restaurantsfocusonqualityof theirstaff,cleanliness,andquality
food,theywill mostlikelysucceedinthe business. We alsofoundinthe analysisthatthe one thing
restaurantsmayneedto dois to reduce itshours.Thismay helpresolve alotof qualityissues,andmay
inturn helpincrease the ratingof the restaurant.

More Related Content

Similar to YELP Data Set Challenge

Yelp Rating Prediction
Yelp Rating PredictionYelp Rating Prediction
Yelp Rating Prediction
Kartik Lunkad
 
Nidos-Making a good funding application to the scottish government
Nidos-Making a good funding application to the scottish governmentNidos-Making a good funding application to the scottish government
Nidos-Making a good funding application to the scottish government
Scotland Malawi Partnership
 
Rehearsal Script Page 1 Introduction Lets get down t.docx
Rehearsal Script Page 1  Introduction Lets get down t.docxRehearsal Script Page 1  Introduction Lets get down t.docx
Rehearsal Script Page 1 Introduction Lets get down t.docx
debishakespeare
 
1. ACT is sometimes referred to as a Hospital without walls. W.docx
1. ACT is sometimes referred to as a Hospital without walls. W.docx1. ACT is sometimes referred to as a Hospital without walls. W.docx
1. ACT is sometimes referred to as a Hospital without walls. W.docx
keturahhazelhurst
 
1. ACT is sometimes referred to as a Hospital without walls. W.docx
1. ACT is sometimes referred to as a Hospital without walls. W.docx1. ACT is sometimes referred to as a Hospital without walls. W.docx
1. ACT is sometimes referred to as a Hospital without walls. W.docx
jeremylockett77
 

Similar to YELP Data Set Challenge (20)

Dovetail Services, September 2014 User Group, Jim Bilton's Subscription marke...
Dovetail Services, September 2014 User Group, Jim Bilton's Subscription marke...Dovetail Services, September 2014 User Group, Jim Bilton's Subscription marke...
Dovetail Services, September 2014 User Group, Jim Bilton's Subscription marke...
 
AssReport11
AssReport11AssReport11
AssReport11
 
Yelp Rating Prediction
Yelp Rating PredictionYelp Rating Prediction
Yelp Rating Prediction
 
1st Discovery Delivery Model
1st Discovery Delivery Model1st Discovery Delivery Model
1st Discovery Delivery Model
 
Nidos-Making a good funding application to the scottish government
Nidos-Making a good funding application to the scottish governmentNidos-Making a good funding application to the scottish government
Nidos-Making a good funding application to the scottish government
 
Sales and Operations Planning at Newell Rubbermaid
Sales and Operations Planning at Newell RubbermaidSales and Operations Planning at Newell Rubbermaid
Sales and Operations Planning at Newell Rubbermaid
 
Implementing portfolio managment tools, Ed Couch, Astra Zeneca
Implementing portfolio managment tools, Ed Couch, Astra ZenecaImplementing portfolio managment tools, Ed Couch, Astra Zeneca
Implementing portfolio managment tools, Ed Couch, Astra Zeneca
 
Sovereign Insurance - Goal Setting for RESULTS - SovNet
Sovereign Insurance - Goal Setting for RESULTS - SovNetSovereign Insurance - Goal Setting for RESULTS - SovNet
Sovereign Insurance - Goal Setting for RESULTS - SovNet
 
Rehearsal Script Page 1 Introduction Lets get down t.docx
Rehearsal Script Page 1  Introduction Lets get down t.docxRehearsal Script Page 1  Introduction Lets get down t.docx
Rehearsal Script Page 1 Introduction Lets get down t.docx
 
Supply Chain Strategy Assessment
Supply Chain Strategy AssessmentSupply Chain Strategy Assessment
Supply Chain Strategy Assessment
 
S&OP FINAL
S&OP FINALS&OP FINAL
S&OP FINAL
 
Metric Free Test Management by Joseph Ours
Metric Free Test Management by Joseph OursMetric Free Test Management by Joseph Ours
Metric Free Test Management by Joseph Ours
 
The "WEAI" forward: what we've learned and where we're going
The "WEAI" forward: what we've learned and where we're going The "WEAI" forward: what we've learned and where we're going
The "WEAI" forward: what we've learned and where we're going
 
The WEAI Forward
The WEAI ForwardThe WEAI Forward
The WEAI Forward
 
Continuous S&OP - Breaking the Mold - Kinaxis presentation
Continuous S&OP - Breaking the Mold - Kinaxis presentation Continuous S&OP - Breaking the Mold - Kinaxis presentation
Continuous S&OP - Breaking the Mold - Kinaxis presentation
 
Best of the Best S&OP Conference
Best of the Best S&OP ConferenceBest of the Best S&OP Conference
Best of the Best S&OP Conference
 
Sales & Operations Planning Process
Sales & Operations Planning ProcessSales & Operations Planning Process
Sales & Operations Planning Process
 
Quantitative Mgt 9th ed. ppt ch01
Quantitative Mgt 9th ed. ppt ch01Quantitative Mgt 9th ed. ppt ch01
Quantitative Mgt 9th ed. ppt ch01
 
1. ACT is sometimes referred to as a Hospital without walls. W.docx
1. ACT is sometimes referred to as a Hospital without walls. W.docx1. ACT is sometimes referred to as a Hospital without walls. W.docx
1. ACT is sometimes referred to as a Hospital without walls. W.docx
 
1. ACT is sometimes referred to as a Hospital without walls. W.docx
1. ACT is sometimes referred to as a Hospital without walls. W.docx1. ACT is sometimes referred to as a Hospital without walls. W.docx
1. ACT is sometimes referred to as a Hospital without walls. W.docx
 

YELP Data Set Challenge

  • 1. Yelp Dataset Challenge MSIS 5633 Deliverable 2 25 NOV 2015 James Lynn (CWID11644030) Yolande Mbah Mbole (CWID11696431) Vegard Oelstad(CWID11681522)
  • 2. 2 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Executive Summary Yelpisa webbasedcompanyprovidingcrowd-sourcedreviewsof local business viaYelp.com.Itsstated goal is to connectpeople withgreatlocal businesses.Inrecentyears,Yelphas made subsetsof itsdata available tothe publictopromote innovative usesof dataandgroundbreaking research. The goal of our projectisto leverage thisYelpdatato create a classificationscheme utilizingRatingsand Price information.The analysisshouldprovideinsightsintowhatmakessome restaurantsearntop rankingswhile othersfall short.Obviously,consumersexpecthighqualityintermsof service,food, ambiance etc.The questioniswhichdimensionsare more important.Cana restaurantfall shortinsome areas andstill be ratedhighly? Our projectcouldbenefitthose lookingtoopenanew restaurantby identifyingkeyareastofocuson. It couldalsohelpeducate inexperienced restaurateurs oncustomerexpectationsandwhatittakesto succeedintermsof ratingsandcustomerperception.Everyadvantage canhelpwhenyouconsiderthat a studyby Cornell UniversityandMichiganState University researchersfoundthatafterthe firstyear 27% of restaurantstartupsfailed.Chef RobertIrvineof TV’sRestaurantImpossiblecitedinexperience as the primaryreasonmost restaurantsfail.Ourprojectcanhelp educate inexperienced restaurateurs on customerexpectationsandwhatittakesto succeedintermsof ratingsand customerperception. The one thingfoundinthe analysistoimprove the restaurantisthe openinghours.Despite the factthat longeropeninghoursmayincrease the revenue,shorterhourshelpsincreasethe ratingof the place. This,togetherwiththe factthat the majorityof the reviewsare concernedaboutfoodandservice may argue that the managersmayconsiderreducingthe hourstoincrease itsratings – whichin turnwill help bringin more customersandmore revenues. Project Schedule, DurationandEstimates Initial Project Timeline YELP DATASET CHALLENGE ANALYSIS TIMELINE 9/7 9/14 9/21 9/28 10/5 10/12 10/19 10/26 11/2 11/9 11/16 11/23 11/30 12/7 12/14 Milestone Kick OffMeeting Team 1 9/2/15 9/2/15 Prepare projectproposal Team 7 9/6/15 9/12/15 9/12 Submitprojectproposal Team 1 9/13/15 9/13/15 9/13 Define data requirements for analysis Team 5 9/13/15 9/18/15 9/18 Data consolidation Team 27 9/18/15 10/15/15 10/15 Data cleaning Team 27 9/18/15 10/15/15 10/15 Data reduction Team 27 9/18/15 10/15/15 10/15 Prepare firstdeliverable Team 3 10/15/15 10/17/15 10/17 Submitfirstdeliverable Team 1 10/18/15 10/18/15 10/18 Build models Team 10 10/19/15 10/30/15 10/30 Analyze models Team 24 11/1/15 11/24/15 11/24 Prepare second deliverable Team 3 11/25/15 11/28/15 11/28 Submitsecond deliverable Team 1 11/29/15 11/29/15 11/29 Prepare reportand presentation Team 11 11/30/15 12/10/15 12/10 Submitfinal deliverable Team 1 12/11/15 12/11/15 12/11 Step Task Lead Est. Duration Start Date End Date
  • 3. 3 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Final Project Timeline Comparingourinitial timelinewiththe finalone,we initiallyplannedtodo the data reductionbefore submittingthe firstdeliverable butwere onlyable tosoaftersubmittingthe deliverable because we spentmore time thanexpectedonthe datacleaningandconsolidation.We alsoincludedthe duration of the Data Transformation inourupdatedtimeline.We metalmosteveryweek,butonlythe major onesare includedinourfinal timeline.Anothermajordifference inourplannedandactual scheduleis that we spentmore time ondata Transformationthanplanned.Asaresult,we hadto use some of the time we plannedtospendonbuildingandanalyzingourmodelsonthe datatransformation.Itworked out well andwe were able tocomplete the projectontime. Work Based Structure YELP DATASET CHALLENGE ANALYSIS TIMELINE 9/7 9/14 9/21 9/28 10/5 10/12 10/19 10/26 11/2 11/9 11/16 11/23 11/30 12/7 12/14 Kick OffMeeting Team 1 9/2/15 9/2/15 Prepare projectproposal Team 7 9/6/15 9/12/15 9/12 Submitprojectproposal Team 1 9/13/15 9/13/15 9/13 ** Major Group meeting Team 1 9/14/15 9/14/15 Define data requirements for analysis Team 4 9/15/15 9/18/15 9/18 Data cleaning and data consolidation Team 27 9/18/15 10/15/15 10/15 Prepare firstdeliverable Team 3 10/15/15 10/17/15 10/17 Submitfirstdeliverable Team 1 10/18/15 10/18/15 10/18 ** Major Group meeting Team 1 10/19/15 10/19/15 10/19 Data Transformation Team 18 10/20/15 11/7/15 11/7 Data Reduction Team 6 11/8/15 11/14/15 11/14 ** Major Group meeting Team 1 11/15/15 11/15/15 11/15 Build models Team 5 11/16/15 11/20/15 11/20 Analyze models and startpreparing 2nd deliverable Team 3 11/21/15 11/23/15 11/23 ** Major Group meeting Team 1 11/23/15 11/23/15 11/23 Finalize second deliverable Team 1 11/24/15 11/24/15 11/28 Submitsecond deliverable Team 1 11/25/15 11/25/15 11/29 ** Major Group meeting Team 1 11/26/15 11/26/15 11/26 Prepare reportand presentation Team 10 11/27/15 12/6/15 12/6 Submitfinal deliverable Team 1 12/7/15 12/7/15 12/7 Step Task Lead Est. Duration Start Date End Date YELP Data Mining Project First Deliverable -Define data requirements for analysis -Data cleaning and consolidation Second Deliverable -Data Transformation -Data reduction -Building and analyzing models Final Deliverable -Report -Final Presentation Project Proposal
  • 4. 4 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Statement of Scope Project Objective The objective of ouranalysisistouncoverthe factors mostimportantincategorizingaYelprestaurant intoa highreviewcategory(4,4.5, or 5 Star rating). Target Variable  TARGET – thistarget variable isabinaryfieldwithvaluesof 0or 1. Itis createdbyassigninga value of 1 to restaurantswithinthe Highreview category.All otherrestaurantswill be assigned a 0 value. Predictor Variables Our initial fileincluded over100 possible predictorvariables. Tolimitthe scope, we startedwiththe variablesbelow andusedadecisiontree toidentifythe mostimportantvariablesindeterminingthe desiredoutcome.Inaddition,we selectedafew additional variablesbasedonourintuitionandcuriosity to see howwell theyperformedintermsof classificationandprediction. The boldedvariablesare those actuallyselectedforuse inourmodels.  Ethnicity – type of food(e.g.Italian,Mexican,etc.)  Neighborhood Flag–binaryvariable toindicate whetherneighborhoodswere listed;couldbe an indicatorof trendylocations  ReviewCount- numberof Yelpreviews  Good forKids – whetherrestaurantisgoodforKids  Alcohol – full bar,beerandwine,none,etc.  Noise Level –loud,veryloud,average,etc.  Attire – dressy,casual,etc.  Coat Check – True, False  Romantic– True,False  Classy – True, False  Intimate – True,False  Hipster– True,False  Divey – True,False  Touristy – True, False  Trendy – True,False  Upscale–True, False  Casual – True,False  Good forDessert – True, False  Good forLate Night – True, False  Good forLunch – True,False  Good forDinner– True,False  Good forBreakfast – True,False  Good forBrunch – True,False  Live Music – True, False
  • 5. 5 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad  DairyFree – True,False  GlutenFree – True,False  Vegan– True,False  Vegetarian –True,False  Wi-Fi – True,False  TakesReservations –True,False  Smoking– Yes,No,Outdoor  Hours Open – open/close time brokenoutbydayof week  Text Topics 1-20 – themesidentifiedthroughtextmining  Total Reviewsvoted as cool  Total hours open on weekends  Total Tips  Total Likes of Tips  Percentage of reviewsvotedFunny  Percentage of reviewsvotedUseful  Percentage of reviewsvotedCool People Benefittingfromthe Analysis The primarybenefactorsof thisanalysiswillbe restaurantownersandoperators.Theywillreceive insightsintothe mostimportantdimensionsof ahighlyratedrestaurant. Consumersmayalsobenefit.Whenrestaurantsaren’tratedorwhentheyhave fewerreviews,the criteriamayhelpthemdetermine whetherornotto take a chance on a restaurant. Yelpand advertisersmayalsobenefit.Theycanuse the informationfromthe analysistoapproach businessesinamore consultative fashionbyprovidingofferingsandrecommendationsthathelp restaurantsimprove keyareasof weaknessorconsumerperceptionsinthose areas. Companieswhohelprestaurants couldbenefit.Perhapsarestaurantscoreslow forambiance. Companiesspecializinginremodelingorinteriordesigncouldapproachthese restaurantswithproposals or ideasonhow improvementscouldbe made. Finally,jobseekersmaybenefit.The resultsof the analysiswouldgive them cluesonthe majorvalues and characteristics thatdistinguishone restaurantfromanother.Theywouldthenbe able tomake a betterchoice of the restauranttheywantto work for basedon the attributes theyvalue most. Constraints and Limitations There are a numberof possible constraintsassociatedwiththisproject. 1. Small sample size of highlyrated,expensive restaurants - While there are over6,000 restaurants inthe data setratedas a 4, 4.5, or 5, there are onlyabout175 withthose ratingsalsofallinginto the most expensivecategory (ratingof 4).Giventhatfact, we adjustedour original projectidea of investigatingwhyexpensive restaurantsreceive low ratingstosomethingbroader.We are nowlookingtopredicthigh restaurantratingsirrespective of price.
  • 6. 6 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad 2. Format of the data - There are several datafieldsthatincludenuggetsof informationthatisnot easilyaccessible withouttextmining.Evenwithtextmining,over400 conceptsemerge.These conceptsmustbe combinedintothemes.Thisisatime consumingandinexactprocess. 3. Samples - The samples we are usingare froma few U.S. cities - Pittsburgh,Charlotte,Urbana- Champaign,Phoenix,LasVegas, andMadison.The samplesmaynotbe representative of the U.S. as a whole. 4. Timing– As of the time thispaperwas written,we have receivednoformal feedbackonour original projectproposal.Shouldchangesbe required,we will have lesstime toadapt. 5. Expertise –A gooddata science teamiscomprisedof individualswithexpertise inseveral disciplines –statistics,computerscience,statistics/math,andthe businessdomain.Ourgroup lacksanyone withan in-depthstatistics/mathbackground. Project Costs The projectteam associatedwiththisanalysisconsistsof 3seniordataanalysts.We estimate the time requiredtobe 50 hoursper analyst(150 hourstotal).Ata rate of $250 perhour, the total projectcost to be $37,500. Thisestimate doesnottake intoaccountthe opportunitycostof otherprojectsthat are not undertaken. Since we are usingfree analysissoftware andthere are nodata charges,the intangible costsare negligible. FeasibilityandRisk Assessment Despite ourteam’sshortcomingsinthe realmof statistics,we feltourprojectwasfeasiblebasedonthe trainingwe have receivedinMSIS5633. We feltthe biggestchallenge facinguswasthe conversionof JSON filestoa formateasilyreadable bySPSSModeler. The restof the project waslessdaunting. Timingandresource availability wasone challenge we faced.Withadistance learningstudentand studentathlete onthe team,schedulingmeetings wassometimesdifficult. We were able to overcome the challenge byschedulingregularmeetingsonGoogle Hangoutsandmaintainingongoing,open communicationviaemail. We were fortunate tohave a robustdata setfrom Yelp.The data setpermittedustoeasilyadjustor modifyoursample andthe specificdatato be usedinthe project. We also had the necessaryprograms to performouranalysiswitheachteammemberhavingaccessto Excel,JMP, R, SAS,SPSSModelerand Tableau.These tools,combinedwithtrainingonkeydataminingandanalysistechniquesfromMSIS 5633 gave us the toolsrequiredtosuccessfully achieve ourprojectgoals. Implementingthe Plan/ MeasuringResults To implementourplan,we wouldidentifystartuprestaurantsinthe citiesoursample wasbasedon (Pittsburgh,Charlotte,Urbana-Champaign,Phoenix,LasVegas, andMadison) andpresentourideasto them. Our analyticprogramwill be successful if we are able todetermineif there are factorsinthe Yelpdata setthat can accuratelyidentifythe factorsthatmostcontribute toan expensive restauranthavinga poor rating.If we discoverthatnone of the factorspresentpredict alow rating,which is an interesting
  • 7. 7 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad insightthatmay be of value to Yelp.If we discoverthere are factorsthat may resultinlow ratings,which will be of interesttoYelp,restaurantowners,andpossiblydiners. Beyondouranalysis,we wouldliketosucceedby helpingstrugglingrestaurants.Byleveragingour insights,theycouldimprovethe numberof customervisitsaswell astheirreviews.If the numberof customerssignificantlyincreasesalongside highratings, ouranalysishasdone more thansucceed. Our potential clientswouldbe mainlystartuprestaurants,aswell asrestaurantswithreallylow ratings (1 or 2 stars). We couldpresentourfindingsata range of industryeventslike the National Restaurant AssociationConference,the RestaurantFinance&DevelopmentConference,orsomethingmore interestinglike the TV showRestaurantImpossible. Beyondthat,we wouldpresentourmodel tocustomerswhomayhave a vestedinterestinhelping strugglingrestaurantsturntheirbusinessesaround.Thiscouldinclude chefswhohelpwithmenu selections,interiordesignerswhocouldimprove the look,musicianswhocouldimprove the ambience, etc. Scope Proposal The scope of thisproject waslimitedtoU.S.restaurantsinthe Yelp DatasetChallenge data.We focused on identifyingthe factorscommontohighlyratedrestaurantswithinthisgroupthatare notpresentin restaurantswithlowerratings. Data Dictionary Our data dictionaryisextensivegiventhe numberof variablesprovidedbyYelpandthe numberof derivedfieldswe created.We electedtomaintainalarge data dictionarytoillustrate the breadthof data we had available andthe newfieldswe created.We alsousedvariablescreeningmethodsthat leveragedalarge numberof variablestoidentifythose usefultoourmodel. Yelp Data Set Challenge Master Data Dictionary Variable Description Type Length Format Informat Ages_Allowed Describes ages allowed in restaurant (e.g. 19plus). Char 7 $CHAR 7. $CHAR7. Alcohol Describes if/how alcohol is served (e.g. full bar, beer and wine, etc.). Char 13 $CHAR 13. $CHAR13 . Attire Describes appropriate dress for restaurant (e.g. dressy, casual). Char 6 $CHAR 6. $CHAR6. BYOB Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. BYOB_Corkage Field identifies whether attribute is True, False, or NA. Char 11 $CHAR 11. $CHAR11 . Caters Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Coat_Check Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Corkage Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Credit_Cards Field identifies whether attribute is Char 6 $CHAR $CHAR6.
  • 8. 8 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad True, False, or NA. 6. Delivery Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Dogs_Allowed Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Drive_Thru Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Friday_close Close time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Friday_open Open time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Good_For_Dancing Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Good_For_Groups Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Good_For_Kids2 Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Good_For_breakfast Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Good_For_brunch Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Good_For_dessert Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Good_For_dinner Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Good_For_latenight Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Good_For_lunch Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Good_for_Kids Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Happy_Hour Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Has_TV Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Monday_close Close time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Monday_open Open time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Music_dj Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Music_jukebox Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Music_karaoke Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Music_live Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Music_playlist Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Music_video Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Noise_Level Describes noise level (e.g. average, quiet, loud). Char 9 $CHAR 9. $CHAR9.
  • 9. 9 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Open_24_Hrs Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Order_at_Counter Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Outdoor_Seating Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Parking_garage Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Parking_lot Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Parking_street Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Parking_valet Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Parking_validated Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Payment_amex Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Payment_cash_only Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Payment_discover Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Payment_mastercard Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Payment_visa Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Saturday_close Close time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Saturday_open Open time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Smoking Describes if/where smoking is permitted (e.g. no, outdoor). Char 7 $CHAR 7. $CHAR7. Sunday_close Close time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Sunday_open Open time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Take_out Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Takes_Reservations Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Thursday_close Close time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Thursday_open Open time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Tuesday_close Close time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Tuesday_open Open time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Waiter_Service Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Wednesday_close Close time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5. Wednesday_open Open time for this day in 24 hour format. Char 5 $CHAR 5. $CHAR5.
  • 10. 10 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Wheelchair_Accessible Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. Wi_Fi Describes wi-fi availability and cost (e.g. no, free). Char 4 $CHAR 4. $CHAR4. afternoon_check-ins* Derived from check-ins file. Sum of afternoon check-ins from 11AM to 3PM. Num 8 avgstars_review_file* Derived from reviews file. Average ratings on rating file for a restaurant. Num 8 background_music Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. business_id Unique identifier for individual restaurants. Also the primary key. Char 22 $CHAR 22. $CHAR22 . casual Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. categories Catchall field from Yelp that includes restaurant type, foods, etc. Char 199 $CHAR 199. $CHAR19 9. city City where restaurant is located. Char 35 $CHAR 35. $CHAR35 . classy Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. cool_pct* Derived from reviews file. Percent of total reviews that were voted cool. Num 8 dairy_free Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. divey Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. ethnicity* Derived from restaurants file. Text mining done to create flags for food type. Char 25 evening_check-ins* Derived from check-ins file. Sum of evening check-ins from 6PM to 11PM. Num 8 frihours* Derived from open and close times. Number of hours open this day. Num 8 full_address Full physical address of restaurant. Char 110 $CHAR 110. $CHAR11 0. fullweek_hours* Derived from open and close times. Number of hours open for the week. Num 8 funny_pct* Derived from reviews file. Percent of total reviews that were voted funny. Num 8 gluten_free Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. halal Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. hipster Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. intimate Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. kosher Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. lateafternoon_check-ins* Derived from check-ins file. Sum of check-ins from 3PM to 6PM. Num 8
  • 11. 11 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad latenight_check-ins* Derived from check-ins file. Sum of check-ins from 11PM to 5AM. Num 8 latitude Latitude of restaurant. Num 8 BEST1 6. BEST16. longitude Longitude of restaurant. Num 8 BEST1 7. BEST17. monhours* Derived from open and close times. Number of hours open this day. Num 8 morning_check-ins* Derived from check-ins file. Sum of morning check-ins from 5AM to 11AM. Num 8 name Name of restaurant. Char 61 $CHAR 61. $CHAR61 . neighborhoods Neighborhood restaurant is located in. Char 52 $CHAR 52. $CHAR52 . open Whether the restaurant is still in business (True or False). Char 5 $CHAR 5. $CHAR5. pct_likes_of_tips* Derived from Tips file. Percentage of tips that were liked by other users. Num 8 price_range 1 to 4 with 4 being the most expensive. Char 2 $7,00 $CHAR2. rating* Derived from Stars field. Low (1-2), Medium (2.5-3.5), High(3.5-5) Char 3 $3,00 restaurant_type* Derived from text mining categories field. Type of restaurant (e.g. Bar, Pub, Fast Food). Char 25 review_count Total number of reviews for restaurant as reported on Yelp business file. Num 8 BEST4. BEST4. romantic Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. sathours* Derived from open and close times. Number of hours open this day. Num 8 soy_free Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. stars Overall rating of restaurant. Num 8 BEST3. BEST3. state State where restaurant is located. Char 3 $CHAR 3. $CHAR3. sunhours* Derived from open and close times. Number of hours open this day. Num 8 target* Derived dependent variable. 1 when restaurant has High rating. Zero otherwise. Num 8 thurshours* Derived from open and close times. Number of hours open this day. Num 8 tot_check-ins* Derived from check-ins file. Total number of check-ins for restaurant. Num 8 tot_cool* Derived from tips file. Total number of tips voted cool. Num 8 tot_funny* Derived from tips file. Total number of tips voted funny. Num 8 tot_reviews* Derived from reviews file. Total number of reviews for restaurant. Num 8
  • 12. 12 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad tot_tip_likes* Derived from tips file. Total number of likes for all tips for a restaurant. Num 8 tot_tips* Derived from tips file. Total number of tips for restaurant. Num 8 tot_useful* Derived from tips file. Total number of reviews voted useful. Num 8 touristy Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. trendy Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. tueshours* Derived from open and close times. Number of hours open this day. Num 8 type Type of record (e.g. business, review, tip, etc.) Char 8 $CHAR 8. $CHAR8. upscale Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. useful_pct* Derived field. Percent of total reviews that were voted useful. Num 8 vegan Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. vegetarian Field identifies whether attribute is True, False, or NA. Char 5 $CHAR 5. $CHAR5. wedhours* Derived from open and close times. Number of hours open this day. Num 8 weekday_afternoon_check- ins* Derived from check-ins file. Sum of weekday afternoon check-ins from 11AM to 3PM. Num 8 weekday_evening_check- ins* Derived from check-ins file. Sum of weekday evening check-ins from 6PM to 11PM. Num 8 weekday_hours* Derived from check-ins file. Sum of hours open Monday-Friday. Num 8 weekday_lateafternoon_ch eck-ins* Derived from check-ins file. Sum of weekday check-ins from 3PM to 6PM. Num 8 weekday_latenight_check- ins* Derived from check-ins file. Sum of weekday check-ins from 11PM to 5AM. Num 8 weekday_morn_check-ins* Derived from check-ins file. Sum of weekday morning check-ins from 5AM to 11AM. Num 8 weekend_afternoon_check- ins* Derived from check-ins file. Sum of weekend afternoon check-ins from 11AM to 3PM. Num 8 weekend_evening_check- ins* Derived from check-ins file. Sum of weekend evening check-ins from 6PM to 11PM. Num 8 weekend_hours* Derived from check-ins file. Sum of hours open Saturday-Sunday. Num 8 weekend_lateafternoon_ch eck-ins* Derived from check-ins file. Sum of weekend check-ins from 3PM to 6PM. Num 8 weekend_latenight_check- ins* Derived from check-ins file. Sum of weekday check-ins from 11PM to Num 8
  • 13. 13 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad 6AM. weekend_morn_check-ins* Derived from check-ins file. Sum of weekend morning check-ins from 5AM to 11AM. Num 8 budget_tm* Derived from text mining tips file. Concepts related to money. 0=False, 1=True Num 8 drinks_tm* Derived from text mining tips file. Concepts related to drinks in general e.g beer, juice, water, tea, shakes. 0=False, 1=True Num 8 food_tm* Derived from text mining tips file. Concepts related to food, ingredients, vegetables, fruits, dessert. 0=False, 1=True Num 8 hours_tm* Derived from text mining tips file. Concepts related to days, dates, time, open, closed etc. 0=False, 1=True Num 8 location_tm* Derived from text mining tips file. Concepts related to location and ambiance of the location e.g seats, doors, kitchen, Arizona. 0=False, 1=True Num 8 negative_tm* Derived from text mining tips file. Concepts related to negative feelings e.g rude, dirty. 0=False, 1=True Num 8 people_tm* Derived from text mining tips file. Concepts related to individuals e.g family, friends, kids, wife. 0=False, 1=True Num 8 positive_tm* Derived from text mining tips file. Concepts which were generally related to positive feelings e.g clean, crispy. 0=False, 1=True Num 8 service_tm* Derived from text mining tips file. Concepts related to how the service is viewed e.g waitress, manager, wait time. 0=False, 1=True Num 8 neighborhood_flg* Derived from neighborhood field. 1 if neighborhood was listed, 0 if not. Num 8 text_topic1* Derived from text mining reviews. Concepts related to: "+taco,+salsa,+chip,+burrito,mexica n" Num 8 text_topic2* Derived from text mining reviews. Concepts related to: "+customer,+know,+bad,+manager, +location" Num 8 text_topic3* Derived from text mining reviews. Concepts related to: "+pizza,+crust,+slice,+cheese,+thin" Num 8
  • 14. 14 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad text_topic4* Derived from text mining reviews. Concepts related to: "+great,+great food,+great service,+service,+food" Num 8 text_topic5* Derived from text mining reviews. Concepts related to: "+burger,fries,+fry,+bun,+onion" Num 8 text_topic6* Derived from text mining reviews. Concepts related to: "+wine,+restaurant,+dish,+dessert,+ meal" Num 8 text_topic7* Derived from text mining reviews. Concepts related to: "+sushi,+roll,+fish,+tuna,+roll" Num 8 text_topic8* Derived from text mining reviews. Concepts related to: "+breakfast,+egg,+coffee,+toast,+pa ncake" Num 8 text_topic9* Derived from text mining reviews. Concepts related to: "+thai,+rice,+dish,+noodle,thai" Num 8 text_topic10* Derived from text mining reviews. Concepts related to: "+buffet,+crab,+dessert,+leg,+selecti on" Num 8 text_topic11* Derived from text mining reviews. Concepts related to: "+beer,+bar,+selection,+drink,+night " Num 8 text_topic12* Derived from text mining reviews. Concepts related to: "+sandwich,+bread,+lunch,+salad,+ meat" Num 8 text_topic13* Derived from text mining reviews. Concepts related to: "+hour,+happy,+happy hour,+drink,+special" Num 8 text_topic14* Derived from text mining reviews. Concepts related to: "+price,+steak,+good,good,+portion" Num 8 text_topic15* Derived from text mining reviews. Concepts related to: "de,est,le,à,+pour" Num 8 text_topic16* Derived from text mining reviews. Concepts related to: "+steak,+rib,+chicken,bbq,+sauce" Num 8 text_topic17* Derived from text mining reviews. Concepts related to: "+minute,+wait,+table,+wait,+order" Num 8 text_topic18* Derived from text mining reviews. Concepts related to: "always,+staff,+friendly,+love,+locati on" Num 8 text_topic19* Derived from text mining reviews. Concepts related to: Num 8
  • 15. 15 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad "+time,first,+first time,vegas,+love" text_topic20* Derived from text mining reviews. Concepts related to: "+salad,+lunch,+chicken,always,+sp ecial" Num 8 * Denotes that this is a derived or calculated field. Data Access Our data wasdownloadedfromthe YelpDatasetChallenge webpage.The URLfor thatpage is http://www.yelp.com/dataset_challenge.Clickonthe ‘Getthe Data’ buttonand complete aformto download. The data includesinformationonthe businessesthathave beenreviewed,the reviews,the user/reviewer,usercheck-ins,anduserprovidedtips.Yelpdefinesthe dataasfollows: The Challenge Dataset:  1.6M reviewsand500K tipsby366K usersfor61K businesses  481K businessattributes,e.g.,hours,parkingavailability,ambience.  Social networkof 366K usersfora total of 2.9M social edges.  Aggregatedcheck-insovertime foreachof the 61K businesses Cities:  U.K.: Edinburgh  Germany:Karlsruhe  Canada: Montreal andWaterloo  U.S.: Pittsburgh,Charlotte,Urbana-Champaign,Phoenix,LasVegas,Madison From the data, we focusedonlyonrecordsassociatedwithrestaurants. The processingof consolidating and cleaningthe dataisoutlinedinthe sectionsthatfollow. Data Consolidation Yelpprovidedthe datain5 files.Descriptionsof eachfile are includedbelow. File Name Description File Format Size Number of Records yelp_academic_dataset_business List of reviewed businesses JSON 54MB 61,181 yelp_academic_dataset_review Review information on businesses JSON 1.39GB 1,569,264 yelp_academic_dataset_user Information on Yelp users/reviewers JSON 162MB 366,715 yelp_academic_dataset_checkin Information check-ins at businesses JSON 20MB 45,166 yelp_academic_dataset_tip Tips for each business JSON 96MB 495,107
  • 16. 16 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad A lotof data cleansingandmanipulationhadtobe done to consolidate the dataintoasingle datasetfor modelingpurposes. Inordertogetto a single dataset,we wentthrougha 5 stepprocess. 1. Identifyrestaurantsonthe businessfile 2. Create a subsetof the businessfile thatonlyincludesrestaurants 3. Create subsetsof the reviews,check-ins,andtips files 4. Summarize datafromthe review, check-in,andtipsfile (e.g.sumthe numberof check- ins/tips/reviewsforeachrestaurant) andcreate a file forthe summarizeddatacontainingonly businessIDandsummaryfieldsthatcan be appendedbacktothe restaurantsfile 5. Textmine keytextfieldsinthe review andtipsfiletocreate contentcategoryflags foreach restaurant 6. The final stepwasto merge the summarytablesbackto the restaurant/businessfile thatwould serve asthe final modeling dataset Here is a sample of the SQL code usedto merge the individualfilesbacktothe master. proc sql; create table yelp.yelp_restaurant_reviewsas selecta.*,b.rating,b.starsas avg_star_rating fromyelp.yelp_restaurant_reviewsaleftjoin yelp.yelp_restaurantsbon a.business_id=b.business_id; quit; Data Cleaning The data cleaningprocesswasextensive andtime consuming withthe Yelpdata.The JSON data requiredextensive formattingandsome Yelpdatafieldscombine somewhatunrelateddataintoasingle field. To convertthe JSON fieldsintoamore useable tabdelimitedtextformat,we usedthe jsonlite Rpackage and the followingcommandsforeachfile.The filenameswere changedforeachrunto match the file beingprocessed. library(jsonlite) # load jsonlitelibrary yelp<-"yelp_academic_dataset_review.json" # assign fileto yelp variable reviews<-stream_in(file(yelp)) # read in file reviews<-flatten(reviews, recursive= TRUE) # flatten JSON file reviews$text <- gsub('n', ' ', reviews$text) # strip linefeed from text field reviews$text <- gsub('r',' ', reviews$text) # strip carriagereturn from text field reviews <- data.frame(lapply(reviews,as.character),stringsAsFactors=FALSE) # create data frame that works with write table write.table(reviews, "yelp_reviews.txt", sep="t", row.names=FALSE) # write out data frame as tab delimted text file
  • 17. 17 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad The Business/Restaurantfilehad a field labeled category which was basically a listof key/value pairs.Agreat deal of text mining leveragingSPSS Text Analytics was required to create clean and create new fields fromthis attribute. Data Transformation Our data transformationfocusedprimarilyonthe conversionof free-formtextfieldsintoflagsthat indicate whetherarestauranthad reviews,tips,orcategorydescriptionscontainingcertainkeywordsor themes.Toaccomplishthese transformations,we essentiallyconstructedtextminingmodelstocreate fieldsthatcouldbe fedintoourfinal classificationandpredictormodels. Our textmininginitiativesleveragedSPSSModelerTextAnalyticstoaccomplishthistaskfortextinthe Tipsfile andRestaurantsFile.SAS TextAnalyticswasusedtocreate clustersfromthe review files. A numberof derivedfieldswere alsocreated.Thesewere generallywaystosummarize datathatwas alreadyavailable inadifferentform.The hourseachrestaurantwasopenon a daily,weekly,and weekendlevel were calculatedfromthe startandclose time,forexample. Some of the more importantderivedfieldsare describedbelow. Rating– a fieldthatbinsYelpstarratings froma 1 to 5 (inincrementsof .5) scale intoLow,Medium,or High TextMiningFields –we are miningreviewsforthe restaurantstocreate a listof indicatorsforthe key conceptsthat emerge.Anexample of atheme isbudget_tm whichincludedconceptsinvolving keywordssurroundingprice.A value of 1 indicatesthata restauranthada tiprelatedtobudget,0 indicatedthatthe restaurantdidnot. Target – a fieldthatservesasthe targetvariable forouranalysis.Itidentifiesthe restaurantswitha price value of 4 (the highestvalue) andarating of High Categories –The businessfile categoriesfieldcontainsalotof valuable informationabouteach restaurant.Unfortunately,the informationisoftenunrelatedandmustbe parsedout usinga text miningtool tocreate indicatorvariables.The fieldmaycontainmultiple values –Mexican,Tex-Mex, Nightlife,Lounge,etc. In all,more than30 fieldswerecreatedthroughthe textminingprocess.Those fields,aswell asother derivedfields,are denotedinthe datadictionarywithanasterisk. Data Reduction Data reductioneffortsfocusedon restrictingourdataonlyto the businesswe identifiedasa restaurant. To do that, we restrictedourbusinessfile universe torestaurantsusingthe code below tolookforthe keywordrestaurantsinthe Yelpcategoriesfield.Fromthere,we createdanew restaurantindicator.We were able tosubsetthe data inthe secondline of code below withthe new restaurantindicator
  • 18. 18 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad variable. ThisbusinessIDsfromthissubsetof restaurantswasusedtorestrictrecords inour reviews, tips,and check-insfilestorestaurantsonly. # Identifyrestaurants business$restaurant_flg<- grepl("Restaurant|restaurant",business$categories) yelp_restaurants<-business[business$restaurant_flg=="TRUE",] Our nexttaskwas to reduce the review datasettoinclude onlyreviewsthatcorrespondedtoournewly createdlistof restaurants.The code below showsourapproachto thisprocessusingR. ids<-yelp_restaurants$business_id #subset restaurant_reviews<- reviews[reviews$business_id%in% ids,] Descriptive analysis UsingJMP 12, we didsome descriptiveanalysistogeta betterunderstanding of the distributionsof some of the keyvariables. Ethnicity First,the ethnicityvariableagainstthe targetvariable (seedatatransformation) showsusthe likelihood of a restaurantbeinga4-5 star restaurantfor the differentethnicities.
  • 19. 19 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad In the graph,we can see thatcertainethnicitiesstandout.Interms of highlikelihoodof highrating, Polish,Russian,Scandinavian,andAfricanrestaurantsseemtobe well received.Onthe otherendof the scale,American,Irish,Mexican,andUnknownrestaurantsare notparticularlysuccessful. To illustrate anessential problemwiththisanalysis,we alsobroughtinafrequencytable forthe differentrestaurants.Here we see thatmostof the differentethnicitieshave relativelyfew recordsto base any assumptionson. Basedon the frequencytable above,the mostfrequentethnicitiesare American,Asian,Mexican,Italian, and Unknown. Interestinglyenough,thislistof ethnicitiesseemstobe prettymuchthe opposite of the likelihoodof ahighrating. Thiscouldbe takenas an indicatorthat one of the aspectsneeded foragood reviewmightbe scarcityororiginality,whichwouldmake senseforvariousreasons.Byhavinga restaurantthat servesthe onlyfoodof itskind,there will be fewerrestaurantstocompare itto.You see thishappeningtopeople thattaste very highendfood – theirstandardsrise aftergoingtoa Michelin ratedrestaurant,comparedto someone whohasnevertastedaMichelinstarworthymeal. Weekly hours Anotherinterestingobservationisthe importance of the weeklyhours.Inthe graphbelow, youcansee that likelihoodof ahighratingdecrease asthe numberof hours goesabove 70.
  • 20. 20 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Again,we doa simple frequencytabletodouble checkthatwe are not makingassumptionsbasedona small sample size. As seeninthe frequencytable,there are atleast400 reviewsforeachof the blocksof full-weekhours between30and 110 hours.Hence makingassumptionswithinthisrange maybe safe todo. Focusingon fewerhoursmayhelpincrease the qualityof the restaurant,asitmay helpensure thathighqualitystaff
  • 21. 21 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad isalreadyat the restaurant,as havingmore shiftswill increasethe chance of havingtohire lessqualified workers. Location It isinterestingtosee the importance of location.Hence we made amapinTableauto show the relationshipbetweenthe location,numberof reviews,andrating. Scale: Karlsruhe,Germany
  • 22. 22 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Edinburgh,U.K. Montreal,Canada
  • 23. 23 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Waterloo,Canada Pittsburgh,PA Madison,WI
  • 24. 24 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Urbana-Campaign,IL Charlotte,NC Phoenix,AZ
  • 25. 25 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Las Vegas,NV As seeninthe mapsabove,the distributionof highratedrestaurantsseemstobe independentof the centralityof the locationforall the cities.There doeshoweverseemtobe more high-endrestaurantsin the largercities. RestaurantType Anotheraspect,similartothe restaurantethnicityisthe restauranttype.Below,youcansee graphsand summarystatisticsgeneratedusingJMP12.
  • 26. 26 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad We see thatthere are certaingroupsthat seemtobe underrepresentedinthe highratingcategory. Examplesof these are fastfood,caterer,andbuffet.Amongstthe onesthatare relativelymore representedinthe highratedcategory,we findbakeries,Cafés,Deli,Coffee/TeaHouses,FoodTrucks, and Tapas Bars. Again,acase of originalityseemstooccur,as we saw in the analysisof ethnicity. SelectModelingTechniques We electedtobuild multiple modelsinordertohave a range of techniquesandpotential outcomes.This sectionprovidesthe detailsoneachmodel –whyit wasselected,how itwasused,how itwasbuilt,and itsresults. Model1 – The Decision Tree Our firstmodel choice was a decisiontree.Giventhe highnumberof potential independentvariablesin our data set,we neededawayto quicklyidentifythe variablesmostuseful inclassifyingeachrecord intothe highlyratedrestaurantbucketor non-highlyratedrestaurantbucketusingourtargetvariable.A decisiontree seemedtobe alogical choice.Decisiontreesofferanumberof benefitsinthissortof scenario: 1. Theyare easyto understandandvisualize 2. Theyare easyto implement
  • 27. 27 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad 3. Theyhandle mostanykindof data solittle pre-processingisrequired(missingvaluecorrections, binning,correlationanalysis,etc.generallyaren’tneeded) 4. Outliersgenerallyaren’taproblem Consequently,decisiontreesprovide aquickwaytoexplore dataanddetermine whichvariablesmaybe of interestinpredictive modeling. Model1 – DataSplitting and Sub-sampling Before buildingthe model,we hadtodetermine how the datawasto be splitand sampledwithinSPSS Modeler. Model 1 usesthree datapartition.  Training(usedtobuildthe model) –60% of file  Testing(usedtoevaluate modelondifferentdatasample) –20% of file  Validation(usedtoverifyaccuracyof model ona thirdsample) –20% of file Our data setsize of over21,000 records allowedforthe three partitions.The ratioof these splitsshould provide sufficientquantities tominimizevariance ineach. We usedthe defaultseedsettingtoensure that our seedassignmentwasrepeatableinvariousiterationsandmodels. SPSSModelerPartition Settings These settingsdidagood job of randomlyassigningtargetrecordsineachpartition. The screencapture belowillustratesthatthe distributionof 0and 1 values(HighRating=1,Non-HighRatings=9) isroughly proportional inthe Training,Testing,andValidationdatasets.
  • 28. 28 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Model1 – Building the Model The constructionof our initial decisiontree modelwasbasedonourgoal of identifyingthe variables that are mostimportantinclassifyingourtargetvariable.Withthatinmind,ourtargetvariable wasthe target fielditself. Most potential classifier/predictive variableswere fedintothe modelinanefforttoscreenfor independentvariablesforothermodel types.The onlyfieldsthatwere excludedwere those thathada directtie to the target variable (e.g.the targetvariablewasderivedfromratingssoall variationsof the ratingsfieldwere excluded). InputFieldsforthe DecisionTree
  • 29. 29 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad InputFieldsforDecisionTree Continued
  • 30. 30 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad InputFieldsforDecisionTree Continued Withthe inputvariablesreadyandpartitionscreated,the nextstepwastoselectthe appropriate type of decision tree tobuild.Pastexperiencehasshownthatthe decisiontree variantswithinSPSSmodeler produce similarresults. Evenso,we decidedtoexperimentwith CART,Quest,C5andCHAID treesto determine whichprovidedthe bestinitial results. The screencapture below showshow the resulting SPSSModelerstream. As we will see,the CARTtree performedbestonourdata so that’swhere will focusourbuildscreen captures.For the final CARTmodel we made a changestothe defaultsettingsinanattemptto enhance performance.
  • 31. 31 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad The firstchange wasto enable boosting.Thismeansthataseriesof treesare builttoimprove fitting. The secondchange was to broadenthe tree depth inan attemptto bringinmore variablesthatmay be of importance infuture model builds(e.g.predictive models). Model 1 – Assessingthe Model Our primarymetricinevaluatingandassessingdecisiontreeswasthe percentage of recordsaccurately classifiedonthe Validationdataset. Generallyspeaking,all of ourdecisiontreesperformedwell.They all correctlyclassifiedourtargetvariable around66-68% of the time. You can see fromthe followingtable thatCARThadthe bestperformance at68.46%. CART Results– DefaultSettings The Cart resultswithdefaultsettingsare listedbelow.The testperformedconsistentlyfromTrainingto TestingtoValidationwhichmeansthere waslittle overfitting.Additionally,10variablesshoweduphas havingthe mostpredictive performance.Fourof those,tot_cool,text_topic4,weekend_hoursand text_topic2stoodoutfrom the pack. These maybe keyvariablestofocuson withsomethinglike a logisticregressionmodel. Model % Correct (Validation Data) CART 68.46% QUEST 66.50% C5 68.37% CHAID 67.66%
  • 32. 32 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad The actual tree outputand decisionruleshave beenomittedsince we wereusingthismodelonlyto identifythe variableswiththe mostpredictive importance. CART Results – Enhanced Settings Runningthe same CARTtree withboostingimprovedresultsabit.The percentage accuratelyclassified movedupto 70.25%. The listof variableswiththe mostpredictive performance lookedverydifferent, however.The top10 fieldsare totallydifferentandtheirpredictive importance asassessedbythe tree is much more evenlybalanced.
  • 33. 33 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Basedon our results, we have twogooddecisiontree modelsforclassifyingrecordsbasedonourtarget variable.The questionnow becomeswhetherthe variablesidentifiedcanbe usedina predictive model. Model2 – Logistic Regression The secondmodel buildsonthe outputof the first.The original decisiontree identified4variablesthat may be useful inapredictmodel - tot_cool,text_topic4,weekend_hoursandtext_topic2.The goal of thismodel isto determinethese fieldscanbe usedtopredictour targetvariable (HighYelprating). Giventhatwe have a binarytargetvariable,abinarylogisticregressionmodelseemsappropriate. Binarylogisticregressionmodelsrequirethatthe dependentvariable be binary(have onlyhave two possible valueslike 0/1or True/False).Ourtargetvariable meetsthatcriteria.Althoughlogistic regressionmodelsappearsimilartolinearregression, theydon’trelyonmanyof the assumptionsthat linearregressionmodelsdo.Inparticular,logisticregressiondoesnotrequire the following:  Linearrelationshipbetweenindependentanddependentvariables  Independentvariablesdonotneedtobe normal  Error termsdonot needtobe normallydistributed  Homoscedasticityisnotrequired  Ordinal andnominal variablescanbe usedaspredictors These differencesmeanthatthe testsrequiredforthe linearregressionmodelsdiscussedinclassdonot applyto thismodelingtechnique.
  • 34. 34 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Model2 – DataSplitting and Sub-sampling Thismodel will use the same datasplittingandsub-samplingtechniquesdescribedforModel 1.It will leverage aTrainingdataset (60% of original file),Testdataset(20% of original file),andValidationdata set(20% of original file).The rationaleforthisdecisionisthe same asfor Model 1. Model2 – Building the Model Constructionof the logisticregressionmodelisanoutflow of the decisiontree createdforModel 1. The target variable willbe the binarytargetfieldcreatedtoindicate whetherarestaurantwasrated highly. The independentpredictorvariableswillinclude the variablesthatstoodoutinthe original decisiontree (tot_cool,text_topic4,weekend_hoursandtext_topic2). The LogisticNode wasselectedinIBMSPSS modelerforthismodel. The resultingstreamisshown below. Logistic Regression ModelStreamin IBM SPSSModeler The Enter methodwasleveraged forvariableselection. Usingthisapproach,all variablesare enteredin a single step.Thismakessense inourscenariobecause we wanttotestthe variablesidentifiedinthe decisiontree together.
  • 35. 35 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad The Model Evaluationsettingwaschangedto calculate predictorimportance.Thiswill resultinoutput that showsthe predictive powerof eachmodel variable. Aside fromthese selections,the defaultsettingswereused. Model 2 – Assessingthe Model Our primarymetricforevaluatingthismodel isaccuracyinpredictingourtargetvalue of 1 inthe Validationdataset.Asillustratedinthe screenshotbelow,the model didnotdoa goodjob of prediction.The model correctly identifiedthe targetvariable inthe Validationdataseton39.44% of the time. PseudoRSquare valuesconfirmthatthe model wasnot fitwell.McFaddenPseudoRSquare values between.2and.4 generallyindicate thatamodel hasan excellentfit.Thismodelismuchlowerat.078. The independentvariables,although knownare shownbelow.Interestingly,the predictiveimportance was differentbetweenthe decisiontree andthe logisticregressionmodel. Tot_cool,the numberof reviewsclassifiedascool,remainedatthe topinboth models,however.
  • 36. 36 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad The equationforthislogisticregressionmodelwas: Althoughthe equationisn’tterriblypredictive,itisinterestingthatthe total cool ratingshasa positive impacttoward a highratingwhile weekendhoursisslightlynegative. While the variablesfromourdecisiontree inModel 1seemedtoworkwell forclassification,theydid not performwell forprediction.We hadtotry differentapproachestoboostpredictive performance. Model3 – Logistic Regression PartII Our firstlogisticregressionmodelwasconstructedusingvariablesthatlookedpromisingfromthe decisiontree inModel 1.Since that logisticregressionmodeldidnotperformwellintermof predictive power,we decidedtotrylogisticregressionagain.Thistime,the focusisbasedonvariablesselected usingour intuitionandcuriosity.Forthismodel,more variableswereselected.The ideawastoletthe model selectthose withthe mostpredictivepower. Model3 – DataSplitting and Sub-sampling Once again,we usedthe same data splitting andsub-samplingmethodologyusedinpriormodels.60% Training,20% Test,and 20% Validation.
  • 37. 37 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Model3 – Building the Model For thismodel,the same targetvariable wasused. The independent variablesshiftedtoinclude 50 variablesrelatedtotype of food,foodspecialties,total reviews,typesof reviews,hoursopen, check-in timesanddays,and a range of textminingfields.Forbrevity,the fieldsare notlistedhere.The model assessmentsectionhighlightsthoseselectedbythe model,however. For thismodel, the variable selectionmethodwassettoStepwise.Stepwise isagoodmethodtouse whenyouhave a large numberof potential independentvariablesandare unsure whichmaybe bestfor modeling.Itallowsformultiplemodeliterationswhere variablesare addedandremoved simultaneouslyuntil the bestcombinationof variableshave beenselected. Aside fromthischange,all settingsremainthe same asinthe previouslogisticregressionmodel. Model 3 – Assessingthe Model Usingthe same criteriato evaluate thislogisticregressionmodel,we see thatitcorrectlypredictedtrue valuesforthe targetvariable only39.74% of the time.Thisisa slightimprovementoverthe previous model butit’spredictive powerisstill weak. The listof variablespulledintothe modelshowsthe variableswiththe mostpredictiveimportance.
  • 38. 38 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad A fewinterestingvariablesrise tothe top – weekendhours,ethnicity,restauranttype,afternoon check- ins,touristy,goodforbreakfastandgood for late nightcouldall informrestaurantdecisionmakingto drive higherreviews.Unfortunately,theirpredictive performance isrelativelylow.Decisionmaking basedon the variablesselectedwouldbe sketchyatbest. The regressionequationforthis modelbecomesextremelylongmakingitvirtuallyunusable.Forthat reason,ithas beenomitted. The McFadden PseudoRSquare value hasimprovedbutnotabove .2 where we couldsaythe model is well fitted. Model4 – Fit Least Squares To investigate the topicsfoundinthe textmining,we wentaboutanddida leastsquare regressionwith the 20 topicsas the variablesusingthe JMP12 software.The software wouldpickthe topicsthatwould give the lowestLogWorth(calculatedas –log(p-value)),andthen use thatto compute the bestmodel. Model4 – DataSplitting and Sub-Sampling There wasno needto doany splitting,asJMPwas able torun throughall the variablesandrecords withoutanysplitsorsamples.
  • 39. 39 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Model4 – Building the model To buildthe model,we usedthe FitModel functioninJMP. Withthismodel,we usedstars asthe Y variable tobe predicted,andthe text_topic1-20toconstruct model effects. The personalitywas StandardLeast Squares.
  • 40. 40 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Model4 – Assessing themodel In the outputabove,youcan see the importance of the differenttopics.The R-square beingas lowas 0.22 showshowpoorlythismodel isworkingthough.The thingthatcan be takenfrom thismodel, however,isthe LogWorthvalue forthe differenttopics.We can see that text_topic2 and4 are the more importantoneswhenanalyzingthe differenttopics,togetherwithtopic18, 17, 6, and12 inorderof descendingimportance. Itisinterestingtonote thattext_topic2andText_topic4alsostoodout inour decision tree model. If we lookat the followinggroups,we cansee thatthe most importantthingsare manager,location, food,service,wine,dessert,staff,friendliness,time,andbread,salad andmeat.Sofor the opening restaurants,there isa greatneedof focusingonthese partsof the restaurant. text_topic2* "+customer,+know,+bad,+manager,+location" text_topic4* "+great,+greatfood,+great service,+service,+food"
  • 41. 41 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad text_topic6* "+wine,+restaurant,+dish,+dessert,+meal" text_topic18* "always,+staff,+friendly,+love,+location" text_topic17* "+minute,+wait,+table,+wait,+order" text_topic12* "+sandwich,+bread,+lunch,+salad,+meat" Model5 - Text Profiling To investigate the reviewstofindwhattermswere the onesmostassociatedwiththe differentstars,we chose to go throughSAS’TextProfilertool.The resultingoutputwouldgive the mostcommonly occurringterms inthe differentstarreviews. Model5 – Data Splitting and Sub-Sampling The data was firstsplitintoa 5% sample tobe able tohandle the size of the data. Thenthe data split intothree separate sections,training(20%),validation(50%),andtesting(30%). Model5 – Building the Model To buildthe model,the datawasfirstsub-sampledintoa5% sample.Thenthe sample wasrunthrough a partitionnode tosplitthe data intoa 20-50-30 training,validation,testingsplit.Nextwasatext parsingnode to extractthe textfilestobe usedinthe analysis.Thenatextfiltertofilterout unnecessaryterms,specialsigns,etc.Atlast,before the textprofilingnode,atexttopicnode to create a setof categorical variablestobe usedinthe textprofiling. TextParsingsettings:
  • 42. 42 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad TextFiltersettings:
  • 43. 43 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad TextTopic settings: TextProfile settings: TextProfile output: Model5 – Assessing themodel Withthe textprofile,we cansee thatthe there are certainareas the customersseemtobe more concernedaboutwhenrating.Forthe low rated restaurants,the termsseemtobe focusedon staff/service,mistakeslike hairinthe food,price,portion,andtaste.Forthe betterrestaurants,the maintermsfoundinthe reviewsseemtobe more aboutowner,town,service,andgreatfood. Thismodel representsverywell how we cangoabout analyzingthe YELPreviews.Asitishardto predict the rating basedonany termsor otheraspects,the bestway seemstobe throughdescriptive analytics, and findingthe commonalitiesbetweenthe bestreviews. Model5 Modification Whenanalyzingthe model 5,we didcome across one problem:Adjectives.Despite tellingusaboutthe contentof the review,adjectivesdon’tgivemuchknowledge intermsof specificpartstofocuson when tryingto make a restaurantsuccessful.Hence,we separatedeverythingbutthe nounsfound inthe reviewsbyignoringall the othertermsinthe textparsingnode.The followingwasthe resultingterms the reviewersfocusedon:
  • 44. 44 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad ModifiedModel 5Output: In thismodel,we cansee that the mostimportantthingstothe reviewersseemtobe staff/service, town,food,portionandprice. Model 6 Model 6 was built by using linear regression to predict the degree to which the nature of Reviews and Tips influences ratings. The target variable for the model was “Stars,” which is made up of the number of stars per each of the ratings. Model 6 - Building the model Before building the model, we assessed the numeric dependent variables to determine which to include in the model. Based on the results of the statistical analysis, we excluded all the independent variables with a correlation value higher than 0.7 with other independent variables from the model. Correlation between Independent variables
  • 45. 45 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Correlation between dependent variable and independent variables This left us with 5 input variables which were included in the final model: Model 6 - Assessing the Model The basic results show the Percentage of total reviews voted “Cool” to have the greatest predictor importance on Ratings, followed by the Percentage of total reviews voted “Funny”.
  • 46. 46 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad tot_tips and tot_tip_likes had the same degree of importance, which was not very significant. It was interesting to discover that the Percentage of reviews voted “Useful” had a predictor importance of zero, though it had a strong correlation with the Target variable. The regression equation: Stars = 3.407 - -0.01232 funny_pct + -0.0012 useful_pct + 0.013202 cool_pct + 0.002287 tot_tips + 0.02388 The results of the regression are presented in the following screenshots: The adjusted R squared value of 0.108 means that the model does not do a very good job of explaining variation in the dependent variable. Looking at the F value and t values, it seems that the independent variables selected for the model do have some limited ability to explain variation in the dependent variable.
  • 47. 47 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad Discussion From the modelswe have triedtocreate,there seemstobe greatdifficultyinactuallypredictingthe reviewthata customerisgoingto give.Thisisnatural,as people are of greatdiversity,andpeople focus on differentthings.Notwopeopleare goingtothinkthe exactsame thingabouta place.There isstill some supportinsayingcertainfactorsmay helpimprove the chancesof satisfiedcustomers. In model 1,we saw that the most importantfactorswere total cool reviews,texttopic4 (foodand service),weekendhours,andtexttopic2(customer,manager,andlocation).Thisissimilartowhatwe foundinmodel 4 and 5 in termsof texttopics,andsimilartomodel 2,3, and6 in termsof the importance of weekendhoursandtotal cool reviews. Thoughthe numberof cool reviewsmaynotexplainalottous about whatto focuson whenmakinga successful restaurant,the factthatweekendhoursseemstobe soimportantisof interest. Asseenin the plotbelow, there doesseemtobe atrendsimilartothat whichwe saw inthe descriptive analytics part: Lesshours = more stars. The reasonmay be hard to explainwithoutfurtherinvestigationanddata fromthe businesses,butapossible reasonmaybe asexplainedinthe descriptiveanalysissection:Fewer shiftsmayhelpensure ahighqualitystaff atall time. The suggestionaboutthe staff doesseemtoholdupin the othermodelstoo.Whenlookingattopic4 and model 5,the maintwothingspeople seemtobe concernedaboutisinfactthe staff/service,and food.The argumentthat lessshiftshelpsimprove the qualityishence alsoshowninthose models(we
  • 48. 48 YELP Dataset Challenge,2nd deliverable,Lynn,Mbole,Oelstad mustnot forgetthat foodisas closelyconnectedtopeopleasservice,asitisthe chefspreparingthe foodthat determine howgoodthe foodtastes). Conclusion From the above models,we cansee thatthe data givenfromYELP doesnot workverywell with predictive models.Hence,the betterwaytogoabout analyzingthe reviewsseemstobe throughtext analyticsandgrouping.Throughthe TextProfiler,we foundthatthe mostimportanttermsseemtobe food,andservice.Intermsof service,we actuallysee thatpeople use wordslike love,goodservice,hair, bug,and care. Inother words,if the restaurantsfocusonqualityof theirstaff,cleanliness,andquality food,theywill mostlikelysucceedinthe business. We alsofoundinthe analysisthatthe one thing restaurantsmayneedto dois to reduce itshours.Thismay helpresolve alotof qualityissues,andmay inturn helpincrease the ratingof the restaurant.