1. ACT is sometimes referred to as a Hospital without walls. W.docx
YELP Data Set Challenge
1. Yelp Dataset Challenge
MSIS 5633
Deliverable 2
25 NOV 2015
James Lynn (CWID11644030)
Yolande Mbah Mbole (CWID11696431)
Vegard Oelstad(CWID11681522)
2. 2 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Executive Summary
Yelpisa webbasedcompanyprovidingcrowd-sourcedreviewsof local business viaYelp.com.Itsstated
goal is to connectpeople withgreatlocal businesses.Inrecentyears,Yelphas made subsetsof itsdata
available tothe publictopromote innovative usesof dataandgroundbreaking research.
The goal of our projectisto leverage thisYelpdatato create a classificationscheme utilizingRatingsand
Price information.The analysisshouldprovideinsightsintowhatmakessome restaurantsearntop
rankingswhile othersfall short.Obviously,consumersexpecthighqualityintermsof service,food,
ambiance etc.The questioniswhichdimensionsare more important.Cana restaurantfall shortinsome
areas andstill be ratedhighly?
Our projectcouldbenefitthose lookingtoopenanew restaurantby identifyingkeyareastofocuson. It
couldalsohelpeducate inexperienced restaurateurs oncustomerexpectationsandwhatittakesto
succeedintermsof ratingsandcustomerperception.Everyadvantage canhelpwhenyouconsiderthat
a studyby Cornell UniversityandMichiganState University researchersfoundthatafterthe firstyear
27% of restaurantstartupsfailed.Chef RobertIrvineof TV’sRestaurantImpossiblecitedinexperience as
the primaryreasonmost restaurantsfail.Ourprojectcanhelp educate inexperienced restaurateurs on
customerexpectationsandwhatittakesto succeedintermsof ratingsand customerperception.
The one thingfoundinthe analysistoimprove the restaurantisthe openinghours.Despite the factthat
longeropeninghoursmayincrease the revenue,shorterhourshelpsincreasethe ratingof the place.
This,togetherwiththe factthat the majorityof the reviewsare concernedaboutfoodandservice may
argue that the managersmayconsiderreducingthe hourstoincrease itsratings – whichin turnwill help
bringin more customersandmore revenues.
Project Schedule, DurationandEstimates
Initial Project Timeline
YELP DATASET CHALLENGE ANALYSIS TIMELINE
9/7 9/14 9/21 9/28 10/5 10/12 10/19 10/26 11/2 11/9 11/16 11/23 11/30 12/7 12/14
Milestone Kick OffMeeting Team 1 9/2/15 9/2/15
Prepare projectproposal Team 7 9/6/15 9/12/15 9/12
Submitprojectproposal Team 1 9/13/15 9/13/15 9/13
Define data requirements for analysis Team 5 9/13/15 9/18/15 9/18
Data consolidation Team 27 9/18/15 10/15/15 10/15
Data cleaning Team 27 9/18/15 10/15/15 10/15
Data reduction Team 27 9/18/15 10/15/15 10/15
Prepare firstdeliverable Team 3 10/15/15 10/17/15 10/17
Submitfirstdeliverable Team 1 10/18/15 10/18/15 10/18
Build models Team 10 10/19/15 10/30/15 10/30
Analyze models Team 24 11/1/15 11/24/15 11/24
Prepare second deliverable Team 3 11/25/15 11/28/15 11/28
Submitsecond deliverable Team 1 11/29/15 11/29/15 11/29
Prepare reportand presentation Team 11 11/30/15 12/10/15 12/10
Submitfinal deliverable Team 1 12/11/15 12/11/15 12/11
Step Task Lead
Est.
Duration
Start
Date
End Date
3. 3 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Final Project Timeline
Comparingourinitial timelinewiththe finalone,we initiallyplannedtodo the data reductionbefore
submittingthe firstdeliverable butwere onlyable tosoaftersubmittingthe deliverable because we
spentmore time thanexpectedonthe datacleaningandconsolidation.We alsoincludedthe duration
of the Data Transformation inourupdatedtimeline.We metalmosteveryweek,butonlythe major
onesare includedinourfinal timeline.Anothermajordifference inourplannedandactual scheduleis
that we spentmore time ondata Transformationthanplanned.Asaresult,we hadto use some of the
time we plannedtospendonbuildingandanalyzingourmodelsonthe datatransformation.Itworked
out well andwe were able tocomplete the projectontime.
Work Based Structure
YELP DATASET CHALLENGE ANALYSIS TIMELINE
9/7 9/14 9/21 9/28 10/5 10/12 10/19 10/26 11/2 11/9 11/16 11/23 11/30 12/7 12/14
Kick OffMeeting Team 1 9/2/15 9/2/15
Prepare projectproposal Team 7 9/6/15 9/12/15 9/12
Submitprojectproposal Team 1 9/13/15 9/13/15 9/13
** Major Group meeting Team 1 9/14/15 9/14/15
Define data requirements for analysis Team 4 9/15/15 9/18/15 9/18
Data cleaning and data consolidation Team 27 9/18/15 10/15/15 10/15
Prepare firstdeliverable Team 3 10/15/15 10/17/15 10/17
Submitfirstdeliverable Team 1 10/18/15 10/18/15 10/18
** Major Group meeting Team 1 10/19/15 10/19/15 10/19
Data Transformation Team 18 10/20/15 11/7/15 11/7
Data Reduction Team 6 11/8/15 11/14/15 11/14
** Major Group meeting Team 1 11/15/15 11/15/15 11/15
Build models Team 5 11/16/15 11/20/15 11/20
Analyze models and startpreparing 2nd deliverable Team 3 11/21/15 11/23/15 11/23
** Major Group meeting Team 1 11/23/15 11/23/15 11/23
Finalize second deliverable Team 1 11/24/15 11/24/15 11/28
Submitsecond deliverable Team 1 11/25/15 11/25/15 11/29
** Major Group meeting Team 1 11/26/15 11/26/15 11/26
Prepare reportand presentation Team 10 11/27/15 12/6/15 12/6
Submitfinal deliverable Team 1 12/7/15 12/7/15 12/7
Step Task Lead
Est.
Duration
Start
Date
End Date
YELP Data Mining Project
First Deliverable
-Define data requirements for
analysis
-Data cleaning and
consolidation
Second Deliverable
-Data Transformation
-Data reduction
-Building and analyzing
models
Final Deliverable
-Report
-Final Presentation
Project Proposal
4. 4 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Statement of Scope
Project Objective
The objective of ouranalysisistouncoverthe factors mostimportantincategorizingaYelprestaurant
intoa highreviewcategory(4,4.5, or 5 Star rating).
Target Variable
TARGET – thistarget variable isabinaryfieldwithvaluesof 0or 1. Itis createdbyassigninga
value of 1 to restaurantswithinthe Highreview category.All otherrestaurantswill be assigned
a 0 value.
Predictor Variables
Our initial fileincluded over100 possible predictorvariables. Tolimitthe scope, we startedwiththe
variablesbelow andusedadecisiontree toidentifythe mostimportantvariablesindeterminingthe
desiredoutcome.Inaddition,we selectedafew additional variablesbasedonourintuitionandcuriosity
to see howwell theyperformedintermsof classificationandprediction. The boldedvariablesare those
actuallyselectedforuse inourmodels.
Ethnicity – type of food(e.g.Italian,Mexican,etc.)
Neighborhood Flag–binaryvariable toindicate whetherneighborhoodswere listed;couldbe an
indicatorof trendylocations
ReviewCount- numberof Yelpreviews
Good forKids – whetherrestaurantisgoodforKids
Alcohol – full bar,beerandwine,none,etc.
Noise Level –loud,veryloud,average,etc.
Attire – dressy,casual,etc.
Coat Check – True, False
Romantic– True,False
Classy – True, False
Intimate – True,False
Hipster– True,False
Divey – True,False
Touristy – True, False
Trendy – True,False
Upscale–True, False
Casual – True,False
Good forDessert – True, False
Good forLate Night – True, False
Good forLunch – True,False
Good forDinner– True,False
Good forBreakfast – True,False
Good forBrunch – True,False
Live Music – True, False
5. 5 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
DairyFree – True,False
GlutenFree – True,False
Vegan– True,False
Vegetarian –True,False
Wi-Fi – True,False
TakesReservations –True,False
Smoking– Yes,No,Outdoor
Hours Open – open/close time brokenoutbydayof week
Text Topics 1-20 – themesidentifiedthroughtextmining
Total Reviewsvoted as cool
Total hours open on weekends
Total Tips
Total Likes of Tips
Percentage of reviewsvotedFunny
Percentage of reviewsvotedUseful
Percentage of reviewsvotedCool
People Benefittingfromthe Analysis
The primarybenefactorsof thisanalysiswillbe restaurantownersandoperators.Theywillreceive
insightsintothe mostimportantdimensionsof ahighlyratedrestaurant.
Consumersmayalsobenefit.Whenrestaurantsaren’tratedorwhentheyhave fewerreviews,the
criteriamayhelpthemdetermine whetherornotto take a chance on a restaurant.
Yelpand advertisersmayalsobenefit.Theycanuse the informationfromthe analysistoapproach
businessesinamore consultative fashionbyprovidingofferingsandrecommendationsthathelp
restaurantsimprove keyareasof weaknessorconsumerperceptionsinthose areas.
Companieswhohelprestaurants couldbenefit.Perhapsarestaurantscoreslow forambiance.
Companiesspecializinginremodelingorinteriordesigncouldapproachthese restaurantswithproposals
or ideasonhow improvementscouldbe made.
Finally,jobseekersmaybenefit.The resultsof the analysiswouldgive them cluesonthe majorvalues
and characteristics thatdistinguishone restaurantfromanother.Theywouldthenbe able tomake a
betterchoice of the restauranttheywantto work for basedon the attributes theyvalue most.
Constraints and Limitations
There are a numberof possible constraintsassociatedwiththisproject.
1. Small sample size of highlyrated,expensive restaurants - While there are over6,000 restaurants
inthe data setratedas a 4, 4.5, or 5, there are onlyabout175 withthose ratingsalsofallinginto
the most expensivecategory (ratingof 4).Giventhatfact, we adjustedour original projectidea
of investigatingwhyexpensive restaurantsreceive low ratingstosomethingbroader.We are
nowlookingtopredicthigh restaurantratingsirrespective of price.
6. 6 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
2. Format of the data - There are several datafieldsthatincludenuggetsof informationthatisnot
easilyaccessible withouttextmining.Evenwithtextmining,over400 conceptsemerge.These
conceptsmustbe combinedintothemes.Thisisatime consumingandinexactprocess.
3. Samples - The samples we are usingare froma few U.S. cities - Pittsburgh,Charlotte,Urbana-
Champaign,Phoenix,LasVegas, andMadison.The samplesmaynotbe representative of the
U.S. as a whole.
4. Timing– As of the time thispaperwas written,we have receivednoformal feedbackonour
original projectproposal.Shouldchangesbe required,we will have lesstime toadapt.
5. Expertise –A gooddata science teamiscomprisedof individualswithexpertise inseveral
disciplines –statistics,computerscience,statistics/math,andthe businessdomain.Ourgroup
lacksanyone withan in-depthstatistics/mathbackground.
Project Costs
The projectteam associatedwiththisanalysisconsistsof 3seniordataanalysts.We estimate the time
requiredtobe 50 hoursper analyst(150 hourstotal).Ata rate of $250 perhour, the total projectcost to
be $37,500. Thisestimate doesnottake intoaccountthe opportunitycostof otherprojectsthat are not
undertaken.
Since we are usingfree analysissoftware andthere are nodata charges,the intangible costsare
negligible.
FeasibilityandRisk Assessment
Despite ourteam’sshortcomingsinthe realmof statistics,we feltourprojectwasfeasiblebasedonthe
trainingwe have receivedinMSIS5633. We feltthe biggestchallenge facinguswasthe conversionof
JSON filestoa formateasilyreadable bySPSSModeler. The restof the project waslessdaunting.
Timingandresource availability wasone challenge we faced.Withadistance learningstudentand
studentathlete onthe team,schedulingmeetings wassometimesdifficult. We were able to overcome
the challenge byschedulingregularmeetingsonGoogle Hangoutsandmaintainingongoing,open
communicationviaemail.
We were fortunate tohave a robustdata setfrom Yelp.The data setpermittedustoeasilyadjustor
modifyoursample andthe specificdatato be usedinthe project. We also had the necessaryprograms
to performouranalysiswitheachteammemberhavingaccessto Excel,JMP, R, SAS,SPSSModelerand
Tableau.These tools,combinedwithtrainingonkeydataminingandanalysistechniquesfromMSIS
5633 gave us the toolsrequiredtosuccessfully achieve ourprojectgoals.
Implementingthe Plan/ MeasuringResults
To implementourplan,we wouldidentifystartuprestaurantsinthe citiesoursample wasbasedon
(Pittsburgh,Charlotte,Urbana-Champaign,Phoenix,LasVegas, andMadison) andpresentourideasto
them.
Our analyticprogramwill be successful if we are able todetermineif there are factorsinthe Yelpdata
setthat can accuratelyidentifythe factorsthatmostcontribute toan expensive restauranthavinga
poor rating.If we discoverthatnone of the factorspresentpredict alow rating,which is an interesting
7. 7 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
insightthatmay be of value to Yelp.If we discoverthere are factorsthat may resultinlow ratings,which
will be of interesttoYelp,restaurantowners,andpossiblydiners.
Beyondouranalysis,we wouldliketosucceedby helpingstrugglingrestaurants.Byleveragingour
insights,theycouldimprovethe numberof customervisitsaswell astheirreviews.If the numberof
customerssignificantlyincreasesalongside highratings, ouranalysishasdone more thansucceed.
Our potential clientswouldbe mainlystartuprestaurants,aswell asrestaurantswithreallylow ratings
(1 or 2 stars). We couldpresentourfindingsata range of industryeventslike the National Restaurant
AssociationConference,the RestaurantFinance&DevelopmentConference,orsomethingmore
interestinglike the TV showRestaurantImpossible.
Beyondthat,we wouldpresentourmodel tocustomerswhomayhave a vestedinterestinhelping
strugglingrestaurantsturntheirbusinessesaround.Thiscouldinclude chefswhohelpwithmenu
selections,interiordesignerswhocouldimprove the look,musicianswhocouldimprove the ambience,
etc.
Scope Proposal
The scope of thisproject waslimitedtoU.S.restaurantsinthe Yelp DatasetChallenge data.We focused
on identifyingthe factorscommontohighlyratedrestaurantswithinthisgroupthatare notpresentin
restaurantswithlowerratings.
Data Dictionary
Our data dictionaryisextensivegiventhe numberof variablesprovidedbyYelpandthe numberof
derivedfieldswe created.We electedtomaintainalarge data dictionarytoillustrate the breadthof
data we had available andthe newfieldswe created.We alsousedvariablescreeningmethodsthat
leveragedalarge numberof variablestoidentifythose usefultoourmodel.
Yelp Data Set Challenge Master Data Dictionary
Variable Description Type Length Format Informat
Ages_Allowed Describes ages allowed in
restaurant (e.g. 19plus).
Char 7 $CHAR
7.
$CHAR7.
Alcohol Describes if/how alcohol is served
(e.g. full bar, beer and wine, etc.).
Char 13 $CHAR
13.
$CHAR13
.
Attire Describes appropriate dress for
restaurant (e.g. dressy, casual).
Char 6 $CHAR
6.
$CHAR6.
BYOB Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
BYOB_Corkage Field identifies whether attribute is
True, False, or NA.
Char 11 $CHAR
11.
$CHAR11
.
Caters Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Coat_Check Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Corkage Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Credit_Cards Field identifies whether attribute is Char 6 $CHAR $CHAR6.
8. 8 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
True, False, or NA. 6.
Delivery Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Dogs_Allowed Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Drive_Thru Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Friday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Friday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Good_For_Dancing Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_Groups Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_Kids2 Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_breakfast Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_brunch Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_dessert Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_dinner Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_latenight Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_For_lunch Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Good_for_Kids Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Happy_Hour Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Has_TV Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Monday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Monday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Music_dj Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_jukebox Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_karaoke Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_live Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_playlist Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Music_video Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Noise_Level Describes noise level (e.g. average,
quiet, loud).
Char 9 $CHAR
9.
$CHAR9.
9. 9 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Open_24_Hrs Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Order_at_Counter Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Outdoor_Seating Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_garage Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_lot Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_street Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_valet Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Parking_validated Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_amex Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_cash_only Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_discover Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_mastercard Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Payment_visa Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Saturday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Saturday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Smoking Describes if/where smoking is
permitted (e.g. no, outdoor).
Char 7 $CHAR
7.
$CHAR7.
Sunday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Sunday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Take_out Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Takes_Reservations Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Thursday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Thursday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Tuesday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Tuesday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Waiter_Service Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Wednesday_close Close time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
Wednesday_open Open time for this day in 24 hour
format.
Char 5 $CHAR
5.
$CHAR5.
10. 10 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Wheelchair_Accessible Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
Wi_Fi Describes wi-fi availability and cost
(e.g. no, free).
Char 4 $CHAR
4.
$CHAR4.
afternoon_check-ins* Derived from check-ins file. Sum of
afternoon check-ins from 11AM to
3PM.
Num 8
avgstars_review_file* Derived from reviews file. Average
ratings on rating file for a restaurant.
Num 8
background_music Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
business_id Unique identifier for individual
restaurants. Also the primary key.
Char 22 $CHAR
22.
$CHAR22
.
casual Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
categories Catchall field from Yelp that includes
restaurant type, foods, etc.
Char 199 $CHAR
199.
$CHAR19
9.
city City where restaurant is located. Char 35 $CHAR
35.
$CHAR35
.
classy Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
cool_pct* Derived from reviews file. Percent of
total reviews that were voted cool.
Num 8
dairy_free Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
divey Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
ethnicity* Derived from restaurants file. Text
mining done to create flags for food
type.
Char 25
evening_check-ins* Derived from check-ins file. Sum of
evening check-ins from 6PM to
11PM.
Num 8
frihours* Derived from open and close times.
Number of hours open this day.
Num 8
full_address Full physical address of restaurant. Char 110 $CHAR
110.
$CHAR11
0.
fullweek_hours* Derived from open and close times.
Number of hours open for the week.
Num 8
funny_pct* Derived from reviews file. Percent of
total reviews that were voted funny.
Num 8
gluten_free Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
halal Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
hipster Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
intimate Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
kosher Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
lateafternoon_check-ins* Derived from check-ins file. Sum of
check-ins from 3PM to 6PM.
Num 8
11. 11 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
latenight_check-ins* Derived from check-ins file. Sum of
check-ins from 11PM to 5AM.
Num 8
latitude Latitude of restaurant. Num 8 BEST1
6.
BEST16.
longitude Longitude of restaurant. Num 8 BEST1
7.
BEST17.
monhours* Derived from open and close times.
Number of hours open this day.
Num 8
morning_check-ins* Derived from check-ins file. Sum of
morning check-ins from 5AM to
11AM.
Num 8
name Name of restaurant. Char 61 $CHAR
61.
$CHAR61
.
neighborhoods Neighborhood restaurant is located
in.
Char 52 $CHAR
52.
$CHAR52
.
open Whether the restaurant is still in
business (True or False).
Char 5 $CHAR
5.
$CHAR5.
pct_likes_of_tips* Derived from Tips file. Percentage of
tips that were liked by other users.
Num 8
price_range 1 to 4 with 4 being the most
expensive.
Char 2 $7,00 $CHAR2.
rating* Derived from Stars field. Low (1-2),
Medium (2.5-3.5), High(3.5-5)
Char 3 $3,00
restaurant_type* Derived from text mining categories
field. Type of restaurant (e.g. Bar,
Pub, Fast Food).
Char 25
review_count Total number of reviews for
restaurant as reported on Yelp
business file.
Num 8 BEST4. BEST4.
romantic Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
sathours* Derived from open and close times.
Number of hours open this day.
Num 8
soy_free Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
stars Overall rating of restaurant. Num 8 BEST3. BEST3.
state State where restaurant is located. Char 3 $CHAR
3.
$CHAR3.
sunhours* Derived from open and close times.
Number of hours open this day.
Num 8
target* Derived dependent variable. 1 when
restaurant has High rating. Zero
otherwise.
Num 8
thurshours* Derived from open and close times.
Number of hours open this day.
Num 8
tot_check-ins* Derived from check-ins file. Total
number of check-ins for restaurant.
Num 8
tot_cool* Derived from tips file. Total number
of tips voted cool.
Num 8
tot_funny* Derived from tips file. Total number
of tips voted funny.
Num 8
tot_reviews* Derived from reviews file. Total
number of reviews for restaurant.
Num 8
12. 12 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
tot_tip_likes* Derived from tips file. Total number
of likes for all tips for a restaurant.
Num 8
tot_tips* Derived from tips file. Total number
of tips for restaurant.
Num 8
tot_useful* Derived from tips file. Total number
of reviews voted useful.
Num 8
touristy Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
trendy Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
tueshours* Derived from open and close times.
Number of hours open this day.
Num 8
type Type of record (e.g. business,
review, tip, etc.)
Char 8 $CHAR
8.
$CHAR8.
upscale Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
useful_pct* Derived field. Percent of total
reviews that were voted useful.
Num 8
vegan Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
vegetarian Field identifies whether attribute is
True, False, or NA.
Char 5 $CHAR
5.
$CHAR5.
wedhours* Derived from open and close times.
Number of hours open this day.
Num 8
weekday_afternoon_check-
ins*
Derived from check-ins file. Sum of
weekday afternoon check-ins from
11AM to 3PM.
Num 8
weekday_evening_check-
ins*
Derived from check-ins file. Sum of
weekday evening check-ins from
6PM to 11PM.
Num 8
weekday_hours* Derived from check-ins file. Sum of
hours open Monday-Friday.
Num 8
weekday_lateafternoon_ch
eck-ins*
Derived from check-ins file. Sum of
weekday check-ins from 3PM to
6PM.
Num 8
weekday_latenight_check-
ins*
Derived from check-ins file. Sum of
weekday check-ins from 11PM to
5AM.
Num 8
weekday_morn_check-ins* Derived from check-ins file. Sum of
weekday morning check-ins from
5AM to 11AM.
Num 8
weekend_afternoon_check-
ins*
Derived from check-ins file. Sum of
weekend afternoon check-ins from
11AM to 3PM.
Num 8
weekend_evening_check-
ins*
Derived from check-ins file. Sum of
weekend evening check-ins from
6PM to 11PM.
Num 8
weekend_hours* Derived from check-ins file. Sum of
hours open Saturday-Sunday.
Num 8
weekend_lateafternoon_ch
eck-ins*
Derived from check-ins file. Sum of
weekend check-ins from 3PM to
6PM.
Num 8
weekend_latenight_check-
ins*
Derived from check-ins file. Sum of
weekday check-ins from 11PM to
Num 8
13. 13 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
6AM.
weekend_morn_check-ins* Derived from check-ins file. Sum of
weekend morning check-ins from
5AM to 11AM.
Num 8
budget_tm* Derived from text mining tips file.
Concepts related to money.
0=False, 1=True
Num 8
drinks_tm* Derived from text mining tips file.
Concepts related to drinks in general
e.g beer, juice, water, tea, shakes.
0=False, 1=True
Num 8
food_tm* Derived from text mining tips file.
Concepts related to food,
ingredients, vegetables, fruits,
dessert. 0=False, 1=True
Num 8
hours_tm* Derived from text mining tips file.
Concepts related to days, dates,
time, open, closed etc. 0=False,
1=True
Num 8
location_tm* Derived from text mining tips file.
Concepts related to location and
ambiance of the location e.g seats,
doors, kitchen, Arizona. 0=False,
1=True
Num 8
negative_tm* Derived from text mining tips file.
Concepts related to negative
feelings e.g rude, dirty. 0=False,
1=True
Num 8
people_tm* Derived from text mining tips file.
Concepts related to individuals e.g
family, friends, kids, wife. 0=False,
1=True
Num 8
positive_tm* Derived from text mining tips file.
Concepts which were generally
related to positive feelings e.g clean,
crispy. 0=False, 1=True
Num 8
service_tm* Derived from text mining tips file.
Concepts related to how the service
is viewed e.g waitress, manager,
wait time. 0=False, 1=True
Num 8
neighborhood_flg* Derived from neighborhood field. 1 if
neighborhood was listed, 0 if not.
Num 8
text_topic1* Derived from text mining reviews.
Concepts related to:
"+taco,+salsa,+chip,+burrito,mexica
n"
Num 8
text_topic2* Derived from text mining reviews.
Concepts related to:
"+customer,+know,+bad,+manager,
+location"
Num 8
text_topic3* Derived from text mining reviews.
Concepts related to:
"+pizza,+crust,+slice,+cheese,+thin"
Num 8
14. 14 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
text_topic4* Derived from text mining reviews.
Concepts related to: "+great,+great
food,+great service,+service,+food"
Num 8
text_topic5* Derived from text mining reviews.
Concepts related to:
"+burger,fries,+fry,+bun,+onion"
Num 8
text_topic6* Derived from text mining reviews.
Concepts related to:
"+wine,+restaurant,+dish,+dessert,+
meal"
Num 8
text_topic7* Derived from text mining reviews.
Concepts related to:
"+sushi,+roll,+fish,+tuna,+roll"
Num 8
text_topic8* Derived from text mining reviews.
Concepts related to:
"+breakfast,+egg,+coffee,+toast,+pa
ncake"
Num 8
text_topic9* Derived from text mining reviews.
Concepts related to:
"+thai,+rice,+dish,+noodle,thai"
Num 8
text_topic10* Derived from text mining reviews.
Concepts related to:
"+buffet,+crab,+dessert,+leg,+selecti
on"
Num 8
text_topic11* Derived from text mining reviews.
Concepts related to:
"+beer,+bar,+selection,+drink,+night
"
Num 8
text_topic12* Derived from text mining reviews.
Concepts related to:
"+sandwich,+bread,+lunch,+salad,+
meat"
Num 8
text_topic13* Derived from text mining reviews.
Concepts related to:
"+hour,+happy,+happy
hour,+drink,+special"
Num 8
text_topic14* Derived from text mining reviews.
Concepts related to:
"+price,+steak,+good,good,+portion"
Num 8
text_topic15* Derived from text mining reviews.
Concepts related to:
"de,est,le,à,+pour"
Num 8
text_topic16* Derived from text mining reviews.
Concepts related to:
"+steak,+rib,+chicken,bbq,+sauce"
Num 8
text_topic17* Derived from text mining reviews.
Concepts related to:
"+minute,+wait,+table,+wait,+order"
Num 8
text_topic18* Derived from text mining reviews.
Concepts related to:
"always,+staff,+friendly,+love,+locati
on"
Num 8
text_topic19* Derived from text mining reviews.
Concepts related to:
Num 8
15. 15 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
"+time,first,+first time,vegas,+love"
text_topic20* Derived from text mining reviews.
Concepts related to:
"+salad,+lunch,+chicken,always,+sp
ecial"
Num 8
* Denotes that this is a derived or
calculated field.
Data Access
Our data wasdownloadedfromthe YelpDatasetChallenge webpage.The URLfor thatpage is
http://www.yelp.com/dataset_challenge.Clickonthe ‘Getthe Data’ buttonand complete aformto
download.
The data includesinformationonthe businessesthathave beenreviewed,the reviews,the
user/reviewer,usercheck-ins,anduserprovidedtips.Yelpdefinesthe dataasfollows:
The Challenge Dataset:
1.6M reviewsand500K tipsby366K usersfor61K businesses
481K businessattributes,e.g.,hours,parkingavailability,ambience.
Social networkof 366K usersfora total of 2.9M social edges.
Aggregatedcheck-insovertime foreachof the 61K businesses
Cities:
U.K.: Edinburgh
Germany:Karlsruhe
Canada: Montreal andWaterloo
U.S.: Pittsburgh,Charlotte,Urbana-Champaign,Phoenix,LasVegas,Madison
From the data, we focusedonlyonrecordsassociatedwithrestaurants. The processingof consolidating
and cleaningthe dataisoutlinedinthe sectionsthatfollow.
Data Consolidation
Yelpprovidedthe datain5 files.Descriptionsof eachfile are includedbelow.
File Name Description File Format Size Number of Records
yelp_academic_dataset_business List of reviewed businesses JSON 54MB 61,181
yelp_academic_dataset_review Review information on businesses JSON 1.39GB 1,569,264
yelp_academic_dataset_user Information on Yelp users/reviewers JSON 162MB 366,715
yelp_academic_dataset_checkin Information check-ins at businesses JSON 20MB 45,166
yelp_academic_dataset_tip Tips for each business JSON 96MB 495,107
16. 16 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
A lotof data cleansingandmanipulationhadtobe done to consolidate the dataintoasingle datasetfor
modelingpurposes. Inordertogetto a single dataset,we wentthrougha 5 stepprocess.
1. Identifyrestaurantsonthe businessfile
2. Create a subsetof the businessfile thatonlyincludesrestaurants
3. Create subsetsof the reviews,check-ins,andtips files
4. Summarize datafromthe review, check-in,andtipsfile (e.g.sumthe numberof check-
ins/tips/reviewsforeachrestaurant) andcreate a file forthe summarizeddatacontainingonly
businessIDandsummaryfieldsthatcan be appendedbacktothe restaurantsfile
5. Textmine keytextfieldsinthe review andtipsfiletocreate contentcategoryflags foreach
restaurant
6. The final stepwasto merge the summarytablesbackto the restaurant/businessfile thatwould
serve asthe final modeling dataset
Here is a sample of the SQL code usedto merge the individualfilesbacktothe master.
proc sql;
create table yelp.yelp_restaurant_reviewsas
selecta.*,b.rating,b.starsas avg_star_rating
fromyelp.yelp_restaurant_reviewsaleftjoin yelp.yelp_restaurantsbon
a.business_id=b.business_id;
quit;
Data Cleaning
The data cleaningprocesswasextensive andtime consuming withthe Yelpdata.The JSON data
requiredextensive formattingandsome Yelpdatafieldscombine somewhatunrelateddataintoasingle
field.
To convertthe JSON fieldsintoamore useable tabdelimitedtextformat,we usedthe jsonlite Rpackage
and the followingcommandsforeachfile.The filenameswere changedforeachrunto match the file
beingprocessed.
library(jsonlite) # load jsonlitelibrary
yelp<-"yelp_academic_dataset_review.json" # assign fileto yelp variable
reviews<-stream_in(file(yelp)) # read in file
reviews<-flatten(reviews, recursive= TRUE) # flatten JSON file
reviews$text <- gsub('n', ' ', reviews$text) # strip linefeed from text field
reviews$text <- gsub('r',' ', reviews$text) # strip carriagereturn from text field
reviews <- data.frame(lapply(reviews,as.character),stringsAsFactors=FALSE) # create data frame that works with
write table
write.table(reviews, "yelp_reviews.txt", sep="t", row.names=FALSE) # write out data frame as tab delimted text
file
17. 17 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The Business/Restaurantfilehad a field labeled category which was basically a listof key/value pairs.Agreat deal
of text mining leveragingSPSS Text Analytics was required to create clean and create new fields fromthis
attribute.
Data Transformation
Our data transformationfocusedprimarilyonthe conversionof free-formtextfieldsintoflagsthat
indicate whetherarestauranthad reviews,tips,orcategorydescriptionscontainingcertainkeywordsor
themes.Toaccomplishthese transformations,we essentiallyconstructedtextminingmodelstocreate
fieldsthatcouldbe fedintoourfinal classificationandpredictormodels.
Our textmininginitiativesleveragedSPSSModelerTextAnalyticstoaccomplishthistaskfortextinthe
Tipsfile andRestaurantsFile.SAS TextAnalyticswasusedtocreate clustersfromthe review files.
A numberof derivedfieldswere alsocreated.Thesewere generallywaystosummarize datathatwas
alreadyavailable inadifferentform.The hourseachrestaurantwasopenon a daily,weekly,and
weekendlevel were calculatedfromthe startandclose time,forexample.
Some of the more importantderivedfieldsare describedbelow.
Rating– a fieldthatbinsYelpstarratings froma 1 to 5 (inincrementsof .5) scale intoLow,Medium,or
High
TextMiningFields –we are miningreviewsforthe restaurantstocreate a listof indicatorsforthe key
conceptsthat emerge.Anexample of atheme isbudget_tm whichincludedconceptsinvolving
keywordssurroundingprice.A value of 1 indicatesthata restauranthada tiprelatedtobudget,0
indicatedthatthe restaurantdidnot.
Target – a fieldthatservesasthe targetvariable forouranalysis.Itidentifiesthe restaurantswitha
price value of 4 (the highestvalue) andarating of High
Categories –The businessfile categoriesfieldcontainsalotof valuable informationabouteach
restaurant.Unfortunately,the informationisoftenunrelatedandmustbe parsedout usinga text
miningtool tocreate indicatorvariables.The fieldmaycontainmultiple values –Mexican,Tex-Mex,
Nightlife,Lounge,etc.
In all,more than30 fieldswerecreatedthroughthe textminingprocess.Those fields,aswell asother
derivedfields,are denotedinthe datadictionarywithanasterisk.
Data Reduction
Data reductioneffortsfocusedon restrictingourdataonlyto the businesswe identifiedasa restaurant.
To do that, we restrictedourbusinessfile universe torestaurantsusingthe code below tolookforthe
keywordrestaurantsinthe Yelpcategoriesfield.Fromthere,we createdanew restaurantindicator.We
were able tosubsetthe data inthe secondline of code below withthe new restaurantindicator
18. 18 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
variable. ThisbusinessIDsfromthissubsetof restaurantswasusedtorestrictrecords inour reviews,
tips,and check-insfilestorestaurantsonly.
# Identifyrestaurants
business$restaurant_flg<- grepl("Restaurant|restaurant",business$categories)
yelp_restaurants<-business[business$restaurant_flg=="TRUE",]
Our nexttaskwas to reduce the review datasettoinclude onlyreviewsthatcorrespondedtoournewly
createdlistof restaurants.The code below showsourapproachto thisprocessusingR.
ids<-yelp_restaurants$business_id
#subset
restaurant_reviews<- reviews[reviews$business_id%in% ids,]
Descriptive analysis
UsingJMP 12, we didsome descriptiveanalysistogeta betterunderstanding of the distributionsof
some of the keyvariables.
Ethnicity
First,the ethnicityvariableagainstthe targetvariable (seedatatransformation) showsusthe likelihood
of a restaurantbeinga4-5 star restaurantfor the differentethnicities.
19. 19 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
In the graph,we can see thatcertainethnicitiesstandout.Interms of highlikelihoodof highrating,
Polish,Russian,Scandinavian,andAfricanrestaurantsseemtobe well received.Onthe otherendof the
scale,American,Irish,Mexican,andUnknownrestaurantsare notparticularlysuccessful.
To illustrate anessential problemwiththisanalysis,we alsobroughtinafrequencytable forthe
differentrestaurants.Here we see thatmostof the differentethnicitieshave relativelyfew recordsto
base any assumptionson.
Basedon the frequencytable above,the mostfrequentethnicitiesare American,Asian,Mexican,Italian,
and Unknown. Interestinglyenough,thislistof ethnicitiesseemstobe prettymuchthe opposite of the
likelihoodof ahighrating. Thiscouldbe takenas an indicatorthat one of the aspectsneeded foragood
reviewmightbe scarcityororiginality,whichwouldmake senseforvariousreasons.Byhavinga
restaurantthat servesthe onlyfoodof itskind,there will be fewerrestaurantstocompare itto.You see
thishappeningtopeople thattaste very highendfood – theirstandardsrise aftergoingtoa Michelin
ratedrestaurant,comparedto someone whohasnevertastedaMichelinstarworthymeal.
Weekly hours
Anotherinterestingobservationisthe importance of the weeklyhours.Inthe graphbelow, youcansee
that likelihoodof ahighratingdecrease asthe numberof hours goesabove 70.
20. 20 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Again,we doa simple frequencytabletodouble checkthatwe are not makingassumptionsbasedona
small sample size.
As seeninthe frequencytable,there are atleast400 reviewsforeachof the blocksof full-weekhours
between30and 110 hours.Hence makingassumptionswithinthisrange maybe safe todo. Focusingon
fewerhoursmayhelpincrease the qualityof the restaurant,asitmay helpensure thathighqualitystaff
21. 21 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
isalreadyat the restaurant,as havingmore shiftswill increasethe chance of havingtohire lessqualified
workers.
Location
It isinterestingtosee the importance of location.Hence we made amapinTableauto show the
relationshipbetweenthe location,numberof reviews,andrating.
Scale:
Karlsruhe,Germany
25. 25 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Las Vegas,NV
As seeninthe mapsabove,the distributionof highratedrestaurantsseemstobe independentof the
centralityof the locationforall the cities.There doeshoweverseemtobe more high-endrestaurantsin
the largercities.
RestaurantType
Anotheraspect,similartothe restaurantethnicityisthe restauranttype.Below,youcansee graphsand
summarystatisticsgeneratedusingJMP12.
26. 26 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
We see thatthere are certaingroupsthat seemtobe underrepresentedinthe highratingcategory.
Examplesof these are fastfood,caterer,andbuffet.Amongstthe onesthatare relativelymore
representedinthe highratedcategory,we findbakeries,Cafés,Deli,Coffee/TeaHouses,FoodTrucks,
and Tapas Bars. Again,acase of originalityseemstooccur,as we saw in the analysisof ethnicity.
SelectModelingTechniques
We electedtobuild multiple modelsinordertohave a range of techniquesandpotential outcomes.This
sectionprovidesthe detailsoneachmodel –whyit wasselected,how itwasused,how itwasbuilt,and
itsresults.
Model1 – The Decision Tree
Our firstmodel choice was a decisiontree.Giventhe highnumberof potential independentvariablesin
our data set,we neededawayto quicklyidentifythe variablesmostuseful inclassifyingeachrecord
intothe highlyratedrestaurantbucketor non-highlyratedrestaurantbucketusingourtargetvariable.A
decisiontree seemedtobe alogical choice.Decisiontreesofferanumberof benefitsinthissortof
scenario:
1. Theyare easyto understandandvisualize
2. Theyare easyto implement
27. 27 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
3. Theyhandle mostanykindof data solittle pre-processingisrequired(missingvaluecorrections,
binning,correlationanalysis,etc.generallyaren’tneeded)
4. Outliersgenerallyaren’taproblem
Consequently,decisiontreesprovide aquickwaytoexplore dataanddetermine whichvariablesmaybe
of interestinpredictive modeling.
Model1 – DataSplitting and Sub-sampling
Before buildingthe model,we hadtodetermine how the datawasto be splitand sampledwithinSPSS
Modeler. Model 1 usesthree datapartition.
Training(usedtobuildthe model) –60% of file
Testing(usedtoevaluate modelondifferentdatasample) –20% of file
Validation(usedtoverifyaccuracyof model ona thirdsample) –20% of file
Our data setsize of over21,000 records allowedforthe three partitions.The ratioof these splitsshould
provide sufficientquantities tominimizevariance ineach. We usedthe defaultseedsettingtoensure
that our seedassignmentwasrepeatableinvariousiterationsandmodels.
SPSSModelerPartition Settings
These settingsdidagood job of randomlyassigningtargetrecordsineachpartition. The screencapture
belowillustratesthatthe distributionof 0and 1 values(HighRating=1,Non-HighRatings=9) isroughly
proportional inthe Training,Testing,andValidationdatasets.
28. 28 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model1 – Building the Model
The constructionof our initial decisiontree modelwasbasedonourgoal of identifyingthe variables
that are mostimportantinclassifyingourtargetvariable.Withthatinmind,ourtargetvariable wasthe
target fielditself.
Most potential classifier/predictive variableswere fedintothe modelinanefforttoscreenfor
independentvariablesforothermodel types.The onlyfieldsthatwere excludedwere those thathada
directtie to the target variable (e.g.the targetvariablewasderivedfromratingssoall variationsof the
ratingsfieldwere excluded).
InputFieldsforthe DecisionTree
29. 29 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
InputFieldsforDecisionTree Continued
30. 30 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
InputFieldsforDecisionTree Continued
Withthe inputvariablesreadyandpartitionscreated,the nextstepwastoselectthe appropriate type
of decision tree tobuild.Pastexperiencehasshownthatthe decisiontree variantswithinSPSSmodeler
produce similarresults. Evenso,we decidedtoexperimentwith CART,Quest,C5andCHAID treesto
determine whichprovidedthe bestinitial results. The screencapture below showshow the resulting
SPSSModelerstream.
As we will see,the CARTtree performedbestonourdata so that’swhere will focusourbuildscreen
captures.For the final CARTmodel we made a changestothe defaultsettingsinanattemptto enhance
performance.
31. 31 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The firstchange wasto enable boosting.Thismeansthataseriesof treesare builttoimprove fitting.
The secondchange was to broadenthe tree depth inan attemptto bringinmore variablesthatmay be
of importance infuture model builds(e.g.predictive models).
Model 1 – Assessingthe Model
Our primarymetricinevaluatingandassessingdecisiontreeswasthe percentage of recordsaccurately
classifiedonthe Validationdataset. Generallyspeaking,all of ourdecisiontreesperformedwell.They
all correctlyclassifiedourtargetvariable around66-68% of the time.
You can see fromthe followingtable thatCARThadthe bestperformance at68.46%.
CART Results– DefaultSettings
The Cart resultswithdefaultsettingsare listedbelow.The testperformedconsistentlyfromTrainingto
TestingtoValidationwhichmeansthere waslittle overfitting.Additionally,10variablesshoweduphas
havingthe mostpredictive performance.Fourof those,tot_cool,text_topic4,weekend_hoursand
text_topic2stoodoutfrom the pack. These maybe keyvariablestofocuson withsomethinglike a
logisticregressionmodel.
Model % Correct (Validation Data)
CART 68.46%
QUEST 66.50%
C5 68.37%
CHAID 67.66%
32. 32 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The actual tree outputand decisionruleshave beenomittedsince we wereusingthismodelonlyto
identifythe variableswiththe mostpredictive importance.
CART Results – Enhanced Settings
Runningthe same CARTtree withboostingimprovedresultsabit.The percentage accuratelyclassified
movedupto 70.25%. The listof variableswiththe mostpredictive performance lookedverydifferent,
however.The top10 fieldsare totallydifferentandtheirpredictive importance asassessedbythe tree is
much more evenlybalanced.
33. 33 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Basedon our results, we have twogooddecisiontree modelsforclassifyingrecordsbasedonourtarget
variable.The questionnow becomeswhetherthe variablesidentifiedcanbe usedina predictive model.
Model2 – Logistic Regression
The secondmodel buildsonthe outputof the first.The original decisiontree identified4variablesthat
may be useful inapredictmodel - tot_cool,text_topic4,weekend_hoursandtext_topic2.The goal of
thismodel isto determinethese fieldscanbe usedtopredictour targetvariable (HighYelprating).
Giventhatwe have a binarytargetvariable,abinarylogisticregressionmodelseemsappropriate.
Binarylogisticregressionmodelsrequirethatthe dependentvariable be binary(have onlyhave two
possible valueslike 0/1or True/False).Ourtargetvariable meetsthatcriteria.Althoughlogistic
regressionmodelsappearsimilartolinearregression, theydon’trelyonmanyof the assumptionsthat
linearregressionmodelsdo.Inparticular,logisticregressiondoesnotrequire the following:
Linearrelationshipbetweenindependentanddependentvariables
Independentvariablesdonotneedtobe normal
Error termsdonot needtobe normallydistributed
Homoscedasticityisnotrequired
Ordinal andnominal variablescanbe usedaspredictors
These differencesmeanthatthe testsrequiredforthe linearregressionmodelsdiscussedinclassdonot
applyto thismodelingtechnique.
34. 34 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model2 – DataSplitting and Sub-sampling
Thismodel will use the same datasplittingandsub-samplingtechniquesdescribedforModel 1.It will
leverage aTrainingdataset (60% of original file),Testdataset(20% of original file),andValidationdata
set(20% of original file).The rationaleforthisdecisionisthe same asfor Model 1.
Model2 – Building the Model
Constructionof the logisticregressionmodelisanoutflow of the decisiontree createdforModel 1. The
target variable willbe the binarytargetfieldcreatedtoindicate whetherarestaurantwasrated highly.
The independentpredictorvariableswillinclude the variablesthatstoodoutinthe original decisiontree
(tot_cool,text_topic4,weekend_hoursandtext_topic2).
The LogisticNode wasselectedinIBMSPSS modelerforthismodel. The resultingstreamisshown
below.
Logistic Regression ModelStreamin IBM SPSSModeler
The Enter methodwasleveraged forvariableselection. Usingthisapproach,all variablesare enteredin
a single step.Thismakessense inourscenariobecause we wanttotestthe variablesidentifiedinthe
decisiontree together.
35. 35 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The Model Evaluationsettingwaschangedto calculate predictorimportance.Thiswill resultinoutput
that showsthe predictive powerof eachmodel variable.
Aside fromthese selections,the defaultsettingswereused.
Model 2 – Assessingthe Model
Our primarymetricforevaluatingthismodel isaccuracyinpredictingourtargetvalue of 1 inthe
Validationdataset.Asillustratedinthe screenshotbelow,the model didnotdoa goodjob of
prediction.The model correctly identifiedthe targetvariable inthe Validationdataseton39.44% of the
time.
PseudoRSquare valuesconfirmthatthe model wasnot fitwell.McFaddenPseudoRSquare values
between.2and.4 generallyindicate thatamodel hasan excellentfit.Thismodelismuchlowerat.078.
The independentvariables,although knownare shownbelow.Interestingly,the predictiveimportance
was differentbetweenthe decisiontree andthe logisticregressionmodel. Tot_cool,the numberof
reviewsclassifiedascool,remainedatthe topinboth models,however.
36. 36 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
The equationforthislogisticregressionmodelwas:
Althoughthe equationisn’tterriblypredictive,itisinterestingthatthe total cool ratingshasa positive
impacttoward a highratingwhile weekendhoursisslightlynegative.
While the variablesfromourdecisiontree inModel 1seemedtoworkwell forclassification,theydid
not performwell forprediction.We hadtotry differentapproachestoboostpredictive performance.
Model3 – Logistic Regression PartII
Our firstlogisticregressionmodelwasconstructedusingvariablesthatlookedpromisingfromthe
decisiontree inModel 1.Since that logisticregressionmodeldidnotperformwellintermof predictive
power,we decidedtotrylogisticregressionagain.Thistime,the focusisbasedonvariablesselected
usingour intuitionandcuriosity.Forthismodel,more variableswereselected.The ideawastoletthe
model selectthose withthe mostpredictivepower.
Model3 – DataSplitting and Sub-sampling
Once again,we usedthe same data splitting andsub-samplingmethodologyusedinpriormodels.60%
Training,20% Test,and 20% Validation.
37. 37 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model3 – Building the Model
For thismodel,the same targetvariable wasused. The independent variablesshiftedtoinclude 50
variablesrelatedtotype of food,foodspecialties,total reviews,typesof reviews,hoursopen, check-in
timesanddays,and a range of textminingfields.Forbrevity,the fieldsare notlistedhere.The model
assessmentsectionhighlightsthoseselectedbythe model,however.
For thismodel, the variable selectionmethodwassettoStepwise.Stepwise isagoodmethodtouse
whenyouhave a large numberof potential independentvariablesandare unsure whichmaybe bestfor
modeling.Itallowsformultiplemodeliterationswhere variablesare addedandremoved
simultaneouslyuntil the bestcombinationof variableshave beenselected.
Aside fromthischange,all settingsremainthe same asinthe previouslogisticregressionmodel.
Model 3 – Assessingthe Model
Usingthe same criteriato evaluate thislogisticregressionmodel,we see thatitcorrectlypredictedtrue
valuesforthe targetvariable only39.74% of the time.Thisisa slightimprovementoverthe previous
model butit’spredictive powerisstill weak.
The listof variablespulledintothe modelshowsthe variableswiththe mostpredictiveimportance.
38. 38 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
A fewinterestingvariablesrise tothe top – weekendhours,ethnicity,restauranttype,afternoon check-
ins,touristy,goodforbreakfastandgood for late nightcouldall informrestaurantdecisionmakingto
drive higherreviews.Unfortunately,theirpredictive performance isrelativelylow.Decisionmaking
basedon the variablesselectedwouldbe sketchyatbest.
The regressionequationforthis modelbecomesextremelylongmakingitvirtuallyunusable.Forthat
reason,ithas beenomitted.
The McFadden PseudoRSquare value hasimprovedbutnotabove .2 where we couldsaythe model is
well fitted.
Model4 – Fit Least Squares
To investigate the topicsfoundinthe textmining,we wentaboutanddida leastsquare regressionwith
the 20 topicsas the variablesusingthe JMP12 software.The software wouldpickthe topicsthatwould
give the lowestLogWorth(calculatedas –log(p-value)),andthen use thatto compute the bestmodel.
Model4 – DataSplitting and Sub-Sampling
There wasno needto doany splitting,asJMPwas able torun throughall the variablesandrecords
withoutanysplitsorsamples.
39. 39 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model4 – Building the model
To buildthe model,we usedthe FitModel functioninJMP. Withthismodel,we usedstars asthe Y
variable tobe predicted,andthe text_topic1-20toconstruct model effects. The personalitywas
StandardLeast Squares.
40. 40 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Model4 – Assessing themodel
In the outputabove,youcan see the importance of the differenttopics.The R-square beingas lowas
0.22 showshowpoorlythismodel isworkingthough.The thingthatcan be takenfrom thismodel,
however,isthe LogWorthvalue forthe differenttopics.We can see that text_topic2 and4 are the more
importantoneswhenanalyzingthe differenttopics,togetherwithtopic18, 17, 6, and12 inorderof
descendingimportance. Itisinterestingtonote thattext_topic2andText_topic4alsostoodout inour
decision tree model.
If we lookat the followinggroups,we cansee thatthe most importantthingsare manager,location,
food,service,wine,dessert,staff,friendliness,time,andbread,salad andmeat.Sofor the opening
restaurants,there isa greatneedof focusingonthese partsof the restaurant.
text_topic2* "+customer,+know,+bad,+manager,+location"
text_topic4* "+great,+greatfood,+great
service,+service,+food"
41. 41 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
text_topic6* "+wine,+restaurant,+dish,+dessert,+meal"
text_topic18* "always,+staff,+friendly,+love,+location"
text_topic17* "+minute,+wait,+table,+wait,+order"
text_topic12* "+sandwich,+bread,+lunch,+salad,+meat"
Model5 - Text Profiling
To investigate the reviewstofindwhattermswere the onesmostassociatedwiththe differentstars,we
chose to go throughSAS’TextProfilertool.The resultingoutputwouldgive the mostcommonly
occurringterms inthe differentstarreviews.
Model5 – Data Splitting and Sub-Sampling
The data was firstsplitintoa 5% sample tobe able tohandle the size of the data. Thenthe data split
intothree separate sections,training(20%),validation(50%),andtesting(30%).
Model5 – Building the Model
To buildthe model,the datawasfirstsub-sampledintoa5% sample.Thenthe sample wasrunthrough
a partitionnode tosplitthe data intoa 20-50-30 training,validation,testingsplit.Nextwasatext
parsingnode to extractthe textfilestobe usedinthe analysis.Thenatextfiltertofilterout
unnecessaryterms,specialsigns,etc.Atlast,before the textprofilingnode,atexttopicnode to create a
setof categorical variablestobe usedinthe textprofiling.
TextParsingsettings:
43. 43 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
TextTopic settings:
TextProfile settings:
TextProfile output:
Model5 – Assessing themodel
Withthe textprofile,we cansee thatthe there are certainareas the customersseemtobe more
concernedaboutwhenrating.Forthe low rated restaurants,the termsseemtobe focusedon
staff/service,mistakeslike hairinthe food,price,portion,andtaste.Forthe betterrestaurants,the
maintermsfoundinthe reviewsseemtobe more aboutowner,town,service,andgreatfood.
Thismodel representsverywell how we cangoabout analyzingthe YELPreviews.Asitishardto predict
the rating basedonany termsor otheraspects,the bestway seemstobe throughdescriptive analytics,
and findingthe commonalitiesbetweenthe bestreviews.
Model5 Modification
Whenanalyzingthe model 5,we didcome across one problem:Adjectives.Despite tellingusaboutthe
contentof the review,adjectivesdon’tgivemuchknowledge intermsof specificpartstofocuson when
tryingto make a restaurantsuccessful.Hence,we separatedeverythingbutthe nounsfound inthe
reviewsbyignoringall the othertermsinthe textparsingnode.The followingwasthe resultingterms
the reviewersfocusedon:
44. 44 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
ModifiedModel 5Output:
In thismodel,we cansee that the mostimportantthingstothe reviewersseemtobe staff/service,
town,food,portionandprice.
Model 6
Model 6 was built by using linear regression to predict the degree to which the nature of
Reviews and Tips influences ratings. The target variable for the model was “Stars,” which is
made up of the number of stars per each of the ratings.
Model 6 - Building the model
Before building the model, we assessed the numeric dependent variables to determine which
to include in the model. Based on the results of the statistical analysis, we excluded all the
independent variables with a correlation value higher than 0.7 with other independent
variables from the model.
Correlation between Independent variables
45. 45 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Correlation between dependent variable and independent variables
This left us with 5 input variables which were included in the final model:
Model 6 - Assessing the Model
The basic results show the Percentage of total reviews voted “Cool” to have the greatest
predictor importance on Ratings, followed by the Percentage of total reviews voted “Funny”.
46. 46 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
tot_tips and tot_tip_likes had the same degree of importance, which was not very significant. It
was interesting to discover that the Percentage of reviews voted “Useful” had a predictor
importance of zero, though it had a strong correlation with the Target variable.
The regression equation: Stars = 3.407 - -0.01232 funny_pct + -0.0012 useful_pct + 0.013202
cool_pct + 0.002287 tot_tips + 0.02388
The results of the regression are presented in the following screenshots:
The adjusted R squared value of 0.108 means that the model does not do a very good job of
explaining variation in the dependent variable. Looking at the F value and t values, it seems that
the independent variables selected for the model do have some limited ability to explain
variation in the dependent variable.
47. 47 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
Discussion
From the modelswe have triedtocreate,there seemstobe greatdifficultyinactuallypredictingthe
reviewthata customerisgoingto give.Thisisnatural,as people are of greatdiversity,andpeople focus
on differentthings.Notwopeopleare goingtothinkthe exactsame thingabouta place.There isstill
some supportinsayingcertainfactorsmay helpimprove the chancesof satisfiedcustomers.
In model 1,we saw that the most importantfactorswere total cool reviews,texttopic4 (foodand
service),weekendhours,andtexttopic2(customer,manager,andlocation).Thisissimilartowhatwe
foundinmodel 4 and 5 in termsof texttopics,andsimilartomodel 2,3, and6 in termsof the
importance of weekendhoursandtotal cool reviews.
Thoughthe numberof cool reviewsmaynotexplainalottous about whatto focuson whenmakinga
successful restaurant,the factthatweekendhoursseemstobe soimportantisof interest. Asseenin
the plotbelow, there doesseemtobe atrendsimilartothat whichwe saw inthe descriptive analytics
part: Lesshours = more stars. The reasonmay be hard to explainwithoutfurtherinvestigationanddata
fromthe businesses,butapossible reasonmaybe asexplainedinthe descriptiveanalysissection:Fewer
shiftsmayhelpensure ahighqualitystaff atall time.
The suggestionaboutthe staff doesseemtoholdupin the othermodelstoo.Whenlookingattopic4
and model 5,the maintwothingspeople seemtobe concernedaboutisinfactthe staff/service,and
food.The argumentthat lessshiftshelpsimprove the qualityishence alsoshowninthose models(we
48. 48 YELP Dataset Challenge,2nd
deliverable,Lynn,Mbole,Oelstad
mustnot forgetthat foodisas closelyconnectedtopeopleasservice,asitisthe chefspreparingthe
foodthat determine howgoodthe foodtastes).
Conclusion
From the above models,we cansee thatthe data givenfromYELP doesnot workverywell with
predictive models.Hence,the betterwaytogoabout analyzingthe reviewsseemstobe throughtext
analyticsandgrouping.Throughthe TextProfiler,we foundthatthe mostimportanttermsseemtobe
food,andservice.Intermsof service,we actuallysee thatpeople use wordslike love,goodservice,hair,
bug,and care. Inother words,if the restaurantsfocusonqualityof theirstaff,cleanliness,andquality
food,theywill mostlikelysucceedinthe business. We alsofoundinthe analysisthatthe one thing
restaurantsmayneedto dois to reduce itshours.Thismay helpresolve alotof qualityissues,andmay
inturn helpincrease the ratingof the restaurant.