What drives restaurant ratings

What drives restaurant ratings?
Understanding social recommendation systems with the Yelp dataset
Team
Kaushik Subramaniam Gnanaskandan - A11815321
Forough Nasirpouri Shadbad - A11725946
Prashanth Raj Goud - A11810448
Srujana Mereddy - A11809432

1
MSIS 5223 - Programming for datascience
Project deliverable2
Table of Contents
Table of Contents
Executive Summary
1
3
Statement of Scope 3
Project Schedule
Team availability Tracker
Lessons Learnt
6
7
8
Data Preparation
Data Access
Data Consolidation
Data Cleaning
Data Transformation
Data Reduction
Descriptive Statistics
Data Dictionary
8
8
9
10
11
11
14
19
Modelling Techniques
Regression
Neural Network
20
23
28
Data Splitting and Subsampling 28
Data Modelling
Model A (Regression)
Model B (Neural Network)
Model C (Regression)
Model D (Neural Network)
Model E (Regression)
Model F (Neural Network)
31
31
35
37
40
42
45

2
Model Assessment 46
Model A vs Model B (review_stars) 46
Model C vs Model D (business_stars) 47
Model E vs Model F (business_review_count) 48
Model technique Assessment 49
Regression 49
Neural Network 50

3
Executive Summary
The Yelp dataset consists of different forms of data about restaurants including user generated
reviews, user generated ratings, aggregated business ratings and other numerical attributes
relating to users, reviews and businesses. This data can be used in various ways to help us in
statistically understanding certain behavioral aspects of a business such as rating influencers and
review influencers.
With internet ratings playing a considerable role in determining the popularity and hence the
profitability of a restaurant (or any business these days), a valuable question to ask is: How does
one improve customer-driven ratings on social media platforms such as Yelp? To answer this
question, we first need to recognize what influences social recommendation systems. Often for
businesses on a platform like Yelp, our hunch is that it’s just about identifying the most
influential of users since they play a major role in determining the ratings. We also believe that
this dataset contains other valuable attributes relating to businesses that could be influencing the
ratings. So how do we determine what is truly significant? The need for a business to have a
positive presence on the internet makes it imperative to study the patterns associated with a
recommendation system.
Statement of Scope
The broader scope of our project is to analyze the effects of all the variables in this dataset and
identify the most significant ones that influence the customer-driven ratings of a business.
Although we haven’t yet merged this dataset with other external datasets, we want to identify

4
variables relating to users, reviews and businesses within this dataset itself that could possibly
give us more insight into what the most significant of the influencers could be.
Aside from this and as an expansion to our initial scope, we want to identify the effect that the
different target variables can have on the other target variables that we have chosen - For
example, is there a correlation between the average rating of a business, the number of reviews a
business receives and the rating of a single review itself? We want to finally come up with the
statistical model for each of our target variables that will help understand these relationships
better.
For our final analysis of the dataset, The table below shows the target and the corresponding
predictor variables that we’ve used in this project:
Target Predictors
Ratings of individual reviews
(review_stars) 1. Number of useful votes the review received (useful)
2. Number of funny votes the review received (funny)
3. Number of cool votes the review received (cool)
4. How many reviews has the user who gave this review
has given to other businesses (user_review_count)
5. How many fans does the user have (user_fans)
6. What is the average rating that the user gives other
businesses (user_average_stars)
7. How many compliments has the user received
(user_compliments)
8. How many votes has the user received in total
(user_votes)
9. The average rating of a business (business_stars)
10. The number of reviews a single business has
(business_review_count)

5
Average ratings of the
business (business_stars) 1. How many reviews does the business have
2. Ratings of individual reviews (review_stars)
3. How many reviews have the users who rated this
business gave to other businesses (user_review_count)
4. How many fans do the users who rated this business
have (user_fans)
5. What is the average rating that the users who rated
this business give to other businesses
(user_average_stars)
6. How many compliments do the users who rated this
business have in total (user_compliments)
7. How many votes do the users who rated this business
have in total (user_votes)
8. Total number of useful votes the business received
from reviews (useful)
9. Total number of funny votes the business received
from reviews (funny)
10. Total number of cool votes the business received from
reviews (cool)
Number of reviews the
business has received
1. How many reviews does the business have
(business_stars)
2. Ratings of individual reviews (review_stars)
3. How many reviews have the users who rated this
business gave to other businesses (user_review_count)
4. How many fans do the users who rated this business
have (user_fans)
5. What is the average rating that the users who rated
this business give to other businesses
(user_average_stars)
6. How many compliments do the users who rated this
business have in total (user_compliments)
7. How many votes do the users who rated this business
have in total (user_votes)
8. Total number of useful votes the business received
from reviews (useful)
9. Total number of funny votes the business received
from reviews (funny)
10. Total number of cool votes the business received from
reviews (cool)

Project Schedule
We were able to mostly stick with our original schedule. If not for a few delays in the modelling
process due to some bad results which required iterative attention, things went smoothly
otherwise. Below you can find GANTT charts showing the duration of our entire project
We have also developed a team availability tracker which shows the availability of all project
members throughout the duration of the project:
6

7
Team availability Tracker
Timeline
Team members
Kaushik Forough Prashanth Srujana
Feb
Week 1
X X X X
Week2
X X X X
Week3
X X X X
Week4
X X X X
March
Week 1
X X X X
Week2
O O O O
Week3
X X X X
Week4
X X X X
April
Week 1
X X X X
Week2
X X X X
Week3
X X X X
Week4
X X X X
Lessons Learnt
After going through the complete statistical analysis process with a really big dataset, we felt
that building the model itself initially is a good way to confirm some of the initial hunches we
had before we do the initial analysis of the dataset. Given the large dataset, we could have also
taken a small sample of the data to do the modelling beforehand. Due to the time spent on the
the other items mentioned above, we felt that we didn’t get to the modelling phase until much
later.

8
Data Preparation
The steps we took to prepare our data for analysis are delineated below
Data Access
The dataset is freely available for anyone to download from the following link:
https://www.yelp.com/dataset_challenge. Yelp has consolidated a big portion of its database of
reviews, businesses and users into a dataset of approximately 4GB in size (when uncompressed),
which anyone can use to perform analysis. We were able to download the compressed dataset
(2.5GB in size) in order to access the data. The dataset consists of data relating to businesses,
reviews, users, checkins and tips collected through the popular Yelp app. All of these data files
are formatted as line delimited JSON files (a.k.a ND-JSON). A major portion of the dataset is the
review data file which contains 4 million rows, followed by the user data file with 1 million rows
followed by the business dataset with a 100 thousand rows. Given that the dataset is already
optimized enough for consumption, we did not have a need to access additional datasets.
Data Consolidation
After unsuccessfully having tried to import the data files which are JSON formatted directly into
R using a third party library called jsonlite (We faced serious performance issues due to the
format of the data), we had to resort to converting the dataset into CSV format which allowed for
ease of access. We realised that MongoDB (A NoSQL database) allowed for converting the
dataset into CSV files due to its ability to handle JSON in a seamless manner. Using the
mongoexport command we were able to convert all the data files into CSV format whilebeing

able to choose the variables we needed from the export. During this process we were also able to
eliminate certain variables that we thought we either wouldn’t need for this project or realized
would be beyond our scope to deal with. These variables are mostly textual data like reviews,
addresses, zip codes or location based data such as latitudes and longitudes. We also eliminated
variables whose data types we don’t know yet how to deal with either using R or python. These
include variables like categories and attributes pertaining to the business data file which is of the
array data type. Though R understands this variable as a list data type, we are yet to understand
how to use this in our analysis. Below is an image showing the MongoDB commands we used to
pull the data that we needed:
Data Cleaning
After importing the dataset into R, we had 3 different data frames containing the variables that
we had chosen during the export from MongoDB. We were able to identify variables like
business_id and user_id in the review data frame which showed us potential to merge these data
9

frames together into one. But before doing that we had to ensure that data is clean, in terms of
erroneous data and missing values. After running the na.omit script in R to look for missing
values, we realized that the data had already gone through some rigorous cleaning processes by
the Yelp developer team. To ensure there is no erroneous data, we reviewed the structure of the
data frames by running the str command in R.
Data Transformation
Most of our data frame contains numerical data pertaining to ratings, review counts, number of
fans, number of compliments, number of votes etc. and don’t require any transformation as such
since they are already continuous in nature. We also identified variables in the user data frame
that could be aggregated in one. Variables like compliment_hot, compliment_more,
compliment_profile, compliment_cute, compliment_list, compliment_note, compliment_plain,
compliment_cool, compliment_funny, compliment_writer, compliment_photos could be
aggregated into a single user_compliments variable and variables like useful, funny, cool (Which
are essentially up votes that users received for their reviews) could be aggregated into a single
user_votes variable. Though this was just our initial hunch, we wanted to perform a data
reduction procedure to confirm our doubts. Finally, the data frames showed us the potential to
merge the review, business and user data frames into a single data frame by using the
business_id and the user_id variables. We were able to perform two left joins using R to achieve
this (After renaming certain conflicting variables in all the data frames). After doing this, we
10

realized that we no longer needed the respective id variables, so we dropped them.
Data Reduction
To confirm our hunch about the compliments and votes fields in the user data frame, we ran a
principal component analysis to see if we actually dealing with just one variable:
1. PCA results using compliment_hot, compliment_more, compliment_profile, compliment_cute,
compliment_list, compliment_note, compliment_plain, compliment_cool, compliment_funny,
compliment_writer, compliment_photos from the user data frame.
The code we used to run this procedure is shown in the code below
11

2. PCA results using useful, funny, cool from the user data frame
The code we used to run this procedure is shown in the code below
The result of the two PCA procedures have indeed shown us that we are actually dealing with
just one variable in both cases. To complete the reduction procedure, we went ahead and merged
12

all the compliments variables into a single user_compliments variable by doing a summation and
all the votes variables into a single user_votes variable by doing a summation.
To further reduce our sample size, we decided to focus only on a single state in the United States
to continue our Analysis. We randomly chose Wisconsin as the state we would focus on and
created a subset of the final data frame using the state variable as a filter.
Finally, we removed 2 date variables, the cities variable and the business_name variable, leaving
us with only numerical/ordinal data types, which makes more sense given the numerical nature
of our dataset itself.
13

Descriptive Statistics
Here’s a table showing some basic descriptive statistics of our final data frame
Variable Mean Median Min Max Std Dev Skew Kurtosis
review_stars 3.723 4 1 5 1.33 -0.82 -0.53
useful 1.008 0 0 1128 1.7 6.82 131.2
funny 0.4195 0 0 632 1.09 19.48 809.42
cool 0.5262 0 0 513 1.18 14.8 495.64
business_stars 3.726 4 1 5 0.67 -0.77 1.16
business_revie
w_count
326.1 97.0 3 6414 180.99 3.94 19.39
user_review_co
unt
125.6 25 0 11284 198.06 6.8 131.59
user_fans 10.94 1 0 4691 33.44 19.55 682.39
user_average_s
tars
3.73 3.79 1 5 0.71 -1.11 2.76
user_complime
nts
206.1 2 0 266318 781.03 40.66 2675.61
user_votes 665.2 5 0 529730 2604.22 20.81 705.91
14

We have essentially chosen the 11 numerical variables that we think might matter the most in
our model. After having filtered out our data to only those in the Wisconsin area, we were finally
left with 88778 observations.
Here are histograms for our target variables:
1. review_stars
The graph is definitely left skewed showing a higher concentration at 5. This is very likely due
to the nature of internet rating systems, wherein users mostly tend to vote in extremes. But in our
case, the concentration is definitely more on one extreme. Further investigation could reveal
more details about the nature of this variable.
15

2. business_stars
This histogram shows how the data is almost tending to normal, but is still left skewed like the
previous one we saw. The concentration is more on the 4 level, showing that the average stars
that a business receives is around 4. We can also see that very few businesses are able to
maintain a rating of 5. This makes sense in the real world, where only a few restaurants are
truly considered as the best, with all or most of the users giving a full rating of 5 along with their
reviews. But this still doesn’t tell us anything about the number of reviews each business
received. There could be businesses with very few reviews and all of those reviews could have
been positive. This is an unfair assessment when compared to a much larger businesses with a lot
more reviews, which could have lost the standing at 5 due to only a few bad reviews.
16

3. business_review_count
This histogram shows how the data is completely right skewed. We can see how there are a lot
of businesses with very little to no reviews, while as the number of reviews increases, the
number of businesses decreases. This clearly shows how only a few businesses are truly
popular on Yelp. This can also be considered as a real world bias in which customers usually
tend to trust the more popular businesses when purchasing a product. These kind of businesses
have a greater advantage over newly born businesses or businesses that are just entering the
market. This is often referred to as the “first mover advantage” in the industry, wherein the
business has been around for a while allowing it to gain the sort of popularity that it has.
17

18
Data Dictionary
This is the dictionary of our final merged dataset
Variable Data type Description Source
review_stars int Starts rating rounded to
half stars
https://www.yelp.com/dataset_ch
allenge
useful int Number of useful votes
sent by the user
allenge
funny int Number of funny votes
sent by the user
allenge
cool int Number of cool votes
sent by the user
allenge
business_stars int Number of stars the
business has
allenge
business_review
_count
int Number of reviews https://www.yelp.com/dataset_ch
allenge
user_review_co
unt
int Number of reviews https://www.yelp.com/dataset_ch
allenge
user_fans int Number of fans user has https://www.yelp.com/dataset_ch
allenge
user_average_st
ars
int Number of average stars
user has given
allenge
user_complimen
ts
int Number of compliments
user has given
allenge
user_votes int Number of votes the user
has given
allenge

19
Modelling Techniques
The goal of our project is to assess the effect of different factors on the ratings (On a scale of 1 to
5)that a business receives on the Yelp app. The main idea is to see if there is a relationship
between user behavior (based on the variables present in this dataset), businesses and the reviews
that users write for these businesses. We have chosen 3 target variables since we feel that the
significance of these variables could be really important in understanding how online rating
systems work. Here’s a breakdown of the 3 target variables we have chosen and what we hope to
achieve with the predictor variables we have:
1. review_stars - The rating associated with every individual review
a. Does the number of votes (useful/funny/cool) affect the overall rating of a
review? Do users upvote the good reviews or the bad reviews?
b. Does the popularity (compliments/votes/fans) of users affect the overall rating of
a review? Are popular users strict or lenient with their ratings?
c. Does the history of a user’s rating behavior (average rating) affect the overall
rating of a review?
d. Are the active users (number of reviews given by a user) more strict or lenient
with their reviews? Is there an association here?
e. Does the current standing of a business on the app (rating, review count), affect
what rating a user is going to give a business?
2. business_stars - The average rating of each individual business based on the reviews it
received

20
a. Does the number of votes (useful/funny/cool) that each review has received for a
particular business affect the overall rating of the business?
b. Does the popularity (compliments/votes/fans) of the users who have rated a
particular business affect the overall rating of the business? Are popular users
associated with highly rated businesses?
c. Does the history of a user’s rating behavior (average rating) who has rated a
particular business affect the overall rating of the business?
d. Do the active users (number of reviews given by a user) play a part in determining
the overall rating of a business?
e. Does the rating of each review that a business has received affect the overall
rating of a business?
3. business_review_count - The total number of reviews each business has received over
time.
a. Does the number of votes (useful/funny/cool) that each review has received for a
particular business affect the number of reviews a business receives?
b. Does the popularity (compliments/votes/fans) of the users who have rated a
particular business affect the number of reviews a business receives? Are popular
users associated with popular businesses?
c. Does the history of a user’s rating behavior (average rating) who has rated a
particular business affect the number of reviews a business receives?
d. Do the active users (number of reviews given by a user) play a part in determining
the number of reviews a business receives?

21
e. Does the rating of each review that a business has received affect the number of
reviews a business receives?
Now, given that all of our variables are numeric in nature (targets and predictors) and the fact
that we are trying to understand the correlation/association between these variables as delineated
above , we will be building a Regression model. We believe that a regression model could not
only reveal to us the truly significant predictor variables, but it can also give us an equation
which can be used to determine/predict a pattern associated with this dataset that can help a great
deal in understanding the nature of these target variables. The main assumption here is that all or
most of our chosen predictor variables are linearly related to at least one of our target variables.
The model is certain to give us this information. Also, identifying only a few predictor variables
(based on significance) from the pool would be a valuable insight.
The second modelling technique we plan to use is a Neural Network model. Along with giving
us information about the correlations and associations that exist between our chosen variables,
this model can also help a great deal in predicting the final value of our target variables based on
the trends and patterns present in the dataset. The model itself would produce these predicted
values. The more hidden layers (the neurons counterpart of an ANN) we insert into the model,
the better results (in terms of computation and accuracy) we are going to receive is our
presumption. Along with this information, this model is also certain to reveal the “weights”
associated with each of the indicator variables, which are essentially similar to the coefficients
we observe in the result of a regression model. This information can again inform us about the
most significant of the given indicator variables.

Given that we have 3 target variables, we are planning to implement both of the modelling
techniques on each of the target variables and and choose the best (the most successful) model
for each of the target variables.
Regression
For the regression models we plan to create for the target variables, the dataset should satisfy
assumptions of linearity, collinearity, homoscedasticity and the normality of residuals. The tables
below shows the tests we performed for each of the target variables:
Target variable Collinearity (Assessment of VIF values)
review_stars
business_stars
22

business_review_
count
This assumption can be successfully verified for all of the target variables, given that the VIF
values are well below 10. Therefore, we have no issues with collinearity. The collinearity
assumption is thus verified for all the target variables.
Target variable Normality of residuals (Assessment of Q-Q plot)
review_stars
business_stars
23

business_review_
count
From the graphs above, it’s clear that the normality assumption can only be verified for the target
variable review_stars. This can be seen in the alignment of the Q-Q plots. Only the first graph
shows normality, while the other two either have too many outliers (business_review_count) or
aren’t as normal as it should be (business_stars). Therefore, the normality assumption is
verified only for the target variable review_stars.
Target variable Constant variance (Assessment of scatter plot)
review_stars
24

business_stars
business_review_
count
From the graphs above, it’s clear that the homoscedasticity assumption can only be verified for
the target variables review_stars and business_stars. This can be seen in the scatterplot, where
we observe similar number of data points on both sides of the regression line. The third plot,
however, fails to show the same trend. Therefore, the homoscedasticity assumption is verified
only for the target variables review_stars and business_stars.
25

Linearity (Assessment of correlation procedure)
The correlation matrix in the table above shows the significance (p values = 0) of all of our
variables in the dataset. Therefore, the linearity assumption is verified for all of our target
variables
26

27
Though all of the assumptions haven’t been satisfied for some of the variables, we will still
continue to pursue the regression models for the review_stars, business_stars and the
business_review_count target variables. We will assess these models based on the results we
get from the regression procedure.
Neural Network
The main assumption of a Neural Network model is that the missing values are removed. We
ensured this much earlier on when we ran the na.omit procedure in R where all the missing
values were removed. Further, the original dataset itself was already in a clean and optimized
manner due to a lot of preprocessing done by the developers at Yelp. The dataset is therefore in
perfect condition for a Neural Network procedure. However, Neural Network in general is
considered a black box model, which makes interpretation of the model difficult. The plan is to
feed in different hidden layer sizes and activation methods to arrive at the best model with the
lowest error possible for the given target variables.
Data Splitting and Subsampling
Looking to make an honest assessment, we want to do 60-40 split of the dataset. This means that
we would have 60% of the data for our training dataset and 40% of the data for our validation
dataset. Given the massive size of our dataset, best practices are usually relevant and it makes
sense for us to use this ratio since it’s considered a good assessment of most models in the real
world. The reason we chose a higher training value is to get better results from our models. This
is because a higher training value improves the predictive capabilities of most models. A lower

testing value usually helps in assessing the error rate more accurately. However, we don’t plan to
create a testing dataset, again owing to the size and scope of our main dataset. Therefore,
considering the size of our dataset and the predictive analytics we hope to achieve with this
project, we are moving ahead with the 60-40 split. The image below shows the code we used to
split the data for our regression model
Here is an assessment of the data splits relating to each of our target variables:
1. review_stars
28

2. business_stars
3. business_review_count
Comparing the mean, standard deviation, median, minimum and maximum statistics from the
images above, a clear uniformity can be noticed. The split is actually incredibly accurate with
the values of these statistics across the split datasets being very close, or in most cases, exactly
the same!
29

Data Modelling
Based on the assessments and subsampling done above, we are going to create regression and
neural network models using review_stars, business_stars and business_review_count as our
target variables. The idea is to understand the effects of our chosen predictor variables on our
chosen target variables.
Model A(Regression)
Target variable: review_stars (The rating associated with each individual review)
Predictor variables: useful, funny, cool, user_compliments, user_votes, user_fans,
user_average_stars, user_review_count, business_stars, business_review_count
Here is an image of the model we built by running the regression procedure in R:
30

31
Interpretation
We can see from the results above that all of our indicator variables except for
business_review_count are significant at a 0.05 level of significance. We can also see that
user_compliments and user_votes are significant at a 0.01 level of significance while useful,
funny, cool, user_review_count, user_average_stars and business_stars are all significant at a
1.level of significance. From the looks of it, our hunch about most of our chosen predictor
variables is true. All these variables definitely have an effect on the review_stars target variable.
But to what extent and what do these results mean? This is delineated below:
1. The variable useful is significant at a 0.001 level of significance. From the coefficient, we
can say that for a unit increase in useful, there is a 0.1351 decrease in review_stars. In
other words, when a review has more useful votes, the rating of the review tends to
decrease. This relationship is actually contrary to what we believed might be a positive
linear relationship. It makes sense for a review with a higher rating to have more useful
votes, but it seems like users find the stricter reviews more useful than the more lenient
ones.
2. The variable funny is significant at a 0.001 level of significance. From the coefficient, we
can say that for a unit increase in funny, there is a 0.1576 decrease in review_stars. In
other words, when a review has more funny votes, the rating of the review tends to
decrease. This relationship is in line with what we believed would be a negative linear
relationship. It makes sense for a review with a lower rating to have more funny votes
since users find the stricter reviews to be funnier than the more lenient ones. This could
be due to the more sarcastic tone users might use with their bad reviews.

32
3. The variable cool is also significant at a 0.001 level of significance. From the coefficient,
we can say that for a unit increase in cool, there is a 0.2989 increase in review_stars. In
other words, when a review has more cool votes, the rating of the review tends to
increase. This relationship is actually in line with what we believed might be a positive
linear relationship. It makes sense for a review with a higher rating to have more cool
votes.
4. The variable user_review_count is also significant at a 0.001 level of significance. From
the coefficient, we can say that for a unit increase in user_review_count, there is a
.000109 increase in review_stars. In other words, when a user has written more reviews,
the rating of their review tends towards a higher value. This relationship is actually inline
with what we believed might be a positive linear relationship. It makes sense for an active
reviewer to be more lenient with their reviews.
5. The variable user_average_stars is also significant at a 0.001 level of significance. From
the coefficient, we can say that for a unit increase in user_average_stars, there is a .7706
increase in review_stars. In other words, when a user has a higher average rating score,
the rating of their review tends towards a higher value. This relationship is actually inline
with what we believed might be a positive linear relationship. It makes sense for a user
with a higher average rating to award higher ratings to reviews.
6. The variable user_compliments is significant at a 0.01 level of significance. From the
coefficient, we can say that for a unit increase in user_compliments, there is a .00002133
decrease in review_stars. In other words, when a user has received more compliments
(the popular user in other words), the rating of their review tends towards a lower value.

33
This relationship is actually inline with what we believed might be a negative linear
relationship. It makes sense for a popular reviewer to be stricter with their reviews. The
popularity of a user is definitely an influencer of the rating given to their reviews.
7. The variable user_fans is significant at a 0.05 level of significance. From the coefficient,
we can say that for a unit increase in user_fans, there is a .0003611 decrease in
review_stars. In other words, when a user has more fans (the popular user in other
words), the rating of their review tends towards a lower value. This relationship is
actually inline with what we believed might be a negative linear relationship. It makes
sense for an popular reviewer to be stricter with their reviews. The popularity of a user is
definitely an influencer of the rating given to their reviews.
8. The variable user_votes is significant at a 0.01 level of significance. From the coefficient,
we can say that for a unit increase in user_votes, there is a .000005412 increase in
review_stars. In other words, when a user has more votes (the users whose reviews have
received more votes), the rating of their review tends towards a higher value. This
relationship is actually inline with what we believed might be a positive linear
relationship.
9. The variable business_stars is significant at a 0.05 level of significance. From the
coefficient, we can say that for a unit increase in business_stars, there is a .7023 increase
in review_stars. In other words, when a business has a higher average rating, the rating of
the reviews that the business receives tends towards a higher value. This relationship is
actually inline with what we believed might be a positive linear relationship. It makes
sense for a business with a higher rating to receive more such positive reviews.

10. The variable business_review_count is not significant at all. The number of reviews a
business receives does not influence the rating of the reviews it receives. This makes
sense since we really can’t say that the popular businesses (In terms of reviews) receive a
higher or a lower rating. It really depends on what the user experienced when the review
was given.
Therefore, it is clear that the users definitely have a big role to play in deciding the rating of a
review given to any business on the Yelp app. Considering the social nature of apps like Yelp,
this makes a lot of sense.
Model B (Neural Network)
Target variable: review_stars (The rating associated with each individual review)
Predictor variables: useful, funny, cool, user_compliments, user_fans, user_average_stars,
user_votes, business_stars, business_review_count (We removed the user_review_count variable
after observing better results without it)
Activation: RELU
Hidden Layers: 200
After playing around with the activation type and number of hidden layers, we would like to
present the best model for our target variable. Here is an image of the model we built by running
the neural network procedure in python:
34

Interpretation
Since a neural network model is a black box, we won’t be able to say much about thespecific
relationship that exists between the target variable and the indicator variables. We can however
assess the error rate produced by the model along with the R2
value to determine the efficiency
of the model. In this particular model we can see that the mean absolute error is at 0.7309 and the
mean square error is at 0.94. These values are signs of a low error rate in the model. The R2
value
is 0.45 (or 45%), which is another indication of this being a good model. This model can
definitely be used for further predictive analysis.
Here are a few predicted values from the model as outputted from python:
The patterns associated with the given set of indicator variables and their respective values can
be observed by comparing them to the predicted values of the variable review_stars. These
predictions can definitely be considered as accurate considering the low error rate of the model.
35

Model C (Regression)
Target variable: business_stars (The average rating of a business)
user_average_stars, user_review_count, review_stars, business_review_count
Interpretation
We can see from the results above that all of our indicator variables except for
user_average_stars, user_compliments and user_votes are significant at a 0.05 level of
significance. We can also see that user_fans is significant at a 0.01 level of significance while
review_stars, useful, funny, cool, user_review_count and business_review_count are all
significant at a 0.001 level of significance. From the looks of it, our hunch about most of our
36

37
chosen predictor variables is true. All these variables definitely have an effect on the
business_stars target variable. But to what extent and what do these results mean? This is
delineated below:
1. The variable review_stars is significant at a 0.001 level of significance. From the
coefficient, we can say that for a unit increase in review_stars, there is a 0.2374 increase
in business_stars. In other words, when a business has more reviews with higher ratings,
the rating of the business tends to decrease. This relationship is in line with what we
believed might be a positive linear relationship.
can say that for a unit increase in useful, there is a 0.11884 increase in business_stars. In
other words, when a business has more reviews with higher useful votes, the rating of the
business tends to increase. This relationship is in line with what we believed might be a
positive linear relationship.
can say that for a unit increase in funny, there is a 0.1967 decrease in business_stars. In
other words, when a business has more reviews with higher funny votes, the rating of the
business tends to decrease. This relationship is in line with what we believed would be a
negative linear relationship.
4. The variable cool is also significant at a 0.001 level of significance. From the coefficient,
we can say that for a unit increase in cool, there is a 0.01167 increase in business_stars.
In other words, when a business has more reviews with higher cool votes, the rating of
the business tends to increase. This relationship is actually in line with what we believed

38
might be a positive linear relationship.
.00004964 decrease in business_stars. In other words, when a business has more reviews
written by users who have themselves written more reviews, the rating of the business
tends towards a lower value.
6. The variable user_average_stars is not significant at all. The average rating of users who
write reviews for businesses doesn’t affect the average rating of a business. This is a
surprising, since our hunch was that there would be a relationship between these two
variables.
7. The variable user_compliments is not significant at all. The number of compliments
received by users who write reviews for businesses doesn’t affect the average rating of a
business.
8. The variable user_fans is significant at a 0.05 level of significance. From the coefficient,
we can say that for a unit increase in user_fans, there is a .0002323 increase in
business_stars. In other words, when a business has more reviews from users who have
more fans (the popular user in other words), the rating of their business tends towards a
higher value. This relationship is actually inline with what we believed might be a
positive linear relationship.
9. The variable user_average_votes is not significant at all. The number of votes received
by users who write reviews for businesses doesn’t affect the average rating of the reviews
received by a business.

10. The variable business_review_count is significant at a 0.001 level of significance. From
the coefficient, we can say that for a unit increase in business_review_count, there is a
.0004423 increase in business_stars. In other words, when a business has a higher
number of reviews, the rating of the business tends towards a higher value. This
relationship is actually inline with what we believed might be a positive linear
relationship.
Model D (Neural Network)
Target variable: business_stars (The average rating of each business)
user_votes, review_stars, business_review_count (We removed the user_review_count variable
after observing better results without it)
Activation: RELU
Hidden Layers: 200
39

Interpretation
Since a neural network model is a black box, we won’t be able to say much about thespecific
relationship that exists between the target variable and the indicator variables. We can however
assess the error rate produced by the model along with the R2
value to determine the efficiency
of the model. In this particular model we can see that the mean absolute error is at 0.4163 and the
mean square error is at 0.3047. These values are signs of a low error rate in the model. The R2
value is 0.31 (or 31%), which is another indication of this being a good model. This model can
definitely be used for further predictive analysis.
Here are a few predicted values from the model as outputted from python:
The patterns associated with the given set of indicator variables and their respective values can
be observed by comparing them to the predicted values of the variable business_stars. These
predictions can definitely be considered as accurate considering the low error rate of the model.
40

Model E (Regression)
Target variable: business_review_count (The number of reviews a business receives)
user_average_stars, user_review_count, review_stars, business_stars
Interpretation
We can see from the results above that all of our indicator variables except for review_stars,
user_average stars and user_fans are significant at a 0.05 level of significance. We can also see
that funny is significant at a 0.05 level of significance, user_average_stars is significant at a 0.1
level of significance, while useful, cool, user_review_count, user_compliments, user_votes and
business_stars are all significant at a 0.001 level of significance. From the looks of it, our hunch
41

42
about most of our chosen predictor variables is true. All these variables definitely have an effect
on the business_review_count target variable. But to what extent and what do these results
mean? This is delineated below:
1. The variable review_stars is not significant at all. The rating of reviews received by a
business does not affect the number of reviews a business receives.
can say that for a unit increase in useful, there is 6.662 decrease in
business_review_count. In other words, when a business has more reviews with higher
useful votes, the number of reviews the business receives tends to decrease.
can say that for a unit increase in funny, there is a 2.005 increase in
business_review_count. In other words, when a business has more reviews with higher
funny votes, the number of reviews reviews the business receives tends to increase.
4. The variable cool is significant at a 0.001 level of significance. From the coefficient, we
can say that for a unit increase in cool, there is a 4.23 increase in business_review_count.
In other words, when a business has more reviews with higher cool votes, the number of
reviews the business receives tends to increase.
.03024 increase in business_review_count. In other words, when a business has more
reviews written by users who have themselves written more reviews, the number of

43
6. The variable user_average_stars is significant at a 0.1 level of significance. For a unit
increase in user_average_stars, there is a 1.659 increase in business_review_count. In
other words, when a business has more users with a higher average rating, the number of
7. The variable user_compliments is significant at a 0.005 level of significance. For a unit
increase in user_compliments, there is a 0.004867 increase in business_review_count. In
other words, when a business has more users with a higher number of compliments, the
number of reviews the business receives tends to increase.
8. The variable user_fans is not significant at all. The number of fans of a user who has
reviewed a business does not affect the number of reviews a business receives.
9. The variable user_average_votes is significant at a 0.001 level of significance. For a unit
increase in user_average_votes, there is a 0.002149 decrease in business_review_count.
In other words, when a business has more users with a higher average rating, the number
of reviews the business receives tends to decrease.
10. The variable business_stars is significant at a 0.001 level of significance. From the
coefficient, we can say that for a unit increase in business_stars, there is a .4211 increase
in business_review_count. In other words, when a business has a higher average rating,
the number of reviews the business receives tends to increase.

Model F (Neural Network)
Target variable: business_review_count (The number of reviews each business has)
user_votes, business_stars, review_stars (We removed the user_review_count variable after
observing better results without it)
Activation: RELU
Hidden Layers: 100
Interpretation
In this particular model we can see that the mean absolute error is at 95.61 and the mean square
error is at 28223.50. These values are signs of a very high error rate in the model. The R2
value is
0.14 (or 14%), which is not bad, but the error rate is too high in this model for it to be considered
a good one. Further analysis or identification of more significant variables (Which we probably
didn’t include in the beginning), is definitely required to improve the predictive capabilities of
this model. Here are predicted values from the model (Which might not be very accurate):
44

Model Assessment
From the six models we’ve built above, our goal is to choose the best model for the three target
variables we have. For the objective assessment, we will be comparing the R2
values from the
two models. For the subjective assessment, we will be elaborating on the implications of the
model in the real world.
Model A vs Model B (review_stars)
For the objective assessment of the regression and the neural network model for the target
variable review_stars, let’s first compare the R2
values produced by each of the models. The
table below shows the value produced by both the procedures:
45
Regression (Model A) Neural Network (Model B)
R2
: 0.4405 (44.05%) R2
: 0.4715 (47.15%)

46
Firstly, both these are very good models with such high R2
values. From the table above it is
clear that Model B performs better than Model A. The accuracy of the Neural Network model is
slightly higher than the regression model. From a real world perspective however, the regression
model makes more sense, considering the fact that it helps understand the exact relationship
between the target and predictor variables. In our scenario, the goal is to understand what
influences the rating associated with each individual review and the regression model does this
job the best. Therefore, we would choose Model A as our model of choice for the target
variable review_stars.
Model C vs Model D (business_stars)
variable business_stars, let’s first compare the R2
values produced by each of the models. The
table below shows the value produced by both the procedures:
Regression (Model C) Neural Network (Model D)
R2
: 0.2518 (25.18%) R2
: 0.3173 (31.73%)
Firstly, both these are reasonably good models with decently high R2
values. From the table
above it is clear that Model D performs better than Model C. The accuracy of the Neural
Network model is slightly higher than the regression model. From a real world perspective
however, the regression model makes more sense, considering the fact that it helps understand
the exact relationship between the target and predictor variables. But the assumptions for
regression weren’t satisfied earlier for this model, hence choosing regression wouldn’t be wise in

47
this case. In this scenario, using the Neural Network model for predicting the behaviour of the
variable business_stars makes more sense. Therefore, we would choose Model D as our model
of choice for the target variable business_stars.
Model E vs Model F (business_review_count)
variable business_review_count, let’s first compare the R2
values produced by each of the
models. The table below shows the value produced by both the procedures:
Regression (Model E) Neural Network (Model F)
R2
: 0.0208 (2.08%) R2
: 0.1493 (14.93%)
Firstly, both are not very good models with comparatively low R2
values. From the table above it
is clear that Model F performs better than Model E. Though the Neural Network model has a
moderate R2
value, the error rate for this model is very high (as observed earlier), which brings
into question the accuracy of this model. From a real world perspective however, the regression
model makes more sense, considering the fact that it helps understand the exact relationship
between the target and predictor variables. But the assumptions for regression weren’t satisfied
earlier for this model and the R2
value is very low, hence choosing regression wouldn’t be the
right way to go. Therefore, we wouldn’t be choosing either of the models for the target
variable business_review_count. Further analysis or consideration of other significant variables
is definitely required before coming to any conclusions about this particular target variable.

48
Model technique Assessment
Regression and Neural Network, both are extremely effective modelling techniques and both
have their own strengths and weaknesses. These are delineated below:
Regression
Regression analysis is a statistical process for estimating the relationships among variables. The
focus is on the relationship between a dependent variable and one or more independent variables.
Strengths
1. Multiple regression is a very flexible method. The independent variables can be numeric
or categorical, and interactions between variables can be incorporated; and polynomial
terms can also be included.
2. Multiple regression uses multiple independent variables, with each controlling for the
others. The parameter or coefficients of each of these variables can be derived using a
regression model
3. Regression models have very accurate predictive capabilities and can be used in
forecasting trends in the future.
4. When relationships between the independent variables and the dependent variable are
almost linear, regression shows optimal results.
Weaknesses
1. Linear regression is limited to predicting numeric output.

49
Neural Network
An Artificial Neural Network (ANN) is an information processing model which behaves like the
human brain by using artificial neurons (Hidden layers) for computational statistics.
Strengths
1. Neural Networks have the ability to understand relationships between the indicator and
target variables when they are linearly related to each other, which means that it can be
used to understand trends and patterns in a datasets.
2. Neural Network is capable of self organization. An ANN can create its own organisation
or representation of the information it receives during learning time.
3. Neural Network is capable of adaptive learning. An ANN has the ability to learn how to
do tasks based on the data given for training or initial experience.
Weaknesses
1. Since it’s impossible to pull out information from a Neural Network model, the
implications of it is very hard to understand.

What drives restaurant ratings

Recommended

Recommended

More Related Content

Similar to What drives restaurant ratings

Similar to What drives restaurant ratings (20)

Recently uploaded

Recently uploaded (20)

What drives restaurant ratings