SlideShare a Scribd company logo
1 of 50
What drives restaurant ratings?
Understanding social recommendation systems with the Yelp dataset
Team
Kaushik Subramaniam Gnanaskandan - A11815321
Forough Nasirpouri Shadbad - A11725946
Prashanth Raj Goud - A11810448
Srujana Mereddy - A11809432
1
MSIS 5223 - Programming for datascience
Project deliverable2
Table of Contents
Table of Contents
Executive Summary
1
3
Statement of Scope 3
Project Schedule
Team availability Tracker
Lessons Learnt
6
7
8
Data Preparation
Data Access
Data Consolidation
Data Cleaning
Data Transformation
Data Reduction
Descriptive Statistics
Data Dictionary
8
8
9
10
11
11
14
19
Modelling Techniques
Regression
Neural Network
20
23
28
Data Splitting and Subsampling 28
Data Modelling
Model A (Regression)
Model B (Neural Network)
Model C (Regression)
Model D (Neural Network)
Model E (Regression)
Model F (Neural Network)
31
31
35
37
40
42
45
2
MSIS 5223 - Programming for datascience
Project deliverable2
Model Assessment 46
Model A vs Model B (review_stars) 46
Model C vs Model D (business_stars) 47
Model E vs Model F (business_review_count) 48
Model technique Assessment 49
Regression 49
Neural Network 50
3
MSIS 5223 - Programming for datascience
Project deliverable2
Executive Summary
The Yelp dataset consists of different forms of data about restaurants including user generated
reviews, user generated ratings, aggregated business ratings and other numerical attributes
relating to users, reviews and businesses. This data can be used in various ways to help us in
statistically understanding certain behavioral aspects of a business such as rating influencers and
review influencers.
With internet ratings playing a considerable role in determining the popularity and hence the
profitability of a restaurant (or any business these days), a valuable question to ask is: How does
one improve customer-driven ratings on social media platforms such as Yelp? To answer this
question, we first need to recognize what influences social recommendation systems. Often for
businesses on a platform like Yelp, our hunch is that it’s just about identifying the most
influential of users since they play a major role in determining the ratings. We also believe that
this dataset contains other valuable attributes relating to businesses that could be influencing the
ratings. So how do we determine what is truly significant? The need for a business to have a
positive presence on the internet makes it imperative to study the patterns associated with a
recommendation system.
Statement of Scope
The broader scope of our project is to analyze the effects of all the variables in this dataset and
identify the most significant ones that influence the customer-driven ratings of a business.
Although we haven’t yet merged this dataset with other external datasets, we want to identify
4
MSIS 5223 - Programming for datascience
Project deliverable2
variables relating to users, reviews and businesses within this dataset itself that could possibly
give us more insight into what the most significant of the influencers could be.
Aside from this and as an expansion to our initial scope, we want to identify the effect that the
different target variables can have on the other target variables that we have chosen - For
example, is there a correlation between the average rating of a business, the number of reviews a
business receives and the rating of a single review itself? We want to finally come up with the
statistical model for each of our target variables that will help understand these relationships
better.
For our final analysis of the dataset, The table below shows the target and the corresponding
predictor variables that we’ve used in this project:
Target Predictors
Ratings of individual reviews
(review_stars) 1. Number of useful votes the review received (useful)
2. Number of funny votes the review received (funny)
3. Number of cool votes the review received (cool)
4. How many reviews has the user who gave this review
has given to other businesses (user_review_count)
5. How many fans does the user have (user_fans)
6. What is the average rating that the user gives other
businesses (user_average_stars)
7. How many compliments has the user received
(user_compliments)
8. How many votes has the user received in total
(user_votes)
9. The average rating of a business (business_stars)
10. The number of reviews a single business has
(business_review_count)
5
MSIS 5223 - Programming for datascience
Project deliverable2
Average ratings of the
business (business_stars) 1. How many reviews does the business have
(business_review_count)
2. Ratings of individual reviews (review_stars)
3. How many reviews have the users who rated this
business gave to other businesses (user_review_count)
4. How many fans do the users who rated this business
have (user_fans)
5. What is the average rating that the users who rated
this business give to other businesses
(user_average_stars)
6. How many compliments do the users who rated this
business have in total (user_compliments)
7. How many votes do the users who rated this business
have in total (user_votes)
8. Total number of useful votes the business received
from reviews (useful)
9. Total number of funny votes the business received
from reviews (funny)
10. Total number of cool votes the business received from
reviews (cool)
Number of reviews the
business has received
(business_review_count)
1. How many reviews does the business have
(business_stars)
2. Ratings of individual reviews (review_stars)
3. How many reviews have the users who rated this
business gave to other businesses (user_review_count)
4. How many fans do the users who rated this business
have (user_fans)
5. What is the average rating that the users who rated
this business give to other businesses
(user_average_stars)
6. How many compliments do the users who rated this
business have in total (user_compliments)
7. How many votes do the users who rated this business
have in total (user_votes)
8. Total number of useful votes the business received
from reviews (useful)
9. Total number of funny votes the business received
from reviews (funny)
10. Total number of cool votes the business received from
reviews (cool)
MSIS 5223 - Programming for datascience
Project deliverable2
Project Schedule
We were able to mostly stick with our original schedule. If not for a few delays in the modelling
process due to some bad results which required iterative attention, things went smoothly
otherwise. Below you can find GANTT charts showing the duration of our entire project
We have also developed a team availability tracker which shows the availability of all project
members throughout the duration of the project:
6
7
MSIS 5223 - Programming for datascience
Project deliverable2
Team availability Tracker
Timeline
Team members
Kaushik Forough Prashanth Srujana
Feb
Week 1
X X X X
Week2
X X X X
Week3
X X X X
Week4
X X X X
March
Week 1
X X X X
Week2
O O O O
Week3
X X X X
Week4
X X X X
April
Week 1
X X X X
Week2
X X X X
Week3
X X X X
Week4
X X X X
Lessons Learnt
After going through the complete statistical analysis process with a really big dataset, we felt
that building the model itself initially is a good way to confirm some of the initial hunches we
had before we do the initial analysis of the dataset. Given the large dataset, we could have also
taken a small sample of the data to do the modelling beforehand. Due to the time spent on the
the other items mentioned above, we felt that we didn’t get to the modelling phase until much
later.
8
MSIS 5223 - Programming for datascience
Project deliverable2
Data Preparation
The steps we took to prepare our data for analysis are delineated below
Data Access
The dataset is freely available for anyone to download from the following link:
https://www.yelp.com/dataset_challenge. Yelp has consolidated a big portion of its database of
reviews, businesses and users into a dataset of approximately 4GB in size (when uncompressed),
which anyone can use to perform analysis. We were able to download the compressed dataset
(2.5GB in size) in order to access the data. The dataset consists of data relating to businesses,
reviews, users, checkins and tips collected through the popular Yelp app. All of these data files
are formatted as line delimited JSON files (a.k.a ND-JSON). A major portion of the dataset is the
review data file which contains 4 million rows, followed by the user data file with 1 million rows
followed by the business dataset with a 100 thousand rows. Given that the dataset is already
optimized enough for consumption, we did not have a need to access additional datasets.
Data Consolidation
After unsuccessfully having tried to import the data files which are JSON formatted directly into
R using a third party library called jsonlite (We faced serious performance issues due to the
format of the data), we had to resort to converting the dataset into CSV format which allowed for
ease of access. We realised that MongoDB (A NoSQL database) allowed for converting the
dataset into CSV files due to its ability to handle JSON in a seamless manner. Using the
mongoexport command we were able to convert all the data files into CSV format whilebeing
MSIS 5223 - Programming for datascience
Project deliverable2
able to choose the variables we needed from the export. During this process we were also able to
eliminate certain variables that we thought we either wouldn’t need for this project or realized
would be beyond our scope to deal with. These variables are mostly textual data like reviews,
addresses, zip codes or location based data such as latitudes and longitudes. We also eliminated
variables whose data types we don’t know yet how to deal with either using R or python. These
include variables like categories and attributes pertaining to the business data file which is of the
array data type. Though R understands this variable as a list data type, we are yet to understand
how to use this in our analysis. Below is an image showing the MongoDB commands we used to
pull the data that we needed:
Data Cleaning
After importing the dataset into R, we had 3 different data frames containing the variables that
we had chosen during the export from MongoDB. We were able to identify variables like
business_id and user_id in the review data frame which showed us potential to merge these data
9
MSIS 5223 - Programming for datascience
Project deliverable2
frames together into one. But before doing that we had to ensure that data is clean, in terms of
erroneous data and missing values. After running the na.omit script in R to look for missing
values, we realized that the data had already gone through some rigorous cleaning processes by
the Yelp developer team. To ensure there is no erroneous data, we reviewed the structure of the
data frames by running the str command in R.
Data Transformation
Most of our data frame contains numerical data pertaining to ratings, review counts, number of
fans, number of compliments, number of votes etc. and don’t require any transformation as such
since they are already continuous in nature. We also identified variables in the user data frame
that could be aggregated in one. Variables like compliment_hot, compliment_more,
compliment_profile, compliment_cute, compliment_list, compliment_note, compliment_plain,
compliment_cool, compliment_funny, compliment_writer, compliment_photos could be
aggregated into a single user_compliments variable and variables like useful, funny, cool (Which
are essentially up votes that users received for their reviews) could be aggregated into a single
user_votes variable. Though this was just our initial hunch, we wanted to perform a data
reduction procedure to confirm our doubts. Finally, the data frames showed us the potential to
merge the review, business and user data frames into a single data frame by using the
business_id and the user_id variables. We were able to perform two left joins using R to achieve
this (After renaming certain conflicting variables in all the data frames). After doing this, we
10
MSIS 5223 - Programming for datascience
Project deliverable2
realized that we no longer needed the respective id variables, so we dropped them.
Data Reduction
To confirm our hunch about the compliments and votes fields in the user data frame, we ran a
principal component analysis to see if we actually dealing with just one variable:
1. PCA results using compliment_hot, compliment_more, compliment_profile, compliment_cute,
compliment_list, compliment_note, compliment_plain, compliment_cool, compliment_funny,
compliment_writer, compliment_photos from the user data frame.
The code we used to run this procedure is shown in the code below
11
MSIS 5223 - Programming for datascience
Project deliverable2
2. PCA results using useful, funny, cool from the user data frame
The code we used to run this procedure is shown in the code below
The result of the two PCA procedures have indeed shown us that we are actually dealing with
just one variable in both cases. To complete the reduction procedure, we went ahead and merged
12
MSIS 5223 - Programming for datascience
Project deliverable2
all the compliments variables into a single user_compliments variable by doing a summation and
all the votes variables into a single user_votes variable by doing a summation.
To further reduce our sample size, we decided to focus only on a single state in the United States
to continue our Analysis. We randomly chose Wisconsin as the state we would focus on and
created a subset of the final data frame using the state variable as a filter.
Finally, we removed 2 date variables, the cities variable and the business_name variable, leaving
us with only numerical/ordinal data types, which makes more sense given the numerical nature
of our dataset itself.
13
MSIS 5223 - Programming for datascience
Project deliverable2
Descriptive Statistics
Here’s a table showing some basic descriptive statistics of our final data frame
Variable Mean Median Min Max Std Dev Skew Kurtosis
review_stars 3.723 4 1 5 1.33 -0.82 -0.53
useful 1.008 0 0 1128 1.7 6.82 131.2
funny 0.4195 0 0 632 1.09 19.48 809.42
cool 0.5262 0 0 513 1.18 14.8 495.64
business_stars 3.726 4 1 5 0.67 -0.77 1.16
business_revie
w_count
326.1 97.0 3 6414 180.99 3.94 19.39
user_review_co
unt
125.6 25 0 11284 198.06 6.8 131.59
user_fans 10.94 1 0 4691 33.44 19.55 682.39
user_average_s
tars
3.73 3.79 1 5 0.71 -1.11 2.76
user_complime
nts
206.1 2 0 266318 781.03 40.66 2675.61
user_votes 665.2 5 0 529730 2604.22 20.81 705.91
14
MSIS 5223 - Programming for datascience
Project deliverable2
We have essentially chosen the 11 numerical variables that we think might matter the most in
our model. After having filtered out our data to only those in the Wisconsin area, we were finally
left with 88778 observations.
Here are histograms for our target variables:
1. review_stars
The graph is definitely left skewed showing a higher concentration at 5. This is very likely due
to the nature of internet rating systems, wherein users mostly tend to vote in extremes. But in our
case, the concentration is definitely more on one extreme. Further investigation could reveal
more details about the nature of this variable.
15
MSIS 5223 - Programming for datascience
Project deliverable2
2. business_stars
This histogram shows how the data is almost tending to normal, but is still left skewed like the
previous one we saw. The concentration is more on the 4 level, showing that the average stars
that a business receives is around 4. We can also see that very few businesses are able to
maintain a rating of 5. This makes sense in the real world, where only a few restaurants are
truly considered as the best, with all or most of the users giving a full rating of 5 along with their
reviews. But this still doesn’t tell us anything about the number of reviews each business
received. There could be businesses with very few reviews and all of those reviews could have
been positive. This is an unfair assessment when compared to a much larger businesses with a lot
more reviews, which could have lost the standing at 5 due to only a few bad reviews.
16
MSIS 5223 - Programming for datascience
Project deliverable2
3. business_review_count
This histogram shows how the data is completely right skewed. We can see how there are a lot
of businesses with very little to no reviews, while as the number of reviews increases, the
number of businesses decreases. This clearly shows how only a few businesses are truly
popular on Yelp. This can also be considered as a real world bias in which customers usually
tend to trust the more popular businesses when purchasing a product. These kind of businesses
have a greater advantage over newly born businesses or businesses that are just entering the
market. This is often referred to as the “first mover advantage” in the industry, wherein the
business has been around for a while allowing it to gain the sort of popularity that it has.
17
18
MSIS 5223 - Programming for datascience
Project deliverable2
Data Dictionary
This is the dictionary of our final merged dataset
Variable Data type Description Source
review_stars int Starts rating rounded to
half stars
https://www.yelp.com/dataset_ch
allenge
useful int Number of useful votes
sent by the user
https://www.yelp.com/dataset_ch
allenge
funny int Number of funny votes
sent by the user
https://www.yelp.com/dataset_ch
allenge
cool int Number of cool votes
sent by the user
https://www.yelp.com/dataset_ch
allenge
business_stars int Number of stars the
business has
https://www.yelp.com/dataset_ch
allenge
business_review
_count
int Number of reviews https://www.yelp.com/dataset_ch
allenge
user_review_co
unt
int Number of reviews https://www.yelp.com/dataset_ch
allenge
user_fans int Number of fans user has https://www.yelp.com/dataset_ch
allenge
user_average_st
ars
int Number of average stars
user has given
https://www.yelp.com/dataset_ch
allenge
user_complimen
ts
int Number of compliments
user has given
https://www.yelp.com/dataset_ch
allenge
user_votes int Number of votes the user
has given
https://www.yelp.com/dataset_ch
allenge
19
MSIS 5223 - Programming for datascience
Project deliverable2
Modelling Techniques
The goal of our project is to assess the effect of different factors on the ratings (On a scale of 1 to
5)that a business receives on the Yelp app. The main idea is to see if there is a relationship
between user behavior (based on the variables present in this dataset), businesses and the reviews
that users write for these businesses. We have chosen 3 target variables since we feel that the
significance of these variables could be really important in understanding how online rating
systems work. Here’s a breakdown of the 3 target variables we have chosen and what we hope to
achieve with the predictor variables we have:
1. review_stars - The rating associated with every individual review
a. Does the number of votes (useful/funny/cool) affect the overall rating of a
review? Do users upvote the good reviews or the bad reviews?
b. Does the popularity (compliments/votes/fans) of users affect the overall rating of
a review? Are popular users strict or lenient with their ratings?
c. Does the history of a user’s rating behavior (average rating) affect the overall
rating of a review?
d. Are the active users (number of reviews given by a user) more strict or lenient
with their reviews? Is there an association here?
e. Does the current standing of a business on the app (rating, review count), affect
what rating a user is going to give a business?
2. business_stars - The average rating of each individual business based on the reviews it
received
20
MSIS 5223 - Programming for datascience
Project deliverable2
a. Does the number of votes (useful/funny/cool) that each review has received for a
particular business affect the overall rating of the business?
b. Does the popularity (compliments/votes/fans) of the users who have rated a
particular business affect the overall rating of the business? Are popular users
associated with highly rated businesses?
c. Does the history of a user’s rating behavior (average rating) who has rated a
particular business affect the overall rating of the business?
d. Do the active users (number of reviews given by a user) play a part in determining
the overall rating of a business?
e. Does the rating of each review that a business has received affect the overall
rating of a business?
3. business_review_count - The total number of reviews each business has received over
time.
a. Does the number of votes (useful/funny/cool) that each review has received for a
particular business affect the number of reviews a business receives?
b. Does the popularity (compliments/votes/fans) of the users who have rated a
particular business affect the number of reviews a business receives? Are popular
users associated with popular businesses?
c. Does the history of a user’s rating behavior (average rating) who has rated a
particular business affect the number of reviews a business receives?
d. Do the active users (number of reviews given by a user) play a part in determining
the number of reviews a business receives?
21
MSIS 5223 - Programming for datascience
Project deliverable2
e. Does the rating of each review that a business has received affect the number of
reviews a business receives?
Now, given that all of our variables are numeric in nature (targets and predictors) and the fact
that we are trying to understand the correlation/association between these variables as delineated
above , we will be building a Regression model. We believe that a regression model could not
only reveal to us the truly significant predictor variables, but it can also give us an equation
which can be used to determine/predict a pattern associated with this dataset that can help a great
deal in understanding the nature of these target variables. The main assumption here is that all or
most of our chosen predictor variables are linearly related to at least one of our target variables.
The model is certain to give us this information. Also, identifying only a few predictor variables
(based on significance) from the pool would be a valuable insight.
The second modelling technique we plan to use is a Neural Network model. Along with giving
us information about the correlations and associations that exist between our chosen variables,
this model can also help a great deal in predicting the final value of our target variables based on
the trends and patterns present in the dataset. The model itself would produce these predicted
values. The more hidden layers (the neurons counterpart of an ANN) we insert into the model,
the better results (in terms of computation and accuracy) we are going to receive is our
presumption. Along with this information, this model is also certain to reveal the “weights”
associated with each of the indicator variables, which are essentially similar to the coefficients
we observe in the result of a regression model. This information can again inform us about the
most significant of the given indicator variables.
MSIS 5223 - Programming for datascience
Project deliverable2
Given that we have 3 target variables, we are planning to implement both of the modelling
techniques on each of the target variables and and choose the best (the most successful) model
for each of the target variables.
Regression
For the regression models we plan to create for the target variables, the dataset should satisfy
assumptions of linearity, collinearity, homoscedasticity and the normality of residuals. The tables
below shows the tests we performed for each of the target variables:
Target variable Collinearity (Assessment of VIF values)
review_stars
business_stars
22
MSIS 5223 - Programming for datascience
Project deliverable2
business_review_
count
This assumption can be successfully verified for all of the target variables, given that the VIF
values are well below 10. Therefore, we have no issues with collinearity. The collinearity
assumption is thus verified for all the target variables.
Target variable Normality of residuals (Assessment of Q-Q plot)
review_stars
business_stars
23
MSIS 5223 - Programming for datascience
Project deliverable2
business_review_
count
From the graphs above, it’s clear that the normality assumption can only be verified for the target
variable review_stars. This can be seen in the alignment of the Q-Q plots. Only the first graph
shows normality, while the other two either have too many outliers (business_review_count) or
aren’t as normal as it should be (business_stars). Therefore, the normality assumption is
verified only for the target variable review_stars.
Target variable Constant variance (Assessment of scatter plot)
review_stars
24
MSIS 5223 - Programming for datascience
Project deliverable2
business_stars
business_review_
count
From the graphs above, it’s clear that the homoscedasticity assumption can only be verified for
the target variables review_stars and business_stars. This can be seen in the scatterplot, where
we observe similar number of data points on both sides of the regression line. The third plot,
however, fails to show the same trend. Therefore, the homoscedasticity assumption is verified
only for the target variables review_stars and business_stars.
25
MSIS 5223 - Programming for datascience
Project deliverable2
Linearity (Assessment of correlation procedure)
The correlation matrix in the table above shows the significance (p values = 0) of all of our
variables in the dataset. Therefore, the linearity assumption is verified for all of our target
variables
26
27
MSIS 5223 - Programming for datascience
Project deliverable2
Though all of the assumptions haven’t been satisfied for some of the variables, we will still
continue to pursue the regression models for the review_stars, business_stars and the
business_review_count target variables. We will assess these models based on the results we
get from the regression procedure.
Neural Network
The main assumption of a Neural Network model is that the missing values are removed. We
ensured this much earlier on when we ran the na.omit procedure in R where all the missing
values were removed. Further, the original dataset itself was already in a clean and optimized
manner due to a lot of preprocessing done by the developers at Yelp. The dataset is therefore in
perfect condition for a Neural Network procedure. However, Neural Network in general is
considered a black box model, which makes interpretation of the model difficult. The plan is to
feed in different hidden layer sizes and activation methods to arrive at the best model with the
lowest error possible for the given target variables.
Data Splitting and Subsampling
Looking to make an honest assessment, we want to do 60-40 split of the dataset. This means that
we would have 60% of the data for our training dataset and 40% of the data for our validation
dataset. Given the massive size of our dataset, best practices are usually relevant and it makes
sense for us to use this ratio since it’s considered a good assessment of most models in the real
world. The reason we chose a higher training value is to get better results from our models. This
is because a higher training value improves the predictive capabilities of most models. A lower
MSIS 5223 - Programming for datascience
Project deliverable2
testing value usually helps in assessing the error rate more accurately. However, we don’t plan to
create a testing dataset, again owing to the size and scope of our main dataset. Therefore,
considering the size of our dataset and the predictive analytics we hope to achieve with this
project, we are moving ahead with the 60-40 split. The image below shows the code we used to
split the data for our regression model
Here is an assessment of the data splits relating to each of our target variables:
1. review_stars
28
MSIS 5223 - Programming for datascience
Project deliverable2
2. business_stars
3. business_review_count
Comparing the mean, standard deviation, median, minimum and maximum statistics from the
images above, a clear uniformity can be noticed. The split is actually incredibly accurate with
the values of these statistics across the split datasets being very close, or in most cases, exactly
the same!
29
MSIS 5223 - Programming for datascience
Project deliverable2
Data Modelling
Based on the assessments and subsampling done above, we are going to create regression and
neural network models using review_stars, business_stars and business_review_count as our
target variables. The idea is to understand the effects of our chosen predictor variables on our
chosen target variables.
Model A(Regression)
Target variable: review_stars (The rating associated with each individual review)
Predictor variables: useful, funny, cool, user_compliments, user_votes, user_fans,
user_average_stars, user_review_count, business_stars, business_review_count
Here is an image of the model we built by running the regression procedure in R:
30
31
MSIS 5223 - Programming for datascience
Project deliverable2
Interpretation
We can see from the results above that all of our indicator variables except for
business_review_count are significant at a 0.05 level of significance. We can also see that
user_compliments and user_votes are significant at a 0.01 level of significance while useful,
funny, cool, user_review_count, user_average_stars and business_stars are all significant at a
1.level of significance. From the looks of it, our hunch about most of our chosen predictor
variables is true. All these variables definitely have an effect on the review_stars target variable.
But to what extent and what do these results mean? This is delineated below:
1. The variable useful is significant at a 0.001 level of significance. From the coefficient, we
can say that for a unit increase in useful, there is a 0.1351 decrease in review_stars. In
other words, when a review has more useful votes, the rating of the review tends to
decrease. This relationship is actually contrary to what we believed might be a positive
linear relationship. It makes sense for a review with a higher rating to have more useful
votes, but it seems like users find the stricter reviews more useful than the more lenient
ones.
2. The variable funny is significant at a 0.001 level of significance. From the coefficient, we
can say that for a unit increase in funny, there is a 0.1576 decrease in review_stars. In
other words, when a review has more funny votes, the rating of the review tends to
decrease. This relationship is in line with what we believed would be a negative linear
relationship. It makes sense for a review with a lower rating to have more funny votes
since users find the stricter reviews to be funnier than the more lenient ones. This could
be due to the more sarcastic tone users might use with their bad reviews.
32
MSIS 5223 - Programming for datascience
Project deliverable2
3. The variable cool is also significant at a 0.001 level of significance. From the coefficient,
we can say that for a unit increase in cool, there is a 0.2989 increase in review_stars. In
other words, when a review has more cool votes, the rating of the review tends to
increase. This relationship is actually in line with what we believed might be a positive
linear relationship. It makes sense for a review with a higher rating to have more cool
votes.
4. The variable user_review_count is also significant at a 0.001 level of significance. From
the coefficient, we can say that for a unit increase in user_review_count, there is a
.000109 increase in review_stars. In other words, when a user has written more reviews,
the rating of their review tends towards a higher value. This relationship is actually inline
with what we believed might be a positive linear relationship. It makes sense for an active
reviewer to be more lenient with their reviews.
5. The variable user_average_stars is also significant at a 0.001 level of significance. From
the coefficient, we can say that for a unit increase in user_average_stars, there is a .7706
increase in review_stars. In other words, when a user has a higher average rating score,
the rating of their review tends towards a higher value. This relationship is actually inline
with what we believed might be a positive linear relationship. It makes sense for a user
with a higher average rating to award higher ratings to reviews.
6. The variable user_compliments is significant at a 0.01 level of significance. From the
coefficient, we can say that for a unit increase in user_compliments, there is a .00002133
decrease in review_stars. In other words, when a user has received more compliments
(the popular user in other words), the rating of their review tends towards a lower value.
33
MSIS 5223 - Programming for datascience
Project deliverable2
This relationship is actually inline with what we believed might be a negative linear
relationship. It makes sense for a popular reviewer to be stricter with their reviews. The
popularity of a user is definitely an influencer of the rating given to their reviews.
7. The variable user_fans is significant at a 0.05 level of significance. From the coefficient,
we can say that for a unit increase in user_fans, there is a .0003611 decrease in
review_stars. In other words, when a user has more fans (the popular user in other
words), the rating of their review tends towards a lower value. This relationship is
actually inline with what we believed might be a negative linear relationship. It makes
sense for an popular reviewer to be stricter with their reviews. The popularity of a user is
definitely an influencer of the rating given to their reviews.
8. The variable user_votes is significant at a 0.01 level of significance. From the coefficient,
we can say that for a unit increase in user_votes, there is a .000005412 increase in
review_stars. In other words, when a user has more votes (the users whose reviews have
received more votes), the rating of their review tends towards a higher value. This
relationship is actually inline with what we believed might be a positive linear
relationship.
9. The variable business_stars is significant at a 0.05 level of significance. From the
coefficient, we can say that for a unit increase in business_stars, there is a .7023 increase
in review_stars. In other words, when a business has a higher average rating, the rating of
the reviews that the business receives tends towards a higher value. This relationship is
actually inline with what we believed might be a positive linear relationship. It makes
sense for a business with a higher rating to receive more such positive reviews.
MSIS 5223 - Programming for datascience
Project deliverable2
10. The variable business_review_count is not significant at all. The number of reviews a
business receives does not influence the rating of the reviews it receives. This makes
sense since we really can’t say that the popular businesses (In terms of reviews) receive a
higher or a lower rating. It really depends on what the user experienced when the review
was given.
Therefore, it is clear that the users definitely have a big role to play in deciding the rating of a
review given to any business on the Yelp app. Considering the social nature of apps like Yelp,
this makes a lot of sense.
Model B (Neural Network)
Target variable: review_stars (The rating associated with each individual review)
Predictor variables: useful, funny, cool, user_compliments, user_fans, user_average_stars,
user_votes, business_stars, business_review_count (We removed the user_review_count variable
after observing better results without it)
Activation: RELU
Hidden Layers: 200
After playing around with the activation type and number of hidden layers, we would like to
present the best model for our target variable. Here is an image of the model we built by running
the neural network procedure in python:
34
MSIS 5223 - Programming for datascience
Project deliverable2
Interpretation
Since a neural network model is a black box, we won’t be able to say much about thespecific
relationship that exists between the target variable and the indicator variables. We can however
assess the error rate produced by the model along with the R2
value to determine the efficiency
of the model. In this particular model we can see that the mean absolute error is at 0.7309 and the
mean square error is at 0.94. These values are signs of a low error rate in the model. The R2
value
is 0.45 (or 45%), which is another indication of this being a good model. This model can
definitely be used for further predictive analysis.
Here are a few predicted values from the model as outputted from python:
The patterns associated with the given set of indicator variables and their respective values can
be observed by comparing them to the predicted values of the variable review_stars. These
predictions can definitely be considered as accurate considering the low error rate of the model.
35
MSIS 5223 - Programming for datascience
Project deliverable2
Model C (Regression)
Target variable: business_stars (The average rating of a business)
Predictor variables: useful, funny, cool, user_compliments, user_votes, user_fans,
user_average_stars, user_review_count, review_stars, business_review_count
Here is an image of the model we built by running the regression procedure in R:
Interpretation
We can see from the results above that all of our indicator variables except for
user_average_stars, user_compliments and user_votes are significant at a 0.05 level of
significance. We can also see that user_fans is significant at a 0.01 level of significance while
review_stars, useful, funny, cool, user_review_count and business_review_count are all
significant at a 0.001 level of significance. From the looks of it, our hunch about most of our
36
37
MSIS 5223 - Programming for datascience
Project deliverable2
chosen predictor variables is true. All these variables definitely have an effect on the
business_stars target variable. But to what extent and what do these results mean? This is
delineated below:
1. The variable review_stars is significant at a 0.001 level of significance. From the
coefficient, we can say that for a unit increase in review_stars, there is a 0.2374 increase
in business_stars. In other words, when a business has more reviews with higher ratings,
the rating of the business tends to decrease. This relationship is in line with what we
believed might be a positive linear relationship.
2. The variable useful is significant at a 0.001 level of significance. From the coefficient, we
can say that for a unit increase in useful, there is a 0.11884 increase in business_stars. In
other words, when a business has more reviews with higher useful votes, the rating of the
business tends to increase. This relationship is in line with what we believed might be a
positive linear relationship.
3. The variable funny is significant at a 0.001 level of significance. From the coefficient, we
can say that for a unit increase in funny, there is a 0.1967 decrease in business_stars. In
other words, when a business has more reviews with higher funny votes, the rating of the
business tends to decrease. This relationship is in line with what we believed would be a
negative linear relationship.
4. The variable cool is also significant at a 0.001 level of significance. From the coefficient,
we can say that for a unit increase in cool, there is a 0.01167 increase in business_stars.
In other words, when a business has more reviews with higher cool votes, the rating of
the business tends to increase. This relationship is actually in line with what we believed
38
MSIS 5223 - Programming for datascience
Project deliverable2
might be a positive linear relationship.
5. The variable user_review_count is also significant at a 0.001 level of significance. From
the coefficient, we can say that for a unit increase in user_review_count, there is a
.00004964 decrease in business_stars. In other words, when a business has more reviews
written by users who have themselves written more reviews, the rating of the business
tends towards a lower value.
6. The variable user_average_stars is not significant at all. The average rating of users who
write reviews for businesses doesn’t affect the average rating of a business. This is a
surprising, since our hunch was that there would be a relationship between these two
variables.
7. The variable user_compliments is not significant at all. The number of compliments
received by users who write reviews for businesses doesn’t affect the average rating of a
business.
8. The variable user_fans is significant at a 0.05 level of significance. From the coefficient,
we can say that for a unit increase in user_fans, there is a .0002323 increase in
business_stars. In other words, when a business has more reviews from users who have
more fans (the popular user in other words), the rating of their business tends towards a
higher value. This relationship is actually inline with what we believed might be a
positive linear relationship.
9. The variable user_average_votes is not significant at all. The number of votes received
by users who write reviews for businesses doesn’t affect the average rating of the reviews
received by a business.
MSIS 5223 - Programming for datascience
Project deliverable2
10. The variable business_review_count is significant at a 0.001 level of significance. From
the coefficient, we can say that for a unit increase in business_review_count, there is a
.0004423 increase in business_stars. In other words, when a business has a higher
number of reviews, the rating of the business tends towards a higher value. This
relationship is actually inline with what we believed might be a positive linear
relationship.
Model D (Neural Network)
Target variable: business_stars (The average rating of each business)
Predictor variables: useful, funny, cool, user_compliments, user_fans, user_average_stars,
user_votes, review_stars, business_review_count (We removed the user_review_count variable
after observing better results without it)
Activation: RELU
Hidden Layers: 200
After playing around with the activation type and number of hidden layers, we would like to
present the best model for our target variable. Here is an image of the model we built by running
the neural network procedure in python:
39
MSIS 5223 - Programming for datascience
Project deliverable2
Interpretation
Since a neural network model is a black box, we won’t be able to say much about thespecific
relationship that exists between the target variable and the indicator variables. We can however
assess the error rate produced by the model along with the R2
value to determine the efficiency
of the model. In this particular model we can see that the mean absolute error is at 0.4163 and the
mean square error is at 0.3047. These values are signs of a low error rate in the model. The R2
value is 0.31 (or 31%), which is another indication of this being a good model. This model can
definitely be used for further predictive analysis.
Here are a few predicted values from the model as outputted from python:
The patterns associated with the given set of indicator variables and their respective values can
be observed by comparing them to the predicted values of the variable business_stars. These
predictions can definitely be considered as accurate considering the low error rate of the model.
40
MSIS 5223 - Programming for datascience
Project deliverable2
Model E (Regression)
Target variable: business_review_count (The number of reviews a business receives)
Predictor variables: useful, funny, cool, user_compliments, user_votes, user_fans,
user_average_stars, user_review_count, review_stars, business_stars
Here is an image of the model we built by running the regression procedure in R:
Interpretation
We can see from the results above that all of our indicator variables except for review_stars,
user_average stars and user_fans are significant at a 0.05 level of significance. We can also see
that funny is significant at a 0.05 level of significance, user_average_stars is significant at a 0.1
level of significance, while useful, cool, user_review_count, user_compliments, user_votes and
business_stars are all significant at a 0.001 level of significance. From the looks of it, our hunch
41
42
MSIS 5223 - Programming for datascience
Project deliverable2
about most of our chosen predictor variables is true. All these variables definitely have an effect
on the business_review_count target variable. But to what extent and what do these results
mean? This is delineated below:
1. The variable review_stars is not significant at all. The rating of reviews received by a
business does not affect the number of reviews a business receives.
2. The variable useful is significant at a 0.001 level of significance. From the coefficient, we
can say that for a unit increase in useful, there is 6.662 decrease in
business_review_count. In other words, when a business has more reviews with higher
useful votes, the number of reviews the business receives tends to decrease.
3. The variable funny is significant at a 0.05 level of significance. From the coefficient, we
can say that for a unit increase in funny, there is a 2.005 increase in
business_review_count. In other words, when a business has more reviews with higher
funny votes, the number of reviews reviews the business receives tends to increase.
4. The variable cool is significant at a 0.001 level of significance. From the coefficient, we
can say that for a unit increase in cool, there is a 4.23 increase in business_review_count.
In other words, when a business has more reviews with higher cool votes, the number of
reviews the business receives tends to increase.
5. The variable user_review_count is also significant at a 0.001 level of significance. From
the coefficient, we can say that for a unit increase in user_review_count, there is a
.03024 increase in business_review_count. In other words, when a business has more
reviews written by users who have themselves written more reviews, the number of
reviews the business receives tends to increase.
43
MSIS 5223 - Programming for datascience
Project deliverable2
6. The variable user_average_stars is significant at a 0.1 level of significance. For a unit
increase in user_average_stars, there is a 1.659 increase in business_review_count. In
other words, when a business has more users with a higher average rating, the number of
reviews the business receives tends to increase.
7. The variable user_compliments is significant at a 0.005 level of significance. For a unit
increase in user_compliments, there is a 0.004867 increase in business_review_count. In
other words, when a business has more users with a higher number of compliments, the
number of reviews the business receives tends to increase.
8. The variable user_fans is not significant at all. The number of fans of a user who has
reviewed a business does not affect the number of reviews a business receives.
9. The variable user_average_votes is significant at a 0.001 level of significance. For a unit
increase in user_average_votes, there is a 0.002149 decrease in business_review_count.
In other words, when a business has more users with a higher average rating, the number
of reviews the business receives tends to decrease.
10. The variable business_stars is significant at a 0.001 level of significance. From the
coefficient, we can say that for a unit increase in business_stars, there is a .4211 increase
in business_review_count. In other words, when a business has a higher average rating,
the number of reviews the business receives tends to increase.
MSIS 5223 - Programming for datascience
Project deliverable2
Model F (Neural Network)
Target variable: business_review_count (The number of reviews each business has)
Predictor variables: useful, funny, cool, user_compliments, user_fans, user_average_stars,
user_votes, business_stars, review_stars (We removed the user_review_count variable after
observing better results without it)
Activation: RELU
Hidden Layers: 100
After playing around with the activation type and number of hidden layers, we would like to
present the best model for our target variable. Here is an image of the model we built by running
the neural network procedure in python:
Interpretation
In this particular model we can see that the mean absolute error is at 95.61 and the mean square
error is at 28223.50. These values are signs of a very high error rate in the model. The R2
value is
0.14 (or 14%), which is not bad, but the error rate is too high in this model for it to be considered
a good one. Further analysis or identification of more significant variables (Which we probably
didn’t include in the beginning), is definitely required to improve the predictive capabilities of
this model. Here are predicted values from the model (Which might not be very accurate):
44
MSIS 5223 - Programming for datascience
Project deliverable2
Model Assessment
From the six models we’ve built above, our goal is to choose the best model for the three target
variables we have. For the objective assessment, we will be comparing the R2
values from the
two models. For the subjective assessment, we will be elaborating on the implications of the
model in the real world.
Model A vs Model B (review_stars)
For the objective assessment of the regression and the neural network model for the target
variable review_stars, let’s first compare the R2
values produced by each of the models. The
table below shows the value produced by both the procedures:
45
Regression (Model A) Neural Network (Model B)
R2
: 0.4405 (44.05%) R2
: 0.4715 (47.15%)
46
MSIS 5223 - Programming for datascience
Project deliverable2
Firstly, both these are very good models with such high R2
values. From the table above it is
clear that Model B performs better than Model A. The accuracy of the Neural Network model is
slightly higher than the regression model. From a real world perspective however, the regression
model makes more sense, considering the fact that it helps understand the exact relationship
between the target and predictor variables. In our scenario, the goal is to understand what
influences the rating associated with each individual review and the regression model does this
job the best. Therefore, we would choose Model A as our model of choice for the target
variable review_stars.
Model C vs Model D (business_stars)
For the objective assessment of the regression and the neural network model for the target
variable business_stars, let’s first compare the R2
values produced by each of the models. The
table below shows the value produced by both the procedures:
Regression (Model C) Neural Network (Model D)
R2
: 0.2518 (25.18%) R2
: 0.3173 (31.73%)
Firstly, both these are reasonably good models with decently high R2
values. From the table
above it is clear that Model D performs better than Model C. The accuracy of the Neural
Network model is slightly higher than the regression model. From a real world perspective
however, the regression model makes more sense, considering the fact that it helps understand
the exact relationship between the target and predictor variables. But the assumptions for
regression weren’t satisfied earlier for this model, hence choosing regression wouldn’t be wise in
47
MSIS 5223 - Programming for datascience
Project deliverable2
this case. In this scenario, using the Neural Network model for predicting the behaviour of the
variable business_stars makes more sense. Therefore, we would choose Model D as our model
of choice for the target variable business_stars.
Model E vs Model F (business_review_count)
For the objective assessment of the regression and the neural network model for the target
variable business_review_count, let’s first compare the R2
values produced by each of the
models. The table below shows the value produced by both the procedures:
Regression (Model E) Neural Network (Model F)
R2
: 0.0208 (2.08%) R2
: 0.1493 (14.93%)
Firstly, both are not very good models with comparatively low R2
values. From the table above it
is clear that Model F performs better than Model E. Though the Neural Network model has a
moderate R2
value, the error rate for this model is very high (as observed earlier), which brings
into question the accuracy of this model. From a real world perspective however, the regression
model makes more sense, considering the fact that it helps understand the exact relationship
between the target and predictor variables. But the assumptions for regression weren’t satisfied
earlier for this model and the R2
value is very low, hence choosing regression wouldn’t be the
right way to go. Therefore, we wouldn’t be choosing either of the models for the target
variable business_review_count. Further analysis or consideration of other significant variables
is definitely required before coming to any conclusions about this particular target variable.
48
MSIS 5223 - Programming for datascience
Project deliverable2
Model technique Assessment
Regression and Neural Network, both are extremely effective modelling techniques and both
have their own strengths and weaknesses. These are delineated below:
Regression
Regression analysis is a statistical process for estimating the relationships among variables. The
focus is on the relationship between a dependent variable and one or more independent variables.
Strengths
1. Multiple regression is a very flexible method. The independent variables can be numeric
or categorical, and interactions between variables can be incorporated; and polynomial
terms can also be included.
2. Multiple regression uses multiple independent variables, with each controlling for the
others. The parameter or coefficients of each of these variables can be derived using a
regression model
3. Regression models have very accurate predictive capabilities and can be used in
forecasting trends in the future.
4. When relationships between the independent variables and the dependent variable are
almost linear, regression shows optimal results.
Weaknesses
1. Linear regression is limited to predicting numeric output.
49
MSIS 5223 - Programming for datascience
Project deliverable2
Neural Network
An Artificial Neural Network (ANN) is an information processing model which behaves like the
human brain by using artificial neurons (Hidden layers) for computational statistics.
Strengths
1. Neural Networks have the ability to understand relationships between the indicator and
target variables when they are linearly related to each other, which means that it can be
used to understand trends and patterns in a datasets.
2. Neural Network is capable of self organization. An ANN can create its own organisation
or representation of the information it receives during learning time.
3. Neural Network is capable of adaptive learning. An ANN has the ability to learn how to
do tasks based on the data given for training or initial experience.
Weaknesses
1. Since it’s impossible to pull out information from a Neural Network model, the
implications of it is very hard to understand.

More Related Content

Similar to What drives restaurant ratings

TaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxTaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxbradburgess22840
 
TaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxTaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxdeanmtaylor1545
 
Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017gapariciojr
 
MVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsMVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsBoost Labs
 
IRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product MarketingIRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product MarketingIRJET Journal
 
The Emerging Role of Data Scientists on Software Developmen.docx
The Emerging Role of Data Scientists  on Software Developmen.docxThe Emerging Role of Data Scientists  on Software Developmen.docx
The Emerging Role of Data Scientists on Software Developmen.docxarnoldmeredith47041
 
The Emerging Role of Data Scientists on Software Developmen.docx
The Emerging Role of Data Scientists  on Software Developmen.docxThe Emerging Role of Data Scientists  on Software Developmen.docx
The Emerging Role of Data Scientists on Software Developmen.docxtodd701
 
Assessment 2DescriptionFocusEssayValue50Due D.docx
Assessment 2DescriptionFocusEssayValue50Due D.docxAssessment 2DescriptionFocusEssayValue50Due D.docx
Assessment 2DescriptionFocusEssayValue50Due D.docxgalerussel59292
 
Online review mining for forecasting sales
Online review mining for forecasting salesOnline review mining for forecasting sales
Online review mining for forecasting saleseSAT Publishing House
 
Online review mining for forecasting sales
Online review mining for forecasting salesOnline review mining for forecasting sales
Online review mining for forecasting saleseSAT Journals
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis ReportAbanoub Amgad
 
GERSIS INDUSTRY CASES
GERSIS INDUSTRY CASESGERSIS INDUSTRY CASES
GERSIS INDUSTRY CASESSergej Markov
 
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx
54 C o m m u n i C at i o n s o F t h e a C m j u.docxevonnehoggarth79783
 
Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsProduct School
 
ADV: Solving the data visualization dilemma
ADV: Solving the data visualization dilemmaADV: Solving the data visualization dilemma
ADV: Solving the data visualization dilemmaGrant Thornton LLP
 
Basic-Project-Estimation-1999
Basic-Project-Estimation-1999Basic-Project-Estimation-1999
Basic-Project-Estimation-1999Michael Wigley
 
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...IOSRjournaljce
 

Similar to What drives restaurant ratings (20)

TaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxTaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docx
 
TaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docxTaskYou are required to prepare for this Assessment Item by1..docx
TaskYou are required to prepare for this Assessment Item by1..docx
 
Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017
 
MVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsMVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost Labs
 
Bsa 411 preview full class
Bsa 411 preview full classBsa 411 preview full class
Bsa 411 preview full class
 
IRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product MarketingIRJET- Predicting Review Ratings for Product Marketing
IRJET- Predicting Review Ratings for Product Marketing
 
The Emerging Role of Data Scientists on Software Developmen.docx
The Emerging Role of Data Scientists  on Software Developmen.docxThe Emerging Role of Data Scientists  on Software Developmen.docx
The Emerging Role of Data Scientists on Software Developmen.docx
 
The Emerging Role of Data Scientists on Software Developmen.docx
The Emerging Role of Data Scientists  on Software Developmen.docxThe Emerging Role of Data Scientists  on Software Developmen.docx
The Emerging Role of Data Scientists on Software Developmen.docx
 
Assessment 2DescriptionFocusEssayValue50Due D.docx
Assessment 2DescriptionFocusEssayValue50Due D.docxAssessment 2DescriptionFocusEssayValue50Due D.docx
Assessment 2DescriptionFocusEssayValue50Due D.docx
 
Online review mining for forecasting sales
Online review mining for forecasting salesOnline review mining for forecasting sales
Online review mining for forecasting sales
 
Online review mining for forecasting sales
Online review mining for forecasting salesOnline review mining for forecasting sales
Online review mining for forecasting sales
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis Report
 
GERSIS INDUSTRY CASES
GERSIS INDUSTRY CASESGERSIS INDUSTRY CASES
GERSIS INDUSTRY CASES
 
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
 
Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data Decisions
 
ADV: Solving the data visualization dilemma
ADV: Solving the data visualization dilemmaADV: Solving the data visualization dilemma
ADV: Solving the data visualization dilemma
 
A CRUD Matrix
A CRUD MatrixA CRUD Matrix
A CRUD Matrix
 
Sonali-resume
Sonali-resumeSonali-resume
Sonali-resume
 
Basic-Project-Estimation-1999
Basic-Project-Estimation-1999Basic-Project-Estimation-1999
Basic-Project-Estimation-1999
 
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
 

Recently uploaded

(SUNAINA) Call Girls Alandi Road ( 7001035870 ) HI-Fi Pune Escorts Service
(SUNAINA) Call Girls Alandi Road ( 7001035870 ) HI-Fi Pune Escorts Service(SUNAINA) Call Girls Alandi Road ( 7001035870 ) HI-Fi Pune Escorts Service
(SUNAINA) Call Girls Alandi Road ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Assessment on SITXINV007 Purchase goods.pdf
Assessment on SITXINV007 Purchase goods.pdfAssessment on SITXINV007 Purchase goods.pdf
Assessment on SITXINV007 Purchase goods.pdfUMER979507
 
BPP NC II Lesson 3 - Pastry Products.pptx
BPP NC II Lesson 3 - Pastry Products.pptxBPP NC II Lesson 3 - Pastry Products.pptx
BPP NC II Lesson 3 - Pastry Products.pptxmaricel769799
 
Grade Eight Quarter 4_Week 6_Cookery.pptx
Grade Eight Quarter 4_Week 6_Cookery.pptxGrade Eight Quarter 4_Week 6_Cookery.pptx
Grade Eight Quarter 4_Week 6_Cookery.pptxKurtGardy
 
VIP Call Girls In Singar Nagar ( Lucknow ) 🔝 8923113531 🔝 Cash Payment Avai...
VIP Call Girls In Singar Nagar ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment Avai...VIP Call Girls In Singar Nagar ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment Avai...
VIP Call Girls In Singar Nagar ( Lucknow ) 🔝 8923113531 🔝 Cash Payment Avai...anilsa9823
 
Russian Call Girls in Nagpur Devyani Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Devyani Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Devyani Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Devyani Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Russian Call Girls Gorakhpur Chhaya 8250192130 Independent Escort Service...
VIP Russian Call Girls Gorakhpur Chhaya 8250192130 Independent Escort Service...VIP Russian Call Girls Gorakhpur Chhaya 8250192130 Independent Escort Service...
VIP Russian Call Girls Gorakhpur Chhaya 8250192130 Independent Escort Service...Suhani Kapoor
 
Call Girls Laxmi Nagar Delhi reach out to us at ☎ 9711199012
Call Girls Laxmi Nagar Delhi reach out to us at ☎ 9711199012Call Girls Laxmi Nagar Delhi reach out to us at ☎ 9711199012
Call Girls Laxmi Nagar Delhi reach out to us at ☎ 9711199012rehmti665
 
(MAYA) Baner Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MAYA) Baner Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MAYA) Baner Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MAYA) Baner Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
(PRIYANKA) Katraj Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(PRIYANKA) Katraj Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(PRIYANKA) Katraj Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(PRIYANKA) Katraj Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...ranjana rawat
 
VIP Call Girls Service Secunderabad Hyderabad Call +91-8250192130
VIP Call Girls Service Secunderabad Hyderabad Call +91-8250192130VIP Call Girls Service Secunderabad Hyderabad Call +91-8250192130
VIP Call Girls Service Secunderabad Hyderabad Call +91-8250192130Suhani Kapoor
 
VVIP Pune Call Girls Sinhagad Road (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Sinhagad Road (7001035870) Pune Escorts Nearby with Comp...VVIP Pune Call Girls Sinhagad Road (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Sinhagad Road (7001035870) Pune Escorts Nearby with Comp...Call Girls in Nagpur High Profile
 
Let Me Relax Dubai Russian Call girls O56338O268 Dubai Call girls AgenCy
Let Me Relax Dubai Russian Call girls O56338O268 Dubai Call girls AgenCyLet Me Relax Dubai Russian Call girls O56338O268 Dubai Call girls AgenCy
Let Me Relax Dubai Russian Call girls O56338O268 Dubai Call girls AgenCystephieert
 
Best Connaught Place Call Girls Service WhatsApp -> 9999965857 Available 24x7...
Best Connaught Place Call Girls Service WhatsApp -> 9999965857 Available 24x7...Best Connaught Place Call Girls Service WhatsApp -> 9999965857 Available 24x7...
Best Connaught Place Call Girls Service WhatsApp -> 9999965857 Available 24x7...srsj9000
 
(PRIYA) Call Girls Budhwar Peth ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Budhwar Peth ( 7001035870 ) HI-Fi Pune Escorts Service(PRIYA) Call Girls Budhwar Peth ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Budhwar Peth ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
4th QT WEEK 2 Cook Meat Cuts part 2.pptx
4th QT WEEK 2 Cook Meat Cuts part 2.pptx4th QT WEEK 2 Cook Meat Cuts part 2.pptx
4th QT WEEK 2 Cook Meat Cuts part 2.pptxKattieAlisonMacatugg1
 
(ASHA) Sb Road Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(ASHA) Sb Road Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(ASHA) Sb Road Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(ASHA) Sb Road Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
(ISHITA) Call Girls Manchar ( 7001035870 ) HI-Fi Pune Escorts Service
(ISHITA) Call Girls Manchar ( 7001035870 ) HI-Fi Pune Escorts Service(ISHITA) Call Girls Manchar ( 7001035870 ) HI-Fi Pune Escorts Service
(ISHITA) Call Girls Manchar ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girl Nashik Khushi 7001305949 Independent Escort Service Nashik
Call Girl Nashik Khushi 7001305949 Independent Escort Service NashikCall Girl Nashik Khushi 7001305949 Independent Escort Service Nashik
Call Girl Nashik Khushi 7001305949 Independent Escort Service Nashikranjana rawat
 

Recently uploaded (20)

(SUNAINA) Call Girls Alandi Road ( 7001035870 ) HI-Fi Pune Escorts Service
(SUNAINA) Call Girls Alandi Road ( 7001035870 ) HI-Fi Pune Escorts Service(SUNAINA) Call Girls Alandi Road ( 7001035870 ) HI-Fi Pune Escorts Service
(SUNAINA) Call Girls Alandi Road ( 7001035870 ) HI-Fi Pune Escorts Service
 
Assessment on SITXINV007 Purchase goods.pdf
Assessment on SITXINV007 Purchase goods.pdfAssessment on SITXINV007 Purchase goods.pdf
Assessment on SITXINV007 Purchase goods.pdf
 
BPP NC II Lesson 3 - Pastry Products.pptx
BPP NC II Lesson 3 - Pastry Products.pptxBPP NC II Lesson 3 - Pastry Products.pptx
BPP NC II Lesson 3 - Pastry Products.pptx
 
Grade Eight Quarter 4_Week 6_Cookery.pptx
Grade Eight Quarter 4_Week 6_Cookery.pptxGrade Eight Quarter 4_Week 6_Cookery.pptx
Grade Eight Quarter 4_Week 6_Cookery.pptx
 
VIP Call Girls In Singar Nagar ( Lucknow ) 🔝 8923113531 🔝 Cash Payment Avai...
VIP Call Girls In Singar Nagar ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment Avai...VIP Call Girls In Singar Nagar ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment Avai...
VIP Call Girls In Singar Nagar ( Lucknow ) 🔝 8923113531 🔝 Cash Payment Avai...
 
Russian Call Girls in Nagpur Devyani Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Devyani Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Devyani Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Devyani Call 7001035870 Meet With Nagpur Escorts
 
VIP Russian Call Girls Gorakhpur Chhaya 8250192130 Independent Escort Service...
VIP Russian Call Girls Gorakhpur Chhaya 8250192130 Independent Escort Service...VIP Russian Call Girls Gorakhpur Chhaya 8250192130 Independent Escort Service...
VIP Russian Call Girls Gorakhpur Chhaya 8250192130 Independent Escort Service...
 
young Whatsapp Call Girls in Jamuna Vihar 🔝 9953056974 🔝 escort service
young Whatsapp Call Girls in Jamuna Vihar 🔝 9953056974 🔝 escort serviceyoung Whatsapp Call Girls in Jamuna Vihar 🔝 9953056974 🔝 escort service
young Whatsapp Call Girls in Jamuna Vihar 🔝 9953056974 🔝 escort service
 
Call Girls Laxmi Nagar Delhi reach out to us at ☎ 9711199012
Call Girls Laxmi Nagar Delhi reach out to us at ☎ 9711199012Call Girls Laxmi Nagar Delhi reach out to us at ☎ 9711199012
Call Girls Laxmi Nagar Delhi reach out to us at ☎ 9711199012
 
(MAYA) Baner Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MAYA) Baner Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MAYA) Baner Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MAYA) Baner Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
(PRIYANKA) Katraj Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(PRIYANKA) Katraj Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(PRIYANKA) Katraj Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(PRIYANKA) Katraj Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
 
VIP Call Girls Service Secunderabad Hyderabad Call +91-8250192130
VIP Call Girls Service Secunderabad Hyderabad Call +91-8250192130VIP Call Girls Service Secunderabad Hyderabad Call +91-8250192130
VIP Call Girls Service Secunderabad Hyderabad Call +91-8250192130
 
VVIP Pune Call Girls Sinhagad Road (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Sinhagad Road (7001035870) Pune Escorts Nearby with Comp...VVIP Pune Call Girls Sinhagad Road (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Sinhagad Road (7001035870) Pune Escorts Nearby with Comp...
 
Let Me Relax Dubai Russian Call girls O56338O268 Dubai Call girls AgenCy
Let Me Relax Dubai Russian Call girls O56338O268 Dubai Call girls AgenCyLet Me Relax Dubai Russian Call girls O56338O268 Dubai Call girls AgenCy
Let Me Relax Dubai Russian Call girls O56338O268 Dubai Call girls AgenCy
 
Best Connaught Place Call Girls Service WhatsApp -> 9999965857 Available 24x7...
Best Connaught Place Call Girls Service WhatsApp -> 9999965857 Available 24x7...Best Connaught Place Call Girls Service WhatsApp -> 9999965857 Available 24x7...
Best Connaught Place Call Girls Service WhatsApp -> 9999965857 Available 24x7...
 
(PRIYA) Call Girls Budhwar Peth ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Budhwar Peth ( 7001035870 ) HI-Fi Pune Escorts Service(PRIYA) Call Girls Budhwar Peth ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Budhwar Peth ( 7001035870 ) HI-Fi Pune Escorts Service
 
4th QT WEEK 2 Cook Meat Cuts part 2.pptx
4th QT WEEK 2 Cook Meat Cuts part 2.pptx4th QT WEEK 2 Cook Meat Cuts part 2.pptx
4th QT WEEK 2 Cook Meat Cuts part 2.pptx
 
(ASHA) Sb Road Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(ASHA) Sb Road Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(ASHA) Sb Road Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(ASHA) Sb Road Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
(ISHITA) Call Girls Manchar ( 7001035870 ) HI-Fi Pune Escorts Service
(ISHITA) Call Girls Manchar ( 7001035870 ) HI-Fi Pune Escorts Service(ISHITA) Call Girls Manchar ( 7001035870 ) HI-Fi Pune Escorts Service
(ISHITA) Call Girls Manchar ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girl Nashik Khushi 7001305949 Independent Escort Service Nashik
Call Girl Nashik Khushi 7001305949 Independent Escort Service NashikCall Girl Nashik Khushi 7001305949 Independent Escort Service Nashik
Call Girl Nashik Khushi 7001305949 Independent Escort Service Nashik
 

What drives restaurant ratings

  • 1. What drives restaurant ratings? Understanding social recommendation systems with the Yelp dataset Team Kaushik Subramaniam Gnanaskandan - A11815321 Forough Nasirpouri Shadbad - A11725946 Prashanth Raj Goud - A11810448 Srujana Mereddy - A11809432
  • 2. 1 MSIS 5223 - Programming for datascience Project deliverable2 Table of Contents Table of Contents Executive Summary 1 3 Statement of Scope 3 Project Schedule Team availability Tracker Lessons Learnt 6 7 8 Data Preparation Data Access Data Consolidation Data Cleaning Data Transformation Data Reduction Descriptive Statistics Data Dictionary 8 8 9 10 11 11 14 19 Modelling Techniques Regression Neural Network 20 23 28 Data Splitting and Subsampling 28 Data Modelling Model A (Regression) Model B (Neural Network) Model C (Regression) Model D (Neural Network) Model E (Regression) Model F (Neural Network) 31 31 35 37 40 42 45
  • 3. 2 MSIS 5223 - Programming for datascience Project deliverable2 Model Assessment 46 Model A vs Model B (review_stars) 46 Model C vs Model D (business_stars) 47 Model E vs Model F (business_review_count) 48 Model technique Assessment 49 Regression 49 Neural Network 50
  • 4. 3 MSIS 5223 - Programming for datascience Project deliverable2 Executive Summary The Yelp dataset consists of different forms of data about restaurants including user generated reviews, user generated ratings, aggregated business ratings and other numerical attributes relating to users, reviews and businesses. This data can be used in various ways to help us in statistically understanding certain behavioral aspects of a business such as rating influencers and review influencers. With internet ratings playing a considerable role in determining the popularity and hence the profitability of a restaurant (or any business these days), a valuable question to ask is: How does one improve customer-driven ratings on social media platforms such as Yelp? To answer this question, we first need to recognize what influences social recommendation systems. Often for businesses on a platform like Yelp, our hunch is that it’s just about identifying the most influential of users since they play a major role in determining the ratings. We also believe that this dataset contains other valuable attributes relating to businesses that could be influencing the ratings. So how do we determine what is truly significant? The need for a business to have a positive presence on the internet makes it imperative to study the patterns associated with a recommendation system. Statement of Scope The broader scope of our project is to analyze the effects of all the variables in this dataset and identify the most significant ones that influence the customer-driven ratings of a business. Although we haven’t yet merged this dataset with other external datasets, we want to identify
  • 5. 4 MSIS 5223 - Programming for datascience Project deliverable2 variables relating to users, reviews and businesses within this dataset itself that could possibly give us more insight into what the most significant of the influencers could be. Aside from this and as an expansion to our initial scope, we want to identify the effect that the different target variables can have on the other target variables that we have chosen - For example, is there a correlation between the average rating of a business, the number of reviews a business receives and the rating of a single review itself? We want to finally come up with the statistical model for each of our target variables that will help understand these relationships better. For our final analysis of the dataset, The table below shows the target and the corresponding predictor variables that we’ve used in this project: Target Predictors Ratings of individual reviews (review_stars) 1. Number of useful votes the review received (useful) 2. Number of funny votes the review received (funny) 3. Number of cool votes the review received (cool) 4. How many reviews has the user who gave this review has given to other businesses (user_review_count) 5. How many fans does the user have (user_fans) 6. What is the average rating that the user gives other businesses (user_average_stars) 7. How many compliments has the user received (user_compliments) 8. How many votes has the user received in total (user_votes) 9. The average rating of a business (business_stars) 10. The number of reviews a single business has (business_review_count)
  • 6. 5 MSIS 5223 - Programming for datascience Project deliverable2 Average ratings of the business (business_stars) 1. How many reviews does the business have (business_review_count) 2. Ratings of individual reviews (review_stars) 3. How many reviews have the users who rated this business gave to other businesses (user_review_count) 4. How many fans do the users who rated this business have (user_fans) 5. What is the average rating that the users who rated this business give to other businesses (user_average_stars) 6. How many compliments do the users who rated this business have in total (user_compliments) 7. How many votes do the users who rated this business have in total (user_votes) 8. Total number of useful votes the business received from reviews (useful) 9. Total number of funny votes the business received from reviews (funny) 10. Total number of cool votes the business received from reviews (cool) Number of reviews the business has received (business_review_count) 1. How many reviews does the business have (business_stars) 2. Ratings of individual reviews (review_stars) 3. How many reviews have the users who rated this business gave to other businesses (user_review_count) 4. How many fans do the users who rated this business have (user_fans) 5. What is the average rating that the users who rated this business give to other businesses (user_average_stars) 6. How many compliments do the users who rated this business have in total (user_compliments) 7. How many votes do the users who rated this business have in total (user_votes) 8. Total number of useful votes the business received from reviews (useful) 9. Total number of funny votes the business received from reviews (funny) 10. Total number of cool votes the business received from reviews (cool)
  • 7. MSIS 5223 - Programming for datascience Project deliverable2 Project Schedule We were able to mostly stick with our original schedule. If not for a few delays in the modelling process due to some bad results which required iterative attention, things went smoothly otherwise. Below you can find GANTT charts showing the duration of our entire project We have also developed a team availability tracker which shows the availability of all project members throughout the duration of the project: 6
  • 8. 7 MSIS 5223 - Programming for datascience Project deliverable2 Team availability Tracker Timeline Team members Kaushik Forough Prashanth Srujana Feb Week 1 X X X X Week2 X X X X Week3 X X X X Week4 X X X X March Week 1 X X X X Week2 O O O O Week3 X X X X Week4 X X X X April Week 1 X X X X Week2 X X X X Week3 X X X X Week4 X X X X Lessons Learnt After going through the complete statistical analysis process with a really big dataset, we felt that building the model itself initially is a good way to confirm some of the initial hunches we had before we do the initial analysis of the dataset. Given the large dataset, we could have also taken a small sample of the data to do the modelling beforehand. Due to the time spent on the the other items mentioned above, we felt that we didn’t get to the modelling phase until much later.
  • 9. 8 MSIS 5223 - Programming for datascience Project deliverable2 Data Preparation The steps we took to prepare our data for analysis are delineated below Data Access The dataset is freely available for anyone to download from the following link: https://www.yelp.com/dataset_challenge. Yelp has consolidated a big portion of its database of reviews, businesses and users into a dataset of approximately 4GB in size (when uncompressed), which anyone can use to perform analysis. We were able to download the compressed dataset (2.5GB in size) in order to access the data. The dataset consists of data relating to businesses, reviews, users, checkins and tips collected through the popular Yelp app. All of these data files are formatted as line delimited JSON files (a.k.a ND-JSON). A major portion of the dataset is the review data file which contains 4 million rows, followed by the user data file with 1 million rows followed by the business dataset with a 100 thousand rows. Given that the dataset is already optimized enough for consumption, we did not have a need to access additional datasets. Data Consolidation After unsuccessfully having tried to import the data files which are JSON formatted directly into R using a third party library called jsonlite (We faced serious performance issues due to the format of the data), we had to resort to converting the dataset into CSV format which allowed for ease of access. We realised that MongoDB (A NoSQL database) allowed for converting the dataset into CSV files due to its ability to handle JSON in a seamless manner. Using the mongoexport command we were able to convert all the data files into CSV format whilebeing
  • 10. MSIS 5223 - Programming for datascience Project deliverable2 able to choose the variables we needed from the export. During this process we were also able to eliminate certain variables that we thought we either wouldn’t need for this project or realized would be beyond our scope to deal with. These variables are mostly textual data like reviews, addresses, zip codes or location based data such as latitudes and longitudes. We also eliminated variables whose data types we don’t know yet how to deal with either using R or python. These include variables like categories and attributes pertaining to the business data file which is of the array data type. Though R understands this variable as a list data type, we are yet to understand how to use this in our analysis. Below is an image showing the MongoDB commands we used to pull the data that we needed: Data Cleaning After importing the dataset into R, we had 3 different data frames containing the variables that we had chosen during the export from MongoDB. We were able to identify variables like business_id and user_id in the review data frame which showed us potential to merge these data 9
  • 11. MSIS 5223 - Programming for datascience Project deliverable2 frames together into one. But before doing that we had to ensure that data is clean, in terms of erroneous data and missing values. After running the na.omit script in R to look for missing values, we realized that the data had already gone through some rigorous cleaning processes by the Yelp developer team. To ensure there is no erroneous data, we reviewed the structure of the data frames by running the str command in R. Data Transformation Most of our data frame contains numerical data pertaining to ratings, review counts, number of fans, number of compliments, number of votes etc. and don’t require any transformation as such since they are already continuous in nature. We also identified variables in the user data frame that could be aggregated in one. Variables like compliment_hot, compliment_more, compliment_profile, compliment_cute, compliment_list, compliment_note, compliment_plain, compliment_cool, compliment_funny, compliment_writer, compliment_photos could be aggregated into a single user_compliments variable and variables like useful, funny, cool (Which are essentially up votes that users received for their reviews) could be aggregated into a single user_votes variable. Though this was just our initial hunch, we wanted to perform a data reduction procedure to confirm our doubts. Finally, the data frames showed us the potential to merge the review, business and user data frames into a single data frame by using the business_id and the user_id variables. We were able to perform two left joins using R to achieve this (After renaming certain conflicting variables in all the data frames). After doing this, we 10
  • 12. MSIS 5223 - Programming for datascience Project deliverable2 realized that we no longer needed the respective id variables, so we dropped them. Data Reduction To confirm our hunch about the compliments and votes fields in the user data frame, we ran a principal component analysis to see if we actually dealing with just one variable: 1. PCA results using compliment_hot, compliment_more, compliment_profile, compliment_cute, compliment_list, compliment_note, compliment_plain, compliment_cool, compliment_funny, compliment_writer, compliment_photos from the user data frame. The code we used to run this procedure is shown in the code below 11
  • 13. MSIS 5223 - Programming for datascience Project deliverable2 2. PCA results using useful, funny, cool from the user data frame The code we used to run this procedure is shown in the code below The result of the two PCA procedures have indeed shown us that we are actually dealing with just one variable in both cases. To complete the reduction procedure, we went ahead and merged 12
  • 14. MSIS 5223 - Programming for datascience Project deliverable2 all the compliments variables into a single user_compliments variable by doing a summation and all the votes variables into a single user_votes variable by doing a summation. To further reduce our sample size, we decided to focus only on a single state in the United States to continue our Analysis. We randomly chose Wisconsin as the state we would focus on and created a subset of the final data frame using the state variable as a filter. Finally, we removed 2 date variables, the cities variable and the business_name variable, leaving us with only numerical/ordinal data types, which makes more sense given the numerical nature of our dataset itself. 13
  • 15. MSIS 5223 - Programming for datascience Project deliverable2 Descriptive Statistics Here’s a table showing some basic descriptive statistics of our final data frame Variable Mean Median Min Max Std Dev Skew Kurtosis review_stars 3.723 4 1 5 1.33 -0.82 -0.53 useful 1.008 0 0 1128 1.7 6.82 131.2 funny 0.4195 0 0 632 1.09 19.48 809.42 cool 0.5262 0 0 513 1.18 14.8 495.64 business_stars 3.726 4 1 5 0.67 -0.77 1.16 business_revie w_count 326.1 97.0 3 6414 180.99 3.94 19.39 user_review_co unt 125.6 25 0 11284 198.06 6.8 131.59 user_fans 10.94 1 0 4691 33.44 19.55 682.39 user_average_s tars 3.73 3.79 1 5 0.71 -1.11 2.76 user_complime nts 206.1 2 0 266318 781.03 40.66 2675.61 user_votes 665.2 5 0 529730 2604.22 20.81 705.91 14
  • 16. MSIS 5223 - Programming for datascience Project deliverable2 We have essentially chosen the 11 numerical variables that we think might matter the most in our model. After having filtered out our data to only those in the Wisconsin area, we were finally left with 88778 observations. Here are histograms for our target variables: 1. review_stars The graph is definitely left skewed showing a higher concentration at 5. This is very likely due to the nature of internet rating systems, wherein users mostly tend to vote in extremes. But in our case, the concentration is definitely more on one extreme. Further investigation could reveal more details about the nature of this variable. 15
  • 17. MSIS 5223 - Programming for datascience Project deliverable2 2. business_stars This histogram shows how the data is almost tending to normal, but is still left skewed like the previous one we saw. The concentration is more on the 4 level, showing that the average stars that a business receives is around 4. We can also see that very few businesses are able to maintain a rating of 5. This makes sense in the real world, where only a few restaurants are truly considered as the best, with all or most of the users giving a full rating of 5 along with their reviews. But this still doesn’t tell us anything about the number of reviews each business received. There could be businesses with very few reviews and all of those reviews could have been positive. This is an unfair assessment when compared to a much larger businesses with a lot more reviews, which could have lost the standing at 5 due to only a few bad reviews. 16
  • 18. MSIS 5223 - Programming for datascience Project deliverable2 3. business_review_count This histogram shows how the data is completely right skewed. We can see how there are a lot of businesses with very little to no reviews, while as the number of reviews increases, the number of businesses decreases. This clearly shows how only a few businesses are truly popular on Yelp. This can also be considered as a real world bias in which customers usually tend to trust the more popular businesses when purchasing a product. These kind of businesses have a greater advantage over newly born businesses or businesses that are just entering the market. This is often referred to as the “first mover advantage” in the industry, wherein the business has been around for a while allowing it to gain the sort of popularity that it has. 17
  • 19. 18 MSIS 5223 - Programming for datascience Project deliverable2 Data Dictionary This is the dictionary of our final merged dataset Variable Data type Description Source review_stars int Starts rating rounded to half stars https://www.yelp.com/dataset_ch allenge useful int Number of useful votes sent by the user https://www.yelp.com/dataset_ch allenge funny int Number of funny votes sent by the user https://www.yelp.com/dataset_ch allenge cool int Number of cool votes sent by the user https://www.yelp.com/dataset_ch allenge business_stars int Number of stars the business has https://www.yelp.com/dataset_ch allenge business_review _count int Number of reviews https://www.yelp.com/dataset_ch allenge user_review_co unt int Number of reviews https://www.yelp.com/dataset_ch allenge user_fans int Number of fans user has https://www.yelp.com/dataset_ch allenge user_average_st ars int Number of average stars user has given https://www.yelp.com/dataset_ch allenge user_complimen ts int Number of compliments user has given https://www.yelp.com/dataset_ch allenge user_votes int Number of votes the user has given https://www.yelp.com/dataset_ch allenge
  • 20. 19 MSIS 5223 - Programming for datascience Project deliverable2 Modelling Techniques The goal of our project is to assess the effect of different factors on the ratings (On a scale of 1 to 5)that a business receives on the Yelp app. The main idea is to see if there is a relationship between user behavior (based on the variables present in this dataset), businesses and the reviews that users write for these businesses. We have chosen 3 target variables since we feel that the significance of these variables could be really important in understanding how online rating systems work. Here’s a breakdown of the 3 target variables we have chosen and what we hope to achieve with the predictor variables we have: 1. review_stars - The rating associated with every individual review a. Does the number of votes (useful/funny/cool) affect the overall rating of a review? Do users upvote the good reviews or the bad reviews? b. Does the popularity (compliments/votes/fans) of users affect the overall rating of a review? Are popular users strict or lenient with their ratings? c. Does the history of a user’s rating behavior (average rating) affect the overall rating of a review? d. Are the active users (number of reviews given by a user) more strict or lenient with their reviews? Is there an association here? e. Does the current standing of a business on the app (rating, review count), affect what rating a user is going to give a business? 2. business_stars - The average rating of each individual business based on the reviews it received
  • 21. 20 MSIS 5223 - Programming for datascience Project deliverable2 a. Does the number of votes (useful/funny/cool) that each review has received for a particular business affect the overall rating of the business? b. Does the popularity (compliments/votes/fans) of the users who have rated a particular business affect the overall rating of the business? Are popular users associated with highly rated businesses? c. Does the history of a user’s rating behavior (average rating) who has rated a particular business affect the overall rating of the business? d. Do the active users (number of reviews given by a user) play a part in determining the overall rating of a business? e. Does the rating of each review that a business has received affect the overall rating of a business? 3. business_review_count - The total number of reviews each business has received over time. a. Does the number of votes (useful/funny/cool) that each review has received for a particular business affect the number of reviews a business receives? b. Does the popularity (compliments/votes/fans) of the users who have rated a particular business affect the number of reviews a business receives? Are popular users associated with popular businesses? c. Does the history of a user’s rating behavior (average rating) who has rated a particular business affect the number of reviews a business receives? d. Do the active users (number of reviews given by a user) play a part in determining the number of reviews a business receives?
  • 22. 21 MSIS 5223 - Programming for datascience Project deliverable2 e. Does the rating of each review that a business has received affect the number of reviews a business receives? Now, given that all of our variables are numeric in nature (targets and predictors) and the fact that we are trying to understand the correlation/association between these variables as delineated above , we will be building a Regression model. We believe that a regression model could not only reveal to us the truly significant predictor variables, but it can also give us an equation which can be used to determine/predict a pattern associated with this dataset that can help a great deal in understanding the nature of these target variables. The main assumption here is that all or most of our chosen predictor variables are linearly related to at least one of our target variables. The model is certain to give us this information. Also, identifying only a few predictor variables (based on significance) from the pool would be a valuable insight. The second modelling technique we plan to use is a Neural Network model. Along with giving us information about the correlations and associations that exist between our chosen variables, this model can also help a great deal in predicting the final value of our target variables based on the trends and patterns present in the dataset. The model itself would produce these predicted values. The more hidden layers (the neurons counterpart of an ANN) we insert into the model, the better results (in terms of computation and accuracy) we are going to receive is our presumption. Along with this information, this model is also certain to reveal the “weights” associated with each of the indicator variables, which are essentially similar to the coefficients we observe in the result of a regression model. This information can again inform us about the most significant of the given indicator variables.
  • 23. MSIS 5223 - Programming for datascience Project deliverable2 Given that we have 3 target variables, we are planning to implement both of the modelling techniques on each of the target variables and and choose the best (the most successful) model for each of the target variables. Regression For the regression models we plan to create for the target variables, the dataset should satisfy assumptions of linearity, collinearity, homoscedasticity and the normality of residuals. The tables below shows the tests we performed for each of the target variables: Target variable Collinearity (Assessment of VIF values) review_stars business_stars 22
  • 24. MSIS 5223 - Programming for datascience Project deliverable2 business_review_ count This assumption can be successfully verified for all of the target variables, given that the VIF values are well below 10. Therefore, we have no issues with collinearity. The collinearity assumption is thus verified for all the target variables. Target variable Normality of residuals (Assessment of Q-Q plot) review_stars business_stars 23
  • 25. MSIS 5223 - Programming for datascience Project deliverable2 business_review_ count From the graphs above, it’s clear that the normality assumption can only be verified for the target variable review_stars. This can be seen in the alignment of the Q-Q plots. Only the first graph shows normality, while the other two either have too many outliers (business_review_count) or aren’t as normal as it should be (business_stars). Therefore, the normality assumption is verified only for the target variable review_stars. Target variable Constant variance (Assessment of scatter plot) review_stars 24
  • 26. MSIS 5223 - Programming for datascience Project deliverable2 business_stars business_review_ count From the graphs above, it’s clear that the homoscedasticity assumption can only be verified for the target variables review_stars and business_stars. This can be seen in the scatterplot, where we observe similar number of data points on both sides of the regression line. The third plot, however, fails to show the same trend. Therefore, the homoscedasticity assumption is verified only for the target variables review_stars and business_stars. 25
  • 27. MSIS 5223 - Programming for datascience Project deliverable2 Linearity (Assessment of correlation procedure) The correlation matrix in the table above shows the significance (p values = 0) of all of our variables in the dataset. Therefore, the linearity assumption is verified for all of our target variables 26
  • 28. 27 MSIS 5223 - Programming for datascience Project deliverable2 Though all of the assumptions haven’t been satisfied for some of the variables, we will still continue to pursue the regression models for the review_stars, business_stars and the business_review_count target variables. We will assess these models based on the results we get from the regression procedure. Neural Network The main assumption of a Neural Network model is that the missing values are removed. We ensured this much earlier on when we ran the na.omit procedure in R where all the missing values were removed. Further, the original dataset itself was already in a clean and optimized manner due to a lot of preprocessing done by the developers at Yelp. The dataset is therefore in perfect condition for a Neural Network procedure. However, Neural Network in general is considered a black box model, which makes interpretation of the model difficult. The plan is to feed in different hidden layer sizes and activation methods to arrive at the best model with the lowest error possible for the given target variables. Data Splitting and Subsampling Looking to make an honest assessment, we want to do 60-40 split of the dataset. This means that we would have 60% of the data for our training dataset and 40% of the data for our validation dataset. Given the massive size of our dataset, best practices are usually relevant and it makes sense for us to use this ratio since it’s considered a good assessment of most models in the real world. The reason we chose a higher training value is to get better results from our models. This is because a higher training value improves the predictive capabilities of most models. A lower
  • 29. MSIS 5223 - Programming for datascience Project deliverable2 testing value usually helps in assessing the error rate more accurately. However, we don’t plan to create a testing dataset, again owing to the size and scope of our main dataset. Therefore, considering the size of our dataset and the predictive analytics we hope to achieve with this project, we are moving ahead with the 60-40 split. The image below shows the code we used to split the data for our regression model Here is an assessment of the data splits relating to each of our target variables: 1. review_stars 28
  • 30. MSIS 5223 - Programming for datascience Project deliverable2 2. business_stars 3. business_review_count Comparing the mean, standard deviation, median, minimum and maximum statistics from the images above, a clear uniformity can be noticed. The split is actually incredibly accurate with the values of these statistics across the split datasets being very close, or in most cases, exactly the same! 29
  • 31. MSIS 5223 - Programming for datascience Project deliverable2 Data Modelling Based on the assessments and subsampling done above, we are going to create regression and neural network models using review_stars, business_stars and business_review_count as our target variables. The idea is to understand the effects of our chosen predictor variables on our chosen target variables. Model A(Regression) Target variable: review_stars (The rating associated with each individual review) Predictor variables: useful, funny, cool, user_compliments, user_votes, user_fans, user_average_stars, user_review_count, business_stars, business_review_count Here is an image of the model we built by running the regression procedure in R: 30
  • 32. 31 MSIS 5223 - Programming for datascience Project deliverable2 Interpretation We can see from the results above that all of our indicator variables except for business_review_count are significant at a 0.05 level of significance. We can also see that user_compliments and user_votes are significant at a 0.01 level of significance while useful, funny, cool, user_review_count, user_average_stars and business_stars are all significant at a 1.level of significance. From the looks of it, our hunch about most of our chosen predictor variables is true. All these variables definitely have an effect on the review_stars target variable. But to what extent and what do these results mean? This is delineated below: 1. The variable useful is significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in useful, there is a 0.1351 decrease in review_stars. In other words, when a review has more useful votes, the rating of the review tends to decrease. This relationship is actually contrary to what we believed might be a positive linear relationship. It makes sense for a review with a higher rating to have more useful votes, but it seems like users find the stricter reviews more useful than the more lenient ones. 2. The variable funny is significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in funny, there is a 0.1576 decrease in review_stars. In other words, when a review has more funny votes, the rating of the review tends to decrease. This relationship is in line with what we believed would be a negative linear relationship. It makes sense for a review with a lower rating to have more funny votes since users find the stricter reviews to be funnier than the more lenient ones. This could be due to the more sarcastic tone users might use with their bad reviews.
  • 33. 32 MSIS 5223 - Programming for datascience Project deliverable2 3. The variable cool is also significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in cool, there is a 0.2989 increase in review_stars. In other words, when a review has more cool votes, the rating of the review tends to increase. This relationship is actually in line with what we believed might be a positive linear relationship. It makes sense for a review with a higher rating to have more cool votes. 4. The variable user_review_count is also significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in user_review_count, there is a .000109 increase in review_stars. In other words, when a user has written more reviews, the rating of their review tends towards a higher value. This relationship is actually inline with what we believed might be a positive linear relationship. It makes sense for an active reviewer to be more lenient with their reviews. 5. The variable user_average_stars is also significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in user_average_stars, there is a .7706 increase in review_stars. In other words, when a user has a higher average rating score, the rating of their review tends towards a higher value. This relationship is actually inline with what we believed might be a positive linear relationship. It makes sense for a user with a higher average rating to award higher ratings to reviews. 6. The variable user_compliments is significant at a 0.01 level of significance. From the coefficient, we can say that for a unit increase in user_compliments, there is a .00002133 decrease in review_stars. In other words, when a user has received more compliments (the popular user in other words), the rating of their review tends towards a lower value.
  • 34. 33 MSIS 5223 - Programming for datascience Project deliverable2 This relationship is actually inline with what we believed might be a negative linear relationship. It makes sense for a popular reviewer to be stricter with their reviews. The popularity of a user is definitely an influencer of the rating given to their reviews. 7. The variable user_fans is significant at a 0.05 level of significance. From the coefficient, we can say that for a unit increase in user_fans, there is a .0003611 decrease in review_stars. In other words, when a user has more fans (the popular user in other words), the rating of their review tends towards a lower value. This relationship is actually inline with what we believed might be a negative linear relationship. It makes sense for an popular reviewer to be stricter with their reviews. The popularity of a user is definitely an influencer of the rating given to their reviews. 8. The variable user_votes is significant at a 0.01 level of significance. From the coefficient, we can say that for a unit increase in user_votes, there is a .000005412 increase in review_stars. In other words, when a user has more votes (the users whose reviews have received more votes), the rating of their review tends towards a higher value. This relationship is actually inline with what we believed might be a positive linear relationship. 9. The variable business_stars is significant at a 0.05 level of significance. From the coefficient, we can say that for a unit increase in business_stars, there is a .7023 increase in review_stars. In other words, when a business has a higher average rating, the rating of the reviews that the business receives tends towards a higher value. This relationship is actually inline with what we believed might be a positive linear relationship. It makes sense for a business with a higher rating to receive more such positive reviews.
  • 35. MSIS 5223 - Programming for datascience Project deliverable2 10. The variable business_review_count is not significant at all. The number of reviews a business receives does not influence the rating of the reviews it receives. This makes sense since we really can’t say that the popular businesses (In terms of reviews) receive a higher or a lower rating. It really depends on what the user experienced when the review was given. Therefore, it is clear that the users definitely have a big role to play in deciding the rating of a review given to any business on the Yelp app. Considering the social nature of apps like Yelp, this makes a lot of sense. Model B (Neural Network) Target variable: review_stars (The rating associated with each individual review) Predictor variables: useful, funny, cool, user_compliments, user_fans, user_average_stars, user_votes, business_stars, business_review_count (We removed the user_review_count variable after observing better results without it) Activation: RELU Hidden Layers: 200 After playing around with the activation type and number of hidden layers, we would like to present the best model for our target variable. Here is an image of the model we built by running the neural network procedure in python: 34
  • 36. MSIS 5223 - Programming for datascience Project deliverable2 Interpretation Since a neural network model is a black box, we won’t be able to say much about thespecific relationship that exists between the target variable and the indicator variables. We can however assess the error rate produced by the model along with the R2 value to determine the efficiency of the model. In this particular model we can see that the mean absolute error is at 0.7309 and the mean square error is at 0.94. These values are signs of a low error rate in the model. The R2 value is 0.45 (or 45%), which is another indication of this being a good model. This model can definitely be used for further predictive analysis. Here are a few predicted values from the model as outputted from python: The patterns associated with the given set of indicator variables and their respective values can be observed by comparing them to the predicted values of the variable review_stars. These predictions can definitely be considered as accurate considering the low error rate of the model. 35
  • 37. MSIS 5223 - Programming for datascience Project deliverable2 Model C (Regression) Target variable: business_stars (The average rating of a business) Predictor variables: useful, funny, cool, user_compliments, user_votes, user_fans, user_average_stars, user_review_count, review_stars, business_review_count Here is an image of the model we built by running the regression procedure in R: Interpretation We can see from the results above that all of our indicator variables except for user_average_stars, user_compliments and user_votes are significant at a 0.05 level of significance. We can also see that user_fans is significant at a 0.01 level of significance while review_stars, useful, funny, cool, user_review_count and business_review_count are all significant at a 0.001 level of significance. From the looks of it, our hunch about most of our 36
  • 38. 37 MSIS 5223 - Programming for datascience Project deliverable2 chosen predictor variables is true. All these variables definitely have an effect on the business_stars target variable. But to what extent and what do these results mean? This is delineated below: 1. The variable review_stars is significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in review_stars, there is a 0.2374 increase in business_stars. In other words, when a business has more reviews with higher ratings, the rating of the business tends to decrease. This relationship is in line with what we believed might be a positive linear relationship. 2. The variable useful is significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in useful, there is a 0.11884 increase in business_stars. In other words, when a business has more reviews with higher useful votes, the rating of the business tends to increase. This relationship is in line with what we believed might be a positive linear relationship. 3. The variable funny is significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in funny, there is a 0.1967 decrease in business_stars. In other words, when a business has more reviews with higher funny votes, the rating of the business tends to decrease. This relationship is in line with what we believed would be a negative linear relationship. 4. The variable cool is also significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in cool, there is a 0.01167 increase in business_stars. In other words, when a business has more reviews with higher cool votes, the rating of the business tends to increase. This relationship is actually in line with what we believed
  • 39. 38 MSIS 5223 - Programming for datascience Project deliverable2 might be a positive linear relationship. 5. The variable user_review_count is also significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in user_review_count, there is a .00004964 decrease in business_stars. In other words, when a business has more reviews written by users who have themselves written more reviews, the rating of the business tends towards a lower value. 6. The variable user_average_stars is not significant at all. The average rating of users who write reviews for businesses doesn’t affect the average rating of a business. This is a surprising, since our hunch was that there would be a relationship between these two variables. 7. The variable user_compliments is not significant at all. The number of compliments received by users who write reviews for businesses doesn’t affect the average rating of a business. 8. The variable user_fans is significant at a 0.05 level of significance. From the coefficient, we can say that for a unit increase in user_fans, there is a .0002323 increase in business_stars. In other words, when a business has more reviews from users who have more fans (the popular user in other words), the rating of their business tends towards a higher value. This relationship is actually inline with what we believed might be a positive linear relationship. 9. The variable user_average_votes is not significant at all. The number of votes received by users who write reviews for businesses doesn’t affect the average rating of the reviews received by a business.
  • 40. MSIS 5223 - Programming for datascience Project deliverable2 10. The variable business_review_count is significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in business_review_count, there is a .0004423 increase in business_stars. In other words, when a business has a higher number of reviews, the rating of the business tends towards a higher value. This relationship is actually inline with what we believed might be a positive linear relationship. Model D (Neural Network) Target variable: business_stars (The average rating of each business) Predictor variables: useful, funny, cool, user_compliments, user_fans, user_average_stars, user_votes, review_stars, business_review_count (We removed the user_review_count variable after observing better results without it) Activation: RELU Hidden Layers: 200 After playing around with the activation type and number of hidden layers, we would like to present the best model for our target variable. Here is an image of the model we built by running the neural network procedure in python: 39
  • 41. MSIS 5223 - Programming for datascience Project deliverable2 Interpretation Since a neural network model is a black box, we won’t be able to say much about thespecific relationship that exists between the target variable and the indicator variables. We can however assess the error rate produced by the model along with the R2 value to determine the efficiency of the model. In this particular model we can see that the mean absolute error is at 0.4163 and the mean square error is at 0.3047. These values are signs of a low error rate in the model. The R2 value is 0.31 (or 31%), which is another indication of this being a good model. This model can definitely be used for further predictive analysis. Here are a few predicted values from the model as outputted from python: The patterns associated with the given set of indicator variables and their respective values can be observed by comparing them to the predicted values of the variable business_stars. These predictions can definitely be considered as accurate considering the low error rate of the model. 40
  • 42. MSIS 5223 - Programming for datascience Project deliverable2 Model E (Regression) Target variable: business_review_count (The number of reviews a business receives) Predictor variables: useful, funny, cool, user_compliments, user_votes, user_fans, user_average_stars, user_review_count, review_stars, business_stars Here is an image of the model we built by running the regression procedure in R: Interpretation We can see from the results above that all of our indicator variables except for review_stars, user_average stars and user_fans are significant at a 0.05 level of significance. We can also see that funny is significant at a 0.05 level of significance, user_average_stars is significant at a 0.1 level of significance, while useful, cool, user_review_count, user_compliments, user_votes and business_stars are all significant at a 0.001 level of significance. From the looks of it, our hunch 41
  • 43. 42 MSIS 5223 - Programming for datascience Project deliverable2 about most of our chosen predictor variables is true. All these variables definitely have an effect on the business_review_count target variable. But to what extent and what do these results mean? This is delineated below: 1. The variable review_stars is not significant at all. The rating of reviews received by a business does not affect the number of reviews a business receives. 2. The variable useful is significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in useful, there is 6.662 decrease in business_review_count. In other words, when a business has more reviews with higher useful votes, the number of reviews the business receives tends to decrease. 3. The variable funny is significant at a 0.05 level of significance. From the coefficient, we can say that for a unit increase in funny, there is a 2.005 increase in business_review_count. In other words, when a business has more reviews with higher funny votes, the number of reviews reviews the business receives tends to increase. 4. The variable cool is significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in cool, there is a 4.23 increase in business_review_count. In other words, when a business has more reviews with higher cool votes, the number of reviews the business receives tends to increase. 5. The variable user_review_count is also significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in user_review_count, there is a .03024 increase in business_review_count. In other words, when a business has more reviews written by users who have themselves written more reviews, the number of reviews the business receives tends to increase.
  • 44. 43 MSIS 5223 - Programming for datascience Project deliverable2 6. The variable user_average_stars is significant at a 0.1 level of significance. For a unit increase in user_average_stars, there is a 1.659 increase in business_review_count. In other words, when a business has more users with a higher average rating, the number of reviews the business receives tends to increase. 7. The variable user_compliments is significant at a 0.005 level of significance. For a unit increase in user_compliments, there is a 0.004867 increase in business_review_count. In other words, when a business has more users with a higher number of compliments, the number of reviews the business receives tends to increase. 8. The variable user_fans is not significant at all. The number of fans of a user who has reviewed a business does not affect the number of reviews a business receives. 9. The variable user_average_votes is significant at a 0.001 level of significance. For a unit increase in user_average_votes, there is a 0.002149 decrease in business_review_count. In other words, when a business has more users with a higher average rating, the number of reviews the business receives tends to decrease. 10. The variable business_stars is significant at a 0.001 level of significance. From the coefficient, we can say that for a unit increase in business_stars, there is a .4211 increase in business_review_count. In other words, when a business has a higher average rating, the number of reviews the business receives tends to increase.
  • 45. MSIS 5223 - Programming for datascience Project deliverable2 Model F (Neural Network) Target variable: business_review_count (The number of reviews each business has) Predictor variables: useful, funny, cool, user_compliments, user_fans, user_average_stars, user_votes, business_stars, review_stars (We removed the user_review_count variable after observing better results without it) Activation: RELU Hidden Layers: 100 After playing around with the activation type and number of hidden layers, we would like to present the best model for our target variable. Here is an image of the model we built by running the neural network procedure in python: Interpretation In this particular model we can see that the mean absolute error is at 95.61 and the mean square error is at 28223.50. These values are signs of a very high error rate in the model. The R2 value is 0.14 (or 14%), which is not bad, but the error rate is too high in this model for it to be considered a good one. Further analysis or identification of more significant variables (Which we probably didn’t include in the beginning), is definitely required to improve the predictive capabilities of this model. Here are predicted values from the model (Which might not be very accurate): 44
  • 46. MSIS 5223 - Programming for datascience Project deliverable2 Model Assessment From the six models we’ve built above, our goal is to choose the best model for the three target variables we have. For the objective assessment, we will be comparing the R2 values from the two models. For the subjective assessment, we will be elaborating on the implications of the model in the real world. Model A vs Model B (review_stars) For the objective assessment of the regression and the neural network model for the target variable review_stars, let’s first compare the R2 values produced by each of the models. The table below shows the value produced by both the procedures: 45 Regression (Model A) Neural Network (Model B) R2 : 0.4405 (44.05%) R2 : 0.4715 (47.15%)
  • 47. 46 MSIS 5223 - Programming for datascience Project deliverable2 Firstly, both these are very good models with such high R2 values. From the table above it is clear that Model B performs better than Model A. The accuracy of the Neural Network model is slightly higher than the regression model. From a real world perspective however, the regression model makes more sense, considering the fact that it helps understand the exact relationship between the target and predictor variables. In our scenario, the goal is to understand what influences the rating associated with each individual review and the regression model does this job the best. Therefore, we would choose Model A as our model of choice for the target variable review_stars. Model C vs Model D (business_stars) For the objective assessment of the regression and the neural network model for the target variable business_stars, let’s first compare the R2 values produced by each of the models. The table below shows the value produced by both the procedures: Regression (Model C) Neural Network (Model D) R2 : 0.2518 (25.18%) R2 : 0.3173 (31.73%) Firstly, both these are reasonably good models with decently high R2 values. From the table above it is clear that Model D performs better than Model C. The accuracy of the Neural Network model is slightly higher than the regression model. From a real world perspective however, the regression model makes more sense, considering the fact that it helps understand the exact relationship between the target and predictor variables. But the assumptions for regression weren’t satisfied earlier for this model, hence choosing regression wouldn’t be wise in
  • 48. 47 MSIS 5223 - Programming for datascience Project deliverable2 this case. In this scenario, using the Neural Network model for predicting the behaviour of the variable business_stars makes more sense. Therefore, we would choose Model D as our model of choice for the target variable business_stars. Model E vs Model F (business_review_count) For the objective assessment of the regression and the neural network model for the target variable business_review_count, let’s first compare the R2 values produced by each of the models. The table below shows the value produced by both the procedures: Regression (Model E) Neural Network (Model F) R2 : 0.0208 (2.08%) R2 : 0.1493 (14.93%) Firstly, both are not very good models with comparatively low R2 values. From the table above it is clear that Model F performs better than Model E. Though the Neural Network model has a moderate R2 value, the error rate for this model is very high (as observed earlier), which brings into question the accuracy of this model. From a real world perspective however, the regression model makes more sense, considering the fact that it helps understand the exact relationship between the target and predictor variables. But the assumptions for regression weren’t satisfied earlier for this model and the R2 value is very low, hence choosing regression wouldn’t be the right way to go. Therefore, we wouldn’t be choosing either of the models for the target variable business_review_count. Further analysis or consideration of other significant variables is definitely required before coming to any conclusions about this particular target variable.
  • 49. 48 MSIS 5223 - Programming for datascience Project deliverable2 Model technique Assessment Regression and Neural Network, both are extremely effective modelling techniques and both have their own strengths and weaknesses. These are delineated below: Regression Regression analysis is a statistical process for estimating the relationships among variables. The focus is on the relationship between a dependent variable and one or more independent variables. Strengths 1. Multiple regression is a very flexible method. The independent variables can be numeric or categorical, and interactions between variables can be incorporated; and polynomial terms can also be included. 2. Multiple regression uses multiple independent variables, with each controlling for the others. The parameter or coefficients of each of these variables can be derived using a regression model 3. Regression models have very accurate predictive capabilities and can be used in forecasting trends in the future. 4. When relationships between the independent variables and the dependent variable are almost linear, regression shows optimal results. Weaknesses 1. Linear regression is limited to predicting numeric output.
  • 50. 49 MSIS 5223 - Programming for datascience Project deliverable2 Neural Network An Artificial Neural Network (ANN) is an information processing model which behaves like the human brain by using artificial neurons (Hidden layers) for computational statistics. Strengths 1. Neural Networks have the ability to understand relationships between the indicator and target variables when they are linearly related to each other, which means that it can be used to understand trends and patterns in a datasets. 2. Neural Network is capable of self organization. An ANN can create its own organisation or representation of the information it receives during learning time. 3. Neural Network is capable of adaptive learning. An ANN has the ability to learn how to do tasks based on the data given for training or initial experience. Weaknesses 1. Since it’s impossible to pull out information from a Neural Network model, the implications of it is very hard to understand.