Text Data Mining and Predictive Modeling of Online Reviews

How Popular Will Your
Restaurant Be?
Text Mining Approach To Word
Frequency Prediction on Online Reviews
Mark Chesney & Albert Nguyen
June 11, 2014

Contents
Abstract.........................................................................................................................................................1
Background ...................................................................................................................................................1
Motivation.....................................................................................................................................................1
Data Acquisition............................................................................................................................................2
Web Scraping............................................................................................................................................2
Training Data.............................................................................................................................................2
Testing Data ..............................................................................................................................................2
Data Processing.............................................................................................................................................3
Data Variables...............................................................................................................................................3
Model............................................................................................................................................................4
Google Predictive API................................................................................................................................4
Acknowledgement of Biases.........................................................................................................................4
Analysis Results.............................................................................................................................................5
Histograms................................................................................................................................................5
Word Clouds..............................................................................................................................................6
Evaluation .....................................................................................................................................................8
Confusion Matrix.......................................................................................................................................8
Conclusion.....................................................................................................................................................9
Appendices..................................................................................................................................................10
Appendix A: Import.io GUI Screenshots .................................................................................................10
Appendix B: Python Code using Import.io API........................................................................................11
Appendix C: R Code.................................................................................................................................13
Appendix D: Resulting Training Data ......................................................................................................14
Appendix E: Google Prediction API Training Page ..................................................................................15
Appendix F: Google Prediction API Testing Worksheet..........................................................................16

Abstract
While online consumer review sites can hold a massive amount of seemingly unorganized
information, big data analysis can treat these sites as a repository of useful information. In
our project we apply data mining techniques to build a model that predicts what reviews a
restaurant would receive, based on restaurant review information available online. We then
run our model on a test set to show that to a reasonable extent, it possesses predictive
power that may serve a restaurateur with business expansion.
Background
Yelp is a highly popular company that supports and shares user-provided business reviews
through social media. Though it is available for any internet users to read, the use of Yelp
appeals particularly to younger demographics as well as to the smart-phone carrying portion
of the population. In our project, Yelp serves as a vast source of online data on restaurant
reviews from cities across the United States.
The restaurant industry is highly competitive and very sensitive to consumer
sentiment. Online reviews significantly have an impact on business, with a one-star
increase in Yelp boosting restaurant revenues by as much as 9%1. (Interestingly enough,
this effect is only observed in independent restaurants. Because chain restaurants do not
receive any significant effect from Yelp reviews, Yelp has effectively expanded the success
of independent restaurants in the overall marketplace.) It is this impact that makes a Yelp
review vital for an restaurateur who hopes to open a successful business.
Motivation
The reviews seen on Yelp are the most recent aggregate of all the reviews. Therefore, the
predictive ability of this analysis is not to actively predict the most recent Yelp review, but to
predict potential Yelp reviews before a restaurant opens. For instance, imagine a
restaurateur who recently had a soft launch for her restaurant. The goal of the restaurateur
is to use the soft launch to determine the sentiments of the diners, which she can do by
handing out surveys. With the survey results, she can then compare them to results of this
text mining model to determine its potential Yelp review. This may potentially have impact
on her business decisions. This can be seen as a more robust estimation of a restaurant
potential rating since it takes into account the diners sentiments and detailed thoughts
versus basic online surveys in which diners would just give an overall star-rating. Our
business model is to provide restaurateurs the ability to estimate their future Yelp review,
given about 40 written reviews. This estimation will give them a chance to make any needed
changes before going public.
1 Luca, Michael and Georgios Zervas. Fake It Till You Make It: Reputation, Competition,
and Yelp Review Fraud. Harvard Business School. September 17, 2013.
(http://businessinnovation.berkeley.edu/WilliamsonSeminar/luca092613.pdf)

Data Acquisition
Web Scraping
The first step in gathering the Yelp reviews was to determine which restaurants to pull the
reviews from. Import.io was used as the primary tool for web scraping. Initially Import.io
was used to pull the URLs for 300 Yelp Reviews for various Mexican restaurants within a
metropolitian area. We choose the 8 biggest metropolitan area by population, which were
New York City, Los Angeles, Chicago, Houston, Philadelphia, Phoenix, San Antonio and
San Diego. We wanted to sample 300 restaurants in order to get a wide sample for our
testing and training data.
After collecting the URLs we used Import.io API in conjunction with Python to webscrape
the initial 40 reviews for each restaurant. Leaving us with an upper estimate of 96,000
reviews (Actual figures are slightly less, due to incomplete online review.) After pulling the
data we converted the exported JSON file into a CSV file using R. This was the baseline
data in which we built our training and testing data set.
Training Data
The dataset we used to train our model came the reviews extracted using the Import.io API.
A random number generator was used to split the data, where 80% was to be used for
training data. The training data consist of 2 variables, the star rating and the reviews which
consist of all 40 reviews.
Star Rating Restaurant Count Percent
2 8 0.51%
2.5 38 2.42%
3 183 11.63%
3.5 462 29.37%
4 622 39.54%
4.5 220 13.99%
5 40 2.54%
Training Data Sample
Testing Data
The dataset we used to test our model were the leftover 20% after we formed the Training
Data. We made sure that our distribution of rating were very similar to that of the training
set. The testing data consist of 2 variables, the star rating and the reviews which consist of
all 40 reviews.

Star Rating Restaurant Count Percent
2 5 1.27%
2.5 5 1.27%
3 46 11.70%
3.5 110 27.99%
4 144 36.64%
4.5 70 17.81%
5 13 3.31%
Test Data Sample
Data Processing
Our procedure involves transforming the data properly to separate irrelevant text from
useful text in the corpus. This procedure is as follows:
1) Separation of words conjoined by “/” or “@” into individual terms. Examples are
“yummy/delicious” or “slower@lunchtime”.
2) Conversion of all terms into lower case.
3) Removal of numbers and punctuation symbols.
4) Removal of common stopwords in English. These are encoded in the tm package.
5) Removal of whitespace (blanks and tabs).
6) Stemming, or removal of common word endings, as defined in the SnowballC package.
Examples would be removing “es”, “ed”, and “s”, so that “taco” is not unique from “tacos”.
See appendix for complete R code
Data Variables
Our independent variables of measurement are the terms, the number of unique words
originating from the Yelp reviews that end up in the data file, post-processing. These
57,626 terms, along with their frequencies of occurrence, are contained in the term
document matrix. Term frequencies are essential in the prediction model.
Our dependent variable is the restaurant’s Yelp star rating. Ratings range from 1 star to 5
stars in half-star increments. Any correlation that particular terms and term frequencies
may have with high or low star ratings, while inconclusive, may be useful to incorporate into
our prediction model.

Model
Our initial attempts were to compare frequencies of rating categories and their correlation
with key term frequencies. We ran into a problem with sample sizes for several rating
categories, since very few restaurants were rated under a 3. Thus we were unable to
provide helpful plots to compare these. We then turned to comparing word clouds of the
various ratings of the training set and the test set.
We ran several comparisons between word clouds and term frequencies between the
different star ratings. We found some preliminary differences between 2-star restaurants
and 5-star restaurants, where reviews of 5-star restaurants seem to be more descriptive
while also emphasizing restaurant environment and service. Nonetheless, we were unable
to find strong tendencies of certain terms or term frequencies that could help predict the star
rating of a restaurant.
In general, we could not make a strong prediction using words clouds and word
frequencies. Instead we need to delve more into sentiment analysis which is outside of our
current toolset of R. The tool we used is Google Prediction API which took in our data and
attempt to categorized and rate restaurants based on the uploaded reviews.
Google Predictive API
We wanted to use Google Predictive API to predict the star ratings of the testing dataset.
We created a project within Google Developer Council and activated their Predictive API
and Cloud Storage API, which was used to hold the training data. The training data is then
fed into Google Predictive API in which the star rating was designated as the variable to be
predicted while the reviews were inputs. After the model was fully trained, it was then
recalled through a specially designed Google Docs Spreadsheet. The reviews from the
Testing data was then inputted into the Google Docs Spreadsheet, where the model
proceeded to predict star rating. The predicted star ratings is then compared to the actual
star rating for later analysis.
Acknowledgement of Biases
There are several areas in which bias can be introduced into the data sets. The first bias
may be present in our use of the 8 biggest metropolitan areas as the basis of selecting the
Mexican restaurants. These ratings and reviews may differ from in restaurants in rural
areas and our prediction model might not be effective in predicting ratings in those areas.
The second potential bias may occur when we selected the first 300 restaurants from
Yelp. Yelp has its own sorting algorithm which might impose a certain selection bias in
which restaurants are displayed. Although we used a sample size of 300 to compensate for
this, there is still a possibility of bias especially in cities with a large number of Mexican
restaurants like Los Angeles or San Diego.

Even with the selection of the first 300 restaurants in a metropolitan area, there seems to be
a rating bias. About 68% of the restaurants in our data set are ranked 3.5 or 4 stars while
only 3 percent of restaurants in our data set are ranked 2 or 2.5 stars. This might be the
general trend within Yelp ratings or possible selection bias when we search for Mexican
restaurants.
We must acknowledge the biases of Yelp reviews in the first place. Yelp captures the
opinions of only those who use the internet -- and who use it frequently. While internet
users make up a broad sector of the population, customers who actually report their
experiences on Yelp represent a narrower band of people, particularly young, middle-class,
and slightly more affluent consumers.
Further bias may come from the potentially fake Yelp reviews. Academic researchers are
studying the prevalence of dishonest reviews -- some as high as 20%2 -- that are created by
restaurant owners and managers with the intention of artificially raising their own Yelp
review, or lowering those reviews of competitors. Like the findings of Yelp reviews on
business impact, fake reviews are more present on independent restaurants than on
chains. The study also shows that fake reviews tend to occur in restaurants that only have
few reviews to begin with, giving large weight to any single negative review.
Analysis Results
Our initial attempts were to compare frequencies of rating categories and their correlation
with key term frequencies. We ran into a problem with sample sizes for several rating
categories. Thus we were unable to provide descriptive plots to compare these.
Histograms
Two histograms, one for the training set and another for the test set, show the 14 most
frequently-occurring terms in each set. Despite slight variations in the orders of term
frequencies, these 14 terms are for the most part common to both the training and test
sets. Even the overall patterns take similar shape. For example, after the most frequent
term, “food”, the next three terms share similar frequencies before a visible drop in
frequency occurs from the fifth word and on.
2
IBID.

Word Clouds
Word clouds serve as visualizations by attempting to give a larger size to the most frequent
terms. Because these visualizations follow a random seed generation, a frequent word may
not always appear in the word cloud. Thus these are more of an approximation aid rather
than a quantitatively measured tool. In the training set, a high resolution is done (with a
term frequency threshold of 2500) and a low resolution (at 10,000). In the test set, the low
and high resolution thresholds are 5000 and 500, respectively.

Low Resolution High Resolution
Word Cloud: Training Set
Low Resolution High Resolution
Word Cloud: Test Set

Evaluation
Confusion Matrix
The confusion matrix evaluates how successful our model was in predicting the review star
rating.
When assessed within a minimal margin of error, our accuracy rate of 32% would suggest
that this model has room for improvement, especially in predicting high or low
ratings. However, when allowing a 0.5 margin of error, our model’s accuracy significantly
improves to 74% accuracy. For business purposes, we can state that our model is able to
predict within a 0.5 star ratings of the actual rating, around 74% of the time. To further
support this claim we can attempt to run larger test cases that are more balance and hence
will push the model to predict high or low ratings.
Confusion Matrix and Accuracy Summary Table
The below graph shows the boxplots of actual star ratings by various predicted star ratings,
which allows us to compare the accuracy of each predicted star rating. Generally, the model
tends to predict with a downward bias, e.g. it predicts a star rating of 3 on an observation’s
actual star rating of 3.5. If we took the difference between predicted star ratings and actual
star rating, on average predicted ratings were 0.05 lower than real ratings. This general
trend can be viewed as helpful since giving restaurateurs false low ratings is more helpful
than giving false high ratings. False low ratings would push them to continue to further
improve their restaurant.

We can attempt to fix this lower bias by training the model with more reviews from low rated
restaurants, although this might be difficult since low Yelp ratings are hard to come by.
Conclusion
The final model offers valuable, though finite, explanatory power into the factors that predict
a restaurant’s Yelp star rating. Whether this may add value to a restaurateur’s business
decisions is up to his or her discretion. Given the wide range of factors in customer
preferences, we are able to account for many of these varieties because our model is
trained from such a large set of data.
We recommend further analysis to improve the accuracy of the model. Analysis can be
done to improve the performance of the model on the lower and higher ends of the five-star
rating spectrum. Analysis can be expanded into many areas to either investigate potential
biases in our model, or to reach areas that our model does not include. The potential
biases of urban restaurants in our model can be tested by incorporating data from rural
places. Further model improvement can be tested by increasing or decreasing the number
of reviews per restaurant, to see if accuracy may improve.
There can be improvement to the model by merging our data with other data sets. For
example, data sets that contain consumer trends, economic growth areas, crime rates, and
various other factors may be relevant in capturing correlation of Yelp star ratings with other
covariates. To this degree, our model may offer a starting point from which this
comprehensive data model may be integrate. Lastly, there are always opportunities to us to
deepen our understanding of the inner workings of the Google Prediction API and Yelp’s
sort algorithm. These are crucial in influencing our data sources and data processing.

Appendices
Appendix A: Import.io GUI Screenshots
We use Import.io to download 300 Mexican restaurants for each city:

Appendix B: Python Code using Import.io API
We then used this list and load it into the Import.io Python API to download the restaurant
reviews
import logging, json, importio, latch
# To use an API key for authentication, use the following code:
client = importio.importio(user_id="dac82174-0f48-4b63-9584-87ce0a99336e",
api_key="4Q9CL7FCM5lOHA/WC25YDcUNF7JUXtGRC3sYPz9LvxGRSjfTM+3tb+Y/MydOvNl8lcFA9ChJRsMaI
7SxMD4low==")
# client = importio.importio(user_id="dac82174-0f48-4b63-9584-87ce0a99336e",
api_key="4Q9CL7FCM5lOHA/WC25YDcUNF7JUXtGRC3sYPz9LvxGRSjfTM+3tb+Y/MydOvNl8lcFA9ChJRsMaI
7SxMD4low==", host="https://query.import.io")
# Once we have started the client and authenticated, we need to connect it to the
server:
client.connect()
# Because import.io queries are asynchronous, for this simple script we will use a
"latch"
# to stop the script from exiting before all of our queries are returned
# For more information on the latch class, see the latch.py file included in this
client library
queryLatch = latch.latch(2292)
# Define here a global variable that we can put all our results in to when they come
back from
# the server, so we can use the data later on in the script
dataRows = []
# In order to receive the data from the queries we issue, we need to define a callback
method
# This method will receive each message that comes back from the queries, and we can
take that
# data and store it for use in our app
def callback(query, message):
global dataRows
# Disconnect messages happen if we disconnect the client library while a query is in
progress
if message["type"] == "DISCONNECT":
print "Query in progress when library disconnected"
print json.dumps(message["data"], indent = 4)
# Check the message we receive actually has some data in it
if message["type"] == "MESSAGE":
if "errorType" in message["data"]:
# In this case, we received a message, but it was an error from the external
service
print "Got an error!"
else:
# We got a message and it was not an error, so we can process the data
print "Got data!"
# Save the data we got in our dataRows variable for later
dataRows.extend(message["data"]["results"])
# When the query is finished, countdown the latch so the program can continue when
everything is done

if query.finished(): queryLatch.countdown()
# Issue queries to your data sources and with your inputs
# You can modify the inputs and connectorGuids so as to query your own sources
# Query for tile extractor
# the first 40 reviews -- (2292 restaurants)
client.query({
"connectorGuids": [
"e9916865-4fa9-4bf1-a224-728649d2958a"
],
"input": {
"webpage/url": "http://www.yelp.com/biz/12th-street-cantina-
philadelphia?sort_by=date_asc"
}
}, callback)
# NOTE: This client.query() command is repeated thousands of times, once for each
restaurant.
..............
client.query({
"connectorGuids": [
"e9916865-4fa9-4bf1-a224-728649d2958a"
],
"input": {
"webpage/url": "http://www.yelp.com/biz/3-amigos-mexican-restaurant-san-
antonio?sort_by=date_asc"
}
}, callback)
print "Queries dispatched, now waiting for results"
# Now we have issued all of the queries, we can "await" on the latch so that we know
when it is all done
queryLatch.await()
print "Latch has completed, all results returned"
# It is best practice to disconnect when you are finished sending queries and getting
data - it allows us to
# clean up resources on the client and the server
client.disconnect()
# Now we can print out the data we got
print "All data received:"
with open('iodata.txt','w') as outfile:
json.dump(dataRows, outfile)
print json.dumps(dataRows, indent = 4)

Appendix C: R Code
The output from the Python script is a JSON file. We use R to process this into a CSV file.
library("rjson")
library("plyr")
setwd("C:/Users/Albert/Desktop/data_test")
json_file <- "C:/Users/Albert/Desktop"
fp <- file.path(json_file, "1st.txt")
json_data <- fromJSON(file = fp)
#json_data <- fromJSON(paste(readLines(json_file), collapse=""))
outputs <- json_data
outputs$fivenum <- fivenum(rnorm(100))
outputs$summary <- as.data.frame(as.vector(summary(rnorm(100))))
tmp <- lapply(outputs, as.data.frame)
write.table(tmp, file="Output.csv",append=T, sep=",")
After R processes the JSON file, the resulting CSV is loaded into Google Prediction API

Appendix D: Resulting Training Data

Appendix E: Google Prediction API Training Page

Appendix F: Google Prediction API Testing Worksheet

Text Data Mining and Predictive Modeling of Online Reviews

Recommended

Recommended

More Related Content

Similar to Text Data Mining and Predictive Modeling of Online Reviews

Similar to Text Data Mining and Predictive Modeling of Online Reviews (20)

Text Data Mining and Predictive Modeling of Online Reviews