Predicting
Restaurants Rating
and Popularity based
on Yelp Dataset
  
MACHINE LEARNING PROJECT REPORT 
 
 
 
 
 
Submitted by 
ALIN BABU (67) 
NANDU O (66) 
LIJU THOMAS (36) 
 
 
 
 
 
 
Introduction  
Restaurants rating on Yelp becomes an important indicator of their future. In this                         
project, we focus on predicting ratings and popularity change of restaurants. With data                         
from Yelp, we use several machine learning methods including logistic regression and                       
Naive Bayes, to make relevant predictions. While logistic regression seems to perform                       
better than the others, predictions from all the methods are far from perfect. This implies                             
the potential improvement of more data and more suited methodology. 
 
Project Objectives 
 
➔ To predict ratings of restaurants on Yelp and popularity change based on                       
restaurant features. 
➔ Project can shed light on what customers value the most about a restaurant. 
 
Dataset 
➔ The data comes from Yelp Dataset Challenge . 
➔ It includes review data, including text, time and rating. 
➔ From the raw dataset, we select 20000 samples for testing. 
➔ Due to different cultures across cities, we only focus on restaurants in a particular                           
city and surrounding areas in this project. 
1 
 
 
Algorithm and Methods 
 
  In this project we use mainly three machine learning algorithms to predict the 
restaurant rating.The algorithm used here are supervisory learning .The algorithm used in 
the project are: 
★ Logistic Regression 
★ Multinomial Naive Bayes 
★ Naive Bayes 
Logistic Regression 
Logistic regression is a classification algorithm used to assign observations to a                       
discrete set of classes. Unlike linear regression which outputs continuous number values,                       
logistic regression transforms its output using the logistic sigmoid function to return a                         
probability value which can then be mapped to two or more discrete classes. 
Naive Bayes 
  A Naive Bayes classifier is a probabilistic machine learning model that’s used for                         
classification task. The crux of the classifier is based on the Bayes theorem. 
Bayes Theorem: 
2 
 
Using Bayes theorem, we can find the probability of A happening, given that B has                             
occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is                               
that the predictors/features are independent. That is presence of one particular feature                       
does not affect the other. Hence it is called naive. 
Multinomial Naive Bayes 
Multinomial Naive Bayes is a specialized version of Naive Bayes that is designed                         
more for text documents. Whereas simple naive Bayes would model a document as the                           
presence and absence of particular words, multinomial naive bayes explicitly models the                       
word counts and adjusts the underlying calculations to deal with in.  
 
 
Data Pre-Processing 
In this project we use mainly Yelp Dataset.Dataset consists of user review and                         
rating. We mainly use restaurant name, date, comfortability, star rating, comments,                     
review id out of these data we will two features which are essential for our prediction. 
Selecting two valid features manually. 
❖ Star rating 
❖ Comments 
After selecting the valid features we find the missing values of an attribute and then root 
word extracting from comment rating using the methods below: 
❖ Removing punctuations 
3 
 
❖ Removing stop words 
❖ Stemming - The process of producing morphological variants of a root/base word.   
 
Performance Evaluation 
 
Logistic Regression 
 
 
Naive Bayes 
 
 
4 
 
 
Multinomial Naive Bayes 
 
 
 
Conclusion 
 
  After testing with 20000 samples we can see that logistic regression performs                       
better than the other methods. One possible explanation is that the assumptions for other                           
models are problematic, and logistic regression is more robust to problematic model                       
assumptions.This implies the potential improvement of more data and more suited                     
methodology.However, the prediction needs further improvement. We compare our best                   
predictor-logistic regression with a random-number predictor, and a constant-number                 
predictor. As we can see, the logistic predictor is only slightly better than the                           
constant-number predictor. 
 
5 

Prediciting restaurant and popularity based on Yelp Dataset project report

  • 1.
          Predicting Restaurants Rating and Popularitybased on Yelp Dataset    MACHINE LEARNING PROJECT REPORT            Submitted by  ALIN BABU (67)  NANDU O (66)  LIJU THOMAS (36)         
  • 2.
        Introduction   Restaurants rating onYelp becomes an important indicator of their future. In this                          project, we focus on predicting ratings and popularity change of restaurants. With data                          from Yelp, we use several machine learning methods including logistic regression and                        Naive Bayes, to make relevant predictions. While logistic regression seems to perform                        better than the others, predictions from all the methods are far from perfect. This implies                              the potential improvement of more data and more suited methodology.    Project Objectives    ➔ To predict ratings of restaurants on Yelp and popularity change based on                        restaurant features.  ➔ Project can shed light on what customers value the most about a restaurant.    Dataset  ➔ The data comes from Yelp Dataset Challenge .  ➔ It includes review data, including text, time and rating.  ➔ From the raw dataset, we select 20000 samples for testing.  ➔ Due to different cultures across cities, we only focus on restaurants in a particular                            city and surrounding areas in this project.  1 
  • 3.
        Algorithm and Methods     In this project we use mainly three machine learning algorithms to predict the  restaurant rating.The algorithm used here are supervisory learning .The algorithm used in  the project are:  ★ Logistic Regression  ★ Multinomial Naive Bayes  ★ Naive Bayes  Logistic Regression  Logistic regression is a classification algorithm used to assign observations to a                        discrete set of classes. Unlike linear regression which outputs continuous number values,                        logistic regression transforms its output using the logistic sigmoid function to return a                          probability value which can then be mapped to two or more discrete classes.  Naive Bayes    A Naive Bayes classifier is a probabilistic machine learning model that’s used for                          classification task. The crux of the classifier is based on the Bayes theorem.  Bayes Theorem:  2 
  • 4.
      Using Bayes theorem,we can find the probability of A happening, given that B has                              occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is                                that the predictors/features are independent. That is presence of one particular feature                        does not affect the other. Hence it is called naive.  Multinomial Naive Bayes  Multinomial Naive Bayes is a specialized version of Naive Bayes that is designed                          more for text documents. Whereas simple naive Bayes would model a document as the                            presence and absence of particular words, multinomial naive bayes explicitly models the                        word counts and adjusts the underlying calculations to deal with in.       Data Pre-Processing  In this project we use mainly Yelp Dataset.Dataset consists of user review and                          rating. We mainly use restaurant name, date, comfortability, star rating, comments,                      review id out of these data we will two features which are essential for our prediction.  Selecting two valid features manually.  ❖ Star rating  ❖ Comments  After selecting the valid features we find the missing values of an attribute and then root  word extracting from comment rating using the methods below:  ❖ Removing punctuations  3 
  • 5.
      ❖ Removing stopwords  ❖ Stemming - The process of producing morphological variants of a root/base word.      Performance Evaluation    Logistic Regression      Naive Bayes      4 
  • 6.
        Multinomial Naive Bayes        Conclusion     After testing with 20000 samples we can see that logistic regression performs                        better than the other methods. One possible explanation is that the assumptions for other                            models are problematic, and logistic regression is more robust to problematic model                        assumptions.This implies the potential improvement of more data and more suited                      methodology.However, the prediction needs further improvement. We compare our best                    predictor-logistic regression with a random-number predictor, and a constant-number                  predictor. As we can see, the logistic predictor is only slightly better than the                            constant-number predictor.    5