Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yelp Dataset Challenge

411 views

Published on

The Yelp Dataset consists of 1.6M reviews by customers for 61K businesses. There are three tasks accomplished using this dataset:-
1. Assign categories to businesses based on customer reviews
2. Recommend food items and services of a restaurant based on reviews
3. Determine Influential factors in a city affecting restaurants

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Yelp Dataset Challenge

  1. 1. Yelp Dataset Challenge ANWAR SHAIKH ASHWIN NIMHAN MANASHREE RAO SHRIJIT PILLAI TEJAS SHAH
  2. 2. Project Tasks  Task 1  Assign Categories to Business in the Yelp Data Set  Task 2  Recommend Food Items and/or services in a Restaurant  Determine Influential Factors in a City affecting Restaurants
  3. 3. Task 1 ...
  4. 4. Task 1 : Methodology Business Business To To Review Category Map Map …... …... Tf-Idf 1. Default 2. BM25 3. Dirichlet Lucene Index Lucene Index Mapping Phase Category to Review Mapping Predicted Categories Training Set Testing Set
  5. 5. Evaluation Precision Recall F2-Measure At least 1 TP 0 0.2 0.4 0.6 Top 3 Top 5 Top 7 Precision 0.54 0.38 0.33 Recall 0.55 0.66 0.72 F2-Measure 0.55 0.57 0.58 At least 1 TP 0.85 0.88 0.89 0.54 0.38 0.33 0.55 0.66 0.72 0.55 0.57 0.58 0.85 0.88 0.89 BM25 Similarity
  6. 6. Evaluation Precision Recall F2-Measure At least 1 TP 0 0.2 0.4 0.6 Top 3 Top 5 Top 7 Precision 0.51 0.36 0.33 Recall 0.53 0.62 0.66 F2-Measure 0.53 0.54 0.55 At least 1 TP 0.84 0.85 0.87 0.51 0.36 0.33 0.53 0.62 0.66 0.53 0.54 0.55 0.84 0.85 0.87 Default Similarity
  7. 7. Evaluation Precision Recall F2-Measure At least 1 TP 0 0.2 0.4 0.6 Top 3 Top 5 Top 7 Precision 0.42 0.32 0.3 Recall 0.58 0.6 0.55 F2-Measure 0.53 0.51 0.47 At least 1 TP 0.81 0.84 0.86 0.42 0.32 0.3 0.58 0.6 0.55 0.53 0.51 0.47 0.81 0.84 0.86 LMDirichlet Similarity
  8. 8. Task 2: Recommend Restaurant Food Items or Services ...
  9. 9. Task 2 : Methodology
  10. 10. Feature Extraction  Every token has an associated POS tag  POS tag with “NN” are Nouns and “JJ” are adjectives  Nouns are considered as features and adjectives as sentiments
  11. 11. Feature Filtering  Noise present in features obtained from Feature Extraction Phase  Using Task 1 Solution, categories of input features are determined  Features whose categories are related to restaurants are considered for further processing Before Feature Filtering After Feature Filtering • cheese • burger • ones • menu • combinations • idea • commission • cheese • burger • menu
  12. 12. Feature Processing Stanford CoreNLP Dependency type NSUBJ Sentence Dependent Tag Governor Tag • Problem : The relationship between noun and adjective was ambiguous for some sentences. • Example : The food was great but the service was bad • After parsing “bad” belongs to food or service?
  13. 13. New Review Adjective Positive or Negative? Negative Word in 4- word distance? Decision (Recommended or not Recommended) Classification of reviews 1. For each sentence the noun is extracted through feature extraction 2. Corresponding adjective is identified as positive or negative 3. Negation is searched for within 4 word distance of adjective 4. Feature is classified as Recommended if number of positive sentiments associated with it is more than the number of negative sentiments  All the above steps are repeated for each review
  14. 14. Sample Result Predicted Features Predicted Feature Sentiments Predicted as Recommended Features ? Actual Recommended Features sub, next, decent Y Y bread flavorful, bland, fresh, great, nice Y Y peppercorn nice Y Y stuff-it chewy Y N sandwich mayo/mustard/vinegar, east, good, unknown Y Y menu decent Y Y bacon real Y Y bite huge Y N veggies sorry N Y
  15. 15. Evaluation  Set 1 - Recommended Features are obtained from 60% reviews of a particular restaurant.  Set 2 - The remaining 40% of the reviews are considered for testing  If a recommended feature from Set 1 is present as a recommended feature in Set 2, then it is a True Positive  Evaluation Metrics  Precision  Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Precision Recall 0.53 0.67
  16. 16. Identifying Influential topics “Identify features from reviews which are relevant city wide and influence the user’s choice and restaurant’s popularity” Phases I. Business classification by city II. Popular item word-count III. NLP feature extraction IV. Feature re-ranking model V. Model fitness evaluation
  17. 17. Business Classification Phase I Issue: Reviews specify neighborhood not city. (~150 !!!) Solution: 1. Identify city based on geo-code through mapping service. 2. K-means clustering 1. Data point features (Business Id, Latitude, Longitude) 2. Dissimilarity metric (Euclidian distance) 3. Cluster count: k (10) 4. Centroid Labeling 3. Data persistence and indexing 1. Split reviews based on clustered business ids 2. Save & index for next phase.
  18. 18. Word-count Phase II Issue: How do we get the influential factors of a city Solution: Word count as first pass Observation: Noise (adjectives, verbs, expressions) Proposal: Include features derived through NLP
  19. 19. NLP Features Phase III Issue: Noise reduction and contextual awareness Solution: Use NLP to identify features in the reviews Observation: Subtle change in ordering of words Proposal: Re-ranking the words using metrics from user and review.
  20. 20. 𝑹ε𝑹 𝟏𝟎𝟎𝟎 𝒕𝒇. 𝐥𝐨𝐠 𝟏 − 𝒅𝒇 |𝑫𝒄𝒐𝒖𝒏𝒕| × 𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓 𝑼 𝒗 𝑼 𝒓𝒄 + 𝟎. 𝟐𝟎 . 𝐔𝐟) Mathematical Formula
  21. 21. Elite User Who is Important?
  22. 22. Elite User Useful Review Who is Important? What is Important?
  23. 23. Mathematical Formula  Features from NLP does take in account word count and context but does NOT consider user weight and review weight Program with Mathematical Formula Solr Index Word list from NLP Top 1K Relevant Reviews Scored word
  24. 24. User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf Mathematical Formula Uvnorm = UTotalVotes UReviewCount (𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓 𝑼 𝒗 𝑼 𝒓𝒄 + 𝟎. 𝟐𝟎 . 𝐔𝐟) Normalization of votes User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf User Review Count Votes U1 10 1000 U2 1000 1000
  25. 25. User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf Mathematical Formula 𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓 𝑼 𝒗 𝑼 𝒓𝒄 + 𝟎. 𝟐𝟎 . 𝐔𝐟) Review  User  Stars = Rs  Text  Date  Votes = Rv User Stars Sentiment 1 Very Strong 2 Inclined -ve 3 Ambivalent 4 Inclined +ve 5 Very Strong
  26. 26. User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf Mathematical Formula 𝒕𝒇 . 𝐥𝐨𝐠 𝟏 − 𝒅𝒇 𝑫 𝒄𝒐𝒖𝒏𝒕 × 𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝑼 𝒆 + 𝟎. 𝟓𝟓 𝑼 𝒗 𝑼 𝒓𝒄 + 𝟎. 𝟐𝟎 . 𝑼𝒇) Review  User  Stars = Rs  Text  Date  Votes = Rv User Review Relevance  TermFrequency = tf  Document Frequecny = df  Document Count = Dcount User  Review Count = Urc  Average Stars  Votes = Uv  Friends  Elite = Ue  Yelping Since  Compliments  Fans = Uf
  27. 27. j Output 1 2 3 4 5 6 7 9 8 10 11 12 14 13 15 16 17
  28. 28. Madison Rank Wordcount List NLP list- Unformatted NLP list- Model 1 food food pizza 2 place beer cheese 3 like cheese coffee 4 from menu breakfast 5 service curds burger 6 go atmosphere taco 7 time burger sushi 8 madison dane chocolate 9 been drinks beer 10 cheese beers sandwich 11 menu restaurant curds 12 bar table ice 13 restaurant coffee wine 14 ordered pizza store 15 love something cream 16 order sandwich lunch 17 chicken dinner rolls 18 beer lunch atmosphere 19 pizza meal tea 20 sauce sauce curries 21 night burgers steak 22 people drink noodle 23 make bread spot 24 staff server soup 25 made chicken egg Rank Wordcount List NLP list- Unformatted NLP list- Model 1 food food donut 2 good pizza bagel 3 place burger cupcake 4 great menu gelato 5 like restaurant gyro 6 service fries yogurt 7 time atmosphere buffet 8 go chicken boba 9 back patio pizza 10 from breakfast sushi 11 been table coffee 12 love lunch sub 13 ordered dinner wing 14 chicken meal crepe 15 nice salad burger 16 order cheese burrito 17 restaurant potato taco 18 little server cookie 19 menu something gluten 20 pizza sauce breakfast 21 bar drinks coffee-shop 22 delicious rice hash-brown 23 friendly burgers cake 24 first beer Vegan 25 Pretty Spot Teas Pheonix Las Vegas Rank Wordcount NLP list- Unformatted NLP list- Model 1food food donuts 2good beer bagel 3place sushi crepe 4like restaurant pizza 5great meal oyster 6service menu yogurt 7from atmosphere shrimp 8time table burger 9vegas steak gelato 10go dinner sushi 11back server wings 12ordered salad sandwich 13restaurant tables pancake 14nice rib coffee 15been buffet burrito 16order dining curry 17chicken breakfast buffet 18little waitress waffle 19pretty shrimp chocolate 20love something cake 21menu beers breakfast 22eat dishes tea 23delicious dish cookies 24first restaurants gluten 25people sauce pastrami
  29. 29. Evaluation Metric: NDCG  Predicted topics for Phoenix under categories: Bakery, Breakfast and Brunch  To capture the strongest sentiments about these topics, we analyzed the top 1000 features for businesses under predicted under Bakery, Breakfast and Brunch for the specific city, in this case Phoenix.  Using these features as input for relevance score, we analyze the top 30 topics predicted by the model:  NDCG = 18.80190835 / 21.8978282 = 0.8586
  30. 30. Rank NLP list- Output From Model Relevance Score Log DCG= rel(i)/log i 1donut 3 0 3 2bagel 3 1 3 3cupcake 3 1.584963 1.892789 4gelato 0 2 0 5gyro 2 2.321928 0.861353 6yogurt 2 2.584963 0.773706 7buffet 0 2.807355 0 8boba 1 3 0.333333 9pizza 0 3.169925 0 10sushi 0 3.321928 0 11coffee 3 3.459432 0.867194 12sub 2 3.584963 0.557886 13wing 1 3.70044 0.270238 14crepe 2 3.807355 0.525299 15burger 2 3.906891 0.511916 16burrito 2 4 0.5 17taco 2 4.087463 0.489301 18cookie 2 4.169925 0.479625 19gluten 0 4.247928 0 20breakfast 2 4.321928 0.462756 21coffee-shop 2 4.392317 0.45534 22hash-brown 1 4.459432 0.224244 23cake 3 4.523562 0.663194 24vegan 1 4.584963 0.218104 25teas 2 4.643856 0.430677 26bruschetta 1 4.70044 0.212746 27waffle 3 4.754888 0.63093 28pancake 3 4.807355 0.624044 29subway 1 4.857981 0.205847 30latte 3 4.906891 0.611385 Rank Relevance Score Log Ideal DCG (IDCG) 1 3 0 3 2 3 1 3 3 3 1.5849625 1.89278926 4 3 2 1.5 5 3 2.32192809 1.29202967 6 3 2.5849625 1.16055842 7 3 2.80735492 1.06862156 8 3 3 1 9 3 3.169925 0.94639463 10 2 3.32192809 0.60205999 11 2 3.45943162 0.57812965 12 2 3.5849625 0.55788589 13 2 3.70043972 0.54047631 14 2 3.80735492 0.52529907 15 2 3.9068906 0.51191605 16 2 4 0.5 17 2 4.08746284 0.48930108 18 2 4.169925 0.47962493 19 2 4.24792751 0.47081783 20 2 4.32192809 0.46275643 21 1 4.39231742 0.22767025 22 1 4.45943162 0.22424382 23 1 4.52356196 0.22106473 24 1 4.5849625 0.21810429 25 1 4.64385619 0.21533828 26 1 4.70043972 0.21274605 27 0 4.7548875 0 28 0 4.80735492 0 29 0 4.857981 0 30 0 4.9068906 0
  31. 31. Things to Note !  Based on Results: Identified categories: Breakfast and Brunch, Bakery  Keywords donut bagel cupcake gelato gyro yogurt buffet boba pizza sushi coffee sub wing crepe burger burrito taco cookie gluten breakfast coffee-shop hash-brown cake
  32. 32. Things to Note !  Identified categories:  Breakfast and Brunch, Bakery donut bagel cupcake gelato gyro yogurt buffet boba pizza sushi coffee sub wing crepe burger burrito taco cookie gluten breakfast coffee-shop hash-brown cake
  33. 33. Thank You!

×