SlideShare a Scribd company logo
1 of 22
Yelp Dataset Challenge
Aparna Nanda Shivika Thapar
Arnab Kumar Mishra
Vishesh Tanksale Vraj Parikh
Task 1 - Toolkit / API
Lucene Java API - querying index
MongoDB - Loading json files so as to have easy access
Apache Spark - Fast large scale index creation
The Stanford NLP POS Tagging - Generating Effective Queries
Task 1 - Method / Algorithm
1. Create and index a corpus of business documents using Lucene.
Document -> (Business ID, Review Text, Tip Text, Category)
1. Take in a new review/tip from test set, preprocess it by applying Part-of-
Speech tagging on it and then query it against against index.
2. Perform ranking of the documents for the given query against index
created using BM25 Similarity, LMDirichlet Similarity, etc.
3. Based on the top ranked documents, we rank the corresponding
categories, and assign top 5 of these categories to the input review / tip.
Task 1 - Evaluation Metrics
Precision = #(relevant items retrieved) / #(retrieved items)
Recall = #(relevant items retrieved) / #(relevant items)
BM25 Similarity
Language Model with Dirichlet Smoothing
Language Model with Jelinek Mercer Smoothing
Task 1 - Evaluation Results
Task 1 - Evaluation Results
Task 2 - The Challenge
Information Retrieval for City and Category wise comparison of businesses.
What is a business famous for? What is it that the customers like the most
about a business? What is it that they don’t like?
Considered all businesses in a city to get consolidated city sentiments
Scope for improvement of a business by fetching negative remarks,
complaints from reviews.
City wise comparison of businesses
Suggestions/ recommendations based on above findings
Task 2 - Toolkit / API
Java
Python
MongoDB
PyMongo
NLTK for chunking and POS tagging
Pattern for sentiment analysis
MatPlotLib for line graph plotting
Task 2 - Method / Algorithm
Filter the businesses in order to perform the review filtering for the selected
business types(hospitals,indian restaurants,gyms).
Filter the reviews based on their business
cities(Madison,Pittsburgh,Charlotte) and categories and generate the
corresponding MongoDB collections for them.
Use the built collections to access the review texts one by one for further
processing.
Perform sentiment analysis using the Pattern package on the review text to
figure out which review is positive and which one is negative.
Task 2 - Method / Algorithm
For each positive review, fetch phrases by using Chunker from NLTK
package. We used {<JJ> <NN>|<JJ> <NNS>|<NN> <NNS>} to fetch chunks
from the review. Ex. wonderful stay, fresh towels,great staff etc.
For each negative review, fetch phrases by using Chunker from NLTK
package. We used {<NNP> <NN|NNP>|<RB> <JJ>|<JJ> <NNS>|<NN>
<NN>} to fetch chunks from the review. Ex. always understaffed, horrible
hospital, worst service, parking lot etc.
Add the good and bad phrases to the “good” and “bad” set for a city’s
business.
Compare each business’s strengths and weaknesses.
Pittsburgh Hospitals Negative Word Cloud
Charlotte Hospitals Negative Word Cloud
Pittsburgh Hospitals Positive Word Cloud
Charlotte Hospitals Positive Word Cloud
Madison Positive Word Cloud
Madison Negative Word Cloud
Positive and Negative Score
Attributes based comparison of cities
Task 2 - Evaluation Metrics
Percentage Error: Compare the average rating of the reviews against the
average rating of the reviews based on sentiment of the reviews for that
category.
x: avg of ratings of reviews from data set
y: avg of ratings based on sentiment of reviews
Percentage Error = (|y - x| / x) * 100
Error greatly impacts our analysis and recommendations.
Task 2 - Evaluation Metrics
Accuracy: Estimate the rating of each review based on good and bad
phrases
Positiveness = #good phrases / #phrases
Marginalize Positiveness to a scale of 5 to get the rating.
Rating_Predicted = Positiveness * 5
Total correct predictions = #(|Actual_Rating - Rating_Predicted | <=
Error)
Accuracy = (Total correct predictions/Total predictions)*100
Task 2 - Evaluation Result
Average Rating Sentiment
Average Rating
Error Accuracy
Madison 4.0 3.54 11.5% 63.27
Charlotte 3.84 4.11 7% 75.55
Pittsburgh 3.54 2.79 21.1% 59.45
THANK YOU! :)

More Related Content

Viewers also liked

Ismda rof, nasb va jar alomatlari, o'rinlari
Ismda rof, nasb va jar alomatlari, o'rinlariIsmda rof, nasb va jar alomatlari, o'rinlari
Ismda rof, nasb va jar alomatlari, o'rinlarimuslima014
 
Letter of Recommendation for Mr. Matthew Pulsifer Tiffany Park
Letter of Recommendation for Mr. Matthew Pulsifer Tiffany ParkLetter of Recommendation for Mr. Matthew Pulsifer Tiffany Park
Letter of Recommendation for Mr. Matthew Pulsifer Tiffany ParkMatthew Pulsifer
 
Post Market Report - ShareTipsInfo
Post Market Report - ShareTipsInfoPost Market Report - ShareTipsInfo
Post Market Report - ShareTipsInfoIndiaNotes.com
 
Pga intraduction
Pga intraductionPga intraduction
Pga intraductionJing Zhao
 
HTML Article Laser Diodo (5)
HTML Article   Laser Diodo (5)HTML Article   Laser Diodo (5)
HTML Article Laser Diodo (5)kelleravkedtfpzw
 
metabolisme pada kloroplas Desiana Ika Listiani 0402514017 pps unnes
metabolisme pada kloroplas Desiana Ika Listiani 0402514017 pps unnes metabolisme pada kloroplas Desiana Ika Listiani 0402514017 pps unnes
metabolisme pada kloroplas Desiana Ika Listiani 0402514017 pps unnes Desiana Ika Listiani
 

Viewers also liked (12)

Ismda rof, nasb va jar alomatlari, o'rinlari
Ismda rof, nasb va jar alomatlari, o'rinlariIsmda rof, nasb va jar alomatlari, o'rinlari
Ismda rof, nasb va jar alomatlari, o'rinlari
 
Letter of Recommendation for Mr. Matthew Pulsifer Tiffany Park
Letter of Recommendation for Mr. Matthew Pulsifer Tiffany ParkLetter of Recommendation for Mr. Matthew Pulsifer Tiffany Park
Letter of Recommendation for Mr. Matthew Pulsifer Tiffany Park
 
Dracula Cover
Dracula CoverDracula Cover
Dracula Cover
 
Resume2
Resume2Resume2
Resume2
 
Post Market Report - ShareTipsInfo
Post Market Report - ShareTipsInfoPost Market Report - ShareTipsInfo
Post Market Report - ShareTipsInfo
 
Cs project
Cs projectCs project
Cs project
 
dF
dFdF
dF
 
Pga intraduction
Pga intraductionPga intraduction
Pga intraduction
 
17
1717
17
 
HTML Article Laser Diodo (5)
HTML Article   Laser Diodo (5)HTML Article   Laser Diodo (5)
HTML Article Laser Diodo (5)
 
SQL Facit
SQL FacitSQL Facit
SQL Facit
 
metabolisme pada kloroplas Desiana Ika Listiani 0402514017 pps unnes
metabolisme pada kloroplas Desiana Ika Listiani 0402514017 pps unnes metabolisme pada kloroplas Desiana Ika Listiani 0402514017 pps unnes
metabolisme pada kloroplas Desiana Ika Listiani 0402514017 pps unnes
 

Similar to Yelp dataset challenge

IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment AnalysisIRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment AnalysisIRJET Journal
 
Predicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguagePredicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguageSebastian W. Cheah
 
Computing Ratings and Rankings by Mining Feedback Comments
Computing Ratings and Rankings by Mining Feedback CommentsComputing Ratings and Rankings by Mining Feedback Comments
Computing Ratings and Rankings by Mining Feedback CommentsIRJET Journal
 
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET Journal
 
IRJET- Classification of Business Reviews using Sentiment Analysis
IRJET-  	  Classification of Business Reviews using Sentiment AnalysisIRJET-  	  Classification of Business Reviews using Sentiment Analysis
IRJET- Classification of Business Reviews using Sentiment AnalysisIRJET Journal
 
COMMTRUST: A MULTI-DIMENSIONAL TRUST MODEL FOR E-COMMERCE APPLICATIONS
COMMTRUST: A MULTI-DIMENSIONAL TRUST MODEL FOR E-COMMERCE APPLICATIONSCOMMTRUST: A MULTI-DIMENSIONAL TRUST MODEL FOR E-COMMERCE APPLICATIONS
COMMTRUST: A MULTI-DIMENSIONAL TRUST MODEL FOR E-COMMERCE APPLICATIONSijnlc
 
An E-commerce feedback review mining for a trusted seller’s profile and class...
An E-commerce feedback review mining for a trusted seller’s profile and class...An E-commerce feedback review mining for a trusted seller’s profile and class...
An E-commerce feedback review mining for a trusted seller’s profile and class...IRJET Journal
 
SURVEY ON SENTIMENT ANALYSIS
SURVEY ON SENTIMENT ANALYSISSURVEY ON SENTIMENT ANALYSIS
SURVEY ON SENTIMENT ANALYSISIRJET Journal
 
Review Mining of Products of Amazon.com
Review Mining of Products of Amazon.comReview Mining of Products of Amazon.com
Review Mining of Products of Amazon.comShobhit Monga
 
IRJET-Fake Product Review Monitoring
IRJET-Fake Product Review MonitoringIRJET-Fake Product Review Monitoring
IRJET-Fake Product Review MonitoringIRJET Journal
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Reportweichen
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewIRJET Journal
 
Analyzing and Comparing opinions on the Web mining Consumer Reviews
Analyzing and Comparing opinions on the Web mining Consumer ReviewsAnalyzing and Comparing opinions on the Web mining Consumer Reviews
Analyzing and Comparing opinions on the Web mining Consumer Reviewsijsrd.com
 
Yelp Rating Prediction
Yelp Rating PredictionYelp Rating Prediction
Yelp Rating PredictionKartik Lunkad
 
Lobsters, Wine and Market Research
Lobsters, Wine and Market ResearchLobsters, Wine and Market Research
Lobsters, Wine and Market ResearchTed Clark
 
IRJET- Customer Feedback Analysis using Machine Learning
IRJET-  	  Customer Feedback Analysis using Machine LearningIRJET-  	  Customer Feedback Analysis using Machine Learning
IRJET- Customer Feedback Analysis using Machine LearningIRJET Journal
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support SystemKavita Ganesan
 
IRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review AnalysisIRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review AnalysisIRJET Journal
 

Similar to Yelp dataset challenge (20)

IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment AnalysisIRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
 
Predicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguagePredicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with Language
 
Computing Ratings and Rankings by Mining Feedback Comments
Computing Ratings and Rankings by Mining Feedback CommentsComputing Ratings and Rankings by Mining Feedback Comments
Computing Ratings and Rankings by Mining Feedback Comments
 
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov DisplayIRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
IRJET- Slant Analysis of Customer Reviews in View of Concealed Markov Display
 
IRJET- Classification of Business Reviews using Sentiment Analysis
IRJET-  	  Classification of Business Reviews using Sentiment AnalysisIRJET-  	  Classification of Business Reviews using Sentiment Analysis
IRJET- Classification of Business Reviews using Sentiment Analysis
 
COMMTRUST: A MULTI-DIMENSIONAL TRUST MODEL FOR E-COMMERCE APPLICATIONS
COMMTRUST: A MULTI-DIMENSIONAL TRUST MODEL FOR E-COMMERCE APPLICATIONSCOMMTRUST: A MULTI-DIMENSIONAL TRUST MODEL FOR E-COMMERCE APPLICATIONS
COMMTRUST: A MULTI-DIMENSIONAL TRUST MODEL FOR E-COMMERCE APPLICATIONS
 
An E-commerce feedback review mining for a trusted seller’s profile and class...
An E-commerce feedback review mining for a trusted seller’s profile and class...An E-commerce feedback review mining for a trusted seller’s profile and class...
An E-commerce feedback review mining for a trusted seller’s profile and class...
 
SURVEY ON SENTIMENT ANALYSIS
SURVEY ON SENTIMENT ANALYSISSURVEY ON SENTIMENT ANALYSIS
SURVEY ON SENTIMENT ANALYSIS
 
Review Mining of Products of Amazon.com
Review Mining of Products of Amazon.comReview Mining of Products of Amazon.com
Review Mining of Products of Amazon.com
 
IRJET-Fake Product Review Monitoring
IRJET-Fake Product Review MonitoringIRJET-Fake Product Review Monitoring
IRJET-Fake Product Review Monitoring
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Report
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer Review
 
Analyzing and Comparing opinions on the Web mining Consumer Reviews
Analyzing and Comparing opinions on the Web mining Consumer ReviewsAnalyzing and Comparing opinions on the Web mining Consumer Reviews
Analyzing and Comparing opinions on the Web mining Consumer Reviews
 
Yelp Rating Prediction
Yelp Rating PredictionYelp Rating Prediction
Yelp Rating Prediction
 
Lobsters, Wine and Market Research
Lobsters, Wine and Market ResearchLobsters, Wine and Market Research
Lobsters, Wine and Market Research
 
M comp
M compM comp
M comp
 
IRJET- Customer Feedback Analysis using Machine Learning
IRJET-  	  Customer Feedback Analysis using Machine LearningIRJET-  	  Customer Feedback Analysis using Machine Learning
IRJET- Customer Feedback Analysis using Machine Learning
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
Chennakesavulu_VBA Developer
Chennakesavulu_VBA DeveloperChennakesavulu_VBA Developer
Chennakesavulu_VBA Developer
 
IRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review AnalysisIRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review Analysis
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

Yelp dataset challenge

  • 1. Yelp Dataset Challenge Aparna Nanda Shivika Thapar Arnab Kumar Mishra Vishesh Tanksale Vraj Parikh
  • 2. Task 1 - Toolkit / API Lucene Java API - querying index MongoDB - Loading json files so as to have easy access Apache Spark - Fast large scale index creation The Stanford NLP POS Tagging - Generating Effective Queries
  • 3. Task 1 - Method / Algorithm 1. Create and index a corpus of business documents using Lucene. Document -> (Business ID, Review Text, Tip Text, Category) 1. Take in a new review/tip from test set, preprocess it by applying Part-of- Speech tagging on it and then query it against against index. 2. Perform ranking of the documents for the given query against index created using BM25 Similarity, LMDirichlet Similarity, etc. 3. Based on the top ranked documents, we rank the corresponding categories, and assign top 5 of these categories to the input review / tip.
  • 4. Task 1 - Evaluation Metrics Precision = #(relevant items retrieved) / #(retrieved items) Recall = #(relevant items retrieved) / #(relevant items) BM25 Similarity Language Model with Dirichlet Smoothing Language Model with Jelinek Mercer Smoothing
  • 5. Task 1 - Evaluation Results
  • 6. Task 1 - Evaluation Results
  • 7. Task 2 - The Challenge Information Retrieval for City and Category wise comparison of businesses. What is a business famous for? What is it that the customers like the most about a business? What is it that they don’t like? Considered all businesses in a city to get consolidated city sentiments Scope for improvement of a business by fetching negative remarks, complaints from reviews. City wise comparison of businesses Suggestions/ recommendations based on above findings
  • 8. Task 2 - Toolkit / API Java Python MongoDB PyMongo NLTK for chunking and POS tagging Pattern for sentiment analysis MatPlotLib for line graph plotting
  • 9. Task 2 - Method / Algorithm Filter the businesses in order to perform the review filtering for the selected business types(hospitals,indian restaurants,gyms). Filter the reviews based on their business cities(Madison,Pittsburgh,Charlotte) and categories and generate the corresponding MongoDB collections for them. Use the built collections to access the review texts one by one for further processing. Perform sentiment analysis using the Pattern package on the review text to figure out which review is positive and which one is negative.
  • 10. Task 2 - Method / Algorithm For each positive review, fetch phrases by using Chunker from NLTK package. We used {<JJ> <NN>|<JJ> <NNS>|<NN> <NNS>} to fetch chunks from the review. Ex. wonderful stay, fresh towels,great staff etc. For each negative review, fetch phrases by using Chunker from NLTK package. We used {<NNP> <NN|NNP>|<RB> <JJ>|<JJ> <NNS>|<NN> <NN>} to fetch chunks from the review. Ex. always understaffed, horrible hospital, worst service, parking lot etc. Add the good and bad phrases to the “good” and “bad” set for a city’s business. Compare each business’s strengths and weaknesses.
  • 19. Task 2 - Evaluation Metrics Percentage Error: Compare the average rating of the reviews against the average rating of the reviews based on sentiment of the reviews for that category. x: avg of ratings of reviews from data set y: avg of ratings based on sentiment of reviews Percentage Error = (|y - x| / x) * 100 Error greatly impacts our analysis and recommendations.
  • 20. Task 2 - Evaluation Metrics Accuracy: Estimate the rating of each review based on good and bad phrases Positiveness = #good phrases / #phrases Marginalize Positiveness to a scale of 5 to get the rating. Rating_Predicted = Positiveness * 5 Total correct predictions = #(|Actual_Rating - Rating_Predicted | <= Error) Accuracy = (Total correct predictions/Total predictions)*100
  • 21. Task 2 - Evaluation Result Average Rating Sentiment Average Rating Error Accuracy Madison 4.0 3.54 11.5% 63.27 Charlotte 3.84 4.11 7% 75.55 Pittsburgh 3.54 2.79 21.1% 59.45