Presentation for Information Retrieval / Extraction Project on Yelp Data set. The project utilizes various Information Retrieval and Natural Language Processing concepts to build the models.
2. Task 1 - Toolkit / API
Lucene Java API - querying index
MongoDB - Loading json files so as to have easy access
Apache Spark - Fast large scale index creation
The Stanford NLP POS Tagging - Generating Effective Queries
3. Task 1 - Method / Algorithm
1. Create and index a corpus of business documents using Lucene.
Document -> (Business ID, Review Text, Tip Text, Category)
1. Take in a new review/tip from test set, preprocess it by applying Part-of-
Speech tagging on it and then query it against against index.
2. Perform ranking of the documents for the given query against index
created using BM25 Similarity, LMDirichlet Similarity, etc.
3. Based on the top ranked documents, we rank the corresponding
categories, and assign top 5 of these categories to the input review / tip.
4. Task 1 - Evaluation Metrics
Precision = #(relevant items retrieved) / #(retrieved items)
Recall = #(relevant items retrieved) / #(relevant items)
BM25 Similarity
Language Model with Dirichlet Smoothing
Language Model with Jelinek Mercer Smoothing
7. Task 2 - The Challenge
Information Retrieval for City and Category wise comparison of businesses.
What is a business famous for? What is it that the customers like the most
about a business? What is it that they don’t like?
Considered all businesses in a city to get consolidated city sentiments
Scope for improvement of a business by fetching negative remarks,
complaints from reviews.
City wise comparison of businesses
Suggestions/ recommendations based on above findings
8. Task 2 - Toolkit / API
Java
Python
MongoDB
PyMongo
NLTK for chunking and POS tagging
Pattern for sentiment analysis
MatPlotLib for line graph plotting
9. Task 2 - Method / Algorithm
Filter the businesses in order to perform the review filtering for the selected
business types(hospitals,indian restaurants,gyms).
Filter the reviews based on their business
cities(Madison,Pittsburgh,Charlotte) and categories and generate the
corresponding MongoDB collections for them.
Use the built collections to access the review texts one by one for further
processing.
Perform sentiment analysis using the Pattern package on the review text to
figure out which review is positive and which one is negative.
10. Task 2 - Method / Algorithm
For each positive review, fetch phrases by using Chunker from NLTK
package. We used {<JJ> <NN>|<JJ> <NNS>|<NN> <NNS>} to fetch chunks
from the review. Ex. wonderful stay, fresh towels,great staff etc.
For each negative review, fetch phrases by using Chunker from NLTK
package. We used {<NNP> <NN|NNP>|<RB> <JJ>|<JJ> <NNS>|<NN>
<NN>} to fetch chunks from the review. Ex. always understaffed, horrible
hospital, worst service, parking lot etc.
Add the good and bad phrases to the “good” and “bad” set for a city’s
business.
Compare each business’s strengths and weaknesses.
19. Task 2 - Evaluation Metrics
Percentage Error: Compare the average rating of the reviews against the
average rating of the reviews based on sentiment of the reviews for that
category.
x: avg of ratings of reviews from data set
y: avg of ratings based on sentiment of reviews
Percentage Error = (|y - x| / x) * 100
Error greatly impacts our analysis and recommendations.
20. Task 2 - Evaluation Metrics
Accuracy: Estimate the rating of each review based on good and bad
phrases
Positiveness = #good phrases / #phrases
Marginalize Positiveness to a scale of 5 to get the rating.
Rating_Predicted = Positiveness * 5
Total correct predictions = #(|Actual_Rating - Rating_Predicted | <=
Error)
Accuracy = (Total correct predictions/Total predictions)*100
21. Task 2 - Evaluation Result
Average Rating Sentiment
Average Rating
Error Accuracy
Madison 4.0 3.54 11.5% 63.27
Charlotte 3.84 4.11 7% 75.55
Pittsburgh 3.54 2.79 21.1% 59.45