1
Project Report
On
Text Mining of
Chashme Baddoor Movie Reviews
Submitted to
Prof Subhasis Dasgupta
(Text Analytics)
On
10-Mar-2014
Submitted by:
Maruthi Nataraj K (A13009)
2
Table of Contents
SL NO Topic Page No
1
2
3
4
5
6
7
8
Introduction
Project Objective
Data Description
Experimental Approaches
Evaluation Methodology
Final Results
Analysis and Inferences
Conclusions
3
3
3
4
12
12
14
14
3
1. INTRODUCTION
Text mining, also referred to as text data mining, roughly equivalent to text analytics,
refers to the process of deriving high-quality information from text. Typical text mining
tasks include text categorization, text clustering, concept/entity extraction, sentiment
analysis, document summarization etc.
As it is known, large amount of textual content is subjective and reflects opinions. With
the rapid growth of the Web, more people write online reviews for all types of products
and services. It is becoming a common practice for a consumer to learn why others like
or dislike a product before he or she buys it, or for a manufacturer to track customer
opinions on its products to improve customer satisfaction. However, as the number of
reviews for a product grows, it becomes harder to understand and evaluate customer
opinions about a specific product.
Sentiment classification, also referred to as polarity, tone, or opinion analysis, can track
changes in attitudes toward a brand or product, compare the attitudes of the public
between one brand or product and another, and extract examples of types of positive or
negative opinions. It may also involve analysis of movie reviews for estimating how
favorable a review is for a movie. Such an analysis may need a labeled data set.
A web search engine often returns thousands of pages in response to a broad query,
making it difficult for users to browse or to identify relevant information. Document
Clustering methods can be used to automatically group the retrieved documents into a
list of meaningful categories.
2. PROJECT OBJECTIVE
The first major objective of the study is to build a model to classify the Bollywood movie
“Chashme Baddoor” reviews as positive or negative opinions.
The second goal is to go for document clustering of the movie reviews to come up with
the pertinent points that could reveal the interesting facts about the movie indicating
the areas of improvement.
3. DATA DESCRIPTION
For my study, I have used user review collection of the Bollywood movie “Chashme
Baddoor” that is available at
http://www.imdb.com/title/tt2229848/reviews?ref_=tt_ov_rt
The Collection contains 26 reviews by a large variety of individuals. Each review is
several paragraphs long and has an associated star rating out of 10.Out of all, 23 reviews
were considered meaningful after crawling and used for further processing.
4
Once the initial data of 23 reviews with respective ratings is obtained, 4 reviews are
selected at random and set aside as unlabeled dataset for validation and rest 19 reviews
are used as development sample for building the classification model.
Then, in the development dataset, all the ratings which are greater than or equal to
7(out of 10) are tagged as “Positive” and the remaining are tagged as “Negative” for
convenient purposes.
4. EXPERIMENTAL APPROACHES
Classification
URL 
http://www.imdb.com/title/tt2229848/
Rules –
store_with_matching_url  .*reviews.*
follow_link_with_matching_url  http://www.imdb.com/title/tt2229848/reviews.*
Please enter the path of the output directory where you would like to store your web
crawled HTML pages. Note to give the extension as HTML as shown above.
5
6
Once the HTML pages are obtained in the dedicated location as shown above, we need
to use “Process the Documents from Files” to read and pull out the ratings and reviews
from them using “Cut Document” and “Extract Information” within the process (double
click on Process Documents from Files”.
7
Note that the query type should be XPath here.
In this case, in order to check how the ratings and reviews are embedded in the HTML
page, we need to inspect the element by right clicking the review text and star ratings as
shown below.
8
XPath queries for Extract Information are shown below:
9
Once the output is obtained, we will have to clean the data to remove unwanted
information and duplicates. Then, the same is read and processed as below:
10
Internal structure of X-Validation is shown below where Decision Tree Classifier is used.
Here, the model is stored so that it can be used in further stages.
11
Now, the previous model is retrieved and the same is applied on the unlabeled
validation dataset (4 reviews) as explained earlier.
Clustering
The development sample is also used for clustering without taking Rating into
consideration.
12
5. EVALUATION METHODOLOGY
The Classification Model is evaluated based on accuracy of confusion matrix between
the predicated and actual opinions and the Cluster on the contribution of set of words to
each cluster as per the word vector with their respective TF-IDFs using statistical
procedure such as ANOVA.
6. FINAL RESULTS
Classification
Total Occurrences of different set of words is shown below:
Confusion Matrix of labeled (development) dataset based on X-Validation:
13
Opinion Prediction of Unlabeled (Validation) dataset using the Decision Tree
Classification
Clustering
14
7. ANALYSIS AND INFERENCES
Cluster 1: Highlights more about the plot of the movie and its acting crew. It indicates
that the movie is light hearted and a good time pass though cannot be compared with
the original by Sai Paranjpye. The heroine’s "Dum Hai Boss" dialogues are a big letdown
whereas the lead actors Sid, Jai and Omi try their best to bring out the humor. Songs
"Har Ek Friend Kamina Hota Hai" and "Dichkyaon Doom Doom" by Sajid-Wajid seem to
score high.
With this information, the film makers could come up with more entertaining stories
concentrating on the lags and work on better dialogues.
Cluster 0: Showcases the larger elements of the movie. It spots that newer version
(remake) of the movie by David Dhawan when compared to older one has decent
performances enticing the spectators and making them feel it’s worth the money to
some extent. Even then, few others opine that the budget of the movie doesn’t appear to
justify the creative screenplay which is expected by them and also Ali, the main lead
lacks in few areas and needs improvement. Songs on the whole by Sajid-Wajid miss the
required potential.
Here, the director could focus on gripping screenplay maintaining the humor quotient
with strong leads and better music. Also, effective utilization of the budget has to be
planned.
8. CONCLUSIONS
Text Mining can be used as an effective tool in obtaining the true feedback about the
movies from the large number of audience and the same should be kept in minds when
starting a new script to meet the viewers expectations.
15
APPENDIX
The same process can be done using “Get Pages” also.
ANOVA example for contribution of word to cluster
16
Word Cloud

Text Mining of Movie Reviews

  • 1.
    1 Project Report On Text Miningof Chashme Baddoor Movie Reviews Submitted to Prof Subhasis Dasgupta (Text Analytics) On 10-Mar-2014 Submitted by: Maruthi Nataraj K (A13009)
  • 2.
    2 Table of Contents SLNO Topic Page No 1 2 3 4 5 6 7 8 Introduction Project Objective Data Description Experimental Approaches Evaluation Methodology Final Results Analysis and Inferences Conclusions 3 3 3 4 12 12 14 14
  • 3.
    3 1. INTRODUCTION Text mining,also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization etc. As it is known, large amount of textual content is subjective and reflects opinions. With the rapid growth of the Web, more people write online reviews for all types of products and services. It is becoming a common practice for a consumer to learn why others like or dislike a product before he or she buys it, or for a manufacturer to track customer opinions on its products to improve customer satisfaction. However, as the number of reviews for a product grows, it becomes harder to understand and evaluate customer opinions about a specific product. Sentiment classification, also referred to as polarity, tone, or opinion analysis, can track changes in attitudes toward a brand or product, compare the attitudes of the public between one brand or product and another, and extract examples of types of positive or negative opinions. It may also involve analysis of movie reviews for estimating how favorable a review is for a movie. Such an analysis may need a labeled data set. A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Document Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories. 2. PROJECT OBJECTIVE The first major objective of the study is to build a model to classify the Bollywood movie “Chashme Baddoor” reviews as positive or negative opinions. The second goal is to go for document clustering of the movie reviews to come up with the pertinent points that could reveal the interesting facts about the movie indicating the areas of improvement. 3. DATA DESCRIPTION For my study, I have used user review collection of the Bollywood movie “Chashme Baddoor” that is available at http://www.imdb.com/title/tt2229848/reviews?ref_=tt_ov_rt The Collection contains 26 reviews by a large variety of individuals. Each review is several paragraphs long and has an associated star rating out of 10.Out of all, 23 reviews were considered meaningful after crawling and used for further processing.
  • 4.
    4 Once the initialdata of 23 reviews with respective ratings is obtained, 4 reviews are selected at random and set aside as unlabeled dataset for validation and rest 19 reviews are used as development sample for building the classification model. Then, in the development dataset, all the ratings which are greater than or equal to 7(out of 10) are tagged as “Positive” and the remaining are tagged as “Negative” for convenient purposes. 4. EXPERIMENTAL APPROACHES Classification URL  http://www.imdb.com/title/tt2229848/ Rules – store_with_matching_url  .*reviews.* follow_link_with_matching_url  http://www.imdb.com/title/tt2229848/reviews.* Please enter the path of the output directory where you would like to store your web crawled HTML pages. Note to give the extension as HTML as shown above.
  • 5.
  • 6.
    6 Once the HTMLpages are obtained in the dedicated location as shown above, we need to use “Process the Documents from Files” to read and pull out the ratings and reviews from them using “Cut Document” and “Extract Information” within the process (double click on Process Documents from Files”.
  • 7.
    7 Note that thequery type should be XPath here. In this case, in order to check how the ratings and reviews are embedded in the HTML page, we need to inspect the element by right clicking the review text and star ratings as shown below.
  • 8.
    8 XPath queries forExtract Information are shown below:
  • 9.
    9 Once the outputis obtained, we will have to clean the data to remove unwanted information and duplicates. Then, the same is read and processed as below:
  • 10.
    10 Internal structure ofX-Validation is shown below where Decision Tree Classifier is used. Here, the model is stored so that it can be used in further stages.
  • 11.
    11 Now, the previousmodel is retrieved and the same is applied on the unlabeled validation dataset (4 reviews) as explained earlier. Clustering The development sample is also used for clustering without taking Rating into consideration.
  • 12.
    12 5. EVALUATION METHODOLOGY TheClassification Model is evaluated based on accuracy of confusion matrix between the predicated and actual opinions and the Cluster on the contribution of set of words to each cluster as per the word vector with their respective TF-IDFs using statistical procedure such as ANOVA. 6. FINAL RESULTS Classification Total Occurrences of different set of words is shown below: Confusion Matrix of labeled (development) dataset based on X-Validation:
  • 13.
    13 Opinion Prediction ofUnlabeled (Validation) dataset using the Decision Tree Classification Clustering
  • 14.
    14 7. ANALYSIS ANDINFERENCES Cluster 1: Highlights more about the plot of the movie and its acting crew. It indicates that the movie is light hearted and a good time pass though cannot be compared with the original by Sai Paranjpye. The heroine’s "Dum Hai Boss" dialogues are a big letdown whereas the lead actors Sid, Jai and Omi try their best to bring out the humor. Songs "Har Ek Friend Kamina Hota Hai" and "Dichkyaon Doom Doom" by Sajid-Wajid seem to score high. With this information, the film makers could come up with more entertaining stories concentrating on the lags and work on better dialogues. Cluster 0: Showcases the larger elements of the movie. It spots that newer version (remake) of the movie by David Dhawan when compared to older one has decent performances enticing the spectators and making them feel it’s worth the money to some extent. Even then, few others opine that the budget of the movie doesn’t appear to justify the creative screenplay which is expected by them and also Ali, the main lead lacks in few areas and needs improvement. Songs on the whole by Sajid-Wajid miss the required potential. Here, the director could focus on gripping screenplay maintaining the humor quotient with strong leads and better music. Also, effective utilization of the budget has to be planned. 8. CONCLUSIONS Text Mining can be used as an effective tool in obtaining the true feedback about the movies from the large number of audience and the same should be kept in minds when starting a new script to meet the viewers expectations.
  • 15.
    15 APPENDIX The same processcan be done using “Get Pages” also. ANOVA example for contribution of word to cluster
  • 16.