NLP based Mining on Movie Critics

NLP based Mining on Movie Critics.
Sushanth Reddy Vanga
Computer Science Kent State University
svanga@kent.edu
Akhay Kumar Kataiah
akataiah@kent.edu
Laxmi Supraja Narayan
Computer Science, Kent State University
lnarayan@kent.edu
Sushanth Kumar Mukka
smukka@kent.edu
Abstract— In this project, data is collected through Online
Movie Data Base Api. Applying Sentiment analysis on the
cleaned data using python which will give us the
information of positive and negative critics. We have
applied naive bayes classification to obtain accurate data.
Finally, we are trying to create a web application which
will quote the critic whether it is a positive or negative
review. The web application shows the effectiveness of our
project.
I. INTRODUCTION
The internet provides a large number of data that can be easily
accessed from all over the world. From such huge amount of
raw data, finding information relevant to user needs has
become very important. The most part of information on the
web is in the form of text. For instance, we find a huge
number of review documents that contains user opinion about
the product. When a user wants to buy a product's user usually
surveys on the product reviews.
Similarly, in the case of movie reviews. Movie critic
is the analysis and evaluation of movie. The movie critique
generally gives an impression of the film while mentioning the
movie's title, director, and key actors. Due to increase of
internet usage today arts criticism in general does not hold the
same place it once held with the general public for instance
positive film reviews have been known to spark interest in
little-known films. Movie reviews are just a quick look about
the movie. In some cases movie critic may be lengthier or it
may be very short [1]. Every individual may not have time to
read all the review so at most end of the day it is important to
judge whether the movie is good or bad. In our project we
have considered two particular movie review websites like
rotten tomatoes and IMDB, which are more popular in the
present market and as we find more reviews in such websites
we get a huge amount of data. The chief aim of the review is
to tell the user weather a movie is worth going or not as it
helps the user before watching a movie. This even saves a lot
of time and money. More precise and effective method to
evaluate a movie. So it has become one of the largest
commercial applications in all over the world.
Our project mainly focuses on collecting the data from critics
and word features are extracted by feature extractors and then
a training data set is created, then the classification is done to
classify whether the data is a positive or negative data.
Initially data is collected through online movie database API
(OMDB). Then in further process the data cleaning is done
and thus data is collected in bag of words this is done using
python. Thus by applying sentiment analysis on the processed
data which will give us positive and negative data about
critics. In this project we are using Naive Bayes classifier to
classify the data. To predict the data we are using Naive Bayes
and random forest. We are implementing both Naive Bayes
and random forest because in case of small amount of data we
found that Naive Bayes classifier has optimal solution but in
case of large amount of data random forest would give optimal
solution. In this project we use sentiment analysis. In this
project we are using Natural Language Processing (NLP) and
then we apply sentiment analysis. It is a linguistic analysis
technique that identifies opinion early in a piece of text. It
helps to classify the critic is good, bad.
Previous works mainly focus on classifying whether the movie
is good or bad, but our work also focuses on even developing
a web application to predict whether the user critic is positive
negative, neutral. In this web application if a critique then it
will predict whether the critique or review given by critique is
a positive or negative critique. By this we are giving more
convenient approach to the user.
II. BACKGROUND
Background work for this project has begun with exploring
for Application program interface (API) to gather movie
critics. This collected movie critics or information is obtained
by API of OMDB abbreviated as online movie database ,

which is a domain of IMDB . where all the information such
as images, videos and other movie content are updated
frequently by the naive users. The obtained movie critics data
is preprocessed and mining techniques are applied to get the
accurate results for naive users to analyze the opinion through
the piece of text. Apparently we will discuss the individual
concepts for better understanding of the project. Primarily we
initiated the process with the Natural language processing for
opinion mining to extract the critic trait from the obtained
data.
A. Sentiment analysis
• The analytical process of extracting a mood or opinion
from the piece of text is coined as sentiment
analysis[2]. It is a linguistic analysis technique to
assess the opinion from a text document in the early
stages. Sentiment analysis is relied upon the analysis of
text and processing of natural language to filter and
extract the precise mood or opinion from the text.
Sentiment analysis is mainly to find the text document
polarity for optimum classification.
Analyzed sentiment or opinion is classified as positive,
negative and a neutral sentiment or opinion. Sentiment
analysis is part of text classification. classification is
performed based upon the personal traits, emotions and
mood, attitudes about a particular topic at an instance
of an user by user updated data.
• The analysis of sentiment or mood of a text is mainly
concerned with three parameters they are as follows
with individuals perspectives.[3]
1) Source perspective of sentiment or opinion
2) Destination perspective
3) Nature perspective
• From the above aspects the opinion or sentiment
factor is extracted by considering the source opinion,
which is a fixed set of classes used for prediction.
Destination aspect is for to target on what sort of
opinion is to be analyzed and nature perspective is to
find which sort of opinion or mood is retrieved. Text
attitude is filtered as positive critic or negative critic
and further the ranking is done.
• Labeling of the data by considering the sentiment or
opinion [4] about particular topic gives comprehensive
data for naive users. Vital point is feature extraction
for analysis of sentiment. Feature is extracted by
relying upon extraction of subjective nature.
Consequently the feature words from the parsed data
are filtered explicitly. The feature generation[5] is the
process of extracting the relevant features. where the
feature extraction for classifying sentiment is relied
upon negation handling while considering adjectives
for evaluating the sentiment. Apparently, after the
feature extraction process the polarity of the text is
determined, since the word features are linked with the
opinion of the text. The basic text classification[6] is
mentioned as the process of predicting a class 'c' from
fixed set of classes ( c1,c2,c3,c4....ci) which belongs to
main set class 'C' from a document 'D'. Classification
of text mainly occurs in the areas of spam detection,
identification of particular linguistics, genes and
gender, analysis of sentiment.
B. Preprocessing stage
• The data mining enthusiasm is driving the current era
for obtaining the optimum knowledge from the large
unsorted and inefficient data. So to build up the pristine
knowledge base system and to discover the precise
knowledge. Preprocessing [7] stage is crucial aspect in
data mining era to fill the voids in the process of
knowledge discovery. Preprocessing stage has the
subsequent stages to extract the desired knowledge
from the raw data. The steps contained in
preprocessing stage are defined below for better
understanding. Tokenization[8] technique is the
pressing factor for data preprocessing. In this
technique the long linked text is parsed or divided into
pieces of words to acknowledge the writers intention.
The splitting is done in way that to form a separate
words or flow ( sequence ) of words. For instance let us
consider the sentence " data mining and machine
learning class" which is transformed into ( "data",
"mining", "and", "machine", "learning", "class") by
using tokenization technique for comparing with other
texts or for analyzing the context to obtain the
circumscribed data. Stop words filtering is the vital part
for purifying the data. Stop words takes more space
and it is unnecessary, which should be eliminated for
perfect analyzing of the data. Initially indexing the list
of stop words is being done and removing the stop
words which are static with a statistical approach.
consequently the case conversion and removal of
punctuation from the text is being done to get the final
cleaned data for retrieving the essential mood or
sentiment from the critic.
• Classification of text or document is the pivotal factor
for our project. We used Naïve Bayes technique for
classification problem. We used Naïve Bayes because
it assumes the features which are self-reliant and
individualistic for obtaining at most classification.
Classification is done by considering the probabilities
and it is simplest in nature. If a certain class 'C' and
document ’d’ and the output of the Naïve Bayes is
probability p(C/d) of document contains in class.
Assigning of probabilities depends on the number of
times the feature term occurs. Since it is machine
learning algorithm by depending upon the test data set
it creates the learned data set and compares the list
created for better classification.
• To beautify our project we considered Random forest
[9] method which is a state-of-art methodology. This
method is basic and clear but outturn accurate and
sophisticated results. The accuracy of classification is
done by increasing the number of trees by selecting the
features or variables in a random manner (selecting
without or with replacement). Finally conducting a poll
[10] to choose the best class for obtaining precise
classification.

III. APPROACH
The approach, we have chosen is shown in Figure 1, starts
with gaining data from different open source data and training
a classifier using a corpus of self-tagged critics available from
data retrieved. We then refine our classifier using this same
corpus before applying it to sentences mined from web.
Fig 1.
3.1 Collecting Critics
The process of obtaining data was to collect a large dataset
from a well-known movie website which would then be
classified on which training and testing a classifier for
sentiment analysis is implemented as in [12]. There were two
sites on our mind OMDB and Rotten Tomatoes, where a large
number of reviews, critic data and robust critique are to be
found. We took in movies ranging from the year 2000-2008.
OMDB has a system where the user can input a text which
returns a positive or negative rating. There were extra data
available which isn’t used in this project such as date, time,
review data etc. We have selected a wide array of critic
reviews based on movies released to around 15000 instances.
3.2 Pre-processing
The next step in our process was to fetch word features[13]
from the data collected. The pre-processing stage is removing
unnecessary details from the comma separated values data. It
follows as:
• Tokenization
• Case Conversion
• Word conversion to full forms(“Don’t” to “Do
not)
• Removal of punctuations
• Stop word filtering
The process of tokenization is carried out by a parser as
implemented in[14]. Where without changing the meaning of
the word sentences are clipped down to meaningful words. We
can apply humongous number of transformations to the then
ordered list of data. Transformation of data from words with
apostrophe, short words are converted to full forms.
Punctuations were removed in the process. Stop words were
introduced from the available NLTK corpus to remove words
which were irrelevant to the data collected such as ‘the’, ‘if’,
‘what’, ‘when’ were some of those used. However we need to
remember all token should be meaningful English word only.
Then use of bag of words to map feature name to feature
values, we defined a function. [15][16] The frequency of the
words repeated are collected. Where it represents positive and
negative reviews.
3.3 Feature Extraction
To generate the feature vectors, we used the collected
dataset in the previous process which will be used to train our
classifier. We used a specific method defined in [15] [16],
where the frequency of keyword occurrence was a better
feature for our usage. Using a specific function derived by us
which takes in three things which are the words (extracted
from the reviews), trained word2vec converting model and
dimension of the vectors to be presented, the output would be
a numpy array representing the reviews. Here numpy is a term
derived from python, which is used for scientific computing,
where it can be used to create powerful array objects.
Using NLTK corpus reader package to create a text corpus
of all the data we have collected[17], from the corpus we have,
we will be using 60 percent of it as a training set and the rest
of the percent as test set. So now we labeled the words as
rotten and fresh through a function which takes in words from
dictionary to classify sentimentally.
Naïve Bayes Classifier[18] was used to build a sentimental
classifier, the words are classified into rotten and fresh words
with the frequency of each being displayed. Another point
considered is the necessity of having three labeled classes with
neutral taken was not taken into account. As the possibility of
having neutral words vastly improves the accuracy but we
cannot say so because the classifier treats all the words same.
This usually is done using the concept of improved sentiment
analysis which might be a future prospect of our project.
3.4 Classification
Classification is an important part of data mining to obtain
the accuracy of trained data, to specify we have used two
classification methods. They are naïve bayes [18] and random
forest classification model [19]. Decision trees are good
because they tell you what inputs are the best predicators of
the outputs.
Naïve Bayes classification model has been used to get an
accuracy percentage of 87%. Actually this is bit lower than
random forest because naïve Bayes performs well for low
amount of data in comparison to decision trees which
performs well for large data and can categorize well. It can be
a hypothetical answer too as for few data sets it can be vice
versa. But a condition where there involves truth or false
based problems, decision trees are the best predictors.
Using machine learning algorithm, random forest to check
the accuracy of data classification, we have obtained 93%.
Random forest has been specifically chosen as a decision
based tree would be right in case of unsupervised learning. As
each tree is constructed using a random subset of training data.
After training the data pass each test data through it to obtain
an output for prediction.[19]

An ensemble technique which combines the output of one
weaker technique to obtain a stronger result. Where the
weaker one is a decision tree and this results in a good
predictive output when good features are split along it. By
using pandas, a data structure is created where it is split into
train and test data sets. A strong point we have observed is the
random forest fails for higher dimensional data. So we haven’t
dwelt with that part.
Random Forest
Input: X = Number of Trees, T = Trained Data, P = Total
Number of Features, p = Subset of Features.
Output: Bagged labeled class for input data.
a) For each tree:
1) Selecting a sample bootstrap Y of size T from the
trained data.
2) Creating tree by repeatedly repeating choosing p at
random from P, Selecting best from p and splitting the
points.
b) When all trees are done, testing the instances to each
tree and classes label will be assigned based on the no.of
votes.
The main aspect of our project was to create an application
which can be interactive enough where a search field will take
in the necessary words or sentences given producing output
Taking the whole project into account an application using
python flask which will be our base. The rest is built using
HTML and JavaScript to handle the user interface of our
application. This application comprises of a search field where
a sentence of critic entered would result in whether the
critique was fresh or rotten. This can be further extended as
spider crawling a website or a review site to grab all the text a
give a comment on the data provided. Sooner this project
would be open source for further researchers to work.
A. Figures and Tables
1) Dataset Retrieved.
Critic Publication Critique Title
Derek
Adams
Time Out Mediocre
Regrettably
Toy Story 3
Roger Ebert Chicago-Sun
times
The movie is
too pat.
Grumpy Old
Men
Liam Lacey Globe and
Mail
Never
escapes the
queasy aura
of place
Grumpy Old
Men.
Janet Maslin New York
Times
Children will
enjoy a new
take on the
idea.
Toy Story 3
Kenneth
Turan
Los Angeles
Times
A pleasant if
undemanding
piece of
work that is
diverting
Grumpy Old
Men
Mike Clark USA Today For a film
that deserves
Oscars for
photography,
editing and
sound
Heat
Edward
Guthmann
San
Francisco
Chronicle
What make it
work are the
integrity of
Pfeiffer's
performance
and Smith's
direction,
and the high
spirits of the
young.
Dangerous
Minds
Bruce Reid Film.com Robbins and
Susan
Sarandon
have crafted
a film that
transcends its
own political
message.
Dead Man
Walking
TABLE I
IV. CONCLUSION
Based on the project, we have achieved to perform
NLP based mining on movie critics. The pre-processing
techniques to filter the data and the bag of words are a

valuable source for us to dig in further on. Even though the
steps mentioned were already achieved before, the application
which we were trying to implement through the methods
described will have a profound impact on the project. We
applied naïve bayes and random forest on data set to achieve
an accountable accuracy to implement our application’s
predictability rate and it came out well. Prediction based on
type of movie review was thoroughly classified. Naive bayes
classifier works on small data set which means it initially
takes the pre-allocation memory from device and random
forest has a positive side on taking multiple true, false values
to implement classification and regression.
Ultimately our application through the extension of
above mentioned process, we managed to create a web
application which would take input of a word or sentence and
output as a positive or negative. This will be expanded to other
languages and as well as to a web crawler which can review a
site.
Focusing on the user interface we planned to release
this application on mobile as well. So the opportunities gained
through this will be of immense knowledge to us as well as the
open source users of our project.
V. ACKNOWLEDGMENT
We feel ourselves honored and privileged to place
our warm salutation to Kent State University and department
of Computer science which gave us the opportunity to have
expertise in engineering and profound technical knowledge.
We have our gratitude professor Dr. Kambiz Ghazinour, for
providing us with the environment and means to enrich our
skills and motivating us in our endeavor and helping us realize
our full potential. We would like to convey thanks to Mr.
Sravan kumar for his regular guidance and constant
encouragement and we are extremely grateful to him for his
valuable suggestions and unflinching co-operation throughout
project work.
References
[1] Gary Handman, Film Studies: UC Berkeley Library Film Reviews and
Film Criticism: An Introduction.
[2] Rudy Prabowo1 , Sentiment Analysis: A Combined Approach, Mike
Thelwall School of Computing and Information Technology University
of Wolverhampton Wulfruna Street WV1 1SB Wolverhampton, UK.
[3] https://web.stanford.edu/~jurafsky/slp3/slides/7_Sent.pdf
[4] Shivakumar Vaithyanathan, Bo Pang and Lillian Lee. Thumbs up?
Sentiment Classification using Machine Learning Techniques.
[5] Henrique Siqueira and Flavia Barros Centro de Inform´atica (CIn) -
Universidade Federal de Pernambuco (UFPE) Recife-PE . A Feature
Extraction Process for Sentiment Analysis of Opinions on Services .
[6] https://web.stanford.edu/class/cs124/lec/naivebayes.pdf
[7] Dr.S.Kannan, Vairaprakash Gurusamy: Preprocessing Techniques for
Text Mining
[8] http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html
[9] S.L. Ting, W.H. Ip, Albert H.C. Tsang Department of Industrial and
Systems Engineering, The Hong Kong Polytechnic University, Hung
Hum, Kowloon, Hong Kong : Is Naïve Bayes a Good Classifier for
Document Classification? International Journal of Software
Engineering and Its Applications Vol. 5, No. 3, July, 2011 37
[10] Gerard Biau : Analysis of a Random Forests Model ,Journal of Machine
Learning Research 13 (2012) 1063-1095 Submitted 10/10; Revised
10/11; Published 4/12 .
[11] LEO BREIMAN , Random Forests, University of California, Berkeley
Machine Learning, 45, 5–32, 2001.
[12] Philip Beineke, Trevor Hastie, Christopher Manning and Shivakumar
Vaithyanathan. An exploration of sentiment summarization. In
Proceedings of AAAI 2003, pp.12-15.
[13] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang,
Andrew Y. Ng, and Christopher Potts. Learning Word Vectors for
Sentiment Analysis.
[14] Shravan Vishwanathan. Sentiment Analysis for Movie Reviews.
[15] Minqing Hu and Bing Liu. Opinion Feature Extraction Using Class
Sequential Rules.
[16] Kushal Dave, Steve Lawrence and David M. Pennock. Mining the
peanut gallery: Opinion extraction and semantic classification of product
reviews. In Proceedings of WWW 2005, pp.519-528.
[17] Steven Bird, Ewan Klein and Edward Loper, Natural Language
Processing with Python, 2014.
[18] Alyssa Liang. Rotten Tomatoes: Sentiment Classification in Movie
Reviews, CS 229,2006
[19] Hitesh Parmar, Sanjay Bhanderi and Glory Shah. Sentiment Mining of
Movie Reviews using Random Forest with Tuned Hyperparameters.

NLP based Mining on Movie Critics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to NLP based Mining on Movie Critics

Similar to NLP based Mining on Movie Critics (20)

NLP based Mining on Movie Critics