Moviereview prjct

Innovative Geeks
Presented By- Shekhar Bhardwaj - 03351202716
Shubham Siddhartha - 03451202716
Mukul Sharma - 02051202716
Eckovation Machine Learning

TOPIC: SENTIMENT ANALYSIS ON MOVIE
REVIEWS
The Rotten Tomatoes movie review dataset is a corpus of movie reviews
used for sentiment analysis, originally collected by Pang and Lee [1]. In
their work on sentiment treebanks, Socher et al. used Amazon's
Mechanical Turk to create fine-grained labels for all parsed phrases in the
corpus.

DATA DESCRIPTION
• The dataset is comprised of tab-separated files with phrases from the
Rotten Tomatoes dataset. The train/test split has been preserved for the
purposes of benchmarking, but the sentences have been shuffled from
their original order. Each Sentence has been parsed into many phrases
by the Stanford parser. Each phrase has a PhraseId. Each sentence has a
SentenceId. Phrases that are repeated (such as short/common words)
are only included once in the data.
• train.tsv contains the phrases and their associated sentiment labels. We
have additionally provided a SentenceId so that you can track which
phrases belong to a single sentence.
• test.tsv contains just phrases. You must assign a sentiment label to each
phrase.

THE SENTIMENT LABELS ARE:
• 0 – negative
• 1 - somewhat negative
• 2 – neutral
• 3 - somewhat positive
• 4 - positive

DATA ANALYSIS
• Data analysis is a process of inspecting, cleansing, transforming,
and modeling data with the goal of discovering useful information,
informing conclusions, and supporting decision-making. Data analysis
has multiple facets and approaches, encompassing diverse techniques
under a variety of names, while being used in different business, science,
and social science domains.
• Data mining is a particular data analysis technique that focuses on
modeling and knowledge discovery for predictive rather than purely
descriptive purposes

EXTRACTING FEATURES FROM DATA SET AND REMOVING IRRELEVANT
DATA-CODE
• We have removed the stopwords and names from our dataset.
• We also converted the uppercase letters into lowercase.
• After that we calculated the no. features in our model , so as the create
Dataframe.

DIFFERENT CLASSIFIERS TO FIND THE ACCURACY
OF DATA
• Logistic Regression
• Decision Tree
• Random Forest
• SVM(Support Vector Machine)
• Naïve Bayes
• kNN(k-Nearest Neighbors)

DECISION TREE:
• Decision tree is a type of supervised learning algorithm (having a pre-defined target
variable) that is mostly used in classification problems. It works for both categorical
and continuous input and output variables. In this technique, we split the population
or sample into two or more homogeneous sets (or sub-populations) based on most
significant splitter / differentiator in input variables.

CODE:
• On applying decision tree , we get the accuracy of
0.63696369

LOGISTIC REGRESSION-
• It is a classification not a regression algorithm. It is used to estimate discrete values ( Binary values like
0/1, yes/no, true/false ) based on given set of independent variable(s). In simple words, it predicts the
probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit
regression. Since, it predicts the probability, its output values lies between 0 and 1 (as expected).
• Logit function or logistic equation is a curve of ‘S’ shape,with equation:
• e = the natural logarithm base (also known as Euler's number),
• x0 = the x-value of the sigmoid's midpoint,
• L = the curve's maximum value, and
• k = the steepness of the curve.

CODE-
• After applying the logistic regression on our model we get the accuracy of
0.6644664466..
• Which is much better than decision tree.

RANDOM FOREST-
• Random forests or random decision forests are an ensemble learning method
for classification, regression and other tasks, that operate by constructing a multitude
of decision trees at training time and outputting the class that is the mode of the
classes (classification) or mean prediction (regression) of the individual trees. Random
decision forests correct for decision trees' habit of overfitting to their training set.
• On applying the Random forest , we get the maximum accuracy:0.6611661166..

SVM (SUPPORT VECTOR MACHINE)-
• It is a classification method. In this algorithm, we plot each data item as a point in n-
dimensional space (where n is number of features you have) with the value of each
feature being the value of a particular coordinate.
• For example, if we only had two features like Height and Hair length of an individual,
we’d first plot these two variables in two dimensional space where each point has two
co-ordinates (these co-ordinates are known as Support Vectors)
• Now, we will find some line that splits the data between the two differently classified
groups of data. This will be the line such that the distances from the closest point in
each of the two groups will be farthest away.

CODE-
• We get the accuracy on applying SVM on our model is 0.572697260

K-NEAREST NEIGHBOUR-
• It can be used for both classification and regression problems. However, it is more
widely used in classification problems in the industry. K nearest neighbors is a simple
algorithm that stores all available cases and classifies new cases by a majority vote of
its k neighbors. The case being assigned to the class is most common amongst its K
nearest neighbors measured by a distance function.
• These distance functions can be Euclidean, Manhattan, Minkowski and Hamming
distance. First three functions are used for continuous function and fourth one
(Hamming) for categorical variables. If K = 1, then the case is simply assigned to the
class of its nearest neighbor. At times, choosing K turns out to be a challenge while
performing kNN modeling.
• On applying this algorithm on our model, we get the accuracy of 0.644114411.. for
value of k=3.
• We get the accuracy of for 0.628162816, k=10.

PLOTTING GRAPHS FOR EACH MODEL-
• So after applying different type of classifiers and regression ,we achieve
the maximum accuracy in logistic regression : 0.670517051

COUNTING NUMBERS OF WORDS FOR DIFFERENT
SENTIMENT

Moviereview prjct

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Moviereview prjct

Similar to Moviereview prjct (20)

Recently uploaded

Recently uploaded (20)

Moviereview prjct