Analyzing Samsung Galaxy Mega 5.8 Reviews

1
TEXT ANALYSIS TECHNIQUES TO
ANALYZE REVIEWS OF SAMSUNG
GALAXY MEGA 5.8 I9152
SUBMITTED BY
KOUSHIK RAKSHIT
ROLL NO:-A14034

2
CONTENTS
1. Introduction------------------------------------------------------------------------ 3
2. Problem Statement---------------------------------------------------------------- 3
3. Key features------------------------------------------------------------------------ 3
4. Research Design------------------------------------------------------------------- 4
5. Research Methodology---------------------------------------------------------- 4
A. Insights from Web Crawling & Word Cloud--------------------------5
B. Latent Semantic Analysis (LSA) and Cluster Analysis--------------7
C. Reviews Ratings Analysis-----------------------------------------------14
D. Classification using Support Vector Machine(SVM)---------------14
E. Reviews Sentiment Analysis-------------------------------------------16
6. Business Perspective------------------------------------------------------------18
7. Appendix-------------------------------------------------------------------------19

3
1. INTRODUCTION:- TEXT ANALYTICS
Text mining,referred to as text data mining, roughly equivalent to text analytics,
refers to the process of deriving high-quality information from text. High-quality
information is typically derived through the devising of patterns and trends through
means such as statistical pattern learning.
2. PROBLEM STATEMENT
ANALYZING REVIEWS FOR SAMSUNG GALAXY MEGA 5.8 I9152 (BLACK, WITH
BLACK)
From flipkart.com reviews for Samsung Galaxy Mega 5.8 I9152 (at least
100 reviews) were downloaded and a thorough analysis using text
analysis techniques was carried out.
3. KEY FEATURES OF SAMSUNG GALAXY MEGA 5.8 I9152
 Wi-Fi Enabled
 Expandable Storage Capacity of 64 GB
 5.8-inch TFT Capacitive Touchscreen
 Android v4.2.2 (Jelly Bean) OS
 8 MP Primary Camera
 1.9 MP Secondary Camera
 1.4 GHz Dual Core Processor
 Full HD Recording

4
4. RESEARCH DESIGN
 To analyze the user’s responses we had to collect primary and secondary
information from user’s mobile reviews from the website http://www.flipkart.com.
 To analyze the user’s perception the about the phone we took 100 reviews from
the review section from flipkart.
5. RESEARCH METHODOLOGY
To analyze the user reviews following research analysis procedures where undertaken:
A. Web Crawling & Word Cloud
B. Latent Semantic Analysis (LSA) and Clustering Analysis
C. Rating Analysis
D. Classification Analysis using Support Vector Machine(SVM)
E. Reviews Sentiment Analysis
A. INSIGHT FROM WEB CRAWLING & WORDCLOUD
A tag cloud (word cloud, or weighted list in visual design) is a visual representation
for text data, typically used to depict keyword metadata (tags) on websites, or to
visualize free form text. Tags are usually single words, and the importance of each tag
is shown with font size or color. This format is useful for quickly perceiving the most
prominent terms and for locating a term alphabetically to determine its relative
prominence. When used as website navigation aids, the terms are hyperlinked to items
associated with the tag.
R packages used for Word cloud:-RCurl , XML , rvest , word cloud , tm

5
1. Fetching reviews from FLIPKART.COM
FLIPKART<-"http://www.flipkart.com/samsung-galaxy-mega-5-8-i9152/product-
reviews/ITMEYFRTWAXZXTUT?pid=MOBDZSDJAPQXGAWN&type=all"
2.Word Cloud Creation
wordcloud(d$words,d$freq,max.words=300,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),
random.order=F)
INFERENCE DRAWN:- The word that took prominence in this Word Cloud gave a
clear idea that the mobile at the point of discussion is good & may be known for
screen size , display, camera ,battery . But it does not give proper idea if the product is
worth buying or users of the said mobile are satisfied or not. So to gain more insight
into our data we had to analyze their ratings (based out of 5).
B.LATENT SEMANTIC ANALYSIS AND CLUSTERANALYSIS:
For Latent Semantic Analysis, in which we break the term document matrix into 3 matrices:

6
 Word-Dimension Matrix
 Documents dimension Matrix
 Diagonal Matrix(Identity)
Word-Dimension Matrix:-
PLOTTING X1 vs X2
Inferences:
 When we break the term document matrix into dimension-word vector space chart it is
clearly visible that the positive words like good, features like screen, battery etc, are
occurring mostly at dimension-1.
 Grand, Mega , phone occurring mostly in dimension-2
 Display, price, money, quality of phone are more or less occurring in both the
dimensions.

7
PLOTTING X1 vs X3
Inferences:
 Grand, Mega , phone occurring mostly in dimension-1
 quality related words are more or less occurring in both the dimensions.
PLOTTING X2 vs X3

8
Inferences:
 Grand, Mega occurring mostly in dimension-1
 Camera, samsung are more or less occurring in both the dimensions.
PLOTTING FOR DOCUMENT MATRIX
Inference:-
 Document no. 71, 67 49 are close to dimension-1.
 Document no.68 is close to dimension-2.
 HIERARCHIAL CLUSTERING TO DETERMINE OPTIMUM NO. OF
CLUSTERS
In data mining, hierarchical clustering (also called hierarchical cluster analysis or
HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two types:
 Agglomerative: This is a "bottom up" approach: each observation starts in its
own cluster, and pairs of clusters are merged as one moves up the hierarchy.
 Divisive: This is a "top down" approach: all observations start in one cluster,
and splits are performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The
results of hierarchical clustering are usually presented in a dendrogram.

9
As per the plot, optimum values could be either 3,4 or 5.
CLUSTER ANALYSIS:
 From the above LSA analysis we got 5 optimum numbers of clusters,
which shows that there are 5 categories of reviews out of total 100
reviews.
 So as if now we will concentrate on 5 review-clusters which are helpful to
club different types of reviews from the users.
CLUSTER-1

10
 There are total of 57 observations in this cluster.
 This cluster consists of words related to price and the look of the phone
CLUSTER-2
 There are 32 observations in this cluster
 This cluster consists of words from reviews that are from customers who have good
experience with this mobile.
WORD CLOUD FOR CLUSTER-1

11
CLUSTER-3
 There are 38 observations in this cluster
 This cluster consists of words from reviews that are from customers who have good faith in
the company.

12
CLUSTER-4
 There are only 2 observations in this cluster
 This cluster doesnot throw any light on the nature of the cluster.

13
CLUSTER-5
 The above cluster has 1449 number of observations.
 This cluster consists of words related to product features & quality.

14
INFERENCE DRAWN FROM CLUSTERING
Apart from Cluster 1 , all other cluster does not give sufficient information about the
customer base/type. Moreover Cluster 2,3,4,5 is substantially smaller than Cluster 1
and no constructive storyline can be curved out of them.
Whereas Cluster 1 reflects almost anything & everything about the various features of
the phone that the customers might have liked.
C.REVIEWS RATINGS ANALYSIS
Total Reviews=100
Satisfied Reviews: 73
Dissatisfied Reviews: 27
Checking The Ratings Gives A Better Idea That Most Users Are Satisfied With
This Mobile.
D. CLASSIFICATION USING SUPPORT VECTOR MACHINE(SVM)
In machine learning, support vector machines (SVMs, also support vector
networks) are supervised learning models with associated learning algorithms
that analyze data and recognize patterns, used for classification and regression
analysis. Given a set of training examples, each marked as belonging to one of
two categories, an SVM training algorithm builds a model that assigns new
examples into one category or the other, making it a non-probabilistic binary
linear classifier. An SVM model is a representation of the examples as points in
space, mapped so that the examples of the separate categories are divided by a
clear gap that is as wide as possible. New examples are then mapped into that
same space and predicted to belong to a category based on which side of the
gap they fall on.
The generalization properties of an SVM do not depend on the dimensionality of
the space. You can bound the generalization error by a term depending on the
quotient of radius of a ball which contains all the data and the margin realized on
that data, but not on the dimensionality of the space. Many extensions exist, but
the answer is essentially the same: The generalization does not depend on the

15
dimensionality.
An extended explanation is that you can generalize well even in high-dimensional
spaces because the data occupies only a low-dimensional subspace of the feature
space, and regularization results in the learner dealing only with that subspace.
You can see this for yourself if you look at the eigenvalues of the kernel matrix
which typically decay quickly, meaning that you can project your data to a low-
dimensional subspace with negligible error.
So even if you have, for example, a Gaussian kernel, where the feature space is
infinite-dimensional, you are actually dealing with an essentially finite dimensional
kernel feature space where you are learning a linear decision function, which is
statistically tractable. Note that you need to regularize, though.
From the above fig. we can infer that out of 100 data points , 95 data points contribute
to the formation of marginal plane .
6 words displayed at the
head has negative coefficient
6 words displayed at the tail
has positive coefficient

16
Using SVM, we have classified reviews into two categories .
Since “dissatisfied” is the first level, the words with negative coefficient have positive
impact & vice versa
Snapshot of Data Frame containing list of words & their
frequency count
SENTIMENT ANALYSIS
Sentiment essentially relates to feelings; attitudes, emotions and opinions.
Sentiment Analysis refers to the practice of applying Natural Language
Processing and Text Analysis techniques to identify and extract subjective
information from a piece of text. A person‟s opinion or feelings are for the
most part subjective and not facts. Which means to accurately analyze an
individual‟s opinion or mood from a piece of text can be extremely difficult.
With Sentiment Analysis from a text analytics point of view, we are essentially
looking to get an understanding of the attitude of a writer with respect to a

17
topic in a piece of text and its polarity; whether it‟s positive, negative or
neutral.
In recent years there has been a steady increase in interest from brands,
companies and researchers in Sentiment Analysis and its application to
business analytics.
The business world today, as is the case in many data analytics streams, is
looking for “business insight.”
● Installing „qdap‟ package
● We decided the threshold value for polarity to classify between satisfied
& dissatisfied on the basis of the plot in the next page.
● This tree plot was done using library “party”.
● Output of the popularity check or sentiment analysis gives the clear
message that 65% of the buyers are satisfied with the purchase of the
mobile

18
CONCLUSION & BUSINESS PERSPECTIVE
• Output of our text analytics techniques brings out the fact that
Samsung Galaxy Mega 5.8 I9152 is a mobile worth buying.
• Most of the customers who bought it are extremely satisfied with the
various features it offers.
• Customer segmentation is possible but the very clear classification is
not possible as there are lot many features that are equally liked by all
across clusters.
• But of course buyers can be classified in terms of their satisfaction
level.

19
APPENDIX
CODES
#WEB CRAWLING & WORD CLOUD
#install.packages("RCurl")
library(RCurl)
#install.packages("XML")
library(XML)
#install.packages("rvest")
library(rvest)
library(wordcloud)
library(tm)
FLIPKART<-"http://www.flipkart.com/samsung-galaxy-core-18262/product-reviews/ITMDV6F6KYTTPGU4"
d = getURL(FLIPKART)
doc=htmlParse(d)
list=getNodeSet(doc,"//a")
list_href=sapply(list,function(x)xmlGetAttr(x,"href"))
page_link=grep("start=",list_href)
page_links<-list_href[page_link]
page_links<-unique(page_links)
crawl_candidate<-"start="
base="http://www.flipkart.com"
num<-10
doclist=list()
anchorlist=vector()
j=0
while(j<num)
{
if(j==0)
{
doclist[j+1]<-getURL(FLIPKART)

20
}
else
{
doclist[j+1]=getURL(paste(base,anchorlist[j+1],sep=""))
}
doc<-htmlParse(doclist[[j+1]])
anchor<-getNodeSet(doc,"//a")
anchor<-sapply(anchor,function(x)xmlGetAttr(x,"href"))
anchor<-anchor[grep(crawl_candidate,anchor)]
anchorlist=c(anchorlist,anchor)
anchorlist=unique(anchorlist)
j=j+1
}
reviews=c()
for(i in 1:10)
{
doc=htmlParse(doclist[[i]])
l=getNodeSet(doc,"//div/p/span[@class='review-text']")
l1=html_text(l)
#r=l1[nchar(l1)>200]
reviews=c(reviews,l1)
}
save(reviews,file="C:UsersKoushikDesktopNew folderAreviews.RData")
#install.packages("wordcloud")
library(wordcloud)
corpus=Corpus(VectorSource(reviews[1:100]))
corpus=tm_map(corpus,tolower)
corpus=tm_map(corpus,removePunctuation)

21
corpus=tm_map(corpus,removeNumbers)
corpus=tm_map(corpus,removeWords,stopwords("en"))
corpus=Corpus(VectorSource(corpus))
tdm=TermDocumentMatrix(corpus)
m=as.matrix(tdm)
v=sort(rowSums(m),decreasing=T)
d=data.frame(words=names(v),freq=v)
wordcloud(d$words,d$freq,max.words=300,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
#REVIEW RATINGS
reviews=c()
ratings=c()
missingRating=data.frame(Page=0,missing=0)
for(i in 1:10){
doc=htmlParse(doclist[[i]])
l=getNodeSet(doc,"//div/p/span")
rateNodes=getNodeSet(doc,"//div[@class='fk-stars']")
rates=sapply(rateNodes,function(x)xmlGetAttr(x,"title"))
ratings=c(ratings,rates)
l1=html_text(l)
reviews=c(reviews,l1)
}
View(reviews)
View(ratings)
reviews100=reviews[1:100]
reviews100
ratings
rating=gsub(" star[s]?","",ratings)
rating=as.numeric(rating)
satisfaction=ifelse(rating>3,"satisfied","dissatisfied")

22
satisfaction
dtmmobile=create_matrix(reviews100,removePunctuation=T,removeNumbers=T,weighting=weightTfIdf,ste
mWords=TRUE)
dtmmobile=as.matrix(dtmmobile)
data=as.data.frame(dtmmobile)
data=cbind(data,satisfaction)
#data1=na.omit(data)
data=data[,colSums(data[,-length(data)])>0]
View(data)
table(data$satisfaction)
svm=svm(satisfaction~.,data=data)
svm
#To get variable importance in prediction, SVM weights are evaluated as shown below
coef_imp=as.data.frame(t(svm$coefs)%*%svm$SV)
coef_imp1=data.frame(words=names(coef_imp),Importance=t(coef_imp))
coef_imp1=coef_imp1[order(coef_imp1$Importance),]
head(coef_imp1)
tail(coef_imp1)
View(coef_imp1)
#LSA & CLUSTERING
library(vegan)
install.packages("RTools")
library(RTools)
library(RTextTools)
library(mclust)
library(lsa)
library(cluster)
tdm=create_matrix(reviews,removeNumbers=T)
tdm_tfidf=weightTfIdf(tdm)
m=as.matrix(tdm)
m_tfidf=as.matrix(tdm_tfidf)

23
lsa_m=lsa(t(m),dimcalc_share(share=0.8))
lsa_m_tk=as.data.frame(lsa_m$tk)
lsa_m_dk=as.data.frame(lsa_m$dk)
lsa_mtfidf=lsa(t(m_tfidf),dimcalc_share(share=0.8))
k50=kmeans(scale(lsa_m$dk),centers=50,nstart=20)
centers50=aggregate(cbind(V1,V2,V3)~k50$cluster,data=as.data.frame(lsa_m$dk),FUN=mean)
d=dist(centers50[,-1])
hc=hclust(d,method="ward.D")
plot(hc,hang=-1)
rect.hclust(hc,h=0.3)
rect.hclust(hc,h=0.4,border="blue")
rect.hclust(hc,h=1.0,border="cyan")
rect.hclust(hc,h=1.25,border="green")
rect.hclust(hc,h=1.7,border="black")
#As per the plot, optimum values could be either 3,4 or 5
k3=kmeans(scale(lsa_m$tk),centers=3,nstart=20)
centers3=aggregate(cbind(V1,V2,V3)~k3$cluster,data=as.data.frame(lsa_m$tk),FUN=mean)
lsa_tk=lsa_m$tk
v=sort(colSums(m),decreasing=T)
wordFreq=data.frame(words=names(v),freq=v)
k5_1=wordFreq[k5$cluster==1,]

24
lsa_dk=as.data.frame(lsa_m$dk)
lsa_dk3=data.frame(words=rownames(lsa_dk),lsa_dk[,1:3])
plot(lsa_dk3$V1,lsa_dk3$V2)
text(lsa_dk3$V1,lsa_dk3$V2,label=lsa_dk3$words)
lsa_tk3=data.frame(words=rownames(lsa_tk),lsa_tk[,1:3])
plot(lsa_tk3$X1,lsa_tk3$X2)
text(lsa_tk3$X1,lsa_tk3$X2,label=lsa_tk3$words)
#sENTIMENT ANALYSIS
#qdap
data1=data
satisfaction1=as.data.frame(satisfaction)
for(i in 1:100)
{
sent=sent_detect(reviews[i])
pol=polarity(sent)
data1$polarity[i]=pol$group$stan.mean.polarity
satisfaction1$polarity_val[i]=pol$group$stan.mean.polarity
if(is.na(satisfaction1$polarity_val[i]))
{satisfaction1$polarity_val[i]=pol$group$ave.polarity
data1$polarity[i]=pol$group$ave.polarity}

25
}
new_rate=cbind(rating,satisfaction1)
aggregate(polarity_val~rating,data=new_rate,FUN=mean)
tree=party::ctree(satisfaction~polarity_val, data=new_rate)
plot(tree)
new_rate$status=ifelse(new_rate$polarity_val>0.385,"Satisfied","Dissatisfied")
count_status1=as.data.frame(table(new_rate$status))
View(count_status1)

Analyzing Samsung Galaxy Mega 5.8 Reviews

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Analyzing Samsung Galaxy Mega 5.8 Reviews

Similar to Analyzing Samsung Galaxy Mega 5.8 Reviews (20)

Analyzing Samsung Galaxy Mega 5.8 Reviews