Multivariate Data Analysis Project Report

ONLINE NEWS POPULARITY
Neha Tembe Utkarsh Agrawal Vighnesh Kulkarni
MS in Information Systems MS in Information Systems MS in Information Systems
Stevens Institute of Technology Stevens Institute of Technology Stevens Institute of Technology
Email: ntembe@stevens.edu Email: uagrawal@stevens.edu Email: vkulkar1@stevens.edu
Under the guidance of:
Prof. David Belanger
Abstract- An ever-increasing number of individuals appreciate perusing and sharing on the web news
articles, with the development of the Internet. The number of share under a news article shows how
popular the news is. In this venture, we mean to break down the dataset to foresee the prevalence of
online news, utilizing machine learning procedures. Our information originates from Mashable, a
notable online news site. We implemented 3 different learning algorithms on the dataset, namely
K-Nearest Neighbor Algorithm, Classification and Regression Trees and Random Forest Algorithm.
Their exhibitions are recorded and looked at. Irregular Forest ends up being the best model for
expectation, and it can accomplish a precision of 70% with ideal parameters. Our work can help online
news organizations to anticipate news popularity before distribution.
INTRODUCTION
In this information era, reading and sharing news have become the center of people’s
entertainment lives. Therefore, it would be greatly helpful if we could accurately predict the
popularity of news prior to its publication, for social media workers (authors, advertisers, etc).
For the purpose of this paper, we intend to make use of this dataset which summarizes a
heterogeneous set of features about articles published by Mashable in a period of two years.
The goal is to-
● Predict the popularity of online news.
● Classify online news into popular or not popular category.
● Analyse data by using K-Nearest Neighbor Algorithm, apply classification and regression
techniques and Random Forest Algorithm.
● Compare the three algorithms and come up with the best suitable model for prediction.

Data and Data Preparation
Our dataset “Online News Popularity” was originally acquired and preprocessed by K.Fernandes
and is provided by UCI Machine Learning Repository. It consists of 61 attributes having
attributes characteristics as Integer and Real, with 39797 instances till date.
Attribute Information:
0. url: URL of the article (non-predictive)
1. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
7. num_hrefs: Number of links
8. num_self_hrefs: Number of links to other articles published by Mashable
9. num_imgs: Number of images
10. num_videos: Number of videos
11. average_token_length: Average length of the words in the content
12. num_keywords: Number of keywords in the metadata
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
14. data_channel_is_entertainment: Is data channel 'Entertainment'?
15. data_channel_is_bus: Is data channel 'Business'?
16. data_channel_is_socmed: Is data channel 'Social Media'?
17. data_channel_is_tech: Is data channel 'Tech'?
18. data_channel_is_world: Is data channel 'World'?
19. kw_min_min: Worst keyword (min. shares)
20. kw_max_min: Worst keyword (max. shares)
21. kw_avg_min: Worst keyword (avg. shares)
22. kw_min_max: Best keyword (min. shares)
23. kw_max_max: Best keyword (max. shares)
24. kw_avg_max: Best keyword (avg. shares)
25. kw_min_avg: Avg. keyword (min. shares)
26. kw_max_avg: Avg. keyword (max. shares)
27. kw_avg_avg: Avg. keyword (avg. shares)
28. self_reference_min_shares: Min. shares of referenced articles in Mashable
29. self_reference_max_shares: Max. shares of referenced articles in Mashable

30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
31. weekday_is_monday: Was the article published on a Monday?
32. weekday_is_tuesday: Was the article published on a Tuesday?
33. weekday_is_wednesday: Was the article published on a Wednesday?
34. weekday_is_thursday: Was the article published on a Thursday?
35. weekday_is_friday: Was the article published on a Friday?
36. weekday_is_saturday: Was the article published on a Saturday?
37. weekday_is_sunday: Was the article published on a Sunday?
38. is_weekend: Was the article published on the weekend?
39. LDA_00: Closeness to LDA topic 0
44. global_subjectivity: Text subjectivity
45. global_sentiment_polarity: Text sentiment polarity
46. global_rate_positive_words: Rate of positive words in the content
47. global_rate_negative_words: Rate of negative words in the content
48. rate_positive_words: Rate of positive words among non-neutral tokens
49. rate_negative_words: Rate of negative words among non-neutral tokens
50. avg_positive_polarity: Avg. polarity of positive words
51. min_positive_polarity: Min. polarity of positive words
52. max_positive_polarity: Max. polarity of positive words
53. avg_negative_polarity: Avg. polarity of negative words
54. min_negative_polarity: Min. polarity of negative words
55. max_negative_polarity: Max. polarity of negative words
56. title_subjectivity: Title subjectivity
57. title_sentiment_polarity: Title polarity
58. abs_title_subjectivity: Absolute subjectivity level
59. abs_title_sentiment_polarity: Absolute polarity level
60. shares: Number of shares (target)

First we read and viewed the data in R as follows:
#Read the dataset
news<- read.csv ("C:/Users/neha tembe/Desktop/MIS/SEMESTER
#View the dataset
2/Multivariate/OnlineNewsPopularity.csv") # read the popularity data set
View(news)
OUTPUT:
Total Number of Attributes - 61
Number of Predictive Attributes - 58
Number of Non Predictive Attributes - 2
Goal Field - 1

DATA PREPROCESSING
In this, we basically removed the first two columns of our dataset, that is, ‘url’ and ‘timedelta’ as
they were irrelevant to our analysis. Then we standardized the data by generating z-scores using
scale function.
Delete url and timedelta columns
newsreg <- subset( news, select = -c(url, timedelta ) )
Standardize data
Generate z-scores
for(i in ncol(newsreg)-1){newsreg[,i]<-scale(newsreg[,i], center = TRUE, scale = TRUE)}
We calculated the median of the ‘shares’ column which comes out to be 1400. Further, we
identified the articles with shares>1400 as popular articles.
Dataset for classification
newscla <-newsreg
newscla$shares <- as.factor(ifelse(newscla$shares > 1400,1,0))
In the end, we set one random situation and then selected training data and prediction data.
Train 70% Test 30% to avoid overfitting is a modeling error which occurs when a function is too
closely fit to a limited set of data points.
Set random situation
set.seed(100)
Training data and prediction data
ind<-sample(2,nrow(newscla),replace=TRUE,prob=c(0.7,0.3))

ANALYSIS
1. PRINCIPAL COMPONENT ANALYSIS
PCA is a dimensionality reduction algorithm, which could give us a lower dimensional
approximation for original dataset while preserving as much variability as possible. We
first created a data frame in R and performed the principal component analysis using both
varimax and oblique rotation.
#Creating the dataframe using R
all_data<-news[,c(2:61)]
data_frame <- data.frame(all_data)
#Performing principal component analysis with varimax rotation
install.packages("psych")
library(psych)
pca_varimax <- principal(data_frame, nfactors=4, rotate="varimax")
pca_varimax
RC1 RC2 RC3 RC4
SS loadings 4.49 4.35 3.79 3.00
Proportion Var 0.07 0.07 0.06 0.05
Cumulative Var 0.07 0.15 0.21 0.26
Proportion Explained 0.29 0.28 0.24 0.19
Cumulative Proportion 0.29 0.57 0.81 1.00
Mean item complexity = 1.5
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.09
with the empirical chi square 1187905 with prob < 0
Fit based upon off diagonal values = 0.57

#Performing principal component analysis with oblique rotation
pca_oblique <- principal(data_frame, nfactors=4, rotate="promax")
pca_oblique
But, PCA did not provide any improvements for our models, reason being the original
feature set is well-designed and correlated information between features is limited.
2. K-NEAREST NEIGHBOR ALGORITHM
K-Nearest Neighbor(KNN) is one of the essential classification algorithms in Machine Learning.
In this, a case is classified according to the majority vote of its K nearest neighbors.It is then
given the class most common among these neighbors. We applied KNN algorithm to the dataset
before which we deleted the ‘url’ and ‘timedelta’ columns and standardized the data. We
obtained the confusion matrix and ROC curve of this, resulting into 56% accuracy.
#KNN
newscla.knn <- knn3(shares ~.,newscla[ind==1,])
newscla.knn.pred <- predict( newscla.knn,newscla[ind==2,],type="class")
newscla.knn.prob <- predict( newscla.knn,newscla[ind==2,],type="prob")
# Confusion matrix
confusionMatrix(newscla.knn.pred, newscla[ind==2,]$shares)
OUTPUT:

3. CLASSIFICATION AND REGRESSION TREES
Classification and regression trees are used for predicting continuous dependent variables
(regression) and categorical predictor variables (classification). The models are obtained by
recursively partitioning the data space and fitting a simple prediction model within each
partition. As a result, the partitioning can be represented graphically as a decision tree. We
plotted the classification and regression tree for our data, after which we obtained the confusion
matrix and ROC Curve for the same, resulting into 61% accuracy.
#CART(Classification and regression Trees)
newscla.cart<-rpart(shares ~.,newscla[ind==1,],method='class')
# Plot tree
fancyRpartPlot(newscla.cart) # the most beautiful one
Confusion matrix
confusionMatrix(newscla.cart.pred, newscla[ind==2,]$shares)
Confusion Matrix and Statistics

4. RANDOM FOREST ALGORITHM
Random Forest use multiple decision trees which are built on separate sets of examples drawn
from the dataset. In each tree, we can use a subset of all the features we have.
By using more decision trees and averaging the result, the variance of the model can be greatly
lowered. For Random Forest, there are two main parameters to be considered: number of trees
and number of features they select at each decision point.
The approach is to have smaller node size in order to improve accuracy
Theoretically, accuracy will increase with more trees making decision.We obtained the
confusion matrix and ROC Curve for the same, resulting into 66% accuracy.
Plot Feature Importance
Here we plot importance based on two coefficients:
● Global variable importance is the mean decrease of accuracy over all out-of-bag cross
validated predictions, when a given variable is permuted after training, but before
prediction.
● The mean decrease in Gini coefficient is a measure of how each variable contributes to
the homogeneity of the nodes and leaves in the resulting random forest

Confusion matrix
confusionMatrix(newscla.rf.pred, newscla[ind==2,]$shares)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 3952 1922
1 2059 3817
Accuracy : 0.6612
95% CI : (0.6526, 0.6698)
No Information Rate : 0.5116
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.3224
Mcnemar's Test P-Value : 0.03112
Sensitivity : 0.6575
Specificity : 0.6651
Pos Pred Value : 0.6728
Neg Pred Value : 0.6496
Prevalence : 0.5116
Detection Rate : 0.3363
Detection Prevalence : 0.4999
Balanced Accuracy : 0.6613
'Positive' Class : 0

RESULTS
By comparing the ROC Curve for all the three methods, we see that Random Forest Algorithm
gives us the highest accuracy of 66% with area under the curve- 0.72. As this value is closer to 1,
it falls under the category ‘good’ as per traditional academic scale system. Hence we can state
that the model is good, though not excellent.
ROC for KNN- AUC 0.592 ROC FOR CART- AUC 0.638 ROC FOR RF- AUC
0.72

PERFORMANCE MEASURES
1. Confusion Matrix: It is used for finding the correctness and accuracy of the model.
Ideally, the model should give 0 False Positives and 0 False Negatives. But in real life no
model will be 100% accurate most of the times.
2. Accuracy: Accuracy in classification problems is the number of correct predictions made
by the model over all kinds of predictions made.
3. Precision: It is a measure that tells us what proportion of articles that we
classified as being popular, actually were popular.
CONCLUSION
Random Forest has the best result for this classification problem. It can have different number of
decision trees and different number of features used for each decision point. The number of
training examples can also change. Therefore, implementation should be done in a systematic
way.
In the future, we could directly treat all the words in an article as additional features, and then
apply machine learning algorithms like Naive Bayes and SVM. In this way, what the article
really talks about is taken into account, and this approach should improve the accuracy of
prediction if combined with our current work.
References:
● https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
● "Predicting the Popularity of Social News Posts." 2013 cs229 projects. Joe Maguire Scott
Michelson.
● Hensinger, Elena, Ilias Flaounas, and Nello Cristianini. "Modelling and predicting news
popularity." Pattern Analysis and Applications 16.4 (2013): 623-635.

Multivariate Data Analysis Project Report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multivariate Data Analysis Project Report

Similar to Multivariate Data Analysis Project Report (20)

More from Utkarsh Agrawal

More from Utkarsh Agrawal (6)

Recently uploaded

Recently uploaded (20)

Multivariate Data Analysis Project Report