Predicting Online News Popularity

Predicting News Popularity
CSC 424 Advanced Data Analysis and Regression
Ke Feng
07/03/2019

Introduction
 Mashable.com is a digital website
founded in 2005. It has now become
one of today’s most popular sources
to acquire information.

Dataset
 This dataset summarizes articles published by Mashable in a
period of two years. The data is publically available at University
of California Irvine Machine Learning Repository
 This original dataset has a total of 39644 observations and 61
variables. 58 of the variables will be used as predictors.
 The goal of this analysis is to predict news shares on social media
networks (popularity). My response variable is number of shares
on social media networks.

Literature Review
 Ding, C., & He, X. (2004). K-means clustering via principal component analysis. Twenty-first International Conference
on Machine Learning - ICML 04. doi:10.1145/1015330.1015408
 Heller, B. (1986). Statistics for experimenters, an introduction to design, data analysis, and model
building. Mathematical Modelling, 7(9-12), 1657-1658. doi:10.1016/0270-0255(86)90102-8
 Khuntia, J., Sun, H., & Yim, D. (2016). Sharing News Through Social Networks. International Journal on Media
Management, 18(1), 59-74. doi:10.1080/14241277.2016.1185429
 Hate Speech, Online and Social Media. (n.d.). Encyclopedia of Social Media and Politics.
doi:10.4135/9781452244723.n252
 Barthel, M. (2017, June 01). Despite subscription surges for largest U.S. newspapers, circulation and revenue fall for
industry overall. Retrieved from http://www.pewresearch.org/fact-tank/2017/06/01/circulation-and-revenue-fall-for-
newspaper-industry/
 Advertise With Mashable. (n.d.). Retrieved from https://mashable.com/advertise/
 Al-Zwainy, F. M., Abdulmajeed, M. H., & Aljumaily, H. S. (2013). Using Multivariable Linear Regression Technique for
Modeling Productivity Construction in Iraq. Open Journal of Civil Engineering, 03(03), 127-135.
doi:10.4236/ojce.2013.33015

Exploratory Stage: Clean and Explore the Data
 Check if there is any missing value
 Remove repetitive columns
 Check categorical variables
 Make a descriptive statistical
summary and check the structure
again
Categorical Variables Detection
No Missing ValueDescriptive Summary of Y-variable

Techniques
 Multiple Regression and Model Building
 PCA & Factor Analysis

Multi-regression and Model Building
 Check multicollinearity
 Split the data into 80% training and 20%
testing
 Use training set to do model
construction and use testing set to
predict value
 Model 1 has a R2 of 11.9% (TOO
LOW!)
 Automatic Model Selection (Stepwise &
Backward)
 Final Model
Data Partition (80%training+20%testing)
First Model Fitting Result

Result & Business Insights
 Parameter Estimate shows association
between Y and X-variables. Though not
causation, it shows association between Y
and Xs. Variables like data_channel_is_tech
and abs_title_subjectivity should be
highlighted.
 Insight
 Categorize articles in the right channel is
important. More tech articles may increase
the popularity.
 More subjectivity may increase popularity.
Personal views can boost traffic.
Model Fitting Results

Principal Component Analysis (PCA)
 Select components based on Scree
plots and Eigenvalue.
Scree Plot

Naming Components
 Factor 1: Length of the Article
 Factor 2: Use of key words
 Factor 3: Number of links
 Factor 4: Published Channel
 Factor 5: Is the title polarized
 Factor 6: Publication Date
* There are factors overlapped
Loadings (After Rotation)

Result & Business Insight
 Appropriate length
 Which day to publish matters
 More embedded popular article links
 Tech Channel is usually more popular
 Use proper amount of key words
 Create title with unique words
 Title should be polarized

Data and News Ethics
 Should we focus solely on news traffic?
 Is there a better way to measure “good news”?
Hate Speech, Online and Social Media. (n.d.). Encyclopedia of Social Media and Politics.
doi:10.4135/9781452244723.n252

Future Work
 More work in data cleaning (Low R2)
 Try out different transformation and model selection to see if
could improve my R2
 Try out different techniques to see if there are underlying
relationships that I failed to find out from previous studies
 More diversified variables will be tested

Predicting Online News Popularity

Predicting Online News Popularity

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Predicting Online News Popularity

Similar to Predicting Online News Popularity (20)

Recently uploaded

Recently uploaded (20)

Predicting Online News Popularity

Editor's Notes