This is my class project using UCI Mashable dataset to determine what constitutes popular news. In this project, I used (1) multiple regression and model building and (2) PCA and factor analysis.
Data Analytics Tools: SAS and R
2. Introduction
Mashable.com is a digital website
founded in 2005. It has now become
one of today’s most popular sources
to acquire information.
3. Dataset
This dataset summarizes articles published by Mashable in a
period of two years. The data is publically available at University
of California Irvine Machine Learning Repository
This original dataset has a total of 39644 observations and 61
variables. 58 of the variables will be used as predictors.
The goal of this analysis is to predict news shares on social media
networks (popularity). My response variable is number of shares
on social media networks.
4. Literature Review
Ding, C., & He, X. (2004). K-means clustering via principal component analysis. Twenty-first International Conference
on Machine Learning - ICML 04. doi:10.1145/1015330.1015408
Heller, B. (1986). Statistics for experimenters, an introduction to design, data analysis, and model
building. Mathematical Modelling, 7(9-12), 1657-1658. doi:10.1016/0270-0255(86)90102-8
Khuntia, J., Sun, H., & Yim, D. (2016). Sharing News Through Social Networks. International Journal on Media
Management, 18(1), 59-74. doi:10.1080/14241277.2016.1185429
Hate Speech, Online and Social Media. (n.d.). Encyclopedia of Social Media and Politics.
doi:10.4135/9781452244723.n252
Barthel, M. (2017, June 01). Despite subscription surges for largest U.S. newspapers, circulation and revenue fall for
industry overall. Retrieved from http://www.pewresearch.org/fact-tank/2017/06/01/circulation-and-revenue-fall-for-
newspaper-industry/
Advertise With Mashable. (n.d.). Retrieved from https://mashable.com/advertise/
Al-Zwainy, F. M., Abdulmajeed, M. H., & Aljumaily, H. S. (2013). Using Multivariable Linear Regression Technique for
Modeling Productivity Construction in Iraq. Open Journal of Civil Engineering, 03(03), 127-135.
doi:10.4236/ojce.2013.33015
5. Exploratory Stage: Clean and Explore the Data
Check if there is any missing value
Remove repetitive columns
Check categorical variables
Make a descriptive statistical
summary and check the structure
again
Categorical Variables Detection
No Missing ValueDescriptive Summary of Y-variable
7. Multi-regression and Model Building
Check multicollinearity
Split the data into 80% training and 20%
testing
Use training set to do model
construction and use testing set to
predict value
Model 1 has a R2 of 11.9% (TOO
LOW!)
Automatic Model Selection (Stepwise &
Backward)
Final Model
Data Partition (80%training+20%testing)
First Model Fitting Result
8. Result & Business Insights
Parameter Estimate shows association
between Y and X-variables. Though not
causation, it shows association between Y
and Xs. Variables like data_channel_is_tech
and abs_title_subjectivity should be
highlighted.
Insight
Categorize articles in the right channel is
important. More tech articles may increase
the popularity.
More subjectivity may increase popularity.
Personal views can boost traffic.
Model Fitting Results
10. Naming Components
Factor 1: Length of the Article
Factor 2: Use of key words
Factor 3: Number of links
Factor 4: Published Channel
Factor 5: Is the title polarized
Factor 6: Publication Date
* There are factors overlapped
Loadings (After Rotation)
11. Result & Business Insight
Appropriate length
Which day to publish matters
More embedded popular article links
Tech Channel is usually more popular
Use proper amount of key words
Create title with unique words
Title should be polarized
12. Data and News Ethics
Should we focus solely on news traffic?
Is there a better way to measure “good news”?
Hate Speech, Online and Social Media. (n.d.). Encyclopedia of Social Media and Politics.
doi:10.4135/9781452244723.n252
13. Future Work
More work in data cleaning (Low R2)
Try out different transformation and model selection to see if
could improve my R2
Try out different techniques to see if there are underlying
relationships that I failed to find out from previous studies
More diversified variables will be tested
Editor's Notes
3 Mashable stories shared per second
1.7 billion monthly cross platform content views
70 million unique content visitors
48 million social media followers