3. Introduction
• Created to analyze the number of shares depending on the attributes and
predict if an article will be popular on the internet or not.
• 39,644 observations
• 61 attributes
• Mashable website: collected over a 2 year period from Jan 2013 - Jan 2015
• No missing values, but some topics were unclassified
• Target: number of shares
7. LDA
The Latent Dirichlet Allocation algorithm was applied to all Mashable
texts (known before publication) in order to first identify the five top
relevant topics and then measure the closeness of each articles to such
topics.
• They were named LDA-00…...LDA-04 (undefined topics)
• LDAs add up to one per observation
• Maximum LDA impurity → overall low shares
• Mean: 1,660 vs 3,395
• Median: 1,100 vs 1,400
9. Data Modification
Recoding
Data channel Date of publication
0 Viral
1 Lifestyle
2 Entertainment
3 Business
4 Social Media
5 Technology
6 World
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
10. Conference Paper
• Max: 843,300, Mean: 3,395.380, Deviation: 11,626.951 Median shares: 1,400 shares.
• Attribute popularity: Shares<=1400 unpopular; Shares>1400 popular
• Avoided dealing with a class imbalance problem
• Made it into a binary problem
Popular or Unpopular
AUC = 0.73
15. Data Insights
Publication Day:
Most articles published - Tuesday, Wednesday,
and Thursday.
Least articles published - Weekends.
Channel:
Most popular topic is Viral,
followed by Tech and Business.
Least popular topic is Social Media.
No. of keywords:
Generally between 5 to 10.
17. Challenges
• Understanding the variables
what is LDA topic #
sentiment
polarity
keywords
• Finding relation among attributes and which attributes are important for
modelling.
• Numbers in dataset vs. numbers on Mashable
shares
videos
images
• Can’t do boosting because we don’t have a binary outcome
19. Recommendations
For Mashable
Publish during the week rather than weekend
Publish about world, technology, and business and avoid social media articles
Publish articles closer to the topic (minimize impurity)
For Researchers
Always identify your attributes
Ethically and accurately collecting data
To get more accurate results, get data about the number of likes and comments,
number of tweets or hashtags, number of URL mentions and to understand the
source of shares
21. Conclusion
● R2 is very small regardless of the model
● Using all attributes is the best combination
● Removing attributes, changing number of trees, and changing
classifier does not improve R2 value