Popularity of Online News Article

•Download as PPTX, PDF•

2 likes•2,705 views

Sumit Saini

Online News Popularity Dataset
PRESENTED BY Sumit Kumar Saini, Shivali Advilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel

Introduction
• Created to analyze the number of shares depending on the attributes and
predict if an article will be popular on the internet or not.
• 39,644 observations
• 61 attributes
• Mashable website: collected over a 2 year period from Jan 2013 - Jan 2015
• No missing values, but some topics were unclassified
• Target: number of shares

Data Set Introduction
Data accuracy
Data Set Website
843,330 shares
12 videos
128 videos
792 shares
0 videos
12 videos

LDA
The Latent Dirichlet Allocation algorithm was applied to all Mashable
texts (known before publication) in order to first identify the five top
relevant topics and then measure the closeness of each articles to such
topics.
• They were named LDA-00…...LDA-04 (undefined topics)
• LDAs add up to one per observation
• Maximum LDA impurity → overall low shares
• Mean: 1,660 vs 3,395
• Median: 1,100 vs 1,400

Data Modification
Recoding
Data channel Date of publication
0 Viral
1 Lifestyle
2 Entertainment
3 Business
4 Social Media
5 Technology
6 World
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday

Conference Paper
• Max: 843,300, Mean: 3,395.380, Deviation: 11,626.951 Median shares: 1,400 shares.
• Attribute popularity: Shares<=1400 unpopular; Shares>1400 popular
• Avoided dealing with a class imbalance problem
• Made it into a binary problem
Popular or Unpopular
AUC = 0.73

Models - Chosen Attributes
Subjective Opinion Random Forest Importance Highly Correlated (w/ shares)
• n_tokens_title
• n_tokens_content
• average_token_length
• summary_channel_value
• summary_weekday
• LDA_00
• LDA_01
• LDA_02
• LDA_03
• LDA_04
• global_subjectivity
• global_sentiment_polarity
• global_rate_positive_words
• global_rate_negative_words
• title_subjectivity
• title_sentiment_polarity
• LDA _03
• LDA_02
• kw_max_avg
• kw_avg_avg
• summary_channel_value
• self_reference_min_shares
• self_reference_avg_shares

Models - Chosen Attributes
Random Forest Importance
R2: -1.376
Highly Correlated (w/ shares)
R2: 0.01434R2: 0.0148
Subjective Opinion

Data Insights
Publication Day:
Most articles published - Tuesday, Wednesday,
and Thursday.
Least articles published - Weekends.
Channel:
Most popular topic is Viral,
followed by Tech and Business.
Least popular topic is Social Media.
No. of keywords:
Generally between 5 to 10.

Challenges
• Understanding the variables
what is LDA topic #
sentiment
polarity
keywords
• Finding relation among attributes and which attributes are important for
modelling.
• Numbers in dataset vs. numbers on Mashable
shares
videos
images
• Can’t do boosting because we don’t have a binary outcome

Recommendations
For Mashable
Publish during the week rather than weekend
Publish about world, technology, and business and avoid social media articles
Publish articles closer to the topic (minimize impurity)
For Researchers
Always identify your attributes
Ethically and accurately collecting data
To get more accurate results, get data about the number of likes and comments,
number of tweets or hashtags, number of URL mentions and to understand the
source of shares

Conclusion
● R2 is very small regardless of the model
● Using all attributes is the best combination
● Removing attributes, changing number of trees, and changing
classifier does not improve R2 value

THANK YOU!
PRESENTED BY Sumit Kumar Saini, ShivaliAdvilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel

Similar to Popularity of Online News Article

When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY

Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll

Mastering Customer Data on Apache SparkCaserta

Data Management, Metadata Management, and Data Governance – Working TogetherDATAVERSITY

ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY

RWDG Slides: How to Govern Data LakesDATAVERSITY

ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?Albert Hoitingh

Working With Different Kinds of DataEmbarcadero Technologies

RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY

Conceptual vs. Logical vs. Physical Data ModelingDATAVERSITY

Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY

Social Security Company Nexgate's Success Relies on Apache CassandraDataStax Academy

The Role of Metadata in a Data Governance ProgramDATAVERSITY

Uof memphis nosql mike king dell v1.5 feb18Mike King

W3C HCLS Dataset Description GuidelinesMichel Dumontier

Scalable Topic-Specific Influence Analysis on MicroblogsYuanyuan Tian

Real-World Data Governance: What is a Data Steward and What Do They Do?DATAVERSITY

Real-World Data Governance: Managing Governance Metadata for Mass ConsumptionDATAVERSITY

The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis

Government and Education Webinar: Zero-Trust Panel Discussion SolarWinds

More from Sumit Saini

Letter of RecommendationSumit Saini

Product recommendation for Santander Bank customersSumit Saini

Titanic data set analysisSumit Saini

Natural Disaster and oil refinery analysisSumit Saini

GROUPR 5 TRIDENT CASE PRESENTATIONSumit Saini

Predict price of car from Vehicles DatasetSumit Saini

Popularity of Online News ArticleSumit Saini

Tableau ProjectSumit Saini

More from Sumit Saini (11)

Letter of Recommendation

Product recommendation for Santander Bank customers

Titanic data set analysis

Natural Disaster and oil refinery analysis

GROUPR 5 TRIDENT CASE PRESENTATION

Predict price of car from Vehicles Dataset

Popularity of Online News Article

Tableau Project

Popularity of Online News Article

1. Online News Popularity Dataset PRESENTED BY Sumit Kumar Saini, Shivali Advilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel

2. 01 Introduction

3. Introduction • Created to analyze the number of shares depending on the attributes and predict if an article will be popular on the internet or not. • 39,644 observations • 61 attributes • Mashable website: collected over a 2 year period from Jan 2013 - Jan 2015 • No missing values, but some topics were unclassified • Target: number of shares

4. 02 Data Set Introduction

5. Data Set Introduction Data accuracy Data Set Website 843,330 shares 12 videos 128 videos 792 shares 0 videos 12 videos

6. Attributes

7. LDA The Latent Dirichlet Allocation algorithm was applied to all Mashable texts (known before publication) in order to first identify the five top relevant topics and then measure the closeness of each articles to such topics. • They were named LDA-00…...LDA-04 (undefined topics) • LDAs add up to one per observation • Maximum LDA impurity → overall low shares • Mean: 1,660 vs 3,395 • Median: 1,100 vs 1,400

8. 03 Data Modification And Models

9. Data Modification Recoding Data channel Date of publication 0 Viral 1 Lifestyle 2 Entertainment 3 Business 4 Social Media 5 Technology 6 World 1 Monday 2 Tuesday 3 Wednesday 4 Thursday 5 Friday 6 Saturday 7 Sunday

10. Conference Paper • Max: 843,300, Mean: 3,395.380, Deviation: 11,626.951 Median shares: 1,400 shares. • Attribute popularity: Shares<=1400 unpopular; Shares>1400 popular • Avoided dealing with a class imbalance problem • Made it into a binary problem Popular or Unpopular AUC = 0.73

11. Model 1 • 1500 trees • All attributes

12. Models - Chosen Attributes Subjective Opinion Random Forest Importance Highly Correlated (w/ shares) • n_tokens_title • n_tokens_content • average_token_length • summary_channel_value • summary_weekday • LDA_00 • LDA_01 • LDA_02 • LDA_03 • LDA_04 • global_subjectivity • global_sentiment_polarity • global_rate_positive_words • global_rate_negative_words • title_subjectivity • title_sentiment_polarity • LDA _03 • LDA_02 • kw_max_avg • kw_avg_avg • summary_channel_value • self_reference_min_shares • self_reference_avg_shares

13. Models - Chosen Attributes Random Forest Importance R2: -1.376 Highly Correlated (w/ shares) R2: 0.01434R2: 0.0148 Subjective Opinion

14. 04 Data Insights

15. Data Insights Publication Day: Most articles published - Tuesday, Wednesday, and Thursday. Least articles published - Weekends. Channel: Most popular topic is Viral, followed by Tech and Business. Least popular topic is Social Media. No. of keywords: Generally between 5 to 10.

16. Challenges

17. Challenges • Understanding the variables what is LDA topic # sentiment polarity keywords • Finding relation among attributes and which attributes are important for modelling. • Numbers in dataset vs. numbers on Mashable shares videos images • Can’t do boosting because we don’t have a binary outcome

18. Recommendations

19. Recommendations For Mashable Publish during the week rather than weekend Publish about world, technology, and business and avoid social media articles Publish articles closer to the topic (minimize impurity) For Researchers Always identify your attributes Ethically and accurately collecting data To get more accurate results, get data about the number of likes and comments, number of tweets or hashtags, number of URL mentions and to understand the source of shares

20. Conclusion

21. Conclusion ● R2 is very small regardless of the model ● Using all attributes is the best combination ● Removing attributes, changing number of trees, and changing classifier does not improve R2 value

22. THANK YOU! PRESENTED BY Sumit Kumar Saini, ShivaliAdvilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel

Editor's Notes

Heba
Heba
Heba

Popularity of Online News Article

Recommended

Recommended

More Related Content

Similar to Popularity of Online News Article

Similar to Popularity of Online News Article (20)

More from Sumit Saini

More from Sumit Saini (11)

Popularity of Online News Article

Editor's Notes