SlideShare a Scribd company logo
1 of 22
Online News Popularity Dataset
PRESENTED BY Sumit Kumar Saini, Shivali Advilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel
01
Introduction
Introduction
• Created to analyze the number of shares depending on the attributes and
predict if an article will be popular on the internet or not.
• 39,644 observations
• 61 attributes
• Mashable website: collected over a 2 year period from Jan 2013 - Jan 2015
• No missing values, but some topics were unclassified
• Target: number of shares
02
Data Set Introduction
Data Set Introduction
Data accuracy
Data Set Website
843,330 shares
12 videos
128 videos
792 shares
0 videos
12 videos
Attributes
LDA
The Latent Dirichlet Allocation algorithm was applied to all Mashable
texts (known before publication) in order to first identify the five top
relevant topics and then measure the closeness of each articles to such
topics.
• They were named LDA-00…...LDA-04 (undefined topics)
• LDAs add up to one per observation
• Maximum LDA impurity → overall low shares
• Mean: 1,660 vs 3,395
• Median: 1,100 vs 1,400
03
Data Modification And Models
Data Modification
Recoding
Data channel Date of publication
0 Viral
1 Lifestyle
2 Entertainment
3 Business
4 Social Media
5 Technology
6 World
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
Conference Paper
• Max: 843,300, Mean: 3,395.380, Deviation: 11,626.951 Median shares: 1,400 shares.
• Attribute popularity: Shares<=1400 unpopular; Shares>1400 popular
• Avoided dealing with a class imbalance problem
• Made it into a binary problem
Popular or Unpopular
AUC = 0.73
Model 1
• 1500 trees
• All attributes
Models - Chosen Attributes
Subjective Opinion Random Forest Importance Highly Correlated (w/ shares)
• n_tokens_title
• n_tokens_content
• average_token_length
• summary_channel_value
• summary_weekday
• LDA_00
• LDA_01
• LDA_02
• LDA_03
• LDA_04
• global_subjectivity
• global_sentiment_polarity
• global_rate_positive_words
• global_rate_negative_words
• title_subjectivity
• title_sentiment_polarity
• LDA _03
• LDA_02
• kw_max_avg
• kw_avg_avg
• summary_channel_value
• self_reference_min_shares
• self_reference_avg_shares
Models - Chosen Attributes
Random Forest Importance
R2: -1.376
Highly Correlated (w/ shares)
R2: 0.01434R2: 0.0148
Subjective Opinion
04
Data Insights
Data Insights
Publication Day:
Most articles published - Tuesday, Wednesday,
and Thursday.
Least articles published - Weekends.
Channel:
Most popular topic is Viral,
followed by Tech and Business.
Least popular topic is Social Media.
No. of keywords:
Generally between 5 to 10.
Challenges
Challenges
• Understanding the variables
what is LDA topic #
sentiment
polarity
keywords
• Finding relation among attributes and which attributes are important for
modelling.
• Numbers in dataset vs. numbers on Mashable
shares
videos
images
• Can’t do boosting because we don’t have a binary outcome
Recommendations
Recommendations
For Mashable
Publish during the week rather than weekend
Publish about world, technology, and business and avoid social media articles
Publish articles closer to the topic (minimize impurity)
For Researchers
Always identify your attributes
Ethically and accurately collecting data
To get more accurate results, get data about the number of likes and comments,
number of tweets or hashtags, number of URL mentions and to understand the
source of shares
Conclusion
Conclusion
● R2 is very small regardless of the model
● Using all attributes is the best combination
● Removing attributes, changing number of trees, and changing
classifier does not improve R2 value
THANK YOU!
PRESENTED BY Sumit Kumar Saini, ShivaliAdvilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel

More Related Content

Similar to Popularity of Online News Article

When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Data Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working TogetherData Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working TogetherDATAVERSITY
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
RWDG Slides: How to Govern Data Lakes
RWDG Slides: How to Govern Data LakesRWDG Slides: How to Govern Data Lakes
RWDG Slides: How to Govern Data LakesDATAVERSITY
 
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?Albert Hoitingh
 
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceRWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY
 
Conceptual vs. Logical vs. Physical Data Modeling
Conceptual vs. Logical vs. Physical Data ModelingConceptual vs. Logical vs. Physical Data Modeling
Conceptual vs. Logical vs. Physical Data ModelingDATAVERSITY
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY
 
Social Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache CassandraSocial Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache CassandraDataStax Academy
 
The Role of Metadata in a Data Governance Program
The Role of Metadata in a Data Governance ProgramThe Role of Metadata in a Data Governance Program
The Role of Metadata in a Data Governance ProgramDATAVERSITY
 
Uof memphis nosql mike king dell v1.5 feb18
Uof memphis nosql mike king dell v1.5 feb18Uof memphis nosql mike king dell v1.5 feb18
Uof memphis nosql mike king dell v1.5 feb18Mike King
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesMichel Dumontier
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsYuanyuan Tian
 
Real-World Data Governance: What is a Data Steward and What Do They Do?
Real-World Data Governance: What is a Data Steward and What Do They Do?Real-World Data Governance: What is a Data Steward and What Do They Do?
Real-World Data Governance: What is a Data Steward and What Do They Do?DATAVERSITY
 
Real-World Data Governance: Managing Governance Metadata for Mass Consumption
Real-World Data Governance: Managing Governance Metadata for Mass ConsumptionReal-World Data Governance: Managing Governance Metadata for Mass Consumption
Real-World Data Governance: Managing Governance Metadata for Mass ConsumptionDATAVERSITY
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis
 
Government and Education Webinar: Zero-Trust Panel Discussion
Government and Education Webinar: Zero-Trust Panel Discussion Government and Education Webinar: Zero-Trust Panel Discussion
Government and Education Webinar: Zero-Trust Panel Discussion SolarWinds
 

Similar to Popularity of Online News Article (20)

When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Data Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working TogetherData Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working Together
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
RWDG Slides: How to Govern Data Lakes
RWDG Slides: How to Govern Data LakesRWDG Slides: How to Govern Data Lakes
RWDG Slides: How to Govern Data Lakes
 
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
 
Working With Different Kinds of Data
Working With Different Kinds of DataWorking With Different Kinds of Data
Working With Different Kinds of Data
 
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceRWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
Conceptual vs. Logical vs. Physical Data Modeling
Conceptual vs. Logical vs. Physical Data ModelingConceptual vs. Logical vs. Physical Data Modeling
Conceptual vs. Logical vs. Physical Data Modeling
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
Social Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache CassandraSocial Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache Cassandra
 
The Role of Metadata in a Data Governance Program
The Role of Metadata in a Data Governance ProgramThe Role of Metadata in a Data Governance Program
The Role of Metadata in a Data Governance Program
 
Uof memphis nosql mike king dell v1.5 feb18
Uof memphis nosql mike king dell v1.5 feb18Uof memphis nosql mike king dell v1.5 feb18
Uof memphis nosql mike king dell v1.5 feb18
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description Guidelines
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on Microblogs
 
Real-World Data Governance: What is a Data Steward and What Do They Do?
Real-World Data Governance: What is a Data Steward and What Do They Do?Real-World Data Governance: What is a Data Steward and What Do They Do?
Real-World Data Governance: What is a Data Steward and What Do They Do?
 
Real-World Data Governance: Managing Governance Metadata for Mass Consumption
Real-World Data Governance: Managing Governance Metadata for Mass ConsumptionReal-World Data Governance: Managing Governance Metadata for Mass Consumption
Real-World Data Governance: Managing Governance Metadata for Mass Consumption
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Government and Education Webinar: Zero-Trust Panel Discussion
Government and Education Webinar: Zero-Trust Panel Discussion Government and Education Webinar: Zero-Trust Panel Discussion
Government and Education Webinar: Zero-Trust Panel Discussion
 

More from Sumit Saini

Letter of Recommendation
Letter of RecommendationLetter of Recommendation
Letter of RecommendationSumit Saini
 
Product recommendation for Santander Bank customers
Product recommendation for Santander Bank customersProduct recommendation for Santander Bank customers
Product recommendation for Santander Bank customersSumit Saini
 
Titanic data set analysis
Titanic data set analysisTitanic data set analysis
Titanic data set analysisSumit Saini
 
Natural Disaster and oil refinery analysis
Natural Disaster and oil refinery analysisNatural Disaster and oil refinery analysis
Natural Disaster and oil refinery analysisSumit Saini
 
Natural Disaster and oil refinery analysis
Natural Disaster and oil refinery analysisNatural Disaster and oil refinery analysis
Natural Disaster and oil refinery analysisSumit Saini
 
GROUPR 5 TRIDENT CASE PRESENTATION
GROUPR 5 TRIDENT CASE PRESENTATIONGROUPR 5 TRIDENT CASE PRESENTATION
GROUPR 5 TRIDENT CASE PRESENTATIONSumit Saini
 
GROUPR 5 TRIDENT CASE PRESENTATION
GROUPR 5 TRIDENT CASE PRESENTATIONGROUPR 5 TRIDENT CASE PRESENTATION
GROUPR 5 TRIDENT CASE PRESENTATIONSumit Saini
 
Predict price of car from Vehicles Dataset
Predict price of car from Vehicles DatasetPredict price of car from Vehicles Dataset
Predict price of car from Vehicles DatasetSumit Saini
 
Predict price of car from Vehicles Dataset
Predict price of car from Vehicles DatasetPredict price of car from Vehicles Dataset
Predict price of car from Vehicles DatasetSumit Saini
 
Popularity of Online News Article
Popularity of Online News ArticlePopularity of Online News Article
Popularity of Online News ArticleSumit Saini
 

More from Sumit Saini (11)

Letter of Recommendation
Letter of RecommendationLetter of Recommendation
Letter of Recommendation
 
Product recommendation for Santander Bank customers
Product recommendation for Santander Bank customersProduct recommendation for Santander Bank customers
Product recommendation for Santander Bank customers
 
Titanic data set analysis
Titanic data set analysisTitanic data set analysis
Titanic data set analysis
 
Natural Disaster and oil refinery analysis
Natural Disaster and oil refinery analysisNatural Disaster and oil refinery analysis
Natural Disaster and oil refinery analysis
 
Natural Disaster and oil refinery analysis
Natural Disaster and oil refinery analysisNatural Disaster and oil refinery analysis
Natural Disaster and oil refinery analysis
 
GROUPR 5 TRIDENT CASE PRESENTATION
GROUPR 5 TRIDENT CASE PRESENTATIONGROUPR 5 TRIDENT CASE PRESENTATION
GROUPR 5 TRIDENT CASE PRESENTATION
 
GROUPR 5 TRIDENT CASE PRESENTATION
GROUPR 5 TRIDENT CASE PRESENTATIONGROUPR 5 TRIDENT CASE PRESENTATION
GROUPR 5 TRIDENT CASE PRESENTATION
 
Predict price of car from Vehicles Dataset
Predict price of car from Vehicles DatasetPredict price of car from Vehicles Dataset
Predict price of car from Vehicles Dataset
 
Predict price of car from Vehicles Dataset
Predict price of car from Vehicles DatasetPredict price of car from Vehicles Dataset
Predict price of car from Vehicles Dataset
 
Popularity of Online News Article
Popularity of Online News ArticlePopularity of Online News Article
Popularity of Online News Article
 
Tableau Project
Tableau ProjectTableau Project
Tableau Project
 

Popularity of Online News Article

  • 1. Online News Popularity Dataset PRESENTED BY Sumit Kumar Saini, Shivali Advilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel
  • 3. Introduction • Created to analyze the number of shares depending on the attributes and predict if an article will be popular on the internet or not. • 39,644 observations • 61 attributes • Mashable website: collected over a 2 year period from Jan 2013 - Jan 2015 • No missing values, but some topics were unclassified • Target: number of shares
  • 5. Data Set Introduction Data accuracy Data Set Website 843,330 shares 12 videos 128 videos 792 shares 0 videos 12 videos
  • 7. LDA The Latent Dirichlet Allocation algorithm was applied to all Mashable texts (known before publication) in order to first identify the five top relevant topics and then measure the closeness of each articles to such topics. • They were named LDA-00…...LDA-04 (undefined topics) • LDAs add up to one per observation • Maximum LDA impurity → overall low shares • Mean: 1,660 vs 3,395 • Median: 1,100 vs 1,400
  • 9. Data Modification Recoding Data channel Date of publication 0 Viral 1 Lifestyle 2 Entertainment 3 Business 4 Social Media 5 Technology 6 World 1 Monday 2 Tuesday 3 Wednesday 4 Thursday 5 Friday 6 Saturday 7 Sunday
  • 10. Conference Paper • Max: 843,300, Mean: 3,395.380, Deviation: 11,626.951 Median shares: 1,400 shares. • Attribute popularity: Shares<=1400 unpopular; Shares>1400 popular • Avoided dealing with a class imbalance problem • Made it into a binary problem Popular or Unpopular AUC = 0.73
  • 11. Model 1 • 1500 trees • All attributes
  • 12. Models - Chosen Attributes Subjective Opinion Random Forest Importance Highly Correlated (w/ shares) • n_tokens_title • n_tokens_content • average_token_length • summary_channel_value • summary_weekday • LDA_00 • LDA_01 • LDA_02 • LDA_03 • LDA_04 • global_subjectivity • global_sentiment_polarity • global_rate_positive_words • global_rate_negative_words • title_subjectivity • title_sentiment_polarity • LDA _03 • LDA_02 • kw_max_avg • kw_avg_avg • summary_channel_value • self_reference_min_shares • self_reference_avg_shares
  • 13. Models - Chosen Attributes Random Forest Importance R2: -1.376 Highly Correlated (w/ shares) R2: 0.01434R2: 0.0148 Subjective Opinion
  • 15. Data Insights Publication Day: Most articles published - Tuesday, Wednesday, and Thursday. Least articles published - Weekends. Channel: Most popular topic is Viral, followed by Tech and Business. Least popular topic is Social Media. No. of keywords: Generally between 5 to 10.
  • 17. Challenges • Understanding the variables what is LDA topic # sentiment polarity keywords • Finding relation among attributes and which attributes are important for modelling. • Numbers in dataset vs. numbers on Mashable shares videos images • Can’t do boosting because we don’t have a binary outcome
  • 19. Recommendations For Mashable Publish during the week rather than weekend Publish about world, technology, and business and avoid social media articles Publish articles closer to the topic (minimize impurity) For Researchers Always identify your attributes Ethically and accurately collecting data To get more accurate results, get data about the number of likes and comments, number of tweets or hashtags, number of URL mentions and to understand the source of shares
  • 21. Conclusion ● R2 is very small regardless of the model ● Using all attributes is the best combination ● Removing attributes, changing number of trees, and changing classifier does not improve R2 value
  • 22. THANK YOU! PRESENTED BY Sumit Kumar Saini, ShivaliAdvilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel

Editor's Notes

  1. Heba
  2. Heba
  3. Heba