SlideShare a Scribd company logo
ONLINE NEWS POPULARITY
Neha Tembe Utkarsh Agrawal Vighnesh Kulkarni
MS in Information Systems MS in Information Systems MS in Information Systems
Stevens Institute of Technology Stevens Institute of Technology Stevens Institute of Technology
Email: ​ntembe@stevens.edu Email: ​uagrawal@stevens.edu​ Email: ​vkulkar1@stevens.edu
Under the guidance of:
Prof. David Belanger
Abstract-​ An ever-increasing number of individuals appreciate perusing and sharing on the web news
articles, with the development of the Internet. The number of share under a news article shows how
popular the news is. In this venture, we mean to break down the dataset to foresee the prevalence of
online news, utilizing machine learning procedures. Our information originates from Mashable, a
notable online news site. We implemented 3 different learning algorithms on the dataset, namely
K-Nearest Neighbor Algorithm, Classification and Regression Trees and Random Forest Algorithm.
Their exhibitions are recorded and looked at. Irregular Forest ends up being the best model for
expectation, and it can accomplish a precision of 70% with ideal parameters. Our work can help online
news organizations to anticipate news popularity before distribution.
INTRODUCTION
In this information era, reading and sharing news have become the center of people’s
entertainment lives. Therefore, it would be greatly helpful if we could accurately predict the
popularity of news prior to its publication, for social media workers (authors, advertisers, etc).
For the purpose of this paper, we intend to make use of this dataset which summarizes a
heterogeneous set of features about articles published by Mashable in a period of two years.
The goal is to-
● Predict the popularity of online news.
● Classify online news into popular or not popular category.
● Analyse data by using K-Nearest Neighbor Algorithm, apply classification and regression
techniques and Random Forest Algorithm.
● Compare the three algorithms and come up with the best suitable model for prediction.
Data and Data Preparation
Our dataset “Online News Popularity” was originally acquired and preprocessed by K.Fernandes
and is provided by UCI Machine Learning Repository. It consists of 61 attributes having
attributes characteristics as Integer and Real, with 39797 instances till date.
Attribute Information:
0. url: URL of the article (non-predictive)
1. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
7. num_hrefs: Number of links
8. num_self_hrefs: Number of links to other articles published by Mashable
9. num_imgs: Number of images
10. num_videos: Number of videos
11. average_token_length: Average length of the words in the content
12. num_keywords: Number of keywords in the metadata
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
14. data_channel_is_entertainment: Is data channel 'Entertainment'?
15. data_channel_is_bus: Is data channel 'Business'?
16. data_channel_is_socmed: Is data channel 'Social Media'?
17. data_channel_is_tech: Is data channel 'Tech'?
18. data_channel_is_world: Is data channel 'World'?
19. kw_min_min: Worst keyword (min. shares)
20. kw_max_min: Worst keyword (max. shares)
21. kw_avg_min: Worst keyword (avg. shares)
22. kw_min_max: Best keyword (min. shares)
23. kw_max_max: Best keyword (max. shares)
24. kw_avg_max: Best keyword (avg. shares)
25. kw_min_avg: Avg. keyword (min. shares)
26. kw_max_avg: Avg. keyword (max. shares)
27. kw_avg_avg: Avg. keyword (avg. shares)
28. self_reference_min_shares: Min. shares of referenced articles in Mashable
29. self_reference_max_shares: Max. shares of referenced articles in Mashable
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
31. weekday_is_monday: Was the article published on a Monday?
32. weekday_is_tuesday: Was the article published on a Tuesday?
33. weekday_is_wednesday: Was the article published on a Wednesday?
34. weekday_is_thursday: Was the article published on a Thursday?
35. weekday_is_friday: Was the article published on a Friday?
36. weekday_is_saturday: Was the article published on a Saturday?
37. weekday_is_sunday: Was the article published on a Sunday?
38. is_weekend: Was the article published on the weekend?
39. LDA_00: Closeness to LDA topic 0
40. LDA_01: Closeness to LDA topic 1
41. LDA_02: Closeness to LDA topic 2
42. LDA_03: Closeness to LDA topic 3
43. LDA_04: Closeness to LDA topic 4
44. global_subjectivity: Text subjectivity
45. global_sentiment_polarity: Text sentiment polarity
46. global_rate_positive_words: Rate of positive words in the content
47. global_rate_negative_words: Rate of negative words in the content
48. rate_positive_words: Rate of positive words among non-neutral tokens
49. rate_negative_words: Rate of negative words among non-neutral tokens
50. avg_positive_polarity: Avg. polarity of positive words
51. min_positive_polarity: Min. polarity of positive words
52. max_positive_polarity: Max. polarity of positive words
53. avg_negative_polarity: Avg. polarity of negative words
54. min_negative_polarity: Min. polarity of negative words
55. max_negative_polarity: Max. polarity of negative words
56. title_subjectivity: Title subjectivity
57. title_sentiment_polarity: Title polarity
58. abs_title_subjectivity: Absolute subjectivity level
59. abs_title_sentiment_polarity: Absolute polarity level
60. shares: Number of shares (target)
First we read and viewed the data in R as follows:
#Read the dataset
news<- read.csv ("C:/Users/neha tembe/Desktop/MIS/SEMESTER
#View the dataset
2/Multivariate/OnlineNewsPopularity.csv") # read the popularity data set
View(news)
OUTPUT:
Total Number of Attributes - 61
Number of Predictive Attributes - 58
Number of Non Predictive Attributes - 2
Goal Field - 1
DATA PREPROCESSING
In this, we basically removed the first two columns of our dataset, that is, ‘url’ and ‘timedelta’ as
they were irrelevant to our analysis. Then we standardized the data by generating z-scores using
scale function.
Delete url and timedelta columns
newsreg <- subset( news, select = -c(url, timedelta ) )
Standardize data
Generate z-scores
for(i in ncol(newsreg)-1){newsreg[,i]<-scale(newsreg[,i], center = TRUE, scale = TRUE)}
We calculated the median of the ‘shares’ column which comes out to be 1400. Further, we
identified the articles with shares>1400 as popular articles.
Dataset for classification
newscla <-newsreg
newscla$shares <- as.factor(ifelse(newscla$shares > 1400,1,0))
In the end, we set one random situation and then selected training data and prediction data.
Train 70% Test 30% to avoid ​overfitting​ is a modeling error which occurs when a function is too
closely fit to a limited set of data points​.
Set random situation
set.seed(100)
Training data and prediction data
ind<-sample(2,nrow(newscla),replace=TRUE,prob=c(0.7,0.3))
ANALYSIS
1. PRINCIPAL COMPONENT ANALYSIS
PCA is a dimensionality reduction algorithm, which could give us a lower dimensional
approximation for original dataset while preserving as much variability as possible. We
first created a data frame in R and performed the principal component analysis using both
varimax and oblique rotation.
#Creating the dataframe using R
all_data<-news[,c(2:61)]
data_frame <- data.frame(all_data)
#Performing principal component analysis with varimax rotation
install.packages("psych")
library(psych)
pca_varimax <- principal(data_frame, nfactors=4, rotate="varimax")
pca_varimax
RC1 RC2 RC3 RC4
SS loadings 4.49 4.35 3.79 3.00
Proportion Var 0.07 0.07 0.06 0.05
Cumulative Var 0.07 0.15 0.21 0.26
Proportion Explained 0.29 0.28 0.24 0.19
Cumulative Proportion 0.29 0.57 0.81 1.00
Mean item complexity = 1.5
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.09
with the empirical chi square 1187905 with prob < 0
Fit based upon off diagonal values = 0.57
#Performing principal component analysis with oblique rotation
pca_oblique <- principal(data_frame, nfactors=4, rotate="promax")
pca_oblique
But, PCA did not provide any improvements for our models, reason being the original
feature set is well-designed and correlated information between features is limited.
2. K-NEAREST NEIGHBOR ALGORITHM
K-Nearest Neighbor(KNN) is one of the essential classification algorithms in Machine Learning.
In this, a case is classified according to the majority vote of its K nearest neighbors.It is then
given the class most common among these neighbors. We applied KNN algorithm to the dataset
before which we deleted the ‘url’ and ‘timedelta’ columns and standardized the data. We
obtained the confusion matrix and ROC curve of this, resulting into 56% accuracy.
#KNN
newscla.knn <- knn3(shares ~.,newscla[ind==1,])
newscla.knn.pred <- predict( newscla.knn,newscla[ind==2,],type="class")
newscla.knn.prob <- predict( newscla.knn,newscla[ind==2,],type="prob")
# Confusion matrix
confusionMatrix(newscla.knn.pred, newscla[ind==2,]$shares)
OUTPUT:
3. CLASSIFICATION AND REGRESSION TREES
Classification and regression trees are used for predicting continuous dependent variables
(regression) and categorical predictor variables (classification). The models are obtained by
recursively partitioning the data space and fitting a simple prediction model within each
partition. As a result, the partitioning can be represented graphically as a decision tree. We
plotted the classification and regression tree for our data, after which we obtained the confusion
matrix and ROC Curve for the same, resulting into 61% accuracy.
#CART(Classification and regression Trees)
newscla.cart<-rpart(shares ~.,newscla[ind==1,],method='class')
# Plot tree
fancyRpartPlot(newscla.cart) # the most beautiful one
Confusion matrix
confusionMatrix(newscla.cart.pred, newscla[ind==2,]$shares)
Confusion Matrix and Statistics
4. RANDOM FOREST ALGORITHM
Random Forest use multiple decision trees which are built on separate sets of examples drawn
from the dataset. In each tree, we can use a subset of all the features we have.
By using more decision trees and averaging the result, the variance of the model can be greatly
lowered. For Random Forest, there are two main parameters to be considered: number of trees
and number of features they select at each decision point.
The approach is to have smaller node size in order to improve accuracy
Theoretically, accuracy will increase with more trees making decision.​We obtained the
confusion matrix and ROC Curve for the same, resulting into 66% accuracy.
Plot Feature Importance
Here we plot importance based on two coefficients:
● Global variable importance is the mean decrease of accuracy over all out-of-bag cross
validated predictions, when a given variable is permuted after training, but before
prediction.
● The ​mean decrease​ in ​Gini​ coefficient is a measure of how each variable contributes to
the homogeneity of the nodes and leaves in the resulting random forest
Confusion matrix
confusionMatrix(newscla.rf.pred, newscla[ind==2,]$shares)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 3952 1922
1 2059 3817
Accuracy : 0.6612
95% CI : (0.6526, 0.6698)
No Information Rate : 0.5116
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.3224
Mcnemar's Test P-Value : 0.03112
Sensitivity : 0.6575
Specificity : 0.6651
Pos Pred Value : 0.6728
Neg Pred Value : 0.6496
Prevalence : 0.5116
Detection Rate : 0.3363
Detection Prevalence : 0.4999
Balanced Accuracy : 0.6613
'Positive' Class : 0
RESULTS
By comparing the ROC Curve for all the three methods, we see that Random Forest Algorithm
gives us the highest accuracy of 66% with area under the curve- 0.72. As this value is closer to 1,
it falls under the category ‘good’ as per traditional academic scale system. Hence we can state
that the model is good, though not excellent.
ROC for KNN- AUC 0.592 ROC FOR CART- AUC 0.638 ROC FOR RF- AUC
0.72
PERFORMANCE MEASURES
1. Confusion Matrix: ​It is ​used for finding the correctness and accuracy of the model.
Ideally, the model should give 0 False Positives and 0 False Negatives. But in real life no
model will be 100% accurate most of the times.
2. Accuracy: ​Accuracy in classification problems is the number of correct predictions made
by the model over all kinds of predictions made.
3. Precision: ​It​ ​is a measure that tells us what proportion of articles that we
classified as being popular, actually were popular.
CONCLUSION
Random Forest has the best result for this classification problem. It can have different number of
decision trees and different number of features used for each decision point. The number of
training examples can also change. Therefore, implementation should be done in a systematic
way.
In the future, we could directly treat all the words in an article as additional features, and then
apply machine learning algorithms like Naive Bayes and SVM. In this way, what the article
really talks about is taken into account, and this approach should improve the accuracy of
prediction if combined with our current work.
References:
● https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
● "Predicting the Popularity of Social News Posts." 2013 cs229 projects. Joe Maguire Scott
Michelson.
● Hensinger, Elena, Ilias Flaounas, and Nello Cristianini. "Modelling and predicting news
popularity." Pattern Analysis and Applications 16.4 (2013): 623-635.

More Related Content

What's hot

Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerce
Alexander Konduforov
 
Bench management - arunesh chand mankotia
Bench management -  arunesh chand mankotiaBench management -  arunesh chand mankotia
Bench management - arunesh chand mankotia
Consultonmic
 
Movies Recommendation System
Movies Recommendation SystemMovies Recommendation System
Movies Recommendation System
Shubham Patil
 
A SURVEY ON KIDNEY STONE DETECTION USING IMAGE PROCESSING AND DEEP LEARNING
A SURVEY ON KIDNEY STONE DETECTION USING IMAGE PROCESSING AND DEEP LEARNINGA SURVEY ON KIDNEY STONE DETECTION USING IMAGE PROCESSING AND DEEP LEARNING
A SURVEY ON KIDNEY STONE DETECTION USING IMAGE PROCESSING AND DEEP LEARNING
IRJET Journal
 
Capstone Project.pptx
Capstone Project.pptxCapstone Project.pptx
Capstone Project.pptx
surendrapushpupadhya
 
MoEngage: Next Generation Marketing Cloud
MoEngage: Next Generation Marketing CloudMoEngage: Next Generation Marketing Cloud
MoEngage: Next Generation Marketing Cloud
MoEngage Inc.
 
Lead Scoring Case Study
Lead Scoring Case StudyLead Scoring Case Study
Lead Scoring Case Study
LumbiniSardare
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentation
Ramandeep Kaur Bagri
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
Milind Gokhale
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Epam presentation
Epam presentationEpam presentation
Epam presentationLula21
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
Bhaskara Reddy Sannapureddy
 
IPL Data Analysis using Data Science
IPL Data Analysis using Data ScienceIPL Data Analysis using Data Science
IPL Data Analysis using Data Science
FET Gurukula Kangri University
 
Canteen management
Canteen managementCanteen management
Canteen management
Omkar Majukar
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
Harivamshi D
 
How to Build a Proactive Candidate Sourcing Strategy
How to Build a Proactive Candidate Sourcing Strategy How to Build a Proactive Candidate Sourcing Strategy
How to Build a Proactive Candidate Sourcing Strategy
Lever Inc.
 
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGCRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
IRJET Journal
 

What's hot (20)

Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerce
 
Bench management - arunesh chand mankotia
Bench management -  arunesh chand mankotiaBench management -  arunesh chand mankotia
Bench management - arunesh chand mankotia
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Movies Recommendation System
Movies Recommendation SystemMovies Recommendation System
Movies Recommendation System
 
A SURVEY ON KIDNEY STONE DETECTION USING IMAGE PROCESSING AND DEEP LEARNING
A SURVEY ON KIDNEY STONE DETECTION USING IMAGE PROCESSING AND DEEP LEARNINGA SURVEY ON KIDNEY STONE DETECTION USING IMAGE PROCESSING AND DEEP LEARNING
A SURVEY ON KIDNEY STONE DETECTION USING IMAGE PROCESSING AND DEEP LEARNING
 
Capstone Project.pptx
Capstone Project.pptxCapstone Project.pptx
Capstone Project.pptx
 
expense maneger
expense maneger expense maneger
expense maneger
 
MoEngage: Next Generation Marketing Cloud
MoEngage: Next Generation Marketing CloudMoEngage: Next Generation Marketing Cloud
MoEngage: Next Generation Marketing Cloud
 
Lead Scoring Case Study
Lead Scoring Case StudyLead Scoring Case Study
Lead Scoring Case Study
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentation
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Epam presentation
Epam presentationEpam presentation
Epam presentation
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
IPL Data Analysis using Data Science
IPL Data Analysis using Data ScienceIPL Data Analysis using Data Science
IPL Data Analysis using Data Science
 
Canteen management
Canteen managementCanteen management
Canteen management
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
How to Build a Proactive Candidate Sourcing Strategy
How to Build a Proactive Candidate Sourcing Strategy How to Build a Proactive Candidate Sourcing Strategy
How to Build a Proactive Candidate Sourcing Strategy
 
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGCRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
 

Similar to Multivariate Data Analysis Project Report

House price prediction
House price predictionHouse price prediction
House price prediction
SabahBegum
 
fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
19445KNithinbabu
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
KathleneNgo
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting Problem
Masaharu Kinoshita
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptx
RakshaAgrawal21
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSC
RakshaAgrawal21
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
NitinSharma134320
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Vikash Kumar
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
IRJET Journal
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
Ujjwal Kumar
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
warishali570
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
Anshik Bansal
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTM
IRJET Journal
 
Poster
PosterPoster
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
Adam Doyle
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
IRJET Journal
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
IRJET Journal
 

Similar to Multivariate Data Analysis Project Report (20)

House price prediction
House price predictionHouse price prediction
House price prediction
 
fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting Problem
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptx
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSC
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTM
 
Poster
PosterPoster
Poster
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 

More from Utkarsh Agrawal

Voice Dubbing Automation
Voice Dubbing AutomationVoice Dubbing Automation
Voice Dubbing Automation
Utkarsh Agrawal
 
Developing new IT Strategy for Big Basket
Developing new IT Strategy for Big BasketDeveloping new IT Strategy for Big Basket
Developing new IT Strategy for Big Basket
Utkarsh Agrawal
 
It Issue/Technology presentation: Business Process Management
It Issue/Technology presentation: Business Process ManagementIt Issue/Technology presentation: Business Process Management
It Issue/Technology presentation: Business Process Management
Utkarsh Agrawal
 
Project Management Report
Project Management ReportProject Management Report
Project Management Report
Utkarsh Agrawal
 
Business Use Case Paper
Business Use Case PaperBusiness Use Case Paper
Business Use Case Paper
Utkarsh Agrawal
 
Data Warehouse and Business Intelligence
Data Warehouse and Business IntelligenceData Warehouse and Business Intelligence
Data Warehouse and Business Intelligence
Utkarsh Agrawal
 

More from Utkarsh Agrawal (6)

Voice Dubbing Automation
Voice Dubbing AutomationVoice Dubbing Automation
Voice Dubbing Automation
 
Developing new IT Strategy for Big Basket
Developing new IT Strategy for Big BasketDeveloping new IT Strategy for Big Basket
Developing new IT Strategy for Big Basket
 
It Issue/Technology presentation: Business Process Management
It Issue/Technology presentation: Business Process ManagementIt Issue/Technology presentation: Business Process Management
It Issue/Technology presentation: Business Process Management
 
Project Management Report
Project Management ReportProject Management Report
Project Management Report
 
Business Use Case Paper
Business Use Case PaperBusiness Use Case Paper
Business Use Case Paper
 
Data Warehouse and Business Intelligence
Data Warehouse and Business IntelligenceData Warehouse and Business Intelligence
Data Warehouse and Business Intelligence
 

Recently uploaded

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

Multivariate Data Analysis Project Report

  • 1. ONLINE NEWS POPULARITY Neha Tembe Utkarsh Agrawal Vighnesh Kulkarni MS in Information Systems MS in Information Systems MS in Information Systems Stevens Institute of Technology Stevens Institute of Technology Stevens Institute of Technology Email: ​ntembe@stevens.edu Email: ​uagrawal@stevens.edu​ Email: ​vkulkar1@stevens.edu Under the guidance of: Prof. David Belanger Abstract-​ An ever-increasing number of individuals appreciate perusing and sharing on the web news articles, with the development of the Internet. The number of share under a news article shows how popular the news is. In this venture, we mean to break down the dataset to foresee the prevalence of online news, utilizing machine learning procedures. Our information originates from Mashable, a notable online news site. We implemented 3 different learning algorithms on the dataset, namely K-Nearest Neighbor Algorithm, Classification and Regression Trees and Random Forest Algorithm. Their exhibitions are recorded and looked at. Irregular Forest ends up being the best model for expectation, and it can accomplish a precision of 70% with ideal parameters. Our work can help online news organizations to anticipate news popularity before distribution. INTRODUCTION In this information era, reading and sharing news have become the center of people’s entertainment lives. Therefore, it would be greatly helpful if we could accurately predict the popularity of news prior to its publication, for social media workers (authors, advertisers, etc). For the purpose of this paper, we intend to make use of this dataset which summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to- ● Predict the popularity of online news. ● Classify online news into popular or not popular category. ● Analyse data by using K-Nearest Neighbor Algorithm, apply classification and regression techniques and Random Forest Algorithm. ● Compare the three algorithms and come up with the best suitable model for prediction.
  • 2. Data and Data Preparation Our dataset “Online News Popularity” was originally acquired and preprocessed by K.Fernandes and is provided by UCI Machine Learning Repository. It consists of 61 attributes having attributes characteristics as Integer and Real, with 39797 instances till date. Attribute Information: 0. url: URL of the article (non-predictive) 1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) 2. n_tokens_title: Number of words in the title 3. n_tokens_content: Number of words in the content 4. n_unique_tokens: Rate of unique words in the content 5. n_non_stop_words: Rate of non-stop words in the content 6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content 7. num_hrefs: Number of links 8. num_self_hrefs: Number of links to other articles published by Mashable 9. num_imgs: Number of images 10. num_videos: Number of videos 11. average_token_length: Average length of the words in the content 12. num_keywords: Number of keywords in the metadata 13. data_channel_is_lifestyle: Is data channel 'Lifestyle'? 14. data_channel_is_entertainment: Is data channel 'Entertainment'? 15. data_channel_is_bus: Is data channel 'Business'? 16. data_channel_is_socmed: Is data channel 'Social Media'? 17. data_channel_is_tech: Is data channel 'Tech'? 18. data_channel_is_world: Is data channel 'World'? 19. kw_min_min: Worst keyword (min. shares) 20. kw_max_min: Worst keyword (max. shares) 21. kw_avg_min: Worst keyword (avg. shares) 22. kw_min_max: Best keyword (min. shares) 23. kw_max_max: Best keyword (max. shares) 24. kw_avg_max: Best keyword (avg. shares) 25. kw_min_avg: Avg. keyword (min. shares) 26. kw_max_avg: Avg. keyword (max. shares) 27. kw_avg_avg: Avg. keyword (avg. shares) 28. self_reference_min_shares: Min. shares of referenced articles in Mashable 29. self_reference_max_shares: Max. shares of referenced articles in Mashable
  • 3. 30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable 31. weekday_is_monday: Was the article published on a Monday? 32. weekday_is_tuesday: Was the article published on a Tuesday? 33. weekday_is_wednesday: Was the article published on a Wednesday? 34. weekday_is_thursday: Was the article published on a Thursday? 35. weekday_is_friday: Was the article published on a Friday? 36. weekday_is_saturday: Was the article published on a Saturday? 37. weekday_is_sunday: Was the article published on a Sunday? 38. is_weekend: Was the article published on the weekend? 39. LDA_00: Closeness to LDA topic 0 40. LDA_01: Closeness to LDA topic 1 41. LDA_02: Closeness to LDA topic 2 42. LDA_03: Closeness to LDA topic 3 43. LDA_04: Closeness to LDA topic 4 44. global_subjectivity: Text subjectivity 45. global_sentiment_polarity: Text sentiment polarity 46. global_rate_positive_words: Rate of positive words in the content 47. global_rate_negative_words: Rate of negative words in the content 48. rate_positive_words: Rate of positive words among non-neutral tokens 49. rate_negative_words: Rate of negative words among non-neutral tokens 50. avg_positive_polarity: Avg. polarity of positive words 51. min_positive_polarity: Min. polarity of positive words 52. max_positive_polarity: Max. polarity of positive words 53. avg_negative_polarity: Avg. polarity of negative words 54. min_negative_polarity: Min. polarity of negative words 55. max_negative_polarity: Max. polarity of negative words 56. title_subjectivity: Title subjectivity 57. title_sentiment_polarity: Title polarity 58. abs_title_subjectivity: Absolute subjectivity level 59. abs_title_sentiment_polarity: Absolute polarity level 60. shares: Number of shares (target)
  • 4. First we read and viewed the data in R as follows: #Read the dataset news<- read.csv ("C:/Users/neha tembe/Desktop/MIS/SEMESTER #View the dataset 2/Multivariate/OnlineNewsPopularity.csv") # read the popularity data set View(news) OUTPUT: Total Number of Attributes - 61 Number of Predictive Attributes - 58 Number of Non Predictive Attributes - 2 Goal Field - 1
  • 5. DATA PREPROCESSING In this, we basically removed the first two columns of our dataset, that is, ‘url’ and ‘timedelta’ as they were irrelevant to our analysis. Then we standardized the data by generating z-scores using scale function. Delete url and timedelta columns newsreg <- subset( news, select = -c(url, timedelta ) ) Standardize data Generate z-scores for(i in ncol(newsreg)-1){newsreg[,i]<-scale(newsreg[,i], center = TRUE, scale = TRUE)} We calculated the median of the ‘shares’ column which comes out to be 1400. Further, we identified the articles with shares>1400 as popular articles. Dataset for classification newscla <-newsreg newscla$shares <- as.factor(ifelse(newscla$shares > 1400,1,0)) In the end, we set one random situation and then selected training data and prediction data. Train 70% Test 30% to avoid ​overfitting​ is a modeling error which occurs when a function is too closely fit to a limited set of data points​. Set random situation set.seed(100) Training data and prediction data ind<-sample(2,nrow(newscla),replace=TRUE,prob=c(0.7,0.3))
  • 6. ANALYSIS 1. PRINCIPAL COMPONENT ANALYSIS PCA is a dimensionality reduction algorithm, which could give us a lower dimensional approximation for original dataset while preserving as much variability as possible. We first created a data frame in R and performed the principal component analysis using both varimax and oblique rotation. #Creating the dataframe using R all_data<-news[,c(2:61)] data_frame <- data.frame(all_data) #Performing principal component analysis with varimax rotation install.packages("psych") library(psych) pca_varimax <- principal(data_frame, nfactors=4, rotate="varimax") pca_varimax RC1 RC2 RC3 RC4 SS loadings 4.49 4.35 3.79 3.00 Proportion Var 0.07 0.07 0.06 0.05 Cumulative Var 0.07 0.15 0.21 0.26 Proportion Explained 0.29 0.28 0.24 0.19 Cumulative Proportion 0.29 0.57 0.81 1.00 Mean item complexity = 1.5 Test of the hypothesis that 4 components are sufficient. The root mean square of the residuals (RMSR) is 0.09 with the empirical chi square 1187905 with prob < 0 Fit based upon off diagonal values = 0.57
  • 7. #Performing principal component analysis with oblique rotation pca_oblique <- principal(data_frame, nfactors=4, rotate="promax") pca_oblique But, PCA did not provide any improvements for our models, reason being the original feature set is well-designed and correlated information between features is limited. 2. K-NEAREST NEIGHBOR ALGORITHM K-Nearest Neighbor(KNN) is one of the essential classification algorithms in Machine Learning. In this, a case is classified according to the majority vote of its K nearest neighbors.It is then given the class most common among these neighbors. We applied KNN algorithm to the dataset before which we deleted the ‘url’ and ‘timedelta’ columns and standardized the data. We obtained the confusion matrix and ROC curve of this, resulting into 56% accuracy. #KNN newscla.knn <- knn3(shares ~.,newscla[ind==1,]) newscla.knn.pred <- predict( newscla.knn,newscla[ind==2,],type="class") newscla.knn.prob <- predict( newscla.knn,newscla[ind==2,],type="prob") # Confusion matrix confusionMatrix(newscla.knn.pred, newscla[ind==2,]$shares) OUTPUT:
  • 8. 3. CLASSIFICATION AND REGRESSION TREES Classification and regression trees are used for predicting continuous dependent variables (regression) and categorical predictor variables (classification). The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. We plotted the classification and regression tree for our data, after which we obtained the confusion matrix and ROC Curve for the same, resulting into 61% accuracy. #CART(Classification and regression Trees) newscla.cart<-rpart(shares ~.,newscla[ind==1,],method='class') # Plot tree fancyRpartPlot(newscla.cart) # the most beautiful one Confusion matrix confusionMatrix(newscla.cart.pred, newscla[ind==2,]$shares) Confusion Matrix and Statistics
  • 9. 4. RANDOM FOREST ALGORITHM Random Forest use multiple decision trees which are built on separate sets of examples drawn from the dataset. In each tree, we can use a subset of all the features we have. By using more decision trees and averaging the result, the variance of the model can be greatly lowered. For Random Forest, there are two main parameters to be considered: number of trees and number of features they select at each decision point. The approach is to have smaller node size in order to improve accuracy Theoretically, accuracy will increase with more trees making decision.​We obtained the confusion matrix and ROC Curve for the same, resulting into 66% accuracy. Plot Feature Importance Here we plot importance based on two coefficients: ● Global variable importance is the mean decrease of accuracy over all out-of-bag cross validated predictions, when a given variable is permuted after training, but before prediction. ● The ​mean decrease​ in ​Gini​ coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest
  • 10. Confusion matrix confusionMatrix(newscla.rf.pred, newscla[ind==2,]$shares) Confusion Matrix and Statistics Reference Prediction 0 1 0 3952 1922 1 2059 3817 Accuracy : 0.6612 95% CI : (0.6526, 0.6698) No Information Rate : 0.5116 P-Value [Acc > NIR] : < 2e-16 Kappa : 0.3224 Mcnemar's Test P-Value : 0.03112 Sensitivity : 0.6575 Specificity : 0.6651 Pos Pred Value : 0.6728 Neg Pred Value : 0.6496 Prevalence : 0.5116 Detection Rate : 0.3363 Detection Prevalence : 0.4999 Balanced Accuracy : 0.6613 'Positive' Class : 0
  • 11. RESULTS By comparing the ROC Curve for all the three methods, we see that Random Forest Algorithm gives us the highest accuracy of 66% with area under the curve- 0.72. As this value is closer to 1, it falls under the category ‘good’ as per traditional academic scale system. Hence we can state that the model is good, though not excellent. ROC for KNN- AUC 0.592 ROC FOR CART- AUC 0.638 ROC FOR RF- AUC 0.72
  • 12. PERFORMANCE MEASURES 1. Confusion Matrix: ​It is ​used for finding the correctness and accuracy of the model. Ideally, the model should give 0 False Positives and 0 False Negatives. But in real life no model will be 100% accurate most of the times. 2. Accuracy: ​Accuracy in classification problems is the number of correct predictions made by the model over all kinds of predictions made. 3. Precision: ​It​ ​is a measure that tells us what proportion of articles that we classified as being popular, actually were popular. CONCLUSION Random Forest has the best result for this classification problem. It can have different number of decision trees and different number of features used for each decision point. The number of training examples can also change. Therefore, implementation should be done in a systematic way. In the future, we could directly treat all the words in an article as additional features, and then apply machine learning algorithms like Naive Bayes and SVM. In this way, what the article really talks about is taken into account, and this approach should improve the accuracy of prediction if combined with our current work. References: ● https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity ● "Predicting the Popularity of Social News Posts." 2013 cs229 projects. Joe Maguire Scott Michelson. ● Hensinger, Elena, Ilias Flaounas, and Nello Cristianini. "Modelling and predicting news popularity." Pattern Analysis and Applications 16.4 (2013): 623-635.