SlideShare a Scribd company logo
1 of 16
Download to read offline
News Classifier and Trending Event
Detection
https://github.com/sujithcp/NewsAnalyser
SHARANYA KV
SUJITH CP
A Brief Introduction
Data classification and mining are trending and rapidly evolving fields of
Computer science. Data mining is the process of identifying patterns and relationships in data
that often are not obvious in large and complex data sets.
News event detection and classification are very helpful in socio-economic research and such
purposes. As social media and real-time information sharing gains popularity, an automated
event detection system can do better in understanding the required events.
Machine Learning Technologies for Data Mining
• Inductive Logic Programming
• Genetic Algorithms
• Neural Networks
• Statistical Methods (We use Bayesian method for classification)
• Decision Trees
• Hidden Markov Models
Data Collection
•RSS links collection
•Feed fetching
Preprocessing and
cleaning
•Extract news from feed links
using beautifulsoup and sumy
•Discard very short and
irrelevant data
Transformation and
reduction
•Tokenizing,stopwords removal,
stemming etc.
•Prepare categorized datasets
for training
Training
•Train the machine using naïve
Bayes classifier.(Used nltk
module for the same)
•Dump the classifier object for
easy retrieval.(Using pickle)
Testing
•Test 1: Use same dataset for
training and classification.
•Expecting higher accuracy.
•Test 2: Use mutually exclusive
50% data for training and
classification.
Procedure for building news classifier
Data collection and cleaning
• RSS url with
category
indication
Feed URL
• Link
• Title
• Date
• summary
feed parser
• Main content
• summarization
Html parser
• Insert into table
with attributes
id,date,title,link,
content etc.
Sqlite
database
Training
 Supervised learning using naïve Bayes classifier
NB classifier works on the Bayes theorem in probability mathematics
Processed
data
NB
algorithm
NB classifier
Processed
test data
Category,
probability
Dump the classifier object for easy
reuse (Using pickle library)
Analysis
 Random testing
Data set size : 5000+
Accuracy : 94.6
 Best case test
No. of fault classification : 265
Data set size : 9000+
Accuracy : 97.17
11 173 22 45 14
Best case Errors out of 9000+
articles
Best case
Analysis
Proposed Enhancements
 Multi-category tagging for documents
 Dynamic categorization during fetching
Event detection
 Event detection is an unsupervised learning task.
 We used retrospective event detection which identifies previously
unidentified events in chronological order.
 Events are those phrases which occur in a peaking frequency for a short
period of time.
Event detection Procedure
Data selection and
sampling
•Fetch data from database
for a period
Data selection and
sampling
•Fetch data from database
for a period
Transformation and
reduction
•tokenize
•N-gramize
Transformation and
reduction
•tokenize
•N-gramize
Unsupervised learning
•Compute weight for each
phrases
•Compare with previous
events
Unsupervised learning
•Compute weight for each
phrases
•Compare with previous
events
Analysis and
verification
•Compare with online data
manually
•visualization
•Compute hit and miss and
false alarms of events. (Out
of the program task)
Analysis and
verification
•Compare with online data
manually
•visualization
•Compute hit and miss and
false alarms of events. (Out
of the program task)
Data transformation
 The data used for learning purpose must be cleansed before the task
 Main processing steps are
• Tokenize the news data and remove stopwords.
“Brain cancer detected in the Assam state
{ brain, cancer, detected, assam, state }
• Create unigrams, bigrams, and trigrams and add it to the phrases list (learning
vocabulary)
{ {brain, cancer, . . . . }
{brain cancer, cancer detected, . . . . . }
{brain cancer detected, cancer detected assam, . . . . . }
}
Event detection working
Weight of a term is calculated as,
𝒘 𝒕, 𝒅 =
𝟏+𝒕𝒇 𝒕,𝒅 ×𝒍𝒐𝒈(𝑵/𝒏 𝒕)
||𝒅||
 𝑤 𝑡, 𝑑 𝑖𝑠 𝑡ℎ𝑒 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑡𝑒𝑟𝑚 𝑡 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑
 𝑡𝑓 𝑡, 𝑑 𝑖𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑇𝐹 𝑡𝑓 𝑡, 𝑑 = 𝑎 + 1 − 𝑎 .
𝑓 𝑡,𝑑
𝑚𝑎𝑥{𝑡′∈𝑑} 𝑓 𝑡′,𝑑
 log(𝑁/𝑛 𝑡) 𝑖𝑠 𝑡ℎ𝑒 𝐼𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
 𝑁 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑟𝑝𝑢𝑠
 𝑛 𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑜. 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡𝑒𝑟𝑚 𝑡
 || 𝑑|| = 𝑡 𝑤 𝑡, 𝑑 2 is the 2-norm vector
Analysis
Some correctly detected events
• Met gala
• ipl 2016
• Mother day
• Assembly polls
• Cannes review
• Rio Olympics
Some noises
• Film review
• Times India
• Jessica parker
• Indian express
• Captain America
• Net profit
Proposed Enhancements
 Clustering and event merging.
 Real-time online reference, comparison, and verification.
References
 Learning approaches for Detecting and Tracking News Events: IEEE paper
by Yiming Yang, Jaime Carbonell, Ralf Brown,Tom Pierce, Brian T. Archibald,
Xin Liu
 Web references:
• http://stevenloria.com/how-to-build-a-text-classification-system-with-python-
and-textblob/
• https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Thank you….

More Related Content

What's hot

Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
Albert Bifet
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
Ian Foster
 

What's hot (20)

WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and Applications
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
"Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ..."Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ...
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming Data
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
 
Java ug
Java ugJava ug
Java ug
 
Вебинар: Julia — A fresh approach to numerical computing and data science
Вебинар: Julia — A fresh approach to numerical computing and data scienceВебинар: Julia — A fresh approach to numerical computing and data science
Вебинар: Julia — A fresh approach to numerical computing and data science
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
 

Similar to Project

04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
Shree Shree
 

Similar to Project (20)

Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital Decisions
 
Data mining
Data mining Data mining
Data mining
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Reaction RuleML 1.0
Reaction RuleML 1.0Reaction RuleML 1.0
Reaction RuleML 1.0
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Navy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurity
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Machine Learning Algorithm & Anomaly detection 2021
Machine Learning Algorithm & Anomaly detection 2021Machine Learning Algorithm & Anomaly detection 2021
Machine Learning Algorithm & Anomaly detection 2021
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 

Project

  • 1. News Classifier and Trending Event Detection https://github.com/sujithcp/NewsAnalyser SHARANYA KV SUJITH CP
  • 2. A Brief Introduction Data classification and mining are trending and rapidly evolving fields of Computer science. Data mining is the process of identifying patterns and relationships in data that often are not obvious in large and complex data sets. News event detection and classification are very helpful in socio-economic research and such purposes. As social media and real-time information sharing gains popularity, an automated event detection system can do better in understanding the required events. Machine Learning Technologies for Data Mining • Inductive Logic Programming • Genetic Algorithms • Neural Networks • Statistical Methods (We use Bayesian method for classification) • Decision Trees • Hidden Markov Models
  • 3. Data Collection •RSS links collection •Feed fetching Preprocessing and cleaning •Extract news from feed links using beautifulsoup and sumy •Discard very short and irrelevant data Transformation and reduction •Tokenizing,stopwords removal, stemming etc. •Prepare categorized datasets for training Training •Train the machine using naïve Bayes classifier.(Used nltk module for the same) •Dump the classifier object for easy retrieval.(Using pickle) Testing •Test 1: Use same dataset for training and classification. •Expecting higher accuracy. •Test 2: Use mutually exclusive 50% data for training and classification. Procedure for building news classifier
  • 4. Data collection and cleaning • RSS url with category indication Feed URL • Link • Title • Date • summary feed parser • Main content • summarization Html parser • Insert into table with attributes id,date,title,link, content etc. Sqlite database
  • 5. Training  Supervised learning using naïve Bayes classifier NB classifier works on the Bayes theorem in probability mathematics Processed data NB algorithm NB classifier Processed test data Category, probability Dump the classifier object for easy reuse (Using pickle library)
  • 6. Analysis  Random testing Data set size : 5000+ Accuracy : 94.6  Best case test No. of fault classification : 265 Data set size : 9000+ Accuracy : 97.17 11 173 22 45 14 Best case Errors out of 9000+ articles Best case
  • 8. Proposed Enhancements  Multi-category tagging for documents  Dynamic categorization during fetching
  • 9. Event detection  Event detection is an unsupervised learning task.  We used retrospective event detection which identifies previously unidentified events in chronological order.  Events are those phrases which occur in a peaking frequency for a short period of time.
  • 10. Event detection Procedure Data selection and sampling •Fetch data from database for a period Data selection and sampling •Fetch data from database for a period Transformation and reduction •tokenize •N-gramize Transformation and reduction •tokenize •N-gramize Unsupervised learning •Compute weight for each phrases •Compare with previous events Unsupervised learning •Compute weight for each phrases •Compare with previous events Analysis and verification •Compare with online data manually •visualization •Compute hit and miss and false alarms of events. (Out of the program task) Analysis and verification •Compare with online data manually •visualization •Compute hit and miss and false alarms of events. (Out of the program task)
  • 11. Data transformation  The data used for learning purpose must be cleansed before the task  Main processing steps are • Tokenize the news data and remove stopwords. “Brain cancer detected in the Assam state { brain, cancer, detected, assam, state } • Create unigrams, bigrams, and trigrams and add it to the phrases list (learning vocabulary) { {brain, cancer, . . . . } {brain cancer, cancer detected, . . . . . } {brain cancer detected, cancer detected assam, . . . . . } }
  • 12. Event detection working Weight of a term is calculated as, 𝒘 𝒕, 𝒅 = 𝟏+𝒕𝒇 𝒕,𝒅 ×𝒍𝒐𝒈(𝑵/𝒏 𝒕) ||𝒅||  𝑤 𝑡, 𝑑 𝑖𝑠 𝑡ℎ𝑒 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑡𝑒𝑟𝑚 𝑡 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑  𝑡𝑓 𝑡, 𝑑 𝑖𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑇𝐹 𝑡𝑓 𝑡, 𝑑 = 𝑎 + 1 − 𝑎 . 𝑓 𝑡,𝑑 𝑚𝑎𝑥{𝑡′∈𝑑} 𝑓 𝑡′,𝑑  log(𝑁/𝑛 𝑡) 𝑖𝑠 𝑡ℎ𝑒 𝐼𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦  𝑁 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑟𝑝𝑢𝑠  𝑛 𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑜. 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡𝑒𝑟𝑚 𝑡  || 𝑑|| = 𝑡 𝑤 𝑡, 𝑑 2 is the 2-norm vector
  • 13. Analysis Some correctly detected events • Met gala • ipl 2016 • Mother day • Assembly polls • Cannes review • Rio Olympics Some noises • Film review • Times India • Jessica parker • Indian express • Captain America • Net profit
  • 14. Proposed Enhancements  Clustering and event merging.  Real-time online reference, comparison, and verification.
  • 15. References  Learning approaches for Detecting and Tracking News Events: IEEE paper by Yiming Yang, Jaime Carbonell, Ralf Brown,Tom Pierce, Brian T. Archibald, Xin Liu  Web references: • http://stevenloria.com/how-to-build-a-text-classification-system-with-python- and-textblob/ • https://en.wikipedia.org/wiki/Tf%E2%80%93idf