SlideShare a Scribd company logo
1 of 10
Major Project
Event Based News Clustering
Submitted By: Aniket Mishra
Problem Statement:
• To implement a clustering system which can cluster the data which is
related to it in one cluster and one can see what is happening in the
next event. so basically i have to implement event based news
clustering system using clustering algorithm.
Implementation Steps Followed:
• I have crawled data of election campaign Using BING API in different
time periods.
• Used sub categories AAP , BJP,Congress
• Applied k-means first I have taken 10 clusters.
• Then applied Modified K-means On data to improve it’s Efficiency.
• Applied algorithm using tfidf ,centroid calculation,cosine similiarity.
RSS Purity Rand Index
K-means 73.52 65.9 .66
Modified K-means 73.70 71.5 .649
Table 1 shows the results obtained by our system for k-means and
modified k-means algorithm.
Table 1-Comparison of clustering results
When calculating purity and rand index of k-means and modified k-
means we found out that when we repeat the clusters for 10 times and
get the initial k-points from each of the k different clusters rather than
random restart for modified k-means it gives better results and give
better purity as it can be.
Results Demonstration
These are the results in cluster 9 that are coming altogether making it related news as we can see all 4 news are
related to Rahul Gandhi. I have taken the news on 29-05-14 and these results were scattered and by using k-
means clustering they are clustered and we found out these results.
As in this second example that I have taken we can see news is mostly related to Punjab unit of congress.so this
is inferring that the news that I have taken correctly clustered. And we can also see that 2 news are also not
related so It is not 100% pure clustered news.
Conclusion
• In this project I have designed and evaluated clustering system. Our clustering
system crawls incoming news reports from Bing api and cluster them according to
the event they are describing. The clustering is performed by representing
incoming news reports as Bag of Word with TF-IDF weighting, and using a
variation of k-means algorithm that works in a single pass without cluster re-
organization. The number of cluster to produce is fixed for every query to 29 and
new events are detected automatically. Clustering process takes 1-2 minutes to
fetch news from website.
• The evaluation results show that our system is very effective when clustering
documents into highly specific clusters, but performs rather poorly when
clustering documents into more general categories and it performs better for
Modified k-means.
Future Work:
• It is my opinion that our clustering can be applied in other domains
apart from online news. For example it can be applied successfully to
the clustering of social media feed to produce clusters according to
the item being discussed by different people. In my project in future a
user interface for user can be created for better use. And we can also
improve its scalability
•
Thank you!

More Related Content

Viewers also liked

Viewers also liked (17)

Enc 3241 color
Enc 3241 colorEnc 3241 color
Enc 3241 color
 
Enc 3241 document_design1
Enc 3241 document_design1Enc 3241 document_design1
Enc 3241 document_design1
 
Big Data ROI
Big Data ROIBig Data ROI
Big Data ROI
 
페차쿠차_ 조연진
페차쿠차_ 조연진페차쿠차_ 조연진
페차쿠차_ 조연진
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
 
Qlitan wid my cousins
Qlitan wid my cousinsQlitan wid my cousins
Qlitan wid my cousins
 
Enc 3241 document_design1
Enc 3241 document_design1Enc 3241 document_design1
Enc 3241 document_design1
 
Aca advocacy
Aca advocacyAca advocacy
Aca advocacy
 
Usability ppt
Usability pptUsability ppt
Usability ppt
 
Enc 3241 usability
Enc 3241 usabilityEnc 3241 usability
Enc 3241 usability
 
Rosalia de Castro
Rosalia de CastroRosalia de Castro
Rosalia de Castro
 
Vir’s ib educators ankeeta
Vir’s ib educators ankeetaVir’s ib educators ankeeta
Vir’s ib educators ankeeta
 
La empresa
La empresaLa empresa
La empresa
 
SEO Pricing & Cost
SEO Pricing & CostSEO Pricing & Cost
SEO Pricing & Cost
 
Top 150 global design firms
Top 150 global design firmsTop 150 global design firms
Top 150 global design firms
 
Qtllb
QtllbQtllb
Qtllb
 
Enc 3241 color
Enc 3241 colorEnc 3241 color
Enc 3241 color
 

Similar to Event-Based News Clustering Using K-Means

Using Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
Using Kafka on Event-driven Microservices Architectures - Apache Kafka MeetupUsing Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
Using Kafka on Event-driven Microservices Architectures - Apache Kafka MeetupStratio
 
Building the BI system and analytics capabilities at the company based on Rea...
Building the BI system and analytics capabilities at the company based on Rea...Building the BI system and analytics capabilities at the company based on Rea...
Building the BI system and analytics capabilities at the company based on Rea...GameCamp
 
Incentive Compatible Privacy Preserving Data Analysis
Incentive Compatible Privacy Preserving Data AnalysisIncentive Compatible Privacy Preserving Data Analysis
Incentive Compatible Privacy Preserving Data Analysisrupasri mupparthi
 
Reduce Time to Value: Focus First on Configuration Management Debt
Reduce Time to Value: Focus First on Configuration Management DebtReduce Time to Value: Focus First on Configuration Management Debt
Reduce Time to Value: Focus First on Configuration Management DebtChris Sterling
 
QuickBooks Connect 2016 - Using WebHooks to handle data changes in your app
QuickBooks Connect 2016 - Using WebHooks to handle data changes in your appQuickBooks Connect 2016 - Using WebHooks to handle data changes in your app
QuickBooks Connect 2016 - Using WebHooks to handle data changes in your appIntuit Developer
 
IRJET- Secure Distributed Data Mining
IRJET- Secure Distributed Data MiningIRJET- Secure Distributed Data Mining
IRJET- Secure Distributed Data MiningIRJET Journal
 
Performance analysis of KNN & K-Means using internet advertisements data
Performance analysis of KNN & K-Means using internet advertisements dataPerformance analysis of KNN & K-Means using internet advertisements data
Performance analysis of KNN & K-Means using internet advertisements dataMuhammad GulRaj
 
Application Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and SucceedApplication Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and SucceedVMware Tanzu
 
Marketing Campaign Management & Execution Process Final Submission
Marketing Campaign Management & Execution Process Final SubmissionMarketing Campaign Management & Execution Process Final Submission
Marketing Campaign Management & Execution Process Final SubmissionPoonam Gupta
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceHarivamshi D
 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET Journal
 
Four Steps Toward a Safer Continuous Delivery Practice (Hint: Add Monitoring)
Four Steps Toward a Safer Continuous Delivery Practice (Hint: Add Monitoring)Four Steps Toward a Safer Continuous Delivery Practice (Hint: Add Monitoring)
Four Steps Toward a Safer Continuous Delivery Practice (Hint: Add Monitoring)VMware Tanzu
 
Book Recommendation System
Book Recommendation SystemBook Recommendation System
Book Recommendation SystemIRJET Journal
 
Cloud-Native Fundamentals: Accelerating Development with Continuous Integration
Cloud-Native Fundamentals: Accelerating Development with Continuous IntegrationCloud-Native Fundamentals: Accelerating Development with Continuous Integration
Cloud-Native Fundamentals: Accelerating Development with Continuous IntegrationVMware Tanzu
 
ATAGTR2017 The way to recover the issue faced in IoT regression Testing
ATAGTR2017 The way to recover the issue faced in IoT regression TestingATAGTR2017 The way to recover the issue faced in IoT regression Testing
ATAGTR2017 The way to recover the issue faced in IoT regression TestingAgile Testing Alliance
 
Mining Large Streams of User Data for PersonalizedRecommenda.docx
Mining Large Streams of User Data for PersonalizedRecommenda.docxMining Large Streams of User Data for PersonalizedRecommenda.docx
Mining Large Streams of User Data for PersonalizedRecommenda.docxARIV4
 
Web and Social Computing - Presentation Week8
Web and Social Computing - Presentation Week8Web and Social Computing - Presentation Week8
Web and Social Computing - Presentation Week8Matthew Courtney
 
A Comparative Study Of Scrum And Kanban Approaches On A Real Case Study Using...
A Comparative Study Of Scrum And Kanban Approaches On A Real Case Study Using...A Comparative Study Of Scrum And Kanban Approaches On A Real Case Study Using...
A Comparative Study Of Scrum And Kanban Approaches On A Real Case Study Using...Fiona Phillips
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...IRJET Journal
 

Similar to Event-Based News Clustering Using K-Means (20)

Using Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
Using Kafka on Event-driven Microservices Architectures - Apache Kafka MeetupUsing Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
Using Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
 
Building the BI system and analytics capabilities at the company based on Rea...
Building the BI system and analytics capabilities at the company based on Rea...Building the BI system and analytics capabilities at the company based on Rea...
Building the BI system and analytics capabilities at the company based on Rea...
 
Incentive Compatible Privacy Preserving Data Analysis
Incentive Compatible Privacy Preserving Data AnalysisIncentive Compatible Privacy Preserving Data Analysis
Incentive Compatible Privacy Preserving Data Analysis
 
Reduce Time to Value: Focus First on Configuration Management Debt
Reduce Time to Value: Focus First on Configuration Management DebtReduce Time to Value: Focus First on Configuration Management Debt
Reduce Time to Value: Focus First on Configuration Management Debt
 
QuickBooks Connect 2016 - Using WebHooks to handle data changes in your app
QuickBooks Connect 2016 - Using WebHooks to handle data changes in your appQuickBooks Connect 2016 - Using WebHooks to handle data changes in your app
QuickBooks Connect 2016 - Using WebHooks to handle data changes in your app
 
IRJET- Secure Distributed Data Mining
IRJET- Secure Distributed Data MiningIRJET- Secure Distributed Data Mining
IRJET- Secure Distributed Data Mining
 
Performance analysis of KNN & K-Means using internet advertisements data
Performance analysis of KNN & K-Means using internet advertisements dataPerformance analysis of KNN & K-Means using internet advertisements data
Performance analysis of KNN & K-Means using internet advertisements data
 
Application Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and SucceedApplication Migration: How to Start, Scale and Succeed
Application Migration: How to Start, Scale and Succeed
 
Marketing Campaign Management & Execution Process Final Submission
Marketing Campaign Management & Execution Process Final SubmissionMarketing Campaign Management & Execution Process Final Submission
Marketing Campaign Management & Execution Process Final Submission
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation System
 
Four Steps Toward a Safer Continuous Delivery Practice (Hint: Add Monitoring)
Four Steps Toward a Safer Continuous Delivery Practice (Hint: Add Monitoring)Four Steps Toward a Safer Continuous Delivery Practice (Hint: Add Monitoring)
Four Steps Toward a Safer Continuous Delivery Practice (Hint: Add Monitoring)
 
Book Recommendation System
Book Recommendation SystemBook Recommendation System
Book Recommendation System
 
Cloud-Native Fundamentals: Accelerating Development with Continuous Integration
Cloud-Native Fundamentals: Accelerating Development with Continuous IntegrationCloud-Native Fundamentals: Accelerating Development with Continuous Integration
Cloud-Native Fundamentals: Accelerating Development with Continuous Integration
 
ATAGTR2017 The way to recover the issue faced in IoT regression Testing
ATAGTR2017 The way to recover the issue faced in IoT regression TestingATAGTR2017 The way to recover the issue faced in IoT regression Testing
ATAGTR2017 The way to recover the issue faced in IoT regression Testing
 
Software Sizing
Software SizingSoftware Sizing
Software Sizing
 
Mining Large Streams of User Data for PersonalizedRecommenda.docx
Mining Large Streams of User Data for PersonalizedRecommenda.docxMining Large Streams of User Data for PersonalizedRecommenda.docx
Mining Large Streams of User Data for PersonalizedRecommenda.docx
 
Web and Social Computing - Presentation Week8
Web and Social Computing - Presentation Week8Web and Social Computing - Presentation Week8
Web and Social Computing - Presentation Week8
 
A Comparative Study Of Scrum And Kanban Approaches On A Real Case Study Using...
A Comparative Study Of Scrum And Kanban Approaches On A Real Case Study Using...A Comparative Study Of Scrum And Kanban Approaches On A Real Case Study Using...
A Comparative Study Of Scrum And Kanban Approaches On A Real Case Study Using...
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Event-Based News Clustering Using K-Means

  • 1. Major Project Event Based News Clustering Submitted By: Aniket Mishra
  • 2. Problem Statement: • To implement a clustering system which can cluster the data which is related to it in one cluster and one can see what is happening in the next event. so basically i have to implement event based news clustering system using clustering algorithm.
  • 3. Implementation Steps Followed: • I have crawled data of election campaign Using BING API in different time periods. • Used sub categories AAP , BJP,Congress • Applied k-means first I have taken 10 clusters. • Then applied Modified K-means On data to improve it’s Efficiency. • Applied algorithm using tfidf ,centroid calculation,cosine similiarity.
  • 4. RSS Purity Rand Index K-means 73.52 65.9 .66 Modified K-means 73.70 71.5 .649 Table 1 shows the results obtained by our system for k-means and modified k-means algorithm. Table 1-Comparison of clustering results
  • 5. When calculating purity and rand index of k-means and modified k- means we found out that when we repeat the clusters for 10 times and get the initial k-points from each of the k different clusters rather than random restart for modified k-means it gives better results and give better purity as it can be.
  • 6. Results Demonstration These are the results in cluster 9 that are coming altogether making it related news as we can see all 4 news are related to Rahul Gandhi. I have taken the news on 29-05-14 and these results were scattered and by using k- means clustering they are clustered and we found out these results.
  • 7. As in this second example that I have taken we can see news is mostly related to Punjab unit of congress.so this is inferring that the news that I have taken correctly clustered. And we can also see that 2 news are also not related so It is not 100% pure clustered news.
  • 8. Conclusion • In this project I have designed and evaluated clustering system. Our clustering system crawls incoming news reports from Bing api and cluster them according to the event they are describing. The clustering is performed by representing incoming news reports as Bag of Word with TF-IDF weighting, and using a variation of k-means algorithm that works in a single pass without cluster re- organization. The number of cluster to produce is fixed for every query to 29 and new events are detected automatically. Clustering process takes 1-2 minutes to fetch news from website. • The evaluation results show that our system is very effective when clustering documents into highly specific clusters, but performs rather poorly when clustering documents into more general categories and it performs better for Modified k-means.
  • 9. Future Work: • It is my opinion that our clustering can be applied in other domains apart from online news. For example it can be applied successfully to the clustering of social media feed to produce clusters according to the item being discussed by different people. In my project in future a user interface for user can be created for better use. And we can also improve its scalability •