1) The document describes a project implementing event-based news clustering using k-means and modified k-means clustering algorithms.
2) The author crawled election campaign data from different time periods using Bing API and applied k-means and modified k-means clustering.
3) Evaluation results found that modified k-means performed better, with higher purity and Rand index scores, clustering related news articles into coherent events.
2. Problem Statement:
• To implement a clustering system which can cluster the data which is
related to it in one cluster and one can see what is happening in the
next event. so basically i have to implement event based news
clustering system using clustering algorithm.
3. Implementation Steps Followed:
• I have crawled data of election campaign Using BING API in different
time periods.
• Used sub categories AAP , BJP,Congress
• Applied k-means first I have taken 10 clusters.
• Then applied Modified K-means On data to improve it’s Efficiency.
• Applied algorithm using tfidf ,centroid calculation,cosine similiarity.
4. RSS Purity Rand Index
K-means 73.52 65.9 .66
Modified K-means 73.70 71.5 .649
Table 1 shows the results obtained by our system for k-means and
modified k-means algorithm.
Table 1-Comparison of clustering results
5. When calculating purity and rand index of k-means and modified k-
means we found out that when we repeat the clusters for 10 times and
get the initial k-points from each of the k different clusters rather than
random restart for modified k-means it gives better results and give
better purity as it can be.
6. Results Demonstration
These are the results in cluster 9 that are coming altogether making it related news as we can see all 4 news are
related to Rahul Gandhi. I have taken the news on 29-05-14 and these results were scattered and by using k-
means clustering they are clustered and we found out these results.
7. As in this second example that I have taken we can see news is mostly related to Punjab unit of congress.so this
is inferring that the news that I have taken correctly clustered. And we can also see that 2 news are also not
related so It is not 100% pure clustered news.
8. Conclusion
• In this project I have designed and evaluated clustering system. Our clustering
system crawls incoming news reports from Bing api and cluster them according to
the event they are describing. The clustering is performed by representing
incoming news reports as Bag of Word with TF-IDF weighting, and using a
variation of k-means algorithm that works in a single pass without cluster re-
organization. The number of cluster to produce is fixed for every query to 29 and
new events are detected automatically. Clustering process takes 1-2 minutes to
fetch news from website.
• The evaluation results show that our system is very effective when clustering
documents into highly specific clusters, but performs rather poorly when
clustering documents into more general categories and it performs better for
Modified k-means.
9. Future Work:
• It is my opinion that our clustering can be applied in other domains
apart from online news. For example it can be applied successfully to
the clustering of social media feed to produce clusters according to
the item being discussed by different people. In my project in future a
user interface for user can be created for better use. And we can also
improve its scalability
•