Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Published in: Education, Technology, Business
  • Be the first to comment


  1. 1. Topic Modelling in Social Media Group 28 Project 06
  2. 2. Group Members ● Prateek Mehta (201203006) ● Saurabh Kanaujia (201305551) ● Nikita Nataraj (201101079)
  3. 3. Aim ● Apply topic-modelling techniques to social media. ● Main focus: reduce the cost of computing LDA-model in social networks and this technique should be scalable. ● Efficient representation and calculation of topics of whole network.
  4. 4. Introduction Topical categorization of blogs, documents or other objects that can be tagged with text, improves the experience for end users. When the set of documents is very large and varies significantly from user to user, the task of calculating a single global topic model, or an individual topic model for each and every user can become very expensive in large scale internet settings. In order to implement topic modelling, we have used LDA. Latent Dirichlet allocation (LDA)is an unsupervised, probabilistic, text clustering algorithm. LDA defines a generative model that can be used to model how documents are generated given a set of topics and the words in the topics. We have chosen to LDA because it is more convenient to model more human like corpus, in other other words social media.
  5. 5. Possible Approaches 1. Find LDA model for each user in network. (very costly) 2. Find top K influential users and apply LDA model for these. 3. Classifying communities and apply the LDA model across communities. We tried to implement Approach 2 and 3.
  6. 6. Approach No. 3 Drawbacks ● This community detection is based upon bi-directional follower-followee relationship. only 22-23% users in twitter have such relationship where they follow each other. ● Implementation to find communities based upon uni- directional follower-followee relationship was not possible and scalable.
  7. 7. Approach No. 2 Phase 1: Finding Influential Users ● Top-k users found using GraphChi API page rank algorithm. ● Fetched tweets and URLs embedded with them. Metadata, tags, ids are also fetched. ● Crawled the URLs, and summarized them. ● Tweets document + URI summary used as training data
  8. 8. Approach No.2 Phase 1: Diagram
  9. 9. Approach No. 2 Phase 2: User Similarity ● Tweets and urls are fetched. Url is summarised to 15- 20 sentences. ● Jaccard index is calculated to match user with one of the top users. ● Maximum Jaccard index implies that user adopts the topic distribution with the corresponding
  10. 10. Approach No. 2 Phase 2: Diagram
  11. 11. Conclusion Out of the three approaches that were proposed, the second one, in which we define 100 top users and create an LDA model for each.