Twitter LDA

5,696
-1

Published on

http://www.akshaybhat.com/LDA

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,696
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
141
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Twitter LDA

  1. 1. Entity and Link annotation in Online Social Networks<br />Karan Kurani & Akshay Bhat<br />CS 6740 Fall 2010 Project at Cornell University <br />
  2. 2. Overview<br /><ul><li>Introduction
  3. 3. Prior work
  4. 4. Methodology
  5. 5. Datasets and Implementation
  6. 6. Results
  7. 7. Discussion
  8. 8. Future Work</li></li></ul><li>Overview<br /><ul><li>Introduction
  9. 9. Prior work
  10. 10. Methodology
  11. 11. Datasets and Implementation
  12. 12. Results
  13. 13. Discussion
  14. 14. Future Work</li></li></ul><li>Introduction<br />Motivation:<br />We are interested in studying how social networks and textual information associated with entities in the networkcan be modeled for insight?<br />Goal : <br /> To create a model for annotating entities and links using <br />The social network<br />Textual content<br />Dataset : <br />Social network of 36 Million twitter users and 450 Million tweets<br />Applications <br />Targeted Advertising, <br />Friend suggestions<br />etc.,<br />
  15. 15. Overview<br /><ul><li>Introduction
  16. 16. Prior work
  17. 17. Methodology
  18. 18. Datasets and Implementation
  19. 19. Results
  20. 20. Discussion
  21. 21. Future Work</li></li></ul><li>Prior Work<br />Large scale analysis of user behavior on Twitter "What is Twitter, a Social Network or a News Media” by Kwak et. al.<br />Studies the propagation of information through the network using retweets, to determine user influence. <br />“Automatic generation of personalized annotation tags for Twitter users” by Wu et. al.<br />Uses TFIDF weights to assign tags to each user, using textual information alone.<br />
  22. 22. Prior Work: Motivation<br />Connections between the lines: Augmenting social networks with text published by Chang et. al.<br />Using Wikipedia and Bible, annotated with entities, a network between entities and a topic model is constructed.<br />
  23. 23. Prior Work: Other models<br />Block-LDA :  Jointly modeling entity-annotated text and entity-entity links by Cohen et. al.  (Protein-Protein Interaction dataset)<br />Predefined undirected network & text associated with each node<br />
  24. 24. Prior Work: Other models<br />Topic-link LDA : joint models of topic and author communities by Liu et. al. <br />Corpus of academic publications modeled using Bayesian hierarchical topic model<br />To find topics within those papers as well as community of authors<br />
  25. 25. Overview<br /><ul><li>Introduction
  26. 26. Prior work
  27. 27. Methodology
  28. 28. Datasets and Implementation
  29. 29. Results
  30. 30. Discussion
  31. 31. Future Work</li></li></ul><li>Methodology: Overview<br />Community Detection<br /><ul><li>Detect communities of users in the social network
  32. 32. Use Label Propagation algorithm
  33. 33. Only the network information, who follows whom is used
  34. 34. Communities are detected at various levels</li></ul>LDA<br /><ul><li>Each community is considered as a corpus
  35. 35. All tweets by a single entity are considered as a single document
  36. 36. Only the textual information is used
  37. 37. Stemming, stop word removal and rare word removal</li></li></ul><li>Methodology: Generating annotations<br />A topic is considered to be relevant to a user if the probability exceeds 0.05<br />Users are annotated using topics generated by the LDA model<br />For a link between users we take intersection of the topics generated for each user forming the link. <br />We also detect general topics, by comparing topics generated for randomly selected users from the network (not the community)<br />
  38. 38. Methodology: Evaluation<br /><ul><li>Select a single community
  39. 39. Generate the LDA model from all users within that community
  40. 40. Generate topic probabilities using the model for a set of randomly selected users
  41. 41. A classifier (linear SVM) is used to discriminate between a users in the community and randomly selected.
  42. 42. Measure Accuracy, Precision and Recall
  43. 43. Repeat above procedure for different communities</li></li></ul><li>Overview<br /><ul><li>Introduction
  44. 44. Prior work
  45. 45. Methodology
  46. 46. Datasets and Implementation
  47. 47. Results
  48. 48. Discussion
  49. 49. Future Work</li></li></ul><li>Dataset<br /><ul><li>Two types of data sets:
  50. 50. Network:
  51. 51. Twitter follower network of 36 million users collected in June 2009.
  52. 52. Users with more than 900 followers are removed to uncover the underlying social network,
  53. 53. Textual data:
  54. 54. 450 million tweets from 20 million users.
  55. 55. Collected from June 2009 to December 2009.
  56. 56. Covers ~20-30% of all public tweets from above time period. </li></li></ul><li>Implementation<br /><ul><li>Community Detection
  57. 57. Communities are detected using label propagation algorithm
  58. 58. Implemented on Cornell Web Lab Hadoop cluster
  59. 59. 15 iterations are performed
  60. 60. Communities from 7th and 15th iterations are considered
  61. 61. LDA
  62. 62. LingPipe – Java package which provides implementation of LDA.
  63. 63. Uses Collapsed Gibbs Sampling to infer the topic distributions.
  64. 64. Stop word removal, stemming and rare word removal is performed.
  65. 65. Number of topics – 50.
  66. 66. Topic prior value - 0.02, word prior – 0.001, number of samples – 2000, burnin epochs – 100.</li></li></ul><li>Overview<br /><ul><li>Introduction
  67. 67. Prior work
  68. 68. Methodology
  69. 69. Datasets and Implementation
  70. 70. Results
  71. 71. Discussion
  72. 72. Future Work</li></li></ul><li>Results - Topics<br />
  73. 73. Results: User distribution over number of topics<br />
  74. 74. Results: Topic Distribution<br />
  75. 75. Results: Topic Distribution<br />
  76. 76. Results: Portland Specific<br />
  77. 77. Results: Portland General<br />
  78. 78. Results: India Specific<br />
  79. 79. Results: India General<br />
  80. 80. Results: Common/Shared Topics<br />
  81. 81. Results: Classifier performance<br />
  82. 82. Overview<br /><ul><li>Introduction
  83. 83. Prior work
  84. 84. Methodology
  85. 85. Datasets and Implementation
  86. 86. Results
  87. 87. Discussion
  88. 88. Future Work</li></li></ul><li>Discussion<br />
  89. 89. Overview<br /><ul><li>Introduction
  90. 90. Prior work
  91. 91. Methodology
  92. 92. Datasets and Implementation
  93. 93. Results
  94. 94. Discussion
  95. 95. Future Work</li></li></ul><li>Future Work<br />Use Hierarchical Dirichlet Processes to determine the number of topics automatically.<br />Also use online version of LDA currently being developed by David Blei at Princeton. Which will allow the possibility of generating topic distribution over whole twitter dataset.<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×