Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Topic Discovery in Unstructured Data: The Next Generation

867 views

Published on

Looking at text clustering using probabilistic methods (LDA) and correlating with structured data, in particular geolocation

Published in: Technology, Sports, Business
  • Be the first to comment

Topic Discovery in Unstructured Data: The Next Generation

  1. 1. Web Science & Technologies University of Koblenz ▪ Landau, GermanyTopic Discovery in Unstructured Data: The Next Generation Christoph Kling, Sergej Sizov, Steffen Staab
  2. 2. Understanding Social Media: Example Yahoo News Comments • Many comments • More opinions • Commenting different (sub)topics WeST Steffen Staab Topic Detection - TNG 2 of 25 10/09/2012 2
  3. 3. Discovering topics using LDAWeST Steffen Staab Topic Detection - TNG 3 of 25
  4. 4. Browse by topic more.. more..WeST Steffen Staab Topic Detection - TNG 4 of 25
  5. 5. We have: Topic-Document – All Fine?How do we understand We work on: the topics? • Opinions about topicsAre all topics of same • Diversity of opinions value? • Localisation of topicsIs there structured data to • Time-varying topic correlate? models (Blei, Lafferty) • Space • .... • Time • Geo-varying topic • Network information modelsWeST Steffen Staab Topic Detection - TNG 5 of 25
  6. 6. Geo-located social media content BMW Audi Audi Citroen BMW Chevrolet Peugeot Renault Citroen Chevrolet BMW Pontiac Mercedes Audi Chevrolet Pontiac Fiat Mercedes BMWWeST Steffen Staab Topic Detection - TNG 6 of 25
  7. 7. Geo-located social media content chevrolet citroen pontiac renault bmw BMW Audi peugeot mercedes Audi Citroen bmw audi Chevrolet Peugeot BMW Renault Citroen Chevrolet BMW Pontiac Mercedes Audi Chevrolet Pontiac Fiat Mercedes BMW bmw audi mercedes fiat citroenWeST Steffen Staab Topic Detection - TNG 7 of 25
  8. 8. Related work chevrolet citroen pontiac renault bmw BMW Audi peugeot mercedes Audi Citroen bmw audi Chevrolet Peugeot BMW Renault Citroen Chevrolet BMW Pontiac Mercedes Audi Chevrolet Pontiac Fiat Mercedes BMW bmw audi mercedes fiat citroen LGTA, Yin et al. 2011WeST Steffen Staab Topic Detection - TNG 8 of 25
  9. 9. ProblemGeographical distribution of topics Language areas Dominating religionWeST Steffen Staab Topic Detection - TNG 9 of 25
  10. 10. Our approach chevrolet citroen pontiac renault bmw BMW Audi peugeot mercedes Audi Citroen bmw audi Chevrolet Peugeot BMW Renault Citroen Chevrolet BMW Pontiac Mercedes Audi Chevrolet Pontiac Fiat Mercedes BMW bmw audi mercedes fiat citroenWeST Steffen Staab Topic Detection - TNG 10 of 25
  11. 11. Our approachchevrolet citroenpontiac renault BMWbmw Audi Audi peugeotmercedes Citroen bmwaudi BMW Chevrolet Peugeot Renault Chevrolet Citroen BMW Pontiac Mercedes Audi Chevrolet Pontiac chevrolet Fiat citroen pontiac Mercedes renault bmw BMW BMW bmw peugeot Audi Audi mercedes audi Citroen bmw audi Chevrolet mercedes Peugeot BMW fiat Renault citroen Citroen Chevrolet BMW Pontiac Mercedes Audi Chevrolet Pontiac Fiat Mercedes BMW bmw audi mercedesWeST Steffen Staab Topic Detection - TNG fiat citroen 11 of 25
  12. 12. Geographical network construction Data points Spatial region centroids Geographical networkWeST Steffen Staab Topic Detection - TNG 12 of 25 10/09/2012 12
  13. 13. Topic detection Topic assignmentsWeST Steffen Staab Topic Detection - TNG 13 of 25
  14. 14. Topic detection Topic assignmentsWeST Steffen Staab Topic Detection - TNG 14 of 25
  15. 15. Topic detection Topic assignmentsWeST Steffen Staab Topic Detection - TNG 15 of 25
  16. 16. Topic detection Topic assignmentsWeST Steffen Staab Topic Detection - TNG 16 of 25
  17. 17. Topic detection Topic assignmentsWeST Steffen Staab Topic Detection - TNG 17 of 25
  18. 18. Topic detectionTopic exchange between adjacent clusters: Pontiac Chevrolet BMW BMW Pontiac ChevroletWeST Steffen Staab Topic Detection - TNG 18 of 25
  19. 19. Topic detectionTopic exchange between adjacent clusters:spatial region A spatial region B Pontiac spatial region B D A Chevrolet 1 D BMW C BMW Pontiac Chevrolet document1 spatial region CWeST Steffen Staab Topic Detection - TNG 19 of 25
  20. 20. Topic detectionTopic exchange between adjacent clusters:spatial region A spatial region B Pontiac spatial region B D A Chevrolet 1 D BMW C BMW Pontiac Chevrolet document1 spatial region CWeST Steffen Staab Topic Detection - TNG 20 of 25
  21. 21. Topic detection Pontiac BA Chevrolet 1 DBMW C BMW Pontiac Chevrolet A B 1 1 C 1 is drawn from with equal probability WeST Steffen Staab Topic Detection - TNG 21 of 25
  22. 22. Visualisationchevrolet 0.35 bmw 0.29bmw 0.18 audi 0.18cadillac 0.16 fiat 0.10pontiac 0.09 citroen 0.09gmc 0.07 renault 0.09buick 0.06 peugeot 0.08audi 0.05 mercedesbenz 0.06 chevrolet 0.05WeST Steffen Staab Topic Detection - TNG 22 of 25
  23. 23. Visualisationbmw 0.63 fiat 0.66 renault pontiac 0.92mercedesbenz 0.17 bmw 0.10 0.28audi 0.13 citroen 0.09 citroen renault 0.05 0.22 peugeot 0.15 bmw 0.10 audi 0.09 fiat 0.07WeST Steffen Staab Topic Detection - TNG 23 of 25
  24. 24. Topic Detection: The next generationGeoMTD• Better understandability: „nicer regions“• Improved quality • Better explanation of the data • Measured in terms of reduced perplexity • about half compared to related workWeST Steffen Staab Topic Detection - TNG 24 of 25
  25. 25. Topic Detection: The next generationOther next generation mechanisms for understanding socialmedia:• Opinions • adding vocabularies with meaning (LIWC, POMS,...)• Diversity • maximizing for spread of topics and opinions• Author-topic-time... Need to balance between complexity of model and sparsity of data!WeST Steffen Staab Topic Detection - TNG 25 of 25
  26. 26. Web Science & Technologies University of Koblenz ▪ Landau, GermanyThank you for your attention!
  27. 27. ReferencesHierarchical Dirichlet processesby: Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. BleiIn: Journal of the American Statistical Association, Vol. 101 (2006) , p. 1566-1581.GeoFolk: latent spatial semantics in web 2.0 social media.by: Sergej SizovIn: WSDM ACM (2010) , p. 281-290.Geographical topic discovery and comparison.by: Zhijun Yin, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas S. HuangIn: WWW ACM (2011) , p. 247-256.A Nonparametric Bayesian Model of Multi-Level Category Learning.by: Kevin Robert Canini, and Thomas L. GriffithsIn: AAAI AAAI Press (2011) .Naveed, Nasir; Gottron, Thomas; Sizov, Sergej; Staab, Steffen (2012): FREuD: Feature-CentricSentiment Diversification of Online Discussions. In: WebSci12: Proceedings of the 4th InternationalConference on Web Science. ACM, 2012.Nasir Naveed, Sergej Sizov, Steffen Staab: ATTention: Understanding Authors and Topics in Context ofTemporal Evolution. European Conference on Information Retrieval 2011: 733-737. Springer, 2011.Further papers about our work currently in preparation. Contact us if interestedWeST Steffen Staab Topic Detection - TNG 27 of 25

×