Notes on mining social media updated


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Notes on mining social media updated

  1. 1. Notes On: MINING SOCIAL MEDIA COMMUNITIES AND CONTENT by Akshay Java Dissertation submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2008
  2. 2. <ul><li>Open Access Link </li></ul><ul><li> </li></ul><ul><li>Key Words: social media, folksonomies (tags), social graph, structural vs. semantic information/knowledge </li></ul><ul><li>Disclaimer: Most notes are taken directly from the paper/article and should be appropriately referenced/cited directly from the author(s) </li></ul>
  3. 3. Introduction <ul><li>Social media is described as… </li></ul><ul><li>an umbrella term that defines the various activities that integrate technology, social interaction, and the construction of words, pictures, videos and audio . This interaction, and the manner in which information is presented, depends on the varied perspectives and “building” of shared meaning, as people share their stories, and understandings. </li></ul><ul><li>Institute for language and information technologies. / </li></ul>
  4. 4. Social Media <ul><li>Social Media has radically changed the way we communicate and share information both within and outside our social networks </li></ul>
  5. 5. Folksonomies <ul><li>Free-form tags (also known as folksonomies) </li></ul>
  6. 6. “Social Graph” <ul><li>social graph can be described as the sum of all declared social relationships across the participants in a given network </li></ul>
  7. 7. “ User-generated content” <ul><li>Content produced in social media is often referred to as “user-generated content” </li></ul>
  8. 8. (Criticism: Lack of Reference) <ul><li>User-generated content contributes to about five times more content present on the Web today </li></ul><ul><li>(No reference for this source was quoted!) </li></ul>
  9. 9. Motivating Thesis Question <ul><li>The motivating question that has guided this thesis is the following: “How can we analyze the structure and content of social media data to understand the nature of online communication and collaboration in social applications?” </li></ul>
  10. 10. Thesis Statement <ul><li>It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties structure and content </li></ul><ul><li>This thesis is based on two key observations… </li></ul><ul><li>• Understanding communication in social media requires identifying and modeling communities </li></ul><ul><li>• Communities are a result of collective, social interactions and usage </li></ul>
  11. 11. Semantic Web language OWL <ul><li>Why OWL? An acronym of Web Ontology Language </li></ul><ul><li>The Semantic Web is a vision for the future of the Web in which information is given explicit meaning, making it easier for machines to automatically process and integrate information available on the Web. The Semantic Web will build on XML's ability to define customized tagging schemes and RDF's flexible approach to representing data. The first level above RDF required for the Semantic Web is an ontology language what can formally describe the meaning of terminology used in Web documents. If machines are expected to perform useful reasoning tasks on these documents, the language must go beyond the basic semantics of RDF Schema. The OWL Use Cases and Requirements Document provides more details on ontologies , motivates the need for a Web Ontology Language in terms of six use cases , and formulates design goals , requirements and objectives for OWL. </li></ul><ul><li> </li></ul>
  12. 12. Thesis Focus on Blogs & Wikis <ul><li>We soon realized that processing blogs and social media data required new techniques to be developed </li></ul><ul><li>The problem of spam blogs in social media </li></ul><ul><li>Blogs empower users with a channel to freely express themselves </li></ul><ul><li>The open, unrestricted format of blogs means that the user is now able to express themselves and freely air opinions </li></ul>
  13. 13. Open Retrieval/Access <ul><li>Opinion retrieval is thus an important application of social media analysis </li></ul>
  14. 14. TREC Conference Blog Track
  15. 15. TREC Tracks <ul><li>A TREC workshop consists of a set tracks, areas of focus in which particular retrieval tasks are defined. The tracks serve several purposes. First, tracks act as incubators for new research areas: the first running of a track often defines what the problem really is, and a track creates the necessary infrastructure (test collections, evaluation methodology, etc.) to support research on its task. The tracks also demonstrate the robustness of core retrieval technology in that the same techniques are frequently appropriate for a variety of tasks. Finally, the tracks make TREC attractive to a broader community by providing tasks that match the research interests of more groups. </li></ul>
  16. 16. TREC Tracks <ul><li>Each track has a mailing list. The primary purpose of the mailing list is to discuss the details of the track's tasks in the current TREC. However, a track mailing list also serves as a place to discuss general methodological issues related to the track's retrieval tasks. Further, some tracks have track-specific web pages that provide history and background material regarding the track's focus. Thus, this page lists contact information for all the TREC tracks, whether or not the track is scheduled to be run in the current TREC. TREC track mailing lists are open to all; you need not participate in TREC to join a list. Most lists do require that you become a member of the list before you can send a message to it. </li></ul><ul><li>The set of tracks that will be run in a given year of TREC is determined by the TREC program committee. The committee has established a procedure for proposing new tracks . </li></ul>
  17. 17. Goal of Thesis TREC Track <ul><li>The goal of this track was to build and evaluate a retrieval system that would find blog posts that express some opinion (either positive or negative) about a given topic or query word </li></ul>
  18. 18. The Blog Vox System <ul><li>The BlogVox system retrieves opinionated blog posts specified by ad hoc queries. BlogVox was developed for the 2006 TREC blog track by the University of Maryland, Baltimore County and the Johns Hopkins University Applied Physics Laboratory using a novel system to recognize legitimate posts and discriminate against spam blogs. It also processes posts to eliminate extraneous non-content, including blog-rolls, link-rolls, advertisements and sidebars. After retrieving posts relevant to a topic query, the system processes them to produce a set of independent features estimating the likelihood that a post expresses an opinion about the topic. These are combined using an SVM-based system and integrated with the relevancy score to rank the results. </li></ul><ul><li> </li></ul>
  19. 19. The Blog Vox System <ul><li>BlogVox has resulted in the development of novel techniques for identifying trust and influence in online social media systems </li></ul>
  20. 20. General Content <ul><li>This dissertation is dedicated to social media content analysis and outlines both the semantic analysis system and the opinion retrieval system </li></ul>
  21. 21. Microblogging <ul><li>The activity of posting regular updates on a microblog </li></ul><ul><li>A variety of microblogging sites have sprung up </li></ul><ul><li> </li></ul>
  22. 22. (Is this claim accurate?) <ul><li>This is the first study in the literature that has analyzed the microblogging phenomenon. </li></ul><ul><li>(Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. Why we twitter: understanding microblogging usage and communities. In WebKDD/SNA-KDD ’07: Proceedings of the 9th WebKDD and 1 st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 56–65, New York, NY, USA, 2007. ACM.) </li></ul>
  23. 23. Social Graphs & Algorithms <ul><li>Thesis present how to utilize the special structure of social media and the nature of social graphs to develop efficient algorithms for community detection. </li></ul><ul><li> </li></ul><ul><li> </li></ul>
  24. 24. SVD or Matrix Factorization Methods <ul><li> </li></ul><ul><li> </li></ul>
  25. 25. Social Media Tags & Graphs <ul><li>One important property of social media datasets is the availability of tags. Tags or folksonomies, as they are typically called, are free-form descriptive terms that are associated with any resource. Lately, folksonomies have become an extremely popular means to organize and share information. Tags can be used for videos, photos or URLs. While structural analysis is the most widely used method for community detection, the rich meta-data available via tags can provide additional information that helps group related nodes together. However, techniques that combine tag information (or more generally content) with the structural analysis typically tend to be complicated. We present a simple, yet effective method that combines the metadata provided by tags with structural information from the graphs to identify communities in social media. The main contribution of this technique is a simplified and intuitive approach to combining tags and graphs. </li></ul>
  26. 26. General Content <ul><li>This thesis outlines the structural analysis of social graphs </li></ul><ul><li>Focuses on the (social media) user perspective by analyzing feed subscriptions across a large population of users </li></ul><ul><li>Analyzes the subscription patterns of over eighty three thousand publicly listed Bloglines users </li></ul><ul><li>http:// / </li></ul>
  27. 27. (Criticism: Lack of Reference) <ul><li>According to some estimates, “the size of the Blogosphere continues to double every six months” and there are over seventy million blogs (with many that are actively posting) </li></ul>
  28. 28. Few Feeds <ul><li>However, our studies indicate that of all these blogs and feeds, the ones that really matter are relatively few. What blogs and feeds these users subscribe to and how they organize their subscriptions revealed interesting properties and characteristics of the way we consume information. For instance, most users have relatively few feeds in their subscriptions, indicating an inherent limit to the amount of attention that can be devoted to different channels. </li></ul>
  29. 29. User-Defined Folder Names <ul><li>Many users organize their feeds under user-defined folder names. Aggregated across a large number of users, these folder names are good indicators of the topics (or categories) associated with each blog. The study uses this collective intelligence to measure a readership-based influence of each feed for a given topic. </li></ul>
  30. 30. Feed Distillation Task <ul><li>The task of identifying the most relevant feed for a given topic or query term is now known as the “‘feed distillation task” in the literature </li></ul>
  31. 31. Thesis Contributions <ul><li>Following are the main contributions of this thesis: </li></ul><ul><li>• We provide a systematic study of the social media landscape by analyzing the content, structure and special properties. </li></ul><ul><li>• Developed and evaluated innovative approaches for community detection. </li></ul><ul><li>– We present a new algorithm for finding communities in social datasets. </li></ul><ul><li>– SimCut, a novel algorithm for combining structural and semantic information. </li></ul><ul><li>• First to comprehensively analyze two important social media forms </li></ul><ul><li>– We analyze the subscription patterns of a large collection of blog subscribers. The insights gained in this study were critical in developing a blog categorization system, a recommendation system as well as provide a basis for further, recent studies on feed subscription patters. </li></ul><ul><li>– We analyze the microblogging phenomena and develop a taxonomy of user intentions and types of communities present in this setting. </li></ul><ul><li>• Finally we have built systems, infrastructure and datasets for the social media research community. </li></ul>
  32. 32. The Social Web (Web 2.0) <ul><li>The World Wide Web today has become increasingly social </li></ul>
  33. 33. ( Reference Cited) Here Comes Everybody: The Power of Organizing Without Organizations by Clay Shirky (Book Overview) (YouTube)
  34. 34. (Criticism: Lack of Reference) <ul><li>Content on the Web today </li></ul><ul><li>According to recent estimates, while editing content like CNN or Reuters news reports are about 2G per day, user generated content produced today is four to five times as much </li></ul>
  35. 35. So, what makes the Web “social”? <ul><li>Web 1.0: most websites and homepages that exist are a one-way communication medium </li></ul><ul><li>Web 2.0: blogs and social media sites changed this by adding functionality to comment and interact with the content – be it blogs, music, videos or photos </li></ul>
  36. 36. The Blogosphere <ul><li>There are a number of studies that have specifically analyzed its structure and content </li></ul><ul><li>Blogging provides a channel to express opinions, facts and thoughts </li></ul><ul><li>Through these pieces of information, also known as memes, bloggers influence each other and engage in conversations that ultimately lead to exchange of ideas and spread of information </li></ul>
  37. 37. The Blogosphere <ul><li>By analyzing the graphs generated through such interactions, we can answer several questions about the structure of the blogosphere: </li></ul><ul><li>Community structure </li></ul><ul><li>Spread of influence </li></ul><ul><li>Opinion detection </li></ul><ul><li>Formation, friendship networks </li></ul><ul><li>Information cascades </li></ul>
  38. 38. (Criticism: Lack of Reference) <ul><li>As of 2006 there were over 52 million blogs and presently there are in excess of 70 million blogs </li></ul><ul><li>The number of blogs are rapidly doubling every six months and a large fraction of these blogs are active </li></ul>
  39. 39. Blogs <ul><li>It is estimated that blogs enjoy a significant readership and according to the recent report by Forrester Research, one in four Americans read blogs and a large fraction of users also participate by commenting </li></ul><ul><li>Blogs are typically published through blog hosting sites or tools like Wordpress that can be self-hosted </li></ul><ul><li>Blogs can be subscribed to by RSS (Really Simple Syndication) feeds </li></ul>
  40. 40. (Reference Cited) Click: What Millions of People are Doing Online and Why It Matters by Bill Tancer
  41. 41. Social Networking Sites <ul><li>In a recent study of Facebook users, Dr. Zeynep Tufecki concluded that Facebook users are very open about their personal information </li></ul><ul><li>A surprisingly large fraction openly disclose their real names, phone numbers and other personal information </li></ul><ul><li>(Zeynep Tufekci. Can you see me now? audience and disclosure regulation in online social network sites, 2008) </li></ul><ul><li>(Zeynep Tufekci. Grooming, gossip, facebook and myspace: What can we learn about social networking sites from non-users. In Information, Communication and Society , volume 11, pages 544–564, 2008) </li></ul>
  42. 42. (Reference Cited) Snoop: What your stuff says about you? by Dr. Sam Gosling
  43. 43. Social Networking Sites <ul><li>Dr. Sam Gosling talks about how personal spaces like bedrooms, office desks and even Facebook profiles reveal a whole lot about the real self </li></ul><ul><li>The research indicates how using just the information from a Facebook profile page, users can accurately score openness, conscientiousness, extraversion, agreeableness, and neuroticism (also known as the five factor model in Psychology) </li></ul>
  44. 44. Tagging & Folksonomies <ul><li>The term folksonomy is derived from folk and taxonomy and is attributed to Thomas VanderWal </li></ul><ul><li> </li></ul>
  45. 45. (Reference Cited) <ul><li>Heymann et al. inquire about the effectiveness of tagging and applications of social bookmarking in Web search </li></ul><ul><li>(Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina. Can social bookmarking improve web search? In WSDM ’08: Proceedings of the international conference on Web search and web data mining , pages 195–206, New York, NY, USA, 2008. ACM) </li></ul>
  46. 46. (Reference Cited) <ul><li>Brooks and Montanez have also studied the phenomenon of user-generated tags and evaluate effectiveness of tagging </li></ul><ul><li>(Christopher H. Brooks and Nancy Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW , 2006) </li></ul>
  47. 47. Tagging & Folksonomies <ul><li>Studies have also shown that tagging can explain user behavior </li></ul>
  48. 48. Tagging & Folksonomies <ul><li>Cattuto et al. model users as simple agents that tag documents with a frequency-bias and have the notion of memory, such that they are less likely to use older tags </li></ul><ul><li>(Ciro Cattuto, Vittorio Loreto, and Luciano Pietronero. Collaborative tagging and semiotic dynamics. CoRR , abs/cs/0605015, 2006) </li></ul>
  49. 49. Tagging & Folksonomies <ul><li>AutoTagging is a collaborative filtering-based recommendation system for suggesting appropriate tags </li></ul>
  50. 50. Tagging & Folksonomies <ul><li>TagAssist is a system that recommends tags related to a given blog post </li></ul>
  51. 51. Growth of the Blogosphere <ul><li>Ravi Kumar et. al. have studied the evolution of the blog graph and find that the size of the blogosphere grew drastically in 2001 </li></ul><ul><li>But only a small percentage of blogs have the most in-links </li></ul><ul><li>(Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. On the bursty evolution of blogspace. In WWW , pages 568–576, 2003) </li></ul>
  52. 52. “Forest Fire” Model <ul><li>Leskovec et al. present the “Forest Fire” model to explain the growth and evolution of dynamic social network graphs </li></ul><ul><li>(Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining , pages 177–187, New York, NY, USA, 2005. ACM) </li></ul><ul><li>There are 2 theories that support this model: </li></ul><ul><li>Out degree increases over time as the networks evolve </li></ul><ul><li>“ Shrinking diameter” of network decreases over time </li></ul><ul><li>The “Forest Fire” model tries to mimic the way information spreads in networks </li></ul>
  53. 53. Information Cascades <ul><li>The forest fire model was also shown to describe information cascades in blog graphs </li></ul><ul><li>Information cascades are a chain of links from one blog to another that describe a conversation </li></ul>
  54. 54. Behavioral Model <ul><li>Blogger is treated as both a reader and a writer </li></ul>
  55. 55. 80/20 Distribution <ul><li>Herring et al. performed an empirical study the interconnectivity of a sample of blogs and found conversations on the blogosphere are sporadic and highlight the importance of the ‘A-list’ bloggers and their roles in conversations </li></ul><ul><li>(Susan C. Herring, Inna Kouper, John C. Paolillo, Lois Ann Scheidt, Michael Tyworth, Peter Welsch, Elijah Wright, and Ning Yu. Conversations in the blogosphere: An analysis “from the bottom up” In HICSS ’05: Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) - Track 4 , page 107.2, Washington, DC, USA, 2005. IEEE Computer Society) </li></ul><ul><li>A-list bloggers are those that enjoy a high degree of influence in the blogosphere </li></ul><ul><li>These are the blogs that correspond to the head of the long tail (or power-law) distribution of the blogosphere </li></ul><ul><li>These constitute a small fraction of all the blogs that receive the most attention or links </li></ul>
  56. 56. Feed Distillation <ul><li>Technorati lists the top 100 blogs on the blogosphere </li></ul><ul><li>These lists, while serving as a generic ranking purpose, do not indicate the most popular blogs in different categories </li></ul><ul><li>This task was explored by Java et al. to identify the “ Feeds that Matter ” </li></ul><ul><li>(Akshay Java, PranamKolari, Tim Finin, AnupamJoshi, and Tim Oates. Feeds ThatMatter: A Study of Bloglines Subscriptions. In Proceedings of the International Conference on Weblogs and SocialMedia (ICWSM 2007) . Computer Science and Electrical Engineering, University of Maryland, Baltimore County, March 2007) </li></ul><ul><li>The TREC 2007 blog track defines a new task called the feed distillation task </li></ul><ul><li>Feed distillation, as defined in TREC 2007 is the task of identifying blogs with recurrent interest in a given topic </li></ul>
  57. 57. Thesis Context <ul><li>Presents two techniques for community analysis </li></ul><ul><li>Most of the existing approaches to community detection are based on link analysis and ignore the folksonomy meta-data that is easily available on in social media </li></ul><ul><li>Presents a novel method to combine the link analysis for community detection with information available in tags and folksonomies, yielding more accurate communities </li></ul>
  58. 58. Influence & Trust <ul><li>Influence on the Web is often a function of topic </li></ul><ul><li>Meausres of the blog’s authority are mostly based on the number of inlinks </li></ul><ul><li>This can sometimes be slightly misleading since a single post from a popular blogger on any topic may make it the top-most blog for that topic, even if the blog has little to do with the given subject </li></ul>
  59. 59. Discussion <ul><li>The broader impact of this work is to understand online, human communications and study how various elements of social media tools and platforms facilitate this goal </li></ul><ul><li>The study spans a period of three years and is a snapshot into the World Wide Web’s changing landscape </li></ul><ul><li>Sees the emergence of social media and it’s mainstream adoption as a key factor that has brought about a substantial change in how we interact with each other </li></ul><ul><li>The study found that blogs are an important component of social media </li></ul><ul><li>The goal has been to understand social behavior through the Web </li></ul>
  60. 60. Discussion <ul><li>The approach of the study takes a simplistic view of a community </li></ul><ul><li>Defines a community as a set of nodes that have more links to each other than the rest of the network </li></ul>
  61. 61. Further Research <ul><li>Discovering partial membership and multi-dimensional communities is a challenging problem and something worth investigating further </li></ul>
  62. 62. Blog Search Implications <ul><li>As social media content becomes even more pervasive, more Web search engine queries also return a number of blog posts within their results </li></ul><ul><li>It is an open question as to how this effects Web search ranking </li></ul>