Your SlideShare is downloading. ×
CSE509 Lecture 5
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

CSE509 Lecture 5

549
views

Published on

Lecture 5 of CSE509:Web Science and Technology Summer Course

Lecture 5 of CSE509:Web Science and Technology Summer Course

Published in: Technology, Business

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
549
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The past decade has witnessed the emergence of participatory Web and social media, bringing peopletogether in many creative ways. Millions of users are playing, tagging, working, and socializingonline, demonstrating new forms of collaboration, communication, and intelligence that were hardlyimaginable just a short time ago. Social media also helps reshape business models, sway opinions andemotions, and opens up numerous possibilities to study human interaction and collective behavior inan unparalleled scale. This lecture, from a data mining perspective, introduces characteristics of socialmedia, reviews representative tasks of computing with social media, and illustrates associated challenges.
  • In traditional media such as TV, radio, movies, and newspapers, it is only a small numberof “authorities” or “experts” who decide which information should be produced and how it is distributed.The majority of users are consumers who are separated from the production process. Thecommunication pattern in the traditional media is one-way traffic, from a centralized producer towidespread consumers.This new type of mass publication enables the production of timely news and grassrootsinformation and leads to mountains of user-generated contents, forming the wisdom of crowds
  • Twitter: a directed graphFacebook: an undirected graphIn Twitter, for example, one user x follows another user y, but user y does not necessarily follow user x. In this case, the follower-followee network is directed and asymmetrical
  • a linear relationship between the logarithms of the variables
  • the number of connections between one’s friends over the total number of possible connections among them
  • Previously: email communication networks, instant messaging networks, mobile call networks, friendshipNetworks. Other forms of complex networks, like coauthorship or citation networks, biological networks, metabolic pathways, genetic regulatory networks and food webThese large-scale networks combined with unique characteristics of social media present novelchallenges for mining social media.In reality, multiple relationships can exist between individuals. Two personscan be friends and colleagues at the same time. Thus, a variety of interactions exist betweenthe same set of actors in a network. Multiple types of entities can also be involved in onenetwork. For many social bookmarking and media sharing sites, users, tags and content areintertwined with each other, leading to heterogeneous entities in one network. Analysis ofthese heterogeneous networks involving heterogeneous entities or interactions requires newtheories and tools.Social media emphasizes timeliness. For example, in content sharing sites andblogosphere, people quickly lose their interest in most shared contents and blog posts. Thisdiffers fromclassical web mining.Newusers join in,newconnections establish between existingmembers, and senior users become dormant or simply leave.How can we capture the dynamicsof individuals in networks? Can we find the die-hard members that are the backbone ofcommunities? Can they determine the rise and fall of their communities?In social media, people tend to share their connections. The wisdomof crowds, in forms of tags, comments, reviews, and ratings, is often accessible. The metainformation, in conjunction with user interactions, might be useful for many applications.It remains a challenge to effectively employ social connectivity information and collectiveintelligence to build social computing applications.A research barrier concerning mining social media is evaluation. In traditionaldata mining, we are so used to the training-testing model of evaluation. It differs in socialmedia. Since many social media sites are required to protect user privacy information, limitedbenchmark data is available. Another frequently encountered problem is the lack of groundtruth for many social computing tasks, which further hinders some comparative study ofdifferent works.Without ground truth, how can we conduct fair comparison and evaluation?Slide 7-11
  • Transcript

    • 1. CSE509: Introduction to Web Science and Technology
      Lecture 5: Social Network Analysis
      ArjumandYounus
      Web Science Research Group
      Institute of Business Administration (IBA)
    • 2. Last Time…
      Web Data Explosion
      Part I
      MapReduce Basics
      MapReduce Example and Details
      MapReduce Case-Study: Web Crawler based on MapReduce Architecture
      Part II
      Large-Scale File Systems
      Google File System Case-Study
      August 06, 2011
    • 3. Today
      Transition from Web 1.0 to Web 2.0
      Social Media Characteristics
      Part I: Theoretical Aspects
      Social Networks as a Graph
      Properties of Social Networks
      Part II: Getting Hands-On Experience on Social Media Analytics
      Twitter Data Hacks
      Part III: Example Researches
      August 06, 2011
    • 4. Quick Survey
      Do you have a Facebook, MySpace, Twitter, or LinkedIn account?
      Do you own a blog?
      Do you read blogs?
      Have you ever searched for something on Wikipedia?
      Have you ever submitted content to a social network?
      August 06, 2011
    • 5. Web 1.0 vs. Web 2.0
      August 06, 2011
      Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
    • 6. What is so Different about Web 2.0?
      User Generated Content
      Collaborative Environment: Participatory Web, Citizen Journalism
      User is the Driving Factor
      August 06, 2011
      A Paradigm Shift rather than a Technology Shift
    • 7. Top 20 Most Visited Web Sites
      Internet traffic report by Alexa on July 29th 2008
      August 06, 2011
      Borrowed from SIGKDD 2008 tutorial slides of Professor Huan Liu and Professor Nitin Agarwal with permission
    • 8. Various forms of Social Media
      Blog: Wordpress, blogspot, LiveJournal
      Forum: Yahoo! Answers, Epinions
      Media Sharing: Flickr, YouTube, Scribd
      Microblogging: Twitter, FourSquare
      Social Networking: Facebook, LinkedIn, Orkut
      Social Bookmarking: Del.icio.us, Diigo
      Wikis: Wikipedia, scholarpedia, AskDrWiki
      August 06, 2011
    • 9. Characteristics of Social Media
      “Consumers” become “Producers”
      Rich User Interaction
      User-Generated Contents
      Collaborative environment
      Collective Wisdom
      Long Tail
      Broadcast Media
      Filter, then Publish
      Social Media
      Publish, then Filter
      August 06, 2011
    • 10. August 06, 2011
    • 11. PART I: Theoretical Aspects
      August 06, 2011
    • 12. Networks and Representation
      Social Network: A social structure made of nodes (individuals or organizations) and edges that connect nodes in various relationships like friendship, kinship etc.
      August 06, 2011
      • Graph Representation
      • 13. Matrix Representation
    • Properties of Large-Scale Networks
      Networks in social media are typically huge, involving millions of actors and connections
      Large-scale networks in real world demonstrate similar patterns
      Scale-free Distributions
      Small-world Effect
      Strong Community Structure
      August 06, 2011
    • 14. Scale-Free Distributions
      Degree distribution in large-scale networks often follows a power law.
      A.k.a. long tail distribution, scale-free distribution
      August 06, 2011
      Degrees
      Nodes
    • 15. Small-World Effect
      “Six Degrees of Separation”
      A famous experiment conducted by Travers and Milgram (1969)
      Subjects were asked to send a chain letter to his acquaintance in order to reach a target person
      The average path length is around 5.5
      Verified on a planetary-scale IM network of 180 million users (Leskovec and Horvitz 2008)
      The average path length is 6.6
      August 06, 2011
    • 16. Small World Facebook Experiment by Yahoo! Labs
      Anyone in the world can get a message to anyone else in just "six degrees of separation" by passing it from friend to friend. Sociologists have tried to prove (or disprove) this claim for decades, but it is still unresolved.
      http://smallworld.sandbox.yahoo.com/
      August 06, 2011
    • 17. Community Structure
      Community: People in a group interact with each other more frequently than those outside the group
      ki = number of edges among node Ni’s neighbors
      Friends of a friend are likely to be friends as well
      Measured by clustering coefficient:
      Density of connections among one’s friends
      August 06, 2011
    • 18. Clustering Coefficient
      August 06, 2011
      • d6=4, N6= {4, 5, 7,8}
      • 19. k6=4 as e(4,5), e(5,7), e(5,8), e(7,8)
      • 20. C6 = 4/(4*3/2) = 2/3
      • 21. Average clustering coefficient
      C = (C1 + C2 + … + Cn)/n
      • C = 0.61 for the left network
      • 22. In a random graph, the expected coefficient is 14/(9*8/2) = 0.19.
    • Challenges
      Scalability
      Social networks are often in a scale of millions of nodes and connections
      Traditional network analysis often deals with at most hundreds of subjects
      Heterogeneity
      Various types of entities and interactions are involved
      Evolution
      Timelines are emphasized in social media
      Collective Intelligence
      How to utilize wisdom of crowds in forms of tags, wikis, reviews
      Evaluation
      Lack of ground truth, and complete information due to privacy
      August 06, 2011
    • 23. Social Computing Tasks
      Social Computing: a young and vibrant field
      Conferences: KDD, WSDM, WWW, ICML, AAAI/IJCAI, SocialCom, etc.
      Tasks
      Centrality Analysis and Influence Modeling
      Community Detection
      Classification and Recommendation
      Privacy, Spam and Security
      August 06, 2011
    • 24. Centrality Analysis and Influence Modeling
      Centrality Analysis:
      Identify the most important actors or edges
      E.g. PageRank in Google
      Various other criteria
      Influence modeling:
      How is information diffused?
      How does one influence each other?
      Related Problems
      Viral marketing: word-of-mouth effect
      Influence maximization
      August 06, 2011
    • 25. Community Detection
      A community is a set of nodes between which the interactions are (relatively) frequent
      A.k.a., group, cluster, cohesive subgroups, modules
      Applications: Recommendation based communities, Network Compression, Visualization of a huge network
      New lines of research in social media
      Community Detection in Heterogeneous Networks
      Community Evolution in Dynamic Networks
      Scalable Community Detection in Large-Scale Networks
      August 06, 2011
    • 26. Classification and Recommendation
      Common in social media applications
      Tag suggestion, Product/Friend/Group Recommendation
      August 06, 2011
      Link prediction
      Network-Based Classification
    • 27. Privacy, Spam and Security
      Privacy is a big concern in social media
      Facebook, Google buzz often appear in debates about privacy
      NetFlix Prize Sequel cancelled due to privacy concern
      Simple anonymization does not necessarily protect privacy
      Spam blog (splog), spam comments, fake identity, etc., all requires new techniques
      As private information is involved, a secure and trustable system is critical
      Need to achieve a balance between sharing and privacy
      August 06, 2011
    • 28. PART II: Practical SNA with Twittersphere Mining
      August 06, 2011
    • 29. Pre-Requisites
      Expectation that Python is installed and you have some hands-on experience with it
      Dependencies
      easy_install
      networkx
      twitter (Twitter API for Python)
      For Windows users
      Install ActivePython: comes bundled with easy_install
      easy_installnetworkx
      easy_install twitter
      For Linux users
      sh setuptools-0.6c11-py2.6.egg
      sudoeasy_installnetworkx
      sudoeasy_install twitter
      August 06, 2011
    • 30. Getting Tweets from Twitter Search API
      import twitter
      import json
      twitter_search=twitter.Twitter(domain="search.twitter.com")
      search_results=[]
      for page in range(1,6):
      search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))
      print json.dumps(search_results, sort_keys=True, indent=1)
      tweets=[r['text'] for result in search_results for r in result['results']]
      print tweets
      August 06, 2011
    • 31. Lexical Diversity for Tweets
      words=[]
      for t in tweets:
      words+= [w for w in t.split()]
      lexical_diversity=1.0*len(set(words))/len(words)
      August 06, 2011
    • 32. What People are Tweeting: Frequency Analysis
      freq_dist=nltk.FreqDist(words)
      freq_dist.keys()[:50]
      freq_dist.keys()[-50:]
      August 06, 2011
    • 33. Extracting Relationships from Tweets (1/3)
      Step 1: Extracting Graph Data
      import networkx as nx
      import re
      g=nx.DiGraph()
      twitter_search=twitter.Twitter(domain="search.twitter.com")
      search_results=[]
      for page in range(1,6):
      search_results.append(twitter_search.search(q="pakistan",rpp=100,page=page))
      all_tweets=[tweet for page in search_results for tweet in page["results"]]
      def get_rt_sources(tweet):
      rt_patterns=re.compile(r"(RT|via)((?:bW*@w+)+)",re.IGNORECASE)
      return [source.strip() for tuple in rt_patterns.findall(tweet) for source in tuple if source not in ("RT", "via")]
      for tweet in all_tweets:
      rt_sources=get_rt_sources(tweet["text"])
      if not rt_sources:
      continue
      for rt_source in rt_sources:
      g.add_edge(rt_source,tweet["from_user"],{"tweet_id":tweet["id"]})
      August 06, 2011
    • 34. Extracting Relationships from Tweets (2/3)
      Step 2: Generating DOT File
      OUT = "pakistan_search_results.dot“
      dot=['"%s" -> "%s" [tweet_id=%s]' % (n1.encode('utf-8'), n2.encode('utf-8'), g[n1][n2]['tweet_id']) for n1, n2 in g.edges()]
      f=open(OUT, 'w')
      f.write('strict digraph {n%sn}' % (';n'.join(dot),))
      f.close()
      August 06, 2011
    • 35. Extracting Relationships from Tweets (3/3)
      Step 3: Visualizing the Retweet Data in Graphical Form
      For Windows users
      For Linux users
      circo -Tpng -Osnl_search_results pakistan_search_results.dot
      August 06, 2011
    • 36. PART III: Example Researches
      August 06, 2011
    • 37. Million Follower Fallacy (New York Times)
      August 06, 2011
    • 38. Twitter: More a News Medium than a Social Network (PC World)
      August 06, 2011
    • 39. Twitter for World Peace (Business Week)
      August 06, 2011
    • 40. SocialFlow: Social Media Optimization
      Social Media Optimization Platform
      Works in Domains of Viral and Word-of-Mouth Marketing
      Provides Services to Major Media Outlets
      Recent study
      How different audiences consumed and rebroadcast messages news organizations were sending out: AlJazeera English, BBC News, CNN, The Economist, Fox News and New York Times
      August 06, 2011
    • 41. August 06, 2011
      Twitter as a Real-Time News Analysis Service
    • 42. Studying Ins and Outs of News
      Using Twitter to study hot news items people are heavily tweeting about
      August 06, 2011
    • 43. Algorithm for Identification of Popular News
      August 06, 2011
    • 44. Application Prototype
      August 06, 2011
    • 45. Observations (1/3)
      August 06, 2011
      Percentage of news in tweets per day greater than 50% for all days except one day
    • 46. Observations (2/3)
      August 06, 2011
      Highest Number of Recorded Tweets per Day
    • 47. Observations (3/3)
      August 06, 2011