Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining the Social Web - Lecture 2 - T61.6020

412 views

Published on

Slides for the class "Mining the Social Web".
This set of slides was created by Aristides Gionis.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Mining the Social Web - Lecture 2 - T61.6020

  1. 1. mining the social web Aris Gionis Michael Mathioudakis Mon, Feb 2 — lecture #2 structure and dynamics of social networks
  2. 2. T-61.6020: Mining the social web — lecture #2 class web page in piazza https://piazza.com/aalto.fi/spring2015/t616020/home share resources and also use as a discussion forum sensible posts : looking for a project mate looking for a project mate on idea X anyone knows how to access dataY? anyone has seen some analysis on data Z? … or just anything else 2
  3. 3. T-61.6020: Mining the social web — lecture #2 today’s themes analysis of the structure and dynamics of social-network how social networks look like? how social networks evolve over time? how people in social networks behave and interact? how information spreads in social networks and social media? who is influential? what is the interplay between structure and content? 3
  4. 4. T-61.6020: Mining the social web — lecture #2 objectives in today’s presentation focus on one particular topic review some “classic” papers in the literature ideas for projects assess the presented papers what is the main idea? what is the novelty? why they had impact? 4
  5. 5. T-61.6020: Mining the social web — lecture #2 criteria to evaluate the research projects originality (has it done before?) potential impact (how interesting it is and why) rigorousness and technical novelty reproducibility presentation 5
  6. 6. T-61.6020: Mining the social web — lecture #2 structure of social networks social networks and social-media data can be represented as graphs (or networks) how these graphs look like? what is their structure data contain additional information (actions, interactions, dynamics, attributes,…) mining this additional information as part of the network structure 6
  7. 7. T-61.6020: Mining the social web — lecture #2 contrast against random graphs random graph model by Erdős-Rényi edges independently drawn with probability p real-world networks do not look like random graphs also, random graphs are static 7 degree distribution hubs triangle coefficient clusters diameter giant component random graphs binomial no no no small yes real-world networks power law yes yes yes small yes
  8. 8. T-61.6020: Mining the social web — lecture #2 graph generation models a large number of graph generations models have been proposed preferential-attachment model copy model Strogatz-Watts model typically trying to capture some property of the data beyond the scope of this class and the project 8
  9. 9. arXiv:0810.1355v1[cs.DS]8Oct2008 Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters ∗ Jure Leskovec † Kevin J. Lang ‡ Anirban Dasgupta † Michael W. Mahoney § Abstract A large body of work has been devoted to defining and identifying clusters or communities in social and information networks, i.e., in graphs in which the nodes represent underlying social entities and the edges represent some sort of interaction between pairs of nodes. Most such research begins with the premise that a community or a cluster should be thought of as a set of nodes that has more and/or better connections between its members than to the remainder of the network. In this paper, we explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions. Rather than defining a procedure to extract sets of nodes from a graph and then attempt to interpret these sets as a “real” communities, we employ approximation algorithms for the graph partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities. In particular, we define the network community profile plot, which characterizes the “best” possible community—according to the conductance measure—over a wide range of size scales. We study over 100 large real-world networks, ranging from traditional and on-line social networks, to technological and information networks and web graphs, and ranging in size from thousands up to tens of millions of nodes. Our results suggest a significantly more refined picture of community structure in large networks than has been appreciated previously. Our observations agree with previous work on small networks, but we show that large networks have a very different structure. In particular, we observe tight communities that are barely connected to the rest of the network at very small size scales (up to ≈ 100 nodes); and communities of size scale beyond ≈ 100 nodes gradually “blend into” the expander- like core of the network and thus become less “community-like,” with a roughly inverse relationship between community size and optimal community quality. This observation agrees well with the so-called Dunbar number which gives a limit to the size of a well-functioning community. However, this behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, it is exactly the opposite of what one would expect based on intuition from expander graphs, low-dimensional or manifold-like graphs, and from small social networks that have served as testbeds of community detection algorithms. The relatively gradual increase of the network community profile plot as a function of increasing community size depends in a subtle manner on the way in which local clustering information is propagated from smaller to larger size scales in the network. We have found that a generative graph model, in which new edges are added via an iterative “forest fire” burning process, is able to produce graphs exhibiting a network community profile plot similar to what we observe in our network datasets.
  10. 10. T-61.6020: Mining the social web — lecture #2 community structure in social networks hypothesis : social networks have well-formed communities 10 Community structure loose definition of community: a set of vertices densely connected to each other and sparsely connected to the rest of the graph artificial communities: http://projects.skewed.de/graph-tool/
  11. 11. T-61.6020: Mining the social web — lecture #2 community structure in social networks 11 study community structure in an extensive collection of real- world networks authors introduce the network community profile (NCP) plot characterizes best possible community over a range of scales
  12. 12. T-61.6020: Mining the social web — lecture #2 community structure in social networks 12 dolphins network and its NPC Community structure dolphins network and its NCP (source [Leskovec et al., 2009]) Frieze, Gionis, Tsourakakis Algorithmic Techniques for Modeling and Mining Large Graphs 34 / 277
  13. 13. T-61.6020: Mining the social web — lecture #2 community structure in social networks 13 NPC on DBLP co-authorship munity structure do large-scale real-world networks have this nice artifical structure? NO! NCP of a DBLP graph (source [Leskovec et al., 2009]) do large-scale real-world networks have such nice artifical structure? NO!
  14. 14. 10Leskovec,Lang,Dasgupta,andMaho Network N E Nb Eb ¯d ˜d ¯C D ¯D Description Social networks Delicious 147,567 301,921 0.40 0.65 4.09 48.44 0.30 24 6.28 del.icio.us collaborative tagging social network Epinions 75,877 405,739 0.48 0.90 10.69 183.88 0.26 15 4.27 Who-trusts-whom network from epinions.com [142] Flickr 404,733 2,110,078 0.33 0.86 10.43 442.75 0.40 18 5.42 Flickr photo sharing social network [101] LinkedIn 6,946,668 30,507,070 0.47 0.88 8.78 351.66 0.23 23 5.43 Social network of professional contacts LiveJournal01 3,766,521 30,629,297 0.78 0.97 16.26 111.24 0.36 23 5.55 Friendship network of a blogging community [20] LiveJournal11 4,145,160 34,469,135 0.77 0.97 16.63 122.44 0.36 23 5.61 Friendship network of a blogging community [20] LiveJournal12 4,843,953 42,845,684 0.76 0.97 17.69 170.66 0.35 20 5.53 Friendship network of a blogging community [20] Messenger 1,878,736 4,079,161 0.53 0.78 4.34 15.40 0.09 26 7.42 Instant messenger social network Email-All 234,352 383,111 0.18 0.50 3.27 576.87 0.50 14 4.07 Research organization email network (all addresses) [113] Email-InOut 37,803 114,199 0.47 0.82 6.04 165.73 0.58 8 3.74 (all addresses but email has to be sent both ways) [113] Email-Inside 986 16,064 0.90 0.99 32.58 74.66 0.45 7 2.60 (only emails inside the research organization) [113] Email-Enron 33,696 180,811 0.61 0.90 10.73 142.36 0.71 13 3.99 Enron email dataset [100] Answers 488,484 1,240,189 0.45 0.78 5.08 251.78 0.11 22 5.72 Yahoo Answers social network Answers-1 26,971 91,812 0.56 0.87 6.81 59.17 0.08 16 4.49 Cluster 1 from Yahoo Answers Answers-2 25,431 65,551 0.48 0.80 5.16 56.57 0.10 15 4.76 Cluster 2 from Yahoo Answers Answers-3 45,122 165,648 0.53 0.87 7.34 417.83 0.21 15 3.94 Cluster 3 from Yahoo Answers Answers-4 93,971 266,199 0.49 0.82 5.67 94.48 0.08 16 4.91 Cluster 4 from Yahoo Answers Answers-5 5,313 11,528 0.41 0.73 4.34 29.55 0.12 14 4.75 Cluster 5 from Yahoo Answers Answers-6 290,351 613,237 0.40 0.71 4.22 57.16 0.09 22 5.92 Cluster 6 from Yahoo Answers Information (citation) networks Cit-Patents 3,764,105 16,511,682 0.82 0.96 8.77 21.34 0.09 26 8.15 Citation network of all US patents [112] Cit-hep-ph 34,401 420,784 0.96 1.00 24.46 63.50 0.30 14 4.33 Citations between physics (arxiv hep-th) papers [78] Cit-hep-th 27,400 352,021 0.94 0.99 25.69 106.40 0.33 15 4.20 Citations between physics (arxiv hep-ph) papers [78] Blog-nat05-6m 29,150 182,212 0.74 0.96 12.50 342.51 0.24 10 3.40 Blog citation network (6 months of data) [116] Blog-nat06all 32,384 315,713 0.87 0.99 19.50 153.08 0.20 18 3.94 Blog citation network (1 year of data) [116] Post-nat05-6m 238,305 297,338 0.21 0.34 2.50 39.51 0.13 45 10.34 Blog post citation network (6 months) [116] Post-nat06all 437,305 565,072 0.22 0.38 2.58 35.54 0.11 54 10.48 Blog post citation network (1 year) [116] Collaboration networks AtA-IMDB 883,963 27,473,042 0.87 0.99 62.16 517.40 0.79 15 3.48 IMDB actor collaboration network from Dec 2007 CA-astro-ph 17,903 196,972 0.89 0.98 22.00 65.70 0.67 14 4.21 Co-authorship in astro-ph of arxiv.org [112] CA-cond-mat 21,363 91,286 0.81 0.93 8.55 22.47 0.70 15 5.36 Co-authorship in cond-mat category [112] CA-gr-qc 4,158 13,422 0.64 0.78 6.46 17.98 0.66 17 6.10 Co-authorship in gr-qc category [112] CA-hep-ph 11,204 117,619 0.81 0.97 21.00 130.88 0.69 13 4.71 Co-authorship in hep-ph category [112] CA-hep-th 8,638 24,806 0.68 0.85 5.74 12.99 0.58 18 5.96 Co-authorship in hep-th category [112] CA-DBLP 317,080 1,049,866 0.67 0.84 6.62 21.75 0.73 23 6.75 DBLP co-authorship network [20] Table 1: Network datasets we analyzed. Statistics of networks we consider: number of nodes N; number of edges E; fraction nodes not in whiskers (size of largest biconnected component) Nb/N; fraction of edges in biconnected component Eb/E; average degree ¯d = 2E/N; second order average degree ˜d; average clustering coefficient ¯C; diameter D; and average path length ¯D. Communitystructureinlargenetworks11 Network N E Nb Eb ¯d ˜d ¯C D ¯D Description Web graphs Web-BerkStan 319,717 1,542,940 0.57 0.88 9.65 1,067.55 0.32 35 5.66 Web graph of Stanford and UC Berkeley [98] Web-Google 855,802 4,291,352 0.75 0.92 10.03 170.35 0.62 24 6.27 Web graph Google released in 2002 [3] Web-Notredame 325,729 1,090,108 0.41 0.76 6.69 280.68 0.47 46 7.22 Web graph of University of Notre Dame [11] Web-Trec 1,458,316 6,225,033 0.59 0.78 8.54 682.89 0.68 112 8.58 Web graph of TREC WT10G web corpus [2] Internet networks As-RouteViews 6,474 12,572 0.62 0.80 3.88 164.81 0.40 9 3.72 AS from Oregon Exchange BGP Route View [112] As-Caida 26,389 52,861 0.61 0.81 4.01 281.93 0.33 17 3.86 CAIDA AS Relationships Dataset As-Skitter 1,719,037 12,814,089 0.99 1.00 14.91 9,934.01 0.17 5 3.44 AS from traceroutes run daily in 2005 by Skitter As-Newman 22,963 48,436 0.65 0.83 4.22 261.46 0.35 11 3.83 AS graph from Newman [5] As-Oregon 13,579 37,448 0.72 0.90 5.52 235.97 0.46 9 3.58 Autonomous systems [1] Gnutella-25 22,663 54,693 0.59 0.83 4.83 10.75 0.01 11 5.57 Gnutella network on March 25 2000 [143] Gnutella-30 36,646 88,303 0.55 0.81 4.82 11.46 0.01 11 5.75 Gnutella P2P network on March 30 2000 [143] Gnutella-31 62,561 147,878 0.54 0.81 4.73 11.60 0.01 11 5.94 Gnutella network on March 31 2000 [143] eDonkey 5,792,297 147,829,887 0.93 1.00 51.04 6,139.99 0.08 5 3.66 P2P eDonkey graph for a period of 47 hours in 2004 Bi-partite networks IpTraffic 2,250,498 21,643,497 1.00 1.00 19.23 94,889.05 0.00 5 2.53 IP traffic graph a single router for 24 hours AtP-astro-ph 54,498 131,123 0.70 0.87 4.81 16.67 0.00 28 7.78 Authors-to-papers network of astro-ph [116] AtP-cond-mat 57,552 104,179 0.65 0.79 3.62 10.54 0.00 31 9.96 Authors-to-papers network of cond-mat [116] AtP-gr-qc 14,832 22,266 0.47 0.60 3.00 9.72 0.00 35 11.08 Authors-to-papers network of gr-qc [116] AtP-hep-ph 47,832 86,434 0.60 0.76 3.61 16.80 0.00 27 8.55 Authors-to-papers network of hep-ph [116] AtP-hep-th 39,986 64,154 0.53 0.68 3.21 13.07 0.00 36 10.74 Authors-to-papers network of hep-th [116] AtP-DBLP 615,678 944,456 0.49 0.64 3.07 13.61 0.00 48 12.69 DBLP authors-to-papers bipartite network Spending 1,831,540 2,918,920 0.34 0.58 3.19 1,536.35 0.00 26 5.62 Users-to-keywords they bid Hw7 653,260 2,278,448 0.99 0.99 6.98 346.85 0.00 24 6.26 Downsampled advertiser-query bid graph Netflix 497,959 100,480,507 1.00 1.00 403.57 28,432.89 0.00 5 2.31 Users-to-movies they rated. From Netflix prize [4] QueryTerms 13,805,808 17,498,668 0.28 0.41 2.53 14.92 0.00 86 19.81 Users-to-queries they submit to a search engine Clickstream 199,308 951,649 0.39 0.87 9.55 430.74 0.00 7 3.83 Users-to-URLs they visited [126] Biological networks Bio-Proteins 4,626 14,801 0.72 0.91 6.40 24.25 0.12 12 4.24 Yeast protein interaction network [51] Bio-Yeast 1,458 1,948 0.37 0.51 2.67 7.13 0.14 19 6.89 Yeast protein interaction network data [92] Bio-YeastP0.001 353 1,517 0.73 0.93 8.59 20.18 0.57 11 4.33 Yeast protein-protein interaction map [135] Bio-YeastP0.01 1,266 8,511 0.79 0.97 13.45 47.73 0.44 12 3.87 Yeast protein-protein interaction map [135] Table 2: Network datasets we analyzed. Statistics of networks we consider: number of nodes N; number of edges E; fraction nodes not in whiskers (size of largest biconnected component) Nb/N; fraction of edges in biconnected component Eb/E; average degree ¯d = 2E/N; second order average degree ˜d; average clustering coefficient ¯C; diameter D; and average path length ¯D. Leskovec,Lang,Dasgupta,andMahoney Network N E Nb Eb ¯d ˜d ¯C D ¯D Description Nearly low-dimensional networks Road-CA 1,957,027 2,760,388 0.80 0.85 2.82 3.17 0.06 865 310.97 California road network Road-USA 126,146 161,950 0.97 0.98 2.57 2.81 0.03 617 218.55 USA road network (only main roads) Road-PA 1,087,562 1,541,514 0.79 0.85 2.83 3.20 0.06 794 306.89 Pennsylvania road network Road-TX 1,351,137 1,879,201 0.78 0.84 2.78 3.15 0.06 1,064 418.73 Texas road network PowerGrid 4,941 6,594 0.62 0.69 2.67 3.87 0.11 46 19.07 Power grid of Western States Power Grid [156] Mani-faces7k 696 6,979 0.98 0.99 20.05 37.99 0.56 16 5.52 Faces (64x64 grayscale images) (connect 7k closest pairs) Mani-faces4k 663 3,465 0.90 0.97 10.45 20.20 0.56 29 8.96 Faces (connect 4k closest pairs) Mani-faces2k 551 1,981 0.84 0.94 7.19 12.77 0.54 32 11.07 Faces (connect 2k closest pairs) Mani-facesK10 698 6,935 1.00 1.00 19.87 25.32 0.51 6 3.25 Faces (connect every to 10 nearest neighbors) Mani-facesK3 698 2,091 1.00 1.00 5.99 7.98 0.45 9 4.89 Faces (connect every to 5 nearest neighbors) Mani-facesK5 698 3,480 1.00 1.00 9.97 12.91 0.48 7 4.03 Faces (connect every to 3 nearest neighbors) Mani-swiss200k 20,000 200,000 1.00 1.00 20.00 21.08 0.59 103 37.21 Swiss-roll (connect 200k nearest pairs of nodes) Mani-swiss100k 19,990 99,979 1.00 1.00 10.00 11.02 0.59 162 58.32 Swiss-roll (connect 100k nearest pairs of nodes) Mani-swiss60k 19,042 57,747 0.93 0.96 6.07 7.03 0.59 243 89.15 Swiss-roll (connect 60k nearest pairs of nodes) Mani-swissK10 20,000 199,955 1.00 1.00 20.00 25.38 0.56 10 5.47 Swiss-roll (every node connects to 10 nearest neighbors) Mani-swissK5 20,000 99,990 1.00 1.00 10.00 12.89 0.54 13 8.34 Swiss-roll (every node connects to 5 nearest neighbors) Mani-swissK3 20,000 59,997 1.00 1.00 6.00 7.88 0.50 17 6.89 Swiss-roll (every node connects to 3 nearest neighbors) IMDB Actor-to-Movie graphs AtM-IMDB 2,076,978 5,847,693 0.49 0.82 5.63 65.41 0.00 32 6.82 Actors-to-movies graph from IMDB (imdb.com) Imdb-top30 198,430 566,756 0.99 1.00 5.71 18.19 0.00 26 8.32 Actors-to-movies graph heavily preprocessed Imdb-raw07 601,481 1,320,616 0.54 0.79 4.39 20.94 0.00 32 8.55 Country clusters were extracted from this graph Imdb-France 35,827 74,201 0.51 0.76 4.14 14.62 0.00 20 6.57 Cluster of French movies Imdb-Germany 21,258 42,197 0.56 0.78 3.97 13.69 0.00 34 7.47 German movies (to actors that played in them) datasets! publicly available in SNAP
  15. 15. T-61.6020: Mining the social web — lecture #2 community structure in social networks 15 1. up to a certain size k (k ∼ 100 vertices) there exist good cuts as the size increases so does the quality of the community 2. at the size k we observe the best possible community such communities are typically connected to the remainder with a single edge 3. above the size k the community quality decreases this is because they blend in and gradually disappear main findings
  16. 16. T-61.6020: Mining the social web — lecture #2 16 hypothesis : well-formed and interesting, assumed true data : very extensive collection methodology : introduce a new metric (NCP) impact / interestingness : challenged the starting hypothesis reproducibility : datasets and code publicly available Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters Leskovec, Lang, Dasgupta, Mahoney summary
  17. 17. T-61.6020: Mining the social web — lecture #2 Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters Leskovec, Lang, Dasgupta, Mahoney 17 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  18. 18. rXiv:physics/0603229v3[physics.soc-ph]28Jan2007 Graph Evolution: Densification and Shrinking Diameters Jure Leskovec School of Computer Science, Carnegie Mellon University, Pittsburgh, PA Jon Kleinberg Department of Computer Science, Cornell University, Ithaca, NY Christos Faloutsos School of Computer Science, Carnegie Mellon University, Pittsburgh, PA February 2, 2008 Abstract How do real graphs evolve over time? What are “normal” growth patterns in social, technological, and information networks? Many studies have discovered patterns in static graphs, identifying properties in a single snapshot of a large network, or in a very small number of snapshots; these include heavy tails for in- and out-degree distributions, communities, small-world phenomena, and others. However, given the lack of information about network evolution over long periods, it has been hard to convert these findings into statements about trends over time. Here we study a wide range of real graphs, and we observe some surprising phenomena. First, most of these graphs densify over time, with the number of edges growing super-linearly in the number of nodes. Second, the average distance between nodes often shrinks over time, in contrast to the conventional wisdom that such distance parameters should increase slowly as a function of the number of nodes (like O(log n) or O(log(log n)). Existing graph generation models do not exhibit these types of behavior, even at a qualitative level. We provide a new graph generator, based on a “forest fire” spreading process, that has a simple, intuitive justification, requires very few parameters (like the “flammability” of nodes), and produces graphs exhibiting the full range of properties observed both in prior work and in the
  19. 19. T-61.6020: Mining the social web — lecture #2 graph evolution and shrinking diameters 19 networks evolve over time typically new vertices/edges are added (not many deletions) how do network distances change over time? constant average degree and vertex addition… … implies diameter = O(logn) — slowly increasing according to random-graph model also according to other more “realistic” models e.g., preferential attachment
  20. 20. T-61.6020: Mining the social web — lecture #2 graph evolution and shrinking diameters 20 empirical observation : as networks evolve distances shrink (e.g., diameter shrink) why? number of edges grow faster than number of vertices graph become denser — graph densification me-evolving networks J. Leskovec J. Kleinberg C. Faloutsos [Leskovec et al., 2005b] • densification power law: |Et| / |Vt|↵ 1  ↵  2 • shrinking diameters: diameter is shrinking over time.
  21. 21. Graphs Over Time 7 1994 1996 1998 2000 2002 0 5 10 15 20 Year of publication Averageout−degree 1975 1980 1985 1990 1995 4 6 8 10 12 Year granted Averageout−degree (a) arXiv (b) Patents 0 200 400 600 3.4 3.6 3.8 4 4.2 Averageout−degree Time [days] 1994 1996 1998 2000 1 1.5 2 2.5 3 Year of publication Averageout−degree (c) Autonomous Systems (d) Affiliation network Figure 1: The average node out-degree over time. Notice that it increases, in all 4 datasets. That is, all graphs are densifying. average degree graph evolution and shrinking diameters
  22. 22. number of edges graph evolution and shrinking diameters 10 2 10 3 10 4 10 5 10 2 10 3 10 4 10 Numberofedges Number of nodes Jan 1993 Edges = 0.0113 x 1.69 R 2 =1.0 10 5 10 6 10 7 10 5 10 6 10 7 Number of nodes Numberofedges 1975 Edges = 0.0002 x 1.66 R 2 =0.99 (a) arXiv (b) Patents 10 3.5 10 3.6 10 3.7 10 3.8 10 4.1 10 4.2 10 4.3 10 4.4 Numberofedges Number of nodes Edges = 0.87 x 1.18 R 2 =1.00 10 2 10 3 10 4 10 5 10 2 10 3 10 4 10 5 10 6 Numberofedges Number of nodes Edges = 0.4255 x 1.15 R 2 =1.0 (c) Autonomous Systems (d) Affiliation network 10 3 10 4 10 5 10 3 10 4 10 5 10 6 Number of nodes Numberofedges Oct ’03 May ’05 Edges = 1 x 1.12 R 2 =1.00 10 4 10 5 10 6 10 3 10 4 10 5 10 6 10 7 Number of nodes Numberofedges 1910 2004 Edges = 0.9 x 1.11 R 2 =0.98 (e) Email network (f) IMDB actors to movies network Figure 2: Number of edges e(t) versus number of nodes n(t), in log-log scales, for several graphs. All 4 graphs obey the Densification Power Law, with a consistently good fit. Slopes: a = 1.68, 1.66, 1.18, 1.15, 1.12, and 1.11 respectively.
  23. 23. effective diameter graph evolution and shrinking diameters 1992 1994 1996 1998 2000 2002 2004 4 5 6 7 8 9 Time [years] Effectivediameter Post ’95 subgraph, no past 1992 1994 1996 1998 2000 2002 4 5 6 7 8 9 10 Time [years] Effectivediameter Post ’95 subgraph, no past (a) arXiv citation graph (b) Affiliation network 1975 1980 1985 1990 1995 2000 5 10 15 20 25 30 35 Time [years] Effectivediameter Full graph Post ’85 subgraph Post ’85 subgraph, no past 3000 3500 4000 4500 5000 5500 6000 6500 4 4.2 4.4 4.6 4.8 5 Effectivediameter Size of the graph [number of nodes] Linear fit (c) Patents citation graph (d) Autonomous Systems 0 5 10 15 20 4 4.5 5 5.5 6 6.5 7 Time [months] Effectivediameter Full graph Post Jan ’04 subgraph Post Jan ’04 subgraph, no past 1920 1940 1960 1980 2000 8 9 10 11 12 13 14 15 16 Time [years] Effectivediameter Full graph Post ’40 subgraph Post ’40 subgraph, no past (e) Email network (f) IMDB actors to movies network Figure 3: The effective diameter over time for 6 different datasets. Notice consistent decrease of the diameter over time.
  24. 24. T-61.6020: Mining the social web — lecture #2 graph evolution and shrinking diameters 24 theoretical justification : proposed a graph-evolution model that explains the empirical findings (graph densification and shrinking diameters) forest fire model (FF)
  25. 25. T-61.6020: Mining the social web — lecture #2 25 hypothesis : well-formed, assumed true as graph evolves distances increase data : extensive collection — how to collect evolving networks? methodology : simple statistics, but never done before impact/interestingness : challenged the hypothesis, interesting findings reproducibility : datasets and code publicly available summary Graph evolution: densification and shrinking diameters Leskovec, Kleinberg, Faloutsos
  26. 26. T-61.6020: Mining the social web — lecture #2 Graph evolution: densification and shrinking diameters Leskovec, Kleinberg, Faloutsos 26 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  27. 27. Feedback Effects between Similarity and Social Influence in Online Communities David Crandall Dept. of Computer Science Cornell University Ithaca, NY 14853 crandall@cs.cornell.edu Dan Cosley Dept. of Communication Cornell University Ithaca, NY 14853 drc44@cornell.edu Daniel Huttenlocher Dept. of Computer Science Cornell University Ithaca, NY 14853 dph@cs.cornell.edu Jon Kleinberg Dept. of Computer Science Cornell University Ithaca, NY 14853 kleinber@cs.cornell.edu Siddharth Suri Dept. of Computer Science Cornell University Ithaca, NY 14853 ssuri@cs.cornell.edu ABSTRACT A fundamental open question in the analysis of social net- works is to understand the interplay between similarity and social ties. People are similar to their neighbors in a social network for two distinct reasons: first, they grow to resemble their current friends due to social influence; and second, they tend to form new links to others who are already like them, a process often termed selection by sociologists. While both factors are present in everyday social processes, they are in tension: social influence can push systems toward unifor- mity of behavior, while selection can lead to fragmentation. As such, it is important to understand the relative e↵ects of these forces, and this has been a challenge due to the di culty of isolating and quantifying them in real settings. We develop techniques for identifying and modeling the in- teractions between social influence and selection, using data from online communities where both social interaction and changes in behavior over time can be measured. We find clear feedback e↵ects between the two factors, with rising similarity between two individuals serving, in aggregate, as an indicator of future interaction — but with similarity then continuing to increase steadily, although at a slower rate, for the current activities of their friends, or of the people most similar to them? Categories and Subject Descriptors: H.2.8 Database Management: Database Applications – Data Mining General Terms: Measurement, Theory Keywords: social networks, online communities, social in- fluence 1. INTRODUCTION Social influence and selection. A fundamental property of social networks is that people tend to have attributes similar to those of their friends. There are two underlying reasons for this. First, the process of social influence [7] leads people to adopt behaviors exhibited by those they interact with; this e↵ect is at work in many settings where new ideas di↵use by word-of-mouth or imitation through a network of people [19, 22]. A second, distinct reason is that people tend to form relationships with others who are already similar to them. This phenomenon, which is often termed selection, has a long history of study in sociology [13, 16].1 The two forces of social influence and selection are both seen in a wide range of social settings: people decide to adopt
  28. 28. T-61.6020: Mining the social web — lecture #2 similarity and social influence 28 observation : people are similar to their friends selection or influence? questions : how social interaction affects interests, and vice versa? can we use social similarity and interaction to predict future behavior?
  29. 29. T-61.6020: Mining the social web — lecture #2 user interests and similarity between users 29 focus on wikipedia editors who edits which page? edits up to time t forms a vector expressing user interests up to that time point similarity between users consider the similarity of two users who “meet” one posts in the discussion page of the other in the r. d to a inter- ooting. tworks wer of g peo- ts, re- use one of the more common measures, the cosine metric, Cosine(~u,~v) = cos ~u ~v = ~u · ~v ||~u||2||~v||2 , (1) where ||~v||2 denotes the Euclidean norm of v. While a comparison of similarity measures is not the fo- cus of our current work, we have evaluated a wide range of measures for our purpose. We use the cosine metric here be- cause it is independent of the rate at which people are edit-
  30. 30. user interests and similarity between users main finding : Figure 1: Average cosine similarity of user pairs as a function of the number of edits from time of first interaction, for Wikipedia. 2.2 The logues media throug pattern social i article a site sharing out cle people A na — rap but ste a mod networ a mini and in ters of that th
  31. 31. user interests and similarity between users possible explanation : feedback loop between social influence and selection similarity leads to interaction, which leads to further similarity proposed a theoretical model to explain the findings (neighbors may affect actions and interactions)
  32. 32. predicting future behavior based on user similarity and user interaction (a) Wikipedia (b) LiveJournal Figure 4: (a) Probability of joining a community based on k exposure via social ties versus similarity ties for (a) Wikipedia and (b) LiveJournal. The solid black curves corresponds to social ties and the dashed red curves to similarity ties. The error bars represent ±2 standard errors. solid black curves are drawn using neighbors in the social influence graph for each community, while the dashed red months apart. Many first edits close to t1 would suggest e↵ects based on short-term processes, such as immediate
  33. 33. T-61.6020: Mining the social web — lecture #2 33 question to study : interplay between influence and selection data : wikipedia edits (creative but somewhat limited) methodology : simple statistics, theoretical model, prediction model impact/interestingness : some interesting findings reproducibility : datasets publicly available summary Feedback effects between similarity and social influence in online communities Crandall et al.
  34. 34. T-61.6020: Mining the social web — lecture #2 Feedback effects between similarity and social influence in online communities Crandall et al. 34 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  35. 35. Meme-tracking and the Dynamics of the News Cycle Jure Leskovec ∗† Lars Backstrom ∗ Jon Kleinberg ∗ ∗ Cornell University † Stanford University jure@cs.stanford.edu lars@cs.cornell.edu kleinber@cs.cornell.edu ABSTRACT Tracking new topics, ideas, and “memes” across the Web has been an issue of considerable interest. Recent work has developed meth- ods for tracking topic shifts over long time scales, as well as abrupt spikes in the appearance of particular named entities. However, these approaches are less well suited to the identification of content that spreads widely and then fades over time scales on the order of days — the time scale at which we perceive news and events. We develop a framework for tracking short, distinctive phrases that travel relatively intact through on-line text; developing scalable algorithms for clustering textual variants of such phrases, we iden- tify a broad class of memes that exhibit wide spread and rich vari- ation on a daily basis. As our principal domain of study, we show how such a meme-tracking approach can provide a coherent repre- sentation of the news cycle — the daily rhythms in the news media that have long been the subject of qualitative interpretation but have never been captured accurately enough to permit actual quantitative analysis. We tracked 1.6 million mainstream media sites and blogs over a period of three months with the total of 90 million articles and we find a set of novel and persistent temporal patterns in the news cycle. In particular, we observe a typical lag of 2.5 hours between the peaks of attention to a phrase in the news media and in blogs respectively, with divergent behavior around the overall peak and a “heartbeat”-like pattern in the handoff between news and blogs. We also develop and analyze a mathematical model for the kinds of temporal variation that the system exhibits. Categories and Subject Descriptors: H.2.8 [Database Manage- ment]: Database applications—Data mining General Terms: Algorithms; Experimentation. Keywords: Meme-tracking, Blogs, News media, News cycle, In- formation cascades, Information diffusion, Social networks abilistic term mixtures have been successful at identifying long- range trends in general topics over time [5, 7, 16, 17, 30, 31]. At the other extreme, identifying hyperlinks between blogs and extracting rare named entities has been used to track short information cas- cades through the blogosphere [3, 14, 20, 23]. However, between these two extremes lies much of the temporal and textual range over which propagation on the web and between people typically occurs, through the continuous interaction of news, blogs, and web- sites on a daily basis. Intuitively, short units of text, short phrases, and “memes” that act as signatures of topics and events propagate and diffuse over the web, from mainstream media to blogs, and vice versa. This is exactly the focus of our study here. Moreover, it is at this intermediate temporal and textual granular- ity of memes and phrases that people experience news and current events. A succession of story lines that evolve and compete for at- tention within a relatively stable set of broader topics collectively produces an effect that commentators refer to as the news cycle. Tracking dynamic information at this temporal and topical resolu- tion has proved difficult, since the continuous appearance, growth, and decay of new story lines takes place without significant shifts in the overall vocabulary; in general, this process can also not be closely aligned with the appearance and disappearance of specific named entities (or hyperlinks) in the text. As a result, while the dynamics of the news cycle has been a subject of intense interest to researchers in media and the political process, the focus has been mainly qualitative, with a corresponding lack of techniques for un- dertaking quantitative analysis of the news cycle as a whole. Our approach to meme-tracking, with applications to the news cycle. Here we develop a method for tracking units of information as they spread over the web. Our approach is the first to scalably identify short distinctive phrases that travel relatively intact through
  36. 36. T-61.6020: Mining the social web — lecture #2 meme tracking 36 understand the dynamics of reported news focus on 24-hour news cycles questions : do such news cycles exist? can we detect them in the data? can we measure their properties
  37. 37. T-61.6020: Mining the social web — lecture #2 meme tracking 37 dataset : 90 m news articles from the 2008 US presidential elections how to identify news cycles : urls, topics, name entities, bag-of-words…? approach taken : quotes (memes) easy to manage at large scale travel relatively unchanged via many articles
  38. 38. is palling around with terrorists as being so imperfect he is palling around with terrorists who would target their own country a force for good in the world we see america as a force for good in this world we see america as a force for exceptionalism our opponents see america as imperfect enough to pal around with terrorists who would bomb their own country s as being so imperfect enough uld target their own country america it seems as being so imperfect this is not a man who sees america as you see america and as i see america this is not a man who sees america as you see it and how i see america palling around with terrorists who would target their own country that he s palling around with terrorists who would target their own country pal around with terrorists who targeted their own country palling around with terrorists who target their own country this is someone who sees america as impe around with terrorists who targeted th our opponent is someone who sees america as imperfect enough to pal around with terrorists who targeted their own country our opponent though is someone who sees america it seems as being so imperfect that he s palling around with terrorists who would target their own country this is not a man who sees america as you see it and how i see america we see imperfect imperfect enough that ld target their own country perfect imperfect enough that would target their own country is someone who sees america it seems as being so imperfect that he s palling around with terrorists who would target their own country our opponent is someone who sees america it seems as being so imperfect that he s palling around with terrorists who would target their own country our opponent is someone who sees america as imperfect enough to pal around with terrorists who target their own country we see america as a force of good in this world we see an america of exceptionalism someone who sees america as imperfe around with terrorists who targeted th someone who sees america it seems as being so imperfect that he s palling around with terrorists who would target their own country sees america as imperfect enough to pal around with terrorists who targeted their own country terrorists who would target their own country imperfect enough that he s palling around with terrorists who would target their country Figure 1: A small portion of the full set of variants of Sarah Palin’s quote, “Our opponent is someone who sees America, it seems, as being so imperfect, imperfect enough that he’s palling around with terrorists who would target their own country.” The arrows indicate the (approximate) inclusion of one variant in another, as part of the methodology developed in Section 2. 1 4 8 9 13 phrases with this property are exclusively produced by spammers. (We use ε = .25, L = 4, and M = 10 in our implementation.)
  39. 39. meme tracking interesting optimization problem identify single-rooted propagations s as being so imperfect enough uld target their own country america it seems as being so imperfect our opponent though is someone who sees america it seems as being so impe that he s palling around with terrorists who would target their own count this is not a man who sees america as you see it and how i see am imperfect imperfect enough that ld target their own country perfect imperfect enough that would target their own country is someone who sees america it seems as being so imperfect that he s pallin around with terrorists who would target their own country our opponent is someone who sees america it seems as being so imperfect th he s palling around with terrorists who would target their own country Figure 1: A small portion of the full set of variants of Sarah Palin’s as being so imperfect, imperfect enough that he’s palling around wit indicate the (approximate) inclusion of one variant in another, as part 1 2 3 4 5 6 7 8 9 10 11 13 15 14 12 Figure 2: Phrase graph. Each phrase is a node and we want to delete the least edges so that each resulting connected compo- nent has a single root node/phase, a node with zero out-edges. By deleting the indicated edges we obtain the optimal solution. To begin, we define some terminology. We will refer to each news article or blog post as an item, and refer to a quoted string
  40. 40. meme tracking volume distributions 5 in Fig. 2). So, the phrase cluster should be a ll paths terminate in a single root node. o identify phrase clusters, we would like delete weight from the phrase graph so it falls apart with the property that each piece “feeds into” hat can serve as the exemplar for the phrase ely, we define a directed acyclic graph to be ntains exactly one root node. (Note that ev- one root.) We now define the following DAG ng: Given a directed acyclic graph with delete a set of edges of minimum to- hat each of the resulting components is 2 shows a DAG with all edge weights equal to edges forms the unique optimal solution. DAG Partitioning is computationally intractable We then discuss the heuristic we use for the which we find to work well in practice. DAG Partitioning is NP-hard. 10-1 100 10 1 102 103 104 105 10 6 10 7 108 109 100 101 102 103 104 105 No.ofitemswithvolume≥x Volume, x Phrases: ∝ x-1.8 Clusters: ∝ x-2.1 Lipstick: ∝ x-0.85 Figure 3: Phrase volume distribution. We consider the volume of individual phrases, phrase-clusters, and the phrases that compose the “Lipstick on a pig” cluster. Notice phrases and phrase-clusters have similar power-law distribution while the “Lipstick on a pig” cluster has much fatter tail, which means that popular phrases have unexpectedly high popularity. to the cluster to which it has the most edges. For example, in Fig. 2
  41. 41. Figure 4: Top 50 threads in the news cycle with highest volume for the period Aug. 1 – Oct. 31, 2008. Each thread consists of all news articles and blog posts containing a textual variant of a particular quoted phrases. (Phrase variants for the two largest threads in each week are shown as labels pointing to the corresponding thread.) The data is drawn as a stacked plot in which the thickness of the strand corresponding to each thread indicates its volume over time. Interactive visualization is available at http://memetracker.org. threads dynamics
  42. 42. T-61.6020: Mining the social web — lecture #2 42 question to study : identify news cycles, study their dynamics data : news articles methodology : interesting computational problems in managing memes impact/interestingness : interesting methods interesting findings reproducibility : datasets publicly available summary Meme-tracking and the dynamics of the news cycle Leskovec, Backstrom, Kleinberg
  43. 43. T-61.6020: Mining the social web — lecture #2 Meme-tracking and the dynamics of the news cycle Leskovec, Backstrom, Kleinberg 43 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  44. 44. Everyone’s an Influencer: Quantifying Influence on Twitter Eytan Bakshy∗ University of Michigan, USA ebakshy@umich.edu Jake M. Hofman Yahoo! Research, NY, USA hofman@yahoo-inc.com Winter A. Mason Yahoo! Research, NY, USA winteram@yahoo- inc.com Duncan J. Watts Yahoo! Research, NY, USA djw@yahoo-inc.com ABSTRACT In this paper we investigate the attributes and relative influ- ence of 1.6M Twitter users by tracking 74 million diffusion events that took place on the Twitter follower graph over a two month interval in 2009. Unsurprisingly, we find that the largest cascades tend to be generated by users who have been influential in the past and who have a large number of followers. We also find that URLs that were rated more interesting and/or elicited more positive feelings by workers on Mechanical Turk were more likely to spread. In spite of these intuitive results, however, we find that predictions of which particular user or URL will generate large cascades are relatively unreliable. We conclude, therefore, that word- of-mouth diffusion can only be harnessed reliably by tar- geting large numbers of potential influencers, thereby cap- turing average effects. Finally, we consider a family of hy- pothetical marketing strategies, defined by the relative cost of identifying versus compensating potential “influencers.” We find that although under some circumstances, the most influential users are also the most cost-effective, under a wide range of plausible assumptions the most cost-effective performance can be realized using “ordinary influencers”— individuals who exert average or even less-than-average in- fluence. Categories and Subject Descriptors H.1.2 [Models and Principles]: User/Machine Systems; J.4 [Social and Behavioral Sciences]: Sociology Keywords Communication networks, Twitter, diffusion, influence, word of mouth marketing. 1. INTRODUCTION Word-of-mouth diffusion has long been regarded as an im- portant mechanism by which information can reach large populations, possibly influencing public opinion [14], adop- tion of innovations [26], new product market share [4], or brand awareness [15]. In recent years, interest among re- searchers and marketers alike has increasingly focused on whether or not diffusion can be maximized by seeding a piece of information or a new product with certain spe- cial individuals, often called “influentials” [34, 15] or sim- ply “influencers,” who exhibit some combination of desirable attributes—whether personal attributes like credibility, ex- pertise, or enthusiasm, or network attributes such as connec- tivity or centrality—that allows them to influence a dispro- portionately large number of others [10], possibly indirectly via a cascade of influence [31, 16]. Although appealing, the claim that word-of-mouth diffu- sion is driven disproportionately by a small number of key influencers necessarily makes certain assumptions about the underlying influence process that are not based directly on empirical evidence. Empirical studies of diffusion are there- fore highly desirable, but historically have suffered from two major difficulties. First, the network over which word-of- mouth influence spreads is generally unobservable, hence
  45. 45. T-61.6020: Mining the social web — lecture #2 who is influential in twitter? 45 questions : who is influential and in which content? (celebrity vs. expert on a topic vs. trusted friend…) can we predict who is influential?
  46. 46. T-61.6020: Mining the social web — lecture #2 who is influential in twitter? 46 dataset : track 1.6 m users 74 m diffusion events (cascades of shortened urls) two-month period in 2009 definition of influential : someone who posts urls that many retweet (narrow for the purpose of the study)
  47. 47. T-61.6020: Mining the social web — lecture #2 the dataset 47 URLs posted Density 10!10 10!8 10!6 10!4 10!2 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 101 102 103 104 Figure 1: Probability density of number of bit.ly URLs posted per user “leaders,” not on prediction.) Second, whereas the focus of previous studies has been largely descriptive (e.g. compar- ing the most influential users), we are interested explicitly in the same two-month period. We did this by querying the Twitter API to find the followers of every user who posted a bit.ly URL. Subsequently, we placed those followers in a queue to be crawled, thereby identifying their followers, who were then also placed in the queue, and so on. In this way, we obtained a large fraction of the Twitter follower graph comprising all active bit.ly posters and anyone connected to these users via one-way directed chains of followers. Specifi- cally, the subgraph comprised approximately 56M users and 1.7B edges. Consistent with previous work [7, 18, 35], both the in- degree (‘followers”) and out-degree (“friends”) distributions are highly skewed, but the former much more so—whereas the maximum # of followers was nearly 4M, the maximum # of friends was only about 760K—reflecting the passive and one-way nature of the “follow” action on Twitter (i.e. A can follow B without any action required from B). We emphasize, moreover, that because the crawled graph was seeded exclusively with active users, it is almost certainly not representative of the entire follower graph. In particular, active users are likely to have more followers than average, in which case we would expect that the average in-degree will exceed the average out-degree for our sample—as indeed we observe. Table 1 presents some basic statistics of the distributions of the number of friends, followers and number of URLs posted per user. URLs posted ! 101 102 103 104 gure 1: Probability density of number of bit.ly RLs posted per user aders,” not on prediction.) Second, whereas the focus of evious studies has been largely descriptive (e.g. compar- g the most influential users), we are interested explicitly in edicting influence; thus we consider all users, not merely e most influential. Third, in addition to predicting diffu- on as a function of the attributes of individual seeds, we so study the effects of content. We believe these differ- ces bring the understanding of diffusion on Twitter closer practical applications, although as we describe later, ex- rimental studies are still required. DATA To study diffusion on Twitter, we combined two separate t related sources of data. First, over the two-month pe- od of September 13 2009 - November 15 2009 we recorded 1.03B public tweets broadcast on Twitter, excluding Oc- ber 14-16 during which there were intermittent outages in e Twitter API. Of these, we extracted 87M tweets that cluded bit.ly URLs and which corresponded to distinct ffusion “events,” where each event comprised a single ini- tor, or “seed,” followed by some number of repostings of e same URL by the seed’s followers, their followers, and so 1 . Finally, we identified a subset of 74M diffusion events at were initiated by seed users who were active in both e first and second months of the observation period; thus abling us to train our regression model on first month # of friends was only about 760K—reflecting the passive and one-way nature of the “follow” action on Twitter (i.e. A can follow B without any action required from B). We emphasize, moreover, that because the crawled graph was seeded exclusively with active users, it is almost certainly not representative of the entire follower graph. In particular, active users are likely to have more followers than average, in which case we would expect that the average in-degree will exceed the average out-degree for our sample—as indeed we observe. Table 1 presents some basic statistics of the distributions of the number of friends, followers and number of URLs posted per user. Table 1: Statistics of the Twitter follower graph and seed activity # Followers # Friends # Seeds Posted Median 85.00 82.00 11.00 Mean 557.10 294.10 46.33 Max. 3,984,000.00 759,700.00 54,890 4. COMPUTING INFLUENCE ON TWITTER To calculate the influence score for a given URL post, we tracked the diffusion of the URL from its origin at a particular “seed” node through a series of reposts—by that user’s followers, those users’ followers, and so on—until the diffusion event, or cascade, terminated. To do this, we used the time each URL was posted: if person B is following person A, and person A posted the URL before B and was the only of B’s friends to post the URL, we say person A influenced person B to post the URL. As Figure 2 shows, if B has more than one friend who has previously posted the same URL, we have three choices for how to assign the corresponding influence: first, we can assign full credit to the the urls the follower graph
  48. 48. T-61.6020: Mining the social web — lecture #2 cascades 48 l- en a ed ry ck p- r- s. ss al; e. to n- RT Figure 3: Examples of information cascades on Twitter. there are many reasons why individuals may choose to pass along information other than the number and identity of the individuals from whom they received it—in particular, the nature of the content itself. Moreover, influencing an- other individual to pass along a piece of information does not Size Density 10−7 10−6 10−5 10−4 10−3 10−2 10−1 G G G G G G G G G G G G G G 100 101 102 103 104 (a) Cascade Sizes Depth Frequency 101 102 103 104 105 106 107 G G G G G G G G G G 0 2 4 6 8 (b) Cascade Depths Figure 4: (a). Frequency distribution of cascade sizes. (b). Distribution of cascade depths. we study size or depth, therefore, the implication is that most events do not spread at all, and even moderately sized cascades are extremely rare. To identify consistently influential individuals, we aggre- gated all URL posts by user and computed individual-level influence as the logarithm of the average size of all cascades for which that user was a seed. We then fit a regression tree model [6], in which a greedy optimization process recur- sively partitions the feature space, resulting in a piecewise- constant function where the value in each partition is fit to the mean of the corresponding training data. An important whe isfie mea part that age pred atin U prov the this are o dict follo are ablin as s as o follo from that also Fi five cate the t pear the
  49. 49. T-61.6020: Mining the social web — lecture #2 prediction task 49 build a model to predict influence model features : user attributes # followers # friends # tweets date of joining past influence of seed users average, minimum, and maximum total influence average, minimum, and maximum local influence (repeat study with additional content features)
  50. 50. T-61.6020: Mining the social web — lecture #2 prediction task — results 50 # followers and past influence are important features individuals who have been influential in the past and who have many followers are more likely to be influential in the future however, this is correct only on average predictor features are necessary but not sufficient cannot really predict who will initiate a cascade advertisers need a diverse portfolio of users to target
  51. 51. T-61.6020: Mining the social web — lecture #2 prediction task — results 51 log10(pastLocalInfluence + 1)< 0.09791 log10(pastLocalInfluence + 1)< 0.3028 log10(pastLocalInfluence + 1)< 0.3027 log10(pastLocalInfluence + 1)< 0.856 0.0124 0.03631 0.05991 0.09241 0.1452 0.1229 0.1929 0.3045 0.275 0.4118 0.6034 0.9854 Figure 5: Regression tree fit for one of the five cross-validation folds. Leaf nodes give the predicted influence for the corresponding partition, where the left (right) child is followed if the node condition is satisfied (violated). (a) All users Past Local Influence Followers 102 103 104 105 106 TreySongz Orbitz stephenfry marissamayer disneypollsMrEdLover BarackObama pigeonPOLL iphone_dev geohot mslayel cnnbrk TreysAngels OFA_TX britneyspears riskybusinessmb nprnews wealthtv garagemkorova michelebachmann billprady 10-1 100 101 102 (b) Top 25 users Figure 6: Influence as a function of past local influence and number of followers for (a) all users and (b) users with the top 25 actual influence. Each circle represents a single seed user, where the size of the circle represents that user’s actual average influence. than others (e.g. news articles of specialized interest), or First, we filtered URLs that we knew to be spam or in a lan- Predicted Influence ActualInfluence 0.0 0.2 0.4 0.6 0.8 1.0 1.2 GGGGGGGGGGGGGGG GGGGG GGGGGGGGGG GGGGG GGGGG GGGGG GGG GG GG G G G GG G G G 0.2 0.4 0.6 0.8 1.0 Figure 7: Actual vs. predicted influence for regres- sion tree. The model assigns each seed user to a leaf in the regression tree. Points representing the av- erage actual influence values are placed at the pre-
  52. 52. Duncan Watts’ youtube video
  53. 53. T-61.6020: Mining the social web — lecture #2 53 question to study : can we identify influential users in twitter? data : large twitter dataset over two months (proprietary) methodology : prediction and analysis of a regression task impact/interestingness : interesting question, potentially very high impact for advertisers reproducibility : not publicly available dataset summary Everyone is an influencer: quantifying influence on twitter Bakshy, Hofman, Mason,Watts
  54. 54. T-61.6020: Mining the social web — lecture #2 Everyone is an influencer: quantifying influence on twitter Bakshy, Hofman, Mason,Watts 54 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  55. 55. Coevolution of Network Structure and Content Chun-Yuen Teng School of Information University of Michigan Ann Arbor, MI 48109 chunyuen@umich.edu Liuling Gong School of Information University of Michigan Ann Arbor, MI 48109 llgong@umich.edu Avishay Livne EECS University of Michigan Ann Arbor, MI 48109 avishay@umich.edu Celso Brunetti Carey Business School Johns Hopkins Baltimore, MD 21202 celsob@jhu.edu Lada Adamic School of Information University of Michigan Ann Arbor, MI 48109 ladamic@umich.edu ABSTRACT As individuals communicate, their exchanges form a dy- namic network. We demonstrate, using time series analy- sis of communication in three online settings, that network structure alone can be highly revealing of the diversity and novelty of the information being communicated. Our ap- proach uses both standard and novel network metrics to characterize how unexpected a network configuration is, and to capture a network’s ability to conduct information. We find that networks with a higher conductance in link struc- ture exhibit higher information entropy, while unexpected network configurations can be tied to information novelty. We use a simulation model to explain the observed corre- spondence between the evolution of a network’s structure and the information it carries. Categories and Subject Descriptors J.4 [Computer Applications]: Social and Behavioral Sci- ences; H.2.8 [Database Applications]: Data Mining General Terms Measurement, Human Factors Keywords social media, information networks, network evolution adoption of ideas and behavior [28, 6, 3], convergence of opinion [5], or the speed and extent of innovation [14]. In practice, networks are rarely static, unless one consid- ers only the strongest and most stable ties [7] or experimen- tally dictates the network topology to be fixed [6]. However, even stable ties transfer information at di↵erent rates [25, 13, 21], and a portion of information flow occurs outside of established social ties [4]. New ties are also induced by information flow, e.g. a Pakistani Twitter user who inad- vertently live-tweeted the Bin Laden assassination quickly gained tens of thousands of new followers on Twitter. This points to a need to approach the relationship between net- work structure and information content in a substantively di↵erent way. In this paper, rather than treating the network structure as static, we specifically use its dynamic nature to infer two properties of the information being communicated through the network. The first is the diversity of the information; whether everyone is talking about the same topic or whether one is observing many disparate conversation topics being discussed. The second is the novelty of the information; whether individuals in the network are continuing to talk about the same topic they talked about in the previous time period, or whether new topics have arisen that are di↵er- ent from what has been discussed before. For example, one could imagine oneself at a dinner party, where most conver- sations are out of earshot, but one can easily observe who is conversing with whom. While individuals are milling about Xiv:1107.5543v2[cs.SI]21May2012
  56. 56. T-61.6020: Mining the social web — lecture #2 content vs. structure 56 questions : understand the interplay between content and structure what is said in the network vs. how the information spreads more concretely : can the network structure tell what people talk about? are they talking about the same thing or they gossip? is what people talk about novel?
  57. 57. T-61.6020: Mining the social web — lecture #2 content vs. structure 57 What’s different here !  We look at network dynamics at relatively short time scales and construct time series !  A range of network metrics, instead of just community structure !  Information novelty and diversity as opposed to tracking single events / pieces of information big news! virus epidemic weather is horrible today
  58. 58. T-61.6020: Mining the social web — lecture #2 content vs. structure — methodology 58 extract features that capture network structure # vertices, # edges, avg degree, degree correlations, … conductance (is information flows along many paths?) expectedness of conversation (have I seen this edge before?) extract features that characterize content diversity and novelty correlation analysis between structure and content features analysis on 3 datasets : twitter, virtual game, enron email network
  59. 59. T-61.6020: Mining the social web — lecture #2 content vs. structure — findings 59 “simple” structure features are not correlated with content diversity and novelty conductance correlates with content diversity expectedness correlates with content novelty
  60. 60. T-61.6020: Mining the social web — lecture #2 60 question to study : interplay between structure and content data : three datasets methodology : feature extraction and correlation analysis impact/interestingness : interesting question, potentially very high impact reproducibility : some datasets publicly available summary Coevolution of network structure and content Teng, Cong, Livne, Brunetti, and Adamic
  61. 61. T-61.6020: Mining the social web — lecture #2 Coevolution of network structure and content Teng, Cong, Livne, Brunetti, and Adamic 61 1 2 3 4 5 originality low high 1 2 3 4 5 impact low high 1 2 3 4 5 rigorousness / technical novelty low high 1 2 3 4 5 reproducibility low high
  62. 62. T-61.6020: Mining the social web — lecture #2 what is next? continue literature review (next week, Michael) meanwhile… keep thinking about project ideas browse papers the ones in Noppa main conferences : ICWSM,WSDM,WWW talk to your colleagues talk to your instructors 62

×