Mining the Social Web - Lecture 2 - T61.6020

mining the social web
Aris Gionis Michael Mathioudakis
Mon, Feb 2 — lecture #2
structure and dynamics of social networks

T-61.6020: Mining the social web — lecture #2
class web page in piazza
https://piazza.com/aalto.ﬁ/spring2015/t616020/home
share resources and also use as a discussion forum
sensible posts :
looking for a project mate
looking for a project mate on idea X
anyone knows how to access dataY?
anyone has seen some analysis on data Z?
… or just anything else
2

today’s themes
analysis of the structure and dynamics of social-network
how social networks look like?
how social networks evolve over time?
how people in social networks behave and interact?
how information spreads in social networks and social media?
who is inﬂuential?
what is the interplay between structure and content?
3

objectives in today’s presentation
focus on one particular topic
review some “classic” papers in the literature
ideas for projects
assess the presented papers
what is the main idea?
what is the novelty?
why they had impact?
4

criteria to evaluate the
research projects
originality (has it done before?)
potential impact (how interesting it is and why)
rigorousness and technical novelty
reproducibility
presentation
5

structure of social networks
social networks and social-media data can be
represented as graphs (or networks)
how these graphs look like?
what is their structure
data contain additional information
(actions, interactions, dynamics, attributes,…)
mining this additional information as part of
the network structure
6

contrast against
random graphs
random graph model by Erdős-Rényi
edges independently drawn with probability p
real-world networks do not look like random graphs
also, random graphs are static
7
degree
distribution
hubs
triangle
coefﬁcient
clusters diameter
giant
component
random
graphs
binomial no no no small yes
real-world
networks
power law yes yes yes small yes

graph generation models
a large number of graph generations models have been proposed
preferential-attachment model
copy model
Strogatz-Watts model
typically trying to capture some property of the data
beyond the scope of this class and the project
8

arXiv:0810.1355v1[cs.DS]8Oct2008
Community Structure in Large Networks: Natural Cluster Sizes
and the Absence of Large Well-Defined Clusters ∗
Jure Leskovec †
Kevin J. Lang ‡
Anirban Dasgupta †
Michael W. Mahoney §
Abstract
A large body of work has been devoted to defining and identifying clusters or communities in social
and information networks, i.e., in graphs in which the nodes represent underlying social entities and
the edges represent some sort of interaction between pairs of nodes. Most such research begins with
the premise that a community or a cluster should be thought of as a set of nodes that has more
and/or better connections between its members than to the remainder of the network. In this paper,
we explore from a novel perspective several questions related to identifying meaningful communities
in large social and information networks, and we come to several striking conclusions.
Rather than defining a procedure to extract sets of nodes from a graph and then attempt to
interpret these sets as a “real” communities, we employ approximation algorithms for the graph
partitioning problem to characterize as a function of size the statistical and structural properties of
partitions of graphs that could plausibly be interpreted as communities. In particular, we define the
network community profile plot, which characterizes the “best” possible community—according to the
conductance measure—over a wide range of size scales. We study over 100 large real-world networks,
ranging from traditional and on-line social networks, to technological and information networks and
web graphs, and ranging in size from thousands up to tens of millions of nodes.
Our results suggest a significantly more refined picture of community structure in large networks
than has been appreciated previously. Our observations agree with previous work on small networks,
but we show that large networks have a very different structure. In particular, we observe tight
communities that are barely connected to the rest of the network at very small size scales (up to
≈ 100 nodes); and communities of size scale beyond ≈ 100 nodes gradually “blend into” the expander-
like core of the network and thus become less “community-like,” with a roughly inverse relationship
between community size and optimal community quality. This observation agrees well with the
so-called Dunbar number which gives a limit to the size of a well-functioning community.
However, this behavior is not explained, even at a qualitative level, by any of the commonly-used
network generation models. Moreover, it is exactly the opposite of what one would expect based
on intuition from expander graphs, low-dimensional or manifold-like graphs, and from small social
networks that have served as testbeds of community detection algorithms. The relatively gradual
increase of the network community profile plot as a function of increasing community size depends in
a subtle manner on the way in which local clustering information is propagated from smaller to larger
size scales in the network. We have found that a generative graph model, in which new edges are
added via an iterative “forest fire” burning process, is able to produce graphs exhibiting a network
community profile plot similar to what we observe in our network datasets.

community structure in social networks
hypothesis : social networks have well-formed communities
10
Community structure
loose deﬁnition of community: a set of vertices densely
connected to each other and sparsely connected to the rest of
the graph
artiﬁcial communities:
http://projects.skewed.de/graph-tool/

11
study community structure in an extensive collection of real-
world networks
authors introduce the network community proﬁle (NCP) plot
characterizes best possible community over a range of scales

12
dolphins network and its NPC
Community structure
dolphins network and its NCP
(source [Leskovec et al., 2009])
Frieze, Gionis, Tsourakakis Algorithmic Techniques for Modeling and Mining Large Graphs 34 / 277

13
NPC on DBLP
co-authorship
munity structure
do large-scale real-world networks have this nice artiﬁcal
structure? NO!
NCP of a DBLP graph (source [Leskovec et al., 2009])
do large-scale real-world networks have such nice artiﬁcal structure?
NO!

10Leskovec,Lang,Dasgupta,andMaho
Network N E Nb Eb
¯d ˜d ¯C D ¯D Description
Social networks
Delicious 147,567 301,921 0.40 0.65 4.09 48.44 0.30 24 6.28 del.icio.us collaborative tagging social network
Epinions 75,877 405,739 0.48 0.90 10.69 183.88 0.26 15 4.27 Who-trusts-whom network from epinions.com [142]
Flickr 404,733 2,110,078 0.33 0.86 10.43 442.75 0.40 18 5.42 Flickr photo sharing social network [101]
LinkedIn 6,946,668 30,507,070 0.47 0.88 8.78 351.66 0.23 23 5.43 Social network of professional contacts
LiveJournal01 3,766,521 30,629,297 0.78 0.97 16.26 111.24 0.36 23 5.55 Friendship network of a blogging community [20]
Messenger 1,878,736 4,079,161 0.53 0.78 4.34 15.40 0.09 26 7.42 Instant messenger social network
Email-All 234,352 383,111 0.18 0.50 3.27 576.87 0.50 14 4.07 Research organization email network (all addresses) [113]
Email-InOut 37,803 114,199 0.47 0.82 6.04 165.73 0.58 8 3.74 (all addresses but email has to be sent both ways) [113]
Email-Inside 986 16,064 0.90 0.99 32.58 74.66 0.45 7 2.60 (only emails inside the research organization) [113]
Email-Enron 33,696 180,811 0.61 0.90 10.73 142.36 0.71 13 3.99 Enron email dataset [100]
Answers 488,484 1,240,189 0.45 0.78 5.08 251.78 0.11 22 5.72 Yahoo Answers social network
Answers-1 26,971 91,812 0.56 0.87 6.81 59.17 0.08 16 4.49 Cluster 1 from Yahoo Answers
Information (citation) networks
Cit-Patents 3,764,105 16,511,682 0.82 0.96 8.77 21.34 0.09 26 8.15 Citation network of all US patents [112]
Cit-hep-ph 34,401 420,784 0.96 1.00 24.46 63.50 0.30 14 4.33 Citations between physics (arxiv hep-th) papers [78]
Cit-hep-th 27,400 352,021 0.94 0.99 25.69 106.40 0.33 15 4.20 Citations between physics (arxiv hep-ph) papers [78]
Blog-nat05-6m 29,150 182,212 0.74 0.96 12.50 342.51 0.24 10 3.40 Blog citation network (6 months of data) [116]
Blog-nat06all 32,384 315,713 0.87 0.99 19.50 153.08 0.20 18 3.94 Blog citation network (1 year of data) [116]
Post-nat05-6m 238,305 297,338 0.21 0.34 2.50 39.51 0.13 45 10.34 Blog post citation network (6 months) [116]
Post-nat06all 437,305 565,072 0.22 0.38 2.58 35.54 0.11 54 10.48 Blog post citation network (1 year) [116]
Collaboration networks
AtA-IMDB 883,963 27,473,042 0.87 0.99 62.16 517.40 0.79 15 3.48 IMDB actor collaboration network from Dec 2007
CA-astro-ph 17,903 196,972 0.89 0.98 22.00 65.70 0.67 14 4.21 Co-authorship in astro-ph of arxiv.org [112]
CA-cond-mat 21,363 91,286 0.81 0.93 8.55 22.47 0.70 15 5.36 Co-authorship in cond-mat category [112]
CA-gr-qc 4,158 13,422 0.64 0.78 6.46 17.98 0.66 17 6.10 Co-authorship in gr-qc category [112]
CA-hep-ph 11,204 117,619 0.81 0.97 21.00 130.88 0.69 13 4.71 Co-authorship in hep-ph category [112]
CA-hep-th 8,638 24,806 0.68 0.85 5.74 12.99 0.58 18 5.96 Co-authorship in hep-th category [112]
CA-DBLP 317,080 1,049,866 0.67 0.84 6.62 21.75 0.73 23 6.75 DBLP co-authorship network [20]
Table 1: Network datasets we analyzed. Statistics of networks we consider: number of nodes N; number of edges E; fraction nodes not in whiskers
(size of largest biconnected component) Nb/N; fraction of edges in biconnected component Eb/E; average degree ¯d = 2E/N; second order average
degree ˜d; average clustering coefficient ¯C; diameter D; and average path length ¯D.
Communitystructureinlargenetworks11
Network N E Nb Eb
Web graphs
Web-BerkStan 319,717 1,542,940 0.57 0.88 9.65 1,067.55 0.32 35 5.66 Web graph of Stanford and UC Berkeley [98]
Web-Google 855,802 4,291,352 0.75 0.92 10.03 170.35 0.62 24 6.27 Web graph Google released in 2002 [3]
Web-Notredame 325,729 1,090,108 0.41 0.76 6.69 280.68 0.47 46 7.22 Web graph of University of Notre Dame [11]
Web-Trec 1,458,316 6,225,033 0.59 0.78 8.54 682.89 0.68 112 8.58 Web graph of TREC WT10G web corpus [2]
Internet networks
As-RouteViews 6,474 12,572 0.62 0.80 3.88 164.81 0.40 9 3.72 AS from Oregon Exchange BGP Route View [112]
As-Caida 26,389 52,861 0.61 0.81 4.01 281.93 0.33 17 3.86 CAIDA AS Relationships Dataset
As-Skitter 1,719,037 12,814,089 0.99 1.00 14.91 9,934.01 0.17 5 3.44 AS from traceroutes run daily in 2005 by Skitter
As-Newman 22,963 48,436 0.65 0.83 4.22 261.46 0.35 11 3.83 AS graph from Newman [5]
As-Oregon 13,579 37,448 0.72 0.90 5.52 235.97 0.46 9 3.58 Autonomous systems [1]
Gnutella-25 22,663 54,693 0.59 0.83 4.83 10.75 0.01 11 5.57 Gnutella network on March 25 2000 [143]
Gnutella-30 36,646 88,303 0.55 0.81 4.82 11.46 0.01 11 5.75 Gnutella P2P network on March 30 2000 [143]
Gnutella-31 62,561 147,878 0.54 0.81 4.73 11.60 0.01 11 5.94 Gnutella network on March 31 2000 [143]
eDonkey 5,792,297 147,829,887 0.93 1.00 51.04 6,139.99 0.08 5 3.66 P2P eDonkey graph for a period of 47 hours in 2004
Bi-partite networks
IpTraffic 2,250,498 21,643,497 1.00 1.00 19.23 94,889.05 0.00 5 2.53 IP traffic graph a single router for 24 hours
AtP-astro-ph 54,498 131,123 0.70 0.87 4.81 16.67 0.00 28 7.78 Authors-to-papers network of astro-ph [116]
AtP-cond-mat 57,552 104,179 0.65 0.79 3.62 10.54 0.00 31 9.96 Authors-to-papers network of cond-mat [116]
AtP-gr-qc 14,832 22,266 0.47 0.60 3.00 9.72 0.00 35 11.08 Authors-to-papers network of gr-qc [116]
AtP-hep-ph 47,832 86,434 0.60 0.76 3.61 16.80 0.00 27 8.55 Authors-to-papers network of hep-ph [116]
AtP-hep-th 39,986 64,154 0.53 0.68 3.21 13.07 0.00 36 10.74 Authors-to-papers network of hep-th [116]
AtP-DBLP 615,678 944,456 0.49 0.64 3.07 13.61 0.00 48 12.69 DBLP authors-to-papers bipartite network
Spending 1,831,540 2,918,920 0.34 0.58 3.19 1,536.35 0.00 26 5.62 Users-to-keywords they bid
Hw7 653,260 2,278,448 0.99 0.99 6.98 346.85 0.00 24 6.26 Downsampled advertiser-query bid graph
Netflix 497,959 100,480,507 1.00 1.00 403.57 28,432.89 0.00 5 2.31 Users-to-movies they rated. From Netflix prize [4]
QueryTerms 13,805,808 17,498,668 0.28 0.41 2.53 14.92 0.00 86 19.81 Users-to-queries they submit to a search engine
Clickstream 199,308 951,649 0.39 0.87 9.55 430.74 0.00 7 3.83 Users-to-URLs they visited [126]
Biological networks
Bio-Proteins 4,626 14,801 0.72 0.91 6.40 24.25 0.12 12 4.24 Yeast protein interaction network [51]
Bio-Yeast 1,458 1,948 0.37 0.51 2.67 7.13 0.14 19 6.89 Yeast protein interaction network data [92]
Bio-YeastP0.001 353 1,517 0.73 0.93 8.59 20.18 0.57 11 4.33 Yeast protein-protein interaction map [135]
Bio-YeastP0.01 1,266 8,511 0.79 0.97 13.45 47.73 0.44 12 3.87 Yeast protein-protein interaction map [135]
Table 2: Network datasets we analyzed. Statistics of networks we consider: number of nodes N; number of edges E; fraction nodes not in whiskers
(size of largest biconnected component) Nb/N; fraction of edges in biconnected component Eb/E; average degree ¯d = 2E/N; second order average
degree ˜d; average clustering coefficient ¯C; diameter D; and average path length ¯D.
Leskovec,Lang,Dasgupta,andMahoney
Network N E Nb Eb
Nearly low-dimensional networks
Road-CA 1,957,027 2,760,388 0.80 0.85 2.82 3.17 0.06 865 310.97 California road network
Road-USA 126,146 161,950 0.97 0.98 2.57 2.81 0.03 617 218.55 USA road network (only main roads)
Road-PA 1,087,562 1,541,514 0.79 0.85 2.83 3.20 0.06 794 306.89 Pennsylvania road network
Road-TX 1,351,137 1,879,201 0.78 0.84 2.78 3.15 0.06 1,064 418.73 Texas road network
PowerGrid 4,941 6,594 0.62 0.69 2.67 3.87 0.11 46 19.07 Power grid of Western States Power Grid [156]
Mani-faces7k 696 6,979 0.98 0.99 20.05 37.99 0.56 16 5.52 Faces (64x64 grayscale images) (connect 7k closest pairs)
Mani-faces4k 663 3,465 0.90 0.97 10.45 20.20 0.56 29 8.96 Faces (connect 4k closest pairs)
Mani-faces2k 551 1,981 0.84 0.94 7.19 12.77 0.54 32 11.07 Faces (connect 2k closest pairs)
Mani-facesK10 698 6,935 1.00 1.00 19.87 25.32 0.51 6 3.25 Faces (connect every to 10 nearest neighbors)
Mani-swiss200k 20,000 200,000 1.00 1.00 20.00 21.08 0.59 103 37.21 Swiss-roll (connect 200k nearest pairs of nodes)
Mani-swissK10 20,000 199,955 1.00 1.00 20.00 25.38 0.56 10 5.47 Swiss-roll (every node connects to 10 nearest neighbors)
IMDB Actor-to-Movie graphs
AtM-IMDB 2,076,978 5,847,693 0.49 0.82 5.63 65.41 0.00 32 6.82 Actors-to-movies graph from IMDB (imdb.com)
Imdb-top30 198,430 566,756 0.99 1.00 5.71 18.19 0.00 26 8.32 Actors-to-movies graph heavily preprocessed
Imdb-raw07 601,481 1,320,616 0.54 0.79 4.39 20.94 0.00 32 8.55 Country clusters were extracted from this graph
Imdb-France 35,827 74,201 0.51 0.76 4.14 14.62 0.00 20 6.57 Cluster of French movies
Imdb-Germany 21,258 42,197 0.56 0.78 3.97 13.69 0.00 34 7.47 German movies (to actors that played in them)
datasets!
publicly
available
in SNAP

15
1. up to a certain size k (k ∼ 100 vertices) there exist good cuts
as the size increases so does the quality of the community
2. at the size k we observe the best possible community
such communities are typically connected to the remainder
with a single edge
3. above the size k the community quality decreases
this is because they blend in and gradually disappear
main ﬁndings

T-61.6020: Mining the social web — lecture #2 16
hypothesis : well-formed and interesting, assumed true
data : very extensive collection
methodology : introduce a new metric (NCP)
impact / interestingness : challenged the starting hypothesis
reproducibility : datasets and code publicly available
Community structure in large networks: natural cluster
sizes and the absence of large well-deﬁned clusters
Leskovec, Lang, Dasgupta, Mahoney
summary

Community structure in large networks: natural cluster
sizes and the absence of large well-deﬁned clusters
Leskovec, Lang, Dasgupta, Mahoney
17
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
rigorousness / technical novelty
low high
1 2 3 4 5
reproducibility
low high

rXiv:physics/0603229v3[physics.soc-ph]28Jan2007 Graph Evolution:
Densification and Shrinking Diameters
Jure Leskovec
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Jon Kleinberg
Department of Computer Science, Cornell University, Ithaca, NY
Christos Faloutsos
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
February 2, 2008
Abstract
How do real graphs evolve over time? What are “normal” growth patterns in
social, technological, and information networks? Many studies have discovered
patterns in static graphs, identifying properties in a single snapshot of a large
network, or in a very small number of snapshots; these include heavy tails
for in- and out-degree distributions, communities, small-world phenomena, and
others. However, given the lack of information about network evolution over
long periods, it has been hard to convert these findings into statements about
trends over time.
Here we study a wide range of real graphs, and we observe some surprising
phenomena. First, most of these graphs densify over time, with the number
of edges growing super-linearly in the number of nodes. Second, the average
distance between nodes often shrinks over time, in contrast to the conventional
wisdom that such distance parameters should increase slowly as a function of
the number of nodes (like O(log n) or O(log(log n)).
Existing graph generation models do not exhibit these types of behavior,
even at a qualitative level. We provide a new graph generator, based on a
“forest fire” spreading process, that has a simple, intuitive justification, requires
very few parameters (like the “flammability” of nodes), and produces graphs
exhibiting the full range of properties observed both in prior work and in the

graph evolution and shrinking diameters
19
networks evolve over time
typically new vertices/edges are added (not many deletions)
how do network distances change over time?
constant average degree and vertex addition…
… implies diameter = O(logn) — slowly increasing
according to random-graph model
also according to other more “realistic” models
e.g., preferential attachment

20
empirical observation :
as networks evolve distances shrink (e.g., diameter shrink)
why?
number of edges grow faster than number of vertices
graph become denser — graph densiﬁcation
me-evolving networks
J. Leskovec J. Kleinberg C. Faloutsos
[Leskovec et al., 2005b]
• densiﬁcation power law:
|Et| / |Vt|↵
1  ↵  2
• shrinking diameters: diameter is shrinking over time.

Graphs Over Time 7
1994 1996 1998 2000 2002
0
5
10
15
20
Year of publication
Averageout−degree
1975 1980 1985 1990 1995
4
6
8
10
12
Year granted
Averageout−degree
(a) arXiv (b) Patents
0 200 400 600
3.4
3.6
3.8
4
4.2
Averageout−degree
Time [days]
1994 1996 1998 2000
1
1.5
2
2.5
3
Year of publication
Averageout−degree
(c) Autonomous Systems (d) Aﬃliation network
Figure 1: The average node out-degree over time. Notice that it increases, in all 4 datasets.
That is, all graphs are densifying.
average degree

number of edges
10
2
10
3
10
4
10
5
10
2
10
3
10
4
10
Numberofedges
Number of nodes
Jan 1993
Edges
= 0.0113 x
1.69
R
2
=1.0
10
5
10
6
10
7
10
5
10
6
10
7
Number of nodes
Numberofedges
1975
Edges
= 0.0002 x
1.66
R
2
=0.99
(a) arXiv (b) Patents
10
3.5
10
3.6
10
3.7
10
3.8
10
4.1
10
4.2
10
4.3
10
4.4
Numberofedges
Number of nodes
Edges
= 0.87 x
1.18
R
2
=1.00
10
2
10
3
10
4
10
5
10
2
10
3
10
4
10
5
10
6
Numberofedges
Number of nodes
Edges
= 0.4255 x
1.15
R
2
=1.0
(c) Autonomous Systems (d) Affiliation network
10
3
10
4
10
5
10
3
10
4
10
5
10
6
Number of nodes
Numberofedges
Oct ’03
May ’05
Edges
= 1 x
1.12
R
2
=1.00
10
4
10
5
10
6
10
3
10
4
10
5
10
6
10
7
Number of nodes
Numberofedges
1910
2004
Edges
= 0.9 x
1.11
R
2
=0.98
(e) Email network (f) IMDB actors to movies network
Figure 2: Number of edges e(t) versus number of nodes n(t), in log-log scales, for several
graphs. All 4 graphs obey the Densification Power Law, with a consistently good fit. Slopes:
a = 1.68, 1.66, 1.18, 1.15, 1.12, and 1.11 respectively.

effective diameter
1992 1994 1996 1998 2000 2002 2004
4
5
6
7
8
9
Time [years]
Effectivediameter
Post ’95 subgraph, no past
1992 1994 1996 1998 2000 2002
4
5
6
7
8
9
10
Time [years]
Effectivediameter
(a) arXiv citation graph (b) Affiliation network
1975 1980 1985 1990 1995 2000
5
10
15
20
25
30
35
Time [years]
Effectivediameter Full graph
Post ’85 subgraph
3000 3500 4000 4500 5000 5500 6000 6500
4
4.2
4.4
4.6
4.8
5
Effectivediameter
Size of the graph [number of nodes]
Linear fit
(c) Patents citation graph (d) Autonomous Systems
0 5 10 15 20
4
4.5
5
5.5
6
6.5
7
Time [months]
Effectivediameter
Full graph
Post Jan ’04 subgraph
Post Jan ’04 subgraph, no past
1920 1940 1960 1980 2000
8
9
10
11
12
13
14
15
16
Time [years]
Effectivediameter
Full graph
Post ’40 subgraph
(e) Email network (f) IMDB actors to movies network
Figure 3: The effective diameter over time for 6 different datasets. Notice consistent decrease
of the diameter over time.

24
theoretical justification :
proposed a graph-evolution model that explains the
empirical findings
(graph densification and shrinking diameters)
forest fire model (FF)

hypothesis : well-formed, assumed true
as graph evolves distances increase
data : extensive collection — how to collect evolving networks?
methodology : simple statistics, but never done before
impact/interestingness : challenged the hypothesis, interesting ﬁndings
reproducibility : datasets and code publicly available
summary
Graph evolution: densiﬁcation and shrinking diameters
Leskovec, Kleinberg, Faloutsos

Graph evolution: densiﬁcation and shrinking diameters
Leskovec, Kleinberg, Faloutsos
26
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
low high
1 2 3 4 5
reproducibility
low high

Feedback Effects between Similarity and Social Influence
in Online Communities
David Crandall
Dept. of Computer Science
Cornell University
Ithaca, NY 14853
crandall@cs.cornell.edu
Dan Cosley
Dept. of Communication
Cornell University
Ithaca, NY 14853
drc44@cornell.edu
Daniel Huttenlocher
Cornell University
Ithaca, NY 14853
dph@cs.cornell.edu
Jon Kleinberg
Cornell University
Ithaca, NY 14853
kleinber@cs.cornell.edu
Siddharth Suri
Cornell University
Ithaca, NY 14853
ssuri@cs.cornell.edu
ABSTRACT
A fundamental open question in the analysis of social net-
works is to understand the interplay between similarity and
social ties. People are similar to their neighbors in a social
network for two distinct reasons: first, they grow to resemble
their current friends due to social influence; and second, they
tend to form new links to others who are already like them,
a process often termed selection by sociologists. While both
factors are present in everyday social processes, they are in
tension: social influence can push systems toward unifor-
mity of behavior, while selection can lead to fragmentation.
As such, it is important to understand the relative e↵ects
of these forces, and this has been a challenge due to the
di culty of isolating and quantifying them in real settings.
We develop techniques for identifying and modeling the in-
teractions between social influence and selection, using data
from online communities where both social interaction and
changes in behavior over time can be measured. We find
clear feedback e↵ects between the two factors, with rising
similarity between two individuals serving, in aggregate, as
an indicator of future interaction — but with similarity then
continuing to increase steadily, although at a slower rate, for
the current activities of their friends, or of the people most
similar to them?
Categories and Subject Descriptors: H.2.8 Database
Management: Database Applications – Data Mining
General Terms: Measurement, Theory
Keywords: social networks, online communities, social in-
fluence
1. INTRODUCTION
Social influence and selection. A fundamental property
of social networks is that people tend to have attributes
similar to those of their friends. There are two underlying
reasons for this. First, the process of social influence [7] leads
people to adopt behaviors exhibited by those they interact
with; this e↵ect is at work in many settings where new ideas
di↵use by word-of-mouth or imitation through a network of
people [19, 22]. A second, distinct reason is that people tend
to form relationships with others who are already similar to
them. This phenomenon, which is often termed selection,
has a long history of study in sociology [13, 16].1
The two forces of social influence and selection are both
seen in a wide range of social settings: people decide to adopt

similarity and social inﬂuence
28
observation : people are similar to their friends
selection or inﬂuence?
questions :
how social interaction affects interests, and vice versa?
can we use social similarity and interaction to predict
future behavior?

user interests and
similarity between users
29
focus on wikipedia editors
who edits which page?
edits up to time t forms a vector expressing user interests
up to that time point
consider the similarity of two users who “meet”
one posts in the discussion page of the other
in the
r.
d to a
inter-
ooting.
tworks
wer of
g peo-
ts, re-
use one of the more common measures, the cosine metric,
Cosine(~u,~v) = cos ~u ~v =
~u · ~v
||~u||2||~v||2
, (1)
where ||~v||2 denotes the Euclidean norm of v.
While a comparison of similarity measures is not the fo-
cus of our current work, we have evaluated a wide range of
measures for our purpose. We use the cosine metric here be-
cause it is independent of the rate at which people are edit-

user interests and
main ﬁnding :
Figure 1: Average cosine similarity of user pairs as
a function of the number of edits from time of ﬁrst
interaction, for Wikipedia.
2.2
The
logues
media
throug
pattern
social i
article
a site
sharing
out cle
people
A na
— rap
but ste
a mod
networ
a mini
and in
ters of
that th

user interests and
possible explanation :
feedback loop between social inﬂuence and selection
similarity leads to interaction, which leads to further similarity
proposed a theoretical model to explain the ﬁndings
(neighbors may affect actions and interactions)

predicting future behavior based on
user similarity and user interaction
(a) Wikipedia (b) LiveJournal
Figure 4: (a) Probability of joining a community based on k exposure via social ties versus similarity ties
for (a) Wikipedia and (b) LiveJournal. The solid black curves corresponds to social ties and the dashed red
curves to similarity ties. The error bars represent ±2 standard errors.
solid black curves are drawn using neighbors in the social
inﬂuence graph for each community, while the dashed red
months apart. Many ﬁrst edits close to t1 would suggest
e↵ects based on short-term processes, such as immediate

question to study : interplay between influence and selection
data : wikipedia edits (creative but somewhat limited)
methodology : simple statistics, theoretical model, prediction model
impact/interestingness : some interesting findings
reproducibility : datasets publicly available
summary
Feedback effects between similarity and social influence in
online communities
Crandall et al.

Feedback effects between similarity and social inﬂuence in
online communities
Crandall et al.
34
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
low high
1 2 3 4 5
reproducibility
low high

Meme-tracking and the Dynamics of the News Cycle
Jure Leskovec
∗†
Lars Backstrom
∗
Jon Kleinberg
∗
∗
Cornell University
†
Stanford University
jure@cs.stanford.edu lars@cs.cornell.edu kleinber@cs.cornell.edu
ABSTRACT
Tracking new topics, ideas, and “memes” across the Web has been
an issue of considerable interest. Recent work has developed meth-
ods for tracking topic shifts over long time scales, as well as abrupt
spikes in the appearance of particular named entities. However,
these approaches are less well suited to the identification of content
that spreads widely and then fades over time scales on the order of
days — the time scale at which we perceive news and events.
We develop a framework for tracking short, distinctive phrases
that travel relatively intact through on-line text; developing scalable
algorithms for clustering textual variants of such phrases, we iden-
tify a broad class of memes that exhibit wide spread and rich vari-
ation on a daily basis. As our principal domain of study, we show
how such a meme-tracking approach can provide a coherent repre-
sentation of the news cycle — the daily rhythms in the news media
that have long been the subject of qualitative interpretation but have
never been captured accurately enough to permit actual quantitative
analysis. We tracked 1.6 million mainstream media sites and blogs
over a period of three months with the total of 90 million articles
and we find a set of novel and persistent temporal patterns in the
news cycle. In particular, we observe a typical lag of 2.5 hours
between the peaks of attention to a phrase in the news media and
in blogs respectively, with divergent behavior around the overall
peak and a “heartbeat”-like pattern in the handoff between news
and blogs. We also develop and analyze a mathematical model for
the kinds of temporal variation that the system exhibits.
Categories and Subject Descriptors: H.2.8 [Database Manage-
ment]: Database applications—Data mining
General Terms: Algorithms; Experimentation.
Keywords: Meme-tracking, Blogs, News media, News cycle, In-
formation cascades, Information diffusion, Social networks
abilistic term mixtures have been successful at identifying long-
range trends in general topics over time [5, 7, 16, 17, 30, 31]. At the
other extreme, identifying hyperlinks between blogs and extracting
rare named entities has been used to track short information cas-
cades through the blogosphere [3, 14, 20, 23]. However, between
these two extremes lies much of the temporal and textual range
over which propagation on the web and between people typically
occurs, through the continuous interaction of news, blogs, and web-
sites on a daily basis. Intuitively, short units of text, short phrases,
and “memes” that act as signatures of topics and events propagate
and diffuse over the web, from mainstream media to blogs, and vice
versa. This is exactly the focus of our study here.
Moreover, it is at this intermediate temporal and textual granular-
ity of memes and phrases that people experience news and current
events. A succession of story lines that evolve and compete for at-
tention within a relatively stable set of broader topics collectively
produces an effect that commentators refer to as the news cycle.
Tracking dynamic information at this temporal and topical resolu-
tion has proved difficult, since the continuous appearance, growth,
and decay of new story lines takes place without significant shifts
in the overall vocabulary; in general, this process can also not be
closely aligned with the appearance and disappearance of specific
named entities (or hyperlinks) in the text. As a result, while the
dynamics of the news cycle has been a subject of intense interest to
researchers in media and the political process, the focus has been
mainly qualitative, with a corresponding lack of techniques for un-
dertaking quantitative analysis of the news cycle as a whole.
Our approach to meme-tracking, with applications to the news
cycle. Here we develop a method for tracking units of information
as they spread over the web. Our approach is the first to scalably
identify short distinctive phrases that travel relatively intact through

meme tracking
36
understand the dynamics of reported news
focus on 24-hour news cycles
questions :
do such news cycles exist?
can we detect them in the data?
can we measure their properties

meme tracking
37
dataset :
90 m news articles from the 2008 US presidential elections
how to identify news cycles :
urls, topics, name entities, bag-of-words…?
approach taken : quotes (memes)
easy to manage at large scale
travel relatively unchanged via many articles

is palling around with terrorists
as being so imperfect he is palling around with terrorists who would target their own country
a force for good in the world
we see america as a force for good in this world we see america as
a force for exceptionalism our opponents see america as imperfect
enough to pal around with terrorists who would bomb their own country
s as being so imperfect enough
uld target their own country
america it seems as being so imperfect
this is not a man who sees america as you see america and as i see america
this is not a man who sees america as you see it and how i see america
palling around with terrorists who would target their own country
that he s palling around with terrorists who would target their own country
pal around with terrorists who targeted their own country
palling around with terrorists who target their own country
this is someone who sees america as impe
around with terrorists who targeted th
our opponent is someone who sees america as imperfect enough to pal around with
terrorists who targeted their own country
our opponent though is someone who sees america it seems as being so imperfect
that he s palling around with terrorists who would target their own country
this is not a man who sees america as you see it and how i see america we see
imperfect imperfect enough that
ld target their own country
perfect imperfect enough that
would target their own country
is someone who sees america it seems as being so imperfect that he s palling
around with terrorists who would target their own country
our opponent is someone who sees america it seems as being so imperfect that
he s palling around with terrorists who would target their own country
our opponent is someone who sees america as imperfect enough to pal around with
terrorists who target their own country
we see america as a force of good in this
world we see an america of exceptionalism
someone who sees america as imperfe
around with terrorists who targeted th
someone who sees america it seems as being so imperfect that he s palling around
with terrorists who would target their own country
sees america as imperfect enough to pal around with terrorists who targeted their own country
terrorists who would target their own country
imperfect enough that he s palling around
with terrorists who would target their country
Figure 1: A small portion of the full set of variants of Sarah Palin’s quote, “Our opponent is someone who sees America, it seems,
as being so imperfect, imperfect enough that he’s palling around with terrorists who would target their own country.” The arrows
indicate the (approximate) inclusion of one variant in another, as part of the methodology developed in Section 2.
1
4 8
9
13
phrases with this property are exclusively produced by spammers.
(We use ε = .25, L = 4, and M = 10 in our implementation.)

meme tracking
interesting optimization problem
identify single-rooted propagations
s as being so imperfect enough
uld target their own country
america it seems as being so imperfect
our opponent though is someone who sees america it seems as being so impe
that he s palling around with terrorists who would target their own count
this is not a man who sees america as you see it and how i see am
imperfect imperfect enough that
ld target their own country
perfect imperfect enough that
would target their own country
is someone who sees america it seems as being so imperfect that he s pallin
around with terrorists who would target their own country
our opponent is someone who sees america it seems as being so imperfect th
he s palling around with terrorists who would target their own country
Figure 1: A small portion of the full set of variants of Sarah Palin’s
as being so imperfect, imperfect enough that he’s palling around wit
indicate the (approximate) inclusion of one variant in another, as part
1
2
3
4
5
6
7
8
9
10
11
13
15
14
12
Figure 2: Phrase graph. Each phrase is a node and we want to
delete the least edges so that each resulting connected compo-
nent has a single root node/phase, a node with zero out-edges.
By deleting the indicated edges we obtain the optimal solution.
To begin, we deﬁne some terminology. We will refer to each
news article or blog post as an item, and refer to a quoted string

meme tracking
volume distributions
5 in Fig. 2). So, the phrase cluster should be a
ll paths terminate in a single root node.
o identify phrase clusters, we would like delete
weight from the phrase graph so it falls apart
with the property that each piece “feeds into”
hat can serve as the exemplar for the phrase
ely, we define a directed acyclic graph to be
ntains exactly one root node. (Note that ev-
one root.) We now define the following DAG
ng: Given a directed acyclic graph with
delete a set of edges of minimum to-
hat each of the resulting components is
2 shows a DAG with all edge weights equal to
edges forms the unique optimal solution.
DAG Partitioning is computationally intractable
We then discuss the heuristic we use for the
which we find to work well in practice.
DAG Partitioning is NP-hard.
10-1
100
10
1
102
103
104
105
10
6
10
7
108
109
100
101
102
103
104
105
No.ofitemswithvolume≥x
Volume, x
Phrases: ∝ x-1.8
Clusters: ∝ x-2.1
Lipstick: ∝ x-0.85
Figure 3: Phrase volume distribution. We consider the volume
of individual phrases, phrase-clusters, and the phrases that
compose the “Lipstick on a pig” cluster. Notice phrases and
phrase-clusters have similar power-law distribution while the
“Lipstick on a pig” cluster has much fatter tail, which means
that popular phrases have unexpectedly high popularity.
to the cluster to which it has the most edges. For example, in Fig. 2

Figure 4: Top 50 threads in the news cycle with highest volume for the period Aug. 1 – Oct. 31, 2008. Each thread consists of all news
articles and blog posts containing a textual variant of a particular quoted phrases. (Phrase variants for the two largest threads in
each week are shown as labels pointing to the corresponding thread.) The data is drawn as a stacked plot in which the thickness of the
strand corresponding to each thread indicates its volume over time. Interactive visualization is available at http://memetracker.org.
threads dynamics

question to study : identify news cycles, study their dynamics
data : news articles
methodology : interesting computational problems in
managing memes
impact/interestingness : interesting methods
interesting ﬁndings
reproducibility : datasets publicly available
summary
Meme-tracking and the dynamics of the news cycle
Leskovec, Backstrom, Kleinberg

Meme-tracking and the dynamics of the news cycle
Leskovec, Backstrom, Kleinberg
43
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
low high
1 2 3 4 5
reproducibility
low high

Everyone’s an Influencer:
Quantifying Influence on Twitter
Eytan Bakshy∗
University of Michigan, USA
ebakshy@umich.edu
Jake M. Hofman
Yahoo! Research, NY, USA
hofman@yahoo-inc.com
Winter A. Mason
winteram@yahoo-
inc.com
Duncan J. Watts
djw@yahoo-inc.com
ABSTRACT
In this paper we investigate the attributes and relative influ-
ence of 1.6M Twitter users by tracking 74 million diffusion
events that took place on the Twitter follower graph over
a two month interval in 2009. Unsurprisingly, we find that
the largest cascades tend to be generated by users who have
been influential in the past and who have a large number
of followers. We also find that URLs that were rated more
interesting and/or elicited more positive feelings by workers
on Mechanical Turk were more likely to spread. In spite of
these intuitive results, however, we find that predictions of
which particular user or URL will generate large cascades
are relatively unreliable. We conclude, therefore, that word-
of-mouth diffusion can only be harnessed reliably by tar-
geting large numbers of potential influencers, thereby cap-
turing average effects. Finally, we consider a family of hy-
pothetical marketing strategies, defined by the relative cost
of identifying versus compensating potential “influencers.”
We find that although under some circumstances, the most
influential users are also the most cost-effective, under a
wide range of plausible assumptions the most cost-effective
performance can be realized using “ordinary influencers”—
individuals who exert average or even less-than-average in-
fluence.
Categories and Subject Descriptors
H.1.2 [Models and Principles]: User/Machine Systems;
J.4 [Social and Behavioral Sciences]: Sociology
Keywords
Communication networks, Twitter, diffusion, influence, word
of mouth marketing.
1. INTRODUCTION
Word-of-mouth diffusion has long been regarded as an im-
portant mechanism by which information can reach large
populations, possibly influencing public opinion [14], adop-
tion of innovations [26], new product market share [4], or
brand awareness [15]. In recent years, interest among re-
searchers and marketers alike has increasingly focused on
whether or not diffusion can be maximized by seeding a
piece of information or a new product with certain spe-
cial individuals, often called “influentials” [34, 15] or sim-
ply “influencers,” who exhibit some combination of desirable
attributes—whether personal attributes like credibility, ex-
pertise, or enthusiasm, or network attributes such as connec-
tivity or centrality—that allows them to influence a dispro-
portionately large number of others [10], possibly indirectly
via a cascade of influence [31, 16].
Although appealing, the claim that word-of-mouth diffu-
sion is driven disproportionately by a small number of key
influencers necessarily makes certain assumptions about the
underlying influence process that are not based directly on
empirical evidence. Empirical studies of diffusion are there-
fore highly desirable, but historically have suffered from two
major difficulties. First, the network over which word-of-
mouth influence spreads is generally unobservable, hence

who is influential in twitter?
45
questions :
who is influential and in which content?
(celebrity vs. expert on a topic vs. trusted friend…)
can we predict who is influential?

who is influential in twitter?
46
dataset :
track 1.6 m users
74 m diffusion events (cascades of shortened urls)
two-month period in 2009
definition of influential :
someone who posts urls that many retweet
(narrow for the purpose of the study)

the dataset
47
URLs posted
Density
10!10
10!8
10!6
10!4
10!2
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
101
102
103
104
Figure 1: Probability density of number of bit.ly
URLs posted per user
“leaders,” not on prediction.) Second, whereas the focus of
previous studies has been largely descriptive (e.g. compar-
ing the most influential users), we are interested explicitly in
the same two-month period. We did this by querying the
Twitter API to find the followers of every user who posted
a bit.ly URL. Subsequently, we placed those followers in a
queue to be crawled, thereby identifying their followers, who
were then also placed in the queue, and so on. In this way,
we obtained a large fraction of the Twitter follower graph
comprising all active bit.ly posters and anyone connected to
these users via one-way directed chains of followers. Specifi-
cally, the subgraph comprised approximately 56M users and
1.7B edges.
Consistent with previous work [7, 18, 35], both the in-
degree (‘followers”) and out-degree (“friends”) distributions
are highly skewed, but the former much more so—whereas
the maximum # of followers was nearly 4M, the maximum
# of friends was only about 760K—reflecting the passive
and one-way nature of the “follow” action on Twitter (i.e.
A can follow B without any action required from B). We
emphasize, moreover, that because the crawled graph was
seeded exclusively with active users, it is almost certainly
not representative of the entire follower graph. In particular,
active users are likely to have more followers than average,
in which case we would expect that the average in-degree
will exceed the average out-degree for our sample—as indeed
we observe. Table 1 presents some basic statistics of the
distributions of the number of friends, followers and number
of URLs posted per user.
URLs posted
!
101
102
103
104
gure 1: Probability density of number of bit.ly
RLs posted per user
aders,” not on prediction.) Second, whereas the focus of
evious studies has been largely descriptive (e.g. compar-
g the most influential users), we are interested explicitly in
edicting influence; thus we consider all users, not merely
e most influential. Third, in addition to predicting diffu-
on as a function of the attributes of individual seeds, we
so study the effects of content. We believe these differ-
ces bring the understanding of diffusion on Twitter closer
practical applications, although as we describe later, ex-
rimental studies are still required.
DATA
To study diffusion on Twitter, we combined two separate
t related sources of data. First, over the two-month pe-
od of September 13 2009 - November 15 2009 we recorded
1.03B public tweets broadcast on Twitter, excluding Oc-
ber 14-16 during which there were intermittent outages in
e Twitter API. Of these, we extracted 87M tweets that
cluded bit.ly URLs and which corresponded to distinct
ffusion “events,” where each event comprised a single ini-
tor, or “seed,” followed by some number of repostings of
e same URL by the seed’s followers, their followers, and so
1
. Finally, we identified a subset of 74M diffusion events
at were initiated by seed users who were active in both
e first and second months of the observation period; thus
abling us to train our regression model on first month
# of friends was only about 760K—reflecting the passive
and one-way nature of the “follow” action on Twitter (i.e.
A can follow B without any action required from B). We
emphasize, moreover, that because the crawled graph was
seeded exclusively with active users, it is almost certainly
not representative of the entire follower graph. In particular,
active users are likely to have more followers than average,
in which case we would expect that the average in-degree
will exceed the average out-degree for our sample—as indeed
we observe. Table 1 presents some basic statistics of the
distributions of the number of friends, followers and number
of URLs posted per user.
Table 1: Statistics of the Twitter follower graph and
seed activity
# Followers # Friends # Seeds Posted
Median 85.00 82.00 11.00
Mean 557.10 294.10 46.33
Max. 3,984,000.00 759,700.00 54,890
4. COMPUTING INFLUENCE ON TWITTER
To calculate the influence score for a given URL post,
we tracked the diffusion of the URL from its origin at a
particular “seed” node through a series of reposts—by that
user’s followers, those users’ followers, and so on—until the
diffusion event, or cascade, terminated. To do this, we used
the time each URL was posted: if person B is following
person A, and person A posted the URL before B and was
the only of B’s friends to post the URL, we say person A
influenced person B to post the URL. As Figure 2 shows,
if B has more than one friend who has previously posted
the same URL, we have three choices for how to assign the
corresponding influence: first, we can assign full credit to the
the urls the follower graph

cascades
48
l-
en
a
ed
ry
ck
p-
r-
s.
ss
al;
e.
to
n-
RT
Figure 3: Examples of information cascades on
Twitter.
there are many reasons why individuals may choose to pass
along information other than the number and identity of
the individuals from whom they received it—in particular,
the nature of the content itself. Moreover, influencing an-
other individual to pass along a piece of information does not
Size
Density
10−7
10−6
10−5
10−4
10−3
10−2
10−1
G
G
G
G
G
G
G
G
G
G
G
G
G
G
100
101
102
103
104
(a) Cascade Sizes
Depth
Frequency
101
102
103
104
105
106
107
G
G
G
G
G
G
G
G
G
G
0 2 4 6 8
(b) Cascade Depths
Figure 4: (a). Frequency distribution of cascade
sizes. (b). Distribution of cascade depths.
we study size or depth, therefore, the implication is that
most events do not spread at all, and even moderately sized
cascades are extremely rare.
To identify consistently influential individuals, we aggre-
gated all URL posts by user and computed individual-level
influence as the logarithm of the average size of all cascades
for which that user was a seed. We then fit a regression
tree model [6], in which a greedy optimization process recur-
sively partitions the feature space, resulting in a piecewise-
constant function where the value in each partition is fit to
the mean of the corresponding training data. An important
whe
isfie
mea
part
that
age
pred
atin
U
prov
the
this
are o
dict
follo
are
ablin
as s
as o
follo
from
that
also
Fi
five
cate
the t
pear
the

prediction task
49
build a model to predict influence
model features :
user attributes
# followers
# friends
# tweets
date of joining
past influence of seed users
average, minimum, and maximum total influence
average, minimum, and maximum local influence
(repeat study with additional content features)

prediction task — results
50
# followers and past influence are important features
individuals who have been influential in the past and who have
many followers are more likely to be influential in the future
however, this is correct only on average
predictor features are necessary but not sufficient
cannot really predict who will initiate a cascade
advertisers need a diverse portfolio of users to target

prediction task — results
51
log10(pastLocalInfluence + 1)< 0.09791 log10(pastLocalInfluence + 1)< 0.3028 log10(pastLocalInfluence + 1)< 0.3027 log10(pastLocalInfluence + 1)< 0.856
0.0124 0.03631 0.05991
0.09241 0.1452
0.1229
0.1929 0.3045 0.275 0.4118 0.6034 0.9854
Figure 5: Regression tree fit for one of the five cross-validation folds. Leaf nodes give the predicted influence
for the corresponding partition, where the left (right) child is followed if the node condition is satisfied
(violated).
(a) All users
Past Local Influence
Followers
102
103
104
105
106
TreySongz
Orbitz
stephenfry
marissamayer
disneypollsMrEdLover
BarackObama
pigeonPOLL
iphone_dev
geohot
mslayel
cnnbrk
TreysAngels
OFA_TX
britneyspears
riskybusinessmb
nprnews
wealthtv
garagemkorova
michelebachmann
billprady
10-1
100
101
102
(b) Top 25 users
Figure 6: Influence as a function of past local influence and number of followers for (a) all users and (b)
users with the top 25 actual influence. Each circle represents a single seed user, where the size of the circle
represents that user’s actual average influence.
than others (e.g. news articles of specialized interest), or First, we filtered URLs that we knew to be spam or in a lan-
Predicted Influence
ActualInfluence
0.0
0.2
0.4
0.6
0.8
1.0
1.2
GGGGGGGGGGGGGGG
GGGGG
GGGGGGGGGG
GGGGG
GGGGG
GGGGG
GGG
GG
GG
G
G
G
GG
G
G
G
0.2 0.4 0.6 0.8 1.0
Figure 7: Actual vs. predicted influence for regres-
sion tree. The model assigns each seed user to a leaf
in the regression tree. Points representing the av-
erage actual influence values are placed at the pre-

question to study : can we identify influential users in twitter?
data : large twitter dataset over two months (proprietary)
methodology : prediction and analysis of a regression task
impact/interestingness : interesting question, potentially very
high impact for advertisers
reproducibility : not publicly available dataset
summary
Everyone is an influencer: quantifying influence on twitter
Bakshy, Hofman, Mason,Watts

Everyone is an inﬂuencer: quantifying inﬂuence on twitter
Bakshy, Hofman, Mason,Watts
54
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
low high
1 2 3 4 5
reproducibility
low high

Coevolution of Network Structure and Content
Chun-Yuen Teng
School of Information
University of Michigan
Ann Arbor, MI 48109
chunyuen@umich.edu
Liuling Gong
Ann Arbor, MI 48109
llgong@umich.edu
Avishay Livne EECS
Ann Arbor, MI 48109
avishay@umich.edu
Celso Brunetti
Carey Business School
Johns Hopkins
Baltimore, MD 21202
celsob@jhu.edu
Lada Adamic
Ann Arbor, MI 48109
ladamic@umich.edu
ABSTRACT
As individuals communicate, their exchanges form a dy-
namic network. We demonstrate, using time series analy-
sis of communication in three online settings, that network
structure alone can be highly revealing of the diversity and
novelty of the information being communicated. Our ap-
proach uses both standard and novel network metrics to
characterize how unexpected a network configuration is, and
to capture a network’s ability to conduct information. We
find that networks with a higher conductance in link struc-
ture exhibit higher information entropy, while unexpected
network configurations can be tied to information novelty.
We use a simulation model to explain the observed corre-
spondence between the evolution of a network’s structure
and the information it carries.
Categories and Subject Descriptors
J.4 [Computer Applications]: Social and Behavioral Sci-
ences; H.2.8 [Database Applications]: Data Mining
General Terms
Measurement, Human Factors
Keywords
social media, information networks, network evolution
adoption of ideas and behavior [28, 6, 3], convergence of
opinion [5], or the speed and extent of innovation [14].
In practice, networks are rarely static, unless one consid-
ers only the strongest and most stable ties [7] or experimen-
tally dictates the network topology to be fixed [6]. However,
even stable ties transfer information at di↵erent rates [25,
13, 21], and a portion of information flow occurs outside
of established social ties [4]. New ties are also induced by
information flow, e.g. a Pakistani Twitter user who inad-
vertently live-tweeted the Bin Laden assassination quickly
gained tens of thousands of new followers on Twitter. This
points to a need to approach the relationship between net-
work structure and information content in a substantively
di↵erent way.
In this paper, rather than treating the network structure
as static, we specifically use its dynamic nature to infer two
properties of the information being communicated through
the network. The first is the diversity of the information;
whether everyone is talking about the same topic or whether
one is observing many disparate conversation topics being
discussed. The second is the novelty of the information;
whether individuals in the network are continuing to talk
about the same topic they talked about in the previous time
period, or whether new topics have arisen that are di↵er-
ent from what has been discussed before. For example, one
could imagine oneself at a dinner party, where most conver-
sations are out of earshot, but one can easily observe who is
conversing with whom. While individuals are milling about
Xiv:1107.5543v2[cs.SI]21May2012

content vs. structure
56
questions :
understand the interplay between content and structure
what is said in the network vs. how the information spreads
more concretely :
can the network structure tell what people talk about?
are they talking about the same thing or they gossip?
is what people talk about novel?

content vs. structure
57
What’s different here
!  We look at network dynamics at relatively short time
scales and construct time series
!  A range of network metrics, instead of just community
structure
!  Information novelty and diversity as opposed to tracking
single events / pieces of information
big news! virus epidemic weather is horrible today

content vs. structure — methodology
58
extract features that capture network structure
# vertices, # edges, avg degree, degree correlations, …
conductance (is information ﬂows along many paths?)
expectedness of conversation (have I seen this edge before?)
extract features that characterize content diversity and novelty
correlation analysis between structure and content features
analysis on 3 datasets : twitter, virtual game, enron email network

content vs. structure — ﬁndings
59
“simple” structure features are not correlated with content
diversity and novelty
conductance correlates with content diversity
expectedness correlates with content novelty

question to study : interplay between structure and content
data : three datasets
methodology : feature extraction and correlation analysis
impact/interestingness : interesting question, potentially very
high impact
reproducibility : some datasets publicly available
summary
Coevolution of network structure and content
Teng, Cong, Livne, Brunetti, and Adamic

Coevolution of network structure and content
Teng, Cong, Livne, Brunetti, and Adamic
61
1 2 3 4 5
originality
low high
1 2 3 4 5
impact
low high
1 2 3 4 5
low high
1 2 3 4 5
reproducibility
low high

what is next?
continue literature review (next week, Michael)
meanwhile…
keep thinking about project ideas
browse papers
the ones in Noppa
main conferences : ICWSM,WSDM,WWW
talk to your colleagues
talk to your instructors
62

Mining the Social Web - Lecture 2 - T61.6020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mining the Social Web - Lecture 2 - T61.6020

Similar to Mining the Social Web - Lecture 2 - T61.6020 (20)

More from Michael Mathioudakis

More from Michael Mathioudakis (9)

Recently uploaded

Recently uploaded (20)

Mining the Social Web - Lecture 2 - T61.6020