Computational Social Science, Lecture 08: Counting Fast, Part II

Counting Fast
(Part II)

Sergei Vassilvitskii
Columbia University
Computational Social Science
March 8, 2013

Thursday, March 14, 13

Last time

Counting fast:
– Quadratic time doesn’t scale
– Sorting is slightly more than linear
– Hashing allows you to do membership queries in constant time

2 Sergei Vassilvitskii


Today

Counting on Networks:
– Large Graphs: Internet, Facebook, Twitter
– Recommendation Graphs: Netﬂix, Amazon, etc.



Friends & Followers

Given a network:
– When do people become friends?
– What factors inﬂuence this?



Friends & Followers

Given a network:

Products:
– People You May Know (PYMK). Reconnect people, help new users



Friends & Followers

Given a network:

Products:
– Twitter’s who to follow?



Friends & Followers

Given a network:

Products:
– Twitter’s who to follow?

Recommendations:
– Netﬂix, Amazon, etc. (Future lectures)



Triadic Closure

Likely to become friends with:
– People in similar groups
– Friends of friends



Deﬁning Tight Knit Circles

Looking for tight-knit circles:
– People whose friends are friends themselves

Why?
– Network Cohesion: Tightly knit communities foster more trust, social
norms. [Coleman ’88, Portes ’88]
– Structural Holes: Individuals beneﬁt form bridging [Burt ’04, ’07]



Clustering Coefficient

vs.



Clustering Coefficient

cc ( ) = 0.5 cc ( ) = 0.1

vs.

Given an undirected graph
- For each node, it’s the fraction of v’s neighbors who are neighbors
themselves
- Identical to the number of triangles containing the node



How to Count Triangles

Sequential Version:
foreach v in V
foreach u,w in Adjacency(v)
if (u,w) in E
Triangles[v]++

v

Triangles[v]=0




Sequential Version:
foreach v in V
if (u,w) in E
Triangles[v]++

v

Triangles[v]=1
w

u



Sequential Version:
foreach v in V
if (u,w) in E
Triangles[v]++

v

Triangles[v]=1

w
u



Sequential Version:
foreach v in V
if (u,w) in E
Triangles[v]++

Running time:
– For each vertex, look at all pairs of neighbors
– Number of pairs ~ quadratic in the degree of the vertex

– What happens if the degree is very large?



Parallel Version

But use 1,000 machines!
– Quadratic algorithms still don’t scale
– Simple parallelization: process each vertex separately

Naive parallelization does not help with data skew
– Some nodes will have very high degree
– Example. 3.2 Million followers, must generate 10 Trillion (10^13)
potential edges to check.
– Even if generating 100M edges per second this is 100K seconds ~ 27
hours for one vertex!



“Just 5 more minutes”

On the LiveJournal Graph (5M nodes, 70M edges)
– 80% of vertices are done after 5 min
– 99% done after 35 min



Adapting the Algorithm

Approach 1: Dealing with skew directly
– currently every triangle counted 3 times (once per vertex)
– Running time quadratic in the degree of the vertex
– Idea: Count each once, from the perspective of lowest degree vertex
– Does this heuristic work?



How to Count Triangles Better

Idea [Schank ’07]
– Only pivot on nodes who have smaller degrees than both neighbors.
– Neighbors of high degree nodes tend to have small degrees




foreach v in V
foreach u in Adjacency(v) with deg(u) > deg(v):
foreach w in Adjacency(v) with deg(w) > deg(v):
if (u,w) is an edge:
Triangles[v]++
Triangles[w]++
Triangles[u]++



Does it make a difference?



Why does it help?

Look at two different kinds of nodes:
– Few friends:
• OK to be quadratic on small instances
– Lots of friends
• Only care about number of friends with even more friends!
• Cannot have too many (can make this formal)



Break



Working in Parallel

MapReduce (review):

Map:
– Decide how to group the data for computation

Reduce:
– Given the grouping, perform the computation



Building People You May Know

Friendships are undirected:
– If Alice knows Bob, Bob knows Alice
– Data stored as a list of all edges
– Find all friends of friends
– Score the possible pairs



Data

Suppose you have edges and degrees of each vertex:

Joe 56 Mary 78
Alice 398 Bob 198
Dan 983 Justin 11,985,234
...

An alternate view may be data stored as adjacency list:
Joe 56 Mary 78 Don 99 Bill 1
Alice 398 Kate 55 Bob 198 Mary 78
...



Previous Algorithm

Adjacency list input.
– Map:
• For each node and its neighbors, output all paths through the node
– Reduce:
• none

– Map: [ | ]
– Output:
– Map: [ | ]
– Output: None



Idea [Schank ’07]
– Only pivot on nodes who have smaller degrees than both neighbors.
– Neighbors of high degree nodes tend to have small degrees



Want to compute all open triads

Data Needed:
– Central node
– Neighbors that have higher degree




Data Needed:
– Central node




Data Needed:
– Central node

– Orient each edge to point to a node of higher degree, breaking ties
arbitrarily but consistently




Map:
– Orient each edge to point to a node of higher degree, breaking ties
arbitrarily but consistently
– Given: Joe 56 Mary 78
– Output: <Key = Joe, Value = Mary>
– Given: Alice 398 Bob 198
– Output: <Key = Bob, Value = Alice>

map(key, value):
split = value.split()
if split[3] > split[1] or
(split[3] == split[1] and split[0] < split[2]):
emit(split[0], split[2])
if split[3] < split[1] or
(split[3] == split[1] and split[0] > split[2]):
emit(split[2], split[0])




Aggregate (Shuffle):
– Collect all values with same key (nodes with higher degree)

Computation:
– Generate all 2-paths (friend of a friend relationships):





Computation:
– Generate all 2-paths (friend of a friend relationships):
– Generate all 2-paths: , ,





Computation:
– Generate all 2-paths (friend of a friend relationships)
– Given: key= Joe, value={Mary, Justin, Alice}
– Output:
• (key = Joe, Value = (Mary, Justin))
• (key = Joe, Value = (Mary, Alice))
• (key = Joe, Value = (Justin, Alice))

reduce(key, values):
for friend1 : values
for friend2 : values
emit(key, (friend1, friend2))



Comparing Algorithms

Edgelist MapOnly Algorithm:
– MapOnly
– Output from some nodes is quadratic

Edge at a time Algroithm:
– Map & Reduce
– More balanced output from each node



Scoring

Some suggestions are better than others:
– Some people are already friends!
– Or they used to be friends...
– Connected through a friend with 1000s of friends
– Connected through multiple friends
– ...



Spring Break!



Computational Social Science, Lecture 08: Counting Fast, Part II

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from jakehofman

More from jakehofman (16)

Recently uploaded

Recently uploaded (20)

Computational Social Science, Lecture 08: Counting Fast, Part II