Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Computational Social Science, Lecture 08: Counting Fast, Part II
1. Counting Fast
(Part II)
Sergei Vassilvitskii
Columbia University
Computational Social Science
March 8, 2013
Thursday, March 14, 13
2. Last time
Counting fast:
– Quadratic time doesn’t scale
– Sorting is slightly more than linear
– Hashing allows you to do membership queries in constant time
2 Sergei Vassilvitskii
Thursday, March 14, 13
3. Today
Counting on Networks:
– Large Graphs: Internet, Facebook, Twitter
– Recommendation Graphs: Netflix, Amazon, etc.
3 Sergei Vassilvitskii
Thursday, March 14, 13
4. Friends & Followers
Given a network:
– When do people become friends?
– What factors influence this?
4 Sergei Vassilvitskii
Thursday, March 14, 13
5. Friends & Followers
Given a network:
– When do people become friends?
– What factors influence this?
Products:
– People You May Know (PYMK). Reconnect people, help new users
5 Sergei Vassilvitskii
Thursday, March 14, 13
6. Friends & Followers
Given a network:
– When do people become friends?
– What factors influence this?
Products:
– People You May Know (PYMK). Reconnect people, help new users
– Twitter’s who to follow?
6 Sergei Vassilvitskii
Thursday, March 14, 13
7. Friends & Followers
Given a network:
– When do people become friends?
– What factors influence this?
Products:
– People You May Know (PYMK). Reconnect people, help new users
– Twitter’s who to follow?
Recommendations:
– Netflix, Amazon, etc. (Future lectures)
7 Sergei Vassilvitskii
Thursday, March 14, 13
8. Triadic Closure
Likely to become friends with:
– People in similar groups
– Friends of friends
8 Sergei Vassilvitskii
Thursday, March 14, 13
9. Defining Tight Knit Circles
Looking for tight-knit circles:
– People whose friends are friends themselves
Why?
– Network Cohesion: Tightly knit communities foster more trust, social
norms. [Coleman ’88, Portes ’88]
– Structural Holes: Individuals benefit form bridging [Burt ’04, ’07]
9 Sergei Vassilvitskii
Thursday, March 14, 13
11. Clustering Coefficient
cc ( ) = 0.5 cc ( ) = 0.1
vs.
Given an undirected graph
- For each node, it’s the fraction of v’s neighbors who are neighbors
themselves
- Identical to the number of triangles containing the node
11 Sergei Vassilvitskii
Thursday, March 14, 13
12. How to Count Triangles
Sequential Version:
foreach v in V
foreach u,w in Adjacency(v)
if (u,w) in E
Triangles[v]++
v
Triangles[v]=0
12 Sergei Vassilvitskii
Thursday, March 14, 13
13. How to Count Triangles
Sequential Version:
foreach v in V
foreach u,w in Adjacency(v)
if (u,w) in E
Triangles[v]++
v
Triangles[v]=1
w
u
13 Sergei Vassilvitskii
Thursday, March 14, 13
14. How to Count Triangles
Sequential Version:
foreach v in V
foreach u,w in Adjacency(v)
if (u,w) in E
Triangles[v]++
v
Triangles[v]=1
w
u
14 Sergei Vassilvitskii
Thursday, March 14, 13
15. How to Count Triangles
Sequential Version:
foreach v in V
foreach u,w in Adjacency(v)
if (u,w) in E
Triangles[v]++
Running time:
– For each vertex, look at all pairs of neighbors
– Number of pairs ~ quadratic in the degree of the vertex
– What happens if the degree is very large?
15 Sergei Vassilvitskii
Thursday, March 14, 13
16. Parallel Version
But use 1,000 machines!
– Quadratic algorithms still don’t scale
– Simple parallelization: process each vertex separately
Naive parallelization does not help with data skew
– Some nodes will have very high degree
– Example. 3.2 Million followers, must generate 10 Trillion (10^13)
potential edges to check.
– Even if generating 100M edges per second this is 100K seconds ~ 27
hours for one vertex!
16 Sergei Vassilvitskii
Thursday, March 14, 13
17. “Just 5 more minutes”
On the LiveJournal Graph (5M nodes, 70M edges)
– 80% of vertices are done after 5 min
– 99% done after 35 min
17 Sergei Vassilvitskii
Thursday, March 14, 13
18. Adapting the Algorithm
Approach 1: Dealing with skew directly
– currently every triangle counted 3 times (once per vertex)
– Running time quadratic in the degree of the vertex
– Idea: Count each once, from the perspective of lowest degree vertex
– Does this heuristic work?
18 Sergei Vassilvitskii
Thursday, March 14, 13
19. How to Count Triangles Better
Idea [Schank ’07]
– Only pivot on nodes who have smaller degrees than both neighbors.
– Neighbors of high degree nodes tend to have small degrees
19 Sergei Vassilvitskii
Thursday, March 14, 13
20. How to Count Triangles Better
foreach v in V
foreach u in Adjacency(v) with deg(u) > deg(v):
foreach w in Adjacency(v) with deg(w) > deg(v):
if (u,w) is an edge:
Triangles[v]++
Triangles[w]++
Triangles[u]++
20 Sergei Vassilvitskii
Thursday, March 14, 13
21. Does it make a difference?
21 Sergei Vassilvitskii
Thursday, March 14, 13
22. Why does it help?
Look at two different kinds of nodes:
– Few friends:
• OK to be quadratic on small instances
– Lots of friends
• Only care about number of friends with even more friends!
• Cannot have too many (can make this formal)
22 Sergei Vassilvitskii
Thursday, March 14, 13
23. Break
23 Sergei Vassilvitskii
Thursday, March 14, 13
24. Working in Parallel
MapReduce (review):
Map:
– Decide how to group the data for computation
Reduce:
– Given the grouping, perform the computation
24 Sergei Vassilvitskii
Thursday, March 14, 13
25. Building People You May Know
Friendships are undirected:
– If Alice knows Bob, Bob knows Alice
– Data stored as a list of all edges
– Find all friends of friends
– Score the possible pairs
25 Sergei Vassilvitskii
Thursday, March 14, 13
26. Data
Suppose you have edges and degrees of each vertex:
Joe 56 Mary 78
Alice 398 Bob 198
Dan 983 Justin 11,985,234
...
An alternate view may be data stored as adjacency list:
Joe 56 Mary 78 Don 99 Bill 1
Alice 398 Kate 55 Bob 198 Mary 78
...
26 Sergei Vassilvitskii
Thursday, March 14, 13
27. Previous Algorithm
Adjacency list input.
– Map:
• For each node and its neighbors, output all paths through the node
– Reduce:
• none
– Map: [ | ]
– Output:
– Map: [ | ]
– Output: None
27 Sergei Vassilvitskii
Thursday, March 14, 13
28. How to Count Triangles Better
Idea [Schank ’07]
– Only pivot on nodes who have smaller degrees than both neighbors.
– Neighbors of high degree nodes tend to have small degrees
28 Sergei Vassilvitskii
Thursday, March 14, 13
29. Want to compute all open triads
Data Needed:
– Central node
– Neighbors that have higher degree
29 Sergei Vassilvitskii
Thursday, March 14, 13
30. Want to compute all open triads
Data Needed:
– Central node
– Neighbors that have higher degree
30 Sergei Vassilvitskii
Thursday, March 14, 13
31. Want to compute all open triads
Data Needed:
– Central node
– Neighbors that have higher degree
31 Sergei Vassilvitskii
Thursday, March 14, 13
32. Want to compute all open triads
Data Needed:
– Central node
– Neighbors that have higher degree
– Orient each edge to point to a node of higher degree, breaking ties
arbitrarily but consistently
32 Sergei Vassilvitskii
Thursday, March 14, 13
33. Want to compute all open triads
Map:
– Orient each edge to point to a node of higher degree, breaking ties
arbitrarily but consistently
– Given: Joe 56 Mary 78
– Output: <Key = Joe, Value = Mary>
– Given: Alice 398 Bob 198
– Output: <Key = Bob, Value = Alice>
map(key, value):
split = value.split()
if split[3] > split[1] or
(split[3] == split[1] and split[0] < split[2]):
emit(split[0], split[2])
if split[3] < split[1] or
(split[3] == split[1] and split[0] > split[2]):
emit(split[2], split[0])
33 Sergei Vassilvitskii
Thursday, March 14, 13
34. Want to compute all open triads
Aggregate (Shuffle):
– Collect all values with same key (nodes with higher degree)
Computation:
– Generate all 2-paths (friend of a friend relationships):
34 Sergei Vassilvitskii
Thursday, March 14, 13
35. Want to compute all open triads
Aggregate (Shuffle):
– Collect all values with same key (nodes with higher degree)
Computation:
– Generate all 2-paths (friend of a friend relationships):
– Generate all 2-paths: , ,
35 Sergei Vassilvitskii
Thursday, March 14, 13
36. Want to compute all open triads
Aggregate (Shuffle):
– Collect all values with same key (nodes with higher degree)
Computation:
– Generate all 2-paths (friend of a friend relationships)
– Given: key= Joe, value={Mary, Justin, Alice}
– Output:
• (key = Joe, Value = (Mary, Justin))
• (key = Joe, Value = (Mary, Alice))
• (key = Joe, Value = (Justin, Alice))
reduce(key, values):
for friend1 : values
for friend2 : values
emit(key, (friend1, friend2))
36 Sergei Vassilvitskii
Thursday, March 14, 13
37. Comparing Algorithms
Edgelist MapOnly Algorithm:
– MapOnly
– Output from some nodes is quadratic
Edge at a time Algroithm:
– Map & Reduce
– More balanced output from each node
37 Sergei Vassilvitskii
Thursday, March 14, 13
38. Scoring
Some suggestions are better than others:
– Some people are already friends!
– Or they used to be friends...
– Connected through a friend with 1000s of friends
– Connected through multiple friends
– ...
38 Sergei Vassilvitskii
Thursday, March 14, 13
39. Spring Break!
39 Sergei Vassilvitskii
Thursday, March 14, 13