Computational Social Science, Lecture 08: Counting Fast, Part II

1,727 views

Published on

Published in: Education
  • Be the first to comment

Computational Social Science, Lecture 08: Counting Fast, Part II

  1. 1. Counting Fast (Part II) Sergei Vassilvitskii Columbia University Computational Social Science March 8, 2013Thursday, March 14, 13
  2. 2. Last time Counting fast: – Quadratic time doesn’t scale – Sorting is slightly more than linear – Hashing allows you to do membership queries in constant time 2 Sergei VassilvitskiiThursday, March 14, 13
  3. 3. Today Counting on Networks: – Large Graphs: Internet, Facebook, Twitter – Recommendation Graphs: Netflix, Amazon, etc. 3 Sergei VassilvitskiiThursday, March 14, 13
  4. 4. Friends & Followers Given a network: – When do people become friends? – What factors influence this? 4 Sergei VassilvitskiiThursday, March 14, 13
  5. 5. Friends & Followers Given a network: – When do people become friends? – What factors influence this? Products: – People You May Know (PYMK). Reconnect people, help new users 5 Sergei VassilvitskiiThursday, March 14, 13
  6. 6. Friends & Followers Given a network: – When do people become friends? – What factors influence this? Products: – People You May Know (PYMK). Reconnect people, help new users – Twitter’s who to follow? 6 Sergei VassilvitskiiThursday, March 14, 13
  7. 7. Friends & Followers Given a network: – When do people become friends? – What factors influence this? Products: – People You May Know (PYMK). Reconnect people, help new users – Twitter’s who to follow? Recommendations: – Netflix, Amazon, etc. (Future lectures) 7 Sergei VassilvitskiiThursday, March 14, 13
  8. 8. Triadic Closure Likely to become friends with: – People in similar groups – Friends of friends 8 Sergei VassilvitskiiThursday, March 14, 13
  9. 9. Defining Tight Knit Circles Looking for tight-knit circles: – People whose friends are friends themselves Why? – Network Cohesion: Tightly knit communities foster more trust, social norms. [Coleman ’88, Portes ’88] – Structural Holes: Individuals benefit form bridging [Burt ’04, ’07] 9 Sergei VassilvitskiiThursday, March 14, 13
  10. 10. Clustering Coefficient vs. 10 Sergei VassilvitskiiThursday, March 14, 13
  11. 11. Clustering Coefficient cc ( ) = 0.5 cc ( ) = 0.1 vs. Given an undirected graph - For each node, it’s the fraction of v’s neighbors who are neighbors themselves - Identical to the number of triangles containing the node 11 Sergei VassilvitskiiThursday, March 14, 13
  12. 12. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=0 12 Sergei VassilvitskiiThursday, March 14, 13
  13. 13. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w u 13 Sergei VassilvitskiiThursday, March 14, 13
  14. 14. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w u 14 Sergei VassilvitskiiThursday, March 14, 13
  15. 15. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ Running time: – For each vertex, look at all pairs of neighbors – Number of pairs ~ quadratic in the degree of the vertex – What happens if the degree is very large? 15 Sergei VassilvitskiiThursday, March 14, 13
  16. 16. Parallel Version But use 1,000 machines! – Quadratic algorithms still don’t scale – Simple parallelization: process each vertex separately Naive parallelization does not help with data skew – Some nodes will have very high degree – Example. 3.2 Million followers, must generate 10 Trillion (10^13) potential edges to check. – Even if generating 100M edges per second this is 100K seconds ~ 27 hours for one vertex! 16 Sergei VassilvitskiiThursday, March 14, 13
  17. 17. “Just 5 more minutes” On the LiveJournal Graph (5M nodes, 70M edges) – 80% of vertices are done after 5 min – 99% done after 35 min 17 Sergei VassilvitskiiThursday, March 14, 13
  18. 18. Adapting the Algorithm Approach 1: Dealing with skew directly – currently every triangle counted 3 times (once per vertex) – Running time quadratic in the degree of the vertex – Idea: Count each once, from the perspective of lowest degree vertex – Does this heuristic work? 18 Sergei VassilvitskiiThursday, March 14, 13
  19. 19. How to Count Triangles Better Idea [Schank ’07] – Only pivot on nodes who have smaller degrees than both neighbors. – Neighbors of high degree nodes tend to have small degrees 19 Sergei VassilvitskiiThursday, March 14, 13
  20. 20. How to Count Triangles Better foreach v in V foreach u in Adjacency(v) with deg(u) > deg(v): foreach w in Adjacency(v) with deg(w) > deg(v): if (u,w) is an edge: Triangles[v]++ Triangles[w]++ Triangles[u]++ 20 Sergei VassilvitskiiThursday, March 14, 13
  21. 21. Does it make a difference? 21 Sergei VassilvitskiiThursday, March 14, 13
  22. 22. Why does it help? Look at two different kinds of nodes: – Few friends: • OK to be quadratic on small instances – Lots of friends • Only care about number of friends with even more friends! • Cannot have too many (can make this formal) 22 Sergei VassilvitskiiThursday, March 14, 13
  23. 23. Break 23 Sergei VassilvitskiiThursday, March 14, 13
  24. 24. Working in Parallel MapReduce (review): Map: – Decide how to group the data for computation Reduce: – Given the grouping, perform the computation 24 Sergei VassilvitskiiThursday, March 14, 13
  25. 25. Building People You May Know Friendships are undirected: – If Alice knows Bob, Bob knows Alice – Data stored as a list of all edges – Find all friends of friends – Score the possible pairs 25 Sergei VassilvitskiiThursday, March 14, 13
  26. 26. Data Suppose you have edges and degrees of each vertex: Joe 56 Mary 78 Alice 398 Bob 198 Dan 983 Justin 11,985,234 ... An alternate view may be data stored as adjacency list: Joe 56 Mary 78 Don 99 Bill 1 Alice 398 Kate 55 Bob 198 Mary 78 ... 26 Sergei VassilvitskiiThursday, March 14, 13
  27. 27. Previous Algorithm Adjacency list input. – Map: • For each node and its neighbors, output all paths through the node – Reduce: • none – Map: [ | ] – Output: – Map: [ | ] – Output: None 27 Sergei VassilvitskiiThursday, March 14, 13
  28. 28. How to Count Triangles Better Idea [Schank ’07] – Only pivot on nodes who have smaller degrees than both neighbors. – Neighbors of high degree nodes tend to have small degrees 28 Sergei VassilvitskiiThursday, March 14, 13
  29. 29. Want to compute all open triads Data Needed: – Central node – Neighbors that have higher degree 29 Sergei VassilvitskiiThursday, March 14, 13
  30. 30. Want to compute all open triads Data Needed: – Central node – Neighbors that have higher degree 30 Sergei VassilvitskiiThursday, March 14, 13
  31. 31. Want to compute all open triads Data Needed: – Central node – Neighbors that have higher degree 31 Sergei VassilvitskiiThursday, March 14, 13
  32. 32. Want to compute all open triads Data Needed: – Central node – Neighbors that have higher degree – Orient each edge to point to a node of higher degree, breaking ties arbitrarily but consistently 32 Sergei VassilvitskiiThursday, March 14, 13
  33. 33. Want to compute all open triads Map: – Orient each edge to point to a node of higher degree, breaking ties arbitrarily but consistently – Given: Joe 56 Mary 78 – Output: <Key = Joe, Value = Mary> – Given: Alice 398 Bob 198 – Output: <Key = Bob, Value = Alice> map(key, value): split = value.split() if split[3] > split[1] or (split[3] == split[1] and split[0] < split[2]): emit(split[0], split[2]) if split[3] < split[1] or (split[3] == split[1] and split[0] > split[2]): emit(split[2], split[0]) 33 Sergei VassilvitskiiThursday, March 14, 13
  34. 34. Want to compute all open triads Aggregate (Shuffle): – Collect all values with same key (nodes with higher degree) Computation: – Generate all 2-paths (friend of a friend relationships): 34 Sergei VassilvitskiiThursday, March 14, 13
  35. 35. Want to compute all open triads Aggregate (Shuffle): – Collect all values with same key (nodes with higher degree) Computation: – Generate all 2-paths (friend of a friend relationships): – Generate all 2-paths: , , 35 Sergei VassilvitskiiThursday, March 14, 13
  36. 36. Want to compute all open triads Aggregate (Shuffle): – Collect all values with same key (nodes with higher degree) Computation: – Generate all 2-paths (friend of a friend relationships) – Given: key= Joe, value={Mary, Justin, Alice} – Output: • (key = Joe, Value = (Mary, Justin)) • (key = Joe, Value = (Mary, Alice)) • (key = Joe, Value = (Justin, Alice)) reduce(key, values): for friend1 : values for friend2 : values emit(key, (friend1, friend2)) 36 Sergei VassilvitskiiThursday, March 14, 13
  37. 37. Comparing Algorithms Edgelist MapOnly Algorithm: – MapOnly – Output from some nodes is quadratic Edge at a time Algroithm: – Map & Reduce – More balanced output from each node 37 Sergei VassilvitskiiThursday, March 14, 13
  38. 38. Scoring Some suggestions are better than others: – Some people are already friends! – Or they used to be friends... – Connected through a friend with 1000s of friends – Connected through multiple friends – ... 38 Sergei VassilvitskiiThursday, March 14, 13
  39. 39. Spring Break! 39 Sergei VassilvitskiiThursday, March 14, 13

×