Counting Triangles &The Curse of the Last Reducer                        Siddharth Suri                    Sergei Vassilvi...
Why Count Triangles?WWW 2011                 2   Sergei Vassilvitskii
Why Count Triangles?           Clustering Coefficient:            Given an undirected graph G = (V, E)            cc(v) = ...
Why Count Triangles?           Clustering Coefficient:            Given an undirected graph G = (V, E)            cc(v) = ...
Why Count Triangles?           Clustering Coefficient:            Given an undirected graph G = (V, E)            cc(v) = ...
Why Clustering Coefficient?           Captures how tight-knit the network is around a node.    cc (      ) = 0.5          ...
Why Clustering Coefficient?           Captures how tight-knit the network is around a node.    cc (      ) = 0.5          ...
Why MapReduce?           De facto standard for parallel computation on           large data           – Widely used at: Ya...
How to Count Triangles           Sequential Version:            foreach v in V                foreach u,w in Adjacency(v) ...
How to Count Triangles           Sequential Version:            foreach v in V                foreach u,w in Adjacency(v) ...
How to Count Triangles           Sequential Version:            foreach v in V                foreach u,w in Adjacency(v) ...
How to Count Triangles           Sequential Version:            foreach v in V                foreach u,w in Adjacency(v) ...
Parallel Version           Parallelize the edge checking phaseWWW 2011                          13             Sergei Vass...
Parallel Version           Parallelize the edge checking phase           – Map 1: For each v send (v, Γ(v)) to single mach...
Parallel Version           Parallelize the edge checking phase           – Map 1: For each v send (v, Γ(v)) to single mach...
Data skew           How much parallelization can we achieve?           - Generate all the paths to check in parallel      ...
Data skew           How much parallelization can we achieve?           - Generate all the paths to check in parallel      ...
“Just 5 more minutes”           Running the naive algorithm on LiveJournal Graph           – 80% of reducers done after 5 ...
Adapting the Algorithm           Approach 1: Dealing with skew directly           – currently every triangle counted 3 tim...
Adapting the Algorithm           Approach 1: Dealing with skew directly           – currently every triangle counted 3 tim...
How to Count Triangles Better           Sequential Version [Schank ’07]:           foreach v in V              foreach u,w...
Does it make a difference?WWW 2011               22      Sergei Vassilvitskii
Dealing with Skew           Why does it help?           – Partition nodes into two groups:                                ...
Dealing with Skew           Why does it help?           – Partition nodes into two groups:                                ...
Approach 2: Graph Split           Partitioning the nodes:           - Previous algorithm shows one way to achieve better p...
Approach 2: Graph Split           Some Triangles present in multiple subgraphs:                                           ...
Approach 2: Graph Split           Analysis:           – Each subgraph has O(m/p2 ) edges in expectation.           – Very ...
Approach 2: Graph Split           Analysis:           – Very balanced running times           – p controls memory needed p...
Approach 2: Graph Split           Analysis:           – Very balanced running times           – p controls memory needed p...
Approach 2: Graph Split           Analysis:           – Very balanced running times           – p controls memory needed p...
Overall           Naive Parallelization Doesn’t help with Data SkewWWW 2011                           31                Se...
Related Work      • Tsourakakis et al. [09]:           – Count global number of triangles by estimating the trace of the c...
ConclusionsWWW 2011        33   Sergei Vassilvitskii
Conclusions           Think about data skew.... and avoid the curseWWW 2011                           33                 S...
Conclusions           Think about data skew.... and avoid the curse           – Get programs to run fasterWWW 2011        ...
Conclusions           Think about data skew.... and avoid the curse           – Get programs to run faster           – Pub...
Conclusions           Think about data skew.... and avoid the curse           – Get programs to run faster           – Pub...
Conclusions           Think about data skew.... and avoid the curse           – Get programs to run faster           – Pub...
Conclusions           Think about data skew.... and avoid the curse           – Get programs to run faster           – Pub...
Thank You
Upcoming SlideShare
Loading in …5
×

Counting Triangles and the Curse of the Last Reducer

2,148 views

Published on

Presentation from WWW 2011

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,148
On SlideShare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Counting Triangles and the Curse of the Last Reducer

  1. 1. Counting Triangles &The Curse of the Last Reducer Siddharth Suri Sergei Vassilvitskii Yahoo! Research
  2. 2. Why Count Triangles?WWW 2011 2 Sergei Vassilvitskii
  3. 3. Why Count Triangles? Clustering Coefficient: Given an undirected graph G = (V, E) cc(v) = fraction of v’s neighbors who are neighbors themselves |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| = dv 2WWW 2011 3 Sergei Vassilvitskii
  4. 4. Why Count Triangles? Clustering Coefficient: Given an undirected graph G = (V, E) cc(v) = fraction of v’s neighbors who are neighbors themselves |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| = dv 2 cc ( ) = N/A cc ( ) = 1/3 cc ( )=1 cc ( )=1WWW 2011 4 Sergei Vassilvitskii
  5. 5. Why Count Triangles? Clustering Coefficient: Given an undirected graph G = (V, E) cc(v) = fraction of v’s neighbors who are neighbors themselves |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| #∆s incident on v = dv = dv 2 2 cc ( ) = N/A cc ( ) = 1/3 cc ( )=1 cc ( )=1WWW 2011 5 Sergei Vassilvitskii
  6. 6. Why Clustering Coefficient? Captures how tight-knit the network is around a node. cc ( ) = 0.5 cc ( ) = 0.1 vs.WWW 2011 6 Sergei Vassilvitskii
  7. 7. Why Clustering Coefficient? Captures how tight-knit the network is around a node. cc ( ) = 0.5 cc ( ) = 0.1 vs. Network Cohesion: - Tightly knit communities foster more trust, social norms. [Coleman ’88, Portes ’88] Structural Holes: - Individuals benefit form bridging [Burt ’04, ’07]WWW 2011 7 Sergei Vassilvitskii
  8. 8. Why MapReduce? De facto standard for parallel computation on large data – Widely used at: Yahoo!, Google, Facebook, – Also at: New York Times, Amazon.com, Match.com, ... – Commodity hardware – Reliable infrastructure – Data continues to outpace available RAM !WWW 2011 8 Sergei Vassilvitskii
  9. 9. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=0WWW 2011 9 Sergei Vassilvitskii
  10. 10. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w uWWW 2011 10 Sergei Vassilvitskii
  11. 11. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w uWWW 2011 11 Sergei Vassilvitskii
  12. 12. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ Running time: d2 v v∈V Even for sparse graphs can be quadratic if one vertex has high degree.WWW 2011 12 Sergei Vassilvitskii
  13. 13. Parallel Version Parallelize the edge checking phaseWWW 2011 13 Sergei Vassilvitskii
  14. 14. Parallel Version Parallelize the edge checking phase – Map 1: For each v send (v, Γ(v)) to single machine. – Reduce 1: Input: v; Γ(v) Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u) ( , ); ( , ); ( , );WWW 2011 14 Sergei Vassilvitskii
  15. 15. Parallel Version Parallelize the edge checking phase – Map 1: For each v send (v, Γ(v)) to single machine. – Reduce 1: Input: v; Γ(v) Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u) ( , ); ( , ); ( , ); – Map 2: Send (v1 , v2 ); u and (v1 , v2 ); $ for (v1 , v2 ) ∈ E to same machine. – Reduce 2: input: (v, w); u1 , u2 , . . . , uk , $? Output: if $ part of the input, then: ui = ui + 1/3 ( , ); , $ −→ +1/3 +1/3 +1/3 ( , ); −→WWW 2011 15 Sergei Vassilvitskii
  16. 16. Data skew How much parallelization can we achieve? - Generate all the paths to check in parallel - The running time becomes max d2 v v∈VWWW 2011 16 Sergei Vassilvitskii
  17. 17. Data skew How much parallelization can we achieve? - Generate all the paths to check in parallel - The running time becomes max d2 v v∈V Naive parallelization does not help with data skew – Some nodes will have very high degree – Example. 3.2 Million followers, must generate 10 Trillion (10^13) potential edges to check. – Even if generating 100M edges per second, 100K seconds ~ 27 hours.WWW 2011 17 Sergei Vassilvitskii
  18. 18. “Just 5 more minutes” Running the naive algorithm on LiveJournal Graph – 80% of reducers done after 5 min – 99% done after 35 minWWW 2011 18 Sergei Vassilvitskii
  19. 19. Adapting the Algorithm Approach 1: Dealing with skew directly – currently every triangle counted 3 times (once per vertex) – Running time quadratic in the degree of the vertex – Idea: Count each once, from the perspective of lowest degree vertex – Does this heuristic work?WWW 2011 19 Sergei Vassilvitskii
  20. 20. Adapting the Algorithm Approach 1: Dealing with skew directly – currently every triangle counted 3 times (once per vertex) – Running time quadratic in the degree of the vertex – Idea: Count each once, from the perspective of lowest degree vertex – Does this heuristic work? Approach 2: Divide Conquer – Equally divide the graph between machines – But any edge partition will be bound to miss triangles – Divide into overlapping subgraphs, account for the overlapWWW 2011 20 Sergei Vassilvitskii
  21. 21. How to Count Triangles Better Sequential Version [Schank ’07]: foreach v in V foreach u,w in Adjacency(v) if deg(u) deg(v) deg(w) deg(v) if (u,w) in E Triangles[v]++WWW 2011 21 Sergei Vassilvitskii
  22. 22. Does it make a difference?WWW 2011 22 Sergei Vassilvitskii
  23. 23. Dealing with Skew Why does it help? – Partition nodes into two groups: √ • Low: L = {v : dv ≤ m} √ • High: H = {v : dv m} – There are at most n low nodes; each produces at most O(m) paths √ – There are at most 2 m high nodes • Each produces paths to other high nodes: O(m) paths per nodeWWW 2011 23 Sergei Vassilvitskii
  24. 24. Dealing with Skew Why does it help? – Partition nodes into two groups: √ • Low: L = {v : dv ≤ m} √ • High: H = {v : dv m} – There are at most n low nodes; each produces at most O(m) paths √ – There are at most 2 m high nodes • Each produces paths to other high nodes: O(m) paths per node – These two are identical ! – Therefore, no mapper can produce substantially more work than others. 3 – Total work is O(m /2 ) , which is optimalWWW 2011 24 Sergei Vassilvitskii
  25. 25. Approach 2: Graph Split Partitioning the nodes: - Previous algorithm shows one way to achieve better parallelization - But what if even O(m) is too much. Is it possible to divide input into smaller chunks? Graph Split Algorithm: – Partition vertices into p equal sized groups V1 , V2 , . . . , Vp . – Consider all possible triples (Vi , Vj , Vk ) and the induced subgraph: Gijk = G [Vi ∪ Vj ∪ Vk ] – Compute the triangles on each Gijk separately.WWW 2011 25 Sergei Vassilvitskii
  26. 26. Approach 2: Graph Split Some Triangles present in multiple subgraphs: in p-2 subgraphs Vi Vj in 1 subgraph Vk in ~p2 subgraphs Can count exactly how many subgraphs each triangle will be inWWW 2011 26 Sergei Vassilvitskii
  27. 27. Approach 2: Graph Split Analysis: – Each subgraph has O(m/p2 ) edges in expectation. – Very balanced running timesWWW 2011 27 Sergei Vassilvitskii
  28. 28. Approach 2: Graph Split Analysis: – Very balanced running times – p controls memory needed per machineWWW 2011 28 Sergei Vassilvitskii
  29. 29. Approach 2: Graph Split Analysis: – Very balanced running times – p controls memory needed per machine 3 – Total work: p · O((m/p2 )3/2 ) = O(m 3/2 ) , independent of pWWW 2011 29 Sergei Vassilvitskii
  30. 30. Approach 2: Graph Split Analysis: – Very balanced running times – p controls memory needed per machine 3 – Total work: p · O((m/p2 )3/2 ) = O(m 3/2 ) , independent of p Input too big: Shuffle time paging increases with duplicationWWW 2011 30 Sergei Vassilvitskii
  31. 31. Overall Naive Parallelization Doesn’t help with Data SkewWWW 2011 31 Sergei Vassilvitskii
  32. 32. Related Work • Tsourakakis et al. [09]: – Count global number of triangles by estimating the trace of the cube of the matrix – Don’t specifically deal with skew, obtain high probability approximations. • Becchetti et al. [08] – Approximate the number of triangles per node – Use multiple passes to obtain a better and better approximationWWW 2011 32 Sergei Vassilvitskii
  33. 33. ConclusionsWWW 2011 33 Sergei Vassilvitskii
  34. 34. Conclusions Think about data skew.... and avoid the curseWWW 2011 33 Sergei Vassilvitskii
  35. 35. Conclusions Think about data skew.... and avoid the curse – Get programs to run fasterWWW 2011 33 Sergei Vassilvitskii
  36. 36. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papersWWW 2011 33 Sergei Vassilvitskii
  37. 37. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers – Get more sleepWWW 2011 33 Sergei Vassilvitskii
  38. 38. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers – Get more sleep – ..WWW 2011 33 Sergei Vassilvitskii
  39. 39. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers – Get more sleep – .. – The possibilities are endless!WWW 2011 33 Sergei Vassilvitskii
  40. 40. Thank You

×