XXL Graph Algorithms
                                              Sergei Vassilvitskii
                                                Yahoo! Research

With help from Jake Hofman, Siddharth Suri, Cong Yu and many others
Introduction
  XXL Graphs are everywhere:
   – Web graph
   – Friend graphs
   – Advertising graphs...




                             2
Introduction
  XXL Graphs are everywhere:
   – Web graph
   – Friend graphs
   – Advertising graphs...



  But we have Hadoop!
   – Few algorithms have been ported (no Hadoop Algorithms book)
   – Few general algorithmic approaches
   – Active area of research




                                  3
Outline
  Today:
   – Act 1: Crawl before you walk
      • Counting connected components
   – Act 2: The curse of the last reducer
      • Finding tight knit friend groups




                                     4
Act 1: Connected Components
     Given a graph, how many components does it have?


                        f
           b
 a
                            g


       c

                    e           h


               d




                                5
Act 1: Connected Components
     Given a graph, how many components does it have?


                        f
           b
                                                  (b,c)             1
 a                                                                      (f,h)       1
                            g                   (b,d)           1

                                    (a,c)   1                       (a,b)       1
                                                (c,d)       1
       c
                                       (c,e)      1                         (f,g)       1
                    e           h                     (d,e)             1

                                            (d,e)       1
               d                                            (b,e)             1
                                                                            (g,h)       1

                                     Data too big to fit on one reducer!

                                6
CC Overview
  Outline for Connected Components
  – Partition the input into several chunks (map 1)
  – Summarize the connectivity on each chunk (reduce 1)
  – Combine all of the (small) summaries (map 2)
  – Find the number of connected components




                                    7
Connected Components
     1. Partition (randomly):


                           f
            b
 a
                                g


        c

                       e            h


                d




                                    8
Connected Components
  1. Partition (randomly):


                                                        f
         b                               b
                                 a
                                                            g


     c                               c

                    e                                           h


               d

         Reduce 1                            Reduce 2


                             9
Connected Components
  1. Partition:
  2. Summarize (retain < n edges):
                                                        f
         b                               b
                                 a
                                                            g


     c                               c

                    e                                           h


               d

         Reduce 1                            Reduce 2


                            10
Connected Components
  1. Partition:
  2. Summarize (retain < n edges):
                                                         f
         b                                b
                                  a
                                                             g


     c                                c

                    e                                            h


               d

         Reduce 1                             Reduce 2


                             11
Connected Components
  1. Partition:
  2. Summarize:
  3. Recombine:                                     f
         b                           b
                             a
                                                        g


     c                           c

                    e                                       h


               d

         Reduce 1                        Reduce 2


                        12
Connected Components
     1. Partition:
     2. Summarize:
     3. Recombine:
            b                  f
 a


                                   g

        c

                          e
                                       h

                 d

                     Round 2


                                       13
Connected Components
     1. Partition:
     2. Summarize:
     3. Recombine:
            b                  f                          (b,c)             1
 a                                                                              (f,h)       1
                                                        (b,d)           1

                                   g        (a,c)   1                       (a,b)       1
                                                        (c,d)       1
        c
                                               (c,e)      1                         (f,g)       1
                                                              (d,e)             1
                          e
                                       h            (d,e)       1
                                                                    (b,e)             1
                 d                                                                  (g,h)       1

                     Round 2


                                       14
Connected Components
     1. Partition:
     2. Summarize:
     3. Recombine:
            b                  f
 a


                                   g        (a,c)   1                   (a,b)   1
                                                        (c,d)       1
        c
                                                                           (f,g)    1

                          e
                                       h            (d,e)       1

                 d                                                         (g,h)    1

                     Round 2
                                             Small enough to fit!

                                       15
Connected Components
  The summarization does not affect connectivity
  – Drops redundant edges
  – Dramatically reduces data size
  – Takes two MapReduce rounds




                                     16
Connected Components
  The summarization does not affect connectivity
  – Drops redundant edges
  – Dramatically reduces data size
  – Takes two MapReduce rounds


  Similar approach works in other situations:
  – Consider vertices connected only if k edges between vertices
  – Consider vertices connected if similarity score above a threshold
     • E.g. approximate Jaccard similarity when computing for recommendation
       systems
  – Find minimum spanning trees
     • Summarize by computing an MST on the subset graph
  – Clustering
     • Cluster each partition, then aggregate the clusters



                                         17
Outline
  Today:
   – Act 1: Crawl before you walk
      • Counting connected components
   – Act 2: The curse of the last reducer
      • Finding tight knit friend groups




                                     18
Act 2: Clustering Coefficient
  Finding tight knit groups of friends




                              19
Act 2: Clustering Coefficient
  Finding tight knit groups of friends




                             vs.




                              19
Act 2: Clustering Coefficient
  Finding tight knit groups of friends




                                   vs.




           2/15   ≈ 0.13                        8/15   ≈ 0.53

  CC(v) = Fraction of v’s friends who know each other
   – Count: number of triangles incident on v


                                   20
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)




                                   21
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)




                                   22
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)
  – Check which of those edges exist:




                      ∩                          =


                             15 edges possible       2 edges present


                                   23
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)
  – Check which of those edges exist




                                   24
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles
  – Check which of those edges exist


  Amount of intermediate data
  – Quadratic in the degree of the nodes
  – 6 friends: 15 possible triangles
  – n friends, n(n-1)/2 possible triangles




                                       25
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles
  – Check which of those edges exist


  Amount of intermediate data
  – Quadratic in the degree of the nodes
  – 6 friends: 15 possible triangles
  – n friends, n(n-1)/2 possible triangles


  There’s always “that guy”:
  – tens of thousands of friends
  – tens of thousands of movie ratings (really!)
  – millions of followers
                                       26
Finding CC For Each Node
  Attempt 1:
  – Look at each node    a le
                       Sc triangles
                    ot
  – Enumerate all possible
                sn
             oe
  – Check which of those edges exist
           D




                                 27
Finding CC For Each Node
  Attempt 1:
  – Look at each node      a le
                       Sc triangles
                    ot
  – Enumerate all possible
                sn
             oe
  – Check which of those edges exist
           D


  Attempt 2:
  – There is a limited number of High degree nodes
  – Count LLL, LLH, LHH, and HHH triangles differently
     – If a triangle has at least one Low node
        – Pivot on Low node to count the triangles
     – If a triangle has all High nodes
        – Pivot but only on other neighboring High nodes (not all nodes)


                                    28
Algorithm in Pictures
  When looking at Low degree nodes
   – Check for all triangles




                               29
Algorithm in Pictures
  When looking at Low degree nodes
   – Check for all triangles

  When looking at High degree nodes
   – Check for triangles with other High degree nodes




                                   30
Clustering Coefficient Discussion
  Attempt 2:
   – Main idea: treat High and Low degree nodes differently
      • Limit the amount of data generated (No more than O(n) per node)
   – All triangles accounted for
   – Can set High-Low threshold to balance the two cases
      • Rule of thumb: threshold around square root of number of vertices
   – A bit more complex, but still easy to code
      • Doesn’t suffer from the one high degree node problem




                                         31
XXL Graphs: Conclusions
  Algorithm Design
   – Prove performance guarantees independent of input data
      • Input skew (e.g. high degree nodes) should not severely affect
        algorithm performance
      • Number of rounds fixed (and hopefully small)




                                    32
XXL Graphs: Conclusions
  Algorithm Design
   – Prove performance guarantees independent of input data
      • Input skew (e.g. high degree nodes) should not severely affect
        algorithm performance
      • Number of rounds fixed (and hopefully small)



  Rethink graph algorithms:
   – Connected Components: Two round approach
   – Clustering Coefficient: High-Low node decomposition
   – (Breaking News) Matchings: Two round sampling technique




                                    33
Thank You
sergei@yahoo-inc.com

XXL Graph Algorithms__HadoopSummit2010

  • 1.
    XXL Graph Algorithms Sergei Vassilvitskii Yahoo! Research With help from Jake Hofman, Siddharth Suri, Cong Yu and many others
  • 2.
    Introduction XXLGraphs are everywhere: – Web graph – Friend graphs – Advertising graphs... 2
  • 3.
    Introduction XXLGraphs are everywhere: – Web graph – Friend graphs – Advertising graphs... But we have Hadoop! – Few algorithms have been ported (no Hadoop Algorithms book) – Few general algorithmic approaches – Active area of research 3
  • 4.
    Outline Today: – Act 1: Crawl before you walk • Counting connected components – Act 2: The curse of the last reducer • Finding tight knit friend groups 4
  • 5.
    Act 1: ConnectedComponents Given a graph, how many components does it have? f b a g c e h d 5
  • 6.
    Act 1: ConnectedComponents Given a graph, how many components does it have? f b (b,c) 1 a (f,h) 1 g (b,d) 1 (a,c) 1 (a,b) 1 (c,d) 1 c (c,e) 1 (f,g) 1 e h (d,e) 1 (d,e) 1 d (b,e) 1 (g,h) 1 Data too big to fit on one reducer! 6
  • 7.
    CC Overview Outline for Connected Components – Partition the input into several chunks (map 1) – Summarize the connectivity on each chunk (reduce 1) – Combine all of the (small) summaries (map 2) – Find the number of connected components 7
  • 8.
    Connected Components 1. Partition (randomly): f b a g c e h d 8
  • 9.
    Connected Components 1. Partition (randomly): f b b a g c c e h d Reduce 1 Reduce 2 9
  • 10.
    Connected Components 1. Partition: 2. Summarize (retain < n edges): f b b a g c c e h d Reduce 1 Reduce 2 10
  • 11.
    Connected Components 1. Partition: 2. Summarize (retain < n edges): f b b a g c c e h d Reduce 1 Reduce 2 11
  • 12.
    Connected Components 1. Partition: 2. Summarize: 3. Recombine: f b b a g c c e h d Reduce 1 Reduce 2 12
  • 13.
    Connected Components 1. Partition: 2. Summarize: 3. Recombine: b f a g c e h d Round 2 13
  • 14.
    Connected Components 1. Partition: 2. Summarize: 3. Recombine: b f (b,c) 1 a (f,h) 1 (b,d) 1 g (a,c) 1 (a,b) 1 (c,d) 1 c (c,e) 1 (f,g) 1 (d,e) 1 e h (d,e) 1 (b,e) 1 d (g,h) 1 Round 2 14
  • 15.
    Connected Components 1. Partition: 2. Summarize: 3. Recombine: b f a g (a,c) 1 (a,b) 1 (c,d) 1 c (f,g) 1 e h (d,e) 1 d (g,h) 1 Round 2 Small enough to fit! 15
  • 16.
    Connected Components The summarization does not affect connectivity – Drops redundant edges – Dramatically reduces data size – Takes two MapReduce rounds 16
  • 17.
    Connected Components The summarization does not affect connectivity – Drops redundant edges – Dramatically reduces data size – Takes two MapReduce rounds Similar approach works in other situations: – Consider vertices connected only if k edges between vertices – Consider vertices connected if similarity score above a threshold • E.g. approximate Jaccard similarity when computing for recommendation systems – Find minimum spanning trees • Summarize by computing an MST on the subset graph – Clustering • Cluster each partition, then aggregate the clusters 17
  • 18.
    Outline Today: – Act 1: Crawl before you walk • Counting connected components – Act 2: The curse of the last reducer • Finding tight knit friend groups 18
  • 19.
    Act 2: ClusteringCoefficient Finding tight knit groups of friends 19
  • 20.
    Act 2: ClusteringCoefficient Finding tight knit groups of friends vs. 19
  • 21.
    Act 2: ClusteringCoefficient Finding tight knit groups of friends vs. 2/15 ≈ 0.13 8/15 ≈ 0.53 CC(v) = Fraction of v’s friends who know each other – Count: number of triangles incident on v 20
  • 22.
    Finding CC ForEach Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) 21
  • 23.
    Finding CC ForEach Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) 22
  • 24.
    Finding CC ForEach Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) – Check which of those edges exist: ∩ = 15 edges possible 2 edges present 23
  • 25.
    Finding CC ForEach Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) – Check which of those edges exist 24
  • 26.
    Finding CC ForEach Node Attempt 1: – Look at each node – Enumerate all possible triangles – Check which of those edges exist Amount of intermediate data – Quadratic in the degree of the nodes – 6 friends: 15 possible triangles – n friends, n(n-1)/2 possible triangles 25
  • 27.
    Finding CC ForEach Node Attempt 1: – Look at each node – Enumerate all possible triangles – Check which of those edges exist Amount of intermediate data – Quadratic in the degree of the nodes – 6 friends: 15 possible triangles – n friends, n(n-1)/2 possible triangles There’s always “that guy”: – tens of thousands of friends – tens of thousands of movie ratings (really!) – millions of followers 26
  • 28.
    Finding CC ForEach Node Attempt 1: – Look at each node a le Sc triangles ot – Enumerate all possible sn oe – Check which of those edges exist D 27
  • 29.
    Finding CC ForEach Node Attempt 1: – Look at each node a le Sc triangles ot – Enumerate all possible sn oe – Check which of those edges exist D Attempt 2: – There is a limited number of High degree nodes – Count LLL, LLH, LHH, and HHH triangles differently – If a triangle has at least one Low node – Pivot on Low node to count the triangles – If a triangle has all High nodes – Pivot but only on other neighboring High nodes (not all nodes) 28
  • 30.
    Algorithm in Pictures When looking at Low degree nodes – Check for all triangles 29
  • 31.
    Algorithm in Pictures When looking at Low degree nodes – Check for all triangles When looking at High degree nodes – Check for triangles with other High degree nodes 30
  • 32.
    Clustering Coefficient Discussion Attempt 2: – Main idea: treat High and Low degree nodes differently • Limit the amount of data generated (No more than O(n) per node) – All triangles accounted for – Can set High-Low threshold to balance the two cases • Rule of thumb: threshold around square root of number of vertices – A bit more complex, but still easy to code • Doesn’t suffer from the one high degree node problem 31
  • 33.
    XXL Graphs: Conclusions Algorithm Design – Prove performance guarantees independent of input data • Input skew (e.g. high degree nodes) should not severely affect algorithm performance • Number of rounds fixed (and hopefully small) 32
  • 34.
    XXL Graphs: Conclusions Algorithm Design – Prove performance guarantees independent of input data • Input skew (e.g. high degree nodes) should not severely affect algorithm performance • Number of rounds fixed (and hopefully small) Rethink graph algorithms: – Connected Components: Two round approach – Clustering Coefficient: High-Low node decomposition – (Breaking News) Matchings: Two round sampling technique 33
  • 35.