Summer School
“Achievements and Applications of Contemporary Informatics,
         Mathematics and Physics” (AACIMP 2011)
              August 8-20, 2011, Kiev, Ukraine




            Graph Based Clustering

                                 Erik Kropat

                     University of the Bundeswehr Munich
                      Institute for Theoretical Computer Science,
                        Mathematics and Operations Research
                                Neubiberg, Germany
Real World Networks

• Biological Networks
  −   Gene regulatory networks
  −   Metabolic networks
  −   Neural networks
  −   Food webs
                                               food web



                                 • Technological Networks
                                   − Telecommunication networks
                                   − Internet
                                   − Power grids

             power grid
Real World Networks

• Social Networks
    −    Communication networks
    −    Organizational networks
    −    Social media
    −    Online communities
                                                                                                    social networks



                                                                              • Economic Networks
                                                                                   − Financial market networks
                                                                                   − Trade networks
                                                                                   − Collaboration networks

               economic networks


Source: Frank Schweitzer et al., “Economic Networks: The New Challenges,” Science 325, no. 5939 (July 24, 2009): 422-425.
Graph-Theory

• Graph theory can provide more detailed information
  about the inner structure of the data set in terms of
    −   cliques          (subsets of nodes where each pair of elements is connected)
    −   clusters         (highly connected groups of nodes)
    −   centrality       (important nodes, hubs)
    −   outliers . . .   (unimportant nodes)

• Applications
    − social network analysis
    − diffusion of information
    − spreading of diseases or rumours

⇒    marketing campaigns, viral marketing, social network advertising
Graph-Based Clustering

• Collection of a wide range of very popular clustering algorithms
  that are based on graph-theory.
• Organize information in large datasets to facilitate users
  for faster access to required information.
Idea

• Objects are represented as nodes in a complete or connected graph.
• Assign a weight to each branch between the two nodes x and y.
  The weight is defined by the distance d(x,y) between the nodes.


Clustering
                                 Distance between
                                      clusters
                                                            Distance between
                                                                 objects
Idea




                               graph




       minimal spanning tree           clusters
Graph Based Clustering

Hierarchical method
(1) Determine a minimal spanning tree (MST)
(2) Delete branches iteratively
    New connected components = Cluster




                                                  4

                                                      6       5


                                              1           8


                                                      3
Minimal Spanning Trees
Minimal Spanning Tree

A minimal spanning tree of a connected graph G = (V,E)
is a connected subgraph with minimal weight
that contains all nodes of G and has no cycles.

                       c                                     c

             4                                      4
                 6           5                          6        5
    b                                    b

     1                8                   1                 8

    a            3               d       a              3             d

          graph G = (V, E)                    minimal spanning tree
Minimal spanning trees can be calculated with...

(1) Prim’s algorithm.
(2) Kruskal’s algorithm.

                                                               c

                                                       4

                                                           6       5
                                                   b

                                                   1           8

                                                   a       3           d
Example – Prims’s Algorithm

Set VT = {a}, ET = { }           Choose an edge (x,y) with minimal weight
                                 such that x ∈ VT and y ∉ VT.
                                 VT = {a,b} and ET = { (a,b) }.


                     c                                     c

           4                                      4

               6         5                            6           5
 b                                       b

 1                  8                    1                8


 a             3             d           a            3               d
Example– Prims’s Algorithm

Choose an edge (x,y) with minimal weight   Choose an edge (x,y) with minimal weight
such that x ∈ VT and y ∉ VT.               such that x ∈ VT and y ∉ VT.
VT = {a,b,d} and ET = { (a,b), (a,d) }.    VT = {a,b,c,d} and ET = { (a,b), (a,d),(b,c) }.


                           c                                             c

                  4                                            4

                      6        5                                   6         5
        b                                             b

        1                 8                           1                 8

    c                                             c
        a             3             d                 a            3              d
Prim’s Algorithm


 INPUT:       Weighted graph G = (V, E), undirected + connected
 OUTPUT:      Minimal spanning tree T = (VT, ET)

 (1) Set VT = {v}, ET = { }, where v is an arbitrary node from V (starting point).
 (2) REPEAT
 (3)   Choose an edge (a,b) with minimal weight, such that a ∈ VT and b ∉ VT.
 (4)   Set VT = VT ∪ {b} and ET = ET ∪ { (a,b) }.
 (5) UNTIL VT = V
Kruskal’s Algorithm


 INPUT:        Weighted graph G = (V, E), undirected + connected
 OUTPUT:       Minimal spanning tree T = (VT, ET)

 (1) Set VT = V, ET = { }, H = E.
 (2) Initialize a queue to contain all edges in G, using the weights in ascending
     order as keys.
 (3) WHILE H ≠ { }
 (4)       Choose an edge e ∈ H with minimal weight.
 (5)       Set H = H  {e}.
 (6)       If (VT, ET ∪ {e}) has no cycles, then ET = ET ∪ {e} .
 (7) END
Branch Deletion
Delete Branches - Different Strategies

(1) Delete the branch with maximum weight.
(2) Delete inconsistent branches.
(3) Delete by analysis of weights.
(1) Delete the branch with maximum weight

• In each step, create two new clusters
  by deleting the branch with maximum weight.
• Repeat until the given number of clusters is reached.

                                               2
                                           2           2

                                                   4




                           2                3

                                  6
                           2
Example: Delete the branch with maximum weight

                                                    2
                                               2            2


                                                        4




                        2                       3
                                                            Minimum spanning tree
                                6
                        2



Ordered weights of branches:   6, 4, 3, 2, 2, 2, 2, 2.
Example: Delete the branch with maximum weight

                                                    2
                                               2            2


                                                        4




                         2                      3

                                   6
                         2



Ordered weights of branches:   6, 4, 3, 2, 2, 2, 2, 2.
Step 1: Delete branch (weight 6)       ⇒   2 clusters
Example: Delete the branch with maximum weight

                                                    2
                                               2            2


                                                        4




                         2                      3

                                   6
                         2




Ordered weights of branches:   6, 4, 3, 2, 2, 2, 2, 2.
Step 1: Delete branch (weight 6)       ⇒   2 clusters
Step 2: Delete branch (weight 4)       ⇒   3 clusters
(2) Delete inconsistent branches

• A branch e is inconsistent, if the corresponding weight de
                                           _
  is (much) larger than a reference value de .
                       _
• The reference value de can be defined by the average weight
  of all branches adjacent to e.

                                                  _
                                                          3+2+1
                                                  de   = _________ = 2
                                                             3
                  1
                          e            3

                          6                                     _
                  2
                                                  d e = 6 > 2 = de
                                                  ⇒ e inconsistent
(3) Delete by analysis of weights
• Perform an “analysis” of all weights of branches in the MST.
  Determine a threshold S.
• The threshold can be estimated by
  histograms on the weights of branches (= length of branches).
• Delete a branches, if the corresponding weight higher than the threshold S.
                   Number




                                                 Number




                                  S

                             weight of branch             weight of branch
                            (length of branch)
Exercise                                  d

                               3                      20

                                              5
                        e                                  c
                               9                  8


                         1                                 4
                               15     g           6

                                          12
                         f                                 b

                               10                     2


                                       a

Find a minimal spanning tree and provide a clustering of the graph
by deleting all inconsistent branches.
Example

Set VT = {a}, ET = { }   Choose an edge (x,y) with minimal weight
                         such that x ∈ VT and y ∉ VT.
Example

Choose an edge (x,y) with minimal weight   Choose an edge (x,y) with minimal weight
such that x ∈ VT and y ∉ VT.               such that x ∈ VT and y ∉ VT.
Example

Choose an edge (x,y) with minimal weight   Choose an edge (x,y) with minimal weight
such that x ∈ VT and y ∉ VT.               such that x ∈ VT and y ∉ VT.
Example
          Choose an edge (x,y) with minimal weight
          such that x ∈ VT and y ∉ VT.




                    minimal spanning tree
Example
          For each branch calculate the reference value
              (average weight of adjacent branches)
                                    d

                        3
                        (3)   (4.5) 5
               e                                        c


                1 (3)                               (4) 4
                                g          6
                                        (3.6)
                f                                       b
                                         (5)
                                                2


                                  a
Example
                Delete inconsistent branches
          (weight is larger than the reference value)
                               d
                                         2 clusters
                       3
                       (3)
              e                              c


               1 (3)                     (4) 4
                               g

              f                              b
                             Noise?


                               a
Summary
Summary

• In graph based clustering objects are represented as nodes
  in a complete or connected graph.
• The distance between two objects is given by the weight
  of the corresponding branch.
• Hierarchical method
     (1) Determine a minimal spanning tree (MST)
     (2) Delete branches iteratively
• Visualization of information in large datasets.
Literature

• V. Kumar, M. Steinbach, P.-N. Tan
  Introduction to Data Mining.
  Addison Wesley, 2005.

Other work mentioned in the presentation
• J.A. Dunne, R.J. Williams, N.D. Martinez, R.A. Wood, D.H. Erwin
  Compilation and Network Analyses of Cambrian Food Webs.
  PLoS Biol 6(4): e102. doi:10.1371/journal.pbio.0060102

• F. Schweitzer, G. Fagiolo, D. Sornette, F. Vega-Redondo,
  A. Vespignani, D.R. White
  Economic Networks: The New Challenges.
  Science 325, no. 5939 (July 24, 2009): 422-425.
Thank you very much!

Graph Based Clustering

  • 1.
    Summer School “Achievements andApplications of Contemporary Informatics, Mathematics and Physics” (AACIMP 2011) August 8-20, 2011, Kiev, Ukraine Graph Based Clustering Erik Kropat University of the Bundeswehr Munich Institute for Theoretical Computer Science, Mathematics and Operations Research Neubiberg, Germany
  • 2.
    Real World Networks •Biological Networks − Gene regulatory networks − Metabolic networks − Neural networks − Food webs food web • Technological Networks − Telecommunication networks − Internet − Power grids power grid
  • 3.
    Real World Networks •Social Networks − Communication networks − Organizational networks − Social media − Online communities social networks • Economic Networks − Financial market networks − Trade networks − Collaboration networks economic networks Source: Frank Schweitzer et al., “Economic Networks: The New Challenges,” Science 325, no. 5939 (July 24, 2009): 422-425.
  • 4.
    Graph-Theory • Graph theorycan provide more detailed information about the inner structure of the data set in terms of − cliques (subsets of nodes where each pair of elements is connected) − clusters (highly connected groups of nodes) − centrality (important nodes, hubs) − outliers . . . (unimportant nodes) • Applications − social network analysis − diffusion of information − spreading of diseases or rumours ⇒ marketing campaigns, viral marketing, social network advertising
  • 5.
    Graph-Based Clustering • Collectionof a wide range of very popular clustering algorithms that are based on graph-theory. • Organize information in large datasets to facilitate users for faster access to required information.
  • 6.
    Idea • Objects arerepresented as nodes in a complete or connected graph. • Assign a weight to each branch between the two nodes x and y. The weight is defined by the distance d(x,y) between the nodes. Clustering Distance between clusters Distance between objects
  • 7.
    Idea graph minimal spanning tree clusters
  • 8.
    Graph Based Clustering Hierarchicalmethod (1) Determine a minimal spanning tree (MST) (2) Delete branches iteratively New connected components = Cluster 4 6 5 1 8 3
  • 9.
  • 10.
    Minimal Spanning Tree Aminimal spanning tree of a connected graph G = (V,E) is a connected subgraph with minimal weight that contains all nodes of G and has no cycles. c c 4 4 6 5 6 5 b b 1 8 1 8 a 3 d a 3 d graph G = (V, E) minimal spanning tree
  • 11.
    Minimal spanning treescan be calculated with... (1) Prim’s algorithm. (2) Kruskal’s algorithm. c 4 6 5 b 1 8 a 3 d
  • 12.
    Example – Prims’sAlgorithm Set VT = {a}, ET = { } Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT. VT = {a,b} and ET = { (a,b) }. c c 4 4 6 5 6 5 b b 1 8 1 8 a 3 d a 3 d
  • 13.
    Example– Prims’s Algorithm Choosean edge (x,y) with minimal weight Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT. such that x ∈ VT and y ∉ VT. VT = {a,b,d} and ET = { (a,b), (a,d) }. VT = {a,b,c,d} and ET = { (a,b), (a,d),(b,c) }. c c 4 4 6 5 6 5 b b 1 8 1 8 c c a 3 d a 3 d
  • 14.
    Prim’s Algorithm INPUT: Weighted graph G = (V, E), undirected + connected OUTPUT: Minimal spanning tree T = (VT, ET) (1) Set VT = {v}, ET = { }, where v is an arbitrary node from V (starting point). (2) REPEAT (3) Choose an edge (a,b) with minimal weight, such that a ∈ VT and b ∉ VT. (4) Set VT = VT ∪ {b} and ET = ET ∪ { (a,b) }. (5) UNTIL VT = V
  • 15.
    Kruskal’s Algorithm INPUT: Weighted graph G = (V, E), undirected + connected OUTPUT: Minimal spanning tree T = (VT, ET) (1) Set VT = V, ET = { }, H = E. (2) Initialize a queue to contain all edges in G, using the weights in ascending order as keys. (3) WHILE H ≠ { } (4) Choose an edge e ∈ H with minimal weight. (5) Set H = H {e}. (6) If (VT, ET ∪ {e}) has no cycles, then ET = ET ∪ {e} . (7) END
  • 16.
  • 17.
    Delete Branches -Different Strategies (1) Delete the branch with maximum weight. (2) Delete inconsistent branches. (3) Delete by analysis of weights.
  • 18.
    (1) Delete thebranch with maximum weight • In each step, create two new clusters by deleting the branch with maximum weight. • Repeat until the given number of clusters is reached. 2 2 2 4 2 3 6 2
  • 19.
    Example: Delete thebranch with maximum weight 2 2 2 4 2 3 Minimum spanning tree 6 2 Ordered weights of branches: 6, 4, 3, 2, 2, 2, 2, 2.
  • 20.
    Example: Delete thebranch with maximum weight 2 2 2 4 2 3 6 2 Ordered weights of branches: 6, 4, 3, 2, 2, 2, 2, 2. Step 1: Delete branch (weight 6) ⇒ 2 clusters
  • 21.
    Example: Delete thebranch with maximum weight 2 2 2 4 2 3 6 2 Ordered weights of branches: 6, 4, 3, 2, 2, 2, 2, 2. Step 1: Delete branch (weight 6) ⇒ 2 clusters Step 2: Delete branch (weight 4) ⇒ 3 clusters
  • 22.
    (2) Delete inconsistentbranches • A branch e is inconsistent, if the corresponding weight de _ is (much) larger than a reference value de . _ • The reference value de can be defined by the average weight of all branches adjacent to e. _ 3+2+1 de = _________ = 2 3 1 e 3 6 _ 2 d e = 6 > 2 = de ⇒ e inconsistent
  • 23.
    (3) Delete byanalysis of weights • Perform an “analysis” of all weights of branches in the MST. Determine a threshold S. • The threshold can be estimated by histograms on the weights of branches (= length of branches). • Delete a branches, if the corresponding weight higher than the threshold S. Number Number S weight of branch weight of branch (length of branch)
  • 24.
    Exercise d 3 20 5 e c 9 8 1 4 15 g 6 12 f b 10 2 a Find a minimal spanning tree and provide a clustering of the graph by deleting all inconsistent branches.
  • 25.
    Example Set VT ={a}, ET = { } Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.
  • 26.
    Example Choose an edge(x,y) with minimal weight Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT. such that x ∈ VT and y ∉ VT.
  • 27.
    Example Choose an edge(x,y) with minimal weight Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT. such that x ∈ VT and y ∉ VT.
  • 28.
    Example Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT. minimal spanning tree
  • 29.
    Example For each branch calculate the reference value (average weight of adjacent branches) d 3 (3) (4.5) 5 e c 1 (3) (4) 4 g 6 (3.6) f b (5) 2 a
  • 30.
    Example Delete inconsistent branches (weight is larger than the reference value) d 2 clusters 3 (3) e c 1 (3) (4) 4 g f b Noise? a
  • 31.
  • 32.
    Summary • In graphbased clustering objects are represented as nodes in a complete or connected graph. • The distance between two objects is given by the weight of the corresponding branch. • Hierarchical method (1) Determine a minimal spanning tree (MST) (2) Delete branches iteratively • Visualization of information in large datasets.
  • 33.
    Literature • V. Kumar,M. Steinbach, P.-N. Tan Introduction to Data Mining. Addison Wesley, 2005. Other work mentioned in the presentation • J.A. Dunne, R.J. Williams, N.D. Martinez, R.A. Wood, D.H. Erwin Compilation and Network Analyses of Cambrian Food Webs. PLoS Biol 6(4): e102. doi:10.1371/journal.pbio.0060102 • F. Schweitzer, G. Fagiolo, D. Sornette, F. Vega-Redondo, A. Vespignani, D.R. White Economic Networks: The New Challenges. Science 325, no. 5939 (July 24, 2009): 422-425.
  • 34.