Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Mining Communities in Networks:
 A Solution for Consistency and
          Its Evaluation
 Haewoon Kwak Yoonchan Choi* Youn...
Outline




   2
Outline
•   Introduction to Community Identification




                      2
Outline
•   Introduction to Community Identification
•   Inconsistency problem in CI




                      2
Outline
•   Introduction to Community Identification
•   Inconsistency problem in CI
•   Metrics for the inconsistency in C...
Outline
•   Introduction to Community Identification
•   Inconsistency problem in CI
•   Metrics for the inconsistency in C...
Outline
•   Introduction to Community Identification
•   Inconsistency problem in CI
•   Metrics for the inconsistency in C...
“Sense of Community”




Introduction
         3
Definitions of community
•   “Subsets of nodes characterized by having
    more internal connections than external
    conn...
Community in ...
•   Sociology
•   Biology
•   Epidemiology
•   Information theory
•   Social network analysis



        ...
Community identification
•   Graph partitioning based on betweenness
•   Clique-based approach
•   Link-pattern based appro...
How do we know whether
‘communities are well identified?’



                7
8
8
Which is better?

       9
Quantitative metric
•   Modularity, Q [15]


•   eii : ratio of the number of links between
    nodes belonging to communi...
<
11
Problem transformation from
 •   Finding GOOD communities?




                     12
Problem transformation to
•   Achieving HIGH modularity!




                     13
Greedy approach for high Q
 •   CNM algorithm
 •   Wakita algorithm
 •   Louvain algorithm




                         14
CNM: Initialization




         15
Calculate ∆Q
‣




         16
Calculate ∆Q
   ‣

0.051           0.041   0.026              0.041
                                0.041

        0.041
 ...
Select ‘global’ max ∆Q
‣


         0.026              0.041
                 0.041


                    0.051


        ...
Update ∆Q
‣


            0.026              0.041
    0.082           0.041


                       0.051


            ...
Keep going...
‣

         -0.005
                             0.041
                  0.041


                     0.051

...
CNM - Final
  ‣

              -0.005




Q=2 x ((6/14 - (7/14)^2)) = 0.357
                20
Wakita algorithm
•   CNM + heuristic
    ‣ More scalable, but not high Q




                       21
Louvain algorithm
•   Iterative 2-phase algorithm
    ‣ P(I) : Finding ‘local’ max ∆Q
    ‣ P(II): Rebuild the weighted ne...
Louvain: 1st phase
   ‣

0.051     0.041
                       0.041


                          0.051
Q=2 x ((6/14 - (7/...
Louvain: 2nd phase
‣




Q’ = (2/2 - (2/2)^2) = 0 < Q
    Louvain stops here
             24
“Similar, but not the same”




Inconsistency problem
                25
Multiple max ∆Q
•   When 2 or more max ∆Q exist, how do
    we pick two communities to merge?


•   Inconsistent communiti...
Inconsistent results
                                                                                    13               ...
“New metrics”




Evaluate inconsistency
             28
Datasets
•   12 network: 34 to 11M nodes
    ‣ Online social network
    ‣ Biological network
    ‣ Internet AS network
  ...
Overview of datasets
                   10


                   9
                                                      Or...
Measurement methodology
•   Choosing one of max ∆Q is related to
    input order of nodes
•   For each network,
    ‣ Gene...
Variance of modularity
b          (b) C.Elegans     (c) Prot




           (e) AS graph          (f)

                  32
We learned...



(a) Karate club       (b) C.Elegans   (c) Protein Interaction




   (d) BBS            (e) AS graph     ...
We learned...

  Louvain algorithm produces highest Q
    CNM shows the smallest variance
(a) Karate club       (b) C.Eleg...
Figure 6: Consistency (no data available
      Pairwise membershiprandomly orde
 runs of an algorithm, each over a prob.
 ...
Distribution of p.m.p.
       (b) C.Elegans




       (e) AS graph

           35
Distribution of p.m.p.



(a) Karate club   (b) C.Elegans   (c) Protein Interaction




   (d) BBS        (e) AS graph    ...
Distribution of p.m.p.


                There are many edges whose
     pairwise membership prob. is not (c) Protein Inte...
ms produce pairwise membership probabilities of
  ’s. For the remaining nine networks, Louvain p
                 Consiste...
C in 12 networks




                 Figure 6: Consistency (no data available by CNM and Wakita for Wikipedia and Cyworld...
C in 12 networks


                           No one outperforms the other two




                 Figure 6: Consistency ...
“Totally intuitive”



Our approach
          39
Intuitions behind our approach
 •   Every edge has pairwise membership prob.
 •   High pairwise membership probability
   ...
Reinforcing p.m.p.




        41
Reinforcing p.m.p.
•   After a cycle of N runs,
    ‣ Calculate pairwise membership prob.
    ‣ Assign p.m.p. as edge weig...
Reinforcing p.m.p.
•   After a cycle of N runs,
    ‣ Calculate pairwise membership prob.
    ‣ Assign p.m.p. as edge weig...
Reinforcing p.m.p.
•   After a cycle of N runs,
    ‣ Calculate pairwise membership prob.
    ‣ Assign p.m.p. as edge weig...
Convergence of C




                                              Figure 8: Convergence of consistency



erforms the oth...
Convergence of C


                                   Except Orkut & Cyworld,
                                C converges ...
Agreement btwn. trials
nvergence of consistency



la-
 of
us-


FI-



the
 all
 all
hip
 is,    Figure 10: Comparison of...
Agreement btwn. trials
nvergence of consistency



la-
 of
us-

                 Communities of independent trials
FI-    ...
For non-converging case
•   Is not enough N = 100 ?
•   Resolution limit in community detection ?




                    ...
For non-converging case
•   Is not enough N = 100 ?
•   Resolution limit in community detection ?
    We are building an a...
“Internet Jellyfish”


Preliminary analysis of
      AS graph
               45
Communities in AS graph




           46
Communities in AS graph

         The largest community,
 The geographically concentrated comm.,
       The star-shaped co...
The largest community, L
•   32.3% of all ASes
•   MCI, Level3, AT&T WorldNet, Sprint, ...
•   9 of top 10 listed in AS ra...
Reapplying our approach to L




             48
Reapplying our approach to L


      3 of 9 falls in the same community,
   remaining 6 fall into different community




...
Geographically
concentrated community




          49
Geographically
concentrated community


     97.4% of ASes in Korea




              49
Star-shaped community


 All relations are provider-customer




                 50
Summary




   51
Summary
•   We identify inconsistency
        in community identification




                          51
Summary
•   We identify inconsistency
        in community identification
•   We define new metrics
        for measuring in...
Summary
•   We identify inconsistency
        in community identification
•   We define new metrics
        for measuring in...
Summary
•   We identify inconsistency
        in community identification
•   We define new metrics
        for measuring in...
Supplementary material
•   http://an.kaist.ac.kr/traces/IMC2009-kwak.html




                          52
Supplementary material
•   http://arxiv.org/abs/0910.1508




                       53
Thank you



    54
Backup slides




      55
Change of p.m.p.




(a) Facebook        (a) Facebook



               56
Upcoming SlideShare
Loading in …5
×

Mining Communities in Networks: A Solution for Consistency and Its Evaluation

1,649 views

Published on

  • Be the first to comment

  • Be the first to like this

Mining Communities in Networks: A Solution for Consistency and Its Evaluation

  1. 1. Mining Communities in Networks: A Solution for Consistency and Its Evaluation Haewoon Kwak Yoonchan Choi* Young-Ho Eom Hawoong Jeong Sue Moon KAIST, Korea *Samsung Advanced Institute of Technology, Korea 1
  2. 2. Outline 2
  3. 3. Outline • Introduction to Community Identification 2
  4. 4. Outline • Introduction to Community Identification • Inconsistency problem in CI 2
  5. 5. Outline • Introduction to Community Identification • Inconsistency problem in CI • Metrics for the inconsistency in CI 2
  6. 6. Outline • Introduction to Community Identification • Inconsistency problem in CI • Metrics for the inconsistency in CI • Empirical solution to remove inconsistency 2
  7. 7. Outline • Introduction to Community Identification • Inconsistency problem in CI • Metrics for the inconsistency in CI • Empirical solution to remove inconsistency • The case study of AS network 2
  8. 8. “Sense of Community” Introduction 3
  9. 9. Definitions of community • “Subsets of nodes characterized by having more internal connections than external connections between them” • “Set of web pages dealing with similar topics” • “Functional units” 4
  10. 10. Community in ... • Sociology • Biology • Epidemiology • Information theory • Social network analysis 5
  11. 11. Community identification • Graph partitioning based on betweenness • Clique-based approach • Link-pattern based approach • Random walks on network 6
  12. 12. How do we know whether ‘communities are well identified?’ 7
  13. 13. 8
  14. 14. 8
  15. 15. Which is better? 9
  16. 16. Quantitative metric • Modularity, Q [15] • eii : ratio of the number of links between nodes belonging to community i over all links ‣ ai : ratio of ends of edges that are attached to vertices in community i 10
  17. 17. < 11
  18. 18. Problem transformation from • Finding GOOD communities? 12
  19. 19. Problem transformation to • Achieving HIGH modularity! 13
  20. 20. Greedy approach for high Q • CNM algorithm • Wakita algorithm • Louvain algorithm 14
  21. 21. CNM: Initialization 15
  22. 22. Calculate ∆Q ‣ 16
  23. 23. Calculate ∆Q ‣ 0.051 0.041 0.026 0.041 0.041 0.041 0.051 16
  24. 24. Select ‘global’ max ∆Q ‣ 0.026 0.041 0.041 0.051 17
  25. 25. Update ∆Q ‣ 0.026 0.041 0.082 0.041 0.051 18
  26. 26. Keep going... ‣ -0.005 0.041 0.041 0.051 19
  27. 27. CNM - Final ‣ -0.005 Q=2 x ((6/14 - (7/14)^2)) = 0.357 20
  28. 28. Wakita algorithm • CNM + heuristic ‣ More scalable, but not high Q 21
  29. 29. Louvain algorithm • Iterative 2-phase algorithm ‣ P(I) : Finding ‘local’ max ∆Q ‣ P(II): Rebuild the weighted network • Node ← community • Weight ← ∑ (weights btwn. nodes) • Until Q becomes larger 22
  30. 30. Louvain: 1st phase ‣ 0.051 0.041 0.041 0.051 Q=2 x ((6/14 - (7/14)^2)) = 0.357 23
  31. 31. Louvain: 2nd phase ‣ Q’ = (2/2 - (2/2)^2) = 0 < Q Louvain stops here 24
  32. 32. “Similar, but not the same” Inconsistency problem 25
  33. 33. Multiple max ∆Q • When 2 or more max ∆Q exist, how do we pick two communities to merge? • Inconsistent communities are produced!! 26
  34. 34. Inconsistent results 13 0 21 19 2 9 22 12 23 29 30 29 17 4 5 6 27 12 25 31 7 11 22 3 6 0 15 10 28 18 17 20 25 24 27 28 30 31 11 7 19 26 14 15 9 5 24 2 4 32 1 20 16 8 10 18 26 21 32 13 16 3 1 23 33 8 14 33 (a) Q=0.273176 (b) Q=0.380671 FIG. 1: [Color Online] Visualization of inconsistent community identification in the K 27 belong to the same community, and node ordering is depicted as the number in the nod
  35. 35. “New metrics” Evaluate inconsistency 28
  36. 36. Datasets • 12 network: 34 to 11M nodes ‣ Online social network ‣ Biological network ‣ Internet AS network ‣ Wikipedia link network ‣ WWW network 29
  37. 37. Overview of datasets 10 9 Orkut Cyworld # of edges (log) 8 Flickr Wikipedia 7 YouTube Facebook WWW 6 5 BBS AS Graph 4 3 C. Elegans Protein 2 Karate 1 0 0 1 2 3 4 5 6 7 8 9 10 # of nodes (log) 30
  38. 38. Measurement methodology • Choosing one of max ∆Q is related to input order of nodes • For each network, ‣ Generating N sets with different order ‣ Finding communities in N sets ‣ Comparing identified communities 31
  39. 39. Variance of modularity b (b) C.Elegans (c) Prot (e) AS graph (f) 32
  40. 40. We learned... (a) Karate club (b) C.Elegans (c) Protein Interaction (d) BBS (e) AS graph (f) Facebook 33
  41. 41. We learned... Louvain algorithm produces highest Q CNM shows the smallest variance (a) Karate club (b) C.Elegans (c) Protein Interaction *Only Louvain works in a huge network (d) BBS (e) AS graph (f) Facebook 33
  42. 42. Figure 6: Consistency (no data available Pairwise membershiprandomly orde runs of an algorithm, each over a prob. Over runs of an algorithm, each over a ra uantify set, we an algorithm, eachpaira of nodespair of the likelihood of alikelihood of aordered inp Over• The likelihood of athe of nodes resulting runs of quantify pair over randomly resulti munity as: community as: aover of runs resulting in t set, we quantify same community pair N nodes same the likelihood of in the same community as: ( where where 1 if = in the th dataset 1 1 the th dataset 0if otherwise in = if = in the 0 otherwise 0 otherwise and and are nodes and and represent communities that and belong to, respectively. 34We call this metric pairwise mem and and are nodes and and represe
  43. 43. Distribution of p.m.p. (b) C.Elegans (e) AS graph 35
  44. 44. Distribution of p.m.p. (a) Karate club (b) C.Elegans (c) Protein Interaction (d) BBS (e) AS graph (f) Facebook 36
  45. 45. Distribution of p.m.p. There are many edges whose pairwise membership prob. is not (c) Protein Interaction (a) Karate club (b) C.Elegans 0 or 1 (d) BBS (e) AS graph (f) Facebook 36
  46. 46. ms produce pairwise membership probabilities of ’s. For the remaining nine networks, Louvain p Consistency, C t consistent outcome and, for (g) to (h), the only ou der to quantify network-wide community members , we define a metric of consistency for the entire • To quantify network-wide consistency, Normalization sistency Weighing p.m.p. pairwise 0.5 weighs the away from membership prob om . The second term in (4) normalizes from e of communities detected by CNM algorithm in th 37
  47. 47. C in 12 networks Figure 6: Consistency (no data available by CNM and Wakita for Wikipedia and Cyworld) ver runs of an algorithm, each over a randomly ordered input we quantify the likelihood of a pair of nodes resulting in the 38
  48. 48. C in 12 networks No one outperforms the other two Figure 6: Consistency (no data available by CNM and Wakita for Wikipedia and Cyworld) ver runs of an algorithm, each over a randomly ordered input we quantify the likelihood of a pair of nodes resulting in the 38
  49. 49. “Totally intuitive” Our approach 39
  50. 50. Intuitions behind our approach • Every edge has pairwise membership prob. • High pairwise membership probability indicates that two nodes are likely to be in the same community • All 3 algorithms in weighted network place edge of high weight within the community 40
  51. 51. Reinforcing p.m.p. 41
  52. 52. Reinforcing p.m.p. • After a cycle of N runs, ‣ Calculate pairwise membership prob. ‣ Assign p.m.p. as edge weight 41
  53. 53. Reinforcing p.m.p. • After a cycle of N runs, ‣ Calculate pairwise membership prob. ‣ Assign p.m.p. as edge weight • Return to another cycle of N runs 41
  54. 54. Reinforcing p.m.p. • After a cycle of N runs, ‣ Calculate pairwise membership prob. ‣ Assign p.m.p. as edge weight • Return to another cycle of N runs • Continue until C gains no improvement 41
  55. 55. Convergence of C Figure 8: Convergence of consistency erforms the other two in all networks and no consistent correla- 42 between the consistency and the topological characteristics of
  56. 56. Convergence of C Except Orkut & Cyworld, C converges to 1 within 5 cycles Figure 8: Convergence of consistency erforms the other two in all networks and no consistent correla- 42 between the consistency and the topological characteristics of
  57. 57. Agreement btwn. trials nvergence of consistency la- of us- FI- the all all hip is, Figure 10: Comparison of community size distribution in 4 tri- 43
  58. 58. Agreement btwn. trials nvergence of consistency la- of us- Communities of independent trials FI- are almost identical the all all hip is, Figure 10: Comparison of community size distribution in 4 tri- 43
  59. 59. For non-converging case • Is not enough N = 100 ? • Resolution limit in community detection ? 44
  60. 60. For non-converging case • Is not enough N = 100 ? • Resolution limit in community detection ? We are building an analytical framework to explain inconsistency problems 44
  61. 61. “Internet Jellyfish” Preliminary analysis of AS graph 45
  62. 62. Communities in AS graph 46
  63. 63. Communities in AS graph The largest community, The geographically concentrated comm., The star-shaped community 46
  64. 64. The largest community, L • 32.3% of all ASes • MCI, Level3, AT&T WorldNet, Sprint, ... • 9 of top 10 listed in AS ranking of CAIDA 47
  65. 65. Reapplying our approach to L 48
  66. 66. Reapplying our approach to L 3 of 9 falls in the same community, remaining 6 fall into different community 48
  67. 67. Geographically concentrated community 49
  68. 68. Geographically concentrated community 97.4% of ASes in Korea 49
  69. 69. Star-shaped community All relations are provider-customer 50
  70. 70. Summary 51
  71. 71. Summary • We identify inconsistency in community identification 51
  72. 72. Summary • We identify inconsistency in community identification • We define new metrics for measuring inconsistency 51
  73. 73. Summary • We identify inconsistency in community identification • We define new metrics for measuring inconsistency • We propose empirical solutions reinforcing pairwise membership probability 51
  74. 74. Summary • We identify inconsistency in community identification • We define new metrics for measuring inconsistency • We propose empirical solutions reinforcing pairwise membership probability • We present preliminary analysis of communities in AS graph 51
  75. 75. Supplementary material • http://an.kaist.ac.kr/traces/IMC2009-kwak.html 52
  76. 76. Supplementary material • http://arxiv.org/abs/0910.1508 53
  77. 77. Thank you 54
  78. 78. Backup slides 55
  79. 79. Change of p.m.p. (a) Facebook (a) Facebook 56

×