Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Modeling Decisions for Artificial Intelligence (MDAI)               2012, Girona, Spain
Introduction We present a benchmarking of two distinct algorithms forextracting community structure from On-line Social N...
Extraction of the community structureAlgorithm 1: Newman’s algorithm   • Extracts the communities by successively dividing...
Extraction of the community structureAlgorithm 2: Blondel’s method   1. The method looks for smaller communities by optimi...
Filtering / Sampling process2-step process: In order to obtain a subset of a complete graph weapply a process consisting o...
Datasets used                         Graph Statistics               Karate   Dolphins    GrQc    Enron    Facebook#Nodes ...
Sample criteria• Sampling requires user supervision• Sampling % desired: [ 10%-20% ]• Only large graphs datasets (3) are s...
Sampling statistics• Graph statistics for sampled Vs original datasets (sss / ooo)                           GrQc         ...
Sampling statistics• Indicator: Clustering coefficient shows a common pattern, increasing   in sampled datasets. This serv...
Empirical Tests and Results1. First, we evaluate Newman’s algorithm with the sampled datasets.                        Stop...
Empirical Tests and Results2. Blondel’s method allows us to extract the communities from theoriginal dataset, given it’s g...
Normalized Mutual Information After labeling the communities, we match the nodes inside everycorresponding community in t...
Normalized Mutual Information After labeling the communities, we match the nodes inside everycorresponding community in t...
Newman’s Vs. Blondel’s• In terms of modularity (Q) and number of communities (C)                    Blondel’s          Blo...
Newman’s Vs. Blondel’s• In terms of modularity (Q) and number of communities (C)                     Blondel’s          Bl...
Newman’s (NG) Vs. Blondel’s (BN)• In terms of NMI: Normalized Mutual Information.         Comparing Top N communities     ...
Conclusions We’ve benchmarked 5 statistically and topologically distinct datasets    • Applying 2 community structure alg...
Thank you foryour attention :-)
Upcoming SlideShare
Loading in …5
×

Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling

424 views

Published on

Presentation celebrated in 9th International Conference on Modeling Decisions for Artificial (MDAI'12) at Girona, Spain.

Abstract. In this paper we benchmark two distinct algorithms for extracting community structure from social networks represented as graphs, considering how we can representatively sample an OSN graph while maintaining its com-munity structure. We also evaluate the extraction algorithms’ optimum value (modularity) for the number of communities using five well-known benchmark-ing datasets, two of which represent real online OSN data. Also we consider the assignment of the filtering and sampling criteria for each dataset. We find that the extraction algorithms work well for finding the major communities in the original and the sampled datasets. The quality of the results is measured using an NMI (Normalized Mutual Information) type metric to identify the grade of correspondence between the communities generated from the original data and those generated from the sampled data. We find that a representative sampling is possible which preserves the key community structures of an OSN graph, significantly reducing computational cost and also making the resulting graph structure easier to visualize. Finally, comparing the communities generated by each algorithm, we identify the grade of correspondence.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling

  1. 1. Modeling Decisions for Artificial Intelligence (MDAI) 2012, Girona, Spain
  2. 2. Introduction We present a benchmarking of two distinct algorithms forextracting community structure from On-line Social Networks(OSNs), considering how we can representatively sample an OSNgraph while maintaining its community structure. We do this by extracting the community structure from theoriginal and sampled versions of five well-known benchmarkingdatasets and comparing the results. • We assume there is NO a priori knowledge about the expected result. • A supervised sampling is performed.
  3. 3. Extraction of the community structureAlgorithm 1: Newman’s algorithm • Extracts the communities by successively dividing the graph into components, using Freeman’s betweenness centrality measure until modularity Q is maximized. • Modularity (Q): Is the measure used to quantify the quality of the community partitions ‘on the fly’. Usual range: [0.3 - 0.7]. • We have implemented it in Python (using NetworkX library).
  4. 4. Extraction of the community structureAlgorithm 2: Blondel’s method 1. The method looks for smaller communities by optimizing modularity locally. 2. Then it aggregates nodes of the same community and builds a new network whose nodes are communities. Steps 1 and 2 are repeated until modularity Q is maximized. • The default version was used from the Gephi graph processing software.
  5. 5. Filtering / Sampling process2-step process: In order to obtain a subset of a complete graph weapply a process consisting of filtering and sampling. • First step: Filtering (seed node selection). Consists of filtering the graph nodes based on their degree or their clustering coefficient. Filtering thresholds are user defined.  Goal: Identify hub nodes and dense regions of the graph. • Second step: Sampling. We apply a sampling at 1 hop to obtain all the neighbours connected to each seed node.  Goal: Maintain core community structure.
  6. 6. Datasets used Graph Statistics Karate Dolphins GrQc Enron Facebook#Nodes 34 62 5242 10630 31720#Edges 78 159 14496 164837 80592Avg. degree 4.59 5.13 5.530 31.013 5.081Clust. coef. 0.57 0.26 0.529 0.383 0.079Avg. path 2.408 3.356 6.049 3.160 6.432lengthDiameter 5 8 17 20 9
  7. 7. Sample criteria• Sampling requires user supervision• Sampling % desired: [ 10%-20% ]• Only large graphs datasets (3) are sampled Resulting sample Filter Value Sample size ArXiv-GrQc Degree ≥30 17.91% All neighbors Enron Clustering coef. =1 20.83% All neighbors Facebook Clustering coef. ≥0.5 10.75% All neighbors
  8. 8. Sampling statistics• Graph statistics for sampled Vs original datasets (sss / ooo) GrQc Enron Facebook Degree >= 30 Clust.Coef = 1 Clust.Coef >= 0.5 #Nodes 939 / 5242 2218 / 10630 3410 / 31720 #Edges 5715 / 14446 14912 / 164387 6561 / 80592 Avg. degree 12.17 / 5,53 12.315 / 31,013 3.848 / 5,081 Clust. coef. 0.698 / 0,529 0.761 / 0,383 0.632 / 0,079 Avg. path length 4.51 / 6,049 3.143 / 3,160 8.388 / 6,432 Diameter 10 / 17 7 / 20 27 / 9
  9. 9. Sampling statistics• Indicator: Clustering coefficient shows a common pattern, increasing in sampled datasets. This serves as an indicator that the core is preferentially included in the samples. GrQc Enron Facebook Degree >= 30 Clust.Coef = 1 Clust.Coef >= 0.5 #Nodes 939 / 5242 2218 / 10630 3410 / 31720 #Edges 5715 / 14446 14912 / 164387 6561 / 80592 Avg. degree 12.17 / 5,53 12.315 / 31,013 3.848 / 5,081 Clust. coef. 0.698 / 0,529 0.761 / 0,383 0.632 / 0,079 Avg. path length 4.51 / 6,049 3.143 / 3,160 8.388 / 6,432 Diameter 10 / 17 7 / 20 27 / 9
  10. 10. Empirical Tests and Results1. First, we evaluate Newman’s algorithm with the sampled datasets. Stop Original or Iteration Q Communities Sampled Karate 4 0.494 5 O Dolphins 5 0.591 6 O GrQc 56 0.777 57 S Enron 865 0.421 869 S Enron Early* 51 0.325 56 S Facebook 40 0.870 190 S
  11. 11. Empirical Tests and Results2. Blondel’s method allows us to extract the communities from theoriginal dataset, given it’s greater execution speed in comparison withNewman’s method. Original Sampled Q C Q C GrQc 0.856 390 0.789 11 Enron 0.491 43 0.560 68 Facebook 0.681 1105 0.519 33• How to compare nodes community matching?  NMI : Normalized Mutual Information
  12. 12. Normalized Mutual Information After labeling the communities, we match the nodes inside everycorresponding community in the sampled and original datasets. Purity: 100% means that all nodes in same communities are matched inboth datasets. We compare the Top N communities ( N =10 )o Handicap Newman’s and Blondel’s methods are stochastic and non-deterministic  Give slightly different results in each execution.
  13. 13. Normalized Mutual Information After labeling the communities, we match the nodes inside everycorresponding community in the sampled and original datasets. Purity: 100% means that all nodes in same communities are matched inboth datasets. We compare the Top N communities ( N =10 )o Handicap Newman’s and Blondel’s methods are stochastic and non-deterministic  Give slightly different results in each execution. NMI sampled NMI orig. Vs. NMI orig Vs. orig. Net loss Vs. sampled (B) sampled (A) (C) (C- A) GrQc 0.66559 0.82544 0.77301 0.10742 Enron 0.69069 0.86903 0.82012 0.12943 Facebook 0.58996 0.73249 0.69215 0.10219
  14. 14. Newman’s Vs. Blondel’s• In terms of modularity (Q) and number of communities (C) Blondel’s Blondel’s Newman’s Original Sampled Sampled Q C Q C Q C GrQc 0.856 390 0.789 11 0.777 57 Enron 0.491 43 0.560 68 0.325 56 Facebook 0.681 1105 0.519 33 0.870 190• The best modularity values are dataset dependent.
  15. 15. Newman’s Vs. Blondel’s• In terms of modularity (Q) and number of communities (C) Blondel’s Blondel’s Newman’s Original Sampled Sampled Q C Q C Q C GrQc 0.856 390 0.789 11 0.777 57 Enron 0.491 43 0.560 68 0.325 56 Facebook 0.681 1105 0.519 33 0.870 190• The methods may give distinct results in terms of the number ofcommunities and modularity values.
  16. 16. Newman’s (NG) Vs. Blondel’s (BN)• In terms of NMI: Normalized Mutual Information. Comparing Top N communities NMI BN Vs. NG NMI NG Vs. BN NMI orig Vs. orig. Net loss (A) (B) (C) (C- Avg (A,B))GrQc 0.69116 0.87243 0.77301 -0.00878Enron 0.31313 0.68796 0.82012 0.31958Enron Early 0.83437 0.44320 0.82012 0.18133Facebook 0.62056 0.54551 0.69215 0.10911• Results show significant differences between the assignment of thenodes between methods.
  17. 17. Conclusions We’ve benchmarked 5 statistically and topologically distinct datasets • Applying 2 community structure algorithms • Sampling original datasets Results indicate that is possible to identify the principal communitiesfor large complex datasets using sampling.  It maintains the key facets of the community structure of a dataset (NMI statistic shows high correspondence is maintained)  Significantly reduces of the dataset size (80-90%) However, a difference is found in the assignment of nodes tocommunities between different executions and methods, due to theirstochastic nature.
  18. 18. Thank you foryour attention :-)

×