SlideShare a Scribd company logo
Bridging the Gap between Community and Node
Representations: Graph Embedding via Community Detection
IEEE BigData 2019, Special Session on Information Granulation in Data Science
Artem Lutov, Dingqi Yang and Philippe Cudré-Mauroux
eXascale Infolab, University of Fribourg, Switzerland
https://bit.ly/daor-slides
https://github.com/eXascaleInfolab/daor
Problem
Graph embedding techniques project graph nodes onto a low-dimensional
vector space preserving the key structural properties of the graph (e.g,
proximity between nodes).
Existing embedding techniques are hard to be applied in practice:
● rely on multiple parameters,
● operate effectively in a single metric space only (e.g., produced with
cosine similarity),
● computationally intensive.
2
Solution
DAOR - our embedding technique based on community detection
(i.e., clustering) and producing embeddings without any manual tuning:
● Parameter-free embedding method based on graph clustering
● Produces metric-space robust embeddings
● Embeddings are produced with near-linear runtime
Moreover, DAOR preserves both high- and low-order structural properties
of the graph and produces interpretable embeddings by design.
3
Preliminaries
DAOR embedding method is based on DAOC clustering (BigData19,
bit.ly/daoc-slides; it granulates the graph) being parameter-free, with
near-linear runtime, hierarchical, overlapping, stable (both robust and
deterministic) and capturing fine-grained (i.e., micro-scale) clusters.
DAOC clustering ← Meta-optimization function ← Generalized Modularity
4
Generalized Modularity: Optimal value of the resolution
parameter (Newman’16): =
Method Outline
DAOR embedding method is an extension of DAOC clustering:
● The hierarchy of clusters is constructed varying the resolution
parameter 𝛾 in within the evaluated bounds.
● The resulting clusters are transformed into embeddings (the
required number of dimensions is specified optionally):
a) Features (salient clusters) are extracted from clusters and
b) Embedding dimensions are formed from the features.
5
Contributions
DAOR - a parameter-free embedding method based on graph clustering:
● 𝛾 bound identification for the most fine-grained clusters
● Automatic identification of features (salient clusters)
● Embedding dimensions formation from the features and optional
constraining of their number
6
Preliminaries: Overlapping vs Multi-resolution
77
Racing cars
Overlapping Clusters Clusters on Various Resolutions
Blue cars
Jeeps
Cars
Racing cars
Bikes
Racing &
blue cars
Bikes
Hierarchical Multi-resolution Clustering: 𝛾 Bounds
The resolution bound for the most coarse-grained clusters is min
= .
The resolution bound for the most fine-grained clusters is inferred
from the resolution limit analysis for the marginal case of clusters
detectability: . A rule of thumb for the maximal expected
number of clusters: .
For real-world networks modelled with sparse graphs (i.e., m ≤ n3/2
):
8
Clusters Transformation into Node Embeddings
9
Feature Extraction from Clusters: Salient Clusters
The number of embedding dimensions d ≤ s (salient clusters, i.e.
features) ≤ k (all clusters). Salient clusters are the top level clusters, t, and all
nested clusters that a) have non-decreasing density of links and b) more
lightweight than their super-cluster by the factor rw
(see the paper).
10
C3 is not salient violating
density constraint,
C2 is not salient violating weight
constraint.
Constraining the Number of Dimensions
The number of dimensions d is bounded with: t ≤ d ≤ s.
If the number of top level clusters t cannot be controlled by the clustering
algorithm and t > d, then according to the “Rag Bag” constraint (Xmeasures,
BigComp19), the t - (d - 1) most lightweight clusters are grouped together.
In case of t - z outliers are present among the top level clusters, the
embedding dimensions are formed from the min(z, d), z < t root level
clusters and d - z ≥ 0 nested salient clusters.
11
Optional Bounding of the Number of Clusters
If a specific number of clusters is required on the top level (t ≳ d):
● The hierarchy generation is interrupted early if the number of
clusters at level i, |hi
| reaches the required number d.
● The hierarchy generation is forced to continue until the number
of clusters reaches the required number d even if the value of the
optimization function (∆Q) becomes negative.
12
Dimension Formation from Features
Each embedding dimension is formed from ≥ 1 salient cluster, so
salient clusters constitute the number of recommended dimensions.
The embedding vector vi
∈ V of size d = |D| for each node #i is
produced quantifying the belonning degree wi, Dj
of the node to each
dimension Dj
:
13
|D| = 2:
vA
= {½, ½},
vB
= {⅖, ⅗},
vC
= {0, 1}
Dimension Interpretability
Dimensions are taken from (salient) clusters representing
ground-truth semantic categories, with performance being evaluated
using extrinsic quality metrics
(F1 measures family, GNMI,
Omega - Xmeasures, BigComp19).
So, it is possible to fetch only,
a subset of the dimensions
having some required semantics.
14
Produced Ground-truth
Dark or Cyan?
Yellow
Experimental Evaluation: Baselines
Baselines - ten state-of-the-art graph embedding techniques tuned for
each dataset:
a) Graph-sampling based: DeepWalk, Node2Vec, LINE and VERSE
b) Factorization-based: GraRep, HOPE and NetMF
c) Similarity-preserving hashing based: INH-MF, NetHash and
NodeSketch
15
Experimental Evaluation: Tasks & Datasets
Evaluation on the Node Classiication & Link prediction tasks on
the widely used datasets for graph embedding evaluation:
* YouTube is used only to evaluate the efficiency, since the ground-truth includes only 3% of
the graph (as opposed to a 100% coverage for the other graphs)
16
Classification Performance Using Kernel SVM
17
Link Prediction Performance
18
Robustness to the Metric Space
19
Efficiency: Learning Time (sec)
20
Conclusions
DAOR is our embedding technique based on clustering:
● the first method the authors are aware about that provides
embedding of any input graph without any manual tuning,
● produces metric-space robust embeddings,
● several orders of magnitude more efficient having competitive
performance on diverse tasks comparing to the manually tuned best
state-of-the-art embedding methods.
In addition, the produced embeddings are interpretable by design.
21
Q&A
Artem Lutov <artem.lutov@unifr.ch>
https://github.com/eXascaleInfolab/daor
22
Supplementary Slides
23
Granular Computing (GrC) ⇔ Clustering
24
“GrC is a superset of the theory of fuzzy information granulation,
rough set theory and interval computations, and is a subset of
granular mathematics.” (L.A. Zadeh, 1997)
Network Community Detection is a special case of Graph Clustering,
which is a special case of Information Granulation.
Rag Bag
25
Elements with low relevance to the
categories (e.g., noise) should be preferably
assigned to the less homogeneous clusters
(macro-scale, low-resolution, coarse-grained
or top-level clusters in a hierarchy). Low High
MMG: Robustness via Micro-consensus
26
Modularity:
Modularity gain:
Mutual Maximal(⬦) Gain:
Overlaps Decomposition (OD): Determinism
Decomposition of a node of degree d=3 into K=3 fragments:
27
OD
constraints:
Matching the clusterings (unordered sets of elements) even with the
elements having a single membership may yield multiple best matches:
=> Strictclusterslabeling is not
always possible and undesirable.
Many dedicated accuracy metrics
are designed but few of them are
applicable for the elements with
multiplemembership.
Accuracy Evaluation for Clusterings
28
Produced Ground-truth
Dark or Cyan?
Yellow
Our Requirements for the Accuracy Metrics
● Applicable for the elements having multiple membership
● Applicable for Large Datasets: ideally O(N), runtime up to O(N2
)
Families with the accuracy metrics satisfying our requirements:
● Pair Counting Based Metrics: Omega Index [Collins,1988]
● Cluster Matching Based Metrics: Average F1 score [Yang,2013]
● Information Theory Based Metrics: Generalized NMI
[Esquivel,2012]
Problem: accuracy values interpretability and the metric selection. 29
Omega Index (Fuzzy ARI) [Collins,1988]
Omega Index (𝛀) counts the number of pairs of elements occurring
in exactly the same number of clusters as in the number of categories
and adjusted to the expected number of such pairs:
30
,
,
C’ - ground-truth
(categories)
C - produced cls.
Soft Omega Index
Soft Omega Index take into account pairs present in different
number of clusters by normalizing smaller number of occurrences of
each pair of elements in all clusters of one clustering by the larger
number of occurrences in another clustering:
31
,
Average F1 Score [Yang,2013]
F1a is defined as the average of the weighted F1 scores of a) the best
matching ground-truth clusters to the formed clusters and b) the best
matching formed clusters to the ground-truth clusters:
32
,
F1 - F1-measure
[Rijsbergen, 1974]
Mean F1 Scores: F1h
F1h uses harmonic instead of the arithm. mean to address F1a ≳ 0.5
for the clusters produced from all combinations of the nodes (F1C‘,C
=
1 since for each category there exists the exactly matching cluster,
F1C,C’
→0 since majority of the clusters have low similarity to the
categories):
33
, for the contribution m of the nodes:
F1p is the harmonic mean of the average over each clustering of the
best local probabilities (f1 ➞ pprob) for each cluster:
Mean F1 Scores: F1p
34
Purpose: O(N(|C’| + |C|)) ➞ O(N)
Cluster
mbs # Member nodes, const
cont #Members contrib, const
counter # Contribs counter
Counter
orig # Originating cluster
ctr # Raw counter, <= mbs
C
Indexing Technique for Mean F1 Score
35
..
...
a
for a in g2.mbs: for c in cls(C.a):
cc = c.counter;
if cc.orig != g2:
cc.ctr=0; cc.orig=g2
cc.ctr += 1 / |C.a| if ovp else 1
fmatch(cc.ctr, c.cont, g2.cont)
g2
c1
c3
C’
SNAP DBLP (Nodes: 317,080
Edges: 1,049,866 Clusters:
13,477) ground-truth vs
clustering by the Louvain.
Evaluation on Intel Xeon
E5-2620 (32 logical CPUs)
@ 2.10 GHz, apps compiled
using GCC 5.4 with -O3 flag.
Xmeasures MF1 vs ParallelComMetric F1-Measure
36
Generalized Normalized Mutual Information (NMI)
NMI is Mutual Information I(C’:C) normalized by the max or mean
value of the unconditional entropy H of the clusterings C’, C:
37
,
,
GNMI
[Esquivel,2012]
uses a stochastic
process to
compute MI.
(Soft) 𝛀
MF1
GNMI
Metrics Applicability
38
O(N2
), performs purely for the
multi-resolution clusterings.
Evaluates the best-matching
clusters only (unfair advantage
for the larger clusters).
Biased to the number of clusters,
non-deterministic results, the
convergence is not guaranteed in
the stochastic implementation.
values are not affected by the
number of clusters.
O(N), F1p satisfiers more
formal constraints than others.
Highly parallelized, evaluates
full matches, well-grounded
theoretically.
Preliminaries: Modularity & Optimal Resolution
39
Modularity: Modularity gain (in Louvain):
Generalized Modularity: Optimal value of the resolution
parameter:
Preliminaries: DAOC & Louvain Properties
Human perception-adapted Taxonomy construction
for large Evolving Networks by Incremental Clustering
● Stable
● Fully-automatic
● Browsable
● Large
● Multi-viewpoint
● Narrow (7 ± 2 rule)
40
● Robust + Determ.
● Parameter-free
● Hierarchical
● Near-linear runtime
● Overlapping
● Fine-grained
Louvain

More Related Content

What's hot

Webinar on Graph Neural Networks
Webinar on Graph Neural NetworksWebinar on Graph Neural Networks
Webinar on Graph Neural Networks
LucaCrociani1
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
Fabio Petroni, PhD
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
Fabio Petroni, PhD
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Sangamesh Ragate
 
Gnn overview
Gnn overviewGnn overview
Gnn overview
Louis (Yufeng) Wang
 
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
Sangmin Woo
 
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Universitat Politècnica de Catalunya
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Universitat Politècnica de Catalunya
 
2.5D Clip-Surfaces for Technical Visualization
2.5D Clip-Surfaces for Technical Visualization2.5D Clip-Surfaces for Technical Visualization
2.5D Clip-Surfaces for Technical Visualization
Matthias Trapp
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Universitat Politècnica de Catalunya
 
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATIONA DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
csandit
 
Understand Manifolds using MATLAB
Understand Manifolds using MATLAB Understand Manifolds using MATLAB
Understand Manifolds using MATLAB
Pranav Challa
 
Gray Image Watermarking using slant transform - digital image processing
Gray Image Watermarking using slant transform - digital image processingGray Image Watermarking using slant transform - digital image processing
Gray Image Watermarking using slant transform - digital image processing
NITHIN KALLE PALLY
 
Edge Representation Learning with Hypergraphs
Edge Representation Learning with HypergraphsEdge Representation Learning with Hypergraphs
Edge Representation Learning with Hypergraphs
MLAI2
 
Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Matthias Trapp
 

What's hot (20)

Webinar on Graph Neural Networks
Webinar on Graph Neural NetworksWebinar on Graph Neural Networks
Webinar on Graph Neural Networks
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)
 
Gnn overview
Gnn overviewGnn overview
Gnn overview
 
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
 
2.5D Clip-Surfaces for Technical Visualization
2.5D Clip-Surfaces for Technical Visualization2.5D Clip-Surfaces for Technical Visualization
2.5D Clip-Surfaces for Technical Visualization
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
 
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
 
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATIONA DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
 
Understand Manifolds using MATLAB
Understand Manifolds using MATLAB Understand Manifolds using MATLAB
Understand Manifolds using MATLAB
 
Gray Image Watermarking using slant transform - digital image processing
Gray Image Watermarking using slant transform - digital image processingGray Image Watermarking using slant transform - digital image processing
Gray Image Watermarking using slant transform - digital image processing
 
Edge Representation Learning with Hypergraphs
Edge Representation Learning with HypergraphsEdge Representation Learning with Hypergraphs
Edge Representation Learning with Hypergraphs
 
Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 

Similar to DAOR - Bridging the Gap between Community and Node Representations: Graph Embedding via Community Detection

DAOC: Stable Clustering of Large Networks
DAOC: Stable Clustering of Large NetworksDAOC: Stable Clustering of Large Networks
DAOC: Stable Clustering of Large Networks
Artem Lutov
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdf
Polytechnique Montréal
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
acijjournal
 
Graph Neural Networks.pdf
Graph Neural Networks.pdfGraph Neural Networks.pdf
Graph Neural Networks.pdf
mahyamk
 
Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...
Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...
Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...
Artem Lutov
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
prithan
 
Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)
Universitat Politècnica de Catalunya
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
ivaderivader
 
Information Content of Complex Networks
Information Content of Complex NetworksInformation Content of Complex Networks
Information Content of Complex Networks
Hector Zenil
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big Data
Alexander Jung
 
Domain adaptation for Image Segmentation
Domain adaptation for Image SegmentationDomain adaptation for Image Segmentation
Domain adaptation for Image Segmentation
Deepak Thukral
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Subhajit Sahu
 
regions
regionsregions
regions
mjbahmani
 
Bivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm forBivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm for
eSAT Publishing House
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networks
Förderverein Technische Fakultät
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
NIET Journal of Engineering & Technology (NIETJET)
 
An35225228
An35225228An35225228
An35225228
IJERA Editor
 
Object Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksObject Detection using Deep Neural Networks
Object Detection using Deep Neural Networks
Usman Qayyum
 
A fitness landscape analysis of the Travelling Thief Problem
A fitness landscape analysis of the Travelling Thief ProblemA fitness landscape analysis of the Travelling Thief Problem
A fitness landscape analysis of the Travelling Thief Problem
Mehdi EL KRARI
 

Similar to DAOR - Bridging the Gap between Community and Node Representations: Graph Embedding via Community Detection (20)

DAOC: Stable Clustering of Large Networks
DAOC: Stable Clustering of Large NetworksDAOC: Stable Clustering of Large Networks
DAOC: Stable Clustering of Large Networks
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdf
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
 
Graph Neural Networks.pdf
Graph Neural Networks.pdfGraph Neural Networks.pdf
Graph Neural Networks.pdf
 
Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...
Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...
Xmeasures - Accuracy evaluation of overlapping and multi-resolution clusterin...
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
 
Information Content of Complex Networks
Information Content of Complex NetworksInformation Content of Complex Networks
Information Content of Complex Networks
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big Data
 
Domain adaptation for Image Segmentation
Domain adaptation for Image SegmentationDomain adaptation for Image Segmentation
Domain adaptation for Image Segmentation
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
 
regions
regionsregions
regions
 
Bivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm forBivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm for
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networks
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
 
An35225228
An35225228An35225228
An35225228
 
Object Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksObject Detection using Deep Neural Networks
Object Detection using Deep Neural Networks
 
A fitness landscape analysis of the Travelling Thief Problem
A fitness landscape analysis of the Travelling Thief ProblemA fitness landscape analysis of the Travelling Thief Problem
A fitness landscape analysis of the Travelling Thief Problem
 

Recently uploaded

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 

Recently uploaded (20)

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 

DAOR - Bridging the Gap between Community and Node Representations: Graph Embedding via Community Detection

  • 1. Bridging the Gap between Community and Node Representations: Graph Embedding via Community Detection IEEE BigData 2019, Special Session on Information Granulation in Data Science Artem Lutov, Dingqi Yang and Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg, Switzerland https://bit.ly/daor-slides https://github.com/eXascaleInfolab/daor
  • 2. Problem Graph embedding techniques project graph nodes onto a low-dimensional vector space preserving the key structural properties of the graph (e.g, proximity between nodes). Existing embedding techniques are hard to be applied in practice: ● rely on multiple parameters, ● operate effectively in a single metric space only (e.g., produced with cosine similarity), ● computationally intensive. 2
  • 3. Solution DAOR - our embedding technique based on community detection (i.e., clustering) and producing embeddings without any manual tuning: ● Parameter-free embedding method based on graph clustering ● Produces metric-space robust embeddings ● Embeddings are produced with near-linear runtime Moreover, DAOR preserves both high- and low-order structural properties of the graph and produces interpretable embeddings by design. 3
  • 4. Preliminaries DAOR embedding method is based on DAOC clustering (BigData19, bit.ly/daoc-slides; it granulates the graph) being parameter-free, with near-linear runtime, hierarchical, overlapping, stable (both robust and deterministic) and capturing fine-grained (i.e., micro-scale) clusters. DAOC clustering ← Meta-optimization function ← Generalized Modularity 4 Generalized Modularity: Optimal value of the resolution parameter (Newman’16): =
  • 5. Method Outline DAOR embedding method is an extension of DAOC clustering: ● The hierarchy of clusters is constructed varying the resolution parameter 𝛾 in within the evaluated bounds. ● The resulting clusters are transformed into embeddings (the required number of dimensions is specified optionally): a) Features (salient clusters) are extracted from clusters and b) Embedding dimensions are formed from the features. 5
  • 6. Contributions DAOR - a parameter-free embedding method based on graph clustering: ● 𝛾 bound identification for the most fine-grained clusters ● Automatic identification of features (salient clusters) ● Embedding dimensions formation from the features and optional constraining of their number 6
  • 7. Preliminaries: Overlapping vs Multi-resolution 77 Racing cars Overlapping Clusters Clusters on Various Resolutions Blue cars Jeeps Cars Racing cars Bikes Racing & blue cars Bikes
  • 8. Hierarchical Multi-resolution Clustering: 𝛾 Bounds The resolution bound for the most coarse-grained clusters is min = . The resolution bound for the most fine-grained clusters is inferred from the resolution limit analysis for the marginal case of clusters detectability: . A rule of thumb for the maximal expected number of clusters: . For real-world networks modelled with sparse graphs (i.e., m ≤ n3/2 ): 8
  • 9. Clusters Transformation into Node Embeddings 9
  • 10. Feature Extraction from Clusters: Salient Clusters The number of embedding dimensions d ≤ s (salient clusters, i.e. features) ≤ k (all clusters). Salient clusters are the top level clusters, t, and all nested clusters that a) have non-decreasing density of links and b) more lightweight than their super-cluster by the factor rw (see the paper). 10 C3 is not salient violating density constraint, C2 is not salient violating weight constraint.
  • 11. Constraining the Number of Dimensions The number of dimensions d is bounded with: t ≤ d ≤ s. If the number of top level clusters t cannot be controlled by the clustering algorithm and t > d, then according to the “Rag Bag” constraint (Xmeasures, BigComp19), the t - (d - 1) most lightweight clusters are grouped together. In case of t - z outliers are present among the top level clusters, the embedding dimensions are formed from the min(z, d), z < t root level clusters and d - z ≥ 0 nested salient clusters. 11
  • 12. Optional Bounding of the Number of Clusters If a specific number of clusters is required on the top level (t ≳ d): ● The hierarchy generation is interrupted early if the number of clusters at level i, |hi | reaches the required number d. ● The hierarchy generation is forced to continue until the number of clusters reaches the required number d even if the value of the optimization function (∆Q) becomes negative. 12
  • 13. Dimension Formation from Features Each embedding dimension is formed from ≥ 1 salient cluster, so salient clusters constitute the number of recommended dimensions. The embedding vector vi ∈ V of size d = |D| for each node #i is produced quantifying the belonning degree wi, Dj of the node to each dimension Dj : 13 |D| = 2: vA = {½, ½}, vB = {⅖, ⅗}, vC = {0, 1}
  • 14. Dimension Interpretability Dimensions are taken from (salient) clusters representing ground-truth semantic categories, with performance being evaluated using extrinsic quality metrics (F1 measures family, GNMI, Omega - Xmeasures, BigComp19). So, it is possible to fetch only, a subset of the dimensions having some required semantics. 14 Produced Ground-truth Dark or Cyan? Yellow
  • 15. Experimental Evaluation: Baselines Baselines - ten state-of-the-art graph embedding techniques tuned for each dataset: a) Graph-sampling based: DeepWalk, Node2Vec, LINE and VERSE b) Factorization-based: GraRep, HOPE and NetMF c) Similarity-preserving hashing based: INH-MF, NetHash and NodeSketch 15
  • 16. Experimental Evaluation: Tasks & Datasets Evaluation on the Node Classiication & Link prediction tasks on the widely used datasets for graph embedding evaluation: * YouTube is used only to evaluate the efficiency, since the ground-truth includes only 3% of the graph (as opposed to a 100% coverage for the other graphs) 16
  • 19. Robustness to the Metric Space 19
  • 21. Conclusions DAOR is our embedding technique based on clustering: ● the first method the authors are aware about that provides embedding of any input graph without any manual tuning, ● produces metric-space robust embeddings, ● several orders of magnitude more efficient having competitive performance on diverse tasks comparing to the manually tuned best state-of-the-art embedding methods. In addition, the produced embeddings are interpretable by design. 21
  • 24. Granular Computing (GrC) ⇔ Clustering 24 “GrC is a superset of the theory of fuzzy information granulation, rough set theory and interval computations, and is a subset of granular mathematics.” (L.A. Zadeh, 1997) Network Community Detection is a special case of Graph Clustering, which is a special case of Information Granulation.
  • 25. Rag Bag 25 Elements with low relevance to the categories (e.g., noise) should be preferably assigned to the less homogeneous clusters (macro-scale, low-resolution, coarse-grained or top-level clusters in a hierarchy). Low High
  • 26. MMG: Robustness via Micro-consensus 26 Modularity: Modularity gain: Mutual Maximal(⬦) Gain:
  • 27. Overlaps Decomposition (OD): Determinism Decomposition of a node of degree d=3 into K=3 fragments: 27 OD constraints:
  • 28. Matching the clusterings (unordered sets of elements) even with the elements having a single membership may yield multiple best matches: => Strictclusterslabeling is not always possible and undesirable. Many dedicated accuracy metrics are designed but few of them are applicable for the elements with multiplemembership. Accuracy Evaluation for Clusterings 28 Produced Ground-truth Dark or Cyan? Yellow
  • 29. Our Requirements for the Accuracy Metrics ● Applicable for the elements having multiple membership ● Applicable for Large Datasets: ideally O(N), runtime up to O(N2 ) Families with the accuracy metrics satisfying our requirements: ● Pair Counting Based Metrics: Omega Index [Collins,1988] ● Cluster Matching Based Metrics: Average F1 score [Yang,2013] ● Information Theory Based Metrics: Generalized NMI [Esquivel,2012] Problem: accuracy values interpretability and the metric selection. 29
  • 30. Omega Index (Fuzzy ARI) [Collins,1988] Omega Index (𝛀) counts the number of pairs of elements occurring in exactly the same number of clusters as in the number of categories and adjusted to the expected number of such pairs: 30 , , C’ - ground-truth (categories) C - produced cls.
  • 31. Soft Omega Index Soft Omega Index take into account pairs present in different number of clusters by normalizing smaller number of occurrences of each pair of elements in all clusters of one clustering by the larger number of occurrences in another clustering: 31 ,
  • 32. Average F1 Score [Yang,2013] F1a is defined as the average of the weighted F1 scores of a) the best matching ground-truth clusters to the formed clusters and b) the best matching formed clusters to the ground-truth clusters: 32 , F1 - F1-measure [Rijsbergen, 1974]
  • 33. Mean F1 Scores: F1h F1h uses harmonic instead of the arithm. mean to address F1a ≳ 0.5 for the clusters produced from all combinations of the nodes (F1C‘,C = 1 since for each category there exists the exactly matching cluster, F1C,C’ →0 since majority of the clusters have low similarity to the categories): 33 , for the contribution m of the nodes:
  • 34. F1p is the harmonic mean of the average over each clustering of the best local probabilities (f1 ➞ pprob) for each cluster: Mean F1 Scores: F1p 34
  • 35. Purpose: O(N(|C’| + |C|)) ➞ O(N) Cluster mbs # Member nodes, const cont #Members contrib, const counter # Contribs counter Counter orig # Originating cluster ctr # Raw counter, <= mbs C Indexing Technique for Mean F1 Score 35 .. ... a for a in g2.mbs: for c in cls(C.a): cc = c.counter; if cc.orig != g2: cc.ctr=0; cc.orig=g2 cc.ctr += 1 / |C.a| if ovp else 1 fmatch(cc.ctr, c.cont, g2.cont) g2 c1 c3 C’
  • 36. SNAP DBLP (Nodes: 317,080 Edges: 1,049,866 Clusters: 13,477) ground-truth vs clustering by the Louvain. Evaluation on Intel Xeon E5-2620 (32 logical CPUs) @ 2.10 GHz, apps compiled using GCC 5.4 with -O3 flag. Xmeasures MF1 vs ParallelComMetric F1-Measure 36
  • 37. Generalized Normalized Mutual Information (NMI) NMI is Mutual Information I(C’:C) normalized by the max or mean value of the unconditional entropy H of the clusterings C’, C: 37 , , GNMI [Esquivel,2012] uses a stochastic process to compute MI.
  • 38. (Soft) 𝛀 MF1 GNMI Metrics Applicability 38 O(N2 ), performs purely for the multi-resolution clusterings. Evaluates the best-matching clusters only (unfair advantage for the larger clusters). Biased to the number of clusters, non-deterministic results, the convergence is not guaranteed in the stochastic implementation. values are not affected by the number of clusters. O(N), F1p satisfiers more formal constraints than others. Highly parallelized, evaluates full matches, well-grounded theoretically.
  • 39. Preliminaries: Modularity & Optimal Resolution 39 Modularity: Modularity gain (in Louvain): Generalized Modularity: Optimal value of the resolution parameter:
  • 40. Preliminaries: DAOC & Louvain Properties Human perception-adapted Taxonomy construction for large Evolving Networks by Incremental Clustering ● Stable ● Fully-automatic ● Browsable ● Large ● Multi-viewpoint ● Narrow (7 ± 2 rule) 40 ● Robust + Determ. ● Parameter-free ● Hierarchical ● Near-linear runtime ● Overlapping ● Fine-grained Louvain