DAOR - Bridging the Gap between Community and Node Representations: Graph Embedding via Community Detection

Bridging the Gap between Community and Node
Representations: Graph Embedding via Community Detection
IEEE BigData 2019, Special Session on Information Granulation in Data Science
Artem Lutov, Dingqi Yang and Philippe Cudré-Mauroux
eXascale Infolab, University of Fribourg, Switzerland
https://bit.ly/daor-slides
https://github.com/eXascaleInfolab/daor

Problem
Graph embedding techniques project graph nodes onto a low-dimensional
vector space preserving the key structural properties of the graph (e.g,
proximity between nodes).
Existing embedding techniques are hard to be applied in practice:
● rely on multiple parameters,
● operate effectively in a single metric space only (e.g., produced with
cosine similarity),
● computationally intensive.
2

Solution
DAOR - our embedding technique based on community detection
(i.e., clustering) and producing embeddings without any manual tuning:
● Parameter-free embedding method based on graph clustering
● Produces metric-space robust embeddings
● Embeddings are produced with near-linear runtime
Moreover, DAOR preserves both high- and low-order structural properties
of the graph and produces interpretable embeddings by design.
3

Preliminaries
DAOR embedding method is based on DAOC clustering (BigData19,
bit.ly/daoc-slides; it granulates the graph) being parameter-free, with
near-linear runtime, hierarchical, overlapping, stable (both robust and
deterministic) and capturing fine-grained (i.e., micro-scale) clusters.
DAOC clustering ← Meta-optimization function ← Generalized Modularity
4
Generalized Modularity: Optimal value of the resolution
parameter (Newman’16): =

Method Outline
DAOR embedding method is an extension of DAOC clustering:
● The hierarchy of clusters is constructed varying the resolution
parameter 𝛾 in within the evaluated bounds.
● The resulting clusters are transformed into embeddings (the
required number of dimensions is specified optionally):
a) Features (salient clusters) are extracted from clusters and
b) Embedding dimensions are formed from the features.
5

Contributions
DAOR - a parameter-free embedding method based on graph clustering:
● 𝛾 bound identification for the most fine-grained clusters
● Automatic identification of features (salient clusters)
● Embedding dimensions formation from the features and optional
constraining of their number
6

Preliminaries: Overlapping vs Multi-resolution
77
Racing cars
Overlapping Clusters Clusters on Various Resolutions
Blue cars
Jeeps
Cars
Racing cars
Bikes
Racing &
blue cars
Bikes

Hierarchical Multi-resolution Clustering: 𝛾 Bounds
The resolution bound for the most coarse-grained clusters is min
= .
The resolution bound for the most fine-grained clusters is inferred
from the resolution limit analysis for the marginal case of clusters
detectability: . A rule of thumb for the maximal expected
number of clusters: .
For real-world networks modelled with sparse graphs (i.e., m ≤ n3/2
):
8

Clusters Transformation into Node Embeddings
9

Feature Extraction from Clusters: Salient Clusters
The number of embedding dimensions d ≤ s (salient clusters, i.e.
features) ≤ k (all clusters). Salient clusters are the top level clusters, t, and all
nested clusters that a) have non-decreasing density of links and b) more
lightweight than their super-cluster by the factor rw
(see the paper).
10
C3 is not salient violating
density constraint,
C2 is not salient violating weight
constraint.

Constraining the Number of Dimensions
The number of dimensions d is bounded with: t ≤ d ≤ s.
If the number of top level clusters t cannot be controlled by the clustering
algorithm and t > d, then according to the “Rag Bag” constraint (Xmeasures,
BigComp19), the t - (d - 1) most lightweight clusters are grouped together.
In case of t - z outliers are present among the top level clusters, the
embedding dimensions are formed from the min(z, d), z < t root level
clusters and d - z ≥ 0 nested salient clusters.
11

Optional Bounding of the Number of Clusters
If a specific number of clusters is required on the top level (t ≳ d):
● The hierarchy generation is interrupted early if the number of
clusters at level i, |hi
| reaches the required number d.
● The hierarchy generation is forced to continue until the number
of clusters reaches the required number d even if the value of the
optimization function (∆Q) becomes negative.
12

Dimension Formation from Features
Each embedding dimension is formed from ≥ 1 salient cluster, so
salient clusters constitute the number of recommended dimensions.
The embedding vector vi
∈ V of size d = |D| for each node #i is
produced quantifying the belonning degree wi, Dj
of the node to each
dimension Dj
:
13
|D| = 2:
vA
= {½, ½},
vB
= {⅖, ⅗},
vC
= {0, 1}

Dimension Interpretability
Dimensions are taken from (salient) clusters representing
ground-truth semantic categories, with performance being evaluated
using extrinsic quality metrics
(F1 measures family, GNMI,
Omega - Xmeasures, BigComp19).
So, it is possible to fetch only,
a subset of the dimensions
having some required semantics.
14
Produced Ground-truth
Dark or Cyan?
Yellow

Experimental Evaluation: Baselines
Baselines - ten state-of-the-art graph embedding techniques tuned for
each dataset:
a) Graph-sampling based: DeepWalk, Node2Vec, LINE and VERSE
b) Factorization-based: GraRep, HOPE and NetMF
c) Similarity-preserving hashing based: INH-MF, NetHash and
NodeSketch
15

Experimental Evaluation: Tasks & Datasets
Evaluation on the Node Classiication & Link prediction tasks on
the widely used datasets for graph embedding evaluation:
* YouTube is used only to evaluate the efficiency, since the ground-truth includes only 3% of
the graph (as opposed to a 100% coverage for the other graphs)
16

Classiﬁcation Performance Using Kernel SVM
17

Link Prediction Performance
18

Robustness to the Metric Space
19

Efﬁciency: Learning Time (sec)
20

Conclusions
DAOR is our embedding technique based on clustering:
● the first method the authors are aware about that provides
embedding of any input graph without any manual tuning,
● produces metric-space robust embeddings,
● several orders of magnitude more efficient having competitive
performance on diverse tasks comparing to the manually tuned best
state-of-the-art embedding methods.
In addition, the produced embeddings are interpretable by design.
21

Q&A
Artem Lutov <artem.lutov@unifr.ch>
https://github.com/eXascaleInfolab/daor
22

Granular Computing (GrC) ⇔ Clustering
24
“GrC is a superset of the theory of fuzzy information granulation,
rough set theory and interval computations, and is a subset of
granular mathematics.” (L.A. Zadeh, 1997)
Network Community Detection is a special case of Graph Clustering,
which is a special case of Information Granulation.

Rag Bag
25
Elements with low relevance to the
categories (e.g., noise) should be preferably
assigned to the less homogeneous clusters
(macro-scale, low-resolution, coarse-grained
or top-level clusters in a hierarchy). Low High

MMG: Robustness via Micro-consensus
26
Modularity:
Modularity gain:
Mutual Maximal(⬦) Gain:

Overlaps Decomposition (OD): Determinism
Decomposition of a node of degree d=3 into K=3 fragments:
27
OD
constraints:

Matching the clusterings (unordered sets of elements) even with the
elements having a single membership may yield multiple best matches:
=> Strictclusterslabeling is not
always possible and undesirable.
Many dedicated accuracy metrics
are designed but few of them are
applicable for the elements with
multiplemembership.
Accuracy Evaluation for Clusterings
28
Produced Ground-truth
Dark or Cyan?
Yellow

Our Requirements for the Accuracy Metrics
● Applicable for the elements having multiple membership
● Applicable for Large Datasets: ideally O(N), runtime up to O(N2
)
Families with the accuracy metrics satisfying our requirements:
● Pair Counting Based Metrics: Omega Index [Collins,1988]
● Cluster Matching Based Metrics: Average F1 score [Yang,2013]
● Information Theory Based Metrics: Generalized NMI
[Esquivel,2012]
Problem: accuracy values interpretability and the metric selection. 29

Omega Index (Fuzzy ARI) [Collins,1988]
Omega Index (𝛀) counts the number of pairs of elements occurring
in exactly the same number of clusters as in the number of categories
and adjusted to the expected number of such pairs:
30
,
,
C’ - ground-truth
(categories)
C - produced cls.

Soft Omega Index
Soft Omega Index take into account pairs present in different
number of clusters by normalizing smaller number of occurrences of
each pair of elements in all clusters of one clustering by the larger
number of occurrences in another clustering:
31
,

Average F1 Score [Yang,2013]
F1a is defined as the average of the weighted F1 scores of a) the best
matching ground-truth clusters to the formed clusters and b) the best
matching formed clusters to the ground-truth clusters:
32
,
F1 - F1-measure
[Rijsbergen, 1974]

Mean F1 Scores: F1h
F1h uses harmonic instead of the arithm. mean to address F1a ≳ 0.5
for the clusters produced from all combinations of the nodes (F1C‘,C
=
1 since for each category there exists the exactly matching cluster,
F1C,C’
→0 since majority of the clusters have low similarity to the
categories):
33
, for the contribution m of the nodes:

F1p is the harmonic mean of the average over each clustering of the
best local probabilities (f1 ➞ pprob) for each cluster:
Mean F1 Scores: F1p
34

Purpose: O(N(|C’| + |C|)) ➞ O(N)
Cluster
mbs # Member nodes, const
cont #Members contrib, const
counter # Contribs counter
Counter
orig # Originating cluster
ctr # Raw counter, <= mbs
C
Indexing Technique for Mean F1 Score
35
..
...
a
for a in g2.mbs: for c in cls(C.a):
cc = c.counter;
if cc.orig != g2:
cc.ctr=0; cc.orig=g2
cc.ctr += 1 / |C.a| if ovp else 1
fmatch(cc.ctr, c.cont, g2.cont)
g2
c1
c3
C’

SNAP DBLP (Nodes: 317,080
Edges: 1,049,866 Clusters:
13,477) ground-truth vs
clustering by the Louvain.
Evaluation on Intel Xeon
E5-2620 (32 logical CPUs)
@ 2.10 GHz, apps compiled
using GCC 5.4 with -O3 flag.
Xmeasures MF1 vs ParallelComMetric F1-Measure
36

Generalized Normalized Mutual Information (NMI)
NMI is Mutual Information I(C’:C) normalized by the max or mean
value of the unconditional entropy H of the clusterings C’, C:
37
,
,
GNMI
[Esquivel,2012]
uses a stochastic
process to
compute MI.

(Soft) 𝛀
MF1
GNMI
Metrics Applicability
38
O(N2
), performs purely for the
multi-resolution clusterings.
Evaluates the best-matching
clusters only (unfair advantage
for the larger clusters).
Biased to the number of clusters,
non-deterministic results, the
convergence is not guaranteed in
the stochastic implementation.
values are not affected by the
number of clusters.
O(N), F1p satisﬁers more
formal constraints than others.
Highly parallelized, evaluates
full matches, well-grounded
theoretically.

Preliminaries: Modularity & Optimal Resolution
39
Modularity: Modularity gain (in Louvain):
Generalized Modularity: Optimal value of the resolution
parameter:

Preliminaries: DAOC & Louvain Properties
Human perception-adapted Taxonomy construction
for large Evolving Networks by Incremental Clustering
● Stable
● Fully-automatic
● Browsable
● Large
● Multi-viewpoint
● Narrow (7 ± 2 rule)
40
● Robust + Determ.
● Parameter-free
● Hierarchical
● Near-linear runtime
● Overlapping
● Fine-grained
Louvain

DAOR - Bridging the Gap between Community and Node Representations: Graph Embedding via Community Detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DAOR - Bridging the Gap between Community and Node Representations: Graph Embedding via Community Detection

Similar to DAOR - Bridging the Gap between Community and Node Representations: Graph Embedding via Community Detection (20)

Recently uploaded

Recently uploaded (20)

DAOR - Bridging the Gap between Community and Node Representations: Graph Embedding via Community Detection