Slides

Introduction Preliminaries MapReduce GPAbstraction Experiments
Distributed Algorithms for k-Truss
Decomposition
Pei-Ling Chen 1 Ming-Syan Chen 2
12Department of Electrical Engineering, National Taiwan University
2Research Center of Information Technology Innovatiom, Academia Sinica
July 17, 2014
Distributed Algorithms for k-Truss Decomposition July 17, 2014 1 / 37

Outline
1 Introduction
2 Preliminaries
3 Distributed k-Truss Decompostion in MapReduce Framework
4 Distributed k-Truss Decompostion in Graph Parallel Abstractions
5 Experimental Analysis

Outline
1 Introduction
Motivation
Related Work and Our Contribution
2 Preliminaries

Motivation
1 k-truss is one of graph measures or methods for describing
the characteristic of a vertex or capturing the structure of a
network;
2 Graph measures have several application such as marketing
and group formation;
3 With the emergence of large online networks, e.g.,
Facebook, processing graph measures becomes diﬃcult on
long consuming time and limited memory for a single
machine;
4 Designing algorithms based on cloud computing is an
important issue.

Related Work and Our Contribution
For k-truss in large graphs:
• Wang and Cheng [8] propose I/O efficient algorithms of
k-truss decomposition. They break a graph into several
partitions and use a sequential processing method to
conquer problem on limited memory of a single machine;
• A heuristic distributed k-truss decomposition with
MapReduce framework has been mentioned in [1] without
experiments and since MapReduce is not designed for
iterative algorithms, the algorithm suffers from IO waiting
time during MapReduce jobs.
We adopt the most recent graph computing model, graph
parallel abstractions, and provide rigorous theoretical basis to
propose an algorithm of efficient and scalable k-truss
decomposition.

Outline
1 Introduction
2 Preliminaries
Deﬁnition
Traditional k-Truss Decomposition

Definition
A
BC
D
E F
sup = 2
Definition (Support)
The support of an edge e = (u, v) ∈ EG, denoted by sup(e, G),
is defined as |nb(u) ∩ nb(v)| where nb(u), nb(v) are the sets of
neighbors of u, v respectively. When G is obvious from content,
we replace sup(e, G) by sup(e).

Definition
A
BC
D
E F
sup = 2
This is a 4-truss
Definition (k-Truss)
A k-truss Rk of G, where k ≥ 2, is defined as a connected
subgraph such that each sup(e, Rk) ≥ k − 2 for all e ∈ Rk.

Definition
We further define a Tk as the subgraph of the union of all
k-trusses, that is, Tk = i Ri
k, where Ri
k is the i-th k-truss in G.
A
B
C
D
E
F
G
H
I
J
5
5
5 5
5
5
55
5
5
4
4
4
4
4
4
3
3 3
Definition (Trussness)
The trussness of an edge e in G, denoted by φ(e) = k, is the
maximal k value that it can be contained in ETk
.

Deﬁnition
k, where Ri
A
B
C
D
E
F
G
H
I
J
T5
5
5
5 5
5
5
55
5
5
4
4
4
4
4
4
3
3 3
.

Deﬁnition
k, where Ri
A
B
C
D
E
F
G
H
I
J
T4
5
5
5 5
5
5
55
5
5
4
4
4
4
4
4
3
3 3
.

Deﬁnition
k, where Ri
A
B
C
D
E
F
G
H
I
J
T3
5
5
5 5
5
5
55
5
5
4
4
4
4
4
4
3
3 3
.

1 The graph which needs to be
processed by k-truss decomposition.
2 Find supports for each edge.
3 For k = 4, the edges with support
< 4 − 2 = 2 should be removed and
then supports for remained edges
are updated.
< 5 − 2 = 3 should be removed.
5 The ﬁnal result.
A
B
C
D
E
2
2
2
2
1
1
sup = 2
3

are updated.
A
B
C
D
E
2
2
2
2
1
1
sup = 2
2

are updated.
A
B
C
D
E3
φ = 3
4
4
4
4
4
φ = 4

Outline
1 Introduction
2 Preliminaries
MRTruss
i-MRTruss

MRTruss
A very heuristic algorithm for distributed k-truss decomposition
under MapReduce framework is proposed in [1]. We abbreviate
this method as MRTruss.
1 For each pair of edges with a common vertex, i.e., an open
triad, generate a record with a triad as a value and a
potential closure which is an edge closing this triad as a
triangle and the existing of such an edge is unknown in this
task, as a key;
2 Check whether a closure speciﬁed in a key exists or not,
and output existing triangles;
3 Count sup(e) for each edge e, and delete edges with the
smallest support.
The procedure of MRTruss is the same as that of the traditional
batch k-truss decomposition.

MRTruss
However, there are several problems in MRTruss.
1 The main issue in MRTruss is that edge triangle
relationships for the input graph are not preserved in each
iteration.
2 Three jobs required in each iteration cause too many
unwanted disk IO operations.
3 Too many intermediate outputs.
Therefore, we propose an improved version as follows.

i-MRTruss
Algorithm 1 i-MRTruss
Input: G = (V, E)
Output: Records with (e, φ(e))
1: run Procedure 1: Triangle Finding
2: t ← 2
3: repeat
4: run Procedure 2: Trussness Counting
5: until ∀e ∈ E, sup(e) + 2 ≥ t
6: t ← (t + 1)
7: goto Step 4
1 Sacriﬁce memory usage to condense the three tasks in
MRTruss into one, which eﬃciently decreases the number
of Disk I/O operations and speeds up the running time.

i-MRTruss
Triangle Finding
A B D
AB
C D
Map phase
A B D
B A C D
D A B C
AB
C D
Reduce Phase 1
AB 3 AD BD
AD 3 AB BD
Reduce Phase 2

i-MRTruss
Trussness Counting
BD 4 AB AD BC CD
AB
C D
Map phase with c = 4
BD 3 AB AD BC CD
AB 3 AD BD
AD 3 AB BD
BC 3 CD BD
CD 3 BC BD
Reduce Phase with c = 4

Outline
1 Introduction
2 Preliminaries
Graph Parallel Abstractions
Deﬁnitions and Theorems
Algorithm and Illustrated Example

• A graph-parallel abstraction comprises a graph and a
vertex-program executed in parallel on every vertex in the
graph.
• A vertex-program can interact neighbors of the vertex.
• Pregel [5] and GraphLab [4] are two well-known graph
parallel abstractions.
A B
vertex-program
Compute{· · · }
vertex-program
Compute{· · · }

• Pregel is a well-known abstraction based on BSP model in which
a vertex-program passes messages to other neighbors in a
sequence of supersteps. Barrier synchronization is used to
separate each superstep and ensures the synchronization. Both
Apache Hama [6] and Apache Giraph are the open source
counterparts to Pregel.
A B Active Inactive
Compute{· · · }
Compute{· · · }
msg
m
sg
m
sg
m
sg
m
sg
Volt to halt
Message recieved

We want to solve the k-truss decomposition in a brand new
aspect:
• The trussness φ(e) of an edge e ∈ EG can be decided by
the trussnesses of a subset of edges in the graph G.
This idea provides a new computation logic diﬀerent from the
traditional batch algorithms.
Therefore, we ﬁrst derive a theorem to prove the locality
property in k-truss to decide the enough range of the subset of
edges in a graph for computing the trussness φ(e) of an edge e.

AB
C D
3
33
φ = 3
3
1 Since φ(BD) = 3, we can ﬁnd at least 2(3 − 2) = 2 edges forming 1 triangle
with BD, and both of these edges have φ = 3 ≥ 3, but no 2(4 − 2) = 4 edges
forming 2 triangles with it have φ ≥ 4.
Theorem (Locality)
∀e ∈ EG: φ(e) = k if and only if
1 there exists a subset Ek ⊆ enb(e) such that |Ek| = 2(k − 2), edges in Ek
forms total (k − 2) triangles with e, and for each edge e ∈ Ek, φ(e ) ≥ k;
2 there is no subset Ek+1 ⊆ enb(e) such that |Ek+1| = 2(k − 1), edges in
Ek+1 forms total k − 1 triangles with e, and for each edge e ∈ Ek+1,
φ(e ) ≥ k + 1.

AB
C D
3
33
φ = 3
?
2 If φ(BD) = k is unknown, since sup(BD) = 2, let’s start from assuming
k = 4
Theorem (Locality)
φ(e ) ≥ k + 1.

AB
C D
3
33
φ = 3
? = 4
3 Then there are at least 2(4 − 2) = 4 edges forming 2 triangles with BD, but
both of these edges have φ = 3 < 4, so φ(BD) = 4.
Theorem (Locality)
φ(e ) ≥ k + 1.

AB
C D
3
33
φ = 3
3
4 Assuming k = 3, then there are at least 2(3 − 2) = 2 edges forming 1
triangles with BD, and both of these edges have φ = 3 ≥ 3, so φ(BD) = 3.
Theorem (Locality)
φ(e ) ≥ k + 1.

A
B
C
D
E
G
AB
BC
CD
DE
AE
AC
L(G)
Deﬁnition (Line Graph)
Given a simple graph G = (VG, EG), its line graph
L(G) = (VL(G), EL(G)) is a graph where each vertex v ∈ VL(G)
represents an e ∈ EG (1 − 1 correspondence), and two vertices
in VL(G) are adjacent if and only if their corresponding edges in
EG share a common endpoint.

A
B
C
D
E
G
AB
BC
CD
DE
AE
AC
L(G)
AB
BC
CD
DE
AE
AC
PL(G)
Deﬁnition (Pruned Line Graph)
Given a simple graph G = (VG, EG) and its line graph L(G) = (VL(G),
EL(G)), the pruned line graph of G is PL(G) = (VP L(G), EP L(G))
where VP L(G) is the same as the VL(G), but EP L(G) is reduced by the
constraint: two vertices in VP L(G) are adjacent if and only if their
corresponding edges e1, e2 ∈ EG forming a triangle with another edge
e ∈ EG.

Trussness-Parallel Computing
1 The pruned line graph PL(G) constructed by Trussness-Parallel
Computing based on the output of Triangle Finding on graph G
(The dashed line represents edges pruned from the original line
graph)
AB
C D
G
AB, 3BC, 3
CD, 3 AD, 3
BD, 4
PL(G)

Trussness-Parallel Computing
2 The ﬁrst superstep in Trussness-Parallel Computing.
AB, 3BC, 3
CD, 3 AD, 3
BD, 4 msg
m
sg
msg
m
sg
m
sg
m
sg
AB’s list
AB : 3
· · ·
BD’s list
BD : 4
msg : (ID, potential trussness)

3 The second superstep in Trussness-Parallel Computing.
AB’s list
AB : 3
AD : 3
BD : 4
· · ·
BD’s list
BD : 4
AB : 3
BC : 3
CD : 3
AD : 3
AB’s table M
D : (A, 3) (B, 4)· · ·
BD’s table M
A : (B, 3) (D, 3)
C : (B, 3) (D, 3)
AB’s counter
φ ≥ 2 : 1
φ ≥ 3 : 1
· · ·
BD’s counter
φ ≥ 2 : 2
φ ≥ 3 : 2
φ ≥ 4 : 0
AB, 3BC, 3
CD, 3 AD, 3
BD, 4
m
sg
m
sg
m
sg
m
sg

3 The second superstep in Trussness-Parallel Computing.
AB’s list
AB : 3
AD : 3
BD : 4
· · ·
BD’s list
BD : 4
AB : 3
BC : 3
CD : 3
AD : 3
AB’s table M
D : (A, 3) (B, 4)· · ·
BD’s table M
A : (B, 3) (D, 3)
C : (B, 3) (D, 3)
AB’s counter
φ ≥ 2 : 1
φ ≥ 3 : 1
· · ·
BD’s counter
φ ≥ 2 : 2
φ ≥ 3 : 2
φ ≥ 4 : 0
AB, 3BC, 3
CD, 3 AD, 3
BD, 3
m
sg
m
sg
m
sg
m
sg

4 The third superstep in Trussness-Parallel Computing.
AB’s list
AB : 3
BD : 3
· · ·
BD’s list
BD : 4
AB’s table M
D : (A, 3) (B, 3)· · ·
BD’s table M
A : (B, 3) (D, 3)
C : (B, 3) (D, 3)
AB’s counter
φ ≥ 2 : 1
φ ≥ 3 : 1
· · ·
BD’s counter
φ ≥ 2 : 2
φ ≥ 3 : 2
φ ≥ 4 : 0
AB, 3BC, 3
CD, 3 AD, 3
BD, 3
m
sg
m
sg
m
sg
m
sg

Outline
1 Introduction
2 Preliminaries
Synthetic Data
Real Data
Conclusion

Synthetic Data
Table: Statistics of Synthetic Graph Datasets
Scale i V E dmax davg supmax supavg kmax
103
10 1024 1368 44 2.6718 13 0.147 4
104
14 16384 25514 110 3.1145 11 0.052 4
105
18 262144 465679 363 3.5528 6 0.017 4
106
20 1048576 1986937 648 3.7898 7 0.009 4
107
24 16777216 36146725 2164 4.308711 29 0.003 4
1 These ﬁve datasets have the same Kronecker matrix setting:
{0.999 0.327; 0.348 0.391}.
2 We can ensure the similar properties in these datasets but with
diﬀerent scales.
3 All graphs are simple graph (undirected, unweighted, no loops or
multiple edges).

Synthetic Data
103 104 105 106 107
0
1
2
3·103
Scale of Node Number
RunningTime(103·sec.)
GPTruss i-MRTruss MRTruss
1 The running time of these
three methods grows when
the scale of the dataset
increases.
2 The running time of
GPTruss is always half of
what i-MRTruss has.
3 The running time of
MRTruss signiﬁcantly
grows when the scale is up
to 107
.

Synthetic Data
103 104 105 106 107
0
20
40
NumberofJobs
103 104 105 106 107
0
5
10
15
NumberofIterations
1 Since the range of k for k-truss in these datasets is not wide, the
diﬀerence of required iteration numbers among these three
methods is not far.
2 However, the required job numbers are far apart from these three
methods.

Real Data
Table: Statistics of Real World Network Datasets
Name V E dmax davg supmax supavg kmax
com-Youtube 1134890 2987624 28754 5.265 4034 3.069 19
loc-Gowalla 196591 950327 14730 9.668 1297 7.176 29
roadNet-TX 1379917 1921660 12 2.785 3 0.129 4
com-DBLP 317080 1049866 343 6.622 312 6.356 114
1 Among these four datasets, com-Youtube, loc-Gowalla and com-DBLP are
three dense datasets with high average degree.
2 For datasets like RoadNet-TX, they are considered to be large datasets with
over one million vertices.
3 Since RoadNet-TX has davg < 3, it is viewed as a sparse dataset.
4 All datasets are preprocessed to be simple graphs (undirected, unweighted,
no self loops or multiple edges).

Real Data
4 6 8 10 12
103
104
Number of reducers
RunningTime(sec.)
com-Youtube
4 6 8 10 12
102
103
104
Number of reducers
RunningTime(sec.)
GPTruss i-MRTruss
loc-Gowalla
4 6 8 10 12
102
103
104
Number of reducers
RunningTime(sec.)
com-DBLP
4 6 8 10 12
60
80
100
120
140
Number of reducers
RunningTime(sec.)
roadNet-TX
1 For i-MRTruss, since the
iteration number cannot be
decreased by using more
reducers, the running time
will be finally bounded by
the consuming time of
Disk I/O operations.
2 GPTruss is shown to be
much more efficient than
i-MRTruss in this kind of
large and dense dataset.
3 For the sparse dataset like
roadNet-TX, the difference
of performance is relatively
smaller, but the running
time of GPTruss is still
lower than half of what
i-MRTruss has.

Real Data
4 6 8 10 12
200
400
600
Number of reducers
RunningTime(sec.)
com-Youtube
4 6 8 10 12
0
100
200
300
Number of reducers
RunningTime(sec.)
Trussness-Parallel Computing Triangle Finding
loc-Gowalla
4 6 8 10 12
0
50
100
150
200
Number of reducers
RunningTime(sec.)
com-DBLP
4 6 8 10 12
0
20
40
60
80
100
Number of reducers
RunningTime(sec.)
roadNet-TX
1 The running time for both
of them is roughly similar
to each other’s in
loc-Gowalla, com-DBLP,
and roadNet-Tx, which
have similar edge number.
2 In com-Youtube, the
running time of Triangle
Finding is much longer
than that of
Trussness-Parallel
Computing.
3 Above results indicate
when a graph has a large
number of edges, the
running time of Triangle
Finding will dominate the
performance of GPTruss.

Real Data
Youtube
Gowalla
DBLP
RoadNet
0
0.5
1
1.5
2
·104
MemoryUsage(MB)
GPTruss i-MRTruss
1 All methods are tested by
using 4 slaves.
2 The memory usage of
i-MRTruss is the average
memory used by one job
(iteration) in a single machine.
3 The memory usage of GPTruss
is the average memory in a
single machine.
4 Since GPTruss needs a line
graph transformation, the line
graph of the original dataset
will become larger and with a
new average degree roughly
equal to 2 × davg − 2.
Therefore, the memory usage
is more than i-MRTruss.

Real Data
Youtube
Gowalla
DBLP
roadNet
102
103
104
105
DiskUsage(MB) GPTruss i-MRTruss
Youtube
Gowalla
DBLP
roadNet
0
200
400
NumberofIteration
GPTruss i-MRTruss
1 For i-MRTruss, the disk usage is roughly correlated to how many iterations
needed by the dataset and the dataset size.
2 For GPTruss, since total 2 jobs (Disk I/O operations) needed, the disk usage
is always lower than i-MRTruss.
3 For the datasets with dense vertices and edges, the diﬀerence of iteration
number between these two methods is much obvious.

Conclusion
• We provide an improved MapReduce version, i-MRTruss,
which is based on an existing distributed k-truss
decomposition;
• We prove the locality property of k-truss and design a
distributed k-truss decomposition based on this property
under graph-parallel abstractions, which eﬃciently
increases the performance;
• In the future work, it is worth studying how to process the
pruned line graph eﬃciently when a graph has lots of
edges, which is pointed out in the experimental analysis of
GPTruss.

Jonathan Cohen.
Graph twiddling in a mapreduce world.
Computing in Science & Engineering, 11(4):29–41, 2009.
Jonathan D Cohen.
Trusses: Cohesive subgraphs for social network analysis.
National Security Agency Technical Report, 2008.
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos
Guestrin.
Powergraph: Distributed graph-parallel computation on natural graphs.
In OSDI, 2012.
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo
Kyrola, and Joseph M Hellerstein.
Distributed graphlab: a framework for machine learning and data mining in
the cloud.
In VLDB, 2012.
Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert,
Ilan Horn, Naty Leiser, and Grzegorz Czajkowski.
Pregel: a system for large-scale graph processing.
In SIGMOD, 2010.

Sangwon Seo, Edward J Yoon, Jaehong Kim, Seongwook Jin, Jin-Soo Kim,
and Seungryoul Maeng.
Hama: An eﬃcient matrix computation with the mapreduce framework.
In CloudCom, 2010.
Johan Ugander, Lars Backstrom, Cameron Marlow, and Jon Kleinberg.
Structural diversity in social contagion.
In PNAS, 2012.
Jia Wang and James Cheng.
Truss decomposition in massive networks.
In VLDB, 2012.
De-Nian Yang, Yi-Ling Chen, Wang-Chien Lee, and Ming-Syan Chen.
On social-temporal group query with acquaintance constraint.
In VLDB, 2011.
De-Nian Yang, Chih-Ya Shen, Wang-Chien Lee, and Ming-Syan Chen.
On socio-spatial group query for location-based social networks.
In SIGKDD, 2012.

Slides

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (16)

Similar to Slides

Similar to Slides (20)

Recently uploaded

Recently uploaded (20)

Slides