"Incremental Lossless Graph Summarization", KDD 2020

Incremental Lossless
Graph Summarization
Jihoon Ko* Yunbum Kook* Kijung Shin

Large-scale Graphs are Everywhere!
Icon made by Freepik from www.flaticon.com
2B+ active users
600M+ users
1.5B+ users

Large-scale Graphs are Everywhere! (cont.)
4B+ web pages 5M papers 6K+ proteins
Icon made by Freepik from www.flaticon.com

Graph Compression for Efficient Manipulation
• Handling large-scale graphs as they are...
 heavy disk or network I/O

• Their compact representation makes possible efficient manipulation!

• Their compact representation makes possible efficient manipulation!
• A larger portion of original graphs can be stored in main memory or cache

Previous Graph Compression Techniques
• Various compression techniques have been proposed
• Relabeling nodes
• Pattern mining
• Lossless graph summarization  One of the most effective compression techniques
• …

• Pattern mining
• …
• Lossless graph summarization is a batch algorithm for “static graphs”,
which indicate a single or a few snapshots of evolving graphs

• Pattern mining
• …
• Lossless graph summarization is a batch algorithm for “static graphs”,
which indicate a single or a few snapshots of evolving graphs
However, most real-world graphs
go through lots of changes in fact...

Real-world Graphs are Evolving
2B+ users2M+ users
10 years

2B+ users2M+ users
10 years
Previous algorithms: not designed to allow for changes in graphs
Algorithms should be rerun from scratch to reflect changes

2B+ users2M+ users
10 years
Previous algorithms: not designed to allow for changes in graphs
Algorithms should be rerun from scratch to reflect changes
Solution: Incrementally update compressed graphs in fast and
effective manners!

Outline
• Preliminaries
• Proposed Algorithm: MoSSo
• Experimental Results
• Conclusions

Lossless Graph Summarization: Example
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
Input graph with 𝟏𝟏𝟏𝟏 edges

𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐴𝐴 = {𝑎𝑎}
𝐵𝐵 = {𝑏𝑏, 𝑐𝑐, 𝑑𝑑, 𝑒𝑒}
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
Delete {𝑓𝑓, 𝑖𝑖}

𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
Add {𝑎𝑎, 𝑓𝑓} & Delete {𝑓𝑓, 𝑖𝑖}

𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
Output with 𝟒𝟒 edges
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐶𝐶 = {𝑓𝑓, 𝑔𝑔, ℎ, 𝑖𝑖}

Lossless Graph Summarization: Definition
Lossless
Summarization
Lossless summarization yields (1) a summary graph and (2) edge corrections,
while minimizing the edge count 𝑷𝑷 + 𝑪𝑪+
+ |𝑪𝑪−
|
(≈ “description cost” denoted by 𝝋𝝋)
Input graph
𝐆𝐆 = (𝑽𝑽, 𝑬𝑬)
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Lossless
Summarization
+ |𝑪𝑪−
|
Summary graph
𝑮𝑮∗ = (𝑺𝑺, 𝑷𝑷)
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Lossless
Summarization
+ |𝑪𝑪−
|
𝐶𝐶+ = 𝑎𝑎𝑎𝑎
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
(𝑪𝑪+, 𝑪𝑪−)
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Lossless
Summarization
+ |𝑪𝑪−
|
Proposed in [NRS08]
based on “the Minimum
Description Length principle”
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Lossless
Summarization 𝐶𝐶+ = 𝑎𝑎𝑎𝑎
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

1. Summary graph 𝑮𝑮∗ = (𝑺𝑺, 𝑷𝑷)
• Supernodes 𝑺𝑺 = a partition of 𝑽𝑽, where each supernode is a set of nodes
• Superedges 𝑷𝑷 = a set of pairs of supernodes (ex: {𝑨𝑨, 𝑩𝑩} in example above)
Lossless
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Lossless
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
Supernode

Lossless
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
Supernode
Superedge

2. Edge corrections (𝑪𝑪+, 𝑪𝑪−)
• Residual graph (Positive) 𝑪𝑪+
• Residual graph (Negative) 𝑪𝑪−
Lossless
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
Supernode
Superedge

Lossless Graph Summarization: Notation
Supernode containing 𝒖𝒖
Edges between supernodes 𝑨𝑨 and 𝑩𝑩
All possible edges between 𝑨𝑨 and 𝑩𝑩
Neighborhood of a node 𝒖𝒖
Nodes incident to 𝒖𝒖 in 𝑪𝑪+ (or 𝑪𝑪−)
Compression rate
: 𝐒𝐒𝒖𝒖 (i.e. 𝒖𝒖 ∈ 𝑺𝑺𝒖𝒖)
: 𝑬𝑬𝑨𝑨𝑨𝑨 = {𝒖𝒖𝒖𝒖 ∈ 𝑬𝑬 ∶ 𝒖𝒖 ∈ 𝑨𝑨, 𝒗𝒗 ∈ 𝑩𝑩 (𝒖𝒖 ≠ 𝒗𝒗)}
: 𝑻𝑻𝑨𝑨𝑨𝑨 = {𝒖𝒖𝒖𝒖 ⊆ 𝑽𝑽: 𝒖𝒖 ∈ 𝑨𝑨, 𝒗𝒗 ∈ 𝑩𝑩 (𝒖𝒖 ≠ 𝒗𝒗)}
: 𝑵𝑵 𝒖𝒖 = {𝒗𝒗 ∈ 𝑽𝑽 ∶ 𝒖𝒖𝒖𝒖 ∈ 𝑬𝑬}
: 𝑪𝑪+(𝒖𝒖) (or 𝑪𝑪−(𝒖𝒖))
: ( 𝑷𝑷 + 𝑪𝑪+
+ 𝑪𝑪−
)/|𝑬𝑬|
Lossless
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Lossless Graph Summarization: Optimal Encoding
For summarization, determining supernodes 𝑺𝑺 (a partition of 𝑽𝑽) is our main concern
 For given 𝑺𝑺, superedges 𝑷𝑷 and edge corrections 𝑪𝑪 are optimally determined
Lossless
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Edges 𝑬𝑬𝑨𝑨𝑨𝑨 between two supernodes:
(1) a superedge with 𝑪𝑪− or (2) no superedge with 𝑪𝑪+
Case 1: 𝑬𝑬𝑨𝑨𝑨𝑨 ≥
𝑻𝑻𝑨𝑨𝑨𝑨 +𝟏𝟏
𝟐𝟐
: add superedge 𝑨𝑨𝑨𝑨 to 𝑷𝑷 and 𝑻𝑻𝑨𝑨𝑨𝑨𝑬𝑬𝑨𝑨𝑨𝑨 to 𝑪𝑪−
Case 2: 𝑬𝑬𝑨𝑨𝑨𝑨 <
𝑻𝑻𝑨𝑨𝑨𝑨 +𝟏𝟏
𝟐𝟐
: add all edges in 𝑬𝑬𝑨𝑨𝑨𝑨 to 𝑪𝑪+
Costs: |𝐄𝐄𝐀𝐀𝐀𝐀|Costs: 𝟏𝟏 + 𝐓𝐓𝑨𝑨𝑨𝑨 − |𝐄𝐄𝐀𝐀𝐀𝐀|
Lossless
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑬𝑬𝑨𝑨𝑨𝑨: Edges between supernodes 𝑨𝑨 and 𝑩𝑩
𝑻𝑻𝑨𝑨𝑨𝑨: All possible edges between 𝑨𝑨 and 𝑩𝑩
Notation

Superedge 𝑨𝑨𝑨𝑨
𝝋𝝋 = 𝟐𝟐 + 𝟏𝟏 + 𝟏𝟏 = 𝟒𝟒
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Superedge 𝑨𝑨𝑨𝑨
𝑪𝑪+
only
𝝋𝝋 = 𝟐𝟐 + 𝟏𝟏 + 𝟏𝟏 = 𝟒𝟒
𝝋𝝋 = 𝟏𝟏 + 𝟓𝟓 + 𝟏𝟏 = 𝟕𝟕
𝐶𝐶−
= 𝑓𝑓𝑖𝑖
Summary graph
Edge corrections
Input graph
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐶𝐶+
= 𝑎𝑎𝑎𝑎, 𝑎𝑎𝑎𝑎, 𝑎𝑎𝑎𝑎, 𝑎𝑎𝑎𝑎, 𝑎𝑎𝑎𝑎
𝐶𝐶− = 𝑓𝑓𝑖𝑖

Recovery: Example
𝐶𝐶+ = 𝑎𝑎𝑎𝑎 , 𝐶𝐶− = {𝑓𝑓𝑓𝑓}
𝐵𝐵 = {𝑏𝑏, 𝑐𝑐, 𝑑𝑑, 𝑒𝑒} 𝐶𝐶 = {𝑓𝑓, 𝑔𝑔, ℎ, 𝑖𝑖}

Recovery: Example
𝐶𝐶+ = 𝑎𝑎𝑎𝑎 , 𝐶𝐶− = {𝑓𝑓𝑓𝑓} Add all pairs of nodes
between two adjacent
supernodes
𝐶𝐶+
= 𝑎𝑎𝑎𝑎 , 𝐶𝐶−
= {𝑓𝑓𝑓𝑓}
𝑒𝑒
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Recovery: Example
supernodes
Remove all edges
in 𝐂𝐂−
𝐶𝐶+
𝑒𝑒
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑒𝑒
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Recovery: Example
supernodes
Remove all edges
in 𝐂𝐂−
Add all edges in 𝐂𝐂+
𝐶𝐶+
𝑒𝑒
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑒𝑒
𝑎𝑎
𝑏𝑏𝑐𝑐
𝑑𝑑
𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖

Why Lossless Graph Summarization?
𝐶𝐶+ = 𝑎𝑎𝑎𝑎 𝐶𝐶− = 𝑓𝑓𝑖𝑖
Summary graph 𝑮𝑮∗ = (𝑺𝑺, 𝑷𝑷) Edge corrections (𝑪𝑪+
, 𝑪𝑪−
)

• Queryable (Retrieving the neighborhood of a query node)
, 𝑪𝑪−
)

• Queryability: key building blocks in numerous graph algorithms
(ex: DFS, PageRank, Dijkstra’s, etc)
• Rapidly done from a summary and corrections
, 𝑪𝑪−
)

• Queryability: key building blocks in numerous graph algorithms
(ex: DFS, PageRank, Dijkstra’s, etc)
• Rapidly done from a summary and corrections
• Combinable
• Its outputs are also graphs  further compressed via other compression techniques!
, 𝑪𝑪−
)

Fully Dynamic Graph Stream
Fully dynamic graphs can be represented by using
a sequence {𝒆𝒆𝒕𝒕}𝒕𝒕=𝟎𝟎
∞
of edge addition 𝒆𝒆𝒕𝒕 = 𝒖𝒖, 𝒗𝒗 + and deletion 𝒆𝒆𝒕𝒕 = 𝒖𝒖, 𝒗𝒗 −
 The graph at time 𝒕𝒕 is constructed by aggregating all edges change until time 𝒕𝒕

∞
Stream of changes:

∞
Stream of changes:
Empty graph 𝑮𝑮𝟎𝟎
Time 𝒕𝒕 = 𝟎𝟎

∞
……Stream of changes:
+ - - + +

∞
+ - - + +
Current graph 𝑮𝑮𝒕𝒕
Time 𝒕𝒕

∞
+ - - + -+
……
Current graph 𝑮𝑮𝒕𝒕
Time 𝒕𝒕

Problem Formulation
• Given Fully dynamic graph stream {𝒆𝒆𝒕𝒕}𝒕𝒕=𝟎𝟎
∞
• Retain Summary graph 𝑮𝑮𝒕𝒕
∗
= 𝑺𝑺𝒕𝒕, 𝑷𝑷𝒕𝒕 and Edge corrections 𝑪𝑪𝒕𝒕 = (𝑪𝑪𝒕𝒕
+
, 𝑪𝑪𝒕𝒕
−
)
of graph 𝑮𝑮𝒕𝒕 at time 𝒕𝒕
• To Minimize the size of output representation 𝑷𝑷𝒕𝒕 + 𝑪𝑪𝒕𝒕
+
+ 𝑪𝑪𝒕𝒕
−

Problem Formulation
∞
∗
+
, 𝑪𝑪𝒕𝒕
−
)
+
+ 𝑪𝑪𝒕𝒕
−
Retained at time 𝒕𝒕
𝐶𝐶+
= 𝑎𝑎𝑎𝑎
Summary graph
𝑮𝑮∗
= (𝑺𝑺, 𝑷𝑷)
Edge corrections
(𝑪𝑪+
, 𝑪𝑪−
)

Problem Formulation
∞
∗
+
, 𝑪𝑪𝒕𝒕
−
)
+
+ 𝑪𝑪𝒕𝒕
−
+
Edge change 𝒆𝒆𝒕𝒕+𝟏𝟏
𝐶𝐶+
= 𝑎𝑎𝑎𝑎
Summary graph
𝑮𝑮∗
Edge corrections
(𝑪𝑪+
, 𝑪𝑪−
)

Problem Formulation
∞
∗
+
, 𝑪𝑪𝒕𝒕
−
)
+
+ 𝑪𝑪𝒕𝒕
−
+ -+
……
Edge change 𝒆𝒆𝒕𝒕+𝟏𝟏
𝐶𝐶+
= 𝑎𝑎𝑎𝑎
Summary graph
𝑮𝑮∗
Edge corrections
(𝑪𝑪+
, 𝑪𝑪−
)

Challenge: Fast Update but Good Performance

Scheme for Incremental Summarization
Current graph Lossless summarization
𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝑎𝑎
𝑏𝑏
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐵𝐵
𝑒𝑒𝑏𝑏 𝑐𝑐 𝑑𝑑
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 4
𝐶𝐶+ 𝐶𝐶−
𝑎𝑎 𝑓𝑓 𝑓𝑓𝑖𝑖

𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝑎𝑎
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐵𝐵
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
New edge:
𝑗𝑗𝑏𝑏
𝑎𝑎 𝑗𝑗
𝑗𝑗
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 5
𝑎𝑎 𝑓𝑓
𝑓𝑓𝑖𝑖
𝑎𝑎 𝑗𝑗

𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝑎𝑎
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐵𝐵
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
New edge:
𝑗𝑗𝑏𝑏
How to update
current summarization?
𝑎𝑎 𝑗𝑗
𝑗𝑗
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 5
𝑎𝑎 𝑓𝑓
𝑓𝑓𝑖𝑖
𝑎𝑎 𝑗𝑗

Our approach
(1) Attempt to move nodes
among supernodes
(2) Accept the move if 𝝋𝝋
decreases
(3) Reject otherwise
𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝑎𝑎
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐵𝐵
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
𝑗𝑗𝑏𝑏
𝑗𝑗
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 5
𝑎𝑎 𝑓𝑓
𝑓𝑓𝑖𝑖
𝑎𝑎 𝑗𝑗

Our approach
among supernodes
decreases
Testing node
𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝑎𝑎
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐵𝐵
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
𝑗𝑗𝑏𝑏
𝑗𝑗
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 5
𝑎𝑎 𝑓𝑓
𝑓𝑓𝑖𝑖
𝑎𝑎 𝑗𝑗

Our approach
among supernodes
decreases
Testing node
𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝑎𝑎
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐵𝐵
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
𝑗𝑗𝑏𝑏
𝑗𝑗
Candidate
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 5
𝑎𝑎 𝑓𝑓
𝑓𝑓𝑖𝑖
𝑎𝑎 𝑗𝑗

Our approach
among supernodes
decreases
Testing node
Testing
𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝑎𝑎
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐵𝐵
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
𝑗𝑗𝑏𝑏
𝑗𝑗
Candidate
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 5
𝑎𝑎 𝑓𝑓
𝑓𝑓𝑖𝑖
𝑎𝑎 𝑗𝑗

MoSSo finds...
(1) Testing nodes
whose move likely
results in 𝛗𝛗 ↓
(2) Candidates for
testing node, likely
resulting in 𝛗𝛗 ↓
𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝑎𝑎
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐵𝐵
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
𝑗𝑗𝑏𝑏
𝑗𝑗
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 5
𝑎𝑎 𝑓𝑓
𝑓𝑓𝑖𝑖
𝑎𝑎 𝑗𝑗

Lossless summarization
Current graph
𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝐵𝐵
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
𝑎𝑎
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝑗𝑗𝑏𝑏
𝑗𝑗Testing node
Candidate
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 5
𝑎𝑎 𝑓𝑓
𝑓𝑓𝑖𝑖
𝑎𝑎 𝑗𝑗

𝐴𝐴
𝑎𝑎
𝐶𝐶
𝐺𝐺∗
𝑎𝑎
𝑐𝑐
𝑑𝑑
𝑒𝑒 𝑓𝑓
𝑔𝑔
ℎ
𝑖𝑖
𝐵𝐵
𝑖𝑖
𝑓𝑓 𝑔𝑔
ℎ
𝑗𝑗𝑏𝑏
𝑗𝑗
𝜑𝜑 = |𝑃𝑃| + |𝐶𝐶+
| + |𝐶𝐶−
| = 4
𝐶𝐶+ 𝐶𝐶−𝐶𝐶+ 𝐶𝐶−
𝑎𝑎 𝑓𝑓 𝑓𝑓𝑖𝑖

MoSSo: Main Ideas
• Step 1: Set testing nodes
• (S1) No restoration from the
current summarization 𝐺𝐺𝑡𝑡
∗
= 𝑆𝑆𝑡𝑡, 𝑃𝑃𝑡𝑡 ,
𝐶𝐶𝑡𝑡 = (𝐶𝐶𝑡𝑡
+
, 𝐶𝐶𝑡𝑡
−
)
• (S2) Reduce redundant testing by a
stochastic filtering
𝑢𝑢 𝑣𝑣
Changed edge
𝑒𝑒 = {𝑢𝑢, 𝑣𝑣}
• Step 2: Find candidate
• (S3) Utilize an incremental
coarse clustering
• (S4) Inject flexibility to
reorganization of supernodes
Which nodes to move?
(testing nodes)
𝑢𝑢 𝑣𝑣
Testing
node
Move into
which supernode?
(candidates)

MoSSo: Main Ideas
• Step 1: Set testing nodes
• (S1) No restoration from the
current summarization 𝐺𝐺𝑡𝑡
∗
= 𝑆𝑆𝑡𝑡, 𝑃𝑃𝑡𝑡 ,
𝐶𝐶𝑡𝑡 = (𝐶𝐶𝑡𝑡
+
, 𝐶𝐶𝑡𝑡
−
)
• (S2) Reduce redundant testing by a
stochastic filtering
Repeated
Time
𝑢𝑢 𝑣𝑣
Changed edge
• Step 2: Find candidate
• (S3) Utilize an incremental
coarse clustering
• (S4) Inject flexibility to
reorganization of supernodes
Performance
Which nodes to move?
(testing nodes)
𝑢𝑢 𝑣𝑣
Testing
node
Move into
which supernode?
(candidates)

MoSSo: Details
Parameters:
• Sample number 𝒄𝒄
• Escape prob. 𝒆𝒆
Input:
• Summary graph 𝑮𝑮𝒕𝒕
∗
& Edge corrections 𝑪𝑪𝒕𝒕
• Edge change 𝒖𝒖, 𝒗𝒗 + (addition) or 𝒖𝒖, 𝒗𝒗 − (deletion)
Output:
• Summary graph 𝑮𝑮𝒕𝒕+𝟏𝟏
∗
& Edge corrections 𝑪𝑪𝒕𝒕+𝟏𝟏

MoSSo: Details (Step 1) – MCMC
𝑢𝑢
𝑣𝑣
Changed edge
𝑵𝑵(𝒖𝒖): Neighborhood of a node 𝒖𝒖
Notation

Neighborhood 𝑵𝑵(𝒖𝒖) of input node 𝒖𝒖 is more likely affected
𝑢𝑢
𝑣𝑣
Changed edge
Notation

 Focus on testings nodes in 𝑵𝑵(𝒖𝒖)𝑢𝑢
𝑣𝑣
Changed edge
Notation

 Focus on testings nodes in 𝑵𝑵(𝒖𝒖)
P1. To sample neighbors, one should retrieve all 𝑵𝑵(𝒖𝒖) from
𝑮𝑮∗
and 𝑪𝑪, which takes 𝑶𝑶(𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂 𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅) time on average
𝑢𝑢
𝑣𝑣
Changed edge
Notation

𝑮𝑮∗
 Deadly to scalability…
𝑢𝑢
𝑣𝑣
Changed edge
Notation

𝑮𝑮∗
 Deadly to scalability…
𝑢𝑢
𝑣𝑣
Changed edge
Graph densification law [LKF05]:
“The average degree of real-world graphs increases over time.”
Notation

MoSSo: Details (Step 1) – MCMC (cont.)
S1. Without full retrievals of 𝑵𝑵(𝒖𝒖), sample 𝒄𝒄 neighbors in un
iformly random by using Markov Chain Monte Carlo method
(MCMC)
 MCMC method: sampling from a random variable with its
probability density proportional to a given function
𝑢𝑢
𝑣𝑣

MoSSo: Details (Step 1) – Probabilistic Filtering
Test all the sampled nodes?
𝑣𝑣
𝑢𝑢

Test all the sampled nodes?  Better not…
𝑣𝑣
𝑢𝑢

P2. Too frequent testing on high-degree nodes
as 𝑃𝑃(𝑣𝑣 sampled) ∝ 𝑑𝑑𝑑𝑑𝑑𝑑(𝑣𝑣)
 Better not…
𝑣𝑣
𝑢𝑢

Computationally
heavy
(Too many nbrs)
 Better not…
- Updating the optimal encoding
- Computing the change 𝛥𝛥𝛥𝛥 in the description cost
𝑣𝑣
𝑢𝑢

Computationally
heavy
(Too many nbrs)
S2. Test a sampled node 𝑣𝑣 w.p.
1
𝑑𝑑𝑑𝑑𝑑𝑑(𝑣𝑣)
(1) Likely to avoid expensive testing on high-degree nodes
(2) In expectation, 𝑃𝑃(𝑣𝑣: actually tested) is the same across
all nodes 𝑣𝑣 (i.e., smoothen unbalance in # of testing)
 Better not…
- Updating the optimal encoding
- Computing the change 𝛥𝛥𝛥𝛥 in the description cost
𝑣𝑣
𝑢𝑢

MoSSo: Details (Step 2) – Coarse Clustering
Testing
node
𝑦𝑦
𝑣𝑣
𝑢𝑢

P3. Among many choices, how do we know ”good” candidates?
Testing
node
𝑦𝑦
(likely resulting in 𝝋𝝋 ↓)
𝑣𝑣
𝑢𝑢

Testing
node
𝑦𝑦
S3. Utilize an incremental coarse clustering
 Desirable: Nodes with “similar connectivity”
in the same cluster
 Any incremental coarse clustering with the
desirable property!
𝑣𝑣
𝑢𝑢

Testing
node
𝑦𝑦
in the same cluster
desirable property!
(1) Fast with the desirable theoretical property:
𝑷𝑷 𝒖𝒖, 𝒗𝒗 ∈ 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 ∝ 𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱(𝑵𝑵 𝒖𝒖 , 𝑵𝑵 𝒗𝒗 )
⇒ Grouping nodes with similar connectivity
Min-hashing
𝑣𝑣
𝑢𝑢

Testing
node
𝑦𝑦
in the same cluster
desirable property!
(1) Fast with the desirable theoretical property:
(2) Clusters from min-hashing: updated rapidly in response to edge changes
𝑷𝑷 𝒖𝒖, 𝒗𝒗 ∈ 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 ∝ 𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱𝑱(𝑵𝑵 𝒖𝒖 , 𝑵𝑵 𝒗𝒗 )
⇒ Grouping nodes with similar connectivity
Min-hashing
𝑣𝑣
𝑢𝑢

MoSSo: Details (Step 2) – Separation of Node
𝑦𝑦
𝑣𝑣
Testing
node
𝑢𝑢

P4. In this way, moving nodes only decreases or maintains 𝑺𝑺
 Discourage reorganizing supernodes in the long run
𝑦𝑦
𝑣𝑣
Testing
node
𝑢𝑢

𝑦𝑦
S4. Instead of finding a candidate, separate 𝒚𝒚 from 𝑺𝑺𝒚𝒚 and
create a singleton supernode 𝑺𝑺𝒚𝒚 w.p. escape probability 𝒆𝒆
𝑣𝑣
Testing
node
𝑢𝑢

𝑦𝑦
 Inject flexibility to supernodes (a partition of 𝑽𝑽)
 Empirically significant improvement in compression rates
𝑣𝑣
Testing
node
𝑢𝑢

𝑦𝑦
 Inject flexibility to supernodes (a partition of 𝑽𝑽)
 Empirically significant improvement in compression rates
Similar to before,
accept or reject the separation depending on Δ𝝋𝝋
𝑣𝑣
Testing
node
𝑢𝑢

Experimental Settings
• 10 Real-world Graphs (up to 0.3B edges)

Web

Web Social

Web Social Collaboration

Web Social Collaboration Email And others!

• Batch loseless graph summarization algorithms:
• Randomized [NSR08], SAGS [KNL15], SWeG [SGKR19]
Web Social Collaboration Email And others!

Baseline Incremental Algorithms
• MoSSo-Greedy:
• Greedily moves nodes related to inserted/deleted edge, while fixing the
other nodes so that the objective is minimized
• MoSSo-MCMC
• See the paper for details
• MoSSo-Simple
• MoSSo without coarse clustering

Experiment results: Speed
• MoSSo processed each change up to 7 orders of magnitude faster
than running the fastest batch algorithm

UK (Insertion-only)

Insertion-only graph streams
Fully-dynamic graph streams
UK (Insertion-only)

Experiment results: Compression Performance
• The compression ratio of MoSSo was even comparable to those of the
best batch algorithms
• MoSSo achieved the best compression ratios among the streaming
algorithms
Compression ratio: ( 𝑷𝑷 + 𝑪𝑪+ + 𝑪𝑪− )/|𝑬𝑬|
Notation

algorithms
UK
Notation

algorithms
PR EN FB
DB YT SK
LJ EU HW
UK
Notation

Experiment results: Scalability
• MoSSo processed each change in near-constant time

Experiment results: Scalability
EU (Insertion-only) SK (Fully-dynamic)
• MoSSo processed each change in near-constant time

Conclusions
We propose MoSSo, the first algorithm for incremental lossless graph summarization

Conclusions
Fast and ‘any time’

Conclusions
Fast and ‘any time’ Effective

Conclusions
Fast and ‘any time’ Effective Scalable

Conclusions
Fast and ‘any time’ Effective Scalable
The code and datasets used in the paper
are available at http://dmlab.kaist.ac.kr/mosso/

"Incremental Lossless Graph Summarization", KDD 2020

More Related Content

What's hot

Similar to "Incremental Lossless Graph Summarization", KDD 2020

Recently uploaded

"Incremental Lossless Graph Summarization", KDD 2020