gSpan algorithm

Gspan: Graph-based
Substructure Pattern Mining
Presented By: Sadik Mussah
University of Vermont
CS 332 – Data mining
1
- Algorithm -

Outlines
• Background
• Problem Definition
• Authors Contribution
• Concepts Behind Gspan
• Experimental Result
• Conclusion
2

Background
• Frequent Subgraph Mining Is An Extension To Existing
Frequent Pattern Mining Algorithms
• A Major Challenge IsTo Count How Many Instances of
patterns are in the Dataset
• Counting Instances Might Be Easy For Sets, But Subtle For
Graphs
• Graph Isomorphism Problem
3

Background
Theorem
Given two graphs G and G’ (g prime), G isomorphic to G’ iff min(G)
= min(G’)
04/12/16Sadik Mussah
4

Background
5
X W
U Y
V
(a)
X
W
U
YV
(b)
Two Isomorphic graph (a) and (b) with their mapping function (c)
 Two Graphs Are Isomorphic If One Can Find A Mapping Of Nodes Of
The First Graph To The Second Graph Such That Labels On Nodes
And Edges Are Preserved.
f(V1.1) = V2.2
f(V1.2) = V2.5
f(V1.3) = V2.3
f(V1.4) = V2.4
f(V1.5) = V2.1
(c)
G1=(V1,E1,L1) G2=(V2,E2,L2)
1
2
3
4
5
1
2
3
4
5

Problem: Finding Frequent Subgraphs
• Problem Setting: Similar To Finding Frequent Itemsets For
Association Rule Discovery
• Input: Database Of Graph Transactions
• Undirected Simple Graph (No Multiples Edges)
• Each Graph Transaction Has Labeled Edges/Vertices.
• Transactions May Not Be Connected
• Minimum Support Thresholds
• Output: Frequent Subgraphs That Satisfy The Support Threshold,
Where Each Frequent Subgraph Is Connected.
6

Authors Contribution
• Representing Graphs As Strings (Like Treeminer)
• No Candidate Generation!
• “It Combines The Growing And Checking Of Frequent Subgraphs
Into One Procedure,Thus Accelerates The Mining Process.”
• Really Fast, Still A Standard Baseline System That Most Rivals
Compare Their Systems To.
8

Concepts Behind Gspan
• The Idea Is To Produces A Depth-first Search (DFS) Codes For
Each Edge In Graphs
• Edges Are Sorted According To Lexicographic Order Of Codes
• Yan And Han Proved That Graph Isomororphism Can Be Tested
For Two Graphs Annotated With DFS Codes
• Starting With Small Graph Patterns Containing 1-edge, Patterns
Are Expanded Systemically By The DFS Search
• Employ Anti-monotonic Property Of Graph Frequency
9

Lexicographic Ordering In Graph
• It Can Tell Us The Order Of Two Graphs.
• The Design Can Help Us Build A Similar Hierarchy.
• The Design Should Guarantee Easy-growing From One Level To
The Lower Level And Easy-rolling-up From Low Level To Higher
Level.
• It May Be Difficult To Have Such Design That No Two Nodes In
This Tree Are Same For Graph Case.
• It Can Tell Us Whether The Graph Has Been Discovered.
• And More,The Most Important, If A Graph Has Been Discovered,
All Its Children Nodes In The Hierarchy Must Have Been
Discovered.
10

Lexicographic Ordering in Graph11
...
... ...
1-edge
2-edge
...3-edge ...
...
...
...

DFS Code And Minimum DFS Code
• We Use A 5-tuple (Vi,Vj, L(vi), L(vj), L(vi,vj)) To Represent An Edge. (It May Be
Redudant, But Much EasierTo Understand.)
• Turn A Graph Into A SequenceWhose Basic Element Is 5-tuple. Form The
Sequence In Such An Order:
• To Extend One New Node,Add The Forward Edge
That Connect One Node In The Old Graph With This
New Node.
• Add All Backward Edge That Connect This New Node
To Other Nodes In The Old Graph
• Repeat This Procedure.
12

DFS code
13
X
Y
X
Z
Z
a a
b
b
c
d
v0
v1
v2
v3
v4
X
Y
a
e0: (0,1,x,y,a)
X
b
e1: (1,2,y,x,b)a
e2: (2,0,x,x,a)
Z
c e3: (2,3,x,z,c)b
e4: (3,1,x,y,b)
Z
d
e5: (1,4,x,z,d)

DFS Code And Minimum DFS Code
14
Depth First Tree And Forward/Backward Edge Set

Minimum DFS code
15
Each Graph may have lots of DFS code (why?):
one smallest lexicographic one is its Minimum DFS Code
Edge no. (B) (C) ( D)
0 (0,1,x,y,a) (0,1,y,x,a) (0,1,x,x,a)
1 (1,2,y,x,b) (1,2,x,x,a) (1,2,x,y,b)
2 (2,0,x,x,a) (2,0,x,y,b) (0,1,y,x,a)
3 (2,3,x,z,c) (2,3,x,z,c) (2,3,y,z,a)
4 (3,1,z,y,b) (3,0,z,y,b) (3,1,z,x,c)
5 (1,4,x,z,d) (0,4,y,z,d) (2,4,y,z,d)

Graph Parent And Its Children
16
X
Y
X
Z
Z
a
b
c
a
Given a DFS code
c0=(e0,e1,…,en)
if c1=(e0,e1,…,en,ex)
if c0<c1, then
c0 is c1’s parent,
c1 is c0’s child.
?
?
?
?
?
?
?
?

Theorem
• 1. Given Two Graph G0 And G1, G0 Is Isomorphic To G1 Iff
Min_dfs_code(g0)=min_dfs_code(g1).
• 2. DFS CodeTree Covers All Graphs Although SomeTree Nodes May
Represent The Same Graph
• 3. Given A Node In DFS CodeTree, If Its DFS Code Is Not Its Minimum DFS
Code, PruneThis Node And Its All DescendantsWon’t Change.“Covering”.
17

DFS Code Tree
18
...
... ...
1-edge
2-edge
...3-edge ...
...
...
...
pruned

FSG: two substructure patterns and their
potential candidates.
19

04/12/16SADIK MUSSAH
20
AGM: two substructures joined by two chains

Algorithm:
Apriorigraph
04/12/16SADIK MUSSAH
23

ALGORITHM:
gSpan
04/12/16Xifeng Yan
24

Conclusion
• No Candidate Generation And FalseTest
• Space Saving From Depth First Search
• Good Performance: Using “Memory Pool” And One Major
Counting Improvement, It SeemsThe PerformanceWill Be
Improved 5Times More. (But Need MoreTesting).
27

Questions
Q1) What Two Major Costs From Apriori-like, Frequent
Substructure Mining Algorithms Did Gspan Aim To
Reduce/Avoid?
 Answer:
1)The Creation Of Size K+1 Candidate Subgraphs From Size K
Frequent Subgraphs Is More Complicated And Costly The
Standard Apriori Large Itemset Generation.
2) Pruning False Positives Is An Expensive Process. Subgraph
Isomorphism Problem Is Np-complete.
28

Security Graph 3DVisualization
• https://www.youtube.com/watch?v=JsEm-CDj4qM
29

Questions (cont.)
• Q2) Which DFSTree Does The DFS Code Below BelongTo?
30

v0
Y
x
x
z
z v4
v1
v2
v3
a
a
c
bb
d
Answer: tree (c)

Questions
• Q3) What Does Gspan CompareWhen Testing For
Isomorphism Between Two Graphs,AndWhy?
• Answer: Gspan Compares The Minimum Dfs Codes Of The Two
Graphs. GivenTwo Graphs G And G’, G Is Isomorphic To G’ If
Min(g)=min(g’).This Theorem Allows For A Simple String
Comparison Of More Complicated Graphs. If Two Nodes Contain
The Same Graph But Different Minimum DFS Codes,We Can
Prune The Sub-branch Of The Rightmost Of The Two Nodes.This
Greatly Decreases The Problem Size.
32

gSpan algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to gSpan algorithm

Similar to gSpan algorithm (20)

Recently uploaded

Recently uploaded (20)

gSpan algorithm

Editor's Notes