Lgm pakdd2011 public

The 15th Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2011)
25 May 2011

LGM: Mining Frequent Subgraphs
from Linear Graphs

Yasuo Tabei
ERATO Minato Project
Japan Science and Technology Agency
joint work with
Daisuke Okanohara (Preferred Infrastructure),
Shuichi Hirose (AIST),
Koji Tsuda (AIST)

1
1

Outline
• Introduction to linear graph
★ Linear subgraph relation
★ Total order among edges
• Frequent subgraph mining from a set of
linear graphs
• Experiments
★ Motif extraction from protein 3D
structures
2
2

Linear graph (Davydov et al., 2004)
• Labeled graph whose vertices are totally
ordered
• Linear graph g = (V, E, L , L ) V E

‣ V ⊂ N : ordered vertex set
‣ E ⊆ V × V : edge set
‣ LV → ΣV : vertex labels
‣L →Σ
E E : edge labels

Example:
c
b

a a
1 2 3 4 5 6
A B A B C A
3
3

Linear subgraph relation
• g1 is a linear subgraph of g2
i) Conventional subgraph condition
★ Vertex labels are matched
★ All edges of g1 exist in g2 with the correct labels
ii) Order of vertices are conserved
Example:
b
b
c

1
a
2 3
⊂ a a

1 2 3 4 5 6
A B A A A B B C A
g1 g2

4
4

Subgraph but not linear
subgraph
• g1 is a subgraph of g2
★ vertex labels are matched
★ all edges in g1also exist in g2 with
correct labels
• g1 is not a linear subgraph of g2
★ the order of vertices is not conserved
b
b c a
c
1 2 3 1 2 3 4
A A B A A B A
g1 g2
5
5

Total order among edges in a
linear graph
• Compare the left vertices ﬁrst. If they
are identical, look at the right vertices
• ∀e1 = (i, j) , e2 = (k, l) ∈ Eg , e1 <e e2
if and only if (i) i < k or (ii) i = k, j < l
Example:
e1 e2 2
3
1
i j k l 1 2 3 4
6
6

Outline
★ linear subgraph relation
linear graphs
• Experiments
structures
7
7

Frequent subgraph mining
from linear graphs
• Enumerate all frequent subgraphs from a set of
linear graphs
★ Subgraphs included in a set of linear graphs at
least τ times (minimum support threshold)
★ Enumerate connected and disconnected subgraphs
with a uniﬁed framework
★ Use reverse search for an efﬁcient enumeration
(Avis and Fukuda, 1993)
• Polynomial delay
★ gSpan = exponential delay
8
8

Enumeration of all linear
subgraph of a linear graph
• Before considering a mining
algorithm, we have to solve the
problem of subgraph enumeration
ﬁrst
• How to enumerate graph withoutof
the following linear
all subgraphs
duplication

9
9

Search lattice of all subgraphs
!"#$%
*+,-+!./!0+12!3!24
&

'

(

)

10
10

Reverse search (Avis and Fukuda, 1993)
• To enumerate all subgraphs without
duplication, we need to deﬁne a search tree
in the search lattice

• Reduction map f
★ Mapping from a child to its parent
★ Remove the largest edge

2 3
f 2
1 1
1 2 3 4 1 2 3

11
11

Search tree induced by the
reduction map
• By applying the reduction map to each
element, search tree can be induced
!"#$%

12
12

Inverting the reduction map f −1

• When traversing the tree from the root,
children nodes are created on demand
• In most cases, the inversion of reduction
map takes the following two steps:
★ Consider all children candidates
★ Take the ones that qualify the reduction map

• However, in this particular case, the
reduction map can be inverted explicitly
★ Can derive the pattern extension rule
(parent to children)
13
13

Pattern extension rule

14
14

Traversing search tree from root
• Depth ﬁrst traversal for its memory efﬁciency
$&!'()*+!,$'!-+!
.!/')--!-'!-+! !"#$%

15
15

Frequent subgraph mining
• Basic idea: ﬁnd all possible extensions of a
current pattern in the graph database, and
extend the pattern
• Occurrence list L G (g)
★ Record every occurrence of a pattern g in
the graph database G
★ Calculate the support of a pattern g by the
occurrence list !"#$%&'($""

• Usesupport for pruningof
the
anti-monotonicity
)$*+,+-

16
16

Outline
★ linear subgraph relation
linear graphs
• Experiments
structures
17
17

Motif extraction from protein
3D structures
• Pairs of homologous proteins in thermophilic
organism and mesophilic organism
• Construct a linear graph from a protein
★ Use vertex order from N- to C- terminal
★ Assign vertex labels from {1,...,6}
★ Draw an edge between pairs of amino acid
residues whose distance is 5Å
• # of data:742, avg. # of vertices:371, avg. # of edges:
496
• Rank the enumerated patterns by statistical
signiﬁcance (p-value)
★ Association to thermophilic/methophilic labels
★ Fisher exact test
18
18

Runtime comparison
• Compared to gSpan
• Made gapped linear graphs and run gSpan
• LGM is faster than gSpan

19
19

• Minimum support = 10
• 103 patterns whose p-value < 0.001
•★Thermophilic (TATA), Mesophilic (pol II)
Share the function as DNA binding
protein, but the thermostatility is
different

20
20

Mapping motifs in 3D structure

• Thermophilic (TATA), Mesophilic (pol II)

21
21

Summary

• Efﬁcient subgraph mining algorithm from
linear graphs
• Search tree is deﬁned by reverse search
principle
• Patterns include disconnected subgraphs
• Computational time is polynomial-delay
• Interesting patterns from proteins
22
22

Lgm pakdd2011 public

More Related Content

What's hot

Viewers also liked

Similar to Lgm pakdd2011 public

Recently uploaded

Lgm pakdd2011 public