• Save
Upcoming SlideShare
×

Like this presentation? Why not share!

# Lgm pakdd2011 public

## on May 26, 2011

• 2,217 views

### Views

Total Views
2,217
Views on SlideShare
1,211
Embed Views
1,006

Likes
0
8
0

### 4 Embeds1,006

 http://d.hatena.ne.jp 993 url_unknown 9 http://webcache.googleusercontent.com 3 http://www.slideshare.net 1

### Report content

• Comment goes here.
Are you sure you want to

## Lgm pakdd2011 publicPresentation Transcript

• The 15th Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2011)25 May 2011 LGM: Mining Frequent Subgraphs from Linear Graphs Yasuo Tabei ERATO Minato Project Japan Science and Technology Agency joint work with Daisuke Okanohara (Preferred Infrastructure), Shuichi Hirose (AIST), Koji Tsuda (AIST) 1 1
• Outline• Introduction to linear graph ★ Linear subgraph relation ★ Total order among edges• Frequent subgraph mining from a set of linear graphs• Experiments ★ Motif extraction from protein 3D structures 2 2
• Linear graph (Davydov et al., 2004) • Labeled graph whose vertices are totally ordered • Linear graph g = (V, E, L , L ) V E ‣ V ⊂ N : ordered vertex set ‣ E ⊆ V × V : edge set ‣ LV → ΣV : vertex labels ‣L →Σ E E : edge labelsExample: c b a a 1 2 3 4 5 6 A B A B C A 3 3
• Linear subgraph relation• g1 is a linear subgraph of g2 i) Conventional subgraph condition ★ Vertex labels are matched ★ All edges of g1 exist in g2 with the correct labels ii) Order of vertices are conservedExample: b b c 1 a 2 3 ⊂ a a 1 2 3 4 5 6 A B A A A B B C A g1 g2 4 4
• Subgraph but not linear subgraph• g1 is a subgraph of g2 ★ vertex labels are matched ★ all edges in g1also exist in g2 with correct labels• g1 is not a linear subgraph of g2 ★ the order of vertices is not conserved b b c a c 1 2 3 1 2 3 4 A A B A A B A g1 g2 5 5
• Total order among edges in a linear graph• Compare the left vertices ﬁrst. If they are identical, look at the right vertices• ∀e1 = (i, j) , e2 = (k, l) ∈ Eg , e1 <e e2 if and only if (i) i < k or (ii) i = k, j < l Example: e1 e2 2 3 1i j k l 1 2 3 4 6 6
• Outline• Introduction to linear graph ★ linear subgraph relation ★ Total order among edges• Frequent subgraph mining from a set of linear graphs• Experiments ★ Motif extraction from protein 3D structures 7 7
• Frequent subgraph mining from linear graphs• Enumerate all frequent subgraphs from a set of linear graphs ★ Subgraphs included in a set of linear graphs at least τ times (minimum support threshold) ★ Enumerate connected and disconnected subgraphs with a uniﬁed framework ★ Use reverse search for an efﬁcient enumeration (Avis and Fukuda, 1993)• Polynomial delay ★ gSpan = exponential delay 8 8
• Enumeration of all linear subgraph of a linear graph• Before considering a mining algorithm, we have to solve the problem of subgraph enumeration ﬁrst• How to enumerate graph withoutof the following linear all subgraphs duplication 9 9
• Search lattice of all subgraphs !"#\$% *+,-+!./!0+12!3!24 & ( ) 10 10
• Reverse search (Avis and Fukuda, 1993) • To enumerate all subgraphs without duplication, we need to deﬁne a search tree in the search lattice • Reduction map f ★ Mapping from a child to its parent ★ Remove the largest edge 2 3 f 2 1 1 1 2 3 4 1 2 3 11 11
• Search tree induced by the reduction map• By applying the reduction map to each element, search tree can be induced !"#\$% 12 12
• Inverting the reduction map f −1• When traversing the tree from the root, children nodes are created on demand• In most cases, the inversion of reduction map takes the following two steps: ★ Consider all children candidates ★ Take the ones that qualify the reduction map• However, in this particular case, the reduction map can be inverted explicitly ★ Can derive the pattern extension rule (parent to children) 13 13
• Pattern extension rule 14 14
• Traversing search tree from root• Depth ﬁrst traversal for its memory efﬁciency \$&!()*+!,\$!-+! .!/)--!-!-+! !"#\$% 15 15
• Frequent subgraph mining• Basic idea: ﬁnd all possible extensions of a current pattern in the graph database, and extend the pattern• Occurrence list L G (g)★ Record every occurrence of a pattern g in the graph database G★ Calculate the support of a pattern g by the occurrence list !"#\$%&(\$""• Usesupport for pruningof the anti-monotonicity )\$*+,+- 16 16
• Outline• Introduction to linear graph ★ linear subgraph relation ★ Total order among edges• Frequent subgraph mining from a set of linear graphs• Experiments ★ Motif extraction from protein 3D structures 17 17
• Motif extraction from protein 3D structures• Pairs of homologous proteins in thermophilic organism and mesophilic organism• Construct a linear graph from a protein ★ Use vertex order from N- to C- terminal ★ Assign vertex labels from {1,...,6} ★ Draw an edge between pairs of amino acid residues whose distance is 5Å• # of data:742, avg. # of vertices:371, avg. # of edges: 496• Rank the enumerated patterns by statistical signiﬁcance (p-value) ★ Association to thermophilic/methophilic labels ★ Fisher exact test 18 18
• Runtime comparison• Compared to gSpan• Made gapped linear graphs and run gSpan• LGM is faster than gSpan 19 19
• • Minimum support = 10• 103 patterns whose p-value < 0.001•★Thermophilic (TATA), Mesophilic (pol II) Share the function as DNA binding protein, but the thermostatility is different 20 20
• Mapping motifs in 3D structure• Thermophilic (TATA), Mesophilic (pol II) 21 21
• Summary• Efﬁcient subgraph mining algorithm from linear graphs• Search tree is deﬁned by reverse search principle• Patterns include disconnected subgraphs• Computational time is polynomial-delay• Interesting patterns from proteins 22 22