• Save
Lgm pakdd2011 public
Upcoming SlideShare
Loading in...5
×
 

Lgm pakdd2011 public

on

  • 2,217 views

 

Statistics

Views

Total Views
2,217
Views on SlideShare
1,211
Embed Views
1,006

Actions

Likes
0
Downloads
8
Comments
0

4 Embeds 1,006

http://d.hatena.ne.jp 993
url_unknown 9
http://webcache.googleusercontent.com 3
http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lgm pakdd2011 public Lgm pakdd2011 public Presentation Transcript

  • The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2011)25 May 2011 LGM: Mining Frequent Subgraphs from Linear Graphs Yasuo Tabei ERATO Minato Project Japan Science and Technology Agency joint work with Daisuke Okanohara (Preferred Infrastructure), Shuichi Hirose (AIST), Koji Tsuda (AIST) 1 1
  • Outline• Introduction to linear graph ★ Linear subgraph relation ★ Total order among edges• Frequent subgraph mining from a set of linear graphs• Experiments ★ Motif extraction from protein 3D structures 2 2
  • Linear graph (Davydov et al., 2004) • Labeled graph whose vertices are totally ordered • Linear graph g = (V, E, L , L ) V E ‣ V ⊂ N : ordered vertex set ‣ E ⊆ V × V : edge set ‣ LV → ΣV : vertex labels ‣L →Σ E E : edge labelsExample: c b a a 1 2 3 4 5 6 A B A B C A 3 3
  • Linear subgraph relation• g1 is a linear subgraph of g2 i) Conventional subgraph condition ★ Vertex labels are matched ★ All edges of g1 exist in g2 with the correct labels ii) Order of vertices are conservedExample: b b c 1 a 2 3 ⊂ a a 1 2 3 4 5 6 A B A A A B B C A g1 g2 4 4
  • Subgraph but not linear subgraph• g1 is a subgraph of g2 ★ vertex labels are matched ★ all edges in g1also exist in g2 with correct labels• g1 is not a linear subgraph of g2 ★ the order of vertices is not conserved b b c a c 1 2 3 1 2 3 4 A A B A A B A g1 g2 5 5
  • Total order among edges in a linear graph• Compare the left vertices first. If they are identical, look at the right vertices• ∀e1 = (i, j) , e2 = (k, l) ∈ Eg , e1 <e e2 if and only if (i) i < k or (ii) i = k, j < l Example: e1 e2 2 3 1i j k l 1 2 3 4 6 6
  • Outline• Introduction to linear graph ★ linear subgraph relation ★ Total order among edges• Frequent subgraph mining from a set of linear graphs• Experiments ★ Motif extraction from protein 3D structures 7 7
  • Frequent subgraph mining from linear graphs• Enumerate all frequent subgraphs from a set of linear graphs ★ Subgraphs included in a set of linear graphs at least τ times (minimum support threshold) ★ Enumerate connected and disconnected subgraphs with a unified framework ★ Use reverse search for an efficient enumeration (Avis and Fukuda, 1993)• Polynomial delay ★ gSpan = exponential delay 8 8
  • Enumeration of all linear subgraph of a linear graph• Before considering a mining algorithm, we have to solve the problem of subgraph enumeration first• How to enumerate graph withoutof the following linear all subgraphs duplication 9 9
  • Search lattice of all subgraphs !"#$% *+,-+!./!0+12!3!24 & ( ) 10 10
  • Reverse search (Avis and Fukuda, 1993) • To enumerate all subgraphs without duplication, we need to define a search tree in the search lattice • Reduction map f ★ Mapping from a child to its parent ★ Remove the largest edge 2 3 f 2 1 1 1 2 3 4 1 2 3 11 11
  • Search tree induced by the reduction map• By applying the reduction map to each element, search tree can be induced !"#$% 12 12
  • Inverting the reduction map f −1• When traversing the tree from the root, children nodes are created on demand• In most cases, the inversion of reduction map takes the following two steps: ★ Consider all children candidates ★ Take the ones that qualify the reduction map• However, in this particular case, the reduction map can be inverted explicitly ★ Can derive the pattern extension rule (parent to children) 13 13
  • Pattern extension rule 14 14
  • Traversing search tree from root• Depth first traversal for its memory efficiency $&!()*+!,$!-+! .!/)--!-!-+! !"#$% 15 15
  • Frequent subgraph mining• Basic idea: find all possible extensions of a current pattern in the graph database, and extend the pattern• Occurrence list L G (g)★ Record every occurrence of a pattern g in the graph database G★ Calculate the support of a pattern g by the occurrence list !"#$%&($""• Usesupport for pruningof the anti-monotonicity )$*+,+- 16 16
  • Outline• Introduction to linear graph ★ linear subgraph relation ★ Total order among edges• Frequent subgraph mining from a set of linear graphs• Experiments ★ Motif extraction from protein 3D structures 17 17
  • Motif extraction from protein 3D structures• Pairs of homologous proteins in thermophilic organism and mesophilic organism• Construct a linear graph from a protein ★ Use vertex order from N- to C- terminal ★ Assign vertex labels from {1,...,6} ★ Draw an edge between pairs of amino acid residues whose distance is 5Å• # of data:742, avg. # of vertices:371, avg. # of edges: 496• Rank the enumerated patterns by statistical significance (p-value) ★ Association to thermophilic/methophilic labels ★ Fisher exact test 18 18
  • Runtime comparison• Compared to gSpan• Made gapped linear graphs and run gSpan• LGM is faster than gSpan 19 19
  • • Minimum support = 10• 103 patterns whose p-value < 0.001•★Thermophilic (TATA), Mesophilic (pol II) Share the function as DNA binding protein, but the thermostatility is different 20 20
  • Mapping motifs in 3D structure• Thermophilic (TATA), Mesophilic (pol II) 21 21
  • Summary• Efficient subgraph mining algorithm from linear graphs• Search tree is defined by reverse search principle• Patterns include disconnected subgraphs• Computational time is polynomial-delay• Interesting patterns from proteins 22 22