Survey on Frequent Pattern Mining on Graph Data - Slides
Upcoming SlideShare
Loading in...5
×
 

Survey on Frequent Pattern Mining on Graph Data - Slides

on

  • 2,833 views

 

Statistics

Views

Total Views
2,833
Views on SlideShare
2,832
Embed Views
1

Actions

Likes
1
Downloads
86
Comments
1

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • hi I download your slides but there was problem in that .
    my mail is m_punisher6@yahoo.com
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Survey on Frequent Pattern Mining on Graph Data - Slides Survey on Frequent Pattern Mining on Graph Data - Slides Presentation Transcript

  • Sriskandarajah Suhothayan Kasun Gajasinghe Isuru Loku Narangoda Subash Chaturanga
  • Outline
    • Introduction
    • Basic principles
    • Solution patterns
  • Introduction
    • Graphs can be seen in everywhere.
    • In computer science, graph is viewed as an abstract data structure which represents relationships among data.
  • Graph based data mining
    • Graph based data mining is finding out useful and understandable patterns from graph representation of data.
    • The main subject area of graph based data mining is identifying the frequently occurring subgraph patterns.
  • Approaches
    • In the recent past a significant work has been done in this subject area to develop algorithms to mine graph data efficiently.
    • In this paper we are discussing about such several well known algorithms under following categories.
      • Mathematical Graph Theory Based Approaches
      • Greedy Search Based Approaches
      • Inductive Logic Programming Approach
      • Inductive Database Based Approaches
  • Applications
    • BioInformatics
      • mine biochemical structures
      • finding out biological conserved sub networks
    • Chemical compound analysis
    • Web browsing pattern analysis
    • intrusion network analysis
    • mining communication networks
  • Basic Principles
    • Subgraph categories
      • general subgraphs
      • induced subgraphs
      • connected subgraphs
    • Subgraph Isomorphism Problem
    • This finds whether there exists a one-to-one mapping from a set of vertices to another set.
  • Basic Principles
    • Graph Invariants
    • Quantities to characterize the topological structure of a graph
      • number vertices,
      • degree of each vertex
      • number of edges connected to the vertex
  • Solution Approaches direct Categorization Completeness complete search heuristic search Subgraph isomorphism matching problem Indirect (solves the subgraph similarity problem)
  • Solution Approaches
    • Greedy search
    • Inductive logic programming (ILP)
    • Inductive database
    • Complete level-wise search
    • Support Vector Machine (SVM)
  • Greedy search
    • The conventional solution
    • Categorized into
    • Depth-First search (DFS) and
    • Breadth-First Search (BFS)
    • Beam search
      • The disadvantage: as the search proceeds it prunes the branches which do not fit to the maximum branch number limit
  • Inductive logic programming (ILP)
    • Induction?
    • combination of the 'abduction' (guessing) to select some hypotheses and the 'justification' to seek those hypotheses to justify the observed facts.
  • Inductive logic programming (ILP)
    • positive examples
      • + negative examples => hypothesis
      • + background knowledge
    • background knowledge
      • to control the search process (prune some search paths)
      • introduce predetermined subgraph patterns
      • ILP can be in any of four categories
  • Inductive database
    • Subgraphs and relations among subgraphs are pre-generated sad stored in an inductive database
    • Advantage: fast operation as the basic patterns
    • Disadvantage: large amount of computation and memory utilization
  • Complete level-wise search
    • It's Complete and Direct
    • Here data are not sets of items
    • Rather graphs having the combinations of a vertex set V(G) and an edge set E(G) which include topological information.
    • Extended approach of Apriori algorithm is used
  • Support Vector Machine (SVM)
    • Used for classification and regression analysis
    • A non-probabilistic binary linear classifier
    • SVN is a heuristic search and an indirect method in terms of subgraph isomorphism problem.
  • Categorization
    • Mathematical Graph Theory Based Approaches
    • Greedy Search Based Approaches
    • Inductive Logic Programming Approach
    • Inductive Database Based Approaches
    • Kernel Function Based Approaches
  • Greedy Search Based Approaches
    • Use heuristics to evaluate the solution.
    • Two major works
    • SUBDUE
    • GBI
  • Graph Based Induction (GBI)
    • Has two methods
      • one for chunking and the other for extracting patters.
    • Can arrive at local minimum solutions; using pair wise chunking at each step by the opportunistic beam search.
    • Ability to reconstruct the original graph as and when needed
    • The advantage of GBI is that it can handle both directed and undirected labelled graph even with closed paths which includes closed edges.
    • Use empirical graph size definition, limitation in continuously compressing the graph, graph never becomes a single vertex.
    • Extract substructures and construct a classifier.
  • SUBDUE
    • A graph-based relational learning system
    • Compress the graphs based on Minimum Description Length (MDL) principle
    • Not face high computational complexity (uses computationally constrained beam search)
    • Miss some optimum sub graphs
    • fewer number of highly interesting patterns; than generating a large number of patterns from which interesting patterns need to be identified.
    • Runtime much larger than gSpan and FSG: non-linear with the dataset size (because of the implementation of graph isomorphism problem)
  • Mathematical Approaches
    • Apriori-based methods
      • AGM
      • FSG
    • Pattern Growth methods
      • gSpan
  • Apriori-based Approach
    • AGM
      • Used to mine “frequent induced subgraphs”
      • Works with both directed and undirected graphs
      • Importantly, this algorithm is not limited to the connected graphs. It also supports isolated graphs.
    • AGM
    • Breadth first search. Create new candidates for level k+1 by joining two graphs at level k.
    • AGM generates new graphs by adding a new node:
    • And then proceeds as per Apriori...
    • FSG
      • FSG works better on graph data sets with more edge and vertex labels
      • This is an optimized algorithm of AGM with added techniques for efficiency.
      • FSG increases the efficiency of the candidate generation of frequent subgraphs by introducing the Transaction ID (TID) method.
      • efficient candidate subgraph generation algorithms.
    • FSG
      • FSG is a apriori-based and therefore uses level-wise algorithm
      • Faces two challenges:
            • candidate generation: the generation of size subgraph candidates is more complicated and costly
            • pruning false positives: subgraph isomorphism test is an NP-complete problem
    • gSpan
      • Uses Depth-First-Search (DFS)
      • can be used to find frequent sub graphs one by one from small to large ones.
      • Advantages
        • No candidate generation and false test
        • Better saving of space by DFS.
    Pattern growth mathod
  • GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) (A) (B) (C) (1) (2)
  • Another three approaches to mine graph based data.
      • Inductive Logic Programming approach
      • Inductive database approach
      • Kernel function based approach
  • ILP approach. ILP systems constructs predictive model for a given data set by searching large space of candidate hypothesis.
      • WARMR – proposed in 1998. Combination of Apriori-like level wise search and IPL method.
      • But have a high computational complexity.
      • FARMER – proposed in 2011. Runs two orders of magnitude than WARMER.
  • Inductive DB approach. Databases which are capable of handling patterns within data. Quite different from from typical data bases. Uses interactive querying process to mine data in these data bases.
      • MolFea is an effort related to this area. Has a better computational efficiency which mines linear fragments in chemical compounds..
      • Also this performs a complete search of the paths in graph data.
  • Kernel Function based approach This “kernel” function basically defines similarity between two graphs The paper consists of two efforts done based on this approach, which classifies the graphs in to binary classes by SVM (Support Vector - Machine).