SlideShare a Scribd company logo
1 of 13
Research and improvement on
Distributed Graph Pattern
Matching algorithm
Chao Chen (University College Dublin / JST)
Toyotaro Suzumura (IBM T.J. Watson Research Center / Columbia
University / JST)
30th / Nov / 2015
Definition
Here is a data from LinkedIn to indicate an example[1]:
Input : Data Graph, Pattern
Output: {4,6,7,8} and {5,6,7,8}
Instance
[1] A. Fard, M. U. Nisar, J. A. Miller, and L. Ramaswamy, Distributed and scalable graph pattern matching:
Models and algorithms. International Journal of Big Data (IJBD), vol. 1, no. 1, 2014.
Definition
Graph pattern matching: find subgraphs in a large graph(data graph) that
are similar to a given graph(pattern graph).
Graph:
1 directed edges or undirected edges
2 labelled vertices or unlabelled vertices
3 labelled edges or unlabelled edges
Currently, graph pattern matching is fundamental to many applications
such as social network analysis and substructure search for
biochemistry.
Definition
Chanllenges
1 Real-life social graphs are typically large. For
instance, Facebook has more than 500 million users
(nodes) with billions links (edges).
2 Graph pattern matching is costly.
lThe traditional algorithms, solving this question by linear scan, are not
practical.
lThere maybe more than one subgraph that match the given graph
Challenge
Chanllenges
Traditional algorithm :
lsubgraph isomorphism : find exact matches , which is NP-complete, thus
is not practical for massive graphs.
Distributed algorithms :
Distributed graph simulation : faster algorithm by relaxing
some restrictions on matches. It only
preserves the child
relationships of each vertex.
Distributed tight simulation : an novel modification based on
Distributed graph simulation, the state-of-the-art algorithm for
distributed graph pattern matching, which has good scalability. But
the performance is not what we expected. The algorithm we
proposed is based on Distributed tight simulation.
Exist algorithms
Improvement
The difference between distributed and traditional algorithms for graph
pattern matching is how to design computations.
1 Traditional Graph Pattern Matching: design at the level of whole graph,
computation is linear, trying to find exact matches.
2 Distributed Graph Pattern Matching: in order to conduct high
scalability, computation must be at the level of a single vertex(vertex-
centric model).
Thus, we think, for distributed graph pattern matching algorithms, it is
better to focus on removing invalided vertices.
Difference between traditional and
distributed algorithms
Improvement
Boundary filter, Which aims to shrink the massive data graph from its
border.
Boundary nodes: in directed graph, the vertex only has one relationship.
Algorithm explanation : it is also observed in paper “From Intractable to
Polynomial Time”[2], that it is easier and faster to evaluate boundary
nodes than internal nodes. However there is no such implementation for
parallel computing.
Concrete solution: each vertex preserves a dynamic status table of its
neighbors. Thus each vertex could apply its own evaluation
independently.
[2] Wenfei Fan, Jianzhong Li. 2010a. Graph Pattern Matching: From Intractable to Polynomial Time.
Proposed Solution : Boundary Filter
Improvement
Here is an example: the vertex 13 can be viewed as a boundary.
According to corresponding PM vertex in Pattern, which only has one
child, vertex 13 will be removed because of the wrong relationship
Example for Boundary Filter
Improvement
lExperiments environment : The experiments were conducted on Amazon
AWS EC2 cluster nodes. The cluster has 3 workers, each one has 61GB
RAM, 26 ECU (EC2 Compute Unit), eight vCPUs: 2.5 GHz, Intel Xeon
E5-2670v2.
lDataset : ”email-EuAll”[3], which contains 265,214 vertices and 420,045
edges is the input Data Graph. The number of distinct labels is 200
which assigned to vertices randomly. The Pattern graph was extracted
from Data graph randomly, and its maximum amount of vertices is 100.
lAccuracy : in our knowledge, there is no criteria for distributed graph
pattern matching algorithms. In following experiments, we outputed the
number of vertices the algorithms found.
[3] Snap of Stanford University. https://snap.stanford.edu/data/email-EuAll.html
Experiments
1 Running time comparison. This experiment aims to find out the effect of
boundary filter for running time. We tested when the total vertices of Pattern
is 20, 40, 60, 80 and 100 respectively.
Running time
Running time(sec) Pattern:20 Pattern:40 Pattern:60 Pattern:80 Pattern:100
Original 13314 12395 11259 10673 10086
Original +
boundary filter
400 421 444 472 804
Running time comparison
2 Accuracy comparison. This experiment aims to find out the effect of
boundary filter for accuracy. We tested when the total vertices of Pattern is
20, 40, 60, 80 and 100 respectively. The value in table is the result which
already contains all Pattern vertices.
New Dual via New Tight
Accuracy Pattern:20 Pattern:40 Pattern:60 Pattern:80 Pattern:100
Original 2266 1509 1206 899 759
Original +
border filter
26 49 65 82 106
Comparison with original
The boundary filter could explicitly improve the distributed tight simulation
from the running time and accuracy.
From the angel of computation design, the more complicated graph, the
faster the algorithm is.
In conclusion, our proposed algorithm outperform the original one(Tight
simulation) and preserve its important properties.
Moreover, the table which we add to track status of neighbors for each
vertex, make this algorithm possible to deal incremental graphs.
Dual vs TightConclusion
20151130

More Related Content

What's hot

Quantum generalized linear models
Quantum generalized linear modelsQuantum generalized linear models
Quantum generalized linear modelsColleen Farrelly
 
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsData Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsColleen Farrelly
 
PyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear ModelsPyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear ModelsColleen Farrelly
 
Logistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsLogistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsColleen Farrelly
 
Compression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesCompression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesWerner Hoffmann
 
Survey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesSurvey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesKasun Gajasinghe
 
Quantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
Quantum-Min-Cut/Max-Flow-Based Vertex Importance RankingQuantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
Quantum-Min-Cut/Max-Flow-Based Vertex Importance RankingColleen Farrelly
 
Multiplication of matrices and its application in biology
Multiplication of matrices and its application in biologyMultiplication of matrices and its application in biology
Multiplication of matrices and its application in biologynayanika bhalla
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
 
Tda presentation
Tda presentationTda presentation
Tda presentationHJ van Veen
 
30.03.2017 Data Science Meetup - USER JOURNEY ANALYSIS, BETWEEN BUDGET ALLOCA...
30.03.2017 Data Science Meetup - USER JOURNEY ANALYSIS, BETWEEN BUDGET ALLOCA...30.03.2017 Data Science Meetup - USER JOURNEY ANALYSIS, BETWEEN BUDGET ALLOCA...
30.03.2017 Data Science Meetup - USER JOURNEY ANALYSIS, BETWEEN BUDGET ALLOCA...Zalando adtech lab
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
Visual analysis of large graphs state of the art and future research challenges
Visual analysis of large graphs state of the art and future research challengesVisual analysis of large graphs state of the art and future research challenges
Visual analysis of large graphs state of the art and future research challengesAsliza Hamzah
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data AnalysisDeviousQuant
 
Matrix and it's Application
Matrix and it's ApplicationMatrix and it's Application
Matrix and it's ApplicationMahmudle Hassan
 
High-Dimensional Data Visualization, Geometry, and Stock Market Crashes
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesHigh-Dimensional Data Visualization, Geometry, and Stock Market Crashes
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesColleen Farrelly
 
presentation
presentationpresentation
presentationjie ren
 

What's hot (20)

Quantum generalized linear models
Quantum generalized linear modelsQuantum generalized linear models
Quantum generalized linear models
 
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsData Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
 
PyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear ModelsPyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear Models
 
Logistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsLogistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerations
 
Compression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesCompression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure Primites
 
Survey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesSurvey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - Slides
 
Pca analysis
Pca analysisPca analysis
Pca analysis
 
Quantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
Quantum-Min-Cut/Max-Flow-Based Vertex Importance RankingQuantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
Quantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
 
Multiplication of matrices and its application in biology
Multiplication of matrices and its application in biologyMultiplication of matrices and its application in biology
Multiplication of matrices and its application in biology
 
Visualisation of Large Networks
Visualisation of Large Networks Visualisation of Large Networks
Visualisation of Large Networks
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
30.03.2017 Data Science Meetup - USER JOURNEY ANALYSIS, BETWEEN BUDGET ALLOCA...
30.03.2017 Data Science Meetup - USER JOURNEY ANALYSIS, BETWEEN BUDGET ALLOCA...30.03.2017 Data Science Meetup - USER JOURNEY ANALYSIS, BETWEEN BUDGET ALLOCA...
30.03.2017 Data Science Meetup - USER JOURNEY ANALYSIS, BETWEEN BUDGET ALLOCA...
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Visual analysis of large graphs state of the art and future research challenges
Visual analysis of large graphs state of the art and future research challengesVisual analysis of large graphs state of the art and future research challenges
Visual analysis of large graphs state of the art and future research challenges
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data Analysis
 
Matrix and it's Application
Matrix and it's ApplicationMatrix and it's Application
Matrix and it's Application
 
High-Dimensional Data Visualization, Geometry, and Stock Market Crashes
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesHigh-Dimensional Data Visualization, Geometry, and Stock Market Crashes
High-Dimensional Data Visualization, Geometry, and Stock Market Crashes
 
presentation
presentationpresentation
presentation
 
Ijetcas14 507
Ijetcas14 507Ijetcas14 507
Ijetcas14 507
 

Viewers also liked (15)

Chpt9 patternmatching
Chpt9 patternmatchingChpt9 patternmatching
Chpt9 patternmatching
 
La radio marina
La radio marinaLa radio marina
La radio marina
 
PPT_New_Delhi__Gymkhana_Club from GM
PPT_New_Delhi__Gymkhana_Club from GMPPT_New_Delhi__Gymkhana_Club from GM
PPT_New_Delhi__Gymkhana_Club from GM
 
La televisió
La televisióLa televisió
La televisió
 
portfolio-work2015
portfolio-work2015portfolio-work2015
portfolio-work2015
 
Chudozubí
ChudozubíChudozubí
Chudozubí
 
Vistup
VistupVistup
Vistup
 
Adros Prezentācija_atlase
Adros Prezentācija_atlaseAdros Prezentācija_atlase
Adros Prezentācija_atlase
 
KMP Pattern Matching algorithm
KMP Pattern Matching algorithmKMP Pattern Matching algorithm
KMP Pattern Matching algorithm
 
El Télefon
El TélefonEl Télefon
El Télefon
 
Our Director's Pitch
Our Director's PitchOur Director's Pitch
Our Director's Pitch
 
OraCRM
OraCRMOraCRM
OraCRM
 
Arctic Monkeys Digipak
Arctic Monkeys DigipakArctic Monkeys Digipak
Arctic Monkeys Digipak
 
The Killers Digipak
The Killers Digipak The Killers Digipak
The Killers Digipak
 
Pink Floyd Digipak
Pink Floyd DigipakPink Floyd Digipak
Pink Floyd Digipak
 

Similar to 20151130

Graph Tea: Simulating Tool for Graph Theory & Algorithms
Graph Tea: Simulating Tool for Graph Theory & AlgorithmsGraph Tea: Simulating Tool for Graph Theory & Algorithms
Graph Tea: Simulating Tool for Graph Theory & AlgorithmsIJMTST Journal
 
Graph Matching Algorithm-Through Isomorphism Detection
Graph Matching Algorithm-Through Isomorphism DetectionGraph Matching Algorithm-Through Isomorphism Detection
Graph Matching Algorithm-Through Isomorphism Detectionijbuiiir1
 
IRJET - Object Detection using Hausdorff Distance
IRJET -  	  Object Detection using Hausdorff DistanceIRJET -  	  Object Detection using Hausdorff Distance
IRJET - Object Detection using Hausdorff DistanceIRJET Journal
 
IRJET- Object Detection using Hausdorff Distance
IRJET-  	  Object Detection using Hausdorff DistanceIRJET-  	  Object Detection using Hausdorff Distance
IRJET- Object Detection using Hausdorff DistanceIRJET Journal
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
Recognition as Graph Matching
  Recognition as Graph Matching  Recognition as Graph Matching
Recognition as Graph MatchingVishakha Agarwal
 
Gavrila_ICCV99.pdf
Gavrila_ICCV99.pdfGavrila_ICCV99.pdf
Gavrila_ICCV99.pdfDeepdeeper
 
Template Matching - Pattern Recognition
Template Matching - Pattern RecognitionTemplate Matching - Pattern Recognition
Template Matching - Pattern RecognitionMustafa Salam
 
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...csandit
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classificationcsandit
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...cscpconf
 
Engineering Numerical Analysis-Introduction.pdf
Engineering Numerical Analysis-Introduction.pdfEngineering Numerical Analysis-Introduction.pdf
Engineering Numerical Analysis-Introduction.pdfssuseraae901
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...IAEME Publication
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...IAEME Publication
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Fractal analysis of good programming style
Fractal analysis of good programming styleFractal analysis of good programming style
Fractal analysis of good programming stylecsandit
 
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLEFRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLEcscpconf
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsSubhajit Sahu
 

Similar to 20151130 (20)

Graph Tea: Simulating Tool for Graph Theory & Algorithms
Graph Tea: Simulating Tool for Graph Theory & AlgorithmsGraph Tea: Simulating Tool for Graph Theory & Algorithms
Graph Tea: Simulating Tool for Graph Theory & Algorithms
 
Graph Matching Algorithm-Through Isomorphism Detection
Graph Matching Algorithm-Through Isomorphism DetectionGraph Matching Algorithm-Through Isomorphism Detection
Graph Matching Algorithm-Through Isomorphism Detection
 
IRJET - Object Detection using Hausdorff Distance
IRJET -  	  Object Detection using Hausdorff DistanceIRJET -  	  Object Detection using Hausdorff Distance
IRJET - Object Detection using Hausdorff Distance
 
IRJET- Object Detection using Hausdorff Distance
IRJET-  	  Object Detection using Hausdorff DistanceIRJET-  	  Object Detection using Hausdorff Distance
IRJET- Object Detection using Hausdorff Distance
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Recognition as Graph Matching
  Recognition as Graph Matching  Recognition as Graph Matching
Recognition as Graph Matching
 
Gavrila_ICCV99.pdf
Gavrila_ICCV99.pdfGavrila_ICCV99.pdf
Gavrila_ICCV99.pdf
 
Template Matching - Pattern Recognition
Template Matching - Pattern RecognitionTemplate Matching - Pattern Recognition
Template Matching - Pattern Recognition
 
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classification
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
 
Engineering Numerical Analysis-Introduction.pdf
Engineering Numerical Analysis-Introduction.pdfEngineering Numerical Analysis-Introduction.pdf
Engineering Numerical Analysis-Introduction.pdf
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Fractal analysis of good programming style
Fractal analysis of good programming styleFractal analysis of good programming style
Fractal analysis of good programming style
 
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLEFRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
 

20151130

  • 1. Research and improvement on Distributed Graph Pattern Matching algorithm Chao Chen (University College Dublin / JST) Toyotaro Suzumura (IBM T.J. Watson Research Center / Columbia University / JST) 30th / Nov / 2015
  • 2. Definition Here is a data from LinkedIn to indicate an example[1]: Input : Data Graph, Pattern Output: {4,6,7,8} and {5,6,7,8} Instance [1] A. Fard, M. U. Nisar, J. A. Miller, and L. Ramaswamy, Distributed and scalable graph pattern matching: Models and algorithms. International Journal of Big Data (IJBD), vol. 1, no. 1, 2014.
  • 3. Definition Graph pattern matching: find subgraphs in a large graph(data graph) that are similar to a given graph(pattern graph). Graph: 1 directed edges or undirected edges 2 labelled vertices or unlabelled vertices 3 labelled edges or unlabelled edges Currently, graph pattern matching is fundamental to many applications such as social network analysis and substructure search for biochemistry. Definition
  • 4. Chanllenges 1 Real-life social graphs are typically large. For instance, Facebook has more than 500 million users (nodes) with billions links (edges). 2 Graph pattern matching is costly. lThe traditional algorithms, solving this question by linear scan, are not practical. lThere maybe more than one subgraph that match the given graph Challenge
  • 5. Chanllenges Traditional algorithm : lsubgraph isomorphism : find exact matches , which is NP-complete, thus is not practical for massive graphs. Distributed algorithms : Distributed graph simulation : faster algorithm by relaxing some restrictions on matches. It only preserves the child relationships of each vertex. Distributed tight simulation : an novel modification based on Distributed graph simulation, the state-of-the-art algorithm for distributed graph pattern matching, which has good scalability. But the performance is not what we expected. The algorithm we proposed is based on Distributed tight simulation. Exist algorithms
  • 6. Improvement The difference between distributed and traditional algorithms for graph pattern matching is how to design computations. 1 Traditional Graph Pattern Matching: design at the level of whole graph, computation is linear, trying to find exact matches. 2 Distributed Graph Pattern Matching: in order to conduct high scalability, computation must be at the level of a single vertex(vertex- centric model). Thus, we think, for distributed graph pattern matching algorithms, it is better to focus on removing invalided vertices. Difference between traditional and distributed algorithms
  • 7. Improvement Boundary filter, Which aims to shrink the massive data graph from its border. Boundary nodes: in directed graph, the vertex only has one relationship. Algorithm explanation : it is also observed in paper “From Intractable to Polynomial Time”[2], that it is easier and faster to evaluate boundary nodes than internal nodes. However there is no such implementation for parallel computing. Concrete solution: each vertex preserves a dynamic status table of its neighbors. Thus each vertex could apply its own evaluation independently. [2] Wenfei Fan, Jianzhong Li. 2010a. Graph Pattern Matching: From Intractable to Polynomial Time. Proposed Solution : Boundary Filter
  • 8. Improvement Here is an example: the vertex 13 can be viewed as a boundary. According to corresponding PM vertex in Pattern, which only has one child, vertex 13 will be removed because of the wrong relationship Example for Boundary Filter
  • 9. Improvement lExperiments environment : The experiments were conducted on Amazon AWS EC2 cluster nodes. The cluster has 3 workers, each one has 61GB RAM, 26 ECU (EC2 Compute Unit), eight vCPUs: 2.5 GHz, Intel Xeon E5-2670v2. lDataset : ”email-EuAll”[3], which contains 265,214 vertices and 420,045 edges is the input Data Graph. The number of distinct labels is 200 which assigned to vertices randomly. The Pattern graph was extracted from Data graph randomly, and its maximum amount of vertices is 100. lAccuracy : in our knowledge, there is no criteria for distributed graph pattern matching algorithms. In following experiments, we outputed the number of vertices the algorithms found. [3] Snap of Stanford University. https://snap.stanford.edu/data/email-EuAll.html Experiments
  • 10. 1 Running time comparison. This experiment aims to find out the effect of boundary filter for running time. We tested when the total vertices of Pattern is 20, 40, 60, 80 and 100 respectively. Running time Running time(sec) Pattern:20 Pattern:40 Pattern:60 Pattern:80 Pattern:100 Original 13314 12395 11259 10673 10086 Original + boundary filter 400 421 444 472 804 Running time comparison
  • 11. 2 Accuracy comparison. This experiment aims to find out the effect of boundary filter for accuracy. We tested when the total vertices of Pattern is 20, 40, 60, 80 and 100 respectively. The value in table is the result which already contains all Pattern vertices. New Dual via New Tight Accuracy Pattern:20 Pattern:40 Pattern:60 Pattern:80 Pattern:100 Original 2266 1509 1206 899 759 Original + border filter 26 49 65 82 106 Comparison with original
  • 12. The boundary filter could explicitly improve the distributed tight simulation from the running time and accuracy. From the angel of computation design, the more complicated graph, the faster the algorithm is. In conclusion, our proposed algorithm outperform the original one(Tight simulation) and preserve its important properties. Moreover, the table which we add to track status of neighbors for each vertex, make this algorithm possible to deal incremental graphs. Dual vs TightConclusion