20151130

Research and improvement on
Distributed Graph Pattern
Matching algorithm
Chao Chen (University College Dublin / JST)
Toyotaro Suzumura (IBM T.J. Watson Research Center / Columbia
University / JST)
30th / Nov / 2015

Definition
Here is a data from LinkedIn to indicate an example[1]:
Input : Data Graph, Pattern
Output: {4,6,7,8} and {5,6,7,8}
Instance
[1] A. Fard, M. U. Nisar, J. A. Miller, and L. Ramaswamy, Distributed and scalable graph pattern matching:
Models and algorithms. International Journal of Big Data (IJBD), vol. 1, no. 1, 2014.

Definition
Graph pattern matching: find subgraphs in a large graph(data graph) that
are similar to a given graph(pattern graph).
Graph:
1 directed edges or undirected edges
2 labelled vertices or unlabelled vertices
3 labelled edges or unlabelled edges
Currently, graph pattern matching is fundamental to many applications
such as social network analysis and substructure search for
biochemistry.
Definition

Chanllenges
1 Real-life social graphs are typically large. For
instance, Facebook has more than 500 million users
(nodes) with billions links (edges).
2 Graph pattern matching is costly.
lThe traditional algorithms, solving this question by linear scan, are not
practical.
lThere maybe more than one subgraph that match the given graph
Challenge

Chanllenges
Traditional algorithm :
lsubgraph isomorphism : find exact matches , which is NP-complete, thus
is not practical for massive graphs.
Distributed algorithms :
Distributed graph simulation : faster algorithm by relaxing
some restrictions on matches. It only
preserves the child
relationships of each vertex.
Distributed tight simulation : an novel modification based on
Distributed graph simulation, the state-of-the-art algorithm for
distributed graph pattern matching, which has good scalability. But
the performance is not what we expected. The algorithm we
proposed is based on Distributed tight simulation.
Exist algorithms

Improvement
The difference between distributed and traditional algorithms for graph
pattern matching is how to design computations.
1 Traditional Graph Pattern Matching: design at the level of whole graph,
computation is linear, trying to find exact matches.
2 Distributed Graph Pattern Matching: in order to conduct high
scalability, computation must be at the level of a single vertex(vertex-
centric model).
Thus, we think, for distributed graph pattern matching algorithms, it is
better to focus on removing invalided vertices.
Difference between traditional and
distributed algorithms

Improvement
Boundary filter, Which aims to shrink the massive data graph from its
border.
Boundary nodes: in directed graph, the vertex only has one relationship.
Algorithm explanation : it is also observed in paper “From Intractable to
Polynomial Time”[2], that it is easier and faster to evaluate boundary
nodes than internal nodes. However there is no such implementation for
parallel computing.
Concrete solution: each vertex preserves a dynamic status table of its
neighbors. Thus each vertex could apply its own evaluation
independently.
[2] Wenfei Fan, Jianzhong Li. 2010a. Graph Pattern Matching: From Intractable to Polynomial Time.
Proposed Solution : Boundary Filter

Improvement
Here is an example: the vertex 13 can be viewed as a boundary.
According to corresponding PM vertex in Pattern, which only has one
child, vertex 13 will be removed because of the wrong relationship
Example for Boundary Filter

Improvement
lExperiments environment : The experiments were conducted on Amazon
AWS EC2 cluster nodes. The cluster has 3 workers, each one has 61GB
RAM, 26 ECU (EC2 Compute Unit), eight vCPUs: 2.5 GHz, Intel Xeon
E5-2670v2.
lDataset : ”email-EuAll”[3], which contains 265,214 vertices and 420,045
edges is the input Data Graph. The number of distinct labels is 200
which assigned to vertices randomly. The Pattern graph was extracted
from Data graph randomly, and its maximum amount of vertices is 100.
lAccuracy : in our knowledge, there is no criteria for distributed graph
pattern matching algorithms. In following experiments, we outputed the
number of vertices the algorithms found.
[3] Snap of Stanford University. https://snap.stanford.edu/data/email-EuAll.html
Experiments

1 Running time comparison. This experiment aims to find out the effect of
boundary filter for running time. We tested when the total vertices of Pattern
is 20, 40, 60, 80 and 100 respectively.
Running time
Running time(sec) Pattern:20 Pattern:40 Pattern:60 Pattern:80 Pattern:100
Original 13314 12395 11259 10673 10086
Original +
boundary filter
400 421 444 472 804
Running time comparison

2 Accuracy comparison. This experiment aims to find out the effect of
boundary filter for accuracy. We tested when the total vertices of Pattern is
20, 40, 60, 80 and 100 respectively. The value in table is the result which
already contains all Pattern vertices.
New Dual via New Tight
Accuracy Pattern:20 Pattern:40 Pattern:60 Pattern:80 Pattern:100
Original 2266 1509 1206 899 759
Original +
border filter
26 49 65 82 106
Comparison with original

The boundary filter could explicitly improve the distributed tight simulation
from the running time and accuracy.
From the angel of computation design, the more complicated graph, the
faster the algorithm is.
In conclusion, our proposed algorithm outperform the original one(Tight
simulation) and preserve its important properties.
Moreover, the table which we add to track status of neighbors for each
vertex, make this algorithm possible to deal incremental graphs.
Dual vs TightConclusion

20151130

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to 20151130

Similar to 20151130 (20)

20151130