This document proposes a new approach called SASUM for approximate subgraph matching in large graphs. Approximate subgraph matching allows missing edges in query matches, which is important for real-world graphs that may be incomplete. SASUM improves upon the basic approach of generating all possible query subgraphs and doing exact matching for each. It exploits the overlapping nature of query subgraphs to reduce the number that require costly exact matching. SASUM uses a lattice framework to identify sharing opportunities between query subgraphs. It generates small "base graphs" that are shared between queries and chooses a minimum set of these to match, from which it can derive matches for all queries. The approach outperforms the state-of-the-art by orders of
A popular programming model for running data intensive applications on the cloud is map reduce. In
the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce
applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline
con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be
completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on
improving s y s t em utilization. We have proposed an algorithm which facilitates the user to
specify a jobs deadline and evaluates whether the job can be finished before the deadline.
Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are
scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or
virtual nodes can be added dynamically to complete the job within deadline[8].
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
Task assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. In this paper, we have parallelized the basic heuristic graph-matching algorithm of task assignment which is suitable only for cases where processors and inter processor links are homogeneous. This proposal is a derivative of the basic task assignment methodology using heuristic graph matching. The results show that near optimal assignments are obtained much faster than the sequential program in all the cases with reasonable speed-up.
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
Slides of my presentation at Semantic Big Data (SBD 2020) co-located with SIGMOD 2020. The actual presentation has my pre-recorded commentary and superimposed picture so these slides can be used as reference.
A popular programming model for running data intensive applications on the cloud is map reduce. In
the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce
applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline
con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be
completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on
improving s y s t em utilization. We have proposed an algorithm which facilitates the user to
specify a jobs deadline and evaluates whether the job can be finished before the deadline.
Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are
scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or
virtual nodes can be added dynamically to complete the job within deadline[8].
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
Task assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. In this paper, we have parallelized the basic heuristic graph-matching algorithm of task assignment which is suitable only for cases where processors and inter processor links are homogeneous. This proposal is a derivative of the basic task assignment methodology using heuristic graph matching. The results show that near optimal assignments are obtained much faster than the sequential program in all the cases with reasonable speed-up.
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
Slides of my presentation at Semantic Big Data (SBD 2020) co-located with SIGMOD 2020. The actual presentation has my pre-recorded commentary and superimposed picture so these slides can be used as reference.
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsSubhajit Sahu
In this paper we study the problem of dynamically
maintaining graph properties under batches of edge
insertions and deletions in the massively parallel model
of computation. In this setting, the graph is stored
on a number of machines, each having space strongly
sublinear with respect to the number of vertices, that
is, n
for some constant 0 < < 1. Our goal is to
handle batches of updates and queries where the data
for each batch fits onto one machine in constant rounds
of parallel computation, as well as to reduce the total
communication between the machines. This objective
corresponds to the gradual buildup of databases over
time, while the goal of obtaining constant rounds of
communication for problems in the static setting has
been elusive for problems as simple as undirected graph
connectivity.
We give an algorithm for dynamic graph connectivity
in this setting with constant communication rounds and
communication cost almost linear in terms of the batch
size. Our techniques combine a new graph contraction
technique, an independent random sample extractor from
correlated samples, as well as distributed data structures
supporting parallel updates and queries in batches.
We also illustrate the power of dynamic algorithms in
the MPC model by showing that the batched version
of the adaptive connectivity problem is P-complete in
the centralized setting, but sub-linear sized batches can
be handled in a constant number of rounds. Due to
the wide applicability of our approaches, we believe
it represents a practically-motivated workaround to the
current difficulties in designing more efficient massively
parallel static graph algorithms.
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
Hadoop and the term 'Big Data' go hand in hand. The information explosion caused
due to cloud and distributed computing lead to the curiosity to process and analyze massive
amount of data. The process and analysis helps to add value to an organization or derive
valuable information.
The current Hadoop implementation assumes that computing nodes in a cluster are
homogeneous in nature. Hadoop relies on its capability to take computation to the nodes
rather than migrating the data around the nodes which might cause a signicant network
overhead. This strategy has its potential benets on homogeneous environment but it might
not be suitable on an heterogeneous environment. The time taken to process the data on a
slower node on a heterogeneous environment might be signicantly higher than the sum of
network overhead and processing time on a faster node. Hence, it is necessary to study the
data placement policy where we can distribute the data based on the processing power of
a node. The project explores this data placement policy and notes the ramications of this
strategy based on running few benchmark applications.
A Subgraph Pattern Search over Graph DatabasesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Scientific Review
Radial Basis Probabilistic Neural Network (RBPNN) has a broader generalized capability that been successfully applied to multiple fields. In this paper, the Euclidean distance of each data point in RBPNN is extended by calculating its kernel-induced distance instead of the conventional sum-of squares distance. The kernel function is a generalization of the distance metric that measures the distance between two data points as the data points are mapped into a high dimensional space. During the comparing of the four constructed classification models with Kernel RBPNN, Radial Basis Function networks, RBPNN and Back-Propagation networks as proposed, results showed that, model classification on Iris Data with Kernel RBPNN display an outstanding performance in this regard.
This introduction to the popular ggplot2 R graphics package will show you how to create a wide variety of graphical displays in R. Data sets and additional workshop materials available at http://projects.iq.harvard.edu/rtc/event/r-graphics
발표자: 송환준(KAIST 박사과정)
발표일: 2018.8.
(Parallel Clustering Algorithm Optimization for Large-Scale Data Analytics)
Clustering은 데이터 분석에 가장 널리 쓰이는 방법 중 하나로 주어진 데이터를 유사성에 기초하여 여러 개의 그룹으로 나누는 작업이다. 하지만 Clustering 방법의 높은 계산 복잡도 때문에 대용량 데이터 분석에는 잘 사용되지 못하고 있다. 최근 이 높은 복잡도 문제를 해결하기 위해 많은 연구가 Hadoop, Spark와 같은 분산 컴퓨팅 방식을 적용하고 있지만 기존 Clustering 알고리즘을 분산 환경에 최적화시키는 것은 쉽지 않다. 특히, 효율성을 높이기 위해 정확성을 손실하는 문제 그리고 여러 작업자들 간의 부하 불균형 문제는 알고리즘을 분산처리 할 때 발생하는 대표적인 문제이다. 본 세미나에서는 대표적 Clustering 알고리즘인 DBSCAN을 분산처리 할 때 발생하는 여러 도전 과제에 초점을 맞추고 이를 해결 할 수 있는 새로운 해결책을 제시한다. 실제로 이 방법은 최신 연구의 방법과 비교하여 정확도 손실 없이 최대 180배까지 알고리즘의 성능을 향상시켰다.
본 세미나는 SIGMOD 2018에서 발표한 다음 논문에 대한 내용이다.
Song, H. and Lee, J., "RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning," In Proc. 2018 ACM Int'l Conf. on Management of Data (SIGMOD), Houston, Texas, pp. 1173 ~ 1187, June 2018
1. Background
- Concept of Clustering
- Concept of Distributed Processing (MapReduce)
- Clustering Algorithms (Focus on DBSCAN)
2. Challenges of Parallel Clustering
- Parallelization of Clustering Algorithm (Focus on DBSCAN)
- Existing Work
- Challenges
3. Our Approach
- Key Idea and Key Contribution
- Overview of Random Partitioning-DBSCAN
4. Experimental Results
5. Conclusions
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Machine Learning (ML) models are often composed as pipelines of operators, from “classical” ML operators to pre-processing and featurization operators. Current systems deploy pipelines as "black boxes”, where the same implementation of training is run for inference. This solution is convenient but leaves large room to improve performance and resource usage. This talk presents Pretzel, a framework for deployment of ML pipelines that is inspired to Database Systems: Pretzel inspects and optimizes pipelines end-to-end much like queries, and manages resources common to multiple pipelines such as operators' state. Pretzel is joint work with University of Seoul and Microsoft Research and has recently been presented at OSDI ’18. After the overview, this talk also shows experimental results of Pretzel against state-of-art ML solutions and discusses limitations and extensions.
Practical Parallel Hypergraph Algorithms
Author: Julian Shun
Publication:PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming February 2020 Pages 232–249https://doi.org/10.1145/3332466.3374527
One More Comments on Programming with Big Number Library in Scientific Computingtheijes
This article makes a comment on programming with big number libraries like GMP and MPFR libraries. The comment proposes not using recursive procedure in programs since recursive procedure that performs big number computations might use big size of stack while neither the operation system nor a compile system can afford to provide the needed size. Based on the comments, the article recommends designing proper algorithm that avoids a recursive procedure. With an example that uses the GMP big number library in factorization of big odd number, the article demonstrates how to convert a recursive procedure into a non-recursive procedure and its numerical experiments sustain that the sample program reveals expected results.
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsSubhajit Sahu
In this paper we study the problem of dynamically
maintaining graph properties under batches of edge
insertions and deletions in the massively parallel model
of computation. In this setting, the graph is stored
on a number of machines, each having space strongly
sublinear with respect to the number of vertices, that
is, n
for some constant 0 < < 1. Our goal is to
handle batches of updates and queries where the data
for each batch fits onto one machine in constant rounds
of parallel computation, as well as to reduce the total
communication between the machines. This objective
corresponds to the gradual buildup of databases over
time, while the goal of obtaining constant rounds of
communication for problems in the static setting has
been elusive for problems as simple as undirected graph
connectivity.
We give an algorithm for dynamic graph connectivity
in this setting with constant communication rounds and
communication cost almost linear in terms of the batch
size. Our techniques combine a new graph contraction
technique, an independent random sample extractor from
correlated samples, as well as distributed data structures
supporting parallel updates and queries in batches.
We also illustrate the power of dynamic algorithms in
the MPC model by showing that the batched version
of the adaptive connectivity problem is P-complete in
the centralized setting, but sub-linear sized batches can
be handled in a constant number of rounds. Due to
the wide applicability of our approaches, we believe
it represents a practically-motivated workaround to the
current difficulties in designing more efficient massively
parallel static graph algorithms.
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
Hadoop and the term 'Big Data' go hand in hand. The information explosion caused
due to cloud and distributed computing lead to the curiosity to process and analyze massive
amount of data. The process and analysis helps to add value to an organization or derive
valuable information.
The current Hadoop implementation assumes that computing nodes in a cluster are
homogeneous in nature. Hadoop relies on its capability to take computation to the nodes
rather than migrating the data around the nodes which might cause a signicant network
overhead. This strategy has its potential benets on homogeneous environment but it might
not be suitable on an heterogeneous environment. The time taken to process the data on a
slower node on a heterogeneous environment might be signicantly higher than the sum of
network overhead and processing time on a faster node. Hence, it is necessary to study the
data placement policy where we can distribute the data based on the processing power of
a node. The project explores this data placement policy and notes the ramications of this
strategy based on running few benchmark applications.
A Subgraph Pattern Search over Graph DatabasesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Scientific Review
Radial Basis Probabilistic Neural Network (RBPNN) has a broader generalized capability that been successfully applied to multiple fields. In this paper, the Euclidean distance of each data point in RBPNN is extended by calculating its kernel-induced distance instead of the conventional sum-of squares distance. The kernel function is a generalization of the distance metric that measures the distance between two data points as the data points are mapped into a high dimensional space. During the comparing of the four constructed classification models with Kernel RBPNN, Radial Basis Function networks, RBPNN and Back-Propagation networks as proposed, results showed that, model classification on Iris Data with Kernel RBPNN display an outstanding performance in this regard.
This introduction to the popular ggplot2 R graphics package will show you how to create a wide variety of graphical displays in R. Data sets and additional workshop materials available at http://projects.iq.harvard.edu/rtc/event/r-graphics
발표자: 송환준(KAIST 박사과정)
발표일: 2018.8.
(Parallel Clustering Algorithm Optimization for Large-Scale Data Analytics)
Clustering은 데이터 분석에 가장 널리 쓰이는 방법 중 하나로 주어진 데이터를 유사성에 기초하여 여러 개의 그룹으로 나누는 작업이다. 하지만 Clustering 방법의 높은 계산 복잡도 때문에 대용량 데이터 분석에는 잘 사용되지 못하고 있다. 최근 이 높은 복잡도 문제를 해결하기 위해 많은 연구가 Hadoop, Spark와 같은 분산 컴퓨팅 방식을 적용하고 있지만 기존 Clustering 알고리즘을 분산 환경에 최적화시키는 것은 쉽지 않다. 특히, 효율성을 높이기 위해 정확성을 손실하는 문제 그리고 여러 작업자들 간의 부하 불균형 문제는 알고리즘을 분산처리 할 때 발생하는 대표적인 문제이다. 본 세미나에서는 대표적 Clustering 알고리즘인 DBSCAN을 분산처리 할 때 발생하는 여러 도전 과제에 초점을 맞추고 이를 해결 할 수 있는 새로운 해결책을 제시한다. 실제로 이 방법은 최신 연구의 방법과 비교하여 정확도 손실 없이 최대 180배까지 알고리즘의 성능을 향상시켰다.
본 세미나는 SIGMOD 2018에서 발표한 다음 논문에 대한 내용이다.
Song, H. and Lee, J., "RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning," In Proc. 2018 ACM Int'l Conf. on Management of Data (SIGMOD), Houston, Texas, pp. 1173 ~ 1187, June 2018
1. Background
- Concept of Clustering
- Concept of Distributed Processing (MapReduce)
- Clustering Algorithms (Focus on DBSCAN)
2. Challenges of Parallel Clustering
- Parallelization of Clustering Algorithm (Focus on DBSCAN)
- Existing Work
- Challenges
3. Our Approach
- Key Idea and Key Contribution
- Overview of Random Partitioning-DBSCAN
4. Experimental Results
5. Conclusions
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Machine Learning (ML) models are often composed as pipelines of operators, from “classical” ML operators to pre-processing and featurization operators. Current systems deploy pipelines as "black boxes”, where the same implementation of training is run for inference. This solution is convenient but leaves large room to improve performance and resource usage. This talk presents Pretzel, a framework for deployment of ML pipelines that is inspired to Database Systems: Pretzel inspects and optimizes pipelines end-to-end much like queries, and manages resources common to multiple pipelines such as operators' state. Pretzel is joint work with University of Seoul and Microsoft Research and has recently been presented at OSDI ’18. After the overview, this talk also shows experimental results of Pretzel against state-of-art ML solutions and discusses limitations and extensions.
Practical Parallel Hypergraph Algorithms
Author: Julian Shun
Publication:PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming February 2020 Pages 232–249https://doi.org/10.1145/3332466.3374527
One More Comments on Programming with Big Number Library in Scientific Computingtheijes
This article makes a comment on programming with big number libraries like GMP and MPFR libraries. The comment proposes not using recursive procedure in programs since recursive procedure that performs big number computations might use big size of stack while neither the operation system nor a compile system can afford to provide the needed size. Based on the comments, the article recommends designing proper algorithm that avoids a recursive procedure. With an example that uses the GMP big number library in factorization of big odd number, the article demonstrates how to convert a recursive procedure into a non-recursive procedure and its numerical experiments sustain that the sample program reveals expected results.
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...Nexgen Technology
Nexgen Technology Address:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com.
www.nexgenproject.com
Mobile: 9751442511,9791938249
Telephone: 0413-2211159.
NEXGEN TECHNOLOGY as an efficient Software Training Center located at Pondicherry with IT Training on IEEE Projects in Android,IEEE IT B.Tech Student Projects, Android Projects Training with Placements Pondicherry, IEEE projects in pondicherry, final IEEE Projects in Pondicherry , MCA, BTech, BCA Projects in Pondicherry, Bulk IEEE PROJECTS IN Pondicherry.So far we have reached almost all engineering colleges located in Pondicherry and around 90km
Subgraph matching with set similarity in anexgentech15
Nexgen Technology Address:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com.
www.nexgenproject.com
Mobile: 9751442511,9791938249
Telephone: 0413-2211159.
NEXGEN TECHNOLOGY as an efficient Software Training Center located at Pondicherry with IT Training on IEEE Projects in Android,IEEE IT B.Tech Student Projects, Android Projects Training with Placements Pondicherry, IEEE projects in pondicherry, final IEEE Projects in Pondicherry , MCA, BTech, BCA Projects in Pondicherry, Bulk IEEE PROJECTS IN Pondicherry.So far we have reached almost all engineering colleges located in Pondicherry and around 90km
This Paper delivers a new algorithm to the problem of graph isomorphism detection. Basically this method is designed in such a way that, a model graph which is known prior is compared with an unknown graph called input graph. Now the problem to be solved is to find a graph isomorphism from the input graph which is given online with the one which is known as model graph. At dynamic time the input graph is compared with the entire model graph for which there exists a graph isomorphism from the input graph are detected. The time complexity depends on the number of input graphs given and size of the graphs. Furthermore it is independent of number of model graphs given and the number of edges in any of the graph
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...Mamoon Ismail Khalid
we extend the global optimization-based
approach of jointly matching a set of images to jointly
matching a set of 3D meshes. The estimated correspon
dences simultaneously maximize pairwise feature affini
ties and cycle consistency across multiple models. We
show that the low-rank matrix recovery problem can be
efficiently applied to the 3D meshes as well. The fast
alternating minimization algorithm helps to handle real
world practical problems with thousands of features. Ex
perimental results show that, unlike the state-of-the-art
algorithm which rely on semi-definite programming, our
algorithm provides an order of magnitude speed-up along
with competitive performance. Along with the joint shape
matching we propose an approach to apply a distortion
term in pairwise matching, which helps in successfully
matching the reflexive sub-parts of two models distinc
tively. In the end, we demonstrate the applicability of
the algorithm to match a set of 3D meshes of the SCAPE
benchmark database
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsSubhajit Sahu
Highlighted notes on Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds.
While doing research work under Prof. Kishore Kothapalli.
Laxman Dhulipala, David Durfee, Janardhan Kulkarni, Richard Peng, Saurabh Sawlani, Xiaorui Sun:
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds. SODA 2020: 1300-1319
In this paper we study the problem of dynamically maintaining graph properties under batches of edge insertions and deletions in the massively parallel model of computation. In this setting, the graph is stored on a number of machines, each having space strongly sublinear with respect to the number of vertices, that is, n for some constant 0 < < 1. Our goal is to handle batches of updates and queries where the data for each batch fits onto one machine in constant rounds of parallel computation, as well as to reduce the total communication between the machines. This objective corresponds to the gradual buildup of databases over time, while the goal of obtaining constant rounds of communication for problems in the static setting has been elusive for problems as simple as undirected graph connectivity. We give an algorithm for dynamic graph connectivity in this setting with constant communication rounds and communication cost almost linear in terms of the batch size. Our techniques combine a new graph contraction technique, an independent random sample extractor from correlated samples, as well as distributed data structures supporting parallel updates and queries in batches. We also illustrate the power of dynamic algorithms in the MPC model by showing that the batched version of the adaptive connectivity problem is P-complete in the centralized setting, but sub-linear sized batches can be handled in a constant number of rounds. Due to the wide applicability of our approaches, we believe it represents a practically-motivated workaround to the current difficulties in designing more efficient massively parallel static graph algorithms.
Report on Efficient Estimation for High Similarities using Odd Sketches AXEL FOTSO
This is a short report of the paper "Efficient Estimation for High Similarities using Odd Sketches" produced for the course Softskills seminar at Telecom Paristech
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Large Graphs
1. IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x
1
PAPER
SASUM: A Sharing based Approach to Fast Approximate
Subgraph Matching for Large Graphs
Song-Hyon KIM†a)
, Student Member, Inchul SONG†
, Kyong-Ha LEE†
, and Yoon-Joon LEE†
, Nonmembers
SUMMARY Subgraph matching is a fundamental operation for query-
ing graph-structured data. Due to potential errors and noises in real world
graph data, exact subgraph matching is sometimes not appropriate in prac-
tice. In this paper we consider an approximate subgraph matching model
that allows missing edges. Based on this model, approximate subgraph
matching finds all occurrences of a given query graph in a database graph,
allowing missing edges. A straightforward approach to this problem is to
first generate query subgraphs of the query graph by deleting edges and
then perform exact subgraph matching for each query subgraph. In this pa-
per we propose a sharing based approach to approximate subgraph match-
ing, called SASUM. Our method is based on the fact that query subgraphs
are highly overlapped. Due to this overlapping nature of query subgraphs,
the matches of a query subgraph can be computed from the matches of a
smaller query subgraph, which results in reducing the number of query sub-
graphs that need costly exact subgraph matching. Our method uses a lattice
framework to identify sharing opportunities between query subgraphs. To
further reduce the number of graphs that need exact subgraph matching,
SASUM generates small base graphs that are shared by query subgraphs
and chooses the minimum number of base graphs whose matches are used
to derive the matching results of all query subgraphs. A comprehensive set
of experiments shows that our approach outperforms the state-of-the-art
approach by orders of magnitude in terms of query execution time.
key words: graph database, approximate subgraph matching
1. Introduction
A graph is a useful data model that represents objects and
their relationships in various applications. For example, a
protein-protein interaction network (PPIN) is modeled as a
graph where each vertex represents a protein and each edge
represents an interaction between two proteins [1]. Graph
data are often very large and complex, e.g., PPIN consists
of tens of thousands of vertices and hundreds of thousands
edges.
One of fundamental operations in graph data process-
ing is subgraph matching. Given a query graph Q and a
database graph G, subgraph matching finds all occurrences
of Q in G. Subgraph matching requires the subgraph iso-
morphism test, which is known to be an NP-complete prob-
lem [2]. A lot of efforts have been devoted to solve the sub-
graph matching problem efficiently [3–8].
Graph data are incomplete in many cases. For example,
when constructing PPIN, many PPI detection methods today
produce a significant amount of false positive protein-to-
protein interactions. Moreover, they sometimes miss real in-
teractions, generating false negatives [9]. Indeed, exact sub-
†
The authors are with the Department of Computer Science,
KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic
of Korea.
a) E-mail: songhyon.kim@kaist.ac.kr
DOI: 10.1587/transinf.E0.D.1
graph matching is not appropriate in such a case. Therefore,
approximate subgraph matching is used instead in many ap-
plications.
In this paper we adopt an approximate subgraph match-
ing model that allows missing edges. Missing edges rep-
resent noises in a database graph. In this model, the edge
edit distance (the number of edge deletions needed to trans-
form one graph to another) is used to identify an occurrence
of Q. If the edge edit distance between the query graph Q
and a subgraph g of G is no more than some user-specified
threshold θ, then g is considered as an approximate match
of Q. Note that in this paper we do not consider approxi-
mate matches with additional edges to the query graph since
such matches always contained by the matches of the query
graph.
A simple solution for approximate subgraph matching
is to first generate all graphs whose edge edit distance to Q is
no more than θ. We call these graphs query subgraphs in this
paper and denote the set of all query subgraphs as S (Q, θ).
Next, for each query subgraph q in S (Q, θ), we perform ex-
act subgraph matching to find the exact occurrences of q in
G. However, there are two shortcomings in this approach.
First, exact subgraph matching itself is a very difficult prob-
lem since it needs the subgraph isomorphism test as men-
tioned above. Second, the number of query subgraphs can
be very large especially when threshold θ is large.
To overcome these shortcomings, we propose a Sharing
based approach to Approximate SUgraph Matching, called
SASUM. Our approach is based on the fact that query sub-
graphs are highly overlapped. Due to this overlapping nature
of query subgraphs, the matches of a query subgraph can be
computed from the matches of a smaller query subgraph.
For example, if a query subgraph qi is a subgraph of another
query subgraph qj, then the matches of qj can be computed
from the matches of qi by simply checking whether the ad-
ditional edges qj has exist in the matches of qi. Thus the
number of graphs that need costly exact subgraph match-
ing can be reduced. SASUM uses a lattice framework to
identify sharing opportunities between query subgraphs. To
further reduce the number of graphs that need exact sub-
graph matching, SASUM generates small base graphs that
are shared by query subgraphs and chooses the minimum
number of base graphs whose matches are used to derive
the matching results of all query subgraphs. The selected
base graphs are called seed graphs. SASUM performs sub-
graph matching only for the seed graphs and systematically
computes the matches of all query subgraphs from them.
Preprint submitted to the IEICE Transactions on Information and Systems on Oct 12, 2012
5. KIM et al.: SASUM: A SHARING BASED APPROACH TO FAST APPROXIMATE SUBGRAPH MATCHING FOR LARGE GRAPHS
5
7
q14
A
B
C
q15
A
A
B
q16
A
A
C
q17
A
B
C
q18
A
B
C
q19
A
A
C
q20
A
A
C
q21
A
B
C
Fig. 7 Base graphs generated from the graphs in T(Q, 2)
graph in T(Q, θ). We say that a base graph qb covers a ter-
minal graph qt if qb ⊆ qt. We denote the set of terminal
graphs in T(Q, θ) that are covered by a base graph qb in
B(Q, θ) by C(qb). Now we are required to find the small-
est subset Bseed(Q, θ) of B(Q, θ) such that the seed graphs in
Bseed(Q, θ) collectively cover every graph in T(Q, θ): that is,
T(Q, θ) = q∈Bseed(Q,θ) C(q).
It is easy to see that the seed selection problem reduces
to the set cover problem [2], whose decision version is NP-
Complete. An instance of the set cover problem consists of
a finite set U and a collection S of subsets of U, such that
every element of U belongs to at least one subset in S . We
want to find the smallest subset C of S whose union is U.
The seed selection problem is transformed into the set cover
problem by setting U = T(Q, θ) and S = {C(q)|q ∈ B(Q, θ)}.
The set cover problem requires that every element of U be-
longs to at least one subset in S , which is guaranteed by
Lemma 2.
There is a greedy approximation algorithm for the set
cover problem whose approximation ratio is ln |U| + 1 [12].
This algorithm begins by selecting the largest subset from S
and then deletes all its elements from U. The algorithm adds
the subset containing the largest number of remaining un-
covered elements repeatedly until all elements are covered.
In the seed selection phase, SASUM obtains Bseed(Q, θ)
from B(Q, θ) by using the greedy approximation algorithm
just described.
Example 2. Figure 8 shows how the base graphs in B(Q, 2)
cover the terminal graphs in T(Q, 2). There is an edge be-
tween a base graph and a terminal graph if the base graph
covers the terminal graph. If we apply the greedy approx-
imation algorithm to the base graphs in B(Q, 2), the algo-
rithm selects the base graphs in the following order: q15,
q21, q14, and q16. Thus we have Bseed(Q, 2) = {q14, q15, q16,
q21}, the size of which is only half of the size of T(Q, 2).
4.3 Query Evaluation
In this section we describe how to find the matches of all
graphs in S (Q, θ). SASUM first finds the matches of graphs
in Bseed(Q, θ) by exact subgraph matching. Any exact sub-
graph matching algorithm can be used for this purpose. Then
from these matches, it computes the matches of all graphs
in T(Q, θ). Given a graph qt in T(Q, θ), the matches of any
graph qb in Bseed(Q, θ) can be used to compute the matches
of qt if qt is in C(qb). According to Lemma 1, the matches of
graphs in T(Q, θ) can be used to compute the matches of the
other graphs in S (Q, θ). To reuse matching results, we have
to determine the evaluation order of graphs in S (Q, θ) such
that if qi → qj, we find the matches of qi first. By topolog-
ically sorting the query lattice S (Q, θ), ⊆ on the successor
relation →, we can obtain the required evaluation order of
graphs.
Let the determined order be q1 → q2 → · · · → qk
where k = |S (Q, θ)|. SASUM finds the matches of graphs in
this order. If a graph q is a terminal graph in T(Q, θ), then
SASUM already has computed the matches of q against the
database graph G. If a query subgraph q is a non-terminal
graph in S (Q, θ), SASUM finds the matches of q from the
matches of q’s predecessors in the query lattice. It is guar-
anteed that the matches of these predecessors have already
been found by the topological sorting order of graphs over
the successor relation →. We can reuse the matches of any
predecessor graph here. To speed up the process of com-
puting matches by reducing the size of intermediate match-
ing results, SASUM uses the predecessor graph q∗
with the
smallest number of matches, i.e., q∗
= arg minq {|M(q ,G)| :
q → q} to compute the matches of q.
Example 3. In Figure 4, the matches of q4 can be com-
puted from the matches of any one of graphs q7, q10, q11,
and q13. The number of matches of these graphs are as fol-
lows: |M(q7,G)| = 5, |M(q10,G)| = 4, |M(q11,G)| = 2, and
|M(q13,G)| = 6. Therefore, SASUM computes the matches
of q4 from those of q11.
Now we describe how to compute the matches of a
graph from those of another graph. There are two cases
to consider when reusing matching results: (1) from the
matches of a graph in Bseed(Q, θ) to those of a graph in
T(Q, θ), and (2) from the matches of a graph in S (Q, θ)
to another graph in S (Q, θ). We consider the case (2) first.
When qi and qj are both in S (Q, θ) and qi → qj, then graph
qj has one additional edge than graph qi. In this case, for
each match in M(qi,G), SASUM checks whether that match
has the additional edge and prunes those matches not having
that edge.
We now discuss the case (1). When qi is in Bseed(Q, θ),
qj is in T(Q, θ), and qj is in C(qi), then qj has either (i) one
additional edge, or (ii) one additional edge and one addi-
tional vertex since qi is generated by edge pruning. For the
case (i), SASUM does the same as in the case (2) above. For
the case (ii), however, we cannot compute the matches of qj
by only pruning the matches of qi. Instead, we have to ex-
tend the matches of qi with the mappings of the additional
vertex. Let graph qj have n vertices and qi have n−1 vertices.
Given a match m in M(qi,G), let vertices u1, u2, . . . , un−1 in
G are the mappings of vertices v1, v2, . . . , vn−1 in qi. Suppose
that vn is an additional vertex in qj, and there is an additional
edge (vi, vn) in qj where 1 ≤ i ≤ n − 1. Now we extend the
6. 6
IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x
match m with the mappings (vn, un) for un in G such that the
edge (ui, un) exists in E(G), and the label of un is equal to
that of vn, i.e., l(un) = l(vn). We can easily find such un’s
from G by visiting the adjacent vertices uj of ui and check-
ing whether l(uj) = l(vn).
Example 4. We illustrate how to compute matches from
other matches by an example, shown in Figure 9. In this
example, we shall compute the matches of graphs q6 and
q1 from those of seed graph q14. On the left, we show the
query graph Q and the database graph G again for the con-
venience of the reader. On the right, we show the graphs
q14, q6, and q1 where q14 covers q6, and q6 → q1. The set
of matches of a graph is shown below the graph, as a ta-
ble. The first row of a table shows the vertices in the graph
above it, and the remaining rows represent the matches of
the graph. We use m1, m2, . . . to denote the first match (row),
the second match, and so on, in each set of matches.
Assume that we already have found the matches
in M(q14,G) by subgraph matching. Now we compute
M(q6,G) from M(q14,G). In the figure, graph q6 has one
additional edge (v1, v3) and one additional vertex v3 than
graph q14 (indicated by using solid lines). Thus, we need to
compute M(q6,G) from M(q14,G) by extending each match
in M(q14,G) with the mappings of vertex v3. Let us consider
the match m1 = {(v1, u2), (v2, u1), (v4, u3)} first. Since the ad-
ditional edge is between vertices v1 and v3, and the label
of vertex v3 is ‘A’, we find the adjacent vertices of u2 in G
having label ‘A’. Only vertex u4 is eligible as the mapping of
vertex v3. Thus, the match m1 in M(q14,G) is extended by the
mapping (v3, u4) to be the match m1 in M(q6,G). The other
matches in M(q14,G) can be extended in the same way just
described. A dotted arrow from a match m in M(q14,G) to a
match m in M(q6,G) indicates that the match m is extended
to be the match m .
Next we compute M(q1,G) from M(q6,G). The graph
q1 has one additional edge (v2, v4) than q6. Here we com-
pute M(q1,G) by filtering out those matches in M(q6,G)
not having the edge corresponding to the edge (v2, v4) in G.
The matches m2 and m3 in M(q6,G) are filtered out because
there is no edge between vertices u8 and u3.
4.4 Outputting Matching Results
In the problem statement present in Section 2, we do not im-
pose any restrictions on the order of outputting the matching
results of the graphs in S (Q, θ). However, the user may want
to get the matching results in the order of the edge edit dis-
tance from the query graph: that is, the matching results of
the query graph, and those with one missing edge, those with
two missing edges, and so on. In this case, we must keep the
matching results of every graph in S (Q, θ) before outputting
them.
If the user does not care about the order of outputting
results, then SASUM can reduce the space usage by pro-
ducing and removing intermediate matching results early.
There are two cases to consider. First, let q be a graph
in Bseed(Q, θ). If SASUM has obtained the matches of all
graphs q where q is in C(q), then SASUM can safely throw
away the matches of q because they will not be used later.
Second, let q be a graph in S (Q, θ). If SASUM has ob-
tained the matches of all graphs that are successors of q,
then SASUM can remove the matches of q right away for
the same reason.
5. Analytical Study
This section analyzes our approach to approximate subgraph
matching. We aim at proving the correctness of SASUM and
showing the superiority of our approach compared to the
state-of-the-art method in terms of the number of graphs that
need subgraph matching.
5.1 Proof of correctness
The following theorem shows the correctness of SASUM.
Theorem 1. Given a database graph G, a connected query
graph Q, and a positive integer threshold θ, SASUM finds
all matches of graphs in S (Q, θ).
Proof. In the evaluation phase, SASUM first finds the
matches of graphs in Bseed(Q, θ) by subgraph matching.
Then it computes the matches of graphs in T(Q, θ) from
those of graphs in Bseed(Q, θ). Now it remains to show
whether SASUM correctly computes the matches of all non-
terminal graphs in S (Q, θ). According to Lemma 1, for ev-
ery non-terminal graph q in S (Q, θ), there is a path from
some terminal graph qt in T(Q, θ) to q. Let the path be
q1(= qt) → q2 → · · · → qk(= q). By the evaluation or-
der of graphs, which is a topological order, the matches of
qi is computed from those of qi−1 for 2 ≤ i ≤ k. We already
have the matches of qt. Therefore, SASUM computes the
matches of q eventually.
5.2 Performance Guarantee of SASUM
We compare SASUM with the state-of-the-art approach in
terms of the number of graphs that need costly subgraph
matching. The number of graphs that need subgraph match-
ing is a dominant factor in the overall performance of a given
method for approximate subgraph matching (we will ver-
ify this in Section 6 through experimental evaluation). We
compare three approaches: NAIVE, SHARE, and SASUM.
The NAIVE approach is the one that finds the matches of
each graph in S (Q, θ) independently. The SHARE approach
is a basic sharing based approach, which is employed by
the state-of-the-art method [10]. It computes the matches
of a query subgraph from those of another query subgraph.
SHARE needs subgraph matching for the graphs in T(Q, θ).
Our approach, SASAUM, increases sharing opportunities by
generating base graphs and selecting a small number of seed
graphs. SASUM needs subgraph matching for the graphs
in Bseed(Q, θ). Let CNAIVE, CSHARE, and CSASUM denote the
8. 8
IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x
0
1
2
3
4
5
2500 5000 7500 10000
Runtime(sec)
Number of vertices in G
SASUM
SAPPER
(a) Varying |V(G)|
0
2
4
6
8
10
12
14
16
18
20
0 100 200 300
Runtime(sec)
Number of labels in G
SASUM
SAPPER
(b) Varying |L(G)|
0.1
1
10
100
1000
10000
1 2 3
Runtime(sec)
θ
SASUM
SAPPER
(c) Varying θ (|V(G)| = 5000, |V(Q)| =
20) (Log scale)
0.01
0.1
1
10
100
1000
1 2 3
Runtime(sec)
θ
SASUM
SAPPER
(d) Varying θ (|V(G)| = 2500, |V(Q)| =
10) (Log scale)
Fig. 10 Query execution time for synthetic datasets
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3
Reductionratio
θ
|V(Q)| = 20
Fig. 11 Comparing the number of subgraph matching executions
0
0.5
1
1.5
2
2.5
3
3.5
4
10 20 30
Runtime(sec)
Number of vertices in Q
SASUM
SAPPER
(a) Varying |V(Q)|
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
2 3 4
Runtime(sec)
Degree of Q
SASUM
SAPPER
(b) Varying deg(Q)
Fig. 12 Query execution time for different query graphs
less number of graphs that need subgraph matching. When
threshold θ is large, the number of graphs in T(Q, θ) is large,
but SASUM uses Bseed(Q, θ) instead, which is far smaller
than T(Q, θ). We can verify this in Figure 11, which shows
the reduction ratio of SASUM in the number of graphs that
need subgraph matching. For example, the reduction ratio of
0.6 indicates that SASUM reduces the number of graphs that
need subgraph matching by 60% when compared to SAP-
PER. As you can see in the figure, SASUM requires less
than half the number of graphs that need subgraph match-
ing required by SAPPER in all cases, and as θ increases, the
reduction ratio further increases.
We vary the number of vertices in Q and, the results are
shown in Figure 12(a). With more vertices in Q, more ver-
tices and edges need to be compared in subgraph matching.
Thus the query times of both methods increase. However,
SASUM performs better than SAPPER because it requires
a smaller number of graphs that need subgraph matching.
0
20
40
60
80
100
2500 5000 7500 10000
Relativeportion(%)
Number of vertices in G
Exact subgraph matching
The other parts
(a) Varing |V(G)|
0
20
40
60
80
100
1 2 3
Relativeportion(%)
θ
Exact subgraph matching
The other parts
(b) Varing θ
Fig. 13 Breakdown of query execution time
The last parameter we vary is the average degree of Q.
The results are shown in Figure 12(b). A high vertex degree
generates more graphs in S (Q, θ) since the number of graphs
in S (Q, θ) is exponential to the average degree of Q. And
when the average degree of Q is large, subgraph matching
needs to compare more edges. Thus, as the average degree
of Q increases, the performance gap between SASUM and
SAPPER widens.
We analyze query execution time of SASUM to ac-
cess relative portion of time spent at subgraph matching and
that spent at the other parts. The results are shown in Fig-
ure 13(a) and (b). Figure 13(a) shows the portion of time
spent by subgraph matching and the other parts in the over-
all query execution time. As the number of vertices in G
increases, subgraph matching needs to compare more ver-
tices. Thus, the portion of time spent by subgraph matching
increases and dominates the overall performance. In Figure
13(b), we vary threshold θ. Here the portion of time spent by
the other parts increases as θ increases. This is because the
number of graphs in S (Q, θ) is exponential to θ. Neverthe-
less, the portion of time spent by subgraph matching is still
dominant in all cases.
6.3 Real datasets
For real world data, we prepared two large real graphs: a
human protein interaction network [14] and a collabora-
tion network [15]. The protein interaction network contains
10,527 vertices and 40,903 edges. Each vertex represents a
protein, and the label of the vertex is its gene ontology term.
An edge in the graph represents an interaction between the
two proteins it connects. The collaboration network includes
5,241 vertices and 14,484 vertices. Each vertex represents
an author, and there is an edge between two authors if they
coauthored a paper. 250 labels are randomly distributed over
the vertices in the collaboration network. Figure 14 shows
9. KIM et al.: SASUM: A SHARING BASED APPROACH TO FAST APPROXIMATE SUBGRAPH MATCHING FOR LARGE GRAPHS
9
0
100
200
300
400
500
600
700
Degree
Vertices in descending order of degree
(a) Human protein interaction
network
0
10
20
30
40
50
60
70
80
90
Degree
Vertices in descending order of degree
(b) Collaboration network
Fig. 14 Degree sequences of real graphs
1
10
100
1000
10000
100000
1 2 3
Runtime(sec)
θ
SASUM
SAPPER
(a) Human protein interaction
network (Log scale)
0.1
1
10
100
1000
10000
1 2 3
Runtime(sec)
θ
SASUM
SAPPER
(b) Collaboration network (Log
scale)
Fig. 15 Query execution time with real graphs
sample degree sequences of the two real datasets. As you
can see in the figure, their degree sequences show a heavy-
tail behavior.
We compare the performance of SASUM and SAPPER
over different query graphs extracted from the two graphs.
Since most of the results are similar to those for the syn-
thetic datasets, we show the results of varying only thresh-
old θ, which are shown in Figure 15. The figure shows that
SASUM outperforms SAPPER by orders of magnitude in
terms of query execution time in both datasets.
7. Related Work
Subgraph matching, which finds the occurrences of a spec-
ified graph pattern in a graph, is a fundamental operation
in graph data processing. Ullmann [3] proposed a subgraph
matching algorithm based on a state space search method
with backtracking. VF2 [4] is a more recent work that intro-
duces a set of feasibility rules for pruning the state space.
These two methods, however, are prohibitively expensive
for query processing against a large database graph since
they do not use any index structure by preprocessing the
database graph.
Several indexing based method have been developed
for subgraph matching. In graph indexing methods, e.g.,
GraphGrep [16], gIndex [17], TreePi [18], Tree+∆ [19], and
FG-Index [20], the graph database consists of a set of small
graphs. The goal of graph indexing is to find all graphs that
contain a given query graph. On the other hand, subgraph in-
dexing finds all occurrences of a given query graph in a very
large database graph. GADDI [5], NOVA [6], SUMMA [7],
the approach proposed in [8], and SAPPER [10] fall into
this category. Our approach, SASUM, also belongs to this
category.
Recently, a number of methods have been proposed to
support approximate subgraph matching [10,21,22]. Among
them, similarity search methods, e.g., TALE [21] and G-
Hash [22], are not designed to find all approximate matches
of a given query graph. SAPPER [10] is the state-of-the-art
method that finds all approximate matches of a given query
graph from a large database graph. SAPPER uses the fact
that subgraphs of a given query are highly overlapped. It
finds the matches of a larger graph from those of a smaller
graph. However, SAPPER still needs to perform subgraph
matching for a large number of graphs.
8. Conclusions
In this paper we have investigated how to find all approxi-
mate matches of a given graph from a large database graph,
allowing missing edges. A straightforward way to solve this
problem is to generate a set of query subgraphs, which have
no more missing edges than a user-specified threshold, and
perform subgraph matching for each query subgraph inde-
pendently. However, this simple method is not feasible be-
cause the number of query subgraphs could still be too large,
and subgraph matching itself is a very difficult problem.
In this paper we have proposed a sharing based
approach to fast approximate subgraph matching, called
SASUM. We aim at reducing the number of graphs that
need subgraph matching, which decides the overall perfor-
mance of an approximate subgraph matching algorithm. To
this end, SASUM exploits the fact that query subgraphs are
highly overlapped. Due to this property of query subgraphs,
the matches of a query subgraph can be computed from
those of another query subgraph without costly subgraph
matching. SASUM uses a lattice framework to identify shar-
ing opportunities between query subgraphs. SASUM goes
one step further and produces small base graphs that are
shared by query subgraphs by edge pruning. It then chooses
a small number of seed graphs from them, and performs sub-
graph matching only for the seed graphs. SASUM system-
atically computes the matches of all query subgraphs from
those of seed graphs. We have proven that the number of
graphs that need subgraph matching required by SASUM
is less than or at most equal to the number required by the
state-of-the-art method; and it is much smaller in many cases
as shown through the experimental evaluation. A compre-
hensive set of experiments has shown that SASUM outper-
forms the state-of-the-art method by orders of magnitude in
terms of query execution time due to great reduction in the
number of graphs that need subgraph matching.
References
[1] T. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Ku-
mar, S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venu-
gopal, et al., “Human protein reference database—2009 update,”
Nucleic Acids Res., vol.37, no.suppl 1, pp.D767–D772, 2009.
[2] M. Garey and D. Johnson, Computers and Intractability: A Guide
to the Theory of NP-completeness, WH Freeman & Co. NY, USA,
1979.
[3] J. Ullmann, “An algorithm for subgraph isomorphism,” J. ACM,
10. 10
IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x
vol.23, no.1, pp.31–42, 1976.
[4] L. Cordella, P. Foggia, C. Sansone, and M. Vento, “A (sub) graph
isomorphism algorithm for matching large graphs,” IEEE Trans. Pat-
tern Anal. Mach. Intell., vol.26, no.10, pp.1367–1372, 2004.
[5] S. Zhang, S. Li, and J. Yang, “Gaddi: distance index based subgraph
matching in biological networks,” Proc. EDBT, pp.192–203, 2009.
[6] K. Zhu, Y. Zhang, X. Lin, G. Zhu, and W. Wang, “Nova: A novel and
efficient framework for finding subgraph isomorphism mappings in
large graphs,” Proc. DASFAA, pp.140–154, 2010.
[7] S. Zhang, S. Li, and J. Yang, “Summa: subgraph matching in mas-
sive graphs,” Proc. CIKM, pp.1285–1288, 2010.
[8] S. Kim, I. Song, and Y. Lee, “An edge-based framework for fast
subgraph matching in a large graph,” Proc. DASFAA, pp.404–417,
2011.
[9] S. Suthram, T. Shlomi, E. Ruppin, R. Sharan, and T. Ideker, “A direct
comparison of protein interaction confidence assignment schemes,”
BMC Bioinf., vol.7, no.1, p.360, 2006.
[10] S. Zhang, J. Yang, and W. Jin, “Sapper: subgraph indexing and ap-
proximate matching in large graphs,” Proc. VLDB Endow., vol.3,
no.1-2, pp.1185–1194, 2010.
[11] D.B. West, Introduction to Graph Theory, Prentice Hall, 2001.
[12] T. Cormen, Introduction to algorithms, MIT electrical engineering
and computer science series, MIT Press, 2001.
[13] F. Viger and M. Latapy, “Efficient and simple generation of random
simple connected graphs with prescribed degree sequence,” Com-
puting and Combinatorics, pp.440–449, 2005.
[14] C. Stark, B. Breitkreutz, A. Chatr-aryamontri, L. Boucher,
R. Oughtred, M. Livstone, J. Nixon, K. Van Auken, X. Wang, X. Shi,
et al., “The biogrid interaction database: 2011 update,” Nucleic acids
research, vol.39, no.suppl 1, pp.D698–D704, 2011.
[15] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graph evolution: Den-
sification and shrinking diameters,” ACM Trans. Knowledge Dis-
covery from Data, vol.1, no.1, pp.2:1–2:41, 2007.
[16] D. Shasha, J.T.L. Wang, and R. Giugno, “Algorithmics and applica-
tions of tree and graph searching,” Proc. PODS, pp.39–52, 2002.
[17] X. Yan, P. Yu, and J. Han, “Graph indexing: A frequent structure-
based approach,” Proc. SIGMOD, pp.335–346, 2004.
[18] S. Zhang, M. Hu, and J. Yang, “Treepi: A novel graph indexing
method,” Proc. ICDE, pp.966–975, 2007.
[19] P. Zhao, J. Yu, and P. Yu, “Graph indexing: tree + delta >= graph,”
Proc. VLDB, pp.938–949, 2007.
[20] J. Cheng, Y. Ke, W. Ng, and A. Lu, “Fg-index: towards verification-
free query processing on graph databases,” Proc. SIGMOD, pp.857–
872, 2007.
[21] Y. Tian and J. Patel, “Tale: A tool for approximate large graph
matching,” Proc. ICDE, pp.963–972, 2008.
[22] X. Wang, A. Smalter, J. Huan, and G. Lushington, “G-hash: towards
fast kernel-based similarity search in large graph databases,” Proc.
EDBT, pp.472–480, 2009.