SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION

Antonio Maccioni & Daniel Abadi
SCALABLE PATTERN MATCHING OVER COMPRESSED
GRAPHS VIA DE-DENSIFICATION
22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 13-17, 2016 | San Francisco, California
Aftab Alam
Department of Computer Engineering, Kyung Hee University

Scalable Pattern Matching over Compressed Graphs via Dedensification
Contents
Background
Conclusion and Future Work
Empirical Evaluation
Problem Statement
Graph Pattern Matching
7
6
5
2
1
4
3 Proposed Solution
Dedensification

Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
• Graph G (V, E, W) Database:
– Uses graph structures for semantic queries with nodes,
edges and properties to represent and store data.
• Graph Query Language:
– SELECT ?subject ?predicate ?object
WHERE {?subject ?predicate ?object}
LIMIT 100
• One of the common operations on graph DB is
– graph pattern matching
Background
subject predicate object
s1 p1 o1
s2 p2 o2

Problem Statement
• Modern social networks such as Facebook, Twitter and many more are
• bulky and having dense areas.
• HD node make query processing over graphs challenging for database.
• Problem of scaling queries over graphs is harder than scaling queries over relational data.
Almost 1/15 of registered twitter users follow @B. Obama
• Relational Data:
o Set-oriented
o Partitioning, replication, & indexing can be applied.
o Multiple cores/servers can be used.
o To operate each partition independently.

Problem Statement (Cont’d)
• Real world Graph query operations are less partitionable.
– Because follow the Power-law.
– E.g.: 10% of Twitter accounts follow the same five users.
– The graph is dense around these height degree (HD) nodes.
• Techniques like Partitioning, Replication, Indexing
– are not enough to solve the problem raised by these dense areas.
• Parallelization: complicated, almost impossible in some situations.
Replication Partitioning
Indexing
Node or Edge Partitioning
• Extremely skewed (High degree processing time >> Low degree nodes )

Proposed Solution
Graph G  Graph G`
• Dedensification (lossless compression technique)
– By reducing the number of adjacencies of HD nodes.
• HD nodes are surrounded by redundant information.
– that can be synthesized and eliminated.
• Identify clusters of low-degree nodes connected to HD nodes.
– Insert special node in the graph called compressor nodes.
– Representing common connections of clusters of related nodes to HD nodes.
– Remove Redundant edges.

Coming Next
• A non-expansive strategy for dedensification
• Query Answering for graph pattern matching
• Experiments on real and synthetic graph

DEDENSIFICATION
Features and Parameters
• Setting threshold of high-degree (HD)?
– Threshold, indicates the minimum number of adjacencies that a node should have.
– A node is HD if it has at least incoming edges sharing the same label.
– divides the N nodes in to two sets. i.e.
o Height degree nodes Hh
o Low degree nodes Hi
• Dedensification
– Lossless compression technique
– Reduce the number of adjacencies of HD nodes
– Applicable to both undirected and directed graphs
– Can perform on both incoming and outgoing edges of the graph
• Real-world graphs (SN domains) have more problems for incoming edge
• Directed labeled graphs G = (N=set of unique nodes, E=set of non-unique edges)
Twitter: Follow Labels
Facebook; Follow or like labels.

DEDENSIFICATION (Cont’d)
Working & Example
• Often a cluster of related nodes are connected to the same group of HD
nodes.
• Add a compressor nodes that summarize multiple connections of the
same kind to high-degree nodes.
• This process is called Dedenscation".
• Example:
– Figure (a), 6 low degree nodes (white) and 3 HD nodes(red).
– Low-degree nodes have outgoing edges to this same set of three HD nodes.
– Remove the edges that connect the white nodes to this set of HD nodes, and
instead create a single edge from each white node to the new yellow
node.(compressor node)

DEDENSIFICATION (Cont’d)
Advantages and Constraint
• Advantages
– If 1000 nodes contacted to set of 3 HD nodes,
– Means that 3000 edges incoming to three HD nodes
– Replaced with 1000 edges
– Reduce congestion around the HD nodes.
• To accelerate query performance and optimize for compression.
• So, place constraints on how and when dedensication occurs.
• Dedensification creates a new compressor node if CONSTRAINT 1 holds on a set of HD
nodes H and other nodes M.
• If M and H overlap, Constraint 1 is still valid.

DEDENSIFICATION
Algorithm
• Algorithm 1: Dedensication of a graph.
– HD(H) & M (White nodes)
• Lines 4-11 used to find node sets H and M.
• Where lines 12-19 computes the actual
dedensication over H and M.
• We can reconstruct the initial graph G from G`
– By iterating on each compressor nc∈Nc
– and connecting each incoming node to nc
to all the outgoing nodes of nc.
– Finally empty and remove Nc.

GRAPH PATTERN MATCHING
Query Pattern
• How graph pattern matching queries are processed over CG?
• Query pattern is itself a graph consisting of.
– Set of nodes and edges which can either have labels or variables.
• Example:
– Fig. (d). Consist of two nodes and a link
o Constant node (6), variable Links ?v3 and node ?v2
o Return all possible values of ?vs and ?v2 connected to node (6)
• Query pattern is the composition of triple:
– Node-edge-node: Triples
– 1st Node: Source or subject “s”,
– Link: Label, edge or predicate “p”
– 2nd Node: called destination or object “o”
• Fig. (b) Query can be decomposed as:
– (?v1, ?v3, ?v4),
– (?v1,?v3, A), and
– (?v1, ?v3, 6).

GRAPH PATTERN MATCHING (Cont’d)
Query Pattern Matching
• Significant amount of algorithms for processing pattern matching queries
• Based on the most prevalent algorithm for processing pattern matching queries & proceeds as:
– Each triple pattern t1, t2,… tn of query q corresponds to
– a selection on the graph database G (i.e. t1(G) t2(G)… tn(G)) and
– A common node leads to join operation between two sets of triples.
– e.g., t1(G) t2(G) if t1 & t2 have a node in common.
– After selection and joins, output is the complete answer-set to query q i.e.,
– Such joins are commutative, however commonly source node is used to connect.
• Star Query: where a node in the query graph has multiple edges emerging from it.

• Proposed solution processes a query q over G
– by rewriting q to a new query pattern q` over the compressed graph G`,
– such that q(G) = q`(G`).
• The underlying system does not need to be aware of the original q(G).
• It simply needs optimize the processing q` over G`.
• In dedenscified G`,
– no direct connections from low-degree nodes to HD nodes.
– compressor nodes are always present

• Flowchart shows the computation of all
– types of star joins over a G`.
• Queries are submitted to the graph database G`,
– As a set of edges (s, o).
• Algorithm input is graph G` and query q.
• If q contains any reference to constant low-degree nodes labels,
– Low-degree nodes are less important , and
– Then no different between G` and G graphs.
• Only focus on the parts of q involving constant HD nodes:
– HD = {h1,h2, … hn} & VAR = {s, v1, v2…. vm} where (s=source)
– Matched either to low-degree nodes or HD nodes.

Three Micro Cases of Star Queries
1. Stars formed by only HD nodes in their fan-out.
– The left branch of Figure
2. Stars formed by a mixed fan-out of variables and HD
nodes.
– The central branch of Figure
3. Stars formed by only variables in their fan-out
– The right branch of Figure

1. HD Nodes Only.
• When VAR = the query contains a fan-out with HD
nodes only.
• Easiest case to compute
– As incoming edges to HD nodes only come from
compressor nodes.
– Simply need to search for compressor node
– that connect to the HD nodes h1, h2,….,hn in the query.
• If found no nodes, the empty set can be returned.
• If found nodes,
– Then all nodes connected to these compressor nodes
via an outgoing edge from itself to the compressor node
form the result set for q`(G`).
• Block A is responsible for such computations.
• Finding the compressors AB and ABC
• Connected to both A and B.
• The solutions are the incoming nodes to those compressors,
• Namely nodes 2, 3, 4 and 5.
q1:  (HD = {A, B}, VAR = {s}
over the graph in Fig. (b),

2. Mix of HD and Variable Nodes.
• When both &
• 1st search for the partial stars containing the HD
nodes that are specified in the query.
– (push high-degree nodes down)
• Assigning them to the set y.
– If y = Null, no need to search
– As the final result set q`(G`) = Null (in block O) .
• If y is not empty,
– we compute Block C, D, E and F in sequence
– To enrich the partial answers in y.
• The goal of Blocks C-F is to refine y by matching the
variable parts of the query.
• This is complicated by the fact that variables can
correspond to both LD and HD nodes.

3. Variables Nodes Only.
• When HD = Null
– Equivalent to 2. mix of HD and variable nodes
– Except: no filter to perform on the HD constants,
• Block G = block C
– Previously it finds the HD nodes that are potential
matches for object variables in the query.
• Block H = Block D and Block E
– It finds all the LD nodes that are potential matches for
variables in the query i.e. all nodes that have incoming
edges from non-compressor nodes.
• Block I = Block F,
• Finally outputting q`(G`).

EMPIRICAL EVALUATION
• To understand
– Proposed approach VS running normal queries over uncompressed graphs,
– The scalability of the approach on real-world graphs and
– Whether the approach can complement existing indexing approaches.
• Created a graph database system prototype
– That can generate compress graph (G` )
– Implemented the proposed algorithms
• Cold and warm cache experiments
• The following dataset were used for testing.

EMPIRICAL EVALUATION (Cont’d)
Star Queries & Patterns
• Star queries used to evaluate the proposed algorithms.
– Five classes of queries
Pattern D & E Compositions of multiple stars through HD
Pattern A = Only HD nodes Pattern B = Mixed of variables and HD nodes
Pattern C = Variables only.

Results
• Focus on query performance rather than data compression
• Set value large to create 100 compressor nodes/dataset
o = 2,500 for Twitter
o = 5,000 for Google
o = 10,000 for LoveJournal
o = 7,000-28,000 for Barabasi
• Performance were checked against both
o With out indexing (ni) & With indexing (in)
• dd = Dedensified graph & or = original graph

Evaluation with Evolving Graphs
• How different pattern-matching techniques scale as the graph increases.
• Use Barabasi model to generate a graph and measure the performance
– When 100,000 nodes and 2,000,000 edges,
– 200,000 nodes and 4,000,000 edges,
– 300,000 nodes and 6,000,000 edges, and
– 400,000 nodes and 8,000,000 edges.
– Where the was set to 7,000; 14,000; 21,000 and 28,000 respectively.

Conclusion
• Introduced a dedensication for graph databases
– To improve scalable query performance: HD nodes
– Remove redundancy in graphs using the compressor node.
– Improves the performance

SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION

Similar to SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION (20)

More from aftab alam

More from aftab alam (6)

Recently uploaded

Recently uploaded (20)

SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION

Editor's Notes