COSI: Cloud Oriented Subgraph Identification in Massive Social Networks

© Adam Perer
COSI

COSI: Cloud Oriented Subgraph
Identification in Massive Social Networks
Matthias Bröcheler, Andrea Pugliese
& V.S. Subrahmanian

© solofotones/flickr COSI
© Felix Heinen

2

© solofotones/flickr COSI
© Felix Heinen

SNA Challenge:
Scalability
3

COSI

500 million users

50M tweets / day

Huge Social Networks
© Ludwig Gatzke

COSI

Cloud based
Asynchronous
storage

COSI

Answers complex queries in ~1 sec
on a 778 million edge network

COSI

Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion

collaborate
USA Prof Prof
COSI
dean author
member Italy
in
Jones Paper Baneri
“ABC” comment UC
UMD author
CS CS
in
faculty Prof
friends
Calero faculty
department in
member
faculty Prof presented
Dooley attended Social
University
MD Science department Universita
department in ASONAM Calabria
10 dean
attended Prof
faculty UMD
author
submitted Roma
member
Physics author
organized visited

Prof author accepted KPLLC Paper friends
09 Paper “UVW”
Smith Paper “HIJ”
submitted
“XYZ”
comment
comment attended
student of author Prof
Prof
collaborates Olsen student of
Prof Lund member
dean
Jamie Larsen
faculty Karl
Lock member
Social Oede
visited Science
Odense SDU
John
colleagues Doe Physics department
Odense Denmark

COSI

Example Query

?p
author comment

?v1 ?v3
faculty friends
faculty
University in
MD Italy ?v2

Simple query, yet already
difficult to answer by hand

8

COSI

Fraud Detection Example

Bank1
wired wired

?v1 ?v2
friends

Suspicious ?v3
labeled

9

COSI

COSI Architecture
Graph Data Client B ?X


?Z C


A ?Y

load Receive query -
Return results

Distribute data/
Dispatch query Query answer

  
 Exchange Data /

Forward query

COSI

COSI Architecture


?Z C


A ?Y

Return results
Partition Graph Distribute data/
Dispatch query Query answer

  
 Exchange Data /

Answer Queries

Forward query

COSI

COSI Graph Partitioning
 How should we partition the graph?
 GOAL: Find a way to partition the
graph DB into “blocks” across the k
storage nodes so that expected
time to answer queries is small.

13

COSI

Example Query & Naive Approach
Jones
Dooley ?p
author
Smith comment

?v1 ?v3
faculty friends
faculty
University in
MD Italy ?v2

14

COSI

Co-Retrieval
Paper “ABC”
?p
author comment

Jones ?v3
faculty friends
faculty
University in
MD Italy ?v2

Co-retrieval:
Jones – Paper “ABC“

15

COSI

Cost Model
  Query trace: A query trace w.r.t. a query plan x
for query Q consists of
-  All vertices in the DB whose neighborhood is
retrieved during execution of x
-  All pairs (u,v) of vertices where x retrieves
v’s nbhd immediately after retrieving u’s
nbhd.
•  Intuition: Try to put u,v on same storage node.
•  Assumption: Retrieved nbhds are cached in
memory.

16

COSI

Cost Model (continued)

 Assume fixed but arbitrary distribution
over the set of all queries.
 This induces a pdf over the set of all
feasible query plans qp(Q) for query Q.
-  (x)=  Q œ , qp(Q)=x (Q).
-  Prob of query plan “x” is the sum of the probs of
queries requiring query plan x.
 Let E(v) be the event that v is retrieved by
a query trace of a random query plan for
Q.
17

COSI


  Prob that vertex v occurs in the trace of a
randomly chosen query plan is
(E(v)) =  x œ qp(Q) ⁄ v œ qt(x,DB) (x).
  Prob that (u,v) occurs in the trace of a randomly
chosen query plan is
(E(u,v)) = x œ qp(Q) ⁄ (u,v ) œ qt(x,DB) (x).

18

COSI


Key Theorem
Suppose vertex retrieval and inter-node comms
are uniform across storage nodes. The partition
of the DB graph that minimizes query exec time
coincides with the partition that minimizes edge
cut cost in the graph (V,VV) with weight
function w(u,v)= (E(u,v))+ (E(v,u)).

  SO MIN EDGE-CUTS IN COMPLETE GRAPHS IS
CLOSELY RELATED TO MINIMIZING QUERY
EXECUTION TIME.
19

COSI

Partitioning Algorithm
  Challenges
-  Finding MIN EDGE-CUT is NP-complete.
-  We want to process graphs containing 100s of
millions of edges.
  So we want an algorithm that is
-  Very fast
-  Produces good edge cuts
•  but maybe not optimal
  To achieve speed, we focus on partition strategies that
permanently assign vertices to blocks.

20

COSI

Individual edge insertion
 Suppose we have a partition P={P1,..,Pk}.
 We are inserting the edge (v,p,o).
 Vertex force vectors: Measures how strongly
each Pi “pulls” a vertex.
-  |v|[i] = fP( y œ (nbhd(v) … Pi) w(v,y))
-  fP maps positive reals to reals and is an “affinity”
measure.
-  |v|[i] sums up the weights of edges from v to each
neighbor in Pi. Insert v into block Pi with highest |v
|[i].

21

COSI

Affinity Measures
 Must satisfy 3 properties
-  Connectedness of a vertex to a partition
block. This helps minimize edge cut.
-  Imbalance of block sizes.
•  E.g. standard deviation of block sizes,
normalized by expected DB size.
-  Excessive size should be punished.

22

COSI

Batch insertion
 Adding a set of edges at once.
 Idea: Find strongly connected
components using modularity
maximization and assign those to the
partition block with highest affinity.

23

COSI

Batch Partitioning Algorithm
Force Vector
Affinity

Contract

Maximize
Modularity

Contract

Maximize
Modularity

COSI

Graph modularity
 Mod(P) = Pi œ P(W(Pi,Pi)/2|E| -
degW(Pi) 2/(2|E|)2)

 Where
-  W(X,Y) is the sum of the weights of
edges (x,y) with x in X, y in Y.
-  degW(v) is the sum of the weights of
edges (v,-) and
-  degW(Pi) is the sum of the degW(v)’s for
v in Pi.
25

COSI

Query Answering


?Z C


A ?Y

Return results

Dispatch query
Query answer

  
 Forward (partially

Answered) query

COSI

Example Query

?p
author comment

?v1 ?v3
faculty friends
faculty
University in
MD Italy ?v2

P1

28

COSI

Example Query
Jones : P2
Dooley : P2 ?p
author
Smith : P3 comment

?v1 ?v3
faculty friends
faculty
University in
MD Italy ?v2

29

COSI

Example Query
Paper “ABC” : P2
Paper “HIJ” : P3
?p
author comment
P2 Calero : P2
Dooley ?v3
faculty friends
faculty
University in
MD Italy ?v2

Where to send query next?

30

COSI

Query answering
 Basic: Next substitution arbitrary
 COSI_Heur is a heuristic version that makes
intelligent choices about the next variable
to be substituted.
-  Branching Factor  # possible substitutions
-  Communication cost  # messages to be sent
-  Workload distribution  partitions hosting
vertices

31

COSI

COSI implementation
 Implementation is in Java (approx
10,000 loc)
 778M edges social network DB
-  Flickr, Orkut, Livejournal, Youtube
-  [Mislove ‘07]

 16-node compute cluster
-  8 GB of RAM
-  30 GB HDs
-  8 core Intel CPU
33

COSI

Partitioning quality
Comparison of Partitioning Methods
40.0%
35.0%
30.0%
25.0% Edge Cut
20.0%
Improvement
15.0%
Imbalance
10.0%
5.0%
0.0%
Single Greedy Batch Greedy Batch Partition

COSI_Partition achieves a 36% improvement in
edge-cut with only slightly higher imbalance.
Took 7.5 h to load with individual triple insertion, 10.5 h with batch.

34

COSI
Logarithmic
Query answering time scale
10000000
Query Times by Cost Model (in ms)
1000000

100000
ms

10000

1000

100
6 Edges / 7 Edges / 8 Edges / 9 Edges / 10 Edges / 11 Edges / 11 Edges / 14 Edges / 16 Edges / 17 Edges / 23 Edges /
3 Vars 4 Vars 3 Vars 3 Vars 3 Vars 4 Vars 5 Vars 5 Vars 7 Vars 5 Vars 6 Vars

Cost Model A
Cost Model 2.0/0.5 Cost Model B
Cost Model 1.2/0.1 Cost Model C
Cost Model 8.0/5.0 No Cost Model
No Cost Model

COSI_heur does very well, answering
pretty complex queries in under a second.
X-axis shows number of edges and variable vertices.
35

COSI
Logarithmic
Partitioning Effect scale
100000

10000
Time (ms)

1000

100
6E/3V 7E/4V 8E/3V 9E/3V 10E/3V 11E/4V 11E/5V 14E/5V 16E/7V 17E/5V 23E/6V
Size of the query (# edges / # vertices)
COSI Batch Partition Individual Edge Insertion

COSI_heur does very well, answering
pretty complex queries in under a second.
36

COSI

Related Work
Systems Pros Cons
Single Neo4j, DEO, Latency, Speed Limited size
Machine Hypergraph, Limited Throughput
RDF-3X, OWLIM,
AllegroGraph, etc
Orchestrated YARS 2, system Size Scalability Latency
Distribution extensions Limited Throughput

Asynchronous COSI Size Scalability Latency
Cloud Throughput
oriented Scalability
Resource Elasticity

38

COSI

Conclusion
 COSI is a general, scalable and fast
graph database framework for social
network analysis
 Demonstrated scalability and speed on
the problem of subgraph identification

39

?
COSI

Questions?
Comments?

COSI: Cloud Oriented Subgraph Identification in Massive Social Networks

Recommended

Recommended

More Related Content

More from Matthias Broecheler

More from Matthias Broecheler (8)

Recently uploaded

Recently uploaded (20)

COSI: Cloud Oriented Subgraph Identification in Massive Social Networks