Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

5,765 views

Published on

Note: The images used in this presentation are copyright by the respective owners as indicated with the picture. Pictures used are either CC or fair use. Please notify the author if you feel that your images are unfairly used in this presentation.

Published in:
Technology

No Downloads

Total views

5,765

On SlideShare

0

From Embeds

0

Number of Embeds

3,451

Shares

0

Downloads

0

Comments

0

Likes

1

No embeds

No notes for slide

- 1. © Adam Perer COSI COSI: Cloud Oriented Subgraph Identification in Massive Social Networks Matthias Bröcheler, Andrea Pugliese & V.S. Subrahmanian
- 2. © solofotones/flickr COSI © Felix Heinen 2
- 3. © solofotones/flickr COSI © Felix Heinen SNA Challenge: Scalability 3
- 4. COSI 500 million users 50M tweets / day Huge Social Networks © Ludwig Gatzke
- 5. COSI Cloud based Asynchronous storage COSI Answers complex queries in ~1 sec on a 778 million edge network
- 6. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
- 7. collaborate USA Prof Prof COSI dean author member Italy in Jones Paper Baneri “ABC” comment UC UMD author CS CS in faculty Prof friends Calero faculty department in member faculty Prof presented Dooley attended Social University MD Science department Universita department in ASONAM Calabria 10 dean attended Prof faculty UMD author submitted Roma member Physics author organized visited Prof author accepted KPLLC Paper friends 09 Paper “UVW” Smith Paper “HIJ” submitted “XYZ” comment comment attended student of author Prof Prof collaborates Olsen student of Prof Lund member dean Jamie Larsen faculty Karl Lock member Social Oede visited Science Odense SDU John colleagues Doe Physics department Odense Denmark
- 8. COSI Example Query ?p author comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 Simple query, yet already difficult to answer by hand 8
- 9. COSI Fraud Detection Example Bank1 wired wired ?v1 ?v2 friends Suspicious ?v3 labeled 9
- 10. COSI COSI Architecture Graph Data Client B ?X ?Z C A ?Y load Receive query - Return results Distribute data/ Dispatch query Query answer Exchange Data / Forward query
- 11. COSI COSI Architecture Graph Data Client B ?X ?Z C A ?Y load Receive query - Return results Partition Graph Distribute data/ Dispatch query Query answer Exchange Data / Answer Queries Forward query
- 12. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
- 13. COSI COSI Graph Partitioning How should we partition the graph? GOAL: Find a way to partition the graph DB into “blocks” across the k storage nodes so that expected time to answer queries is small. 13
- 14. COSI Example Query & Naive Approach Jones Dooley ?p author Smith comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 14
- 15. COSI Co-Retrieval Paper “ABC” ?p author comment Jones ?v3 faculty friends faculty University in MD Italy ?v2 Co-retrieval: Jones – Paper “ABC“ 15
- 16. COSI Cost Model Query trace: A query trace w.r.t. a query plan x for query Q consists of - All vertices in the DB whose neighborhood is retrieved during execution of x - All pairs (u,v) of vertices where x retrieves v’s nbhd immediately after retrieving u’s nbhd. • Intuition: Try to put u,v on same storage node. • Assumption: Retrieved nbhds are cached in memory. 16
- 17. COSI Cost Model (continued) Assume fixed but arbitrary distribution over the set of all queries. This induces a pdf over the set of all feasible query plans qp(Q) for query Q. - (x)= Q œ , qp(Q)=x (Q). - Prob of query plan “x” is the sum of the probs of queries requiring query plan x. Let E(v) be the event that v is retrieved by a query trace of a random query plan for Q. 17
- 18. COSI Cost Model (continued) Prob that vertex v occurs in the trace of a randomly chosen query plan is (E(v)) = x œ qp(Q) ⁄ v œ qt(x,DB) (x). Prob that (u,v) occurs in the trace of a randomly chosen query plan is (E(u,v)) = x œ qp(Q) ⁄ (u,v ) œ qt(x,DB) (x). 18
- 19. COSI Cost Model (continued) Key Theorem Suppose vertex retrieval and inter-node comms are uniform across storage nodes. The partition of the DB graph that minimizes query exec time coincides with the partition that minimizes edge cut cost in the graph (V,VV) with weight function w(u,v)= (E(u,v))+ (E(v,u)). SO MIN EDGE-CUTS IN COMPLETE GRAPHS IS CLOSELY RELATED TO MINIMIZING QUERY EXECUTION TIME. 19
- 20. COSI Partitioning Algorithm Challenges - Finding MIN EDGE-CUT is NP-complete. - We want to process graphs containing 100s of millions of edges. So we want an algorithm that is - Very fast - Produces good edge cuts • but maybe not optimal To achieve speed, we focus on partition strategies that permanently assign vertices to blocks. 20
- 21. COSI Individual edge insertion Suppose we have a partition P={P1,..,Pk}. We are inserting the edge (v,p,o). Vertex force vectors: Measures how strongly each Pi “pulls” a vertex. - |v|[i] = fP( y œ (nbhd(v) … Pi) w(v,y)) - fP maps positive reals to reals and is an “affinity” measure. - |v|[i] sums up the weights of edges from v to each neighbor in Pi. Insert v into block Pi with highest |v |[i]. 21
- 22. COSI Affinity Measures Must satisfy 3 properties - Connectedness of a vertex to a partition block. This helps minimize edge cut. - Imbalance of block sizes. • E.g. standard deviation of block sizes, normalized by expected DB size. - Excessive size should be punished. 22
- 23. COSI Batch insertion Adding a set of edges at once. Idea: Find strongly connected components using modularity maximization and assign those to the partition block with highest affinity. 23
- 24. COSI Batch Partitioning Algorithm Force Vector Affinity Contract Maximize Modularity Contract Maximize Modularity
- 25. COSI Graph modularity Mod(P) = Pi œ P(W(Pi,Pi)/2|E| - degW(Pi) 2/(2|E|)2) Where - W(X,Y) is the sum of the weights of edges (x,y) with x in X, y in Y. - degW(v) is the sum of the weights of edges (v,-) and - degW(Pi) is the sum of the degW(v)’s for v in Pi. 25
- 26. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
- 27. COSI Query Answering Graph Data Client B ?X ?Z C A ?Y load Receive query - Return results Dispatch query Query answer Forward (partially Answered) query
- 28. COSI Example Query ?p author comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 P1 28
- 29. COSI Example Query Jones : P2 Dooley : P2 ?p author Smith : P3 comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 29
- 30. COSI Example Query Paper “ABC” : P2 Paper “HIJ” : P3 ?p author comment P2 Calero : P2 Dooley ?v3 faculty friends faculty University in MD Italy ?v2 Where to send query next? 30
- 31. COSI Query answering Basic: Next substitution arbitrary COSI_Heur is a heuristic version that makes intelligent choices about the next variable to be substituted. - Branching Factor # possible substitutions - Communication cost # messages to be sent - Workload distribution partitions hosting vertices 31
- 32. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
- 33. COSI COSI implementation Implementation is in Java (approx 10,000 loc) 778M edges social network DB - Flickr, Orkut, Livejournal, Youtube - [Mislove ‘07] 16-node compute cluster - 8 GB of RAM - 30 GB HDs - 8 core Intel CPU 33
- 34. COSI Partitioning quality Comparison of Partitioning Methods 40.0% 35.0% 30.0% 25.0% Edge Cut 20.0% Improvement 15.0% Imbalance 10.0% 5.0% 0.0% Single Greedy Batch Greedy Batch Partition COSI_Partition achieves a 36% improvement in edge-cut with only slightly higher imbalance. Took 7.5 h to load with individual triple insertion, 10.5 h with batch. 34
- 35. COSI Logarithmic Query answering time scale 10000000 Query Times by Cost Model (in ms) 1000000 100000 ms 10000 1000 100 6 Edges / 7 Edges / 8 Edges / 9 Edges / 10 Edges / 11 Edges / 11 Edges / 14 Edges / 16 Edges / 17 Edges / 23 Edges / 3 Vars 4 Vars 3 Vars 3 Vars 3 Vars 4 Vars 5 Vars 5 Vars 7 Vars 5 Vars 6 Vars Cost Model A Cost Model 2.0/0.5 Cost Model B Cost Model 1.2/0.1 Cost Model C Cost Model 8.0/5.0 No Cost Model No Cost Model COSI_heur does very well, answering pretty complex queries in under a second. X-axis shows number of edges and variable vertices. 35
- 36. COSI Logarithmic Partitioning Effect scale 100000 10000 Time (ms) 1000 100 6E/3V 7E/4V 8E/3V 9E/3V 10E/3V 11E/4V 11E/5V 14E/5V 16E/7V 17E/5V 23E/6V Size of the query (# edges / # vertices) COSI Batch Partition Individual Edge Insertion COSI_heur does very well, answering pretty complex queries in under a second. 36
- 37. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
- 38. COSI Related Work Systems Pros Cons Single Neo4j, DEO, Latency, Speed Limited size Machine Hypergraph, Limited Throughput RDF-3X, OWLIM, AllegroGraph, etc Orchestrated YARS 2, system Size Scalability Latency Distribution extensions Limited Throughput Asynchronous COSI Size Scalability Latency Cloud Throughput oriented Scalability Resource Elasticity 38
- 39. COSI Conclusion COSI is a general, scalable and fast graph database framework for social network analysis Demonstrated scalability and speed on the problem of subgraph identification 39
- 40. COSI dogma.umiacs.umd.edu
- 41. ? COSI Questions? Comments?

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment