© Adam Perer
                                                   COSI




    COSI: Cloud Oriented Subgraph
Identification ...
© solofotones/flickr   COSI
© Felix Heinen




 2
© solofotones/flickr                    COSI
© Felix Heinen




                       SNA Challenge:
                    ...
COSI

                   500 million users



50M tweets / day




   Huge Social Networks
                             © ...
COSI

Cloud based
                     Asynchronous
  storage


              COSI

 Answers complex queries in ~1 sec
   ...
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
collaborate
  USA                                       Prof                                                              ...
COSI


Example Query

                                   ?p
                author                   comment

            ...
COSI


Fraud Detection Example


                            Bank1
              wired                   wired

          ...
COSI


    COSI Architecture
    Graph Data      Client          B   ?X



      
                                       ...
COSI


           COSI Architecture
          Graph Data      Client          B   ?X



            
                    ...
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


 COSI Graph Partitioning
      How should we partition the graph?
      GOAL: Find a way to partition the
       ...
COSI


 Example Query & Naive Approach
       Jones
       Dooley                        ?p
                  author
     ...
COSI


 Co-Retrieval
                                         Paper “ABC”
                                   ?p
          ...
COSI


 Cost Model
       Query trace: A query trace w.r.t. a query plan x
        for query Q consists of
         -  Al...
COSI


 Cost Model            (continued)

      Assume fixed but arbitrary distribution
        over the set of all quer...
COSI


 Cost Model          (continued)

       Prob that vertex v occurs in the trace of a
         randomly chosen quer...
COSI


 Cost Model         (continued)

     Key Theorem
      Suppose vertex retrieval and inter-node comms
       are un...
COSI


 Partitioning Algorithm
       Challenges
         -  Finding MIN EDGE-CUT is NP-complete.
         -  We want to ...
COSI


     Individual edge insertion
      Suppose we have a partition P={P1,..,Pk}.
      We are inserting the edge (v...
COSI


 Affinity Measures
      Must satisfy 3 properties
       -  Connectedness of a vertex to a partition
           b...
COSI


 Batch insertion
      Adding a set of edges at once.
      Idea: Find strongly connected
        components usin...
COSI


Batch Partitioning Algorithm
                        Force Vector
                          Affinity

             ...
COSI


 Graph modularity
      Mod(P) = Pi œ P(W(Pi,Pi)/2|E| -
                 degW(Pi) 2/(2|E|)2)

      Where
      ...
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


     Query Answering
    Graph Data     Client         B   ?X



      
                                          ...
COSI


 Example Query

                                    ?p
                 author                   comment

         ...
COSI


 Example Query
     Jones : P2
     Dooley : P2                        ?p
                     author
     Smith : ...
COSI


 Example Query
                                         Paper “ABC” : P2
                                         P...
COSI


 Query answering
      Basic: Next substitution arbitrary
      COSI_Heur is a heuristic version that makes
     ...
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


 COSI implementation
      Implementation is in Java (approx
        10,000 loc)
      778M edges social network ...
COSI


 Partitioning quality
                     Comparison of Partitioning Methods
      40.0%
      35.0%
      30.0%
 ...
COSI
                                                                                                   Logarithmic
      ...
COSI
                                                                                  Logarithmic
  Partitioning Effect  ...
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


 Related Work
                 Systems                   Pros               Cons
Single         Neo4j, DEO,        ...
COSI


 Conclusion
  COSI is a general, scalable and fast
    graph database framework for social
    network analysis
 ...
COSI




dogma.umiacs.umd.edu
?
             COSI




Questions?
Comments?
Upcoming SlideShare
Loading in …5
×

COSI: Cloud Oriented Subgraph Identification in Massive Social Networks

5,765 views

Published on

Slides presenting our work on COSI at the ASONAM conference 2010

Note: The images used in this presentation are copyright by the respective owners as indicated with the picture. Pictures used are either CC or fair use. Please notify the author if you feel that your images are unfairly used in this presentation.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,765
On SlideShare
0
From Embeds
0
Number of Embeds
3,451
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

COSI: Cloud Oriented Subgraph Identification in Massive Social Networks

  1. 1. © Adam Perer COSI COSI: Cloud Oriented Subgraph Identification in Massive Social Networks Matthias Bröcheler, Andrea Pugliese & V.S. Subrahmanian
  2. 2. © solofotones/flickr COSI © Felix Heinen 2
  3. 3. © solofotones/flickr COSI © Felix Heinen SNA Challenge: Scalability 3
  4. 4. COSI 500 million users 50M tweets / day Huge Social Networks © Ludwig Gatzke
  5. 5. COSI Cloud based Asynchronous storage COSI Answers complex queries in ~1 sec on a 778 million edge network
  6. 6. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  7. 7. collaborate USA Prof Prof COSI dean author member Italy in Jones Paper Baneri “ABC” comment UC UMD author CS CS in faculty Prof friends Calero faculty department in member faculty Prof presented Dooley attended Social University MD Science department Universita department in ASONAM Calabria 10 dean attended Prof faculty UMD author submitted Roma member Physics author organized visited Prof author accepted KPLLC Paper friends 09 Paper “UVW” Smith Paper “HIJ” submitted “XYZ” comment comment attended student of author Prof Prof collaborates Olsen student of Prof Lund member dean Jamie Larsen faculty Karl Lock member Social Oede visited Science Odense SDU John colleagues Doe Physics department Odense Denmark
  8. 8. COSI Example Query ?p author comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 Simple query, yet already difficult to answer by hand 8
  9. 9. COSI Fraud Detection Example Bank1 wired wired ?v1 ?v2 friends Suspicious ?v3 labeled 9
  10. 10. COSI COSI Architecture Graph Data Client B ?X  ?Z C  A ?Y load Receive query - Return results Distribute data/ Dispatch query Query answer     Exchange Data /  Forward query
  11. 11. COSI COSI Architecture Graph Data Client B ?X  ?Z C  A ?Y load Receive query - Return results Partition Graph Distribute data/ Dispatch query Query answer     Exchange Data /  Answer Queries Forward query
  12. 12. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  13. 13. COSI COSI Graph Partitioning  How should we partition the graph?  GOAL: Find a way to partition the graph DB into “blocks” across the k storage nodes so that expected time to answer queries is small. 13
  14. 14. COSI Example Query & Naive Approach Jones Dooley ?p author Smith comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 14
  15. 15. COSI Co-Retrieval Paper “ABC” ?p author comment Jones ?v3 faculty friends faculty University in MD Italy ?v2 Co-retrieval: Jones – Paper “ABC“ 15
  16. 16. COSI Cost Model   Query trace: A query trace w.r.t. a query plan x for query Q consists of -  All vertices in the DB whose neighborhood is retrieved during execution of x -  All pairs (u,v) of vertices where x retrieves v’s nbhd immediately after retrieving u’s nbhd. •  Intuition: Try to put u,v on same storage node. •  Assumption: Retrieved nbhds are cached in memory. 16
  17. 17. COSI Cost Model (continued)  Assume fixed but arbitrary distribution over the set of all queries.  This induces a pdf over the set of all feasible query plans qp(Q) for query Q. -  (x)=  Q œ , qp(Q)=x (Q). -  Prob of query plan “x” is the sum of the probs of queries requiring query plan x.  Let E(v) be the event that v is retrieved by a query trace of a random query plan for Q. 17
  18. 18. COSI Cost Model (continued)   Prob that vertex v occurs in the trace of a randomly chosen query plan is (E(v)) =  x œ qp(Q) ⁄ v œ qt(x,DB) (x).   Prob that (u,v) occurs in the trace of a randomly chosen query plan is (E(u,v)) = x œ qp(Q) ⁄ (u,v ) œ qt(x,DB) (x). 18
  19. 19. COSI Cost Model (continued) Key Theorem Suppose vertex retrieval and inter-node comms are uniform across storage nodes. The partition of the DB graph that minimizes query exec time coincides with the partition that minimizes edge cut cost in the graph (V,VV) with weight function w(u,v)= (E(u,v))+ (E(v,u)).   SO MIN EDGE-CUTS IN COMPLETE GRAPHS IS CLOSELY RELATED TO MINIMIZING QUERY EXECUTION TIME. 19
  20. 20. COSI Partitioning Algorithm   Challenges -  Finding MIN EDGE-CUT is NP-complete. -  We want to process graphs containing 100s of millions of edges.   So we want an algorithm that is -  Very fast -  Produces good edge cuts •  but maybe not optimal   To achieve speed, we focus on partition strategies that permanently assign vertices to blocks. 20
  21. 21. COSI Individual edge insertion  Suppose we have a partition P={P1,..,Pk}.  We are inserting the edge (v,p,o).  Vertex force vectors: Measures how strongly each Pi “pulls” a vertex. -  |v|[i] = fP( y œ (nbhd(v) … Pi) w(v,y)) -  fP maps positive reals to reals and is an “affinity” measure. -  |v|[i] sums up the weights of edges from v to each neighbor in Pi. Insert v into block Pi with highest |v |[i]. 21
  22. 22. COSI Affinity Measures  Must satisfy 3 properties -  Connectedness of a vertex to a partition block. This helps minimize edge cut. -  Imbalance of block sizes. •  E.g. standard deviation of block sizes, normalized by expected DB size. -  Excessive size should be punished. 22
  23. 23. COSI Batch insertion  Adding a set of edges at once.  Idea: Find strongly connected components using modularity maximization and assign those to the partition block with highest affinity. 23
  24. 24. COSI Batch Partitioning Algorithm Force Vector Affinity Contract Maximize Modularity Contract Maximize Modularity
  25. 25. COSI Graph modularity  Mod(P) = Pi œ P(W(Pi,Pi)/2|E| - degW(Pi) 2/(2|E|)2)  Where -  W(X,Y) is the sum of the weights of edges (x,y) with x in X, y in Y. -  degW(v) is the sum of the weights of edges (v,-) and -  degW(Pi) is the sum of the degW(v)’s for v in Pi. 25
  26. 26. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  27. 27. COSI Query Answering Graph Data Client B ?X  ?Z C  A ?Y load Receive query - Return results Dispatch query Query answer     Forward (partially  Answered) query
  28. 28. COSI Example Query ?p author comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 P1 28
  29. 29. COSI Example Query Jones : P2 Dooley : P2 ?p author Smith : P3 comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 29
  30. 30. COSI Example Query Paper “ABC” : P2 Paper “HIJ” : P3 ?p author comment P2 Calero : P2 Dooley ?v3 faculty friends faculty University in MD Italy ?v2 Where to send query next? 30
  31. 31. COSI Query answering  Basic: Next substitution arbitrary  COSI_Heur is a heuristic version that makes intelligent choices about the next variable to be substituted. -  Branching Factor  # possible substitutions -  Communication cost  # messages to be sent -  Workload distribution  partitions hosting vertices 31
  32. 32. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  33. 33. COSI COSI implementation  Implementation is in Java (approx 10,000 loc)  778M edges social network DB -  Flickr, Orkut, Livejournal, Youtube -  [Mislove ‘07]  16-node compute cluster -  8 GB of RAM -  30 GB HDs -  8 core Intel CPU 33
  34. 34. COSI Partitioning quality Comparison of Partitioning Methods 40.0% 35.0% 30.0% 25.0% Edge Cut 20.0% Improvement 15.0% Imbalance 10.0% 5.0% 0.0% Single Greedy Batch Greedy Batch Partition COSI_Partition achieves a 36% improvement in edge-cut with only slightly higher imbalance. Took 7.5 h to load with individual triple insertion, 10.5 h with batch. 34
  35. 35. COSI Logarithmic Query answering time scale 10000000 Query Times by Cost Model (in ms) 1000000 100000 ms 10000 1000 100 6 Edges / 7 Edges / 8 Edges / 9 Edges / 10 Edges / 11 Edges / 11 Edges / 14 Edges / 16 Edges / 17 Edges / 23 Edges / 3 Vars 4 Vars 3 Vars 3 Vars 3 Vars 4 Vars 5 Vars 5 Vars 7 Vars 5 Vars 6 Vars Cost Model A Cost Model 2.0/0.5 Cost Model B Cost Model 1.2/0.1 Cost Model C Cost Model 8.0/5.0 No Cost Model No Cost Model COSI_heur does very well, answering pretty complex queries in under a second. X-axis shows number of edges and variable vertices. 35
  36. 36. COSI Logarithmic Partitioning Effect scale 100000 10000 Time (ms) 1000 100 6E/3V 7E/4V 8E/3V 9E/3V 10E/3V 11E/4V 11E/5V 14E/5V 16E/7V 17E/5V 23E/6V Size of the query (# edges / # vertices) COSI Batch Partition Individual Edge Insertion COSI_heur does very well, answering pretty complex queries in under a second. 36
  37. 37. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  38. 38. COSI Related Work Systems Pros Cons Single Neo4j, DEO, Latency, Speed Limited size Machine Hypergraph, Limited Throughput RDF-3X, OWLIM, AllegroGraph, etc Orchestrated YARS 2, system Size Scalability Latency Distribution extensions Limited Throughput Asynchronous COSI Size Scalability Latency Cloud Throughput oriented Scalability Resource Elasticity 38
  39. 39. COSI Conclusion  COSI is a general, scalable and fast graph database framework for social network analysis  Demonstrated scalability and speed on the problem of subgraph identification 39
  40. 40. COSI dogma.umiacs.umd.edu
  41. 41. ? COSI Questions? Comments?

×