SlideShare a Scribd company logo
© Adam Perer
                                                   COSI




    COSI: Cloud Oriented Subgraph
Identification in Massive Social Networks
      Matthias Bröcheler, Andrea Pugliese
             & V.S. Subrahmanian
© solofotones/flickr   COSI
© Felix Heinen




 2
© solofotones/flickr                    COSI
© Felix Heinen




                       SNA Challenge:
                       Scalability
 3
COSI

                   500 million users



50M tweets / day




   Huge Social Networks
                             © Ludwig Gatzke
COSI

Cloud based
                     Asynchronous
  storage


              COSI

 Answers complex queries in ~1 sec
   on a 778 million edge network
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
collaborate
  USA                                       Prof                                                                 Prof
                                                                                                                                                    COSI
                   dean                                                             author
                                                                                                                               member               Italy
 in
                                          Jones                       Paper                                 Baneri
                                                                      “ABC”       comment                                                UC
            UMD                                      author
              CS                                                                                                                         CS
                                                                                                                                                          in
                                faculty                                                   Prof
                                                               friends
                                                                                        Calero              faculty
  department in
                                                                                                 member
                       faculty               Prof                                    presented
                                          Dooley                attended                                Social
      University
         MD                                                                                            Science      department              Universita
                      department in                                       ASONAM                                                              Calabria
                                                                            10                     dean
                                             attended                                                                     Prof
faculty                     UMD
                                                                 author
                                                                                               submitted                 Roma
           member
                         Physics                                                                            author
                                         organized                                                                                                 visited

            Prof          author           accepted             KPLLC                                             Paper               friends
                                                                 09                            Paper             “UVW”
         Smith                             Paper                                               “HIJ”
                                                                         submitted
                                           “XYZ”
                                                                                                           comment
                                              comment      attended
         student of             author                                                                                         Prof
                                                                                        Prof
                                    collaborates                                                                          Olsen           student of
                                                               Prof                  Lund         member
                                                                                                                        dean
         Jamie                                             Larsen
                                                                              faculty                                                              Karl
        Lock                                member
                                                                                                                     Social                      Oede
                                     visited                                                                        Science
                                                      Odense                            SDU
                         John
colleagues              Doe                          Physics     department
                                                                                     Odense                                               Denmark
COSI


Example Query

                                   ?p
                author                   comment

                     ?v1                 ?v3
         faculty              friends
                                               faculty
        University                  in
           MD              Italy         ?v2




     Simple query, yet already
     difficult to answer by hand


8
COSI


Fraud Detection Example


                            Bank1
              wired                   wired

                 ?v1                  ?v2
                           friends


         Suspicious             ?v3
                      labeled




9
COSI


    COSI Architecture
    Graph Data      Client          B   ?X



      
                                             ?Z   C




         
                                    A   ?Y

            load                    Receive query -
                                    Return results

                 Distribute data/
                 Dispatch query         Query answer




         
                Exchange Data /
                                             
                 Forward query
COSI


           COSI Architecture
          Graph Data      Client          B   ?X



            
                                                   ?Z   C




               
                                          A   ?Y

                  load                    Receive query -
                                          Return results
Partition Graph        Distribute data/
                       Dispatch query         Query answer




               
                      Exchange Data /
                                                   
                                                Answer Queries


                       Forward query
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


 COSI Graph Partitioning
      How should we partition the graph?
      GOAL: Find a way to partition the
       graph DB into “blocks” across the k
       storage nodes so that expected
       time to answer queries is small.




13
COSI


 Example Query & Naive Approach
       Jones
       Dooley                        ?p
                  author
       Smith                               comment

                       ?v1                 ?v3
           faculty              friends
                                                 faculty
          University                  in
             MD              Italy         ?v2




14
COSI


 Co-Retrieval
                                         Paper “ABC”
                                   ?p
                  author                  comment

                       Jones              ?v3
           faculty             friends
                                                faculty
          University                in
             MD            Italy          ?v2




       Co-retrieval:
       Jones – Paper “ABC“


15
COSI


 Cost Model
       Query trace: A query trace w.r.t. a query plan x
        for query Q consists of
         -  All vertices in the DB whose neighborhood is
             retrieved during execution of x
         -  All pairs (u,v) of vertices where x retrieves
             v’s nbhd immediately after retrieving u’s
             nbhd.
          •  Intuition: Try to put u,v on same storage node.
          •  Assumption: Retrieved nbhds are cached in
              memory.

16
COSI


 Cost Model            (continued)

      Assume fixed but arbitrary distribution
        over the set of all queries.
      This induces a pdf over the set of all
        feasible query plans qp(Q) for query Q.
        -  (x)=  Q œ , qp(Q)=x (Q).
       -  Prob of query plan “x” is the sum of the probs of
           queries requiring query plan x.
      Let E(v) be the event that v is retrieved by
        a query trace of a random query plan for
        Q.
17
COSI


 Cost Model          (continued)

       Prob that vertex v occurs in the trace of a
         randomly chosen query plan is
          (E(v)) =  x œ qp(Q) ⁄ v œ qt(x,DB) (x).
       Prob that (u,v) occurs in the trace of a randomly
         chosen query plan is
          (E(u,v)) = x œ qp(Q) ⁄ (u,v ) œ qt(x,DB) (x).




18
COSI


 Cost Model         (continued)

     Key Theorem
      Suppose vertex retrieval and inter-node comms
       are uniform across storage nodes. The partition
       of the DB graph that minimizes query exec time
       coincides with the partition that minimizes edge
       cut cost in the graph (V,VV) with weight
       function w(u,v)= (E(u,v))+ (E(v,u)).

       SO MIN EDGE-CUTS IN COMPLETE GRAPHS IS
         CLOSELY RELATED TO MINIMIZING QUERY
         EXECUTION TIME.
19
COSI


 Partitioning Algorithm
       Challenges
         -  Finding MIN EDGE-CUT is NP-complete.
         -  We want to process graphs containing 100s of
             millions of edges.
       So we want an algorithm that is
         -  Very fast
         -  Produces good edge cuts
            •  but maybe not optimal
       To achieve speed, we focus on partition strategies that
         permanently assign vertices to blocks.

20
COSI


     Individual edge insertion
      Suppose we have a partition P={P1,..,Pk}.
      We are inserting the edge (v,p,o).
      Vertex force vectors: Measures how strongly
        each Pi “pulls” a vertex.
       -  |v|[i] = fP( y œ (nbhd(v) … Pi) w(v,y))
       -  fP maps positive reals to reals and is an “affinity”
           measure.
       -  |v|[i] sums up the weights of edges from v to each
           neighbor in Pi. Insert v into block Pi with highest |v
          |[i].

21
COSI


 Affinity Measures
      Must satisfy 3 properties
       -  Connectedness of a vertex to a partition
           block. This helps minimize edge cut.
       -  Imbalance of block sizes.
         •  E.g. standard deviation of block sizes,
             normalized by expected DB size.
       -  Excessive size should be punished.


22
COSI


 Batch insertion
      Adding a set of edges at once.
      Idea: Find strongly connected
        components using modularity
        maximization and assign those to the
        partition block with highest affinity.




23
COSI


Batch Partitioning Algorithm
                        Force Vector
                          Affinity

                         Contract

                        Maximize
                        Modularity

                         Contract


                        Maximize
                        Modularity
COSI


 Graph modularity
      Mod(P) = Pi œ P(W(Pi,Pi)/2|E| -
                 degW(Pi) 2/(2|E|)2)

      Where
       -  W(X,Y) is the sum of the weights of
           edges (x,y) with x in X, y in Y.
       -  degW(v) is the sum of the weights of
           edges (v,-) and
       -  degW(Pi) is the sum of the degW(v)’s for
           v in Pi.
25
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


     Query Answering
    Graph Data     Client         B   ?X



      
                                           ?Z   C




         
                                  A   ?Y

            load                  Receive query -
                                  Return results

                 Dispatch query
                                      Query answer




         
            Forward (partially
                                           
             Answered) query
COSI


 Example Query

                                    ?p
                 author                   comment

                      ?v1                 ?v3
          faculty              friends
                                                faculty
         University                  in
            MD              Italy         ?v2


          P1




28
COSI


 Example Query
     Jones : P2
     Dooley : P2                        ?p
                     author
     Smith : P3                               comment

                          ?v1                 ?v3
              faculty              friends
                                                    faculty
             University                  in
                MD              Italy         ?v2




29
COSI


 Example Query
                                         Paper “ABC” : P2
                                         Paper “HIJ” : P3
                                  ?p
                 author                    comment
         P2                                               Calero : P2
                      Dooley              ?v3
          faculty              friends
                                                faculty
         University                in
            MD            Italy           ?v2




      Where to send query next?



30
COSI


 Query answering
      Basic: Next substitution arbitrary
      COSI_Heur is a heuristic version that makes
        intelligent choices about the next variable
        to be substituted.
       -  Branching Factor  # possible substitutions
       -  Communication cost  # messages to be sent
       -  Workload distribution  partitions hosting
           vertices


31
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


 COSI implementation
      Implementation is in Java (approx
        10,000 loc)
      778M edges social network DB
       -  Flickr, Orkut, Livejournal, Youtube
       -  [Mislove ‘07]

      16-node compute cluster
       -  8 GB of RAM
       -  30 GB HDs
       -  8 core Intel CPU
33
COSI


 Partitioning quality
                     Comparison of Partitioning Methods
      40.0%
      35.0%
      30.0%
      25.0%                                                      Edge Cut
      20.0%
                                                                 Improvement
      15.0%
                                                                 Imbalance
      10.0%
       5.0%
       0.0%
               Single Greedy   Batch Greedy    Batch Partition


     COSI_Partition achieves a 36% improvement in
     edge-cut with only slightly higher imbalance.
     Took 7.5 h to load with individual triple insertion, 10.5 h with batch.

34
COSI
                                                                                                   Logarithmic
      Query answering time                                                                            scale
10000000
                                        Query Times by Cost Model (in ms)
 1000000

  100000
ms




     10000

      1000

          100
                6 Edges / 7 Edges / 8 Edges / 9 Edges / 10 Edges / 11 Edges / 11 Edges / 14 Edges / 16 Edges / 17 Edges / 23 Edges /
                 3 Vars    4 Vars    3 Vars    3 Vars     3 Vars     4 Vars     5 Vars     5 Vars     7 Vars     5 Vars     6 Vars

                  Cost Model A
                  Cost Model 2.0/0.5             Cost Model B
                                                Cost Model 1.2/0.1              Cost Model C
                                                                               Cost Model 8.0/5.0            No Cost Model
                                                                                                             No Cost Model

                       COSI_heur does very well, answering
                       pretty complex queries in under a second.
            X-axis shows number of edges and variable vertices.
     35
COSI
                                                                                  Logarithmic
  Partitioning Effect                                                                scale
   100000



       10000
 Time (ms)




             1000



              100
                    6E/3V   7E/4V     8E/3V   9E/3V 10E/3V 11E/4V 11E/5V 14E/5V 16E/7V 17E/5V 23E/6V
                                                  Size of the query (# edges / # vertices)
                                    COSI Batch Partition       Individual Edge Insertion


                      COSI_heur does very well, answering
                      pretty complex queries in under a second.
36
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


 Related Work
                 Systems                   Pros               Cons
Single         Neo4j, DEO,         Latency, Speed        Limited size
Machine        Hypergraph,                               Limited Throughput
               RDF-3X, OWLIM,
               AllegroGraph, etc
Orchestrated   YARS 2, system      Size Scalability      Latency
Distribution   extensions                                Limited Throughput


Asynchronous COSI                  Size Scalability      Latency
Cloud                              Throughput
oriented                           Scalability
                                   Resource Elasticity



38
COSI


 Conclusion
  COSI is a general, scalable and fast
    graph database framework for social
    network analysis
  Demonstrated scalability and speed on
    the problem of subgraph identification




39
COSI




dogma.umiacs.umd.edu
?
             COSI




Questions?
Comments?

More Related Content

More from Matthias Broecheler

Adding Value through graph analysis using Titan and Faunus
Adding Value through graph analysis using Titan and FaunusAdding Value through graph analysis using Titan and Faunus
Adding Value through graph analysis using Titan and Faunus
Matthias Broecheler
 
Big Graph Data
Big Graph DataBig Graph Data
Big Graph Data
Matthias Broecheler
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
Matthias Broecheler
 
PMatch: Probabilistic Subgraph Matching on Huge Social Networks
PMatch: Probabilistic Subgraph Matching on Huge Social NetworksPMatch: Probabilistic Subgraph Matching on Huge Social Networks
PMatch: Probabilistic Subgraph Matching on Huge Social Networks
Matthias Broecheler
 
Budget-Match: Cost Effective Subgraph Matching on Large Networks
Budget-Match: Cost Effective Subgraph Matching on Large NetworksBudget-Match: Cost Effective Subgraph Matching on Large Networks
Budget-Match: Cost Effective Subgraph Matching on Large Networks
Matthias Broecheler
 
Probabilistic Soft Logic
Probabilistic Soft LogicProbabilistic Soft Logic
Probabilistic Soft Logic
Matthias Broecheler
 
Computing Marginal in CCMRFs - NIPS 2010
Computing Marginal in CCMRFs - NIPS 2010Computing Marginal in CCMRFs - NIPS 2010
Computing Marginal in CCMRFs - NIPS 2010
Matthias Broecheler
 
A Scalable Framework for Modeling Competitive Diffusion in Social Networks
A Scalable Framework for Modeling Competitive Diffusion in Social NetworksA Scalable Framework for Modeling Competitive Diffusion in Social Networks
A Scalable Framework for Modeling Competitive Diffusion in Social Networks
Matthias Broecheler
 

More from Matthias Broecheler (8)

Adding Value through graph analysis using Titan and Faunus
Adding Value through graph analysis using Titan and FaunusAdding Value through graph analysis using Titan and Faunus
Adding Value through graph analysis using Titan and Faunus
 
Big Graph Data
Big Graph DataBig Graph Data
Big Graph Data
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
PMatch: Probabilistic Subgraph Matching on Huge Social Networks
PMatch: Probabilistic Subgraph Matching on Huge Social NetworksPMatch: Probabilistic Subgraph Matching on Huge Social Networks
PMatch: Probabilistic Subgraph Matching on Huge Social Networks
 
Budget-Match: Cost Effective Subgraph Matching on Large Networks
Budget-Match: Cost Effective Subgraph Matching on Large NetworksBudget-Match: Cost Effective Subgraph Matching on Large Networks
Budget-Match: Cost Effective Subgraph Matching on Large Networks
 
Probabilistic Soft Logic
Probabilistic Soft LogicProbabilistic Soft Logic
Probabilistic Soft Logic
 
Computing Marginal in CCMRFs - NIPS 2010
Computing Marginal in CCMRFs - NIPS 2010Computing Marginal in CCMRFs - NIPS 2010
Computing Marginal in CCMRFs - NIPS 2010
 
A Scalable Framework for Modeling Competitive Diffusion in Social Networks
A Scalable Framework for Modeling Competitive Diffusion in Social NetworksA Scalable Framework for Modeling Competitive Diffusion in Social Networks
A Scalable Framework for Modeling Competitive Diffusion in Social Networks
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 

COSI: Cloud Oriented Subgraph Identification in Massive Social Networks

  • 1. © Adam Perer COSI COSI: Cloud Oriented Subgraph Identification in Massive Social Networks Matthias Bröcheler, Andrea Pugliese & V.S. Subrahmanian
  • 2. © solofotones/flickr COSI © Felix Heinen 2
  • 3. © solofotones/flickr COSI © Felix Heinen SNA Challenge: Scalability 3
  • 4. COSI 500 million users 50M tweets / day Huge Social Networks © Ludwig Gatzke
  • 5. COSI Cloud based Asynchronous storage COSI Answers complex queries in ~1 sec on a 778 million edge network
  • 6. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 7. collaborate USA Prof Prof COSI dean author member Italy in Jones Paper Baneri “ABC” comment UC UMD author CS CS in faculty Prof friends Calero faculty department in member faculty Prof presented Dooley attended Social University MD Science department Universita department in ASONAM Calabria 10 dean attended Prof faculty UMD author submitted Roma member Physics author organized visited Prof author accepted KPLLC Paper friends 09 Paper “UVW” Smith Paper “HIJ” submitted “XYZ” comment comment attended student of author Prof Prof collaborates Olsen student of Prof Lund member dean Jamie Larsen faculty Karl Lock member Social Oede visited Science Odense SDU John colleagues Doe Physics department Odense Denmark
  • 8. COSI Example Query ?p author comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 Simple query, yet already difficult to answer by hand 8
  • 9. COSI Fraud Detection Example Bank1 wired wired ?v1 ?v2 friends Suspicious ?v3 labeled 9
  • 10. COSI COSI Architecture Graph Data Client B ?X  ?Z C  A ?Y load Receive query - Return results Distribute data/ Dispatch query Query answer     Exchange Data /  Forward query
  • 11. COSI COSI Architecture Graph Data Client B ?X  ?Z C  A ?Y load Receive query - Return results Partition Graph Distribute data/ Dispatch query Query answer     Exchange Data /  Answer Queries Forward query
  • 12. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 13. COSI COSI Graph Partitioning  How should we partition the graph?  GOAL: Find a way to partition the graph DB into “blocks” across the k storage nodes so that expected time to answer queries is small. 13
  • 14. COSI Example Query & Naive Approach Jones Dooley ?p author Smith comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 14
  • 15. COSI Co-Retrieval Paper “ABC” ?p author comment Jones ?v3 faculty friends faculty University in MD Italy ?v2 Co-retrieval: Jones – Paper “ABC“ 15
  • 16. COSI Cost Model   Query trace: A query trace w.r.t. a query plan x for query Q consists of -  All vertices in the DB whose neighborhood is retrieved during execution of x -  All pairs (u,v) of vertices where x retrieves v’s nbhd immediately after retrieving u’s nbhd. •  Intuition: Try to put u,v on same storage node. •  Assumption: Retrieved nbhds are cached in memory. 16
  • 17. COSI Cost Model (continued)  Assume fixed but arbitrary distribution over the set of all queries.  This induces a pdf over the set of all feasible query plans qp(Q) for query Q. -  (x)=  Q œ , qp(Q)=x (Q). -  Prob of query plan “x” is the sum of the probs of queries requiring query plan x.  Let E(v) be the event that v is retrieved by a query trace of a random query plan for Q. 17
  • 18. COSI Cost Model (continued)   Prob that vertex v occurs in the trace of a randomly chosen query plan is (E(v)) =  x œ qp(Q) ⁄ v œ qt(x,DB) (x).   Prob that (u,v) occurs in the trace of a randomly chosen query plan is (E(u,v)) = x œ qp(Q) ⁄ (u,v ) œ qt(x,DB) (x). 18
  • 19. COSI Cost Model (continued) Key Theorem Suppose vertex retrieval and inter-node comms are uniform across storage nodes. The partition of the DB graph that minimizes query exec time coincides with the partition that minimizes edge cut cost in the graph (V,VV) with weight function w(u,v)= (E(u,v))+ (E(v,u)).   SO MIN EDGE-CUTS IN COMPLETE GRAPHS IS CLOSELY RELATED TO MINIMIZING QUERY EXECUTION TIME. 19
  • 20. COSI Partitioning Algorithm   Challenges -  Finding MIN EDGE-CUT is NP-complete. -  We want to process graphs containing 100s of millions of edges.   So we want an algorithm that is -  Very fast -  Produces good edge cuts •  but maybe not optimal   To achieve speed, we focus on partition strategies that permanently assign vertices to blocks. 20
  • 21. COSI Individual edge insertion  Suppose we have a partition P={P1,..,Pk}.  We are inserting the edge (v,p,o).  Vertex force vectors: Measures how strongly each Pi “pulls” a vertex. -  |v|[i] = fP( y œ (nbhd(v) … Pi) w(v,y)) -  fP maps positive reals to reals and is an “affinity” measure. -  |v|[i] sums up the weights of edges from v to each neighbor in Pi. Insert v into block Pi with highest |v |[i]. 21
  • 22. COSI Affinity Measures  Must satisfy 3 properties -  Connectedness of a vertex to a partition block. This helps minimize edge cut. -  Imbalance of block sizes. •  E.g. standard deviation of block sizes, normalized by expected DB size. -  Excessive size should be punished. 22
  • 23. COSI Batch insertion  Adding a set of edges at once.  Idea: Find strongly connected components using modularity maximization and assign those to the partition block with highest affinity. 23
  • 24. COSI Batch Partitioning Algorithm Force Vector Affinity Contract Maximize Modularity Contract Maximize Modularity
  • 25. COSI Graph modularity  Mod(P) = Pi œ P(W(Pi,Pi)/2|E| - degW(Pi) 2/(2|E|)2)  Where -  W(X,Y) is the sum of the weights of edges (x,y) with x in X, y in Y. -  degW(v) is the sum of the weights of edges (v,-) and -  degW(Pi) is the sum of the degW(v)’s for v in Pi. 25
  • 26. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 27. COSI Query Answering Graph Data Client B ?X  ?Z C  A ?Y load Receive query - Return results Dispatch query Query answer     Forward (partially  Answered) query
  • 28. COSI Example Query ?p author comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 P1 28
  • 29. COSI Example Query Jones : P2 Dooley : P2 ?p author Smith : P3 comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 29
  • 30. COSI Example Query Paper “ABC” : P2 Paper “HIJ” : P3 ?p author comment P2 Calero : P2 Dooley ?v3 faculty friends faculty University in MD Italy ?v2 Where to send query next? 30
  • 31. COSI Query answering  Basic: Next substitution arbitrary  COSI_Heur is a heuristic version that makes intelligent choices about the next variable to be substituted. -  Branching Factor  # possible substitutions -  Communication cost  # messages to be sent -  Workload distribution  partitions hosting vertices 31
  • 32. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 33. COSI COSI implementation  Implementation is in Java (approx 10,000 loc)  778M edges social network DB -  Flickr, Orkut, Livejournal, Youtube -  [Mislove ‘07]  16-node compute cluster -  8 GB of RAM -  30 GB HDs -  8 core Intel CPU 33
  • 34. COSI Partitioning quality Comparison of Partitioning Methods 40.0% 35.0% 30.0% 25.0% Edge Cut 20.0% Improvement 15.0% Imbalance 10.0% 5.0% 0.0% Single Greedy Batch Greedy Batch Partition COSI_Partition achieves a 36% improvement in edge-cut with only slightly higher imbalance. Took 7.5 h to load with individual triple insertion, 10.5 h with batch. 34
  • 35. COSI Logarithmic Query answering time scale 10000000 Query Times by Cost Model (in ms) 1000000 100000 ms 10000 1000 100 6 Edges / 7 Edges / 8 Edges / 9 Edges / 10 Edges / 11 Edges / 11 Edges / 14 Edges / 16 Edges / 17 Edges / 23 Edges / 3 Vars 4 Vars 3 Vars 3 Vars 3 Vars 4 Vars 5 Vars 5 Vars 7 Vars 5 Vars 6 Vars Cost Model A Cost Model 2.0/0.5 Cost Model B Cost Model 1.2/0.1 Cost Model C Cost Model 8.0/5.0 No Cost Model No Cost Model COSI_heur does very well, answering pretty complex queries in under a second. X-axis shows number of edges and variable vertices. 35
  • 36. COSI Logarithmic Partitioning Effect scale 100000 10000 Time (ms) 1000 100 6E/3V 7E/4V 8E/3V 9E/3V 10E/3V 11E/4V 11E/5V 14E/5V 16E/7V 17E/5V 23E/6V Size of the query (# edges / # vertices) COSI Batch Partition Individual Edge Insertion COSI_heur does very well, answering pretty complex queries in under a second. 36
  • 37. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 38. COSI Related Work Systems Pros Cons Single Neo4j, DEO, Latency, Speed Limited size Machine Hypergraph, Limited Throughput RDF-3X, OWLIM, AllegroGraph, etc Orchestrated YARS 2, system Size Scalability Latency Distribution extensions Limited Throughput Asynchronous COSI Size Scalability Latency Cloud Throughput oriented Scalability Resource Elasticity 38
  • 39. COSI Conclusion  COSI is a general, scalable and fast graph database framework for social network analysis  Demonstrated scalability and speed on the problem of subgraph identification 39
  • 41. ? COSI Questions? Comments?