Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© Adam Perer                                                 PMatchPMatch: Probabilistic Subgraph Matching       on Huge S...
PMatch                                                                                                                    ...
PMatchExample Query                          attended                   ?p                 Francis                        ...
PMatchQuery Uncertainty                       attended                ?p                 Francis                          ...
PMatchProbabilistic Matching Query        2                      attended                          Certain         organiz...
PMatchIntelligence and Security       Very bad        affiliated                                     ?person1             ...
PMatchIntelligence and Security       Very bad        affiliated                                     ?person1             ...
PMatchPMatch Applications Intelligence    -  Information retrieval on open source data Search    -  Facilitate productiv...
PMatch              OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work & Conclusion
PMatch PMatch Query                   2                                            ?p                                     ...
PMatch PMatch Query                     2                                              ?p                                 ...
PMatch PMatch Query Answer  All substitutions   with probability                             2                           ...
PMatch PMatch Answer  1/(1+e-(-5+6)) ≈ 73%                                               Francis         -  w0=-5        ...
PMatch              OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work  Conclusion
PMatch PMatch Query Answering  Try to find all answers at once by    exploiting shared substructures and    efficiently m...
PMatch PMatch Initialization                                 Attended {2,3}                           ?p                  ...
PMatch PMatch Data Structures Maintained per vertex:  R: Set of substitution candidates     -  Pairs of the form x,Hx, wh...
PMatch PMatch Initialization                         Attended {2,3}         R = {}                                       R...
PMatch Vertex selection heuristic                         progress(v)  Goodness(v) =                           cost(v)  ...
PMatch PMatch Algorithm                                         Initial vertex                                            ...
PMatch PMatch Algorithm                                                 Processing of                                     ...
PMatch PMatch Algorithm                                              Threshold                                            ...
PMatch                                                                             Processing PMatch Algorithm            ...
PMatch PMatch Algorithm                                                     Crossed     R = {Peter’s Bday|2,3}            ...
PMatch PMatch Algorithm                                         Found                                                     ...
PMatch Query Splitting                                       Selecting „Peter“ first                                      ...
PMatch              OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work  Conclusion
PMatch Experimental Setup I      Implementation in Java (approx 5500        loc) on top of Neo4j      Datasets       -  ...
PMatch Experimental Setup II       Compared         -  PMatch-1: box weight         -  PMatch-2: normalized box weight   ...
PMatch                   Friendship network (small queries)                  350                  300                  250...
PMatch                   Friendship network (large queries)                   25000                   20000Query Time (ms)...
PMatch                   Delicious network                                      (small queries)                  7000     ...
PMatch                   Delicious network                                 (large queries)                   40000        ...
PMatch              OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work  Conclusion
PMatch Related Work I  Exact Subgraph Matching     -  Many systems: RDF-3X, OWLIM, YARS (2),         Sesame, Jena, DOGMA,...
PMatch Related Work II  Approximate Subgraph Matching    -  SAGA, TALE, Grafil, etc    -  Database containing many small ...
PMatch Conclusion  Query uncertainty often occurs in    social network search, information    retrieval and pattern match...
PMatchdogma.umiacs.umd.edu
?             PMatchQuestions?Comments?
PMatch Related Work III  Data Uncertainty     -  Specified at the tuple or edge level     -  How to answer certain (SQL) ...
PMatch Related Work III  Data Uncertainty     -  How to encode dependencies?     -  Need to make strong independence     ...
PMatch Query Uncertainty Motivation  User Uncertainty     -  User does not know what she is looking for exactly  Lack of...
Upcoming SlideShare
Loading in …5
×

PMatch: Probabilistic Subgraph Matching on Huge Social Networks

4,118 views

Published on

Users querying massive social networks or RDF databases are often not 100% certain about what they are looking for due to the complexity of the query or heterogeneity of the data. In this paper, we propose “probabilistic subgraph” (PS) queries over a graph/network database, which afford users great flexibility in specifying “approximately” what they are looking for. We formally define the probability that a substitution satisfies a PS-query with respect to a graph database. We then present the PMATCH algorithm to answer such queries and prove its correctness. Our experimental evaluation demonstrates that PMATCH is efficient and scales to massive social networks with over a billion edges.

  • Be the first to comment

PMatch: Probabilistic Subgraph Matching on Huge Social Networks

  1. 1. © Adam Perer PMatchPMatch: Probabilistic Subgraph Matching on Huge Social Networks Matthias Bröcheler, Andrea Pugliese & V.S. Subrahmanian
  2. 2. PMatch Star type friend friend likes Sci-Fi Bob Mark Wars IV friend likes attended The friend friend likes Titanic Godfather likes Halloween attended Pizza John 2008 Peter Feast type likes likes attended organized friend organized organized attended Francis Jenny Drama attended Peter‘s Bday likes friend party attended organized attended Ashley likes friend attended Pulp Sylvester attended type Homecoming friend Fiction 2009 09 organized type attended organized attended Fundraiser likes for School Gone with Thriller Bob the wind Jessie attended Alice Chill-out Night likes Goodbye friend attended likes type Mrs. Doubtfire organized attended organized Inception type Melissa friend Spring Break Trip Alice attended likes likes Comedy likes friend organized likes Harry Emily Potter likes The Lion type Kinglikes type Jon likes type Mystery type Toy Story Familylikes
  3. 3. PMatchExample Query attended ?p Francis friend organized attended Peter ?u ?f likes likes ?b Drama type Simple query, yet complex structure and difficult to answer by hand3
  4. 4. PMatchQuery Uncertainty attended ?p Francis friend organized attended Peter ?u ?f likes likes ?b Drama type Was it really a drama movie?4
  5. 5. PMatchProbabilistic Matching Query 2 attended Certain organized ?p attended Francis friend ∞ Peter ?u ?f likes likes 3 ?b Drama 6 type Uncertain Capture uncertainty in weighted query boxes5
  6. 6. PMatchIntelligence and Security Very bad affiliated ?person1 in Africa guy transfer affiliated ?bank contacted transfer ?person1 ?target posted to posted to traveled ?person3 ?city labeledRadical lives in labeled ?forum Middle located in East Radical6
  7. 7. PMatchIntelligence and Security Very bad affiliated ?person1 3 guy transfer in Africa affiliated ∞ transfer ?bank 1 contacted ?person1 ?target2 posted to traveled posted to 1 2 ?person3 ?city labeledRadical lives in labeled ?forum Middle located in East Radical7
  8. 8. PMatchPMatch Applications Intelligence -  Information retrieval on open source data Search -  Facilitate productive use of social network data Security -  Finding suspicious patterns Social Network Analysis -  Querying as a primitive operation for data retrieval8
  9. 9. PMatch OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work & Conclusion
  10. 10. PMatch PMatch Query 2 ?p attended Francis friend ∞ organized attended  Set of Boxes Peter ?u likes ?f likes 3 -  Each contains edges ?b type Drama 6 -  Has a weight •  Infinity  certainty or “core” of query  Naturally extends exact subgraph matching queries10
  11. 11. PMatch PMatch Query 2 ?p attended Francis friend ∞ organized attended  Matching boxes Peter ?u likes ?f likes 3 -  Let bi = 1 if box Bi is ?b type Drama 6 matched by a substitution else 0 for i=1 to n (= # of boxes) -  Weights wi, offset w0  Probability of match (user defined) =0 if b0 = 0, core must be matched = 1 n 1 + e−(w0 + i=1 wi bi )11
  12. 12. PMatch PMatch Query Answer  All substitutions with probability 2 ?p attended Francis ∞ friend organized attended above user defined Peter ?u ?f likes threshold τ 3 likes ?b Drama 6 type  Need to eliminate redundant answers -  Substitutions that can be extended to higher probability12
  13. 13. PMatch PMatch Answer  1/(1+e-(-5+6)) ≈ 73% Francis -  w0=-5 friend 2 ?p attended Francis ∞ John organized attended friend likes Peter ?u ?f likes The 3 likes Godfather type Drama ?b Drama 6 type13
  14. 14. PMatch OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work Conclusion
  15. 15. PMatch PMatch Query Answering  Try to find all answers at once by exploiting shared substructures and efficiently maintain data structures.  Depth first-search algorithm -  Terminates when search space is exhausted or answer probability falls below threshold.  Provably correct15
  16. 16. PMatch PMatch Initialization Attended {2,3} ?p Francis Friend {0} organi zed {2} Attended {2,3} Peter ?u ?f Likes {0} Likes {2,3} type ?b {6} Drama Rewrite query (0=certain)16
  17. 17. PMatch PMatch Data Structures Maintained per vertex:  R: Set of substitution candidates -  Pairs of the form x,Hx, where x is a potential substitution and Hx the set of box ids it matches. Hx maintained as bit vector  S: Set of box ids for which R has been initialized17
  18. 18. PMatch PMatch Initialization Attended {2,3} R = {} R = {francis|0} S = {} ?p Francis S = {0} Friend {0} organi zed {2} Attended {2,3} R = {} Peter ?u S = {} ?f Likes {0} R = {} R = {peter|2} Likes {2,3} S = {} S = {2} type ?b {6} Drama R = {} R = {drama|6} τ = 0.75 S = {} S = {6} θ = {}18
  19. 19. PMatch Vertex selection heuristic progress(v)  Goodness(v) = cost(v)  Progress(v) = cumulative box weight of edges to be processed -  Normalize weight by box cardinality  Cost(v) = use standard cost models from exact subgraph matching -  Cardinality, estimated selectivity19
  20. 20. PMatch PMatch Algorithm Initial vertex selection Attended {2,3} R = {} R = {francis|0} S = {} ?p Francis S = {0} Friend {0} organi zed {2} Attended {2,3} R = {} Peter ?u S = {} ?f Likes {0} R = {} R = {peter|2} Likes {2,3} S = {} S = {2} type ?b {6} Drama R = {} R = {drama|6} τ = 0.75 S = {} S = {6} θ = {}20
  21. 21. PMatch PMatch Algorithm Processing of „Francis“ vertex R = {Peter’s Bday|2,3, Silvester 09|2,3} S = {2,3} ?p Francis organi zed {2} Attended {2,3} R = {} Peter ?u S = {} ?f Likes {0} R = {John|0,Mark|0, R = {peter|2} Likes {2,3} Melissa|0} S = {2} S = {0} type ?b {6} Drama R = {} R = {drama|6} τ = 0.75 S = {} S = {6} θ = {}21
  22. 22. PMatch PMatch Algorithm Threshold Pruning R = {Peter’s Bday|2,3, Silvester 09|2,3} S = {2,3} ?p Francis organi zed {2} Attended {2,3} R = {} Peter ?u S = {} ?f R = {peter|2} Likes {2,3} S = {2} type ?b {6} Drama R = {Titanic,0,Star Wars,0} R = {drama|6} τ = 0.75 S = {0} S = {6} θ = {?f/Mark}22
  23. 23. PMatch Processing PMatch Algorithm vertex „?b“. R = {Peter’s Bday|2,3, Maintaining Silvester 09|2,3} box indices S = {2,3} ?p Francis organi zed {2} Attended {2,3} R = {Jenny|2,3} Peter ?u S = {2,3} ?f R = {peter|2} S = {2} ?b Drama R = {} τ = 0.75 S = {} θ = {?f/Mark,?b/Titanic}23
  24. 24. PMatch PMatch Algorithm Crossed R = {Peter’s Bday|2,3} Threshold S = {2,3} ?p Francis organi zed {2} Peter ?u ?f R = {peter|2} S = {2} ?b Drama R = {} τ = 0.75 S = {} θ = {?f/Mark,?b/Titanic,?u/Jenny}24
  25. 25. PMatch PMatch Algorithm Found Result ?p Francis Peter ?u ?f R = {peter|2} S = {2} ?b Drama R = {} τ = 0.75 S = {} θ = {?f/Mark,?b/Titanic,?u/Jenny,?p/Peter’s Bday} 0.99725
  26. 26. PMatch Query Splitting Selecting „Peter“ first causes query to be split. R = {Peter’s Bday|2, Attended {2,3} Silvester 09|2} R = {francis|0} S = {2} ?p Francis S = {0} Friend {0} Attended {2,3} R = {} Peter ?u S = {} ?f Likes {0} R = {} Likes {2,3} S = {} type ?b {6} Drama R = {} R = {drama|6} τ = 0.75 S = {} S = {6} θ = {}26
  27. 27. PMatch OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work Conclusion
  28. 28. PMatch Experimental Setup I  Implementation in Java (approx 5500 loc) on top of Neo4j  Datasets -  Friendship Network (Flickr, Orkut, Livejournal, Youtube): 778 million edges -  Delicious Network: 1.1 billion edges  Query Benchmark for each dataset: 9 PMatch queries each of increasing complexity28
  29. 29. PMatch Experimental Setup II   Compared -  PMatch-1: box weight -  PMatch-2: normalized box weight -  DOGMA: Exact matching for all combination -  Neo4j: Exact matching for all combinations •  Too slow – not shown in the charts   System: AMD Opteron, 15K RPM 300 GB SAS HD, 2GB of heap space29
  30. 30. PMatch Friendship network (small queries) 350 300 250Query Time (ms) 200 150 100 50 0 Query SA: 6B,23E Query SB: 4B,6E Query SC: 5B,12E Query SD: 4B,6E Query SE: 5B,12E Query SF: 6B,23E Queries PMatch-1 PMatch-2 DOGMA For each query, we report query times with cold (right) and warm (left) caches. 30
  31. 31. PMatch Friendship network (large queries) 25000 20000Query Time (ms) 15000 10000 5000 0 Query SG: 4B,6E Query SH: 5B,12E Query SI: 6B,23E Queries PMatch-1 PMatch-2 DOGMA For each query, we report query times with cold (right) and warm (left) caches. 31
  32. 32. PMatch Delicious network (small queries) 7000 6000 5000Query Time (ms) 4000 3000 2000 1000 0 Query TA: 5B,13E Query TB: 5B,17E Query TC: 4B,6E Query TD: 5B,12E Query TE:5B,12E Queries PMatch-1 PMatch-2 DOGMA For each query, we report query times with cold (right) and warm (left) caches. 32
  33. 33. PMatch Delicious network (large queries) 40000 35000 30000Query Time (ms) 25000 20000 15000 10000 5000 0 Query TF:4B,6E Query TG: 5B,18E Query TH: 5B,18E Query TI: 4B,7E Queries PMatch-1 PMatch-2 DOGMA For each query, we report query times with cold (right) and warm (left) caches. 33
  34. 34. PMatch OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work Conclusion
  35. 35. PMatch Related Work I  Exact Subgraph Matching -  Many systems: RDF-3X, OWLIM, YARS (2), Sesame, Jena, DOGMA, COSI etc. -  Efficient on-disk index structures -  Query answering algorithms focused on exact queries (e.g. SPARQL).  No treatment of query uncertainty35
  36. 36. PMatch Related Work II  Approximate Subgraph Matching -  SAGA, TALE, Grafil, etc -  Database containing many small graphs -  Do not scale to large social networks -  Query uncertainty is defined in the system.  PMatch provides an expressive language to specify probabilistic queries and answers them efficiently.36
  37. 37. PMatch Conclusion  Query uncertainty often occurs in social network search, information retrieval and pattern matching.  PMatch provides a language to concisely express and algorithms to efficiently answer probabilistic subgraph matching queries.37
  38. 38. PMatchdogma.umiacs.umd.edu
  39. 39. ? PMatchQuestions?Comments?
  40. 40. PMatch Related Work III  Data Uncertainty -  Specified at the tuple or edge level -  How to answer certain (SQL) queries over uncertain data and what are the answer probabilities? First Name Last Name Income Probability John Smith 80,000 0.6 Alice Baker 85,000 0.5 Jennifer Temper 90,000 0.940
  41. 41. PMatch Related Work III  Data Uncertainty -  How to encode dependencies? -  Need to make strong independence assumptions to be efficient -  Where are the probabilities coming form?  Instead, we assume certain data and model uncertainty in the user specified query.41
  42. 42. PMatch Query Uncertainty Motivation  User Uncertainty -  User does not know what she is looking for exactly  Lack of schema -  Lack of schema complicates query design  Data heterogeneity -  Can be caused by data integration  Noisy or Missing Data -  It’s real world data42

×