• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
PMatch: Probabilistic Subgraph Matching on Huge Social Networks
 

PMatch: Probabilistic Subgraph Matching on Huge Social Networks

on

  • 3,192 views

Users querying massive social networks or RDF databases are often not 100% certain about what they are looking for due to the complexity of the query or heterogeneity of the data. In this paper, we ...

Users querying massive social networks or RDF databases are often not 100% certain about what they are looking for due to the complexity of the query or heterogeneity of the data. In this paper, we propose “probabilistic subgraph” (PS) queries over a graph/network database, which afford users great flexibility in specifying “approximately” what they are looking for. We formally define the probability that a substitution satisfies a PS-query with respect to a graph database. We then present the PMATCH algorithm to answer such queries and prove its correctness. Our experimental evaluation demonstrates that PMATCH is efficient and scales to massive social networks with over a billion edges.

Statistics

Views

Total Views
3,192
Views on SlideShare
1,170
Embed Views
2,022

Actions

Likes
1
Downloads
6
Comments
0

5 Embeds 2,022

http://www.knowledgefrominformation.com 2011
http://translate.googleusercontent.com 4
http://www.knowledgefrominformation.de 3
http://webcache.googleusercontent.com 2
http://knowledgefrominformation.de 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    PMatch: Probabilistic Subgraph Matching on Huge Social Networks PMatch: Probabilistic Subgraph Matching on Huge Social Networks Presentation Transcript

    • © Adam Perer PMatchPMatch: Probabilistic Subgraph Matching on Huge Social Networks Matthias Bröcheler, Andrea Pugliese & V.S. Subrahmanian
    • PMatch Star type friend friend likes Sci-Fi Bob Mark Wars IV friend likes attended The friend friend likes Titanic Godfather likes Halloween attended Pizza John 2008 Peter Feast type likes likes attended organized friend organized organized attended Francis Jenny Drama attended Peter‘s Bday likes friend party attended organized attended Ashley likes friend attended Pulp Sylvester attended type Homecoming friend Fiction 2009 09 organized type attended organized attended Fundraiser likes for School Gone with Thriller Bob the wind Jessie attended Alice Chill-out Night likes Goodbye friend attended likes type Mrs. Doubtfire organized attended organized Inception type Melissa friend Spring Break Trip Alice attended likes likes Comedy likes friend organized likes Harry Emily Potter likes The Lion type Kinglikes type Jon likes type Mystery type Toy Story Familylikes
    • PMatchExample Query attended ?p Francis friend organized attended Peter ?u ?f likes likes ?b Drama type Simple query, yet complex structure and difficult to answer by hand3
    • PMatchQuery Uncertainty attended ?p Francis friend organized attended Peter ?u ?f likes likes ?b Drama type Was it really a drama movie?4
    • PMatchProbabilistic Matching Query 2 attended Certain organized ?p attended Francis friend ∞ Peter ?u ?f likes likes 3 ?b Drama 6 type Uncertain Capture uncertainty in weighted query boxes5
    • PMatchIntelligence and Security Very bad affiliated ?person1 in Africa guy transfer affiliated ?bank contacted transfer ?person1 ?target posted to posted to traveled ?person3 ?city labeledRadical lives in labeled ?forum Middle located in East Radical6
    • PMatchIntelligence and Security Very bad affiliated ?person1 3 guy transfer in Africa affiliated ∞ transfer ?bank 1 contacted ?person1 ?target2 posted to traveled posted to 1 2 ?person3 ?city labeledRadical lives in labeled ?forum Middle located in East Radical7
    • PMatchPMatch Applications Intelligence -  Information retrieval on open source data Search -  Facilitate productive use of social network data Security -  Finding suspicious patterns Social Network Analysis -  Querying as a primitive operation for data retrieval8
    • PMatch OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work & Conclusion
    • PMatch PMatch Query 2 ?p attended Francis friend ∞ organized attended  Set of Boxes Peter ?u likes ?f likes 3 -  Each contains edges ?b type Drama 6 -  Has a weight •  Infinity  certainty or “core” of query  Naturally extends exact subgraph matching queries10
    • PMatch PMatch Query 2 ?p attended Francis friend ∞ organized attended  Matching boxes Peter ?u likes ?f likes 3 -  Let bi = 1 if box Bi is ?b type Drama 6 matched by a substitution else 0 for i=1 to n (= # of boxes) -  Weights wi, offset w0  Probability of match (user defined) =0 if b0 = 0, core must be matched = 1 ￿n 1 + e−(w0 + i=1 wi bi )11
    • PMatch PMatch Query Answer  All substitutions with probability 2 ?p attended Francis ∞ friend organized attended above user defined Peter ?u ?f likes threshold τ 3 likes ?b Drama 6 type  Need to eliminate redundant answers -  Substitutions that can be extended to higher probability12
    • PMatch PMatch Answer  1/(1+e-(-5+6)) ≈ 73% Francis -  w0=-5 friend 2 ?p attended Francis ∞ John organized attended friend likes Peter ?u ?f likes The 3 likes Godfather type Drama ?b Drama 6 type13
    • PMatch OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work & Conclusion
    • PMatch PMatch Query Answering  Try to find all answers at once by exploiting shared substructures and efficiently maintain data structures.  Depth first-search algorithm -  Terminates when search space is exhausted or answer probability falls below threshold.  Provably correct15
    • PMatch PMatch Initialization Attended {2,3} ?p Francis Friend {0} organi zed {2} Attended {2,3} Peter ?u ?f Likes {0} Likes {2,3} type ?b {6} Drama Rewrite query (0=certain)16
    • PMatch PMatch Data Structures Maintained per vertex:  R: Set of substitution candidates -  Pairs of the form <x,Hx>, where x is a potential substitution and Hx the set of box ids it matches. Hx maintained as bit vector  S: Set of box ids for which R has been initialized17
    • PMatch PMatch Initialization Attended {2,3} R = {} R = {<francis|0>} S = {} ?p Francis S = {0} Friend {0} organi zed {2} Attended {2,3} R = {} Peter ?u S = {} ?f Likes {0} R = {} R = {<peter|2>} Likes {2,3} S = {} S = {2} type ?b {6} Drama R = {} R = {<drama|6>} τ = 0.75 S = {} S = {6} θ = {}18
    • PMatch Vertex selection heuristic progress(v)  Goodness(v) = cost(v)  Progress(v) = cumulative box weight of edges to be processed -  Normalize weight by box cardinality  Cost(v) = use standard cost models from exact subgraph matching -  Cardinality, estimated selectivity19
    • PMatch PMatch Algorithm Initial vertex selection Attended {2,3} R = {} R = {<francis|0>} S = {} ?p Francis S = {0} Friend {0} organi zed {2} Attended {2,3} R = {} Peter ?u S = {} ?f Likes {0} R = {} R = {<peter|2>} Likes {2,3} S = {} S = {2} type ?b {6} Drama R = {} R = {<drama|6>} τ = 0.75 S = {} S = {6} θ = {}20
    • PMatch PMatch Algorithm Processing of „Francis“ vertex R = {<Peter’s Bday|2,3>, <Silvester 09|2,3>} S = {2,3} ?p Francis organi zed {2} Attended {2,3} R = {} Peter ?u S = {} ?f Likes {0} R = {<John|0>,<Mark|0>, R = {<peter|2>} Likes {2,3} <Melissa|0>} S = {2} S = {0} type ?b {6} Drama R = {} R = {<drama|6>} τ = 0.75 S = {} S = {6} θ = {}21
    • PMatch PMatch Algorithm Threshold Pruning R = {<Peter’s Bday|2,3>, <Silvester 09|2,3>} S = {2,3} ?p Francis organi zed {2} Attended {2,3} R = {} Peter ?u S = {} ?f R = {<peter|2>} Likes {2,3} S = {2} type ?b {6} Drama R = {<Titanic,0>,<Star Wars,0>} R = {<drama|6>} τ = 0.75 S = {0} S = {6} θ = {?f/Mark}22
    • PMatch Processing PMatch Algorithm vertex „?b“. R = {<Peter’s Bday|2,3>, Maintaining <Silvester 09|2,3>} box indices S = {2,3} ?p Francis organi zed {2} Attended {2,3} R = {<Jenny|2,3>} Peter ?u S = {2,3} ?f R = {<peter|2>} S = {2} ?b Drama R = {} τ = 0.75 S = {} θ = {?f/Mark,?b/Titanic}23
    • PMatch PMatch Algorithm Crossed R = {<Peter’s Bday|2,3>} Threshold S = {2,3} ?p Francis organi zed {2} Peter ?u ?f R = {<peter|2>} S = {2} ?b Drama R = {} τ = 0.75 S = {} θ = {?f/Mark,?b/Titanic,?u/Jenny}24
    • PMatch PMatch Algorithm Found Result ?p Francis Peter ?u ?f R = {<peter|2>} S = {2} ?b Drama R = {} τ = 0.75 S = {} θ = {?f/Mark,?b/Titanic,?u/Jenny,?p/Peter’s Bday} 0.99725
    • PMatch Query Splitting Selecting „Peter“ first causes query to be split. R = {<Peter’s Bday|2>, Attended {2,3} <Silvester 09|2>} R = {<francis|0>} S = {2} ?p Francis S = {0} Friend {0} Attended {2,3} R = {} Peter ?u S = {} ?f Likes {0} R = {} Likes {2,3} S = {} type ?b {6} Drama R = {} R = {<drama|6>} τ = 0.75 S = {} S = {6} θ = {}26
    • PMatch OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work & Conclusion
    • PMatch Experimental Setup I  Implementation in Java (approx 5500 loc) on top of Neo4j  Datasets -  Friendship Network (Flickr, Orkut, Livejournal, Youtube): 778 million edges -  Delicious Network: 1.1 billion edges  Query Benchmark for each dataset: 9 PMatch queries each of increasing complexity28
    • PMatch Experimental Setup II   Compared -  PMatch-1: box weight -  PMatch-2: normalized box weight -  DOGMA: Exact matching for all combination -  Neo4j: Exact matching for all combinations •  Too slow – not shown in the charts   System: AMD Opteron, 15K RPM 300 GB SAS HD, 2GB of heap space29
    • PMatch Friendship network (small queries) 350 300 250Query Time (ms) 200 150 100 50 0 Query SA: 6B,23E Query SB: 4B,6E Query SC: 5B,12E Query SD: 4B,6E Query SE: 5B,12E Query SF: 6B,23E Queries PMatch-1 PMatch-2 DOGMA For each query, we report query times with cold (right) and warm (left) caches. 30
    • PMatch Friendship network (large queries) 25000 20000Query Time (ms) 15000 10000 5000 0 Query SG: 4B,6E Query SH: 5B,12E Query SI: 6B,23E Queries PMatch-1 PMatch-2 DOGMA For each query, we report query times with cold (right) and warm (left) caches. 31
    • PMatch Delicious network (small queries) 7000 6000 5000Query Time (ms) 4000 3000 2000 1000 0 Query TA: 5B,13E Query TB: 5B,17E Query TC: 4B,6E Query TD: 5B,12E Query TE:5B,12E Queries PMatch-1 PMatch-2 DOGMA For each query, we report query times with cold (right) and warm (left) caches. 32
    • PMatch Delicious network (large queries) 40000 35000 30000Query Time (ms) 25000 20000 15000 10000 5000 0 Query TF:4B,6E Query TG: 5B,18E Query TH: 5B,18E Query TI: 4B,7E Queries PMatch-1 PMatch-2 DOGMA For each query, we report query times with cold (right) and warm (left) caches. 33
    • PMatch OutlineMotivationPMatch Query DefinitionPMatch Query AnsweringExperimentsRelated Work & Conclusion
    • PMatch Related Work I  Exact Subgraph Matching -  Many systems: RDF-3X, OWLIM, YARS (2), Sesame, Jena, DOGMA, COSI etc. -  Efficient on-disk index structures -  Query answering algorithms focused on exact queries (e.g. SPARQL).  No treatment of query uncertainty35
    • PMatch Related Work II  Approximate Subgraph Matching -  SAGA, TALE, Grafil, etc -  Database containing many small graphs -  Do not scale to large social networks -  Query uncertainty is defined in the system.  PMatch provides an expressive language to specify probabilistic queries and answers them efficiently.36
    • PMatch Conclusion  Query uncertainty often occurs in social network search, information retrieval and pattern matching.  PMatch provides a language to concisely express and algorithms to efficiently answer probabilistic subgraph matching queries.37
    • PMatchdogma.umiacs.umd.edu
    • ? PMatchQuestions?Comments?
    • PMatch Related Work III  Data Uncertainty -  Specified at the tuple or edge level -  How to answer certain (SQL) queries over uncertain data and what are the answer probabilities? First Name Last Name Income Probability John Smith 80,000 0.6 Alice Baker 85,000 0.5 Jennifer Temper 90,000 0.940
    • PMatch Related Work III  Data Uncertainty -  How to encode dependencies? -  Need to make strong independence assumptions to be efficient -  Where are the probabilities coming form?  Instead, we assume certain data and model uncertainty in the user specified query.41
    • PMatch Query Uncertainty Motivation  User Uncertainty -  User does not know what she is looking for exactly  Lack of schema -  Lack of schema complicates query design  Data heterogeneity -  Can be caused by data integration  Noisy or Missing Data -  It’s real world data42