Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On Sampling from Massive Graph Streams

447 views

Published on

Slides from my talk at VLDB 2017

Published in: Data & Analytics
  • Be the first to comment

On Sampling from Massive Graph Streams

  1. 1. Joint  work  with: Nick  Duffield  -­‐ Texas  A&M  University Ted  Willke – Intel  Labs Ryan  Rossi  – PARC  research VLDB’17,  Germany August  31st,  2017 Nesreen  K.  Ahmed Research  Scientist,  Intel  Labs
  2. 2. -­‐ -­‐ -­‐ -­‐ -­‐ Social  network   Human  Disease  Network   [Barabasi 2007] Food  Web  [2007] Terrorist  Network [Krebs  2002]Internet  (AS)  [2005] Gene  Regulatory  Network   [Decourty 2008] Protein  Interactions   [breast  cancer] Political  blogs Power  grid
  3. 3. Social  Network Internet  (AS) BiologicalPolitical  Blogs Graph Mining
  4. 4. Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users
  5. 5. Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Studying  and  analyzing  complex  networks is  a  challenging and  computationally  intensive task Due  to  these  challenges,  we  usually  need  to  sampleDue  to  these  challenges,  we  usually  need  to  sample Statistical   Sampling Graph  G Sample  S e.g. Uniform Random Sampling Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users Ø Today’s  networks  are  dynamic/streaming  over  time -­‐ e.g.,  Twitter  streams,  email  communications   Ø Today’s  networks  are  massive  in  size   -­‐ e.g.,  online  social  networks  have  billions  of  users
  6. 6. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties
  7. 7. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties
  8. 8. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties No.  TrianglesNo.  Wedges Frequent  connected  subsets  of  edges
  9. 9. Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Given a large graph G represented as a stream of edges e1, e2, e3… We show how to efficiently sample from G while limiting memory space to calculate unbiased estimates of various graph properties Frequent  connected  subsets  of  edges Transitivity No.  TrianglesNo.  Wedges
  10. 10. § Random  Sampling • Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]     — Graph  Sparsification with  probability  p   — Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low — Estimates  suffer  from  high  variance   • Wedge  Sampling – [Seshadhri et.  al  SDM’13]   — Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges) — Output  the  estimate  of  the  closed  wedges  (triangles)  
  11. 11. § Random  Sampling • Uniform  random  sampling – [Tsourakakis et.  al  KDD’09]     — Graph  Sparsification with  probability  p   — Chance  of  sampling  a  subgraph (e.g.,  triangle)  is  very  low — Estimates  suffer  from  high  variance   • Wedge  Sampling – [Seshadhri et.  al  SDM’13]   — Sample  vertices,  then  sample  pairs  of  incident  edges  (wedges) — Output  the  estimate  of  the  closed  wedges  (triangles)   Assume  we’ve  access  to  the  full  graph Not  a  good  fit  for  massive  streaming  graphs
  12. 12. § Assume  specific  order  of  the  stream  – [Buriol et.  al  2006]   • Incidence  stream  model– vertex  neighbors  arrive  together  in  the  stream § Use  multiple  passes  over  the  stream  – [Becchetti et.  al  KDD’08] § Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams
  13. 13. § Single-­‐pass  algorithms  for  arbitrary-­‐ordered  graph  streams • Streaming-­‐Triangles  – [Jha et.  al  KDD’13] — Sample  edges  using  reservoir  sampling,  then  sample  pairs  of  incident   edges  (wedges),  and  finally  scan  for  closed  wedges  (triangles) • Neighborhood  Sampling  – [Pavan et.  al  VLDB’13] — Sampling  vectors  of  wedge  estimators,  scan  the  stream  for  closed  wedges   (triangles) • TRIEST– [De  Stefani  et.  al  KDD’16] — Uses  standard  reservoir  sampling  to  maintain  the  edge  sample • MASCOT– [Lim  et.  al  KDD’15] — Independent  edge  sampling  with  probability  p • Graph  Sample  &  Hold– [Ahmed  et.  al  KDD’14] — Conditionally  independent  edge  sampling
  14. 14. Summary  of  previous  work Sampling  designs  for  specific  graph  properties  (triangles)   Not  generally  applicable  to  other  properties Uniform-­‐based  Sampling Obtain  variable-­‐size  sample   We  propose  a  generic  unbiased  sampling  framework:  Graph  Priority  Sampling • Weight-­‐sensitive • Fixed-­‐size  sample • Single-­‐pass • Applicable  for  general  graph  properties • Use  topological  information  that  we  wish  to  estimate  as  auxiliary  variables • Variance-­‐optimal  sampling  (cost  optimization  approach)
  15. 15. Input Graph Priority Sampling Framework GPS(m) Output Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|)
  16. 16. Input Graph Priority Sampling Framework GPS(m) Output For each edge k Generate a random number u(k) ⇠ Uni(0, 1] Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|) Compute edge weight w(k) = W(k, ˆK) Compute edge priority r(k) = w(k)/u(k) ˆK = ˆK [ {k}
  17. 17. Input Graph Priority Sampling Framework GPS(m) Output For each edge k Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|) Find edge with lowest priority k⇤ = arg mink02 ˆK r(k0 ) Update sample threshold z⇤ = max{z⇤ , r(k⇤ )} Remove lowest priority edge ˆK = ˆK{k⇤ } Use a priority queue with O(log m) updates
  18. 18. § We  use  edge  weights  to  express  the  role  of  the  arriving   edge  in  the  sampled  graph • e.g.,  no.  subgraphs completed  by  the  arriving  edge,  and/or  other   auxiliary  variables § Computational  feasibility   • Efficient  implementation  by  using  a  priority  queue   • Implemented  as  a  Min-­‐heap  with  O(log  m)  insertion/deletion • O(1)  access  to  the  edge  with  minimum  priority     w(k) = W(k, ˆK)
  19. 19. For each edge i, we construct a sequence of edge estimators ˆSi,t We achieve unbiasedness by establishing that the sequence is a Martingale (Theorem 1) E[ ˆSi,t] = Si,t ˆSi,t = I(i 2 ˆKt)/min{1, wi/z⇤ } where ˆSi,t are unbiased estimators of the corresponding edge ˆKt is the sample at time t Edge Estimation
  20. 20. For each subgraph J ⇢ [t], we define the sequence of subgraph estimators as ˆSJ,t = Q i2J ˆSi,t E[ ˆSJ,t] = SJ,t We prove the sequence is a Martingale (Theorem 2) Subgraph Estimation
  21. 21. Subgraph Counting For any set J of subgraphs of G, ˆNt(J ) = P J2J :J⇢Kt ˆSJ,t is an unbiased estimator of Nt(J ) = |Jt| (Theorem 2)
  22. 22. § We  provide  a  cost  minimization  approach   • inspired  by  IPPS  sampling  [Duffield  et.  al  2005]     § By  minimizing  the  conditional  variance  of  the  increment   incurred  by  the  arriving  edge  in How the ranks ri,t should be distributed in order to minimize the variance of the unbiased estimator of Nt(J )? Nt(J )
  23. 23. § Post-­‐stream  Estimation • enables  retrospective  subgraph queries • after  any  number  t of  edge  arrivals  have  taken  place,  we  can   compute  an  unbiased  estimator  for  any  subgraph § In-­‐stream  Estimation • we  can  take  “snapshots”  of  estimates  of  specific  sampled  subgraphs at  arbitrary  times  during  the  stream • Still  Unbiased! • Lightweight  online/incremental  update  of  unbiased  estimates  of   subgraph counts • Same  sampling  procedure • Using  stopped  Martingale
  24. 24. Input Graph Priority Sampling Framework GPS(m) Output For each edge k Edge stream k1, k2, ..., k, ... Sampled Edge stream ˆK Stored State m = O(| ˆK|) Compute edge priority r(k) = w(k)/u(k) Update the sample Update unbiased estimates of subgraph counts
  25. 25. In-stream Estimation We define a snapshot as an edge subset J, with a family of stopping times T such that T = {Tj : j 2 J} We prove the sequence is a stopped Martingale (Theorem 4) ˆST J,t = Q j2J ˆS Tj j,t = Q j2J ˆSj,min{Tj ,t} E[ ˆST J,t] = SJ,t
  26. 26. § We  use  GPS  for  the  estimation  of   • Triangle  counts • Wedge  counts • Global  clustering  coefficient • And  their  unbiased  variance    (Theorem  3  in  the  paper) • Weight  function • Used    a  large  set  of  graphs  from  a  variety  of  domains  (social,  we,   tech,  etc)    -­‐ data  is  available  on  http://networkrepository.com/ — Up  to  49B  edges W(k, ˆK) = 9 ⇤ ˆ4(k) + 1 where ˆ4(k) is the number of triangles completed by edge k and whose edges in ˆK
  27. 27. -­ GPS  accurately  estimates  various  properties  simultaneously -­ Consistent  performance  across  graphs  from  various  domains -­ A  key  advantage  for  GPS  in-­stream  has  smaller  variance  and  tight  confidence  bounds
  28. 28. Results  for  triangle  counts   Using  massive  real-­world  and  synthetic  graphs  of  up  to  49B  edges GPS  is  shown  to  be  accurate  with  <0.01  error   Sample  size  =  1M  edges,  in-­stream  estimation 95%  confidence  intervals
  29. 29. 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x Global  Clustering  Coeff 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x 10 4 10 5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 soc−twitter−2010 Sample Size |K| x/x Triangle  Count 10 4 10 5 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−twitter−2010 Sample Size |K| x/x 10 4 10 5 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−twitter−2010 Sample Size |K| x/x Wedge  Count Actual Estimated/Actual Confidence  Upper  &  Lower  Bounds   Sample  Size  =  40K  edges Accurate  estimates  for  large  Twitter  graph  ~  265M  edges,  and  17.2B  triangles 95%  confidence  intervals
  30. 30. Global  Clustering  CoeffTriangle  Count Wedge  Count Actual Estimated/Actual Confidence  Upper  &  Lower  Bounds   Sample  Size  =  40K  edges Accurate  estimates  for  large  social  network  Orkut  ~  120M  edges,  and  630M  triangles 95%  confidence  intervals 10 4 10 5 10 6 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x 10 4 10 5 10 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 soc−orkut Sample Size |K| x/x
  31. 31. 0 2 4 6 8 10 12 x 10 7 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 x 10 8 Stream Size at time t (|Kt|) Trianglesattimet(xt) soc−orkut Actual Estimate Upper Bound Lower Bound 0 2 4 6 8 10 12 x 10 7 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 x 10 8 Stream Size at time t (|Kt|) Trianglesattimet(xt) soc−orkut 0 2 4 6 8 10 12 x 10 7 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Stream Size at time t (|Kt|) ClusteringCoeff.attimet(xt) soc−orkut Actual Estimate Upper Bound Lower Bound 0 2 4 6 8 10 12 x 10 7 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Stream Size at time t (|Kt|) ClusteringCoeff.attimet(xt) soc−orkut GPS  in-­stream  estimates  over  time Sample  size  =  80K  edges 95%  confidence  intervals
  32. 32. 0.994 0.996 0.998 1 1.002 1.004 1.006 0.994 0.996 0.998 1 1.002 1.004 1.006 ca-hollywood-2009 com-amazon higgs-social-network soc-flickr soc-youtube-snap socfb-Indiana69 socfb-Penn94 socfb-Texas84 socfb-UF21 tech-as-skitter web-BerkStan web-google GPS  In-­stream  Estimation,  sample  size  100K  edges GPS  accurately  estimates  both  triangle  and  wedge  counts   simultaneously  with  a  single  sample
  33. 33. We  observe  accurate  results  with  no  significant  difference  in  error  between   the  ordering  schemes
  34. 34. § We  used  three  schemes  for  weighting  edges  during  sampling § Goal:  estimate  triangle  counts  for  Friendster  social  network   with  sample  size=1M  (0.1%  of  the  graph) 1. triangle-­‐based  weights  (3%  relative  error) 2. wedge-­‐based  weights  (25%  relative  error) 3. uniform  weights  for  all  incoming  edges  (43%  relative  error) -­‐ this  is  equivalent  to  simple  random  sampling The  estimator  variance  was  3.8x  higher  using  wedge-­based weights,  and   6.2x  higher  using  uniform  weights  compared  to  triangle-­based  weights.
  35. 35. § A  sample  is  representative if  graph  properties  of  interest  can  be   estimated  with  a  known  degree  of  accuracy   § We  proposed  a  generic  framework  Graph  Priority  Sampling  (GPS) -­‐ GPS  is  an  efficient single-­‐pass  streaming  framework -­‐ GPS  selects  a  representative sample  and  computes  unbiased estimates  of   counts  of  connected  subsets  of  edges  (e.g.,  triangles,  wedges  …)     -­‐ Theoretical  properties  of  GPS  are  supported  by  empirical  analysis       § GPS  admits  generalizations  by  allowing  the  dependence of  the   sampling  process  as  a  function  of  the  stored  state  and/or  auxiliary   variables § GPS  is  variance  minimizing  sampling  approach   § GPS  has  a  relative  estimation  error  <  1%
  36. 36. Thank  you! Questions?

×