Streaming Graph Analytics for Massive GraphsJason Riedy, David A. Bader, David EdigerGeorgia Institute of Technology      ...
Outline       Motivation         Where graphs appear... (hint: everywhere)         Data volumes and rates of change       ...
Exascale Data Analysis              Health care Finding outbreaks, population epidemiology        Social networks Advertis...
Graphs are pervasive           • Sources of massive data: petascale simulations, experimental              devices, the In...
These are not easy graphs.                               Yifan Hu’s (AT&T) visualization of the Livejournal data setSIAM A...
But no shortage of structure...          Protein interactions, Giot et al., “A Protein          Interaction Map of Drosoph...
Also no shortage of data...       Existing (some out-of-date) data volumes                NYSE 1.5 TB generated daily into...
General approaches           • High-performance static graph analysis               • Develop techniques that apply to unc...
Why analyze data streams?                                                  Data transfer                                  ...
Overall streaming approach          Protein interactions, Giot et al., “A Protein          Interaction Map of Drosophila m...
Overall streaming approach          Protein interactions, Giot et al., “A Protein          Interaction Map of Drosophila m...
Overall streaming approach          Protein interactions, Giot et al., “A Protein          Interaction Map of Drosophila m...
Difficulties for performance         • What partitioning             methods apply?                • Geometric? Nope.       ...
STING’s focus                           Control             action                              prediction                ...
STINGER       STING Extensible Representation:           • Rule #1: No explicit locking.               • Rely on atomic op...
Initial results       Prototype STING and STINGER       Background benchmark for STINGER maintenance, plus monitoring     ...
Experimental setup       Unless otherwise noted                    Line             Model         Speed (GHz)          Soc...
STINGER insertion / removal rates       Edges per second       On E7-8870, 80 hardware threads but four sockets:          ...
STINGER insertion / removal rates       Seconds per batch, “latency”       On E7-8870, 80 hardware threads but four socket...
Clustering coefficients       • Used to measure “small-world-ness”           [Watts & Strogatz] and potential               ...
Updating triangle counts                 Given Edge {u, v } to be inserted (+) or deleted (-)           Approach Search fo...
Batches of 10k actions                                                                               Brute force          ...
Different batch sizes                                                                               Brute force            ...
Different batch sizes: Latency                                                                                   Brute forc...
Connected components           • Maintain a mapping from vertex to              component.           • Global property, un...
Connected components: Deleted edges    The difficult case       • Very few deletions         matter.       • Determining whi...
Deletion heuristics       Rule out effect-less deletions           • Use the spanning tree by-product of static connected  ...
Connected components: Performance       On E7-8870                                                                     q  ...
Connected components: Latency       On E7-8870                                                     q                      ...
Common aspects           • Each parallelizes sufficiently well over the affected vertices V ,              those touched by n...
Session outline           • Fast Counting of Patterns in Graphs              Ali Pinar, C. Seshadhri, and Tamara G. Kolda,...
Acknowledgment of supportSIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy   32/31
Upcoming SlideShare
Loading in …5
×

SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs

739 views

Published on

Emerging real-world graph problems include detecting community structure in large social networks, improving the resilience of the electric power grid, and detecting and preventing disease in human populations. The volume and richness of data combined with its rate of change renders monitoring properties at scale by static recomputation infeasible. We approach these problems with massive, fine-grained parallelism across different shared memory architectures both to compute solutions and to explore the sensitivity of these solutions to natural bias and omissions within the data.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
739
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs

  1. 1. Streaming Graph Analytics for Massive GraphsJason Riedy, David A. Bader, David EdigerGeorgia Institute of Technology 10 July 2012
  2. 2. Outline Motivation Where graphs appear... (hint: everywhere) Data volumes and rates of change Why analyze data streams? Technical Overall streaming approach Data structure benchmark Clustering coefficients Connected components Common aspects and questions Session outlineSIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 2/31
  3. 3. Exascale Data Analysis Health care Finding outbreaks, population epidemiology Social networks Advertising, searching, grouping Intelligence Decisions at scale, regulating algorithms Systems biology Understanding interactions, drug design Power grid Disruptions, conservation Simulation Discrete events, cracking meshes The data is full of semantically rich relationships. Graphs! Graphs! Graphs!SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 3/31
  4. 4. Graphs are pervasive • Sources of massive data: petascale simulations, experimental devices, the Internet, scientific applications. • New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality. Astrophysics Bioinformatics Social Informatics Problem Identifying target Problem Emergent behavior, Problem Outlier detection proteins information spread Challenges Massive data Challenges Data Challenges New analysis, sets, temporal variation heterogeneity, quality data uncertainty Graph problems Matching, Graph problems Centrality, Graph problems Clustering, clustering clustering flows, shortest pathsSIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 4/31
  5. 5. These are not easy graphs. Yifan Hu’s (AT&T) visualization of the Livejournal data setSIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 5/31
  6. 6. But no shortage of structure... Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Jason’s network via LinkedIn Labs Science 302, 1722-1736, 2003. • Globally, there rarely are good, balanced separators in the scientific computing sense. • Locally, there are clusters or communities and many levels of detail.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 6/31
  7. 7. Also no shortage of data... Existing (some out-of-date) data volumes NYSE 1.5 TB generated daily into a maintained 8 PB archive Google “Several dozen” 1PB data sets (CACM, Jan 2010) LHC 15 PB per year (avg. 21 TB daily) http://public.web.cern.ch/public/en/lhc/ Computing-en.html Wal-Mart 536 TB, 1B entries daily (2006) EBay 2 PB, traditional DB, and 6.5PB streaming, 17 trillion records, 1.5B records/day, each web click is 50-150 details. http://www.dbms2.com/2009/04/30/ ebays-two-enormous-data-warehouses/ Faceboot 845 M users... and growing. • All data is rich and semantic (graphs!) and changing. • Base data rates include items and not relationships.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 7/31
  8. 8. General approaches • High-performance static graph analysis • Develop techniques that apply to unchanging massive graphs. • Provides useful after-the-fact information, starting points. • Serves many existing applications well: market research, much bioinformatics, ... • High-performance streaming graph analysis • Focus on the dynamic changes within massive graphs. • Find trends or new information as they appear. • Serves upcoming applications: fault or threat detection, trend analysis, ... Both very important to different areas. Remaining focus is on streaming. Note: Not CS theory streaming, but analysis of streaming data.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 8/31
  9. 9. Why analyze data streams? Data transfer • 1 Gb Ethernet: 8.7TB daily at Data volumes 100%, 5-6TB daily realistic NYSE 1.5TB daily • Multi-TB storage on 10GE: 300TB LHC 41TB daily daily read, 90TB daily write Facebook Who knows? • CPU ↔ Memory: QPI,HT: 2PB/day@100% Data growth Speed growth • Facebook: > 2×/yr • Ethernet/IB/etc.: 4× in next 2 • Twitter: > 10×/yr years. Maybe. • Growing sources: • Flash storage, direct: 10× write, Bioinformatics, 4× read. Relatively huge cost. µsensors, securitySIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 9/31
  10. 10. Overall streaming approach Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Jason’s network via LinkedIn Labs Science 302, 1722-1736, 2003. Assumptions • A graph represents some real-world phenomenon. • But not necessarily exactly! • Noise comes from lost updates, partial information, ...SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 10/31
  11. 11. Overall streaming approach Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Jason’s network via LinkedIn Labs Science 302, 1722-1736, 2003. Assumptions • We target massive, “social network” graphs. • Small diameter, power-law degrees • Small changes in massive graphs often are unrelated.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 11/31
  12. 12. Overall streaming approach Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Jason’s network via LinkedIn Labs Science 302, 1722-1736, 2003. Assumptions • The graph changes, but we don’t need a continuous view. • We can accumulate changes into batches... • But not so many that it impedes responsiveness.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 12/31
  13. 13. Difficulties for performance • What partitioning methods apply? • Geometric? Nope. • Balanced? Nope. • Is there a single, useful decomposition? Not likely. • Some partitions exist, but they don’t often help with balanced bisection or memory locality. • Performance needs new approaches, not just standard scientific Jason’s network via LinkedIn Labs computing methods.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 13/31
  14. 14. STING’s focus Control action prediction summary Source data Simulation / query Viz • STING manages queries against changing graph data. • Visualization and control often are application specific. • Ideal: Maintain many persistent graph analysis kernels. • Keep one current snapshot of the graph resident. • Let kernels maintain smaller histories. • Also (a harder goal), coordinate the kernels’ cooperation. • Gather data into a typed graph structure, STINGER.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 14/31
  15. 15. STINGER STING Extensible Representation: • Rule #1: No explicit locking. • Rely on atomic operations. • Massive graph: Scattered updates, scattered reads rarely conflict. • Use time stamps for some view of time.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 15/31
  16. 16. Initial results Prototype STING and STINGER Background benchmark for STINGER maintenance, plus monitoring the following properties: 1 clustering coefficients, and 2 connected components. High-level • Support high rates of change, over 10k updates per second. • Performance scales somewhat with available processing. • Gut feeling: Scales as much with sockets as cores. http://www.cc.gatech.edu/stinger/SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 16/31
  17. 17. Experimental setup Unless otherwise noted Line Model Speed (GHz) Sockets Cores Nehalem X5570 2.93 2 4 Westmere E7-8870 2.40 4 10 • Westmere loaned by Intel (thank you!) • All memory: 1067MHz DDR3, installed appropriately • Implementations: OpenMP, gcc 4.6.1, Linux ≈ 3.0 kernel • Artificial graph and edge stream generated by R-MAT [Chakrabarti, Zhan, & Faloutsos]. • Scale x, edge factor f ⇒ 2x vertices, ≈ f · 2x edges. • Edge actions: 7/8th insertions, 1/8th deletions • Results over five batches of edge actions. • Caveat: No vector instructions, low-level optimizations yet. • Portable: OpenMP, Cray XMT familySIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 17/31
  18. 18. STINGER insertion / removal rates Edges per second On E7-8870, 80 hardware threads but four sockets: 107 q q q q q q q q q q q q Aggregate STINGER updates per second q q q q q q q Batch size q q q q q q q q q 106.5 q q q q q q q q q q q q q q q q q 3000000 q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q 1000000 q 106 qq qq q q q q q q q q q q q q q qq q q q q q q q q q q q q 300000 q qq q q q q q q q q q q qq q q q q q q q q q q q q 100000 q q 105.5 q q q q q q q q q q q q q q q q q q q 30000 q q q q q q q q q q q q q q 10000 q 105 q q q q q q q q q q q q q q q q q q q q q q q q q 3000 q q q q q q q q q q q q q q q q q q q q q q q q q q 1000 q q 104.5 q q q q q q q q 300 q q q q 100 104 q 20 40 60 80 Number of threadsSIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 18/31
  19. 19. STINGER insertion / removal rates Seconds per batch, “latency” On E7-8870, 80 hardware threads but four sockets: q q q 0 q Batch size 10 q Seconds per batch processed q q q 3000000 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1000000 q q q q q q q q q q q q −1 q q q q 300000 10 q q q q q q q q q q q q q q q q q q 100000 q q q q q q q q q q q q q q q q q q q q q q q q q 30000 q q q q q q q q q q q q q q q q q q q q q q 10−2 q q q q q q q q q q q q q q q q q q q q q 10000 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 3000 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1000 q q q q q q q q qqq q q 10−3 q q q q q 300 q q q qq qq q 100 qq 20 40 60 80 Number of threadsSIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 19/31
  20. 20. Clustering coefficients • Used to measure “small-world-ness” [Watts & Strogatz] and potential i m community structure • Larger clustering coefficient ⇒ more v inter-connected j n • Roughly the ratio of the number of actual to potential triangles • Defined in terms of triplets. • i – v – j is a closed triplet (triangle). • m – v – n is an open triplet. • Clustering coefficient: # of closed triplets / total # of triplets • Locally around v or globally for entire graph.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 20/31
  21. 21. Updating triangle counts Given Edge {u, v } to be inserted (+) or deleted (-) Approach Search for vertices adjacent to both u and v , update counts on those and u and v Three methods Brute force Intersect neighbors of u and v by iterating over each, O(du dv ) time. Sorted list Sort u’s neighbors. For each neighbor of v , check if in the sorted list. Compressed bits Summarize u’s neighbors in a bit array. Reduces check for v ’s neighbors to O(1) time each. Approximate with Bloom filters. [MTAAP10] All rely on atomic addition.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 21/31
  22. 22. Batches of 10k actions Brute force Bloom filter Sorted list 1.7e+04 3.4e+04 8.8e+04 1.2e+05 8.8e+04 1.4e+05 105.5 1.5e+04 1.3e+05 1.2e+05 5.1e+03 2.4e+04 2.2e+04 Updates per seconds, both metric and STINGER 3.9e+03 2.0e+04 1.7e+04 q q q q q q q q q 5 q q q q 10 q q qq q q q q q q q q Machine 104.5 q a 4 x E7−8870 q q a 2 x X5570 q q q q q 104 103.5 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Threads Graph size: scale 22, edge factor 16SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 22/31
  23. 23. Different batch sizes Brute force Bloom filter Sorted list 105.5 105 q q q q q q q q q q Updates per seconds, both metric and STINGER q q q 100 4.5 q 10 q q q q q q q q q 104 q qq q 3.5 q q 10 105.5 105 q q q q q q q q Machine 1000 104.5 q q 4 x E7−8870 q q q 2 x X5570 104 q q q q q q 103.5 105.5 q q q q q q q 105 qq q q q qqq q q q q q q 10000 q 4.5 q 10 q q q q q q 104 103.5 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Threads Graph size: scale 22, edge factor 16SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 23/31
  24. 24. Different batch sizes: Latency Brute force Bloom filter Sorted list 100 10−0.5 Seconds between updates, both metric and STINGER 10−1 100 q qq 10−1.5 q q q q q q 10−2 q q q q q q q q q q 10−2.5 q q q q q q q q q 10−3 q q q 100 10−0.5 q q q Machine 10−1 qq 1000 q q q q 4 x E7−8870 10−1.5 q q q q 10−2 q qq q q 2 x X5570 10−2.5 10−3 100 q q q q q 10−0.5 q q q q q q q 10−1 qq qqq 10000 q q q q q q q q q q q 10−1.5 10−2 10−2.5 10−3 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Threads Graph size: scale 22, edge factor 16SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 24/31
  25. 25. Connected components • Maintain a mapping from vertex to component. • Global property, unlike triangle counts • In “scale free” social networks: • Often one big component, and • many tiny ones. • Edge changes often sit within components. • Remaining insertions merge components. • Deletions are more difficult...SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 25/31
  26. 26. Connected components: Deleted edges The difficult case • Very few deletions matter. • Determining which matter may require a large graph search. • Re-running static component detection. • (Long history, see related work in [MTAAP11].) • Coping mechanisms: • Heuristics. • Second level of batching.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 26/31
  27. 27. Deletion heuristics Rule out effect-less deletions • Use the spanning tree by-product of static connected component algorithms. • Ignore deletions when one of the following occur: 1 The deleted edge is not in the spanning tree. 2 If the endpoints share a common neighbor∗ . 3 If the loose endpoint can reach the root∗ . • In the last two (∗), also fix the spanning tree. Rules out 99.7% of deletions.SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 27/31
  28. 28. Connected components: Performance On E7-8870 q q q q q q q q q q q q q q q q q q q Connected component updates per second q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q Batch size q q q q 105 qq q q q q q q q q q q q q q q q q q q q q q q q q q 3000000 q q q q q q q q q q q 1000000 104.5 q q q q q q q q q q q q q q q q q q q q q q q q q 300000 q q q q q q q q q q q q 100000 q 104 q q q q q q q q q q q q q q q q q q q q q q q q q q q 30000 q q q q q q q q q q 10000 q q q q q q q q 103.5 q q q q q q q q q q q q 3000 q q q q q q q q q 1000 q 103 q q 300 q q q 100 q 20 40 60 80 Number of threadsSIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 28/31
  29. 29. Connected components: Latency On E7-8870 q q qq q q q q q q q q q q q q q q q q q q q q q q 101 Batch size Seconds per batch processed q q qq q q q q q q q q q q q q q q 3000000 q q q q q q q q q q q q q 100.5 q q q 1000000 qq qq q q q q q q q q q q q q 300000 q q q q q q q q q q q q q q q q q 100 q 100000 q q q q q q q q q q q q q q q q q q 30000 q q q q q q q q q q q q q q q q q −0.5 qq q q 10000 10 q q qqq q q q q q q q q q q q q qq q q q q q q q q q q 3000 q q q q q q q q q 10−1 q q q q q q q q q q 1000 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 300 q q q q q q q q q q q q q q q q q q q q q q q q −1.5 q q q q q q q q q q q 100 q q q q q q q 10 q q q q q q q q q q q q q q q 20 40 60 80 Number of threadsSIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 29/31
  30. 30. Common aspects • Each parallelizes sufficiently well over the affected vertices V , those touched by new or removed edges. • Total amount of work is O(Vol(V )) = O( v ∈V deg(v )). • Our in-progress work on refining or re-agglomerating communities with updates also is O(Vol(V )). • How many interesting graph properties can be updated with O(Vol(V )) work? • Do these parallelize well? • The hidden constant and how quickly performance becomes asymptotic determines the metric update rate. What implementation techniques bash down the constant? • How sensitive are these metrics to noise and error? • How quickly can we “forget” data and still maintain metrics?SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 30/31
  31. 31. Session outline • Fast Counting of Patterns in Graphs Ali Pinar, C. Seshadhri, and Tamara G. Kolda, Sandia National Laboratories, USA • Statistical Models and Methods for Anomaly Detection in Large Graphs Nicholas Arcolano, Massachusetts Institute of Technology, USA • Perfect Power Law Graphs: Generation, Sampling, Construction and Fitting Jeremy Kepner, Massachusetts Institute of Technology, USASIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 31/31
  32. 32. Acknowledgment of supportSIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 32/31

×