Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

448 views

Published on

Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

  1. 1. scalable and efficient algorithms for analysis of massive, streaming graphs E. Jason Riedy and David A. Bader MS76 Scalable Network Analysis: Tools, Algorithms, Applications SIAM PP, 15 April 2016 HPC Lab, School of Computational Science and Engineering Georgia Institute of Technology
  2. 2. motivation and applications
  3. 3. (insert prefix here)-scale data analysis Cyber-security Identify anomalies, malicious actors Health care Finding outbreaks, population epidemiology Social networks Advertising, searching, grouping Intelligence Decisions at scale, regulating algorithms Systems biology Understanding interactions, drug design Power grid Disruptions, conservation Simulation Discrete events, cracking meshes • Graphs are a motif / theme in data analysis. • Changing and dynamic graphs are important! 3
  4. 4. outline 1. Motivation and background 2. Incremental PageRank 3. Seed set expansion 4. Community maintenance 5. STINGER: Framework for streaming graph analysis 4
  5. 5. why graphs? Another tool, like dense and sparse linear algebra. • Combine things with pairwise relationships • Smaller, more generic than raw data. • Taught (roughly) to all CS students... • Semantic attributions can capture essential relationships. • Traversals can be faster than filtering DB joins. • Provide clear phrasing for queries about relationships. 5
  6. 6. potential applications • Social Networks • Identify communities, influences, bridges, trends, anomalies (trends before they happen)... • Potential to help social sciences, city planning, and others with large-scale data. • Cybersecurity • Determine if new connections can access a device or represent new threat in < 5ms... • Is the transfer by a virus / persistent threat? • Bioinformatics, health • Construct gene sequences, analyze protein interactions, map brain interactions • Credit fraud forensics ⇒ detection ⇒ monitoring • Integrate all the customer’s data, identify in real-time 6
  7. 7. streaming graph data Networks data rates: • Gigabit ethernet: 81k – 1.5M packets per second • Over 130 000 flows per second on 10 GigE (< 7.7 µs) Person-level data rates: • 500M posts per day on Twitter (6k / sec)1 • 3M posts per minute on Facebook (50k / sec)2 We need to analyze only changes and not entire graph. Throughput & latency trade off and expose different levels of concurrency. 1 www.internetlivestats.com/twitter-statistics/ 2 www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/ 7
  8. 8. streaming graph analysis Terminology: • Streaming changes into a massive, evolving graph • Not CS streaming algorithm (tiny memory) • Need to handle deletions as well as insertions Previous throughput results (not comprehensive review): Data ingest >2M up/sec [Ediger, McColl, Poovey, Campbell, & B 2014] Clustering coefficients >100K up/sec [R, Meyerhenke, Bader, Ediger, & Mattson 2012] Connected comp. >1M up/sec [McColl, Green, & B 2013] Community clustering >100K up/sec∗ [R & B 2013] 8
  9. 9. incremental pagerank
  10. 10. pagerank Everyone’s “favorite” metric: PageRank. • Stationary distribution of the random surfer model. • Eigenvalue problem can be re-phrased as a linear system ( I − αAT D−1 ) x = kv, with α teleportation constant, much < 1 A adjacency matrix D diagonal matrix of out degrees, with x/0 = x (self-loop) v personalization vector, here 1/|V| k irrelevant scaling constant • Amenable to analysis, etc. 10
  11. 11. incremental pagerank • Streaming data setting, update PageRank without touching the entire graph. • Existing methods maintain databases of walks, etc. • Let A∆ = A + ∆A, D∆ = D + ∆D for the new graph, want to solve for x + ∆x. • Simple algebra: ( I − αAT ∆D−1 ∆ ) ∆x = α ( A∆D−1 ∆ − AD−1 ) x, and the right-hand side is sparse. • Re-arrange for Jacobi, ∆x(k+1) = αAT ∆D−1 ∆ ∆x(k) + α ( A∆D−1 ∆ − AD−1 ) x, iterate, ... 11
  12. 12. incremental pagerank: accumulating error • And fail. The updated solution wanders away from the true solution. Top rankings stay the same... 12
  13. 13. incremental pagerank: think instead • The old solution x is an ok, not exact, solution to the original problem, now a nearby problem. • How close? Residual: r′ = kv − x + αA∆D−1 ∆ x = r + α ( A∆D−1 ∆ − AD−1 ) x. • Solve (I − αA∆D−1 ∆ )∆x = r′ . • Cheat by not refining all of r′ , only region growing around the changes: (I − αA∆D−1 ∆ )∆x = r′ |∆ • (Also cheat by updating r rather than recomputing at the changes.) 13
  14. 14. incremental pagerank: works Riedy, GABB at IPDPS 2016, to appear. 14
  15. 15. incremental pagerank: worst latency q q q q q q q q q q q q q q q 0.001 0.010 0.001 0.010 0.01 0.10 1.00 1 100 1 10 100 powerPGPgiantcompocaidaRouterLevelbelgium.osmcoPapersCiteseer 10 100 1000 Batch size Updatetime(s) Algorithm q dpr dprheld pr_restart Riedy, GABB at IPDPS 2016, to appear. 15
  16. 16. seed set expansion
  17. 17. graphs: big, nasty hairballs Yifan Hu’s (AT&T) visualization of the in-2004 data set http://www2.research.att.com/~yifanhu/gallery.html 17
  18. 18. but no shortage of structure... Protein interactions, Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003. Jason’s network via LinkedIn Labs • Locally, there are clusters or communities. • There are methods for global community detection. • Also need local communities around seeds for queries and targetted analysis. 18
  19. 19. seed set expansion • Seed set expansion finds the “best” subgraph or communities for a set of vertices of interest • Many quality criteria: Modularity, conductance, etc. • Can be applied to cryptocurrency to identify and track groups of interacting entities • Dynamic algorithm updates communities faster than recomputation, allowing us to keep up with new data produced 19
  20. 20. static seed set expansion Greedy expansion starting from S = { seed vertices }: 1. Check the fitness of every vertex v neighboring S. • fitness = f(S ∪ {v}) − f(S) 2. It any fitness is positive, include most fit v in S. • Currently sequential, could include all sufficiently good neighbors. 3. Record at which step v is included. • This list is the base for updates. Now the dynamic version by example... 20
  21. 21. dynamic seed set example In preparation with Anita Zakrzewska and Eisha Nathan 21
  22. 22. dynamic seed set example In preparation with Anita Zakrzewska and Eisha Nathan 21
  23. 23. dynamic seed set example In preparation with Anita Zakrzewska and Eisha Nathan 21
  24. 24. dynamic seed set quality Graphs from the Koblenz Network Collection. 22
  25. 25. dynamic seed set speed-up Graphs from the Koblenz Network Collection. 23
  26. 26. global community updates
  27. 27. community detection • Partition a graph’s vertices into disjoint communities. • A community locally optimizes some metric, NP-hard. • Trying to capture that vertices are more similar within one community than between communities. Jason’s network via LinkedIn Labs 25
  28. 28. what about streaming? • Simple approach based on agglomeration: 1. Extract all vertices touched by an update. 2. Re-start agglomeration. • “Works” and is fast (MTAAP 2013), but never mentioned quality. • Extracted vertices form bridges and do not re-merge. • Some methods based on label propagation, not metric-driven. • Backtracking (e.g. Görke, et al., JEA 2013) preserves quality at cost of change size. • Ongoing: Can we limit backtracking? 26
  29. 29. community quality: achievable Data from Stanford SNAP archive: Facebook. Stream generation: Reversing the graph. In preparation with Pushkar Godbolé 27
  30. 30. community quality: change size Data from Stanford SNAP archive: Facebook. Stream generation: Reversing the graph. In preparation with Pushkar Godbolé 28
  31. 31. closing
  32. 32. future directions • Of course, continuing to develop streaming / dynamic / incremental algorithms. • For massive graphs, computing small changes is always a win. • Improving approximations or replacing expensive metrics like betweenness centrality would be great. • Including more external and semantic data. • If vertices are documents or data records, many more measures of similarity. • Only now being exploited in concert with static graph algorithms. 30
  33. 33. hpc lab people Faculty: • David A. Bader • Oded Green Data: • Pushkar Godbolé • Anita Zakrzewska • Eisha Nathan STINGER: • Robert McColl, • James Fairbanks, • Adam McLaughlin, • Daniel Henderson, • David Ediger (now GTRI), • Jason Poovey (GTRI), • Karl Jiang, and • feedback from users in industry, government, academia Support: DoD, DoE, NSF, Intel, IBM, Oracle 31
  34. 34. stinger: where do you get it? Home: www.cc.gatech.edu/stinger/ Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/ Gateway to • code, • development, • documentation, • presentations... Remember: Academic code, but maturing with contributions. Users / contributors / questioners: Georgia Tech, PNNL, CMU, Berkeley, Intel, Cray, NVIDIA, IBM, Federal Government, Ionic Security, Citi, ... 32

×