Contributors• David Bader• David Ediger• Rob McColl• Jason Riedy• Kamesh Madduri• Jason Poovey
Outline• Motivation• Dynamic Graph Basics• What is STINGER?• What can STINGER do?• Why STINGER?
Big Data problems need Graph Analysis Health Care • Finding outbreaks, population epidemiology Social Networks • Advertising, searching, grouping, influence Intelligence • Decisions at scale, regulating algorithms Systems Biology • Understanding interactions, drug design Power Grid • Disruptions, conversion Simulation • Discrete events, cracking meshes
Graphs are pervasive • Graphs: things and relationships • Different kinds of things, different kinds of relationships, but graphs provide a framework for analyzing the relationships. • New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality. Astrophysics Bioinformatics Social InformaticsProblem: Outlier detection Problem: Problem: Emergent behavior,Challenges: Massive data Identifying target proteins information spreadsets, temporal variation Challenges: Challenges: New analysis,Graph Problems: matching, Data heterogeneity, quality data uncertainty, scaleclustering Graph Problems: Graph Problems: clustering, Centrality, clustering flows, shortest paths
Data rates and volumes are immense• Facebook: • ~1 billion users • average 130 friends • 30 billion pieces of content shared / month• Twitter: • 500 million active users • 340 million tweets / day• Internet – 100s of exabytes / year • 300 million new websites per year • 48 hours of video to You Tube per minute • 30,000 YouTube videos played per second
Our focus is streaming graphs• As relationships change • Edges (relationships) are inserted, updated, and removed • New vertices (things) join and leave the network• What are the effects? • On information flow • On community structure z x y • On the integrity of data and structure• Which actors and relationships are… • The key players and influencers in the change? • The anomalies and threats?
What is STINGER?Spatio-Temporal Interaction Networks and Graphs Extensible RepresentationD. A. Bader, J. Berry, A. Amos-Binks, D. Chavarr´ıa-Miranda, C. Hastings, K. Madduri, S. C. Poulos• A scalable, high performance in-memory dynamic graph data structure • Stores semantic and temporal information. • Designed to be flexible and extendable. • Be useful for the entire “large graph” community. • Permit good performance: No single structure is optimal for all. • Assume globally addressable memory access. • Support multiple, parallel readers and a single parallel writer.• A software suite for dynamic graph analysis • Targets large shared-memory x86 and the Cray XMT • Written in C with OpenMP and XMT pragma support for parallelism
As a data structure• Fast insertions, deletions, and updates: A data structure that grows and changes at the speed of the data.• Edge and vertex types and weights: Represent complex relationships and multiple simultaneous networks.• Filtering traversal mechanisms: Traverse serially or in parallel on specific edge types, time ranges, vertex sets, etc.• Experimental workflow server: Multiple data streams and analytics with one persistent data structure.• Experimental Java and Python bindings: Use efficiency-oriented languages without sacrificing performance- oriented results.
As an analysis package• Streaming edge insertions and deletions: Performs new edge insertions, updates, and deletions in batches or individually.• Streaming clustering coefficients: Tracks the local and global clustering coefficients of a graph under both edge insertions and deletions.• Streaming connected components: Accurately tracks the connected components of a graph with insertions and deletions.• Streaming community detection: Track and update the community structures within the graph as they change.• Parallel agglomerative clustering: Find clusters that are optimized for a user-defined edge scoring function.• Streaming Betweenness Centrality: Find the key points within information flows and structural vulnerabilities.• K-core Extraction: Extract additional communities and filter noisy high-degree vertices.• Classic breadth-first search: Performs a parallel breadth-first search of the graph starting at a given source vertex to find shortest paths.
What can STINGER represent?• Nearly any set of relationships • Healthcare • Social Networks • Intelligence • Systems biology • Power grid • Travel networks• Example: Twitter • Users, hashtags, tweets as vertex types • Authorship, retweet, mentions, follows / followed by edge types• Example: Work Environment • Users, PCs, printers, emails, URLs, files, etc. as vertex types • Email alias, from, to, access, logon/off, print, IM, etc. as edge types
What can STINGER do?• Optimized to update at rates of over 3 million edges per second on graphs of one billion edges • D. Ediger, R. McColl, J. Riedy, and D.A. Bader, "STINGER: High Performance Data Structure for Streaming Graphs, The IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, September 20- 22, 2012. Best Paper Award. RMAT – Recursive MATrix graph generator. RMAT(N) indicates 2^N vertices.
What can STINGER do?• Maintaining connected components in a graph of half a billion edges • Up to 1.26 million updates per sec. • 137x faster than recomputing.• Scalable parallel streaming community detection • Built on parallel insert / delete mechanisms.• Streaming approximate betweenness • Used to analyze influencers on Twitter during Hurricane Sandy over time.
What does STINGER not do?• Does not provide all ACID properties • Why: Not intended to be the backing data store. • Why: Allows for greater ingest and processing speeds. • Alternative: Back STINGER ingest with an ACID DB • Alternative: STINGER does provide consistency, partial isolation• No text base query language – for now • Why: Currently, no language is general enough to describe most or all queries • Alternative: Filtering traversal APIs, unlimited query flexibility through code • Alternative: Productivity language bindings (Python, Java)• No distributed / Hadoop-like cluster support • Why: Good fit for ingest, but poor for streaming analysis, random access is too slow • Alternative: Larger shared memory systems such as the Cray XMT and SGI UV systems • Alternative: Processing billion-edge graphs in shared memory on affordable Intel servers • Alternative: Extract key portions of the graph from a larger data store and perform fast in- memory processing in STINGER
What sizes, performance can it handle? Server 4x Opteron 6282 256GB DDR3 Desktop (Intel Core i7-2600 16GB DDR3) Connected Updates V E Config Size (GB) Connected Updates Components (s) per Sec.V E Config Size (GB) Components (s) per Sec. 16M 512M 25-14 60GB 13.7 696K1M 8M 22-14 1.184 0.316 2.7M 16M 256M 25-14 24.6GB 9.82 2.1M2M 16M 22-14 2.384 0.75 2.3M4M 33M 22-14 4.768 2 2.3M Cray XMT2 – 64 Processors 2TB DDR28M 67M 24-14 9.536 5.36 0.85M Connected Updates V E Config Size (GB) Components (s) per Sec.4M 67M 24-14 7.984 3 1.38M 67M 512M 28-32 86GB 13.8 3.3M4M 134M 24-14 14.336 5.7 0.8M 268M 4.3B 28-32 312GB 52.3 2.34M • The only limitation on size is system memory • Billions of vertices and edges are possible • V vertices and E edges in each graph • E counts are undirected • STINGER stores both directions • Config is STINGER-specific parameters
Why not existing technologies?• Traditional SQL databases • Not structured to do any meaningful graph queries with any level of efficiency or timeliness• Graph databases - mostly on-disk • Distributed disk can keep up with storing / indexing, but is simply too slow at random graph access to process on as the graph updates• Hadoop and HDFS-based projects • Not really the right programming model for many structural queries over the entire graph, random access performance is poor• Smaller graph libraries, processing tools • Cant scale, cant process dynamic graphs, frequently leads to impossible visualization attempts
Who is GTRI?• Georgia Tech Research Institute • Largest research entity at Georgia Institute of Technology • One of the worlds premier university-based applied R&D organizations for 75 years • Non-profit with over 1,600 employees and 21 locations world-wide • Over $240 million per year of government and industry contracts• Innovative Computing Division of the Cyber Technology and Information Security Lab • Dedicated to the application of practical HPC expertise and cutting-edge fundamental research to solve real-world problems • Experts in high-performance computing, algorithms, and big data
How can I start using STINGER?• Information, code, help • http://cc.gatech.edu/stinger • email@example.com• Together, GTRI and Georgia Tech can offer • Consulting Understand how your organization can benefit from graph analytics. • Training Learn how to use graph analysis and apply STINGER to your data. • Implementation Customize and extend STINGER to suit your needs using our experts. • Research Expertise Connect with researchers on the cutting edge of big data to develop novel solutions to your open problems.