STING is a framework for analyzing streaming graph-structured data on Intel-based platforms in real-time. It focuses on massive graphs that change over time, like social networks. STING maintains the graph and metrics as the data changes rapidly. It uses a lockless data structure called STINGER to accumulate the graph and timestamps. STING has seen performance improvements of up to 14x on Intel platforms. It is being used by several projects to analyze dynamic graphs and detect anomalies.
An introduction to the STINGER dynamic graph structure and analysis package. Shows the motivation for STINGER, what has been done with it, and how you can use it. More at http://cc.gatech.edu/stinger
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...Jason Riedy
Analyzing static snapshots of massive, graph-structured data cannot keep pace with the growth of social networks, financial transactions, and other valuable data sources. We introduce a framework, STING (Spatio-Temporal Interaction Networks and Graphs), and evaluate its performance on multicore, multisocket Intel(TM)-based platforms. STING achieves rates of around 100,000 edge updates per second on large, dynamic graphs with a single, general data structure. We achieve speed-ups of up to 1000$\times$ over parallel static computation, improve monitoring a dynamic graph's connected components, and show an exact algorithm for maintaining local clustering coefficients performs better on Intel-based platforms than our earlier approximate algorithm.
Are cloud based virtual labs cost effective? (CSEDU 2012)Nane Kratzke
Cost efficiency is an often mentioned strength of cloud computing. In times of decreasing educational budgets virtual labs provided by cloud computing might be an interesting alternative for higher education organizations or IT training facilities. This contribution analyzes the cost advantage of virtual educational labs provided via cloud computing means and compare these costs to costs of classical ed- ucational labs provided in a dedicated manner. This contribution develops a four step decision making model which might be interesting for colleges, universities or other IT training facilities planning to implement cloud based training facilities. Furthermore this contribution provides interesting findings when cloud computing has economical advantages in education and when not. The developed four step decision making model of general IaaS applicability can be used to find out whether an IaaS cloud based virtual IT lab approach is more cost efficient than a dedicated approach.
An introduction to the STINGER dynamic graph structure and analysis package. Shows the motivation for STINGER, what has been done with it, and how you can use it. More at http://cc.gatech.edu/stinger
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...Jason Riedy
Analyzing static snapshots of massive, graph-structured data cannot keep pace with the growth of social networks, financial transactions, and other valuable data sources. We introduce a framework, STING (Spatio-Temporal Interaction Networks and Graphs), and evaluate its performance on multicore, multisocket Intel(TM)-based platforms. STING achieves rates of around 100,000 edge updates per second on large, dynamic graphs with a single, general data structure. We achieve speed-ups of up to 1000$\times$ over parallel static computation, improve monitoring a dynamic graph's connected components, and show an exact algorithm for maintaining local clustering coefficients performs better on Intel-based platforms than our earlier approximate algorithm.
Are cloud based virtual labs cost effective? (CSEDU 2012)Nane Kratzke
Cost efficiency is an often mentioned strength of cloud computing. In times of decreasing educational budgets virtual labs provided by cloud computing might be an interesting alternative for higher education organizations or IT training facilities. This contribution analyzes the cost advantage of virtual educational labs provided via cloud computing means and compare these costs to costs of classical ed- ucational labs provided in a dedicated manner. This contribution develops a four step decision making model which might be interesting for colleges, universities or other IT training facilities planning to implement cloud based training facilities. Furthermore this contribution provides interesting findings when cloud computing has economical advantages in education and when not. The developed four step decision making model of general IaaS applicability can be used to find out whether an IaaS cloud based virtual IT lab approach is more cost efficient than a dedicated approach.
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Jason Riedy
High-performance graph analysis is unlocking knowledge in problems like anomaly detection in computer security, community structure in social networks, and many other data integration areas. While graphs provide a convenient abstraction, real-world problems' sparsity and lack of locality challenge current systems. This talk will cover current trends ranging from massive scales to low-power, low-latency systems and summarize opportunities and directions for graphs and computing systems.
Aviso nº 4 CONTRATAÇÃO DE ESCOLA restauraçãoPedro França
CONTRATAÇÃO DE ESCOLA
AVISO Nº 4 - 2016/2017
Técnicos Especializados
Nos termos do ponto 4 do artigo 39º do Decreto-lei nº 132/2012,de 27 de junho, republicado pelo Decreto-Lei nº 83-A/2014,de 23 de maio e demais legislação aplicável, torna-se público que para suprir necessidades temporárias de serviço se encontram abertos, pelo prazo de três dias úteis, os procedimentos concursais para a seleção e recrutamento de um Técnico Especializado, na área abaixo mencionada tendo como suporte a aplicação informática disponibilizada na página da Direção Geral da Administração escolar (DGAE)
Emerging real-world graph problems include detecting community structure in large social networks, improving the resilience of the electric power grid, and detecting and preventing disease in human populations. The volume and richness of data combined with its rate of change renders monitoring properties at scale by static recomputation infeasible. We approach these problems with massive, fine-grained parallelism across different shared memory architectures both to compute solutions and to explore the sensitivity of these solutions to natural bias and omissions within the data.
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Jason Riedy
High-performance graph analysis is unlocking knowledge in problems like anomaly detection in computer security, community structure in social networks, and many other data integration areas. While graphs provide a convenient abstraction, real-world problems' sparsity and lack of locality challenge current systems. This talk will cover current trends ranging from massive scales to low-power, low-latency systems and summarize opportunities and directions for graphs and computing systems.
Aviso nº 4 CONTRATAÇÃO DE ESCOLA restauraçãoPedro França
CONTRATAÇÃO DE ESCOLA
AVISO Nº 4 - 2016/2017
Técnicos Especializados
Nos termos do ponto 4 do artigo 39º do Decreto-lei nº 132/2012,de 27 de junho, republicado pelo Decreto-Lei nº 83-A/2014,de 23 de maio e demais legislação aplicável, torna-se público que para suprir necessidades temporárias de serviço se encontram abertos, pelo prazo de três dias úteis, os procedimentos concursais para a seleção e recrutamento de um Técnico Especializado, na área abaixo mencionada tendo como suporte a aplicação informática disponibilizada na página da Direção Geral da Administração escolar (DGAE)
Emerging real-world graph problems include detecting community structure in large social networks, improving the resilience of the electric power grid, and detecting and preventing disease in human populations. The volume and richness of data combined with its rate of change renders monitoring properties at scale by static recomputation infeasible. We approach these problems with massive, fine-grained parallelism across different shared memory architectures both to compute solutions and to explore the sensitivity of these solutions to natural bias and omissions within the data.
Improvements in DNA sequencing technology have lead to a 10,000 fold increase in our data output over the past 5 years. I will describe the lessons we learned whilst scaling our IT infrastructure and tools to cope with the vast amount of data.
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
We’ll explore current and future considerations in advanced computing architectures that empower the conversion of data into knowledge. Life sciences produce the largest amount of data production out of all major science domains, making analytics and scientific computing cornerstones of modern research programs and methodologies. We’ll highlight the remarkable biomedical discoveries that are emerging through combined efforts, and discuss where and how the right infrastructure can catalyze the advancement of human knowledge. On-premises architectures as well as cloud, hybrid, and exotic architectures will all be discussed. It’s likely that all life science researchers will required advanced computing to perform their research within the next year. However, there has been less focus on advanced computing infrastructures across the industry due to the increased availability of public cloud infrastructure anything as a service models.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
Currently, we face new challenges in realtime analytics of BigData, such as social monitoring, M2M sensor, online advertising optimization, smart energy management and security monitoring. To analyze these data, scalable machine learning technologies are essential. Jubatus is the open source platform for online distributed machine learning on the data streams of BigData. we explain the inside technologies of Jubatus and show how jubatus can achieve realtime analytics in various problems.
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.
Invited talk at Tsinghua University on "Applications of Deep Neural Network". As the tech. lead of deep learning task force at NIO USA INC, I was invited to give this colloquium talk on general applications of deep neural network.
Reproducible Linear Algebra from Application to ArchitectureJason Riedy
All computing must be parallel to take advantage of modern systems like multicore processors, GPUs, and distributed systems. Results that are not bit-wise reproducible introduce doubt on many levels. Sometimes that is appropriate. Reproducibility limitations occur because underlying libraries do not specify their reproducibility requirements. New advances in interfaces, algorithms, and architectures allow selecting among those requirements in the future. This talk covers many of the upcoming options and their trade-offs.
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...Jason Riedy
The Rogues Gallery is a new experimental testbed that is focused on tackling "rogue'' architectures for the Post-Moore era of computing. While some of these devices have roots in the embedded and high-performance computing spaces, managing current and emerging technologies provides a challenge for system administration that are not always foreseen in traditional data center environments.
We present an overview of the motivations and design of the initial Rogues Gallery testbed and cover some of the unique challenges that we have seen and foresee with upcoming hardware prototypes for future post-Moore research. Specifically, we cover the networking, identity management, scheduling of resources, and tools and sensor access aspects of the Rogues Gallery and techniques we have developed to manage these new platforms. We argue that current tools like the Slurm resource manager can support new rogues without major infrastructure changes.
ICIAM 2019: Reproducible Linear Algebra from Application to ArchitectureJason Riedy
All computing must be parallel to take advantage of modern systems like multicore processors, GPUs, and distributed systems. Results that are not bit-wise reproducible introduce doubt on many levels. Sometimes that is appropriate. Reproducibility limitations occur because underlying libraries do not specify their reproducibility requirements. New advances in interfaces, algorithms, and architectures allow selecting among those requirements in the future. This talk covers many of the upcoming options and their trade-offs.
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
Applications in many areas analyze an ever-changing environment. On billion vertices graphs, providing snapshots imposes a large performance cost. We propose the first formal model for graph analysis running concurrently with streaming data updates. We consider an algorithm valid if its output is correct for the initial graph plus some implicit subset of concurrent changes. We show theoretical properties of the model, demonstrate the model on various algorithms, and extend it to updating results incrementally.
In one classic sense a rogue is someone who goes their own way, who breaks away from the crowd. The CRNCH Rogues Gallery aims to support computer architecture rogues by being a physical and virtual space providing access to novel computing architectures. Researchers find applications, and architects discover what happens when their prototypes hit reality. Our goals are to help kick-start software ecosystems, train students in novel system evaluation and use, and provide rapid feedback to architects. By exposing students and researchers to this set of unique hardware, we foster cross-cutting discussions about hardware designs that will drive future performance improvements in computing long after the Moore’s Law era of “cheap transistors” ends. We provide a brief description of the current Rogues Gallery along with successes and research highlights over the last year.
Augmented Arithmetic Operations Proposed for IEEE-754 2018Jason Riedy
Algorithms for extending arithmetic precision through compensated summation or arithmetics like double-double rely on operations commonly called twoSum and twoProduct. The current draft of the IEEE 754 standard specifies these operations under the names augmentedAddition and augmentedMultiplication. These operations were included after three decades of experience because of a motivating new use: bitwise reproducible arithmetic. Standardizing the operations provides a hardware acceleration target that can provide at least a 33% speed improvements in reproducible dot product, placing reproducible dot product almost within a factor of two of common dot product. This paper provides history and motivation for standardizing these operations. We also define the operations, explain the rationale for all the specific choices, and provide parameterized test cases for new boundary behaviors.
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
The Rogues Gallery is a new concept focused on developing our understanding of next-generation hardware with a focus on unorthodox and uncommon technologies. This project, initiated by Georgia Tech's Center for Research into Novel Computing Hierarchies (CRNCH), will acquire new and unique hardware (ie, the aforementioned "rogues") from vendors, research labs, and startups and make this hardware available to students, faculty, and industry collaborators within a managed data center environment. By exposing students and researchers to this set of unique hardware, we hope to foster cross-cutting discussions about hardware designs that will drive future performance improvements in computing long after the Moore's Law era of "cheap transistors" ends.
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
Applications in computer network security, social media analysis,and other areas rely on analyzing a changing environment. The data is rich in relationships and lends itself to graph analysis. Traditional static graph analysis cannot keep pace with network security applications analyzing nearly one million events per second and social networks like Facebook collecting 500 thousand comments per second. Streaming frameworks like STINGER support ingesting up three million of edge changes per second but there are few streaming analysis kernels that keep up with these rates. Here we present a new algorithm model for applying complex metrics to a changing graph. In this model, many more algorithms can be applied without having to stop the world.
High-Performance Analysis of Streaming Graphs Jason Riedy
Graph-structured data in social networks, finance, network security, and others not only are massive but also under continual change. These changes often are scattered across the graph. Stopping the world to run a single, static query is infeasible. Repeating complex global analyses on massive snapshots to capture only what has changed is inefficient. We discuss requirements for single-shot queries on changing graphs as well as recent high-performance algorithms that update rather than recompute results. These algorithms are incorporated into our software framework for streaming graph analysis, STINGER.
High-Performance Analysis of Streaming GraphsJason Riedy
Graph-structured data in social networks, finance, network security, and others not only are massive but also under continual change. These changes often are scattered across the graph. Stopping the world to run a single, static query is infeasible. Repeating complex global analyses on massive snapshots to capture only what has changed is inefficient. We discuss requirements for single-shot queries on changing graphs as well as recent high-performance algorithms that update rather than recompute results. These algorithms are incorporated into our software framework for streaming graph analysis, STING (Spatio-Temporal Interaction Networks and Graphs).
Algorithm for efficiently and accurately updating PageRank as the graph changes from a stream of updates. Also includes needs from the upcoming GraphBLAS to support high-performance streaming graph analysis.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
2. Outline
Motivation
STING for streaming, graph-structured data on Intel-based
platforms
World-leading community detection
Community-building interactions
Plans and direction
Note: Without hard limits, academics will use at least all the time...
2 / 57
3. (Prefix)scale Data Analysis
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating algorithms
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
3 / 57
4. Graphs are pervasive
Graphs: things and relationships
• Different kinds of things, different kinds of relationships, but
graphs provide a framework for analyzing the relationships.
• New challenges for analysis: data sizes, heterogeneity,
uncertainty, data quality.
Astrophysics Bioinformatics Social Informatics
Problem Identifying target Problem Emergent behavior,
Problem Outlier detection
proteins information spread
Challenges Massive data
Challenges Data Challenges New analysis,
sets, temporal variation
heterogeneity, quality data uncertainty
Graph problems Matching,
Graph problems Centrality, Graph problems Clustering,
clustering
clustering flows, shortest paths
4 / 57
5. No shortage of data...
Existing (some out-of-date) data volumes
NYSE 1.5 TB generated daily, maintain a 8 PB archive
Google “Several dozen” 1PB data sets (CACM, Jan 2010)
Wal-Mart 536 TB, 1B entries daily (2006)
EBay 2 PB, traditional DB, and 6.5PB streaming, 17
trillion records, 1.5B records/day, each web click is
50-150 details.
http://www.dbms2.com/2009/04/30/
ebays-two-enormous-data-warehouses/
Facebook 845 M users... and growing(?)
• All data is rich and semantic (graphs!) and changing.
• Base data rates include items and not relationships.
5 / 57
6. General approaches
• High-performance static graph analysis
• Develop techniques that apply to unchanging massive
graphs.
• Provides useful after-the-fact information, starting points.
• Serves many existing applications well: market research,
much bioinformatics, ...
• High-performance streaming graph analysis
• Focus on the dynamic changes within massive graphs.
• Find trends or new information as they appear.
• Serves upcoming applications: fault or threat detection,
trend analysis, ...
Both very important to different areas.
GT’s primary focus is on streaming.
Note: Not CS theory streaming, but analysis of streaming data.
6 / 57
7. Why analyze data streams?
Data transfer
Data volumes • 1 Gb Ethernet: 8.7TB daily at
NYSE 1.5TB daily 100%, 5-6TB daily realistic
LHC 41TB daily • Multi-TB storage on 10GE: 300TB
Facebook, etc. Who daily read, 90TB daily write
knows? • CPU ↔ Memory: QPI,HT:
2PB/day@100%
Data growth
Speed growth
• Facebook: > 2×/yr
• Ethernet/IB/etc.: 4× in next 2
• Twitter: > 10×/yr
years. Maybe.
• Growing sources:
• Flash storage, direct: 10× write,
Bioinformatics, 4× read. Relatively huge cost.
µsensors, security
7 / 57
8. Intel’s non-numeric computing
program (our perspective)
Supporting analysis of massive graph under rapid change
across the spectrum of Intel-based platforms.
STING on Intel-based platforms
• Maintain a graph and metrics under large data changes
• Performance, documentation, distribution
• Greatly improved ingestion rate, improved example metrics
• Available at http://www.cc.gatech.edu/stinger/.
• Algorithm development
• World-leading community detection (clustering) performance
on Intel-based platforms
• Good for focused or multi-level analysis, visualization, ...
• Developments on community monitoring in dynamic graphs
8 / 57
9. On to STING...
Motivation
STING for streaming, graph-structured data on Intel-based
platforms
Assumptions about the data
High-level architecture
Algorithms and performance
World-leading community detection
Community-building interactions
Plans and direction
9 / 57
10. Overall streaming approach
Yifan Hu’s (AT&T) visualization of the
Livejournal data set
Jason’s network via LinkedIn Labs
Assumptions
• A graph represents some real-world phenomenon.
• But not necessarily exactly!
• Noise comes from lost updates, partial information, ...
10 / 57
11. Overall streaming approach
Yifan Hu’s (AT&T) visualization of the
Livejournal data set
Jason’s network via LinkedIn Labs
Assumptions
• We target massive, “social network” graphs.
• Small diameter, power-law degrees
• Small changes in massive graphs often are unrelated.
10 / 57
12. Overall streaming approach
Yifan Hu’s (AT&T) visualization of the
Livejournal data set
Jason’s network via LinkedIn Labs
Assumptions
• The graph changes but we don’t need a continuous view.
• We can accumulate changes into batches...
• But not so many that it impedes responsiveness.
10 / 57
13. STING’s focus
Control
action prediction
summary
Source data Simulation / query Viz
• STING manages queries against changing graph data.
• Visualization and control often are application specific.
• Ideal: Maintain many persistent graph analysis kernels.
• One current snapshot of the graph, kernels keep smaller
histories.
• Also (a harder goal), coordinate the kernels’ cooperation.
• STING components:
• Framework for batching input changes.
• Lockless data structure, STINGER, accumulating a typed
graph with timestamps.
11 / 57
14. STINGER
STING Extensible Representation:
• Rule #1: No explicit locking.
• Rely on atomic operations.
• Massive graph: Scattered updates, scattered reads rarely
conflict.
• Use time stamps for some view of time.
12 / 57
15. STING’s year
Over the past year
• Improved performance on Intel-based platforms by
14×[HPEC12]
• Evaluated multiple dynamic kernels across
platforms[MTAAP12, ICASSP2012]
• Documentation and tutorials
• Assist projects with STING adoption on Intel-based
platforms
13 / 57
16. STING in use
Where is it?
http://www.cc.gatech.edu/stinger/
Code, development, documentation, presentations...
Who is using it?
• Center for Adaptive Supercomputing Software –
Multithreaded Architectures (CASS-MT) directed by PNNL
• PRODIGAL team for DARPA ADAMS (Anomaly Detection at
Multiple Scales) including GTRI, SAIC, Oregon State U., U.
Mass., and CMU
• DARPA SMISC (Social Media in Strategic Communication)
including GTRI
• Well-attended tutorials given at PPoPP, in MD, and locally.
14 / 57
17. Experimental setup
Unless otherwise noted
Line Model Speed (GHz) Sockets Cores
Nehalem X5570 2.93 2 4
Westmere X5650 2.66 2 6
Westmere E7-8870 2.40 4 10
• Large Westmere, mirasol, loaned by Intel (thank you!)
• All memory: 1067MHz DDR3, installed appropriately
• Implementations: OpenMP, gcc 4.6.1, Linux ≈ 3.2 kernel
• Artificial graph and edge stream generated by R-MAT
[Chakrabarti, Zhan, & Faloutsos].
• Scale x, edge factor f ⇒ 2x vertices, ≈ f · 2x edges.
• Edge actions: 7/8th insertions, 1/8th deletions
• Results over five batches of edge actions.
• Caveat: No vector instructions, low-level optimizations yet.
• Portable: OpenMP, Cray XMT family
15 / 57
20. Clustering coefficients
• Used to measure “small-world-ness”
[Watts & Strogatz] and potential i m
community structure
• Larger clustering coefficient ⇒ more v
inter-connected
j n
• Roughly the ratio of the number of actual
to potential triangles
• Defined in terms of triplets.
• i – v – j is a closed triplet (triangle).
• m – v – n is an open triplet.
• Clustering coefficient:
# of closed triplets / total # of triplets
• Locally around v or globally for entire graph.
17 / 57
23. Connected components
• Maintain a mapping from vertex to
component.
• Global property, unlike triangle
counts
• In “scale free” social networks:
• Often one big component, and
• many tiny ones.
• Edge changes often sit within
components.
• Remaining insertions merge
components.
• Deletions are more difficult...
19 / 57
24. Connected components: Deleted
edges
The difficult case
• Very few deletions matter.
• Maintain a spanning tree, &
ignore deletion if
1 not in spanning tree,
2 endpoints share a
common neighbor∗ ,
3 loose endpoint reaches
root∗ .
• In the last two (∗), also fix
the spanning tree.
• Rules out 99.7% of
deletions.
• (Long history, see related
work in [MTAAP11].)
20 / 57