Presentation for the paper C-SAW: A Framework for Graph Sampling and Random Walk on GPUs published in SC20.
Paper link: https://arxiv.org/pdf/2009.09103.pdf
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...ALINLAB
This document summarizes a research paper on developing a context-aware dynamics model (CaDM) to improve generalization in model-based reinforcement learning. The CaDM uses a context encoder to separately learn context from past observations and condition a dynamics model on the learned context vector. This allows the model to generalize better to new environments. The CaDM achieves improved generalization over vanilla dynamics models in simulation experiments, and can also help model-free RL methods generalize better by conditioning policies on the context vector.
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...ALINLAB
This document summarizes a research paper on developing a context-aware dynamics model (CaDM) to improve generalization in model-based reinforcement learning. The CaDM uses a context encoder to separately learn context from past observations and condition a dynamics model on the learned context vector. This allows the model to generalize better to new environments. The CaDM achieves improved generalization over vanilla dynamics models in simulation experiments, and can also help model-free RL methods generalize better by conditioning policies on the context vector.
[Mmlab seminar 2016] deep learning for human pose estimationWei Yang
This document summarizes recent advances in deep learning approaches for human pose estimation. It describes early methods like DeepPose that used cascades of regressors. Later works introduced heatmap regression to capture spatial information. Convolutional Pose Machine and Stacked Hourglass networks further improved accuracy by incorporating stronger context modeling through deeper networks with larger receptive fields and intermediate supervision. These approaches demonstrate that both local appearance cues and modeling of global context and structure are important for accurate human pose estimation.
1. Two papers on unsupervised domain adaptation were presented at ICML2018: "Learning Semantic Representations for Unsupervised Domain Adaptation" and "CyCADA: Cycle-Consistent Adversarial Domain Adaptation".
2. The CyCADA paper uses cycle-consistent adversarial domain adaptation with cycle GAN to translate images at the pixel level while also aligning representations at the semantic level.
3. The semantic representation paper uses semantic alignment and introduces techniques like adding noise to improve over previous semantic alignment methods.
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22GiacomoBalloccu
This document provides an overview of an upcoming tutorial on explainable recommender systems with knowledge graphs. The tutorial will include two sessions - an introductory session on explainable recommendation principles and modeling approaches, and a hands-on session using Jupyter notebooks to build and evaluate recommendation models using knowledge graphs. Attendees will learn about explainable recommendation methods, loading and preprocessing interaction datasets with knowledge graphs, building recommendation models with knowledge graphs, and evaluating and generating explanations from models. The tutorial aims to help attendees understand explainable recommender systems and apply techniques using knowledge graphs.
The lecture discusses manycore GPU architectures and programming using OpenMP and HOMP. It introduces OpenMP directives for offloading computation to accelerators and covers data mapping between the host and device. It also discusses HOMP for automated distribution of parallel loops and data across multiple accelerators to improve load balancing and performance. The document provides examples of using OpenMP target directives and data mapping for problems like AXPY and Jacobi iteration on a GPU. It evaluates performance of different loop scheduling algorithms in HOMP on a system with CPUs, GPUs and MICs.
[Mmlab seminar 2016] deep learning for human pose estimationWei Yang
This document summarizes recent advances in deep learning approaches for human pose estimation. It describes early methods like DeepPose that used cascades of regressors. Later works introduced heatmap regression to capture spatial information. Convolutional Pose Machine and Stacked Hourglass networks further improved accuracy by incorporating stronger context modeling through deeper networks with larger receptive fields and intermediate supervision. These approaches demonstrate that both local appearance cues and modeling of global context and structure are important for accurate human pose estimation.
1. Two papers on unsupervised domain adaptation were presented at ICML2018: "Learning Semantic Representations for Unsupervised Domain Adaptation" and "CyCADA: Cycle-Consistent Adversarial Domain Adaptation".
2. The CyCADA paper uses cycle-consistent adversarial domain adaptation with cycle GAN to translate images at the pixel level while also aligning representations at the semantic level.
3. The semantic representation paper uses semantic alignment and introduces techniques like adding noise to improve over previous semantic alignment methods.
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22GiacomoBalloccu
This document provides an overview of an upcoming tutorial on explainable recommender systems with knowledge graphs. The tutorial will include two sessions - an introductory session on explainable recommendation principles and modeling approaches, and a hands-on session using Jupyter notebooks to build and evaluate recommendation models using knowledge graphs. Attendees will learn about explainable recommendation methods, loading and preprocessing interaction datasets with knowledge graphs, building recommendation models with knowledge graphs, and evaluating and generating explanations from models. The tutorial aims to help attendees understand explainable recommender systems and apply techniques using knowledge graphs.
The lecture discusses manycore GPU architectures and programming using OpenMP and HOMP. It introduces OpenMP directives for offloading computation to accelerators and covers data mapping between the host and device. It also discusses HOMP for automated distribution of parallel loops and data across multiple accelerators to improve load balancing and performance. The document provides examples of using OpenMP target directives and data mapping for problems like AXPY and Jacobi iteration on a GPU. It evaluates performance of different loop scheduling algorithms in HOMP on a system with CPUs, GPUs and MICs.
Calculation time and parallel efficiency are evaluated using OpenFOAM for EPYC server. The 3D lid driven cavity flow is simulated using different EPYC CPUs.
Artificial intelligence (AI) has already been attracting the attention of deep tech investors for some years. The reasons why are clear. In its ‘Sizing The Prize’ analysis of artificial intelligence (AI), PwC forecast that AI will contribute $15.7 trillion to the global economy by 2030, with the ‘AI boost’ available to most national economies being approximately 26%. But what investors often overlook is that AI is not singular. Many individual components must work together to create AI.
At its core artificial intelligence consists essentially of detecting statistical patterns in signals with many dimensions, such as analysis of audio frequencies (voice recognition) or high-resolution images (face recognition). The repetition of this search in order to detect these patterns is the basis of artificial intelligence.
There are usually three components to AI:
First, given a data set, learning what the patterns are.
Second, building a model that can detect these patterns.
Third, model deployment to the target environment.
Traditionally, data mining or learning was done by experts in the matter who would develop some sort of classifier or detector based on certain features, and then try to see their correlations. This process was tedious and time consuming.
https://klepsydra.com/cityam-ai-on-the-edge/
We updated the DLA system introductions here, from design, add-on functions, and applications. During the 2018~2019, we developed the tools needed for IC simulation and verification, constructed a quantize-aware & HW-aware training flow, and improved the automation of the verification. We have verified this system through FPGA and solid-state SoC.
Snap ML is a machine learning framework for fast training of generalized linear models (GLMs) that can scale to large datasets. It uses multi-level parallelism across nodes and GPUs. Snap ML implementations include snap-ml-local for single nodes, snap-ml-mpi for multi-node HPC environments, and snap-ml-spark for Apache Spark clusters. Experimental results show Snap ML can train a logistic regression model on a 3TB Criteo dataset within 1.5 minutes using 16 GPUs.
The first 4D real-time FD-OCT implemented by a GPU and CUDA technology. The original paper is published on Optics Express (V18 pp11772,2010). Reported as feature of the week by OCTNEWS. This work presents a general and low-cost method to solve the real-time data processing and visualization bottlenecks which can be easily applied on a regular nonlinear-k FD-OCT system without extensive hardware modification on the system architecture.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
Performance Evaluation and Comparison of Service-based Image Processing based...Matthias Trapp
Presentation of Research Paper "Performance Evaluation and Comparison of Service-based Image Processing based on Software Rendering" at 27. International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG 2019) in Plzen, Czech Republic.
This document discusses graph processing and the need for distributed graph frameworks. It provides examples of real-world graph sizes that are too large for a single machine to process. It then summarizes some of the key challenges in parallel graph processing like irregular structure and data transfer issues. Several graph processing frameworks are described including Pregel, GraphLab, PowerGraph, and LFGraph. LFGraph is presented as a simple and fast distributed graph analytics framework that aims to have low pre-processing, load-balanced computation and communication, and low memory footprint compared to previous frameworks. The document provides examples and analyses to compare the computation and communication characteristics of different frameworks. It concludes by discussing some open questions and potential areas for improvement in LFGraph.
Using GPUs to accelerate nonstiff and stiff chemical kinetics in combustion s...Oregon State University
Presented at the 15th International Conference on Numerical Combustion in Avignon, France (19–22 April 2015).
Combustion simulations with detailed chemical kinetics require the integration of a large number of ordinary differential equation (ODEs), with at least one ODE system per spatial location solved every time step. This task is well-suited to the massively parallel processing capabilities of graphics processing units (GPUs), where individual GPU threads concurrently integrate independent ODE systems for different spatial locations. However, the typical high-order implicit algorithms used in combustion modeling applications (e.g., VODE, LSODE) to handle stiffness involve complex logical flow that causes severe thread divergence when implemented on GPUs, thus limiting performance. Alternate algorithms are therefore needed. This talk will discuss strategies and results using integration algorithms for nonstiff and stiff chemical kinetics on GPUs.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Reproducible Linear Algebra from Application to ArchitectureJason Riedy
All computing must be parallel to take advantage of modern systems like multicore processors, GPUs, and distributed systems. Results that are not bit-wise reproducible introduce doubt on many levels. Sometimes that is appropriate. Reproducibility limitations occur because underlying libraries do not specify their reproducibility requirements. New advances in interfaces, algorithms, and architectures allow selecting among those requirements in the future. This talk covers many of the upcoming options and their trade-offs.
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Yuichiro Yasui
The document discusses Graph500 and Green Graph500 benchmarks for evaluating graph processing performance on the SGI UV2000 system. It provides an overview of the benchmarks and describes testing various graph workloads, including social networks and road networks, on different hardware from smartphones to supercomputers. The authors aim to optimize breadth-first search (BFS) graph algorithms on the NUMA-based SGI UV2000 without using MPI through NUMA-aware techniques.
Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATLMLconf
I will discuss current research on the MapGraph platform. MapGraph is a new and disruptive technology for ultra-fast processing of large graphs on commodity many-core hardware. On a single GPU you can analyze the bitcoin transaction graph in .35 seconds. With MapGraph on 64 NVIDIA K20 GPUs, you can traverse a scale-free graph of 4.3 billion directed edges in .13 seconds for a throughput of 32 Billion Traversed Edges Per Second (32 GTEPS). I will explain why GPUs are an interesting option for data intensive applications, how we map graphs onto many-core processors, and what the future looks like for the MapGraph platform.
MapGraph provides a familiar vertex-centric abstraction, but its GPU acceleration is 100s of times faster than main memory CPU-only technologies and up to 100,000 times faster than graph technologies based on MapReduce or key-value stores such as HBase, Titan, and Accumulo. Learn more at http://MapGraph.io.
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/07/efficiently-map-ai-and-vision-applications-onto-multi-core-ai-processors-using-cevas-parallel-processing-framework-a-presentation-from-ceva/
Rami Drucker, Machine Learning Software Architect at CEVA, presents the “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework” tutorial at the May 2023 Embedded Vision Summit.
Next-generation AI and computer vision applications for autonomous vehicles, cameras, drones and robots require higher-than-ever computing power. Often, the most efficient way to deliver high performance (especially in cost- and power-constrained applications) is to use multi-core processors. But developers must then map their applications onto the multiple cores in an efficient manner, which can be difficult. To address this challenge and streamline application development, CEVA has introduced the Architecture Planner tool as a new element in CEVA’s comprehensive AI SDK.
In this talk, Drucker shows how the Architecture Planner tool analyzes the network model and the processor configuration (number of cores, memory sizes), then automatically maps the workload onto the multiple cores in an efficient manner. He explains key techniques used by the tool, including symmetrical and asymmetrical multi-processing, partition by sub-graphs, batch partitioning and pipeline partitioning.
This document discusses GPU-based raycasting of volumetric data. It presents an adaptive sampling raycasting algorithm for layered grid data and compares its results to a traditional algorithm. The adaptive algorithm samples rays non-uniformly based on intersecting cell boundaries to more efficiently render layered ocean and atmosphere data on the GPU. Results show the adaptive method produces similar images to the original with fewer samples. Future work involves applying this approach to curvilinear grids and direct volume rendering.
Monte Carlo simulation is well-suited for GPU acceleration due to its highly parallel nature. GPUs provide lower cost and higher performance than CPUs for Monte Carlo applications. Numerical libraries for GPUs allow developers to focus on their models rather than reimplementing basic components. NAG has developed GPU libraries including random number generators and is working with financial institutions to apply Monte Carlo simulations to problems in finance.
Similar to C-SAW: A Framework for Graph Sampling and Random Walk on GPUs (20)
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...amsjournal
The Fourth Industrial Revolution is transforming industries, including healthcare, by integrating digital,
physical, and biological technologies. This study examines the integration of 4.0 technologies into
healthcare, identifying success factors and challenges through interviews with 70 stakeholders from 33
countries. Healthcare is evolving significantly, with varied objectives across nations aiming to improve
population health. The study explores stakeholders' perceptions on critical success factors, identifying
challenges such as insufficiently trained personnel, organizational silos, and structural barriers to data
exchange. Facilitators for integration include cost reduction initiatives and interoperability policies.
Technologies like IoT, Big Data, AI, Machine Learning, and robotics enhance diagnostics, treatment
precision, and real-time monitoring, reducing errors and optimizing resource utilization. Automation
improves employee satisfaction and patient care, while Blockchain and telemedicine drive cost reductions.
Successful integration requires skilled professionals and supportive policies, promising efficient resource
use, lower error rates, and accelerated processes, leading to optimized global healthcare outcomes.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
1. C-SAW: A Framework for Graph
Sampling and Random Walk on
GPUs
Santosh Pandey , Lingda Li , Adolfy Hoisie ,
Xiaoye S. Li , Hang Liu
Source code: https://github.com/concept-inversion/C-SAW
2. • Graphs
Natural representation of data; present everywhere.
2
Mining Large Graphs
• Graph embedding
• Graph visualization
• Graph neural networks
Huge storage requirement?
Computationally expensive?
Extracting information from a large graph is challenging.
Algorithms
• Extracting information
Plethora of algorithms to mine graphs.
Large graphs
Millions/Billions of
vertices and edges
3. 3
Graph Sampling and Random Walk (RW)
Reduces computational complexity and memory requirement.
Reference: https://towardsdatascience.com/graph-embeddings-the-summary-cc6075aba007
1
Samples/RWs
2
Train
model
3
Compute
embeddings
G (V, E)
5. 5
Framework for Graph Sampling and RW
• Allows implementation with few lines of codes.
GraphSAINT KnightKing
A distributed framework random walk.
Sampler for generating graph embedding.
Framework Sampling algorithms* RW algorithms GPU support
KnightKing ❌ ✅ ❌
GraphSAINT ❌ ✅ ❌
C-SAW ✅ ✅ ✅
Limitations
Our work
Challenge 1: No generic
framework
* Traversal-based graph sampling algorithms.
6. 6
Sampling Example with C-SAW
• Multi-dimensional RW
• Randomly generated frontier set.
• Sample a frontier vertex (Biased).
• Sample a neighbor vertex (Unbiased).
• Replace frontier with sampled neighbor.
C-SAW API
Challenge 1
NeighborPool
FrontierPoolt 8 0 3
Frontiert 8
5 7 9 10 11
FrontierPoolt+1 0 3 7
VERTEXBIAS()
EDGEBIAS ()
UPDATE ()
7
Sampled edges
8
7
Almost all graph sampling/RW algorithms can be defined with similar
flow.
But have different 1) bias and 2) method to update frontier set.
7. 7
C-SAW Framework
Simple and Expressive;
Support existing/emerging algorithms.
Hides complex implementation from
users.
VERTEXBIAS ( )
EDGEBIAS ( )
UPDATE ( )
(a) User programming interface
Challenge 1
(b) MAIN function
Optimized for GPU;
8. 8
Sampling More Than 1 Neighbors
• Objective: Sample 2 (out of 5) neighbors of red vertex (8) with a bias.
Independent and Concurrent.
Fast sampling.
But ???
Thread 1 and 2 sampled same vertex (7).
Selection Collision
4 5
7 8 9
3
2
1
6
0 10
11 12
7
Challenge 2
Thread1 Thread2
🎲
🎲
11. GP
U
CPU
Frontier = {0, 2, 8}
11
Out-of-memory Sampling
P1 P3
Transfer
partition
2
Workload
balancing
Queue size
3
2 0 1
#active frontier vertices
P
1
P
2
P
3
Workload-aware
scheduling
1
ɸ ɸ
7, 5, 4 Frontier queues
(Kernel K1 exits)
4
P
1
P
2
P
3
Frontier queues
0, 2 8
ɸ
K2
K1
Frontier queues
(Kernel K2 exits)
7 5
3
3 ɸ
7, 5
Challenge 3
Frontier: {0 , 2 , 8}
4 5
7 8 9
3
2
1
6
0 10
11 12
8
2
0
5
7
3 4
*Assume 2 partitions can fit in GPU.
Repeat: until frontier is { ɸ }
1 2 3
12. 12
Experimental Setup
• Comparison metrics:
Sampled edges per second (SEPS).
#𝐒𝐚𝐦𝐩𝐥𝐞𝐝𝐄𝐝𝐠𝐞𝐬
𝐓𝐢𝐦𝐞
• Test performed on Summit supercomputer of ORNL.
• 6 NVIDIA Tesla 16GB V100 GPUs, dual-socket 22-core POWER9 CPUs.
• 10 Datasets.
13. 13
Comparing with The State-of-the-Art
• Length of the RW: 2000
• Number of sampling instances/walks: 4000
Frontier size for multi-dimensional RW: 2000
C-SAW vs KnightKing: Biased RW C-SAW vs GraphSAINT: Multi-dimensional RW
Speedup: 10x (1 GPU) , 14.7x (6 GPUs) Speedup: 8.1x (1 GPU) , 11.5x (6 GPUs)
95 135
Million
SEPS
AM AS CP FS LJ OR RE TW WG YE
0
20
40
60 KnightKing
C-SAW (1 GPU)
C-SAW (6 GPUs)
Million
SEPS
AM AS CP FS LJ OR RE TW WG YE
0
4
8
10 C-SAW (1 GPU) C-SAW (6 GPUs)
GraphSAINT
14. 14
Scalability of C-SAW with Multiple GPUs
• Neighbor sampling with 8000 instances
0
1
2
3
4
5
6
AM AS CP FR LJ OR RE TW WG YE
Speedup
1
2
3
4
5
6
GPUs:
5.2x with 6
GPUs
*More detailed evaluation on paper.
15. 15
Conclusion
• First GPU based framework for graph sampling and RW.
• Outperforms the sate-of-the-art works by 14.7x and 11.5x for KnightKing and GraphSAINT
respectively.
• Efficient out-of-memory sampling for handling larger graphs.
• Future work:
Adding support for more sampling algorithms.
Improving the sampling techniques.
Source code: https://github.com/concept-inversion/C-SAW
16. 16
Acknowledgement
Thank you. Please cite this work if it was useful for you.
Pandey, Santosh, et al. "C-SAW: A framework for graph sampling and random walk on
GPUs." SC20: International Conference for High Performance Computing, Networking,
Storage and Analysis. IEEE, 2020.
Editor's Notes
Graph provides a natural representation for most data and are widely used. They are mostly used to represent social networks, web networks, knowledge graphs etc. (N)
A plethora of algorithms exists for mining crucial information from graphs. Generating embeddings, visualization of graph data and graph neural networks are some of the example algorithms. (N) But Real world graphs are large and applying these algorithms directly over large graphs incurs huge storage requirement and computational cost. (N) Hence, extracting information from a large graph is challenging.
Graph sampling and random walk algorithms are used to overcome the challenge. Instead of directly applying algorithms over a large graph G (N), we can generate samples or random walks. As samples and random walks are reduced representation of graph, the memory and computational requirement for processing them is also low (N). Multiple instances of samples or random walks can be used instead of whole graph allowing applications to train (N) and perform prediction or compute embeddings from large graphs.
Let’s take a deeper look on how we generate samples or random walks. We start from a randomly selected source vertex 8, and randomly traverse immediate neighbors for certain steps (N). Here, vertex 5, 7, 9, 10 and 11 are the immediate neighbors of vertex 8. The transition probability for each vertex can be defined as degree of vertex upon sum of degree of all neighbor vertices. We can call this transition as biased edge transition. For vertex 8, the sum of degree of all neighbor vertices is 15. For example, as the degree of vertex 7 is 6, its transition probability is 6 upon 15. For random transition (N), we need to generate a random number between 0 and 1 (N) which is denoted by a dice roll here. Based upon the random number, (N) we select a neighbor. We repeat the process from vertex 7 for certain steps to generate samples or random walks.
Moving on, let's have an overview of available framework for graph sampling and random walks. A framework allows us to implement sampling and random walk algorithms with just a few lines of code. Two related works propose a framework for sampling and random walk. (N) The first one is Graphsaint which focus is on defining sampler for generating graph embeddings (N). The second one is KnightKing which is a distributed framework for random walk algorithms (N). Moving towards the limitations, (N) both graphsaint and kinghtking lacks support for majority of traversal based graph sampling algorithm and also do not support GPUs. (N) This brings us towards our first challenge: building a generic framework supporting all algorithms. (N) Our proposed framework C-SAW is the first framework in our knowledge able to support both sampling and random walk algorithms with GPU support.
Let’s see how C-SAW can be used to implement these algorithms with Multi-dimensional random walk as an example. (N) This random walk starts with a randomly generated frontier set. We have vertex 8, 0 and 3 in the frontier set. First, (N) we sample a frontier vertex with vertex degree as a bias. Here, the sampled frontier is vertex 8. Next, (N) we gather the neighbors of vertex 8 (N) and sample one neighbor with equal probability or unbiased selection. The sampled neighbor is 7 which is also added to the sampled edge list. Next, (N) we update the frontier with sampled neighbor by replacing vertex 8 in the frontierpool with vertex 7. (N) Almost all sampling and random walk algorithms can be defined with similar flow (N) but they differ in two things : first, how the bias is defined and second one is method to update the frontier set. (N) Now, let's see how we can use C-SAW APIs to implement these algorithms. (N) Our first API vertexbias defines how a frontier is sampled from frontierpool. (N) Second API Edgebias defines how a neighbor is sampled from neighborpool and (N) our third API update defines how a frontier is updated based upon the sampled neighbor.
Looking at the overview of C-SAW framework, (N) C-SAW provides three simple and expressive APIs or user programming interface which can support existing algorithms and have flexibility to support emerging ones. (N) The main function uses these APIs for implementation and is optimized for GPU. (N) The users do not need to know about the complex implementation of C-SAW to define sampling and random walk algorithms. With this, we address our first challenge for a generic framework.
Moving on to the next challenge: while most random walk algorithms sample only a single neighbor, some graph sampling algorithms like layer sampling or neighbor sampling need to sample more than one neighbor from a vertex with a bias. Here, in this figure, our objective is to sample 2 neighbors of vertex 8 based upon a bias. (N) For faster sampling, we use two different threads to sample each neighbor. Each thread samples independently and concurrently. (N) But with biased sampling, different threads could sample the same neighbor. Here, both threads try to sample same vertex 7 (N) which is not allowed if we do not want any duplicates. We term this duplication as selection collision which is another challenge not addressed by previous work.
One solution for selection collision could be updated sampling. In this method, we update the transition probability after sampling each neighbor. (N) Here, we first sampled the neighbor 7 randomly. Then, (N) vertex 7 is removed from the neighbor list and the transition probability is updated for the remaining neighbors. (N) We perform sampling again with updated transition probability. This time we sampled neighbor 11. As the sampled vertices are removed, we avoid selection collision from happening. (N) But updating the transition probability after each sample is very costly. (N) Another solution for selection collision is repeated sampling. In this technique, we repeat the sampling until we acquire unique neighbors. (N) First, each thread samples neighbor independently. As there is a selection collision for a thread in vertex 7, (N) one thread repeats the sampling process. There is again a selection collision as vertex 7 is sampled again. (N) Finally, in another repetition, the thread is able to sample a unique neighbor This method also solves the problem of selection collision but (N) may require higher number of repetition. (N) We propose an efficient solution inspired from both updated sampling and repeated sampling which we term as bipartite region search or BRS. (N) If selection collision occurs during sampling, (N) we update the random number in such a way that we jump from vertex 7. The update corresponds to a virtual removal of vertex 7 from the neighbor list. (N) Then, we repeat sampling with updated random number in a new region. With this technique, (N) we reduce number of repetition for sampling and avoid costly update. With BRS, C-SAW solves challenge 2 more efficiently. (N) More details on how BRS works can be found on the paper.
As discussed earlier, real world graphs are can be very large. (N) The average memory in recent GPUs in 16 GB with a model of V100 GPU having up to 32 GB. (N) The size of the graph in CSR format for graphs like Friendster and Twitter is 29 GB and 22 GB respectively which is larger than the average memory size of recent GPUs. Even with GPU memory size of 32 GB, graphs like clueweb12 and Uk-2014 have higher than 100 GB space requirement. (N) This brings us to a third challenge: handling graphs larger than memory size of GPU. (N) For sampling and random walk, we observe that entire graph is not required to be stored in the GPU memory. We only need the active frontiers and their immediate neighbors for each step of sampling. (N) This motivates our solution out-of-memory sampling with 1D partition.
Let’s see an example of out-of-memory sampling with C-SAW. (N) Assume we are sampling randomly generated source vertices 0, 2 and 8. (N) We partition the graph by equally assigning the vertices to a different partition P1, p2 and P3. Each color represents a partition in the figure. (N) At first, we have frontiers in the CPU side. (N)Then, we determine the active frontier vertices for each partition. Here, P1, P2 and P3 have 2,0 and 1 active vertices respectively. Our first optimization, (N) workload-aware scheduling determines which partition to sample based upon the workload. The partition with higher workload is scheduled earlier as it helps to reduce the overall partition transfer to GPU. (N) Assuming we can only sample two partitions, we select P1 and P3. (N) Our next optimization, workload balancing, allocates computation resources based upon the workload. As the ration of workload is 2:1 for P1 and P3, the computational resource is also allocated in the ration of 2:1. (N) Then, the partitions are transferred to the GPU side for sampling. (N) Each partition have its own frontier queue in the GPU memory. (N) After first round of sampling, P1 have one frontier, P2 have two frontiers 7,5, and P3 have 0 frontiers. Each selected partitioned at an iteration continues sampling as long as they have some frontiers. As P3 do not have any active frontier, sampling terminates for P3. (N) In the next round, only P1 is sampled which results in addition of one frontier vertex in P2 and sampling terminates for P1. As sampling for both P1 and P3 have terminated, one iteration is completed with 3 vertices in frontier queue of P2. (N) For determining next partitions and resource allocation, the queue size for each partition is passed to CPU side. (N) We repeat this process until the frontier is empty. With this out-of-memory sampling, we address challenge 3.
Moving on to the evaluations, we perform all tests on summit supercomputer from oak ridge national labo. (N) Each node is equipped with 6 V100 gpus with 16gb memory and dual-socket 22 core power9 CPUs. (N) We use 10 different datasets for evaluation. More details on graph datasets and size can be found on the paper. (N) For comparing our work with related works, we use sampled edges per second or SEPS in short. It is computed as total number of sampled edges upon total time for sampling.
For comparing with state of the art methods, we compare our results with Knightking and graphsaint. For Knightking, we use biased ranom walk and for graphsaint we use multi-dimensional random walk. (N) The length of the random walk is kept 2000 and number of sampling instances or walks is 4000 for all comparisons. (N) The graph shows million sampled edges per second achieved for different datasets. Compared with knightking, we achieve 10 times and 14.7 times speedup with 1 GPU and 6 GPUs. (N) We achieve 8.1 times and 11.5 times speedup with 1 GPU and 6 GPU respectively. For multi-dimensional random walk, we use a frontier size of 2000.
Next, we compare the scalability of C-SAW with multiple GPUs using neighbor sampling algorithm with 8000 instances of samples. (N) The graph shows the speedup achieved with different graph datasets. (N) We achieve upto 5.2x speedup at max with 6 GPUs. (N) More detailed evaluation of C-SAW along with profiling of each optimization can be found in the paper.
In conclusion, C-SAW is first generic GPU based framework for both graph sampling random walk. C-SAW outperforms both state-of-the-art implementation 14.7 times and 11.5 times for Knightking and Graphsaint respectively. C-SAW provides efficient out-of-memory sampling for handling larger graphs. For future improvements, we leave adding support for more sampling algorithms and improving the existing sampling techniques. The source-code of C-SAW is open sourced in Github. Please checkout our paper if you find this work interesting.
This work was supported by NSF and Department of energy. Thank you.