This document discusses GPU accelerated computing and programming with GPUs. It provides characteristics of GPUs from Nvidia, AMD, and Intel including number of cores, memory size and bandwidth, and power consumption. It also outlines the 7 steps for programming with GPUs which include building and loading a GPU kernel, allocating device memory, transferring data between host and device memory, setting kernel arguments, enqueueing kernel execution, transferring results back, and synchronizing the command queue. The goal is to achieve super parallel execution with GPUs.
The column-oriented data structure of PG-Strom stores data in separate column storage (CS) tables based on the column type, with indexes to enable efficient lookups. This reduces data transfer compared to row-oriented storage and improves GPU parallelism by processing columns together.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
This document discusses using GPUs and SSDs to accelerate PostgreSQL queries. It introduces PG-Strom, a project that generates CUDA code from SQL to execute queries massively in parallel on GPUs. The document proposes enhancing PG-Strom to directly transfer data from SSDs to GPUs without going through CPU/RAM, in order to filter and join tuples during loading for further acceleration. Challenges include improving the NVIDIA driver for NVMe devices and tracking shared buffer usage to avoid unnecessary transfers. The goal is to maximize query performance by leveraging the high bandwidth and parallelism of GPUs and SSDs.
1) The PG-Strom project aims to accelerate PostgreSQL queries using GPUs. It generates CUDA code from SQL queries and runs them on Nvidia GPUs for parallel processing.
2) Initial results show PG-Strom can be up to 10 times faster than PostgreSQL for queries involving large table joins and aggregations.
3) Future work includes better supporting columnar formats and integrating with PostgreSQL's native column storage to improve performance further.
This document describes using in-place computing on PostgreSQL to perform statistical analysis directly on data stored in a PostgreSQL database. Key points include:
- An F-test is used to compare the variances of accelerometer data from different phone models (Nexus 4 and S3 Mini) and activities (walking and biking).
- Performing the F-test directly in PostgreSQL via SQL queries is faster than exporting the data to an R script, as it avoids the overhead of data transfer.
- PG-Strom, an extension for PostgreSQL, is used to generate CUDA code on-the-fly to parallelize the variance calculations on a GPU, further speeding up the F-test.
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
PG-Strom is a module that utilizes GPUs to accelerate query processing in PostgreSQL. It uses a foreign data wrapper to push query execution to the GPU. Benchmark results show a query running 10 times faster on a table using the PG-Strom FDW compared to a regular PostgreSQL table. Future plans include supporting writable foreign tables, accelerating sort and aggregate operations using the GPU, and inheritance between regular and foreign tables. Help from the community is needed to review code, provide large real-world datasets, and understand common analytic queries.
The column-oriented data structure of PG-Strom stores data in separate column storage (CS) tables based on the column type, with indexes to enable efficient lookups. This reduces data transfer compared to row-oriented storage and improves GPU parallelism by processing columns together.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
This document discusses using GPUs and SSDs to accelerate PostgreSQL queries. It introduces PG-Strom, a project that generates CUDA code from SQL to execute queries massively in parallel on GPUs. The document proposes enhancing PG-Strom to directly transfer data from SSDs to GPUs without going through CPU/RAM, in order to filter and join tuples during loading for further acceleration. Challenges include improving the NVIDIA driver for NVMe devices and tracking shared buffer usage to avoid unnecessary transfers. The goal is to maximize query performance by leveraging the high bandwidth and parallelism of GPUs and SSDs.
1) The PG-Strom project aims to accelerate PostgreSQL queries using GPUs. It generates CUDA code from SQL queries and runs them on Nvidia GPUs for parallel processing.
2) Initial results show PG-Strom can be up to 10 times faster than PostgreSQL for queries involving large table joins and aggregations.
3) Future work includes better supporting columnar formats and integrating with PostgreSQL's native column storage to improve performance further.
This document describes using in-place computing on PostgreSQL to perform statistical analysis directly on data stored in a PostgreSQL database. Key points include:
- An F-test is used to compare the variances of accelerometer data from different phone models (Nexus 4 and S3 Mini) and activities (walking and biking).
- Performing the F-test directly in PostgreSQL via SQL queries is faster than exporting the data to an R script, as it avoids the overhead of data transfer.
- PG-Strom, an extension for PostgreSQL, is used to generate CUDA code on-the-fly to parallelize the variance calculations on a GPU, further speeding up the F-test.
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
PG-Strom is a module that utilizes GPUs to accelerate query processing in PostgreSQL. It uses a foreign data wrapper to push query execution to the GPU. Benchmark results show a query running 10 times faster on a table using the PG-Strom FDW compared to a regular PostgreSQL table. Future plans include supporting writable foreign tables, accelerating sort and aggregate operations using the GPU, and inheritance between regular and foreign tables. Help from the community is needed to review code, provide large real-world datasets, and understand common analytic queries.
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
PgOpenCL is a new PostgreSQL procedural language that allows developers to write OpenCL kernels to harness the parallel processing power of GPUs. It introduces a new execution model where tables can be copied to arrays, passed to an OpenCL kernel for parallel operations on the GPU, and results copied back to tables. This unlock the potential for dramatically improved performance on compute-intensive database operations like joins, aggregations, and sorting.
PL/CUDA allows running CUDA C code directly in PostgreSQL user-defined functions. This allows advanced analytics and machine learning algorithms to be run directly in the database.
The gstore_fdw foreign data wrapper allows data to be stored directly in GPU memory, accessed via SQL, eliminating the overhead of copying data between CPU and GPU memory for each query.
Integrating PostgreSQL with GPU computing and machine learning frameworks allows for fast data exploration and model training by combining flexible SQL queries with high-performance analytics directly on the data.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
PG-Strom is an open source PostgreSQL extension that accelerates analytic queries using GPUs. Key features of version 2.0 include direct loading of data from SSDs to GPU memory for processing, an in-memory columnar data cache for efficient GPU querying, and a foreign data wrapper that allows data to be stored directly in GPU memory and queried using SQL. These features improve performance by reducing data movement and leveraging the GPU's parallel architecture. Benchmark results show the new version providing over 3.5x faster query throughput for large datasets compared to PostgreSQL alone.
PL/CUDA allows writing user-defined functions in CUDA C that can run on a GPU. This provides benefits for analytics workloads that can utilize thousands of GPU cores and wide memory bandwidth. A sample logistic regression implementation in PL/CUDA showed a 350x speedup compared to a CPU-based implementation in MADLib. Logistic regression performs binary classification by estimating weights for explanatory variables and intercept through iterative updates. This is well-suited to parallelization on a GPU.
PG-Strom is an extension of PostgreSQL that utilizes GPUs and NVMe SSDs to enable terabyte-scale data processing and in-database analytics. It features SSD-to-GPU Direct SQL, which loads data directly from NVMe SSDs to GPUs using RDMA, bypassing CPU and RAM. This improves query performance by reducing I/O traffic over the PCIe bus. PG-Strom also uses Apache Arrow columnar storage format to further boost performance by transferring only referenced columns and enabling vector processing on GPUs. Benchmark results show PG-Strom can process over a billion rows per second on a simple 1U server configuration with an NVIDIA GPU and multiple NVMe SSDs.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Trip down the GPU lane with Machine LearningRenaldas Zioma
What Machine Learning professional should know about GPU!
Brief outline of the deck:
* GPU architecture explained with simple images
* memory bandwidth cheat-sheats for common hardware configuration,
* overview of GPU programming model
* under the hood peek at the main building block of ML - matrix multiplication
* effect of mini-batch size on performance
Originally I gave this talk at the internal Machine Learning Workshop in Unity Seattle
HIGH QUALITY pdf slides: http://bit.ly/2iQxm7X (on Dropbox)
This document discusses using PostgreSQL and GPU acceleration to build a machine learning platform. It describes HeteroDB, which provides database and analytics acceleration using GPUs. It outlines how PostgreSQL's foreign data wrapper Gstore_fdw manages persistent GPU device memory, allowing data to remain on the GPU between queries for faster analytics. Gstore_fdw also enables inter-process data collaboration by allowing processes to share access to GPU memory using IPC handles. This facilitates integrating PostgreSQL with external analytics code in languages like Python.
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Accelerating Machine Learning Applications on Spark Using GPUsIBM
Matrix factorization (MF) is widely used in recommendation systems. We present cuMF, a highly-optimized matrix factorization tool with supreme performance on graphics processing units (GPUs) by fully utilizing the GPU compute power and minimizing the overhead of data movement. Firstly, we introduce a memory-optimized alternating least square (ALS) method by reducing discontiguous memory access and aggressively using registers to reduce memory latency. Secondly, we combine data parallelism with model parallelism to scale to multiple GPUs.
Results show that with up to four GPUs on one machine, cuMF can be up to ten times as fast as those on sizable clusters on large scale problems, and has impressively good performance when solving the largest matrix factorization problem ever reported.
Presented at the GPU Technology Conference 2012 in San Jose, California.
Tuesday, May 15, 2012.
Standards such as Scalable Vector Graphics (SVG), PostScript, TrueType outline fonts, and immersive web content such as Flash depend on a resolution-independent 2D rendering paradigm that GPUs have not traditionally accelerated. This tutorial explains a new opportunity to greatly accelerate vector graphics, path rendering, and immersive web standards using the GPU. By attending, you will learn how to write OpenGL applications that accelerate the full range of path rendering functionality. Not only will you learn how to render sophisticated 2D graphics with OpenGL, you will learn to mix such resolution-independent 2D rendering with 3D rendering and do so at dynamic, real-time rates.
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
PgOpenCL is a new PostgreSQL procedural language that allows developers to write OpenCL kernels to harness the parallel processing power of GPUs. It introduces a new execution model where tables can be copied to arrays, passed to an OpenCL kernel for parallel operations on the GPU, and results copied back to tables. This unlock the potential for dramatically improved performance on compute-intensive database operations like joins, aggregations, and sorting.
PL/CUDA allows running CUDA C code directly in PostgreSQL user-defined functions. This allows advanced analytics and machine learning algorithms to be run directly in the database.
The gstore_fdw foreign data wrapper allows data to be stored directly in GPU memory, accessed via SQL, eliminating the overhead of copying data between CPU and GPU memory for each query.
Integrating PostgreSQL with GPU computing and machine learning frameworks allows for fast data exploration and model training by combining flexible SQL queries with high-performance analytics directly on the data.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
PG-Strom is an open source PostgreSQL extension that accelerates analytic queries using GPUs. Key features of version 2.0 include direct loading of data from SSDs to GPU memory for processing, an in-memory columnar data cache for efficient GPU querying, and a foreign data wrapper that allows data to be stored directly in GPU memory and queried using SQL. These features improve performance by reducing data movement and leveraging the GPU's parallel architecture. Benchmark results show the new version providing over 3.5x faster query throughput for large datasets compared to PostgreSQL alone.
PL/CUDA allows writing user-defined functions in CUDA C that can run on a GPU. This provides benefits for analytics workloads that can utilize thousands of GPU cores and wide memory bandwidth. A sample logistic regression implementation in PL/CUDA showed a 350x speedup compared to a CPU-based implementation in MADLib. Logistic regression performs binary classification by estimating weights for explanatory variables and intercept through iterative updates. This is well-suited to parallelization on a GPU.
PG-Strom is an extension of PostgreSQL that utilizes GPUs and NVMe SSDs to enable terabyte-scale data processing and in-database analytics. It features SSD-to-GPU Direct SQL, which loads data directly from NVMe SSDs to GPUs using RDMA, bypassing CPU and RAM. This improves query performance by reducing I/O traffic over the PCIe bus. PG-Strom also uses Apache Arrow columnar storage format to further boost performance by transferring only referenced columns and enabling vector processing on GPUs. Benchmark results show PG-Strom can process over a billion rows per second on a simple 1U server configuration with an NVIDIA GPU and multiple NVMe SSDs.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Trip down the GPU lane with Machine LearningRenaldas Zioma
What Machine Learning professional should know about GPU!
Brief outline of the deck:
* GPU architecture explained with simple images
* memory bandwidth cheat-sheats for common hardware configuration,
* overview of GPU programming model
* under the hood peek at the main building block of ML - matrix multiplication
* effect of mini-batch size on performance
Originally I gave this talk at the internal Machine Learning Workshop in Unity Seattle
HIGH QUALITY pdf slides: http://bit.ly/2iQxm7X (on Dropbox)
This document discusses using PostgreSQL and GPU acceleration to build a machine learning platform. It describes HeteroDB, which provides database and analytics acceleration using GPUs. It outlines how PostgreSQL's foreign data wrapper Gstore_fdw manages persistent GPU device memory, allowing data to remain on the GPU between queries for faster analytics. Gstore_fdw also enables inter-process data collaboration by allowing processes to share access to GPU memory using IPC handles. This facilitates integrating PostgreSQL with external analytics code in languages like Python.
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Accelerating Machine Learning Applications on Spark Using GPUsIBM
Matrix factorization (MF) is widely used in recommendation systems. We present cuMF, a highly-optimized matrix factorization tool with supreme performance on graphics processing units (GPUs) by fully utilizing the GPU compute power and minimizing the overhead of data movement. Firstly, we introduce a memory-optimized alternating least square (ALS) method by reducing discontiguous memory access and aggressively using registers to reduce memory latency. Secondly, we combine data parallelism with model parallelism to scale to multiple GPUs.
Results show that with up to four GPUs on one machine, cuMF can be up to ten times as fast as those on sizable clusters on large scale problems, and has impressively good performance when solving the largest matrix factorization problem ever reported.
Presented at the GPU Technology Conference 2012 in San Jose, California.
Tuesday, May 15, 2012.
Standards such as Scalable Vector Graphics (SVG), PostScript, TrueType outline fonts, and immersive web content such as Flash depend on a resolution-independent 2D rendering paradigm that GPUs have not traditionally accelerated. This tutorial explains a new opportunity to greatly accelerate vector graphics, path rendering, and immersive web standards using the GPU. By attending, you will learn how to write OpenGL applications that accelerate the full range of path rendering functionality. Not only will you learn how to render sophisticated 2D graphics with OpenGL, you will learn to mix such resolution-independent 2D rendering with 3D rendering and do so at dynamic, real-time rates.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
Computational Techniques for the Statistical Analysis of Big Data in Rherbps10
The document describes techniques for improving the computational performance of statistical analysis of big data in R. It uses as a case study the rlme package for rank-based regression of nested effects models. The workflow involves identifying bottlenecks, rewriting algorithms, benchmarking versions, and testing. Examples include replacing sorting with a faster C++ selection algorithm for the Wilcoxon Tau estimator, vectorizing a pairwise function, and preallocating memory for a covariance matrix calculation. The document suggests future directions like parallelization using MPI and GPUs to further optimize R for big data applications.
At StampedeCon 2014, John Tran of NVIDIA presented "GPUs in Big Data." Modern graphics processing units (GPUs) are massively parallel general-purpose processors that are taking Big Data by storm. In terms of power efficiency, compute density, and scalability, it is clear now that commodity GPUs are the future of parallel computing. In this talk, we will cover diverse examples of how GPUs are revolutionizing Big Data in fields such as machine learning, databases, genomics, and other computational sciences.
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingMark Kilgard
Video replay: http://nvidia.fullviewmedia.com/siggraph2012/ondemand/SS106.html
Location: West Hall Meeting Room 503, Los Angeles Convention Center
Date: Wednesday, August 8, 2012
Time: 2:40 PM – 3:40 PM
The future of GPU-based visual computing integrates the web, resolution-independent 2D graphics, and 3D to maximize interactivity and quality while minimizing consumed power. See what NVIDIA is doing today to accelerate resolution-independent 2D graphics for web content. This presentation explains NVIDIA's unique "stencil, then cover" approach to accelerating path rendering with OpenGL and demonstrates the wide variety of web content that can be accelerated with this approach.
More information: http://developer.nvidia.com/nv-path-rendering
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...odsc
This document discusses opportunities for using GPU acceleration to improve the performance of data-parallel graph analytics. GPUs are well-suited for data-parallel workloads and can significantly speed up graph algorithms that exhibit data parallelism. The document was presented at the 2015 Open Data Science Conference in Boston.
In this video from SC13, Vinod Tipparaju presents an Heterogeneous System Architecture Overview.
"The HSA Foundation seeks to create applications that seamlessly blend scalar processing on the CPU, parallel processing on the GPU, and optimized processing on the DSP via high bandwidth shared memory access enabling greater application performance at low power consumption. The Foundation is defining key interfaces for parallel computation utilizing CPUs, GPUs, DSPs, and other programmable and fixed-function devices, thus supporting a diverse set of high-level programming languages and creating the next generation in general-purpose computing."
Learn more: http://hsafoundation.com/
Watch the video presentation: http://wp.me/p3RLHQ-aXk
PyData Amsterdam - Name Matching at ScaleGoDataDriven
Wendell Kuling works as a Data Scientist at ING in the Wholesale Banking Advanced Analytics team. Their projects aim to provide better services to corporate customers of ING, by using innovative techniques from data-science. In this talk, Wendell covers key insights from their experience in matching large datasets based on names. After covering the key algorithms and packages ING uses for name matching, Wendell will share his best-practice approach in applying these algorithms at scale… would you bet on a Cruncher (48-CPU/512 MB RAM machine), a Tesla (Cuda Tesla K80 with 4992 cores, 24GB memory) or a Spark cluster (80 cores/2,5 TB memory)?
Brief intro into the problem and perspectives of OpenCL and distributed heterogeneous calculations with Hadoop. For Big Data Dive 2013 (Belarus Java User Group).
This document discusses deep learning and implementing deep belief networks on Hadoop and YARN. It introduces Adam Gibson and Josh Patterson who have worked on deep learning. It then explains what deep learning and deep belief networks are, and how DeepLearning4J implements them in Java on distributed systems using techniques like parameter averaging. Metrics show DeepLearning4J can train models faster and generalize better by distributing training across clusters. The document envisions using this system with GPUs and unlabeled data to train very large deep learning models.
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...Spark Summit
This document discusses the evolution from traditional machine learning to learning machines. It outlines the machine learning process and highlights how learning machines enable continuous feedback and retraining through automated modeling. The key design principles of learning machines are presented as collaboration across roles, convergence of technologies, and simplicity through automation and intuitiveness. Examples are provided of how learning machines can power experiences and services.
DeepLearning4J and Spark: Successes and Challenges - François Garillotsparktc
Deeplearning4J is an open-source, distributed deep learning library written for Java and Scala. It provides tools for training neural networks on distributed systems. While large companies can distribute training across many servers, Deeplearning4J allows other organizations to do distributed training as well. It includes libraries for vectorization, linear algebra, data preprocessing, model definition and training. The library aims to make deep learning more accessible to enterprises by allowing them to train models on their own large datasets.
Containerizing GPU Applications with Docker for Scaling to the CloudSubbu Rama
This document discusses containerizing GPU applications with Docker to enable scaling to the cloud. It describes how containers can solve problems of hardware and software portability by allowing applications to run consistently across different infrastructure. The document demonstrates how to build a GPU container using Dockerfiles and deploy it across multiple clouds. It also introduces Boost Containers which combine Bitfusion Boost technology with containers to build virtual GPU machines and clusters, enabling flexible scheduling of GPU workflows without code changes.
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly
This document discusses TensorFrames, which bridges Spark and TensorFlow to enable data-parallel model training. TensorFrames allows Spark datasets to be used as input to TensorFlow models, and distributes the model training across Spark workers. The workers train on partitioned data in parallel and periodically aggregate results. This combines the benefits of Spark's distributed processing with TensorFlow's capabilities for neural networks and other machine learning models. A demo is provided of using TensorFrames in Python and Scala to perform distributed deep learning on Spark clusters.
The Potential of GPU-driven High Performance Data Analytics in SparkSpark Summit
This document discusses Andy Steinbach's presentation at Spark Summit Brussels on using GPUs to drive high performance data analytics in Spark. It summarizes that GPUs can help scale up compute intensive tasks and scale out data intensive tasks. Deep learning is highlighted as a new computing model that is being applied beyond just computer vision to areas like medicine, robotics, self-driving cars, and predictive analytics. GPU-powered systems like NVIDIA's DGX-1 are able to achieve superhuman performance for deep learning tasks by providing high memory bandwidth and FLOPS.
This document summarizes Timothée Hunter's presentation on TensorFrames, which allows running Google TensorFlow models on Apache Spark. Some key points:
- TensorFrames embeds TensorFlow into Spark to enable distributed numerical computing on big data. This leverages GPUs to speed up computationally intensive machine learning algorithms.
- An example demonstrates speedups from using TensorFrames and GPUs for kernel density estimation, a non-parametric statistical technique.
- Future improvements include better integration with Tungsten in Spark for direct memory copying and columnar storage to reduce communication costs.
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
1. GPU support in Spark allows for accelerating Spark applications by offloading compute-intensive tasks to GPUs. However, production deployments face challenges like low resource utilization and overload when scheduling mixed GPU and CPU workloads.
2. The presentation proposes solutions like recognizing GPU tasks to optimize the DAG and inserting new GPU stages. It also discusses policies for prioritizing and allocating GPU and CPU resources independently through multi-dimensional scheduling.
3. Evaluation shows the ALS Spark example achieving speedups on GPUs. IBM Spectrum Conductor provides a Spark-centric shared service with fine-grained resource scheduling, reducing wait times and improving utilization across shared GPU and CPU resources.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU, PPU and GPGPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics and scientific research.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs provide massively parallel and multithreaded processing capabilities. GPUs are now commonly used for general purpose computing due to their ability to handle complex computational tasks faster than CPUs in some cases. The document then discusses GPU and PPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics, and scientific research.
The document discusses VPU and GPGPU computing. It explains that a VPU is a visual processing unit, also known as a GPU. GPUs are massively parallel and multithreaded processors that are better than CPUs for tasks like machine learning and graphics processing. The document then discusses GPU architecture, memory, and programming models like CUDA. It provides examples of GPU usage and concludes that GPGPU is used in fields like machine learning, robotics, and scientific computing.
PEER 1 Offers NVIDIA GPU to Accelerate High Performance Applications
PEER 1 has teamed up with NVIDIA the creator of the GPU and a world leader in visual computing, to provide high performance GPU Cloud applications. NVIDIA’s GPUs are well known for making customer software run faster and PEER 1 is offering a number of services that run on NVIDA’s GPUs. PEER 1’s cloud service is built on NVIDIA Telsa GPU’s delivering supercomputing performance in the cloud to solve much tougher problems. Click here to find out how PEER 1 and NVIDIA can transform your business.
This document discusses using GPUs for image processing instead of CPUs. It notes that GPUs have much higher peak performance than CPUs, growing from 5,000 triangles/second in 1995 to 350 million triangles/second in 2010. However, GPU programming is more complex than CPUs due to the different architecture and programming model. This can make it harder to implement algorithms on GPUs and to optimize for high efficiency. The document proposes a methodology for GPU acceleration including characterizing algorithms, estimating performance, using models like Roofline to analyze bottlenecks, and benchmarking. It also describes establishing a competence center to help others overcome the challenges of GPU programming.
PgOpenCL is a new PostgreSQL procedural language that allows developers to execute functions on GPUs using OpenCL. It provides a way to parallelize computations by distributing work across hundreds to thousands of threads. Functions are declared in OpenCL and compiled to binaries that can run efficiently on GPUs and other accelerated processors. This unlock the massive parallel processing power of GPUs for complex analytics and other compute-intensive PostgreSQL queries and procedures.
GPU computing provides a way to access the power of massively parallel graphics processing units (GPUs) for general purpose computing. GPUs contain over 100 processing cores and can achieve over 500 gigaflops of performance. The CUDA programming model allows programmers to leverage this parallelism by executing compute kernels on the GPU from their existing C/C++ applications. This approach democratizes parallel computing by making highly parallel systems accessible through inexpensive GPUs in personal computers and workstations. Researchers can now explore manycore architectures and parallel algorithms using GPUs as a platform.
Congatec_Global Vendor for Innovative Embedded Solutions_IstanbulIşınsu Akçetin
The document discusses congatec, a company that provides computer-on-module solutions. It summarizes congatec's vision, products, partnerships, and support offerings. Specifically, it highlights congatec's COM Express and Qseven modules, cooling solutions, carrier boards, and a new digital signage controller.
Monte Carlo simulation is one of the most important numerical methods in financial derivative pricing and risk management. Due to the increasing sophistication of exotic derivative models, Monte Carlo becomes the method of choice for numerical implementations because of its flexibility in high-dimensional problems. However, the method of discretization of the underlying stochastic differential equation (SDE) has a significant effect on convergence. In addition the choice of computing platform and the exploitation of parallelism offers further efficiency gains. We consider here the effect of higher order discretization methods together with the possibilities opened up by the advent of programmable graphics processing units (GPUs) on the overall performance of Monte Carlo and quasi-Monte Carlo methods.
Congatec_Global Vendor for Innovative Embedded Solutions_AnkaraIşınsu Akçetin
The document discusses congatec, a provider of computer-on-module technology. It presents congatec's vision as the preferred global vendor for innovative embedded solutions. The document provides details on congatec's product portfolio, partnerships, and software and hardware support offerings. It introduces some of congatec's computer-on-module products, including the DSC1 digital signage controller.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
The presentation will introduce Nvidia and the concept of GPU computing in the context of Financial Services industry. Customer successes are referenced where dramatic speed-ups in performance have been achieved.
Dustin Franklin (GPGPU Applications Engineer, GE Intelligent Platforms ) presents:
"GPUDirect support for RDMA provides low-latency interconnectivity between NVIDIA GPUs and various networking, storage, and FPGA devices. Discussion will include how the CUDA 5 technology increases GPU autonomy and promotes multi-GPU topologies with high GPU-to-CPU ratios. In addition to improved bandwidth and latency, the resulting increase in GFLOPS/watt poses a significant impact to both HPC and embedded applications. We will dig into scalable PCIe switch hierarchies, as well as software infrastructure to manage device interopability and GPUDirect streaming. Highlighting emerging architectures composed of Tegra-style SoCs that further decouple GPUs from discrete CPUs to achieve greater computational density."
Learn more at: http://www.gputechconf.com/page/home.html
Stream processing is a computer programming paradigm that allows for parallel processing of data streams. It involves applying the same kernel function to each element in a stream. Stream processing is suitable for applications involving large datasets where each data element can be processed independently, such as audio, video, and signal processing. Modern GPUs use a stream processing approach to achieve high performance by running kernels on multiple data elements simultaneously.
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
These slides have been presented by Dr. Alessandro Vallero at the IEEE VLSI Test Symposium, San Francisco, CA, USA (April 22-25, 2018).
General Purpose computing on Graphics Processing Unit offers a remarkable speedup for data parallel workloads, leveraging GPUs computational power. However, differently from graphic computing, it requires highly reliable operation in most of application domains.
This presentation talk about a “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs“. The work is the outcome of a collaboration between the TestGroup of Politecnico di Torino (http://www.testgroup.polito.it) and the Computer Architecture Lab of the University of Athens (dscal.di.uoa.gr) started under the FP7 Clereco Project (http://www.clereco.eu). It presents an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability (to quantify instruction throughput or task execution throughput between failures) that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.
Watch the presentation at: https://youtu.be/GV5xRDgfCw4
Paper Information:
Alessandro Vallero§ , Sotiris Tselonis, Dimitris Gizopoulos* and Stefano Di Carlo§, “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs”, IEEE VLSI Test Symposium 2018 (VTS 2018), San Francisco, CA (USA), April 22-25, 2018.
∗Politecnico di Torino, Italy. Email: stefano.dicarlo,alessandro.vallero@polito.it †University of Athens, Greece Email: dgizop@di.uoa.gr
This document discusses using HyperLogLog (HLL) to estimate cardinality for count(distinct) queries in PostgreSQL.
HLL is an algorithm that uses constant memory to estimate the number of unique elements in a large set. It works by mapping elements to registers in a bitmap and tracking the number of leading zeros in each hash value. The harmonic mean of these counts is used to estimate cardinality.
PG-Strom implements HLL in PostgreSQL to enable fast count(distinct) queries on GPUs. On a table with 60 million rows and 87GB in size, HLL estimated the distinct count within 0.3% accuracy in just 9 seconds, over 40x faster than the regular count(distinct).
This document provides an introduction to HeteroDB, Inc. and its chief architect, KaiGai Kohei. It discusses PG-Strom, an open source PostgreSQL extension developed by HeteroDB for high performance data processing using heterogeneous architectures like GPUs. PG-Strom uses techniques like SSD-to-GPU direct data transfer and a columnar data store to accelerate analytics and reporting workloads on terabyte-scale log data using GPUs and NVMe SSDs. Benchmark results show PG-Strom can process terabyte workloads at throughput nearing the hardware limit of the storage and network infrastructure.
2. Homogeneous vs Heterogeneous Computing
▌KPIs
Computing Performance
Homogeneous Power Consumption
Scale-Up System Cost
Variety of Applications
Vendor Support
Heterogeneous Software Development
Scale-Up
:
:
+
Scale-out
(not a topic of
today’s talk)
2 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
3. Characteristics of GPU (1/2)
Nvidia AMD Intel
Kepler GCN SandyBridge
Model GTX 680 (*) FirePro S9000 Xeon E5-2690
(Q1/2012) (Q3/2012) (Q1/2012)
Number of 3.54billion 4.3billion 2.26billion
Transistors
Number of 1536 1792 16
Cores Simple Simple Functional
Core clock 1006MHz 925MHz 2.9GHz
Peak FLOPS 3.01Tflops 3.23TFlops 185.6GFlops
Memory 2GB, GDDR5 6GB, GDDR5 up to 768GB,
Size / TYPE DDR3
Memory ~192GB/s ~264GB/s ~51.2GB/s
Bandwidth
Power ~195W ~225W ~135W
Consumption
(*) Nvidia shall release high-end model (Kepler K20) at Q4/2012
3 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
4. Characteristics of GPU (2/2)
Example)
Zi = Xi + Yi (0 <= i <= n)
X0 X1 X2 Xn
+ + + +
Y0 Y1 Y2 Yn
Z0 Z1 Z2 Zn
Assign a particular “core”
Nvidia’s GeForce GTX 680 Block Diagram (1536 Cuda cores)
4 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
5. Programming with GPU (1/2)
Example) Parallel Execution of “sqrt(Xi^2 + Yi^2) < Zi”
GPU Code
__kernel void
sample_func(bool result[], float x[], float y[], float z[]) {
int i = get_global_id(0);
result[i] = (bool)(sqrt(x[i]^2 + y[i]^2) < z[i]);
}
Host Code
#define N (1<<20)
size_t g_itemsz = N / 1024;
size_t l_itemsz = 1024;
/* Acquire device memory and data transfer (host -> device) */
X = clCreateBuffer(cxt, CL_MEM_READ_WRITE, sizeof(float)*N, NULL, &r);
clEnqueueWriteBuffer(cmdq, X, CL_TRUE, sizeof(float)*N, ...);
/* Set argument of the kernel code */
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&X);
/* Invoke device kernel */
clEnqueueNDRangeKernel(cmdq, kernel, 1, &g_itemsz, &l_itemsz, ...);
5 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
6. Programming with GPU (2/2)
1. Build & Load GPU Kernel
Source OpenCL
Code Compiler
L2 Cache
X, Y, Z
result buffer GPU Kernel
buffer Command
Host Memory Queue Device DRAM
6 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
7. Programming with GPU (2/2)
1. Build & Load GPU Kernel
2. Allocate Device Memory
L2 Cache
X, Y, Z
result buffer Device GPU Kernel
buffer Command Memory
Host Memory Queue Device DRAM
7 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
8. Programming with GPU (2/2)
1. Build & Load GPU Kernel
2. Allocate Device Memory
3. Enqueue DMA Transfer (host device)
L2 Cache
X, Y, Z
result buffer Device GPU Kernel
buffer Command Memory
Host Memory Queue Device DRAM
8 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
9. Programming with GPU (2/2)
1. Build & Load GPU Kernel
2. Allocate Device Memory
3. Enqueue DMA Transfer (host device)
4. Setup Kernel Arguments
L2 Cache
X, Y, Z
result buffer Device GPU Kernel
buffer Command Memory
Host Memory Queue Device DRAM
9 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
10. Programming with GPU (2/2)
1. Build & Load GPU Kernel
2. Allocate Device Memory
3. Enqueue DMA Transfer (host device)
4. Setup Kernel Arguments
5. Enqueue Execution of GPU Kernel
L2 Cache
X, Y, Z
result buffer Device GPU Kernel
buffer Command Memory
Host Memory Queue Device DRAM
10 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
11. Programming with GPU (2/2)
1. Build & Load GPU Kernel
2. Allocate Device Memory
3. Enqueue DMA Transfer (host device)
4. Setup Kernel Arguments
5. Enqueue Execution of GPU Kernel
6. Enqueue DMA Transfer (device host)
L2 Cache
X, Y, Z
result buffer Device GPU Kernel
buffer Command Memory
Host Memory Queue Device DRAM
11 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
12. Programming with GPU (2/2)
1. Build & Load GPU Kernel
2. Allocate Device Memory
3. Enqueue DMA Transfer (host device)
4. Setup Kernel Arguments
5. Enqueue Execution of GPU Kernel
6. Enqueue DMA Transfer (device host)
7. Synchronize the command queue
DMA Transfer (host device)
L2 Cache
X, Y, Z
result buffer Device GPU Kernel
buffer Command Memory
Host Memory Queue Device DRAM
12 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
13. Programming with GPU (2/2)
1. Build & Load GPU Kernel
2. Allocate Device Memory
3. Enqueue DMA Transfer (host device)
4. Setup Kernel Arguments
5. Enqueue Execution of GPU Kernel
6. Enqueue DMA Transfer (device host) Super
Parallel
7. Synchronize the command queue Execution
DMA Transfer (host device)
Execution of GPU Kernel
L2 Cache
X, Y, Z
result buffer Device GPU Kernel
buffer Command Memory
Host Memory Queue Device DRAM
13 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
14. Programming with GPU (2/2)
1. Build & Load GPU Kernel
2. Allocate Device Memory
3. Enqueue DMA Transfer (host device)
4. Setup Kernel Arguments
5. Enqueue Execution of GPU Kernel
6. Enqueue DMA Transfer (device host)
7. Synchronize the command queue
DMA Transfer (host device)
Execution of GPU Kernel
DMA Transfer (device host)
8. Release Device Memory L2 Cache
X, Y, Z
result buffer Device GPU Kernel
buffer Command Memory
Host Memory Queue Device DRAM
14 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
15. Programming with GPU (2/2)
1. Build & Load GPU Kernel
2. Allocate Device Memory
3. Enqueue DMA Transfer (host device)
4. Setup Kernel Arguments
5. Enqueue Execution of GPU Kernel
6. Enqueue DMA Transfer (device host)
7. Synchronize the command queue
DMA Transfer (host device)
Execution of GPU Kernel
DMA Transfer (device host)
8. Release Device Memory L2 Cache
X, Y, Z
result buffer GPU Kernel
buffer Command
Host Memory Queue Device DRAM
15 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
16. Basic idea to utilize GPU
Simultaneous (Asynchronous) execution of CPU and GPU
Minimization of data transfer between host and device
GPGPU
Memory CPU (non-integrated)
DDR3-1600 Super
on-host (51.2GB/s) Parallel
buffer Execution
DMA Transfer
DDR5
192.2GB/s
IO HUB
on-device
buffer
HBA device DRAM
SAS 2.0 (600MB/s) PCI-E 3.0 x16 (32.0GB/s)
16 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
17. Back to the PostgreSQL world
Don’t I forget I’m talking
at PGconf.EU 2012?
17 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
18. Re-definition of SQL/MED
▌SQL/MED (Management of External Data)
External data source performing as if regular tables
Not only “management”, but external computing resources also
Exec
Regular Exec
storage
Query Executor Table
Query Planner
Query Parser
Foreign MySQL
Table FDW
SQL
Query Foreign Oracle
FDW Exec
Table
Foreign PG-Strom
Table FDW
Exec
Regular
storage
Table
18 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
19. Introduction of PG-Strom
▌PG-Strom is ...
A FDW extension of PostgreSQL, released under the GPL v3.
https://github.com/kaigai/pg_strom
Not a stable module, please don’t use in production system yet.
Designed to utilize GPU devices for CPU off-load according to
their characteristics.
▌Key features of PG-Strom
Just-in-time pseudo code generation for GPU execution
Column-oriented internal data structure
Asynchronous query execution
Reduction of response-time dramatically!
19 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
20. Asynchronous Execution using CPU/GPU (1/2)
▌CPU characteristics
Complex Instruction, less parallelism
Expensive & much power consumption per core
I/O capability
▌GPU characteristics
Simple Instruction, much parallelism
Cheap & less power consumption per core
Device memory access only (except for integrated GPU)
▌“Best Mix” strategy of PG-Stom
CPU focus on I/O and control stuff.
GPU focus on calculation stuff.
20 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
21. Asynchronous Execution using CPU/GPU (2/2)
vanilla PostgreSQL PostgreSQL with PG-Strom
CPU CPU GPU
Asynchronous memory
transfer and execution
Iteration of scan tuples and
evaluation of qualifiers
Synchronization
Larger “chunk” to scan
Earlier than
the database at once
“Only CPU” scan
: Scan tuples on shared-buffers
: Execution of the qualifiers
Page 21 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
22. So what, How fast is it?
postgres=# SELECT COUNT(*) FROM rtbl
WHERE sqrt((x-256)^2 + (y-128)^2) < 40;
count
--------
100467
(1 row)
Time: 7668.684 ms
postgres=# SELECT COUNT(*) FROM ftbl
WHERE sqrt((x-256)^2 + (y-128)^2) < 40;
count
--------
100467
(1 row) Accelerated!
Time: 857.298 ms
CPU: Xeon E5-2670 (2.60GHz), GPU: NVidia GeForce GT640, RAM: 384GB
Both of regular rtbl and PG-Strom ftbl contain 20milion rows with same value
22 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
23. Architecture of PG-Strom
World of CPU
Regular Shadow
Tables Tables
A chunk contains
both of data & code
shared buffer chunk
Data
World of GPU
Pseudo Async
codes DMA
SeqScan, PG-Strom Transfer GPU Device
etc... Event
Monitor Memory
ForeignScan
Super
Result Parallel
Query Executor PG-Strom Execution
GPU Control
GPU Kernel
PostgreSQL Backend
PostgreSQL Backend Server
PostgreSQL Backend Preload Function
Postmaster An extra Works according to
daemon given pseudo code
23 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
24. Pseudo code generation (1/2)
SELECT * FROM ftbl WHERE
c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2) < 10;
contains unsupported
operators / functions
Translation to xreg10 = $(ftbl.x)
pseudo code
xreg12 = 256.000000::double
Super xreg8 = (xreg10 - xreg12)
Parallel xreg10 = 2.000000::double
Execution
xreg6 = pow(xreg8, xreg10)
GPU Kernel
Function xreg12 = $(ftbl.y)
xreg14 = 128.000000::double
:
24 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
25. Pseudo code generation (2/2)
__global__
Regularly, we should avoid void kernel_qual(const int commands[],...)
{
branch operations on GPU const int *cmd = commands;
code :
while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0; {
switch (*cmd)
if (condition) {
case GPUCMD_CONREF_INT4:
{
regs[*(cmd+1)] = *(cmd + 2);
result = a + b; cmd += 3;
} break;
case GPUCMD_VARREF_INT4:
else VARREF_TEMPLATE(cmd, uint);
{ break;
result = a - b; case GPUCMD_OPER_INT4_PL:
OPER_ADD_TEMPLATE(cmd, int);
} break;
:
return 2 * result;
:
25 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
26. Pseudo code generation (2/2)
__global__
Regularly, we should avoid void kernel_qual(const int commands[],...)
{
branch operations on GPU const int *cmd = commands;
code :
while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0; {
switch (*cmd)
if (condition) {
case GPUCMD_CONREF_INT4:
{
regs[*(cmd+1)] = *(cmd + 2);
result = a + b; cmd += 3;
} break;
case GPUCMD_VARREF_INT4:
else VARREF_TEMPLATE(cmd, uint);
{ break;
result = a - b; case GPUCMD_OPER_INT4_PL:
OPER_ADD_TEMPLATE(cmd, int);
} break;
:
return 2 * result;
:
26 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
27. Pseudo code generation (2/2)
__global__
Regularly, we should avoid void kernel_qual(const int commands[],...)
{
branch operations on GPU const int *cmd = commands;
code :
while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0; {
switch (*cmd)
if (condition) {
case GPUCMD_CONREF_INT4:
{
regs[*(cmd+1)] = *(cmd + 2);
result = a + b; cmd += 3;
} break;
case GPUCMD_VARREF_INT4:
else VARREF_TEMPLATE(cmd, uint);
{ break;
result = a - b; case GPUCMD_OPER_INT4_PL:
OPER_ADD_TEMPLATE(cmd, int);
} break;
:
return 2 * result;
:
27 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
28. Pseudo code generation (2/2)
__global__
Regularly, we should avoid void kernel_qual(const int commands[],...)
{
branch operations on GPU const int *cmd = commands;
code :
while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0; {
switch (*cmd)
if (condition) {
case GPUCMD_CONREF_INT4:
{
regs[*(cmd+1)] = *(cmd + 2);
result = a + b; cmd += 3;
} break;
case GPUCMD_VARREF_INT4:
else VARREF_TEMPLATE(cmd, uint);
{ break;
result = a - b; case GPUCMD_OPER_INT4_PL:
OPER_ADD_TEMPLATE(cmd, int);
} break;
:
return 2 * result;
:
28 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
29. Pseudo code generation (2/2)
__global__
Regularly, we should avoid void kernel_qual(const int commands[],...)
{
branch operations on GPU const int *cmd = commands;
code :
while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0; {
switch (*cmd)
if (condition) {
case GPUCMD_CONREF_INT4:
{
regs[*(cmd+1)] = *(cmd + 2);
result = a + b; cmd += 3;
} break;
case GPUCMD_VARREF_INT4:
else VARREF_TEMPLATE(cmd, uint);
{ break;
result = a - b; case GPUCMD_OPER_INT4_PL:
OPER_ADD_TEMPLATE(cmd, int);
} break;
:
return 2 * result;
:
29 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
30. OT: Why “pseudo”, not native code
Initial design at Jan-2012
SQL Query
PG-Strom module
GPU
PostgreSQL Core
Qualifier
Source
Run-time Pre-Compiled
Query Parser GPU Code Binary Cache
Compile
Generator Time
GPU GPU
Source Binary
PG-Strom
Query Planner
Planner Init nvcc
Async Memcpy Async
PG-Strom
Query Executor Load Execute
Executor
GPU
Columns used Num of
to Qualifiers Scan kernels
Async Memcpy to load
Regular pg_strom Columns used
Databases schema to Target-List
PGcon2012, PG-Strom -A GPU Optimized Asynchronous Executor Module of FDW-
30
31. Save the bandwidth of PCI-Express bus
E.g) SELECT name, tel, email, address FROM address_book
WHERE sqrt((pos_x – 24.5)^2 + (pos_y – 52.3)^2) < 10;
No sense to fetch columns being not in use
CPU GPU CPU GPU
Synchronization Synchronization
: Scan tuples on the shared-buffers
: Execution of the qualifiers Reduction of data-size to be
: Columns being not used the qualifiers transferred via PCI-E
PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
31
36. Key features towards upcoming v9.3 (1/2)
▌Extra Daemon
It enables extension to manage background worker processes.
Pre-requisites to implement PG-Strom’s GPU control server
Alvaro submitted this patch on CommitFest:Nov.
Shared Resources (DB cluster, shared mem, IPC, ...)
Built-in Extra
background daemon
PostgreSQL
PostgreSQL daemon
PostgreSQL
PostgreSQL
Backend
Backend
Backend
Backend (autovacuum,
bgwriter...) (GPU controller)
manage
Extension
postmaster
36 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
37. Key features towards upcoming v9.3 (2/2)
▌Writable Foreign Table
It enables to use usual INSERT, UPDATE or DELETE to modify foreign
tables managed by PG-Strom.
KaiGai submitted a proof-of-concept patch to CommitFest:Sep.
In-core postgresql_fdw is needed for working example.
SELECT rowid, *
create_ FROM ... WHERE ... Remote
foreignscan_plan FOR UPDATE; Data
Foreign Data Wrapper
Planner
Source
ExecQual
FETCH
ExecForeignScan
Executor UPDATE ... WHERE
rowid = xxx;
ExecModifyTable
37 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
38. More Rapidness (1/2) – Parallel Data Load
World of CPU
Regular Shadow
Tables Tables
shared buffer chunk
Async World of GPU
DMA
PG-Strom Transfer
SeqScan PG-Strom PG-Strom
PG-Strom
Data GPU Device
etc... Data
Data
Loader Memory
ForeignScan Loader
Loader Event
Monitor Super
chunk Parallel
Query Executor PG-Strom Execution
to be
GPU Control
PostgreSQL Backend loaded Server
GPU Kernel
PostgreSQL Backend
PostgreSQL Backend Preload Function
Works according to
Postmaster
given pseudo code
38 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
39. More Rapidness (2/2) – TargetList Push-down
SELECT ((a + b) * (c – d))^2 FROM ftbl;
SELECT pseudo_col FROM ftbl;
a b c d pseudo_col
Computed
1 2 3 4 9 during
3 1 4 1 144 ForeignScan
2 4 1 4 324
2 2 3 6 144
: : : : :
Pseudo column hold “computed” result, to be just referenced
Performs as if extra columns exist in addition to table definition
39 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
40. We need you getting involved
▌Project was launched from my personal curiousness,
▌So, it is uncertain how does PG-Strom fit “real-life” workload.
▌We definitely have to find out attractive usage of PG-Strom
Which region?
Which problem?
How to solve?
40 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
41. Summary
▌Characteristics of GPU device
Inflexible instructions, but much higher parallelism
Cheap & small power consumption per computing capability
▌PG-Strom
Utilization of GPU device for CPU off-load and rapid response
Just-in-time pseudo code generation according to the given query
Column-oriented data structure for data density on PCI-Express bus
In the result, dramatic shorter response time
▌Upcoming development
Upstream
• Extra daemons, Writable Foreign Tables
Extension
• Move to OpenCL rather than CUDA
▌Your involvement can lead future evolution of PG-Strom
41 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module