The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
1) The PG-Strom project aims to accelerate PostgreSQL queries using GPUs. It generates CUDA code from SQL queries and runs them on Nvidia GPUs for parallel processing.
2) Initial results show PG-Strom can be up to 10 times faster than PostgreSQL for queries involving large table joins and aggregations.
3) Future work includes better supporting columnar formats and integrating with PostgreSQL's native column storage to improve performance further.
The column-oriented data structure of PG-Strom stores data in separate column storage (CS) tables based on the column type, with indexes to enable efficient lookups. This reduces data transfer compared to row-oriented storage and improves GPU parallelism by processing columns together.
This document describes using in-place computing on PostgreSQL to perform statistical analysis directly on data stored in a PostgreSQL database. Key points include:
- An F-test is used to compare the variances of accelerometer data from different phone models (Nexus 4 and S3 Mini) and activities (walking and biking).
- Performing the F-test directly in PostgreSQL via SQL queries is faster than exporting the data to an R script, as it avoids the overhead of data transfer.
- PG-Strom, an extension for PostgreSQL, is used to generate CUDA code on-the-fly to parallelize the variance calculations on a GPU, further speeding up the F-test.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
1) The PG-Strom project aims to accelerate PostgreSQL queries using GPUs. It generates CUDA code from SQL queries and runs them on Nvidia GPUs for parallel processing.
2) Initial results show PG-Strom can be up to 10 times faster than PostgreSQL for queries involving large table joins and aggregations.
3) Future work includes better supporting columnar formats and integrating with PostgreSQL's native column storage to improve performance further.
The column-oriented data structure of PG-Strom stores data in separate column storage (CS) tables based on the column type, with indexes to enable efficient lookups. This reduces data transfer compared to row-oriented storage and improves GPU parallelism by processing columns together.
This document describes using in-place computing on PostgreSQL to perform statistical analysis directly on data stored in a PostgreSQL database. Key points include:
- An F-test is used to compare the variances of accelerometer data from different phone models (Nexus 4 and S3 Mini) and activities (walking and biking).
- Performing the F-test directly in PostgreSQL via SQL queries is faster than exporting the data to an R script, as it avoids the overhead of data transfer.
- PG-Strom, an extension for PostgreSQL, is used to generate CUDA code on-the-fly to parallelize the variance calculations on a GPU, further speeding up the F-test.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
This document discusses using GPUs and SSDs to accelerate PostgreSQL queries. It introduces PG-Strom, a project that generates CUDA code from SQL to execute queries massively in parallel on GPUs. The document proposes enhancing PG-Strom to directly transfer data from SSDs to GPUs without going through CPU/RAM, in order to filter and join tuples during loading for further acceleration. Challenges include improving the NVIDIA driver for NVMe devices and tracking shared buffer usage to avoid unnecessary transfers. The goal is to maximize query performance by leveraging the high bandwidth and parallelism of GPUs and SSDs.
This document discusses GPU accelerated computing and programming with GPUs. It provides characteristics of GPUs from Nvidia, AMD, and Intel including number of cores, memory size and bandwidth, and power consumption. It also outlines the 7 steps for programming with GPUs which include building and loading a GPU kernel, allocating device memory, transferring data between host and device memory, setting kernel arguments, enqueueing kernel execution, transferring results back, and synchronizing the command queue. The goal is to achieve super parallel execution with GPUs.
PL/CUDA allows running CUDA C code directly in PostgreSQL user-defined functions. This allows advanced analytics and machine learning algorithms to be run directly in the database.
The gstore_fdw foreign data wrapper allows data to be stored directly in GPU memory, accessed via SQL, eliminating the overhead of copying data between CPU and GPU memory for each query.
Integrating PostgreSQL with GPU computing and machine learning frameworks allows for fast data exploration and model training by combining flexible SQL queries with high-performance analytics directly on the data.
PG-Strom is an open source PostgreSQL extension that accelerates analytic queries using GPUs. Key features of version 2.0 include direct loading of data from SSDs to GPU memory for processing, an in-memory columnar data cache for efficient GPU querying, and a foreign data wrapper that allows data to be stored directly in GPU memory and queried using SQL. These features improve performance by reducing data movement and leveraging the GPU's parallel architecture. Benchmark results show the new version providing over 3.5x faster query throughput for large datasets compared to PostgreSQL alone.
PG-Strom is an extension of PostgreSQL that utilizes GPUs and NVMe SSDs to enable terabyte-scale data processing and in-database analytics. It features SSD-to-GPU Direct SQL, which loads data directly from NVMe SSDs to GPUs using RDMA, bypassing CPU and RAM. This improves query performance by reducing I/O traffic over the PCIe bus. PG-Strom also uses Apache Arrow columnar storage format to further boost performance by transferring only referenced columns and enabling vector processing on GPUs. Benchmark results show PG-Strom can process over a billion rows per second on a simple 1U server configuration with an NVIDIA GPU and multiple NVMe SSDs.
PL/CUDA allows writing user-defined functions in CUDA C that can run on a GPU. This provides benefits for analytics workloads that can utilize thousands of GPU cores and wide memory bandwidth. A sample logistic regression implementation in PL/CUDA showed a 350x speedup compared to a CPU-based implementation in MADLib. Logistic regression performs binary classification by estimating weights for explanatory variables and intercept through iterative updates. This is well-suited to parallelization on a GPU.
PgOpenCL is a new PostgreSQL procedural language that allows developers to write OpenCL kernels to harness the parallel processing power of GPUs. It introduces a new execution model where tables can be copied to arrays, passed to an OpenCL kernel for parallel operations on the GPU, and results copied back to tables. This unlock the potential for dramatically improved performance on compute-intensive database operations like joins, aggregations, and sorting.
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
PG-Strom is a module that utilizes GPUs to accelerate query processing in PostgreSQL. It uses a foreign data wrapper to push query execution to the GPU. Benchmark results show a query running 10 times faster on a table using the PG-Strom FDW compared to a regular PostgreSQL table. Future plans include supporting writable foreign tables, accelerating sort and aggregate operations using the GPU, and inheritance between regular and foreign tables. Help from the community is needed to review code, provide large real-world datasets, and understand common analytic queries.
This document discusses using PostgreSQL and GPU acceleration to build a machine learning platform. It describes HeteroDB, which provides database and analytics acceleration using GPUs. It outlines how PostgreSQL's foreign data wrapper Gstore_fdw manages persistent GPU device memory, allowing data to remain on the GPU between queries for faster analytics. Gstore_fdw also enables inter-process data collaboration by allowing processes to share access to GPU memory using IPC handles. This facilitates integrating PostgreSQL with external analytics code in languages like Python.
1) The document describes a presentation on using SSD-to-GPU direct SQL to accelerate PostgreSQL by bypassing CPU/RAM and loading data directly from NVMe SSD to GPU.
2) Benchmark results show the technique achieved query execution performance up to 13.5GB/s, close to the raw I/O limitation of the hardware configuration.
3) One challenge discussed is performing partition-wise joins efficiently across multiple SSDs and GPUs to avoid gathering large results sets.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
GPU Accelerated Data Science with RAPIDS - ODSC West 2020John Zedlewski
This document provides an overview of RAPIDS, an open source suite of libraries for GPU-accelerated data science. It discusses how RAPIDS uses GPUs to accelerate ETL, machine learning, and other data science workflows. Key points include:
- RAPIDS includes libraries like cuDF for dataframes, cuML for machine learning, and cuGraph for graph analytics. It aims to provide familiar Python APIs for these tasks.
- cuDF provides over 10x speedups for ETL tasks like data loading, transformations, and feature engineering by keeping data on the GPU.
- cuML provides GPU-accelerated versions of popular scikit-learn algorithms like linear regression, random forests,
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
The document describes a technology called PG-Strom that uses GPU acceleration to optimize I/O performance for PostgreSQL. PG-Strom allows data to be transferred directly from NVMe SSDs to the GPU over the PCIe bus, bypassing the CPU and RAM. This reduces data movement and allows PostgreSQL queries to be partially executed directly on the GPU. Benchmark results show the approach can achieve throughput close to the theoretical hardware limits for a single server configuration processing large datasets.
This document discusses using GPUs and SSDs to accelerate PostgreSQL queries. It introduces PG-Strom, a project that generates CUDA code from SQL to execute queries massively in parallel on GPUs. The document proposes enhancing PG-Strom to directly transfer data from SSDs to GPUs without going through CPU/RAM, in order to filter and join tuples during loading for further acceleration. Challenges include improving the NVIDIA driver for NVMe devices and tracking shared buffer usage to avoid unnecessary transfers. The goal is to maximize query performance by leveraging the high bandwidth and parallelism of GPUs and SSDs.
This document discusses GPU accelerated computing and programming with GPUs. It provides characteristics of GPUs from Nvidia, AMD, and Intel including number of cores, memory size and bandwidth, and power consumption. It also outlines the 7 steps for programming with GPUs which include building and loading a GPU kernel, allocating device memory, transferring data between host and device memory, setting kernel arguments, enqueueing kernel execution, transferring results back, and synchronizing the command queue. The goal is to achieve super parallel execution with GPUs.
PL/CUDA allows running CUDA C code directly in PostgreSQL user-defined functions. This allows advanced analytics and machine learning algorithms to be run directly in the database.
The gstore_fdw foreign data wrapper allows data to be stored directly in GPU memory, accessed via SQL, eliminating the overhead of copying data between CPU and GPU memory for each query.
Integrating PostgreSQL with GPU computing and machine learning frameworks allows for fast data exploration and model training by combining flexible SQL queries with high-performance analytics directly on the data.
PG-Strom is an open source PostgreSQL extension that accelerates analytic queries using GPUs. Key features of version 2.0 include direct loading of data from SSDs to GPU memory for processing, an in-memory columnar data cache for efficient GPU querying, and a foreign data wrapper that allows data to be stored directly in GPU memory and queried using SQL. These features improve performance by reducing data movement and leveraging the GPU's parallel architecture. Benchmark results show the new version providing over 3.5x faster query throughput for large datasets compared to PostgreSQL alone.
PG-Strom is an extension of PostgreSQL that utilizes GPUs and NVMe SSDs to enable terabyte-scale data processing and in-database analytics. It features SSD-to-GPU Direct SQL, which loads data directly from NVMe SSDs to GPUs using RDMA, bypassing CPU and RAM. This improves query performance by reducing I/O traffic over the PCIe bus. PG-Strom also uses Apache Arrow columnar storage format to further boost performance by transferring only referenced columns and enabling vector processing on GPUs. Benchmark results show PG-Strom can process over a billion rows per second on a simple 1U server configuration with an NVIDIA GPU and multiple NVMe SSDs.
PL/CUDA allows writing user-defined functions in CUDA C that can run on a GPU. This provides benefits for analytics workloads that can utilize thousands of GPU cores and wide memory bandwidth. A sample logistic regression implementation in PL/CUDA showed a 350x speedup compared to a CPU-based implementation in MADLib. Logistic regression performs binary classification by estimating weights for explanatory variables and intercept through iterative updates. This is well-suited to parallelization on a GPU.
PgOpenCL is a new PostgreSQL procedural language that allows developers to write OpenCL kernels to harness the parallel processing power of GPUs. It introduces a new execution model where tables can be copied to arrays, passed to an OpenCL kernel for parallel operations on the GPU, and results copied back to tables. This unlock the potential for dramatically improved performance on compute-intensive database operations like joins, aggregations, and sorting.
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
PG-Strom is a module that utilizes GPUs to accelerate query processing in PostgreSQL. It uses a foreign data wrapper to push query execution to the GPU. Benchmark results show a query running 10 times faster on a table using the PG-Strom FDW compared to a regular PostgreSQL table. Future plans include supporting writable foreign tables, accelerating sort and aggregate operations using the GPU, and inheritance between regular and foreign tables. Help from the community is needed to review code, provide large real-world datasets, and understand common analytic queries.
This document discusses using PostgreSQL and GPU acceleration to build a machine learning platform. It describes HeteroDB, which provides database and analytics acceleration using GPUs. It outlines how PostgreSQL's foreign data wrapper Gstore_fdw manages persistent GPU device memory, allowing data to remain on the GPU between queries for faster analytics. Gstore_fdw also enables inter-process data collaboration by allowing processes to share access to GPU memory using IPC handles. This facilitates integrating PostgreSQL with external analytics code in languages like Python.
1) The document describes a presentation on using SSD-to-GPU direct SQL to accelerate PostgreSQL by bypassing CPU/RAM and loading data directly from NVMe SSD to GPU.
2) Benchmark results show the technique achieved query execution performance up to 13.5GB/s, close to the raw I/O limitation of the hardware configuration.
3) One challenge discussed is performing partition-wise joins efficiently across multiple SSDs and GPUs to avoid gathering large results sets.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
GPU Accelerated Data Science with RAPIDS - ODSC West 2020John Zedlewski
This document provides an overview of RAPIDS, an open source suite of libraries for GPU-accelerated data science. It discusses how RAPIDS uses GPUs to accelerate ETL, machine learning, and other data science workflows. Key points include:
- RAPIDS includes libraries like cuDF for dataframes, cuML for machine learning, and cuGraph for graph analytics. It aims to provide familiar Python APIs for these tasks.
- cuDF provides over 10x speedups for ETL tasks like data loading, transformations, and feature engineering by keeping data on the GPU.
- cuML provides GPU-accelerated versions of popular scikit-learn algorithms like linear regression, random forests,
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
The document describes a technology called PG-Strom that uses GPU acceleration to optimize I/O performance for PostgreSQL. PG-Strom allows data to be transferred directly from NVMe SSDs to the GPU over the PCIe bus, bypassing the CPU and RAM. This reduces data movement and allows PostgreSQL queries to be partially executed directly on the GPU. Benchmark results show the approach can achieve throughput close to the theoretical hardware limits for a single server configuration processing large datasets.
This document summarizes Nvidia's GPU technology conference (GTC16) including announcements about their Tesla P100 GPU and DGX-1 deep learning supercomputer. Key points include:
- The new Tesla P100 GPU delivers up to 21 teraflops of performance for deep learning and uses new technologies like NVLink, HBM2 memory, and a page migration engine.
- The Nvidia DGX-1 is a deep learning supercomputer powered by 8 Tesla P100 GPUs with over 170 teraflops of performance for training neural networks.
- CUDA 8 and unified memory improvements on the P100 enable simpler programming and larger datasets by allowing allocations beyond GPU memory size and
This document provides an introduction to accelerators such as GPUs and Intel Xeon Phi. It discusses the architecture and programming of GPUs using CUDA. GPUs are massively parallel many-core processors designed for graphics processing but now used for general purpose computing. They provide much higher floating point performance than CPUs. The document outlines GPU memory architecture and programming using CUDA. It also provides an overview of Intel Xeon Phi which contains over 50 simple CPU cores for highly parallel workloads.
This document provides an introduction to HeteroDB, Inc. and its chief architect, KaiGai Kohei. It discusses PG-Strom, an open source PostgreSQL extension developed by HeteroDB for high performance data processing using heterogeneous architectures like GPUs. PG-Strom uses techniques like SSD-to-GPU direct data transfer and a columnar data store to accelerate analytics and reporting workloads on terabyte-scale log data using GPUs and NVMe SSDs. Benchmark results show PG-Strom can process terabyte workloads at throughput nearing the hardware limit of the storage and network infrastructure.
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU, PPU and GPGPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics and scientific research.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs provide massively parallel and multithreaded processing capabilities. GPUs are now commonly used for general purpose computing due to their ability to handle complex computational tasks faster than CPUs in some cases. The document then discusses GPU and PPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics, and scientific research.
The document discusses VPU and GPGPU computing. It explains that a VPU is a visual processing unit, also known as a GPU. GPUs are massively parallel and multithreaded processors that are better than CPUs for tasks like machine learning and graphics processing. The document then discusses GPU architecture, memory, and programming models like CUDA. It provides examples of GPU usage and concludes that GPGPU is used in fields like machine learning, robotics, and scientific computing.
The RAPIDS suite of software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
This document discusses NVIDIA's chips for automotive, HPC, and networking. For automotive, it describes the Tegra line of SOC chips used in cars like Tesla, and upcoming chips like Orin and Atlan. For HPC, it introduces the upcoming Grace CPU designed for giant AI models. For networking, it presents the BlueField line of data processing units (DPUs) including the new 400Gbps BlueField-3 chip and the DOCA software framework. The document emphasizes that NVIDIA's GPU, CPU, and DPU chips make yearly leaps while sharing a common architecture.
This document discusses techniques for offloading I/O transactions from the CPU to improve performance. It introduces iDMA which allows direct communication between I/O devices and system memory without CPU involvement. It also describes the "Hot Potato" approach which treats payload data as a "hot potato" passed directly between devices without CPU processing. Finally, it proposes "Device2Device" (D2D) communication which allows direct transfer of data between I/O devices like sending video data directly from a SSD to a NIC without using system memory or the CPU. Measurements show these approaches can significantly reduce latency and improve throughput and power efficiency compared to traditional CPU-managed I/O.
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red_Hat_Storage
This document discusses the need for storage modernization driven by trends like mobile, social media, IoT and big data. It outlines how scale-out architectures using open source Ceph software can help meet this need more cost effectively than traditional scale-up storage. Specific optimizations for IOPS, throughput and capacity are described. Intel is presented as helping advance the industry through open source contributions and optimized platforms, software and SSD technologies. Real-world examples are given showing the wide performance range Ceph can provide.
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
RAPIDS accelerates data science and machine learning workflows in Python by leveraging GPUs. It includes cuDF for GPU-accelerated pandas functionality, cuML for scikit-learn compatible machine learning algorithms, cuGraph for graph analytics, and integrations with Dask and Spark. RAPIDS has a large community of contributors and is used by many Fortune 100 companies to speed up workflows, reduce costs, and scale to large datasets.
BGE provides clients with the capability to integrate GPUs into the IBM BladeCenter ecosystem. This is ideal for clients running applications that can leverage the value of double precision performance and also value the RAS features of IBM BladeCenter.
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
This document discusses accelerating Apache Spark workloads using RAPIDS Accelerator for Spark and Alluxio. It provides an introduction to RAPIDS Accelerator for Spark, shows significant performance gains over CPU-only Spark, and discusses combining GPU acceleration with Alluxio for optimized performance and cost on cloud datasets. Configuration options for RAPIDS and Alluxio are also covered.
The document describes benchmark results achieved by using NVMe SSDs and GPU acceleration to improve the performance of PostgreSQL beyond typical limitations. A benchmark was run using 13 queries on a 1055GB dataset with PostgreSQL v11beta3 + PG-Strom v2.1. This achieved a maximum query execution throughput of 13.5GB/s. PG-Strom is an extension module that uses thousands of GPU cores and wide-band memory to accelerate SQL workloads. It generates GPU code from SQL and executes queries directly on the GPU, bypassing data transfers between CPU and GPU to improve performance.
The document discusses plans to establish an institutional high performance computing (HPC) facility at North-West University. It outlines the technical goals of building a Beowulf cluster to link existing departmental clusters and integrate with national and international computational grids. It also discusses management principles for the new HPC facility to ensure sustainability, efficiency, reliability, availability and high performance.
Similar to 20170602_OSSummit_an_intelligent_storage (20)
This document discusses using HyperLogLog (HLL) to estimate cardinality for count(distinct) queries in PostgreSQL.
HLL is an algorithm that uses constant memory to estimate the number of unique elements in a large set. It works by mapping elements to registers in a bitmap and tracking the number of leading zeros in each hash value. The harmonic mean of these counts is used to estimate cardinality.
PG-Strom implements HLL in PostgreSQL to enable fast count(distinct) queries on GPUs. On a table with 60 million rows and 87GB in size, HLL estimated the distinct count within 0.3% accuracy in just 9 seconds, over 40x faster than the regular count(distinct).
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
1. An Intelligent Storage
~SQL Execution on GPU Closely Connected with SSD~
PG-Strom Development Team
KaiGai Kohei <kaigai@kaigai.gr.jp>
2. The PG-Strom Project
Who am I?
▌KaiGai Kohei
PG-Strom Development Team
▌Background
Linux kernel (SELinux, …)
PostgreSQL database
Enhancement of security features
Writable FDW, Remote Join
Custom Scan Interface
GPU and CUDA
▌PG-Strom
SQL Acceleration Technology using GPU
Solution for data analytics in-database
2 Open Source Summit Japan 2017 - An Intelligent Storage
3. The PG-Strom Project
PG-Strom
PG-Strom (1/3) – GPU as accelerator of SQL
3
GPU CPU
Model
NVIDIA
Tesla P40
Intel Xeon
E5-2699v4
Code name Pascal Broadwell
Release Q3-2016 Q1-2016
# of transistors 12billion 7.2billion
# of cores
3840
(simple)
22
(functional)
Core clocks
1.306GHz
~1.53GHz
2.20GHz
~3.60GHz
Performance
(FP32)
12 TFLOPS
1.2 TFLOPS
(with AVX2)
RAM Size 24GB (GDDR5) max1.5TB (DDR4)
Memory Band 347GB/s 76.8GB/s
Power
consumption
250W 145W
Manufacturing
process
16nm 14nm
Application
GPU
Off-load
WHERE, JOIN,
GROUP BY
Auto generation
of GPU binary
from SQL
PG-Strom: Extension module of PostgreSQL, to accelerate SQL workloads
using multi-thousands cores and high bandwidth memory.
Same
SQL
Statement
Same
Data
Structure
Open Source Summit Japan 2017 - An Intelligent Storage
4. The PG-Strom Project
PG-Strom (2/3) – Transparent GPU code generation on demand
Open Source Summit Japan 2017 - An Intelligent Storage4
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Generate CUDA C code from
calculation formula in WHERE-clause
Reference to input data
SQL expression in CUDA source code
Run-time
Compiler
(nvrtc)
Just-in-time
Compile
Parallel
Execution
5. The PG-Strom Project
PG-Strom (3/3) – An example of SQL acceleration
5
▌Test Query:
SELECT cat, count(*), avg(x)
FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...]
GROUP BY cat;
t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema)
Open Source Summit Japan 2017 - An Intelligent Storage
8.48
13.23
18.28
23.42
28.88
34.50
40.77
47.16
5.00 5.46 5.91 6.45 7.17 8.07
9.22 10.21
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
2 3 4 5 6 7 8 9
QueryResponseTime[sec]
Number of tables joined
PG-Strom microbenchmark with JOIN/GROUP BY
PostgreSQL v9.6 PG-Strom 2.0devel
CPU: Xeon E5-2640v4
GPU: Tesla P40
RAM: 128GB
OS: CentOS 7.3
DB: PostgreSQL 9.6.2 +
PG-Strom 2.0devel
6. The PG-Strom Project
In-database Analytics
▌Computing close to data
Cost to move data
Cost to transform / validate data
Waste of time for DBAs
▌Utilization of SQL’s flexibility
Pre-/Post-process of analytic
algorithms
Conjunction with other data (JOIN)
Summarization of results
(GROUP BY/Window Functions)
Open Source Summit Japan 2017 - An Intelligent Storage6
7. The PG-Strom Project
What we want to develop (1/2)
Open Source Summit Japan 2017 - An Intelligent Storage7
Data
Collection
Summarize
Pre-process
Data
Generation
Analytics
& ML
Visualization
Data Size: Large Small
Amount of computing: Small Large
SSD-to-GPU
Direct SQL Exec
Transparent
GPU code generation
& Asynchronous Exec
Statistical Analytics
& Machine Learning
Libraries
Advanced algorithms
as a part of SQL statement
RowColum
Hardware
Transform
BI Tools
Execution of entire lifecycle
for data analytics in-database
DNN k-means
SVM
Random
Forest
PL/CUDA
NVMe-SSD GPU
8. The PG-Strom Project
What we want to develop (2/2) – elemental technology
Open Source Summit Japan 2017 - An Intelligent Storage8
Data
Collection
Summarize
Pre-process
Data
Generation
Analytics
& ML
Visualization
Data Size: Large Small
Amount of computing: Small Large
SSD-to-GPU
Direct SQL Exec
Transparent
GPU code generation
& Asynchronous Exec
Statistical Analytics
& Machine Learning
Libraries
Advanced algorithms
as a part of SQL statement
RowColum
Hardware
Transform
BI Tools
Execution of entire lifecycle
for data analytics in-database
DNN k-means
SVM
Random
Forest
PL/CUDA
NVMe-SSD GPU
SSD-to-GPU
Direct SQL
Execution
(for I/O)
PG-Strom
(for SQL)
PL/CUDA
and
Library
(for computing)
9. The PG-Strom Project
I/O Acceleration with GPU?
Open Source Summit Japan 2017 - An Intelligent Storage9
10. The PG-Strom Project
Re-definition of GPU’s role
I/O workloads Computing workloads
Open Source Summit Japan 2017 - An Intelligent Storage10
11. The PG-Strom Project
Brief Architecture of GPU
GPU Device
Memory
(GDDR5 or HBM2)
Host RAM
Data Transfer with DMA
over the PCIe Bus
CPU
PCIe Bus
Also data transfer via DMA
from: particular blocks
to: destination physical
address
Storage
GPU Device
Open Source Summit Japan 2017 - An Intelligent Storage11
12. The PG-Strom Project
Technology Basis (1/2) – GPUDirect RDMA
▌Feature for peer-to-peer DMA between GPU and PCIe devices
▌Originally, designed for MPI over Infiniband
▌But available for any kind of PCIe devices, if custom kernel module
Copyright (c) NVIDIA corporation, 2015
Open Source Summit Japan 2017 - An Intelligent Storage12
13. The PG-Strom Project
Technology Basis (2/2) – GPUDirect RDMA
Physical
Address Space
PCIe BAR1 AreaGPU
RAM
Host
RAM
NVMe-SSD Infiniband
HBA
PCIe devices
GPUDirect RDMA
A feature to map GPU’s device
memory on physical address space
of the host system Once GPU device memory gets physical
address of the host system, driver can
use the address as source/destination
of DMA operations.
13 Open Source Summit Japan 2017 - An Intelligent Storage
14. The PG-Strom Project
Idea of SSD-to-GPU Direct SQL Execution
Open Source Summit Japan 2017 - An Intelligent Storage14
Large PostgreSQL Tables
PCIe Bus
NVMe SSD GPUSSD-to-GPU P2P DMA
(NVMe-Strom driver)
WHERE
JOIN
GROUP BYPostgreSQL
Data Block
Reduction of unnecessary
rows according to the SQL
Smaller amount of I/O
Existing Data Flow
(Large I/O Size)
SSD-to-GPU Direct SQL Execution – Load PostgreSQL’s data block to GPU directly,
then reduce unnecessary rows with multi-thousands cores of GPUs.
15. The PG-Strom Project
As destination of
SSD to GPU DMA commands!!
Linux kernel APIs of GPUDirect RDMA
int nvidia_p2p_get_pages(uint64_t p2p_token, /* deprecated */
uint32_t va_space, /* deprecated */
uint64_t virtual_address,
uint64_t length,
struct nvidia_p2p_page_table **page_table,
void (*free_callback)(void *private_data),
void *private_data);
struct nvidia_p2p_page_table {
uint32_t version;
uint32_t page_size;
struct nvidia_p2p_page **pages;
uint32_t entries;
uint8_t *gpu_uuid;
} nvidia_p2p_page_table_t;
Array of I/O mapped
physical memory address
to the GPU device memory
CUresult cuMemAlloc(CUdeviceptr *daddr,
size_t bytesize);
Open Source Summit Japan 2017 - An Intelligent Storage15
16. The PG-Strom Project
NVMe-Strom Software Stack
Open Source Summit Japan 2017 - An Intelligent Storage16
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
DMA
request
block number
SSD-to-GPU Peer-to-Peer DMA
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
17. The PG-Strom Project
Benchmark (1/2) – Star Schema Benchmark
Open Source Summit Japan 2017 - An Intelligent Storage17
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Q1-1 Q1-2 Q1-3 Q2-1 Q2-2 Q2-3 Q3-1 Q3-2 Q3-3 Q3-4 Q4-1 Q4-2 Q4-3
QueryExecutionThroughput[MB/s]
PostgreSQL SSDx1 PostgreSQL SSDx2 PG-Strom SSDx1 PG-Strom SSDx2
SQL with WHERE-clause, JOIN, Aggregation/GROUP BY towards 353GB database.
Usually, entire workloads are dominated by I/O.
PostgreSQL achieves up to 1.6GB/s with regular filesystem based access.
PG-Strom with SSD-to-GPU direct SQL execution achieved up to 3.8GB/s at the peak performance.
※ (Query Execution Throughput) = (Database Size; 353GB) / (Entire Query Response Time [sec])
Theoretical limitation
by 2x SSD [4.4GB/s]
Under
Investigation
Theoretical limitation
by 1x SSD [4.4GB/s]
18. The PG-Strom Project
Benchmark (2/2) – Star Schema Benchmark
Open Source Summit Japan 2017 - An Intelligent Storage18
Example of Query
SELECT sum(lo_revenue), d_year, p_brand1
FROM lineorder, date1, part, supplier
WHERE lo_orderdate = d_datekey
AND lo_partkey = p_partkey
AND lo_suppkey = s_suppkey
AND p_category = 'MFGR#12‘
AND s_region = 'AMERICA‘
GROUP BY d_year, p_brand1
ORDER BY d_year, p_brand1;
customer
12M Rows
(1.6GB)
date1
2500 Rows
(400KB)
part
1.8M Rows
(206MB)
supplier
4M Rows
(528MB)
lineorder
2.4B Rows
(351GB)
Typical queries for summary; which
consists of WHERE-clause, JOIN and
GROUP BY.
Size of lineorder table is larger than
physical RAM size, majority of the
workload is dominated by I/O.
19. The PG-Strom Project
Hardware configuration on the benchmark
19
▌Hardware/Software configuration
Model: DELL PowerEdge R730
CPU: Xeon E5-2670v3 (12C, 2.3GHz) x2
RAM: 128GB
HDD: SAS 300GB x8 (RAID5)
OS: CentOS7.2 (3.10.0-327.el7)
SW: CUDA 7.5.17
NVIDIA driver 352.99
PostgreSQL 9.6.1
PG-Strom v2.0devel
▌GPU Specifications
NVIDIA Tesla K80
2x 2496 CUDA cores (560MHz)
2x 12GB GDDR5 RAM (240GB/s)
▌SSD Specifications
Intel SSD 750 (400GB) x2
Interface: PCIe 3.0 x4 (NVMe 1.2)
Max SeqRead: 2.2GB/s
Open Source Summit Japan 2017 - An Intelligent Storage
CPU2CPU1
SSD
x2
Tesla
K80
GPU
Tesla
K20
GPU
(unused)
20. The PG-Strom Project
(OT) Consistency with disk cache
20
Problem ① Storage block may not be up-to-date, if target blocks are kept on page caches.
② Short size DMA is not efficient.
Solution – Write back cached PostgreSQL blocks (= 8KB/block) to userspace buffer once,
then kicks bulk transfer from RAM to GPU by CUDA APIs.
Almost equivalent cost with read(2) + cuMemcpyHtoDAsync()
Although PostgreSQL blocks are reordered on GPU memory, it generates correct results.
Open Source Summit Japan 2017 - An Intelligent Storage
BLK-100: uncached
BLK-101: cached
BLK-102: uncached
BLK-103: uncached
BLK-104: cached
BLK-105: cached
BLK-106: uncached
BLK-107: uncached
BLK-108: cached
BLK-109: uncached
BLCKSZ
(=8KB)
Transfer Size
Per Request
BLCKSZ *
NChunks
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
BLK-100: uncached
BLK-102: uncached
BLK-103: uncached
BLK-106: uncached
BLK-107: uncached
BLK-109: uncached
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
unused SSD-to-GPU
P2P DMA
File Userspace DMA
Buffer (RAM)
Device Memory
(GPU)
CUDA API
(userspace)
cuMemcpyHtoDAsync
21. The PG-Strom Project
Not only Scan, but GpuJoin also...
Open Source Summit Japan 2017 - An Intelligent Storage21
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next Step Next Step
Just reference
to the results
on CPU side
Hash Table
Search by CPU
Generation of
results by CPU
Generation of
results by GPU
Parallel
Hash Table
Search by GPU
HashJoin by CPU GpuHashJoin
Here we have
SSD-to-GPU
P2P DMA
22. The PG-Strom Project
GpuPreAgg
Not only Scan, but GpuPreAgg also...
Open Source Summit Japan 2017 - An Intelligent Storage22
Aggregation by CPU Aggregation with GPU
Relation Relation
Aggregation
Aggregation
1st Step:
Parallel Hash
Reduction
2nd Step:
Atomic Merge
Read of
1M rows
Process
1M rows
by CPU
Smaller
# of rows
GpuPreAgg:
Reduction of rows
to be processed by CPU
Interim
results
Here we have SSD-to-GPU P2P DMA.
24. The PG-Strom Project
Single Node Solution
2x Enterprise Grade SSD
Based on PCIe x8
NVIDIA
Tesla P40
CPU
RAM
DMI
SSD-1
SSD-2
GPU
max 10GB/s capability of query processing / node
PCI-E
Hardware Block Diagram
Open Source Summit Japan 2017 - An Intelligent Storage24
25. The PG-Strom Project
Multi Node Solution (?)
NVIDIA
Tesla P40
Joint solution with PostgreSQL’s scale-out technologies
2x Enterprise Grade SSD
Based on PCIe x8
Open Source Summit Japan 2017 - An Intelligent Storage25
26. The PG-Strom Project
Scale Up Solution (?)
5x NVIDIA
Tesla P40
DMI
GPU-1
PCI-SW
PCI-SW
QPIx2
GPU-2
GPU-4
GPU-5
SSD-1
SSD-2
SSD-3
SSD-4
SSD-5
GPU-3
RAM
RAM
CPU 1
CPU 2
PCIe x16
x2 slot
PCIe
96lane
PCIe
96lane
Deployment on GPU monster
5x Enterprise Grade SSD
Based on PCIe x16
Open Source Summit Japan 2017 - An Intelligent Storage26
27. The PG-Strom Project
Towards columnar structure (1/3) – On-memory Columnar Cache
27
▌On-memory Columnar Cache
Caches PostgreSQL’s data-blocks (row-data) on memory as column-data
Reduction of data transfer between RAM<->GPU
Optimization of memory bandwidth of GPU
Switch GPU kernel to be invoked according to existence of column-cache
Open Source Summit Japan 2017 - An Intelligent Storage
BlockNum = 101
all_visible=false
BlockNum = 102
all_visible=true
BlockNum = 103
all_visible=true
BlockNum = 104
all_visible=true
Columnar Cache
(BlockNum=102)
Columnar Cache
(BlockNum=102)
background
worker
Construction
UPDATE
invalidation
PG-Strom
Row-data
GPU Kernel
for Row
Column-data
GPU Kernel
for Column
Row-data
GPU Kernel
for Row
Column-data
GPU Kernel
for Column
28. The PG-Strom Project
Towards columnar structure (2/3) – Background
28
▌Row data format – Random memory access
Less usage ratio of memory bus, larger number of memory transactions
Open Source Summit Japan 2017 - An Intelligent Storage
▌Column data format – Coalesced memory access
Optimal memory bus usage, least number of memory transactions
32bit
Memory transaction width: 256bit
32bit 32bit32bit 32bit 32bit
32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit
Memory transaction width: 256bit
32bit x8 = 256bit is valid
in 256bit memory transaction.
(usage ratio: 100.0%)
Only 32bit x1 = 32bit is valid
in 256bit memory transaction .
(usage ratio: 12.5%)
GPU Cores
GPU Cores
Why columnar structure is preferable for GPU architecture
29. The PG-Strom Project
World of OLTP World of OLAP
Towards columnar structure (3/3) – FPGA engine on SSD
Open Source Summit Japan 2017 - An Intelligent Storage29
FPGA Logic on SSD enables write by row, then read as column
RowColumn
Transformer
column-format
Read of
analytic data
(column-format)
Write of
transactional
data
(row-format)
Data-format
transformation by
FPGA on SSD
SQL Execution
– based on columnar
format; best for GPU
Pre-processed
data
Only required
columns are
fetched
30. The PG-Strom Project
Towards the “true” intelligent storage system
Open Source Summit Japan 2017 - An Intelligent Storage30
Large PostgreSQL Tables
PCIe Bus
NVMe SSD GPUSSD-to-GPU P2P DMA
(NVMe-Strom driver)
WHERE
JOIN
GROUP BYPostgreSQL
Data Block
Reduction of unnecessary
rows according to the SQL
Smaller amount of I/O
Step-1: A pair of NVMe-SSD and GPU looks like an intelligent storage which understand
SQL workloads for filtering-out unnecessary rows and pre-process of JOIN/GROUP BY
prior to data loading onto the main memory.
Step-2: A special NVMe-SSD
transforms data format for more
suitable one to process, based on
knowledge of the requester.
(PostgreSQL’s block format in this
case)