The document describes a technology called PG-Strom that uses GPU acceleration to optimize I/O performance for PostgreSQL. PG-Strom allows data to be transferred directly from NVMe SSDs to the GPU over the PCIe bus, bypassing the CPU and RAM. This reduces data movement and allows PostgreSQL queries to be partially executed directly on the GPU. Benchmark results show the approach can achieve throughput close to the theoretical hardware limits for a single server configuration processing large datasets.
KaiGai's talk at PGconf.EU 2018, Lisbon.
It shows how SSD2GPU Direct SQL of PG-Strom accelerates I/O intensive big-data queries using GPU in contradiction to the common sense.
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering.
KaiGai's talk at PGconf.EU 2018, Lisbon.
It shows how SSD2GPU Direct SQL of PG-Strom accelerates I/O intensive big-data queries using GPU in contradiction to the common sense.
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering.
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
아파치 네모로 빠르고 효율적으로 빅데이터 처리하기
- 송원욱, 양영석(서울대학교 컴퓨터공학부 소프트웨어 플랫폼 연구실)
개요 #
아파치 네모(Apache Nemo)는 빅데이터 애플리케이션의 분산 수행 방식을 다양한 자원 환경 및 데이터 특성에 맞춰 최적화하는 시스템입니다. Geo-distributed resources, transient resources, large data shuffle, skewed data 처리 상황에서 아파치 네모는 아파치 스파크(Apache Spark) 보다 월등하게 높은 성능을 보입니다.
목차 #
아파치 네모의 최적화 케이스 스터디
아파치 네모의 분산 실행 과정
앞으로의 연구 방향
There have been plenty of “explaining EXPLAIN” type talks over the years, which provide a great introduction to it. They often also cover how to identify a few of the more common issues through it. EXPLAIN is a deep topic though, and to do a good introduction talk, you have to skip over a lot of the tricky bits. As such, this talk will not be a good introduction to EXPLAIN, but instead a deeper dive into some of the things most don’t cover. The idea is to start with some of the more complex and unintuitive calculations needed to work out the relationships between operations, rows, threads, loops, timings, buffers, CTEs and subplans. Most popular tools handle at least several of these well, but there are cases where they don’t that are worth being conscious of and alert to. For example, we’ll have a look at whether certain numbers are averaged per-loop or per-thread, or both. We’ll also cover a resulting rounding issue or two to be on the lookout for. Finally, some per-operation timing quirks are worth looking out for where CTEs and subqueries are concerned, for example CTEs that are referenced more than once. As time allows, we can also look at a few rarer issues that can be spotted via EXPLAIN, as well as a few more gotchas that we’ve picked up along the way. This includes things like spotting when the query is JIT, planning, or trigger time dominated, spotting the signs of table and index bloat, issues like lossy bitmap scans or index-only scans fetching from the heap, as well as some things to be aware of when using auto_explain.
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Concrete findings and "best practices" from building a cluster sized for 150 analytic queries per second on 100TB of http logs. Topics covered: hardware, clients (http vs native), partitioning, indexing, SELECT vs INSERT performance, replication, sharding, quotas, and benchmarking.
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
Big data analysis using Tajo on AWS (Hands-on session)
- presented by Young-kyong Ko, data analyst at Gruter
- at Gruter TECHDAY 2014 (Oct. 29 Seoul, Korea)
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd
From webinars September 11 and September 17, 2019
ClickHouse is famous for speed. That said, you can almost always make it faster! This webinar uses examples to teach you how to deduce what queries are actually doing by reading the system log and system tables. We'll then explore standard ways to increase query speed: data types and encodings, filtering, join reordering, skip indexes, materialized views, session parameters, to name just a few. In each case we'll circle back to query plans and system metrics to demonstrate changes in ClickHouse behavior that explain the boost in performance. We hope you'll enjoy the first step to becoming a ClickHouse performance guru!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
The search for faster computing remains of great importance to the software community. Relatively inexpensive modern hardware, such as GPUs, allows users to run highly parallel code on thousands, or even millions of cores on distributed systems.
Building efficient GPU software is not a trivial task, often requiring a significant amount of engineering hours to attain the best performance. Similarly, distributed computing systems are inherently complex. In recent years, several libraries were developed to solve such problems. However, they often target a single aspect of computing, such as GPU computing with libraries like CuPy, or distributed computing with Dask.
Libraries like Dask and CuPy tend to provide great performance while abstracting away the complexity from non-experts, being great candidates for developers writing software for various different applications. Unfortunately, they are often difficult to be combined, at least efficiently.
With the recent introduction of NumPy community standards and protocols, it has become much easier to integrate any libraries that share the already well-known NumPy API. Such changes allow libraries like Dask, known for its easy-to-use parallelization and distributed computing capabilities, to defer some of that work to other libraries such as CuPy, providing users the benefits from both distributed and GPU computing with little to no change in their existing software built using the NumPy API.
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
아파치 네모로 빠르고 효율적으로 빅데이터 처리하기
- 송원욱, 양영석(서울대학교 컴퓨터공학부 소프트웨어 플랫폼 연구실)
개요 #
아파치 네모(Apache Nemo)는 빅데이터 애플리케이션의 분산 수행 방식을 다양한 자원 환경 및 데이터 특성에 맞춰 최적화하는 시스템입니다. Geo-distributed resources, transient resources, large data shuffle, skewed data 처리 상황에서 아파치 네모는 아파치 스파크(Apache Spark) 보다 월등하게 높은 성능을 보입니다.
목차 #
아파치 네모의 최적화 케이스 스터디
아파치 네모의 분산 실행 과정
앞으로의 연구 방향
There have been plenty of “explaining EXPLAIN” type talks over the years, which provide a great introduction to it. They often also cover how to identify a few of the more common issues through it. EXPLAIN is a deep topic though, and to do a good introduction talk, you have to skip over a lot of the tricky bits. As such, this talk will not be a good introduction to EXPLAIN, but instead a deeper dive into some of the things most don’t cover. The idea is to start with some of the more complex and unintuitive calculations needed to work out the relationships between operations, rows, threads, loops, timings, buffers, CTEs and subplans. Most popular tools handle at least several of these well, but there are cases where they don’t that are worth being conscious of and alert to. For example, we’ll have a look at whether certain numbers are averaged per-loop or per-thread, or both. We’ll also cover a resulting rounding issue or two to be on the lookout for. Finally, some per-operation timing quirks are worth looking out for where CTEs and subqueries are concerned, for example CTEs that are referenced more than once. As time allows, we can also look at a few rarer issues that can be spotted via EXPLAIN, as well as a few more gotchas that we’ve picked up along the way. This includes things like spotting when the query is JIT, planning, or trigger time dominated, spotting the signs of table and index bloat, issues like lossy bitmap scans or index-only scans fetching from the heap, as well as some things to be aware of when using auto_explain.
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Concrete findings and "best practices" from building a cluster sized for 150 analytic queries per second on 100TB of http logs. Topics covered: hardware, clients (http vs native), partitioning, indexing, SELECT vs INSERT performance, replication, sharding, quotas, and benchmarking.
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
Big data analysis using Tajo on AWS (Hands-on session)
- presented by Young-kyong Ko, data analyst at Gruter
- at Gruter TECHDAY 2014 (Oct. 29 Seoul, Korea)
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd
From webinars September 11 and September 17, 2019
ClickHouse is famous for speed. That said, you can almost always make it faster! This webinar uses examples to teach you how to deduce what queries are actually doing by reading the system log and system tables. We'll then explore standard ways to increase query speed: data types and encodings, filtering, join reordering, skip indexes, materialized views, session parameters, to name just a few. In each case we'll circle back to query plans and system metrics to demonstrate changes in ClickHouse behavior that explain the boost in performance. We hope you'll enjoy the first step to becoming a ClickHouse performance guru!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
The search for faster computing remains of great importance to the software community. Relatively inexpensive modern hardware, such as GPUs, allows users to run highly parallel code on thousands, or even millions of cores on distributed systems.
Building efficient GPU software is not a trivial task, often requiring a significant amount of engineering hours to attain the best performance. Similarly, distributed computing systems are inherently complex. In recent years, several libraries were developed to solve such problems. However, they often target a single aspect of computing, such as GPU computing with libraries like CuPy, or distributed computing with Dask.
Libraries like Dask and CuPy tend to provide great performance while abstracting away the complexity from non-experts, being great candidates for developers writing software for various different applications. Unfortunately, they are often difficult to be combined, at least efficiently.
With the recent introduction of NumPy community standards and protocols, it has become much easier to integrate any libraries that share the already well-known NumPy API. Such changes allow libraries like Dask, known for its easy-to-use parallelization and distributed computing capabilities, to defer some of that work to other libraries such as CuPy, providing users the benefits from both distributed and GPU computing with little to no change in their existing software built using the NumPy API.
The RAPIDS suite of software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
This is from a talk we gave at Graph the Planet 2019 where we demonstrated how BlazingSQL quickly queried Apache Parquet files and seamlessly sent the results to Graphistry to beautifully visualize Netflow Data in a Graph.
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalScyllaDB
GPS Insight is a leader in fleet vehicle management using IoT. Internally they use a combination of SQL and NoSQL big data technologies, including distributed SQL data analytics via Presto, an open-source query engine developed by Facebook. Learn how to set up, configure, and use Presto with Scylla for supporting ad hoc non-partition key queries for analytics and data scientists. Plus hear how to use Presto for a Data Archival approach with csv files on S3 or similar storage appliance.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
2. about HeteroDB
Massive log data processing using I/O optimized PostgreSQL
Corporate overview
Name HeteroDB,Inc
Established 4th-Jul-2017
Location Shinagawa, Tokyo, Japan
Businesses Production of accelerated PostgreSQL database
Technical consulting on GPU&DB region
By the heterogeneous-computing technology on the database area,
we provides a useful, fast and cost-effective data analytics platform
for all the people who need the power of analytics.
CEO Profile
KaiGai Kohei – He has contributed for PostgreSQL and Linux kernel
development in the OSS community more than ten years, especially,
for security and database federation features of PostgreSQL.
Award of “Genius Programmer” by IPA MITOH program (2007)
The posters finalist at GPU Technology Conference 2017.
3. Features of RDBMS
High-availability / Clustering
DB administration and backup
Transaction control
BI and visualization
We can use the products that
support PostgreSQL as-is.
Core technology – PG-Strom
Massive log data processing using I/O optimized PostgreSQL
PG-Strom: An extension module for PostgreSQL, to accelerate SQL
workloads by the thousands cores and wide-band memory of GPU.
GPU
Big-data Processing
PG-Strom
Machine-learning & Statistics
4. GPU’s characteristics - mostly as a computing accelerator
Massive log data processing using I/O optimized PostgreSQL
Over 10years history in HPC, then massive popularization in Machine-Learning
NVIDIA Tesla V100
Super Computer
(TITEC; TSUBAME3.0) Computer Graphics Machine-Learning
How I/O workloads are accelerated by GPU that is a computing accelerator?
Simulation
5. Where is the best “location” to process massive data?
Massive log data processing using I/O optimized PostgreSQL
Data transportation is fundamentally expensive.
SQL is a simple and flexible tool for data management.
Long-standing familiar tool for engineers.
Data
Visualization
Data
Processing
Data
Accumulation
Data
Generation
Statistics &
Machine-
Learning
PG-Strom for
Big-Data
PG-Strom for
Analytics
SSD-to-GPU
Direct SQL Columnar Data
(Parquet)
CREATE FUNCTION
my_logic( reggstore, text )
RETURNS matrix
AS $$
$$ LANGUAGE ‘plcuda’;
PL/CUDA
INSERT
Gstore_fdw Pystrom
Database system is closest to the data.
Simple & Flexible data management without data transportation.
BI Tools
6. DataSize
Of course, mankind has no “silver bullets” for all the usage scenarios
Massive log data processing using I/O optimized PostgreSQL
Small Tier
• Small & Middle Businesses or
Branch of Big Companies
• Data size: Less than 100GB
• Software: RDBMS
Middle Tier
• Global or domestic large company
• Data size: 1TB〜100TB
• Software: Small scale Hadoop, or
DWH solution
Top Tier
• Global Mega-Company
• Data size: 100TB, or often PB class
• Software: Large scale Hadoop cluster
8. Technical Challenges for Log Processing (2/3)
Massive log data processing using I/O optimized PostgreSQL
Distributed System has less H/W load, but more expensive in administration
Data
Management
with
Reasonable
Design
9. Data
Management
Technical Challenges for Log Processing (3/3)
Massive log data processing using I/O optimized PostgreSQL
Ultra
Fast I/O
Parallel
Computing
Hardware Evolution pulls up single-node potential to big-data processing
GPU
NVME-
SSD
10. Massive log data processing using I/O optimized PostgreSQL
How PostgreSQL utilizes GPU?
〜Architecture of PG-Strom〜
11. GPU code generation from SQL - Example of WHERE-clause
Massive log data processing using I/O optimized PostgreSQL
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Transformation of the numeric-formula
in WHERE-clause to CUDA C code on demand
Reference to input data
SQL expression in CUDA source code
Run-time
compiler
Parallel
Execution
12. EXPLAIN shows Query Execution plan with Custom GPU-node
Massive log data processing using I/O optimized PostgreSQL
postgres=# EXPLAIN ANALYZE
SELECT cat,count(*),sum(ax) FROM tbl NATURAL JOIN t1 WHERE cid % 100 < 50 GROUP BY cat;
QUERY PLAN
---------------------------------------------------------------------------------------------------
GroupAggregate (cost=203498.81..203501.80 rows=26 width=20)
(actual time=1511.622..1511.632 rows=26 loops=1)
Group Key: tbl.cat
-> Sort (cost=203498.81..203499.26 rows=182 width=20)
(actual time=1511.612..1511.613 rows=26 loops=1)
Sort Key: tbl.cat
Sort Method: quicksort Memory: 27kB
-> Custom Scan (GpuPreAgg) (cost=203489.25..203491.98 rows=182 width=20)
(actual time=1511.554..1511.562 rows=26 loops=1)
Reduction: Local
Combined GpuJoin: enabled
-> Custom Scan (GpuJoin) on tbl (cost=13455.86..220069.26 rows=1797115 width=12)
(never executed)
Outer Scan: tbl (cost=12729.55..264113.41 rows=6665208 width=8)
(actual time=50.726..1101.414 rows=19995540 loops=1)
Outer Scan Filter: ((cid % 100) < 50)
Rows Removed by Outer Scan Filter: 10047462
Depth 1: GpuHashJoin (plan nrows: 6665208...1797115,
actual nrows: 9948078...2473997)
HashKeys: tbl.aid
JoinQuals: (tbl.aid = t1.aid)
KDS-Hash (size plan: 11.54MB, exec: 7125.12KB)
-> Seq Scan on t1 (cost=0.00..2031.00 rows=100000 width=12)
(actual time=0.016..15.407 rows=100000 loops=1)
Planning Time: 0.721 ms
Execution Time: 1595.815 ms
(19 rows)
13. A usual composition of x86_64 server
Massive log data processing using I/O optimized PostgreSQL
GPUSSD
CPU
RAM
HDD
N/W
14. Data flow to process a massive amount of data
Massive log data processing using I/O optimized PostgreSQL
CPU RAM
SSD GPU
PCIe
PostgreSQL
Data Blocks
Normal Data Flow
All the records, including junks, must be loaded
onto RAM once, because software cannot check
necessity of the rows prior to the data loading.
So, amount of the I/O traffic over PCIe bus tends
to be large.
Unless records are not loaded to CPU/RAM once, over the PCIe bus,
software cannot check its necessity even if it is “junk”.
15. Core Feature: SSD-to-GPU Direct SQL
Massive log data processing using I/O optimized PostgreSQL
CPU RAM
SSD GPU
PCIe
PostgreSQL
Data Blocks
NVIDIA GPUDirect RDMA
It allows to load the data blocks on NVME-SSD
to GPU using peer-to-peer DMA over PCIe-bus;
bypassing CPU/RAM. WHERE-clause
JOIN
GROUP BY
Run SQL by GPU
to reduce the data size
Data Size: Small
v2.0
16. Benchmark Results – single-node version
Massive log data processing using I/O optimized PostgreSQL
2172.3 2159.6 2158.9 2086.0 2127.2 2104.3
1920.3
2023.4 2101.1 2126.9
1900.0 1960.3
2072.1
6149.4 6279.3 6282.5
5985.6 6055.3 6152.5
5479.3
6051.2 6061.5 6074.2
5813.7 5871.8 5800.1
0
1000
2000
3000
4000
5000
6000
7000
Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3
QueryProcessingThroughput[MB/sec]
Star Schema Benchmark on NVMe-SSD + md-raid0
PgSQL9.6(SSDx3) PGStrom2.0(SSDx3) H/W Spec (3xSSD)
SSD-to-GPU Direct SQL pulls out an awesome performance close to the H/W spec
Measurement by the Star Schema Benchmark; which is a set of typical batch / reporting workloads.
CPU: Intel Xeon E5-2650v4, RAM: 128GB, GPU: NVIDIA Tesla P40, SSD: Intel 750 (400GB; SeqRead 2.2GB/s)x3
Size of dataset is 353GB (sf: 401), to ensure I/O bounds workload
17. Technology Basis - GPUDirect RDMA
Massive log data processing using I/O optimized PostgreSQL
▌P2P data transfer technology between GPU and other PCIe devices, bypass CPU
Originally designed for multi-nodes MPI over Infiniband
Infrastructure of Linux kernel driver for other PCIe devices, including NVME-SSDs.
Copyright (c) NVIDIA corporation, 2015
18. SSD-to-GPU Direct SQL - Software Stack
Massive log data processing using I/O optimized PostgreSQL
Filesystem
(ext4, xfs)
nvme_strom
kernel module
NVMe SSD
NVIDIA Tesla GPU
PostgreSQL
pg_strom
extension
read(2) ioctl(2)
Hardware
Layer
Operating
System
Software
Layer
Database
Software
Layer
blk-mq
nvme pcie nvme rdma
Network HBA
File-offset to NVME-SSD sector
number translation
NVMEoF Target
(JBOF)
NVMe
Request
■ Other software component
■ Our developed software
■ Hardware
NVME over Fabric
nvme_strom v2.0 supports
NVME-over-Fabric (RDMA).
19. 40min queries
in the former version,
now 30sec
Case Study: Log management and analytics solution
Massive log data processing using I/O optimized PostgreSQL
Database server with
GPU+NVME-SSD
Search by SQL
Security incident
• Fast try & error
• Familiar interface
• Search on row-data
Aug 12 18:01:01 saba systemd: Started
Session 21060 of user root.
Understanding the
situation of damage
Identifying the impact
Reporting to the manager
21. Next bottleneck – Flood of the data over CPU’s capacity
NVME and GPU accelerates PostgreSQL beyond the limitation -PGconf.EU 2018-21
Skylake-SP allows P2P DMA routing up to 8.5GB/s over PCIe root complex.
GPU
SSD-1
SSD-2
SSD-3
md-raid0
Xeon
Gold
6126T
routing
by CPU
22. Consideration for the hardware configuration
Massive log data processing using I/O optimized PostgreSQL
PCIe-switch (PLX) can make CPU more relaxed.
CPU CPU
PLX
SSD GPU
PLX
SSD GPU
PLX
SSD GPU
PLX
SSD GPU
SCAN SCAN SCAN SCAN
JOIN JOIN JOIN JOIN
GROUP BY GROUP BY GROUP BY GROUP BY
Very small amount of data
GATHER GATHER
23. Hardware with PCIe switch (1/2) - HPC Server
Massive log data processing using I/O optimized PostgreSQL
Supermicro SYS-4029TRT2
x96 lane
PCIe switch
x96 lane
PCIe switch
CPU2 CPU1
QPI
Gen3
x16
Gen3 x16
for each slot
Gen3 x16
for each slot
Gen3
x16
24. Hardware with PCIe switch (2/2) - I/O expansion box
Massive log data processing using I/O optimized PostgreSQL
NEC ExpEther 40G (4slots)
4 slots of
PCIe Gen3 x8
PCIe
Swich
40Gb
Ethernet Network
switch
CPU
NIC
extra I/O boxes
25. System configuration with I/O expansion boxes
Massive log data processing using I/O optimized PostgreSQL
PCIe I/O Expansion Box
Host System
(x86_64 server)
NVMe SSD
PostgreSQL Tables
PostgreSQL
Data Blocks
Internal
PCIe Switch
SSD-to-GPU P2P DMA
(Large data size)
GPU
WHERE-clause
JOIN
GROUP BY
PCIe over
Ethernet
Pre-processed
small data
A few GB/s
SQL execution
performance
per box
A few GB/s
SQL execution
performance
per box
A few GB/s
SQL execution
performance
per box
NIC / HBA
Simplified DB operations and APP development
by the simple single-node PostgreSQL configuration
Enhancement of capacity & performance
Visible as leafs of partitioned child-tables on PostgreSQL
v2.1
26. Benchmark (1/2) - System configuration
Massive log data processing using I/O optimized PostgreSQL
x 1
x 3
x 6
x 3
NEC Express5800/R120h-2m
CPU: Intel Xeon E5-2603 v4 (6C, 1.7GHz)
RAM: 64GB
OS: Red Hat Enterprise Linux 7
(kernel: 3.10.0-862.9.1.el7.x86_64)
CUDA-9.2.148 + driver 396.44
DB: PostgreSQL 11beta3 + PG-Strom v2.1devel
NEC ExpEther 40G (4slots)
I/F: PCIe 3.0 x8 (x16 physical) ... 4slots
with internal PCIe switch
N/W: 40Gb-ethernet
Intel DC P4600 (2.0TB; HHHL)
SeqRead: 3200MB/s, SeqWrite: 1575MB/s
RandRead: 610k IOPS, RandWrite: 196k IOPS
I/F: PCIe 3.0 x4
NVIDIA Tesla P40
# of cores: 3840 (1.3GHz)
Device RAM: 24GB (347GB/s, GDDR5)
CC: 6.1 (Pascal, GP104)
I/F: PCIe 3.0 x16
SPECIAL THANKS FOR
v2.1
27. Benchmark (2/2) - Result of query execution performance
Massive log data processing using I/O optimized PostgreSQL
13 SSBM queries to 1055GB database in total (a.k.a 351GB per I/O expansion box)
Raw I/O data transfer without SQL execution was up to 9GB/s.
In other words, SQL execution was faster than simple storage read with raw-I/O.
2,388 2,477 2,493 2,502 2,739 2,831
1,865
2,268 2,442 2,418
1,789 1,848
2,202
13,401 13,534 13,536 13,330
12,696
12,965
12,533
11,498
12,312 12,419 12,414 12,622 12,594
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3
QueryExecutionThroughput[MB/s]
Star Schema Benchmark for PgSQL v11beta3 / PG-Strom v2.1devel on NEC ExpEther x3
PostgreSQL v11beta3 PG-Strom v2.1devel Raw I/O Limitation
max 13.5GB/s for query execution performance with 3x GPU/SSD units!!
v2.1
28. PostgreSQL is a successor of XXXX?
Massive log data processing using I/O optimized PostgreSQL
13.5GB/s with
3x I/O boxes
30GB/s with
8x I/O boxes
is EXPECTED
Single-node PostgreSQL now offers DWH-appliance grade performance
29. Massive log data processing using I/O optimized PostgreSQL
The Future Direction
〜How latest hardware makes Big-Data easy〜
30. NVME-over-Fabric and GPU RDMA support
Massive log data processing using I/O optimized PostgreSQL
ready
Host System
SSD
SSD
SSD
HBA
CPU
100Gb-
Network
JBOF
More NVME-SSDs than what we can install on single-node with GPUs
Flexibility of storage capacity enhancement on demand
GPU/DB-node
Storage-node
SSD-to-GPU Direct SQL
over 100Gb Ethernet
with RDMA protocol
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
HBA
JBOF
SSD
SSD
SSD
SSD
SSD
SSD
HBA
31. Direct Load of External Columnar Data File (Parquet)
Massive log data processing using I/O optimized PostgreSQL
Large Data import to PostgreSQL takes long time before analytics.
Reference to external data files can skip the data importing.
Columnar format minimizes amount of I/O to be loaded from storage.
SSD-to-GPU Direct SQL on the columnar-file optimizes I/O efficiency.
Foreign Table allows reference to various kind of external data source.
It can be an infrastructure for SSD-to-GPU Direct SQL on Columnar Data Files.
Table
Foreign Table
(postgres_fdw)
Foreign Table
(file_fdw)
Foreign Table
(Gstore_fdw)
Foreign Table
(Parquet_fdw)
External RDMS server
(Oracle, MySQL, ...)
CSV File
GPU Device Memory
Parquet File
External
Data
Source
Data loading
takes
long time
WIP
32. Expected usage – IoT grade log data processing and analysis
Massive log data processing using I/O optimized PostgreSQL
As a data management, analytics and machine-learning platform for log data daily growing up
Manufacturing Logistics Mobile Home electronics
Why PG-Strom?
It covers around 30-50TB data with single node by addition of I/O expansion box.
It allows to process the raw log data as is, more than max performance of H/W.
Users can continue to use the familiar SQL statement and applications.
BI Tools
massive log collection
and management
JBoF: Just Bunch of Flash
NVME over Fabric
(RDMA)
PG-Strom
33. Summary
Massive log data processing using I/O optimized PostgreSQL
PG-Strom
An extension of PostgreSQL, to accelerate SQL execution by GPU.
It pulls out max capability of the latest hardware for big-data processing workloads.
Core feature: SSD-to-GPU Direct SQL
It directly loads data blocks on NVME-SSD to GPU using P2P DMA, then runs SQL
workloads on GPU on the middle of I/O path. By reduction of the data to be processed,
it improves the performance of I/O bound jobs.
Major Usage?
Large Log Processing - M2M/IoT grows the data size to be processed
rapidly, however, proper usage of the latest hardware can cover most of
the workloads, before the big architecture change to distributed cluster.
Simple single-node system, Familiar SQL interface, 10GB/s〜 performance
What I didn’t talk in this session
Advanced Analytics, Machine-Learning & Similarity Search on GPU
Let us make another chance
36. uninitialized
File
BLK-100: unknown
BLK-101: unknown
BLK-102: unknown
BLK-103: unknown
BLK-104: unknown
BLK-105: unknown
BLK-106: unknown
BLK-107: unknown
BLK-108: unknown
BLK-109: unknown
Consistency with disk buffers of PostgreSQL / Linux kernel
Massive log data processing using I/O optimized PostgreSQL
① PG-Strom checks PG’s shared buffer of the blocks to read
If block is already cached on the shared buffer of PostgreSQL.
If block may not be all-visible using visibility-map.
② NVME-Strom checks OS’s page cache of the blocks to read
If block is already cached on the page cache of Linux kernel.
But, on fast-SSD mode, page cache shall not be copied unless it is not diry.
③ Then, kicks SSD-to-GPU P2P DMA on the uncached blocks
BLK-100: uncached
BLK-101: cached by PG
BLK-102: uncached
BLK-103: uncached
BLK-104: cached by OS
BLK-105: cached by PG
BLK-106: uncached
BLK-107: uncached
BLK-108: cached by OS
BLK-109: uncached
BLCKSZ
(=8KB)
Transfer Size
Per Request
BLCKSZ *
NChunks
BLK-108: cached by OS
BLK-104: cached by OS
BLK-105: cached by PG
BLK-101: cached by PG
BLK-100: uncached
BLK-102: uncached
BLK-103: uncached
BLK-106: uncached
BLK-107: uncached
BLK-109: uncached
unused
SSD-to-GPU
P2P DMA
Userspace DMA
Buffer (RAM)
Device Memory
(GPU)
CUDA API
(userspace)
cuMemcpyHtoDAsync
BLK-108: cached by OS
BLK-104: cached by OS
BLK-105: cached by PG
BLK-101: cached by PG
37. (Extra Info) about Star Schema Benchmark
Massive log data processing using I/O optimized PostgreSQL
Q1-1)
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder,date1
where lo_orderdate = d_datekey
and d_year = 1993
and lo_discount between 1 and 3
and lo_quantity < 25;
Q4-1)
select d_year, c_nation,
sum(lo_revenue - lo_supplycost) as profit
from date1, customer, supplier, part, lineorder
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_partkey = p_partkey
and lo_orderdate = d_datekey
and c_region = 'AMERICA'
and s_region = 'AMERICA'
and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by d_year, c_nation
order by d_year, c_nation ;
1.6GB
customer
528MB
supplier
416KB
date
206MB
parts
lineorder
1051GB