The document discusses how theoretical FLOPs per clock do not necessarily correlate with real application performance. It uses an AMD processor called "Fangio" that has its floating point capability capped to 2 FLOPs/clock compared to 4 FLOPs/clock normally. Despite having only half the theoretical FLOPs, Fangio delivers similar performance to the normal processor on many applications. This shows that FLOPs alone do not determine performance, and that code vectorization and algorithm design are also important factors.
This document discusses GPU accelerated computing and programming with GPUs. It provides characteristics of GPUs from Nvidia, AMD, and Intel including number of cores, memory size and bandwidth, and power consumption. It also outlines the 7 steps for programming with GPUs which include building and loading a GPU kernel, allocating device memory, transferring data between host and device memory, setting kernel arguments, enqueueing kernel execution, transferring results back, and synchronizing the command queue. The goal is to achieve super parallel execution with GPUs.
The column-oriented data structure of PG-Strom stores data in separate column storage (CS) tables based on the column type, with indexes to enable efficient lookups. This reduces data transfer compared to row-oriented storage and improves GPU parallelism by processing columns together.
Recent development in Graphic Processing Units (GPUs) has opened a new challenge in harnessing their computing power as a new general-purpose computing paradigm with its CUDA parallel programming. However, porting applications to CUDA remains a challenge to average programmers. We have developed a restructuring software compiler (RT-CUDA) with best possible kernel optimizations to bridge the gap between high-level languages and the machine dependent CUDA environment. RT-CUDA is based upon a set of compiler optimizations. RT-CUDA takes a C-like program and convert it into an optimized CUDA kernel with user directives in a con.figuration .file for guiding the compiler. While the invocation of external libraries is not possible with OpenACC commercial compiler, RT-CUDA allows transparent invocation of the most optimized external math libraries like cuSparse and cuBLAS. For this, RT-CUDA uses interfacing APIs, error handling interpretation, and user transparent programming. This enables efficient design of linear algebra solvers (LAS). Evaluation of RTCUDA has been performed on Tesla K20c GPU with a variety of basic linear algebra operators (M+, MM, MV, VV, etc.) as well as the programming of solvers of systems of linear equations like Jacobi and Conjugate Gradient. We obtained significant speedup over other compilers like OpenACC and GPGPU compilers. RT-CUDA facilitates the design of efficient parallel software for developing parallel simulators (reservoir simulators, molecular dynamics, etc.) which are critical for Oil & Gas industry. We expect RT-CUDA to be needed by many industries dealing with science and engineering simulation on massively parallel computers like NVIDIA GPUs.
The document discusses NVIDIA data center GPUs such as the A100, A30, A40, and A10 and their performance capabilities. It provides examples of GPU accelerated application performance showing simulations in Simulia CST Studio, Altair CFD, and Rocky DEM achieving excellent speedups on GPUs. It also discusses Paraview visualization being accelerated with NVIDIA OptiX ray tracing, further sped up using RT cores. Looking ahead, the document outlines NVIDIA Grace CPUs which are designed to improve memory bandwidth between CPUs and GPUs for giant AI and HPC models.
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
This document describes using in-place computing on PostgreSQL to perform statistical analysis directly on data stored in a PostgreSQL database. Key points include:
- An F-test is used to compare the variances of accelerometer data from different phone models (Nexus 4 and S3 Mini) and activities (walking and biking).
- Performing the F-test directly in PostgreSQL via SQL queries is faster than exporting the data to an R script, as it avoids the overhead of data transfer.
- PG-Strom, an extension for PostgreSQL, is used to generate CUDA code on-the-fly to parallelize the variance calculations on a GPU, further speeding up the F-test.
This document discusses GPU accelerated computing and programming with GPUs. It provides characteristics of GPUs from Nvidia, AMD, and Intel including number of cores, memory size and bandwidth, and power consumption. It also outlines the 7 steps for programming with GPUs which include building and loading a GPU kernel, allocating device memory, transferring data between host and device memory, setting kernel arguments, enqueueing kernel execution, transferring results back, and synchronizing the command queue. The goal is to achieve super parallel execution with GPUs.
The column-oriented data structure of PG-Strom stores data in separate column storage (CS) tables based on the column type, with indexes to enable efficient lookups. This reduces data transfer compared to row-oriented storage and improves GPU parallelism by processing columns together.
Recent development in Graphic Processing Units (GPUs) has opened a new challenge in harnessing their computing power as a new general-purpose computing paradigm with its CUDA parallel programming. However, porting applications to CUDA remains a challenge to average programmers. We have developed a restructuring software compiler (RT-CUDA) with best possible kernel optimizations to bridge the gap between high-level languages and the machine dependent CUDA environment. RT-CUDA is based upon a set of compiler optimizations. RT-CUDA takes a C-like program and convert it into an optimized CUDA kernel with user directives in a con.figuration .file for guiding the compiler. While the invocation of external libraries is not possible with OpenACC commercial compiler, RT-CUDA allows transparent invocation of the most optimized external math libraries like cuSparse and cuBLAS. For this, RT-CUDA uses interfacing APIs, error handling interpretation, and user transparent programming. This enables efficient design of linear algebra solvers (LAS). Evaluation of RTCUDA has been performed on Tesla K20c GPU with a variety of basic linear algebra operators (M+, MM, MV, VV, etc.) as well as the programming of solvers of systems of linear equations like Jacobi and Conjugate Gradient. We obtained significant speedup over other compilers like OpenACC and GPGPU compilers. RT-CUDA facilitates the design of efficient parallel software for developing parallel simulators (reservoir simulators, molecular dynamics, etc.) which are critical for Oil & Gas industry. We expect RT-CUDA to be needed by many industries dealing with science and engineering simulation on massively parallel computers like NVIDIA GPUs.
The document discusses NVIDIA data center GPUs such as the A100, A30, A40, and A10 and their performance capabilities. It provides examples of GPU accelerated application performance showing simulations in Simulia CST Studio, Altair CFD, and Rocky DEM achieving excellent speedups on GPUs. It also discusses Paraview visualization being accelerated with NVIDIA OptiX ray tracing, further sped up using RT cores. Looking ahead, the document outlines NVIDIA Grace CPUs which are designed to improve memory bandwidth between CPUs and GPUs for giant AI and HPC models.
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
This document describes using in-place computing on PostgreSQL to perform statistical analysis directly on data stored in a PostgreSQL database. Key points include:
- An F-test is used to compare the variances of accelerometer data from different phone models (Nexus 4 and S3 Mini) and activities (walking and biking).
- Performing the F-test directly in PostgreSQL via SQL queries is faster than exporting the data to an R script, as it avoids the overhead of data transfer.
- PG-Strom, an extension for PostgreSQL, is used to generate CUDA code on-the-fly to parallelize the variance calculations on a GPU, further speeding up the F-test.
1. The document discusses implementing distributed mclock in Ceph for quality of service (QoS). It describes implementing QoS units at the pool, RBD image, and universal levels.
2. It covers inserting delta/rho/phase parameters into Ceph classes for distributed mclock. Issues addressed include number of shards and background I/O.
3. An outstanding I/O based adaptive throttle is introduced to suspend mclock scheduling if the I/O load is too high. Testing showed it effectively maintained maximum throughput.
4. Future plans include improving the mclock algorithm, extending QoS to individual RBDs, adding metrics, and testing in various environments. Collaboration with the community is
This document discusses using GPUs and SSDs to accelerate PostgreSQL queries. It introduces PG-Strom, a project that generates CUDA code from SQL to execute queries massively in parallel on GPUs. The document proposes enhancing PG-Strom to directly transfer data from SSDs to GPUs without going through CPU/RAM, in order to filter and join tuples during loading for further acceleration. Challenges include improving the NVIDIA driver for NVMe devices and tracking shared buffer usage to avoid unnecessary transfers. The goal is to maximize query performance by leveraging the high bandwidth and parallelism of GPUs and SSDs.
1) The PG-Strom project aims to accelerate PostgreSQL queries using GPUs. It generates CUDA code from SQL queries and runs them on Nvidia GPUs for parallel processing.
2) Initial results show PG-Strom can be up to 10 times faster than PostgreSQL for queries involving large table joins and aggregations.
3) Future work includes better supporting columnar formats and integrating with PostgreSQL's native column storage to improve performance further.
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
PG-Strom is a module that utilizes GPUs to accelerate query processing in PostgreSQL. It uses a foreign data wrapper to push query execution to the GPU. Benchmark results show a query running 10 times faster on a table using the PG-Strom FDW compared to a regular PostgreSQL table. Future plans include supporting writable foreign tables, accelerating sort and aggregate operations using the GPU, and inheritance between regular and foreign tables. Help from the community is needed to review code, provide large real-world datasets, and understand common analytic queries.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
To cope with the steaming out of Moore’s law and Dennard’s scaling end, the world of High-Performance Computing is rapidly evolving toward high-throughput architectures with specialized hardware for vectors and tensor operations in conjunction with sophisticated power management subsystems. RISC-V ISA and Open-HW can prove its effectiveness in fostering innovation in the HPC market as it has done in the embedded one. In this talk, I will introduce a set of building blocks for future HPC systems we have been designing at the ETH Zurich and the University of Bologna.
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
Torsten Hoefler from ETH Zurich presented this deck at the Switzerland HPC Conference.
"Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency-hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks."
Watch the video presentation: http://wp.me/p3RLHQ-gCB
CUDA 6.0 provides performance improvements and new features for several CUDA libraries and tools. Key updates include up to 2x faster kernel launches, new cuFFT and cuBLAS features for multi-GPU support, up to 700 GFLOPS performance from cuFFT, over 3 TFLOPS from cuBLAS, and 5x faster cuSPARSE performance compared to MKL. New features also improve the performance of cuRAND, NPP, and Thrust.
PL/CUDA allows running CUDA C code directly in PostgreSQL user-defined functions. This allows advanced analytics and machine learning algorithms to be run directly in the database.
The gstore_fdw foreign data wrapper allows data to be stored directly in GPU memory, accessed via SQL, eliminating the overhead of copying data between CPU and GPU memory for each query.
Integrating PostgreSQL with GPU computing and machine learning frameworks allows for fast data exploration and model training by combining flexible SQL queries with high-performance analytics directly on the data.
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCinside-BigData.com
In this deck from the Perth HPC Conference, Werner Scholz from XENON Systems presents: Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC.
"A decade ago, 100 watts per CPU was devastating to thermal design. Today, Intel’s highest performing CPUs (e.g. Intel Cascade Lake-AP 9282 processor) have a thermal design envelope of 400 watts. There really is no end in sight, and accommodating more power is critical to advancing performance. The ability to dissipate the resulting heat is the hard ceiling that systems face in terms of performance – giving greater importance to liquid cooling breakthroughs. With liquid cooling, less energy is expended to cool systems – a significant savings in HPC deployments with arrays of servers drawing energy and generating heat. Electrical current drives the CPU and enables it to function. This electrical power is converted into thermal energy (heat). To maintain a stable temperature, the CPU needs to be cooled by efficiently removing this heat and releasing it. Liquid cooling is the best way to cool a system because liquid transfers heat much more efficiently than air. From an environmental perspective, liquid cooling reduces both those characteristics to create a smarter and more ecological approach on a grand scale. The cascade of value continues, as ambient heat removed from systems can then be used to heat buildings and augment or replace traditional heating systems. It’s an intelligent approach to thermal management, distributing the economic value of reduced energy use and transforming heat into an enterprise asset."
Watch the video: https://wp.me/p3RLHQ-kZa
Learn more: https://www.xenon.com.au/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document summarizes a presentation on GPGPU-Sim, a widely used GPU simulator. It introduces GPGPU-Sim and its key components, including its functional model that simulates the GPU programming model and virtual/machine instruction sets, its performance model that simulates timing of GPU components, and its power model GPUWattch. It outlines new features in GPGPU-Sim, including an improved Volta architecture model, the ability to run closed source libraries like cuDNN, modeling of tensor cores, and running the CUTLASS library. The document provides an overview of these new developments and how they enhance GPGPU-Sim's accuracy in simulating modern GPUs.
The document summarizes a presentation given by Stephan Hodes on optimizing performance for AMD's Graphics Core Next (GCN) architecture. The presentation covers key aspects of the GCN architecture, including compute units, registers, and latency hiding. It then provides a top 10 list of performance advice for GCN, such as using DirectCompute threads in groups of 64, avoiding over-tessellation, keeping shader pipelines short, and batching drawing calls.
1. The document discusses implementing distributed mclock in Ceph for quality of service (QoS). It describes implementing QoS units at the pool, RBD image, and universal levels.
2. It covers inserting delta/rho/phase parameters into Ceph classes for distributed mclock. Issues addressed include number of shards and background I/O.
3. An outstanding I/O based adaptive throttle is introduced to suspend mclock scheduling if the I/O load is too high. Testing showed it effectively maintained maximum throughput.
4. Future plans include improving the mclock algorithm, extending QoS to individual RBDs, adding metrics, and testing in various environments. Collaboration with the community is
This document discusses using GPUs and SSDs to accelerate PostgreSQL queries. It introduces PG-Strom, a project that generates CUDA code from SQL to execute queries massively in parallel on GPUs. The document proposes enhancing PG-Strom to directly transfer data from SSDs to GPUs without going through CPU/RAM, in order to filter and join tuples during loading for further acceleration. Challenges include improving the NVIDIA driver for NVMe devices and tracking shared buffer usage to avoid unnecessary transfers. The goal is to maximize query performance by leveraging the high bandwidth and parallelism of GPUs and SSDs.
1) The PG-Strom project aims to accelerate PostgreSQL queries using GPUs. It generates CUDA code from SQL queries and runs them on Nvidia GPUs for parallel processing.
2) Initial results show PG-Strom can be up to 10 times faster than PostgreSQL for queries involving large table joins and aggregations.
3) Future work includes better supporting columnar formats and integrating with PostgreSQL's native column storage to improve performance further.
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
PG-Strom is a module that utilizes GPUs to accelerate query processing in PostgreSQL. It uses a foreign data wrapper to push query execution to the GPU. Benchmark results show a query running 10 times faster on a table using the PG-Strom FDW compared to a regular PostgreSQL table. Future plans include supporting writable foreign tables, accelerating sort and aggregate operations using the GPU, and inheritance between regular and foreign tables. Help from the community is needed to review code, provide large real-world datasets, and understand common analytic queries.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
To cope with the steaming out of Moore’s law and Dennard’s scaling end, the world of High-Performance Computing is rapidly evolving toward high-throughput architectures with specialized hardware for vectors and tensor operations in conjunction with sophisticated power management subsystems. RISC-V ISA and Open-HW can prove its effectiveness in fostering innovation in the HPC market as it has done in the embedded one. In this talk, I will introduce a set of building blocks for future HPC systems we have been designing at the ETH Zurich and the University of Bologna.
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
Torsten Hoefler from ETH Zurich presented this deck at the Switzerland HPC Conference.
"Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency-hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks."
Watch the video presentation: http://wp.me/p3RLHQ-gCB
CUDA 6.0 provides performance improvements and new features for several CUDA libraries and tools. Key updates include up to 2x faster kernel launches, new cuFFT and cuBLAS features for multi-GPU support, up to 700 GFLOPS performance from cuFFT, over 3 TFLOPS from cuBLAS, and 5x faster cuSPARSE performance compared to MKL. New features also improve the performance of cuRAND, NPP, and Thrust.
PL/CUDA allows running CUDA C code directly in PostgreSQL user-defined functions. This allows advanced analytics and machine learning algorithms to be run directly in the database.
The gstore_fdw foreign data wrapper allows data to be stored directly in GPU memory, accessed via SQL, eliminating the overhead of copying data between CPU and GPU memory for each query.
Integrating PostgreSQL with GPU computing and machine learning frameworks allows for fast data exploration and model training by combining flexible SQL queries with high-performance analytics directly on the data.
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCinside-BigData.com
In this deck from the Perth HPC Conference, Werner Scholz from XENON Systems presents: Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC.
"A decade ago, 100 watts per CPU was devastating to thermal design. Today, Intel’s highest performing CPUs (e.g. Intel Cascade Lake-AP 9282 processor) have a thermal design envelope of 400 watts. There really is no end in sight, and accommodating more power is critical to advancing performance. The ability to dissipate the resulting heat is the hard ceiling that systems face in terms of performance – giving greater importance to liquid cooling breakthroughs. With liquid cooling, less energy is expended to cool systems – a significant savings in HPC deployments with arrays of servers drawing energy and generating heat. Electrical current drives the CPU and enables it to function. This electrical power is converted into thermal energy (heat). To maintain a stable temperature, the CPU needs to be cooled by efficiently removing this heat and releasing it. Liquid cooling is the best way to cool a system because liquid transfers heat much more efficiently than air. From an environmental perspective, liquid cooling reduces both those characteristics to create a smarter and more ecological approach on a grand scale. The cascade of value continues, as ambient heat removed from systems can then be used to heat buildings and augment or replace traditional heating systems. It’s an intelligent approach to thermal management, distributing the economic value of reduced energy use and transforming heat into an enterprise asset."
Watch the video: https://wp.me/p3RLHQ-kZa
Learn more: https://www.xenon.com.au/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document summarizes a presentation on GPGPU-Sim, a widely used GPU simulator. It introduces GPGPU-Sim and its key components, including its functional model that simulates the GPU programming model and virtual/machine instruction sets, its performance model that simulates timing of GPU components, and its power model GPUWattch. It outlines new features in GPGPU-Sim, including an improved Volta architecture model, the ability to run closed source libraries like cuDNN, modeling of tensor cores, and running the CUTLASS library. The document provides an overview of these new developments and how they enhance GPGPU-Sim's accuracy in simulating modern GPUs.
The document summarizes a presentation given by Stephan Hodes on optimizing performance for AMD's Graphics Core Next (GCN) architecture. The presentation covers key aspects of the GCN architecture, including compute units, registers, and latency hiding. It then provides a top 10 list of performance advice for GCN, such as using DirectCompute threads in groups of 64, avoiding over-tessellation, keeping shader pipelines short, and batching drawing calls.
What do data center operators need to know when deploying Hadoop in the Data Center? Multi-tenancy, network topology, workload types, and myriad other factors affect the way applications run and perform in the data center. Understanding performance characteristics of the distributed system is key to not only optimize for Hadoop, but allows Hadoop to seamlessly operate side-by-side existing applications.
This document discusses the design and implementation of AM and QPSK software defined radio transmitters and receivers using the ZedBoard and FMComms4 RF transceiver. It begins with descriptions of the QPSK transmitter and receiver designs in Simulink, including resampling, modulation, filtering and data synchronization components. Implementation of the designs in HDL and resource utilization on the ZedBoard is also covered. In addition, an AM transmitter and receiver is presented which is able to transmit and recover an audio signal in under a second. The document provides guidance on building SDR systems on the ZedBoard from initial simulation to final hardware implementation.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU, PPU and GPGPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics and scientific research.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs provide massively parallel and multithreaded processing capabilities. GPUs are now commonly used for general purpose computing due to their ability to handle complex computational tasks faster than CPUs in some cases. The document then discusses GPU and PPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics, and scientific research.
The document discusses VPU and GPGPU computing. It explains that a VPU is a visual processing unit, also known as a GPU. GPUs are massively parallel and multithreaded processors that are better than CPUs for tasks like machine learning and graphics processing. The document then discusses GPU architecture, memory, and programming models like CUDA. It provides examples of GPU usage and concludes that GPGPU is used in fields like machine learning, robotics, and scientific computing.
The document discusses NVIDIA's new Volta GPU architecture and its Tesla V100 GPU. Some key points:
- The Tesla V100 GPU uses the new Volta architecture and features new Tensor Cores that provide a major speedup for deep learning workloads.
- Compared to the previous Pascal GPU, the V100 offers 6x higher deep learning performance using FP16 and 1.5-1.9x higher performance for FP32 and FP64 workloads.
- The V100's Tensor Cores enable mixed precision training where most operations can be done in FP16 with no loss of accuracy using techniques like loss scaling.
- Benchmark results show training ResNet-50 on
Warp processing is a technique that dynamically optimizes software to improve performance and energy efficiency. It works by profiling an application to identify critical regions, then partitioning those regions to hardware using an FPGA. The binary is updated to execute the partitioned regions on the FPGA circuit while the rest continues in software. This allows applications to achieve speedups of 2-100x or more while using 20x less memory and reducing power consumption by 38-94%.
The document discusses the features and capabilities of the QNAP TS-832PX and TS-932PX network attached storage (NAS) devices. Both NAS devices come with dual 10GbE SFP+ and dual 2.5GbE RJ45 ports to provide faster network speeds. They are suitable for small and medium sized business environments that have an increasing number of connected devices and larger file sizes. The document provides details on the specifications, performance tests results, and software applications that come with the NAS devices.
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecturemohamedragabslideshare
This document summarizes research on revisiting co-processing techniques for hash joins on coupled CPU-GPU architectures. It discusses three co-processing mechanisms: off-loading, data dividing, and pipelined execution. Off-loading involves assigning entire operators like joins to either the CPU or GPU. Data dividing partitions data between the processors. Pipelined execution aims to schedule workloads adaptively between the CPU and GPU to maximize efficiency on the coupled architecture. The researchers evaluate these approaches for hash join algorithms, which first partition, build hash tables, and probe tables on the input relations.
FPGAs can compete with GPUs for some applications but with some key differences:
1) FPGAs are configured to create custom hardware for an algorithm rather than using predefined hardware like GPUs. This allows high efficiency but is more difficult to program.
2) While OpenCL provides a common language, FPGAs and GPUs have very different architectures and optimizing algorithms requires different approaches for each.
3) For applications with high bandwidth I/O or flexibility requirements, FPGAs may have advantages over GPUs, but GPUs typically have higher performance for compute-heavy applications and better energy efficiency. Overall, FPGAs have become more accessible but still require more programming effort than GPUs.
1) The document describes an FPGA-based modular and generic automated test equipment (ATE) designed for testing a digital beam forming (DBF) unit.
2) The ATE uses a multi-FPGA, multi-card solution to emulate radar components like receivers and transmitters and test the DBF system at full operating speeds.
3) The ATE architecture is modular and scalable, allowing it to test DBF systems with varying numbers of receivers through reconfiguration of the FPGA designs.
The RAPIDS suite of software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
Bloomberg's Chris Jones and Chris Morgan joined Red Hat Storage Day New York on 1/19/16 to explain how Red Hat Ceph Storage helps the financial giant tackle its data storage challenges.
Performance analysis of 3D Finite Difference computational stencils on Seamic...Joshua Mora
Seamicro fabric compute systems offers an array of low power compute nodes interconnected with a 3D torus network fabric (branded Freedom Supercomputer Fabric). This specific network topology allows very efficient point to point communications where only your neighbor compute nodes are involved in the communications. Such type of communication pattern arises in a wide variety of distributed memory applications like in 3D Finite Difference computational stencils, present on many computationally expensive scientific applications (eg. seismic, computational fluid dynamics). We present the performance analysis (computation, communication, scalability) of a generic 3D Finite Difference computational stencil on such a system. We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for HPC applications that exhibit this communication pattern.
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
The document evaluates the performance impact of virtualization on high-performance computing (HPC) clouds. Experiments were conducted on the AIST Super Green Cloud, a 155-node HPC cluster. Benchmark results show that while PCI passthrough mitigates I/O overhead, virtualization still incurs performance penalties for MPI collectives as node counts increase. Application benchmarks demonstrate overhead is limited to around 5%. The study concludes HPC clouds are promising due to utilization improvements from virtualization, but further optimization of virtual machine placement and pass-through technologies could help reduce overhead.
Similar to Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012 (20)
Exploring the Performance Impact of Virtualization on an HPC Cloud
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
1. Do theoretical FLOPs matter for
real application’s performance ?
By Joshua.Mora@amd.com
Abstract: The most intelligent answer to this question is “it depends on
the application”. To proof that, we will show a few examples from both
theoretical and practical point of view. In order to validate experimentally
it, a modified AMD processor named “Fangio” (AMD Opteron 6275
Processor) will be used which has limited floating point capability to 2
FLOPs/clk/BD unit, delivering less (-8% in avergage) but close to the
performance of AMD Opteron 6276 Processor with 4 times more floating
point capability , ie. 8 FLOPs/clk/BD unit.
The intention of this work is threefold: i) to demonstrate that the
FLOPs/clk/core of microprocessor architectures isn’t necessarily a good
performance metric indicator despite it is heavily used by the industry (eg.
HPL). ii) to expose that code vectorization technology of compilers is
fundamental in order to extract as much as possible real application
performance but it has a long way to go in extracting it. iii) It would not be
fair to blame exclusively on compiler technology: algorithms are not well
designed and written for the compilers to exploit vector instructions (ie.
SSE, AVX and FMA).
Saudi Arabia HPC, KAUST, Thuwal, 2012
2. Agenda
• Concepts
– Kinds of FLOPs/clk
– Single Instruction Single Data , Single Instruction Multiple Data
• AMD Interlagos processor FPU
– FPU, see Understanding Interlagos arch through HPL (HPC
Advisory Council workshop, ISC 2012)
– Roofline model
• AMD Fangio processor
– FPU capping, roofline model
• Results/Conclusions within roofline model for Interlagos and
Fangio.
– Benchmarks: HPL, stream, CFD apps, SPEC fp benchmarks.
Saudi Arabia HPC, KAUST, Thuwal, 2012
4. SISD: Single Instruction Single Data
SIMD: Single Instruction Multiple Data
Current CPU cores can crunch 8 DP numbers at a time. GPUs streaming cores
can crunch 2-4DP numbers. There are several thousand streaming cores per
GPU. Bubbles
(no work)
streams of inputs
single data
of data and results,
input and
stored in vectors or
result at each
packed format,
clock,
SSE, AVX, FMA
stored in
scalar format 1 clock
SIMD allows processing of more data but it needs to
be formated / packed to fit vectors. THAT IS THE
CHALLENGE
Saudi Arabia HPC, KAUST, Thuwal, 2012
5. Few slides from presentation:
Saudi Arabia HPC, KAUST, Thuwal, 2012
11. Roof line for AMD Interlagos
System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.
Real GFLOP/s in double precision (Linpack benchmark)
2procs x 8core-pairs x 2.3GHz x 8 DP FLOP/clk/core-pair x 0.85eff =
250 DP GF/s
Very high numerical intensity ie. (FLOP/s) / (Byte/s)
-use of AMD Core Math Library (FMA4 instructions)
-cache friendly
-reusability of data
-DGEMM is L3 BLAS with Arithmetic Intensity order N (problem size)
Real GB/s: 72 GB/s (stream benchmark)
Low numerical intensity
-use of non temporal stores (use of write combined buffer instead of evicting
data into L2 -> L3 -> RAM to speed up the write into RAM.)
-not cache friendly
-no reusage of data
-most of the time cores waiting for data (low FLOP/clk despite using SSE2,FMA4)
- stream is L1 BLAS with Arithmetic Intensity order 1 (size independent)
Saudi Arabia HPC, KAUST, Thuwal, 2012
12. AMD Fangio, FPU capping
• Fangio is Interlagos processor model 6276 but has capped FPU
from 8 DP FLOP/clk to 2 DP FLOP/clk by slowing down FPU
retirement of instructions.
• Allows same instruction architecture set as Interlagos.
• System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.
Performance impact depends on workload:
Real GFLOP/s in double precision (Linpack benchmark)
2procs x 8core-pairs x 2.3GHz x 2 DP FLOP/clk/core-pair x 0.85eff =
75 DP GF/s
Real GB/s: 72 GB/s (stream benchmark)
unmodified memory throughput performance !
Saudi Arabia HPC, KAUST, Thuwal, 2012
13. HPL runs to confirm FPU capping
2P 16cores @ 2.3GHz (6276 Interlagos)
==============================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01R2L4 86400 100 4 8 1774.55 2.423e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022068 ...... PASSED
==============================================================================
2P 16cores @ 2.3GHz (6275 Interlagos) Fangio 3x longer time
==============================================================================
T/V N NB P Q Time Gflops
78.26GF/s
--------------------------------------------------------------------------------
(16CU*2GF/clk/CU*2.3GHz)
WR01R2L4 86400 100 4 8 5494.39 7.826e+01 78.26 GF/s 106% HPL eff !!
-------------------------------------------------------------------------------- 73.6 GF/s
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022316 ...... PASSED
==============================================================================
Saudi Arabia HPC, KAUST, Thuwal, 2012
14. Stream runs on 6275 to confirm
no drop in memory throughput
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 73089.4045 0.0443 0.0438 0.0449
Scale: 68952.3038 0.0469 0.0464 0.0472
Add: 66289.3072 0.0729 0.0724 0.0734
Triad: 66301.0957 0.0730 0.0724 0.0734
-------------------------------------------------------------
Scale, Add and Triad do FLOPS in double precision.
Triad is plotted in roofline model since it is the one with highest
FLOPs associated with operations: add and multiply using FMA4.
#pragma omp parallel for
for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j];
Saudi Arabia HPC, KAUST, Thuwal, 2012
15. Summary of measurements per node
plotted in roofline model Not the effective freq
due to boost.
(2*7.8)/(2.3GHz*8) = 85% eff (2*2.3)/(2.3GHz*2) = 100% eff !!
Workload Interlagos 6276 Interlagos 6275 “Fangio”
GB/s DP GF/s AI= F/B GB/s DP GF/s AI= F/B
HPL =6*4=24 =7.8*32=250 10.6 =1.8*4=6.8 =2.3*32=75 11.7
STR. TRIAD =17*4=68 =0.5*32=16 0.08 =17*4=68 =0.5*32=16 0.08
OPENFOAM =15*4=60 =0.8*32=25 0.41 =14*4=56 =0.7*32=22 0.39
• 1 computer has 2 processors with a total of 4 numanodes and
32 cores in 16 compute units
• 1 numanode has a total of 4 compute units.
• Memory bandwidth in GB/s is measured per numanode. (in red)
• Double precision floating point is measured per core.(in red)
Saudi Arabia HPC, KAUST, Thuwal, 2012
16. Roofline for Interlagos and Fangio
GF/s (log 2 scale) Measured and plotted
AMD Interlagos (250GF/s)
256 Both processors have same
memory bandwidth, HPL (Interlagos)
Ie. same BW slope. L3 BLAS (eg. Dgemm),
128 benefits from vectorization: FMA, AVX, SSE
AMD Fangio (75GF/s) 75% perf drop
64
HPL (Fangio)
32 SPECfp 3-20%, 8% average perf drop
16 Sparse algebra such as CFD apps ~ 6-8% perf drop
OpenFOAM, FLUENT, STARCCM,..
TRIAD
8 0% perf drop
Data dependencies, scalar code, no benefits from vectorization
0.125 0.25 0.5 1 2 4 8 16 32
(GF/s)/(GB/s) Arithmetic intensity (log 2 scale)
Saudi Arabia HPC, KAUST, Thuwal, 2012
17. Performance impact on
SPEC fp 2006 rate peak Resource utilization
• SPEC website link: www.spec.org
• Runs done with peak flags
configuration in order to utilize
optimally compiler technology.
• In this case Open64 compiler has
been used.
• Runs were done with only 1 copy
per Bulldozer unit, to allow each
process/copy to fully utilize
available computing resources
without constrains originated from
shared resources in the Bulldozer
compute unit (eg. L2, FPU ,
instruction scheduler).
Saudi Arabia HPC, KAUST, Thuwal, 2012
18. Performance impact on
SPEC fp 2006 rate peak (cont)
Benchmark Brief % perf.
Application area
name description drop
Bwaves Fluid Dynamics 3D transonic transient laminar viscous flow. 0.09%
self-consistent field calculations are
Gamess Quantum
performed using the Restricted Hartree -10.51%
Chemistry.
Fock method
Milc Quantum A gauge field generating program for lattice
0.10%
Chromodynamics gauge theory programs with dynamical quarks.
Zeusmp NCSA code, CFD simulation of astrophysical
Fluid Dynamics -7.47%
phenomena.
Biochemistry /
Gromacs Newtonian equations of motion for GPU
Molecular -32.17%
hundreds to millions of particle. candidate
Dynamics
Solves the Einstein evolution equations using a
cactusADM General Relativity -2.01%
staggered-leapfrog numerical method
Leslie3d Fluid Dynamics CFD, Large Eddy Simulation -0.44%
GPU
Biology / Molecular Large biomolecular systems. The test case
Namd -24.23%
Dynamics candidate
has 92,224 atoms of apolipoprotein A-I.
Saudi Arabia HPC, KAUST, Thuwal, 2012
19. Benchmark Application area
Brief Description % perf. drop
Finite Element adaptive finite elements and error estimation.
dealII -9.09%
Analysis Helmholtz-type equation.
Linear simplex algorithm and sparse linear algebra.
Soplex Programming, Test cases include railroad planning and 1.86%
Optimization military airlift models.
Image rendering. The testcase is a
Povray Image Ray-tracing -12.15%
1280x1024 anti-aliased image of a landscape.
Structural GPU
Finite element code for linear and
Calculix -26.82%
Mechanics candidate
nonlinear 3D structural applications.
Computational Solves the Maxwell equations in 3D using the
GemsFDTD -0.67%
Electromagnetics finite-difference time-domain (FDTD) method.
molecular Hartree-Fock wavefunction
Quantum
Tonto calculation to better match experimental -14.43%
Chemistry
X-ray diffraction data.
"Lattice-Boltzmann Method" to simulate
Lbm Fluid Dynamics 0.58%
incompressible fluids in 3D
Weather modeling from scales of meters
Wrf Weather -0.95%
to thousands of kilometers.
speech recognition system from Carnegie
Sphinx3 Speech recognition -3.00%
Mellon University
AVERAGE REAL PERFORMANCE DROP WHEN
REDUCED 75% THE THEORETICAL FLOPs -8.94%
Saudi Arabia HPC, KAUST, Thuwal, 2012
20. most
Performance impact on CFD apps
• Most of CFD apps with Eulerian formulation use sparse linear algebra to
represent the linearized Navier-Stokes equations on non structured grids.
• The higher the discretization schemes, the higher the arithmetic intensity
• Data dependencies in both spatial and time prevent vectorization
• Large datasets have low cache reutilization.
• Cores are waiting most of the time to get new data into caches.
• Once data is on the caches, the floating point instructions are mostly
scalar instead of packed.
• Compilers have hard time in finding opportunities to vectorize loops.
• Loop unrolling and partial vectorization of independent data help very
little due to cores waiting to get that data.
• Overall, low performance from FLOPs/s point of view.
• Therefore, capping FPU in terms of FLOPs/clk does not impact on
application’s performance.
• Theoretical FLOP/s isn’t therefore a good indicator of how
applications such as CFD ones (and many more) will perform.
Saudi Arabia HPC, KAUST, Thuwal, 2012
21. What should we do, moving forward ?
• Multidisciplinary teams to work on
– Algorithm research and development to make it more hardware aware.
– Software research and development to implement efficiently the
algorithms. (eg. Comm avoidance, dynamic task scheduling, work stealing,
locality, power aware, resilient ,…)
– Interaction between scientist and computer (HW+SW) scientist to develop
new formulations of equations that will deliver algorithms better suited
for new computer architectures.
– Research and development on compiler and progr. language technology
to detect algorithm properties and exploit hardware features.
• Supercomputing datacenter institutions to work on
– Enabling science by proper exploitation of computational resources.
– Multidisciplinary teams educating scientist on how to use the resources.
– Supercomputing investments should be funded and measured in terms of
number and quality of scientific projects, not in terms of CPU utilization.
(eg. CPU utilization isn’t CPU efficiency, like theoretical FLOPs isn’t real
application’s performance). Saudi Arabia HPC, KAUST, Thuwal, 2012