Paper presented at the 2nd International Workshop on Deployment and Use of Accelerators (DUAC). Co-located with the 51st International Conference on Parallel Processing (ICPP). August 29, 2021 (virtual event). More information at: https://duac2022.wordpress.com/
OpenACC and Open Hackathons Monthly Highlights August 2022OpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. August’s edition covers the 2022 OpenACC and Hackathons Asia-Pacific Summit, NVIDIA’s GTC, upcoming Open Hackathons and Bootcamps, EuroHPC, the launch of Frontier and Polaris supercomputers, recent research, new resources, and more!
In this deck from the GPU Technology Conference, Kevin Roe from the Maui High Performance Computing Center presents: Multi-GPU FFT Performance on Different Hardware. Configurations.
"We will characterize the performance of multi-GPU systems in an effort to determine their viability for running physics-based applications using Fast Fourier Transforms (FFTs). Additionally, we'll discuss how multi-GPU FFTs allow available memory to exceed the limits of a single GPU and how they can reduce computational time for larger problem sizes."
Watch the video: https://wp.me/p3RLHQ-kjQ
Learn more: https://www.mhpcc.hpc.mil/
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses the design and implementation of AM and QPSK software defined radio transmitters and receivers using the ZedBoard and FMComms4 RF transceiver. It begins with descriptions of the QPSK transmitter and receiver designs in Simulink, including resampling, modulation, filtering and data synchronization components. Implementation of the designs in HDL and resource utilization on the ZedBoard is also covered. In addition, an AM transmitter and receiver is presented which is able to transmit and recover an audio signal in under a second. The document provides guidance on building SDR systems on the ZedBoard from initial simulation to final hardware implementation.
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
The document evaluates the performance impact of virtualization on high-performance computing (HPC) clouds. Experiments were conducted on the AIST Super Green Cloud, a 155-node HPC cluster. Benchmark results show that while PCI passthrough mitigates I/O overhead, virtualization still incurs performance penalties for MPI collectives as node counts increase. Application benchmarks demonstrate overhead is limited to around 5%. The study concludes HPC clouds are promising due to utilization improvements from virtualization, but further optimization of virtual machine placement and pass-through technologies could help reduce overhead.
This document discusses QGATE, a quantum circuit simulator that can accelerate simulations using GPUs. QGATE uses several techniques to optimize simulations, including gate cancellation, dynamic qubit grouping, and operator reordering. Gate cancellation removes redundant gates, dynamic qubit grouping reduces the number of variables needed for state vectors when qubits are not entangled, and operator reordering maximizes the effects of dynamic qubit grouping by rearranging gates and measurements. These optimizations aim to improve simulation performance by reducing calculation amounts. Benchmark results show QGATE achieves up to a 220x speedup over CPU simulations for a circuit with 30 qubits and 10 Hadamard gates on each qubit.
Improvements in space radiation-tolerant FPGA implementation of land surface ...IJECEIAES
The trend in satellite remote sensing assignments has continuously been concerning using hardware devices with more flexibility, smaller size, and higher computational power. Therefore, field programmable gate arrays (FPGA) technology is often used by the developers of the scientific community and equipment for carrying out different satellite remote sensing algorithms. This article explains hardware implementation of land surface temperature split window (LST-SW) algorithm based on the FPGA. To get a high-speed process and real-time application, VHSIC hardware description language (VHDL) was employed to design the LST-SW algorithm. The paper presents the benefits of the used Virtex-4QV of radiation tolerant series FPGA. The experimental results revealed that the suggested implementation of the algorithm using Virtex4QV achieved higher throughput of 435.392 Mbps, and faster processing time with value of 2.95 ms. Furthermore, a comparison between the proposed implementation and existing work demonstrated that the proposed implementation has better performance in terms of area utilization; 1.17% reduction in number of Slice used and 1.06% reduction in of LUTs. Moreover, the significant advantage of area utilization would be the none use of block RAMs comparing to existing work using three blocks RAMs. Finally, comparison results show improvements using the proposed implementation with rates of 2.28% higher frequency, 3.66 x higher throughput, and 1.19% faster processing time.
OpenACC and Open Hackathons Monthly Highlights August 2022OpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. August’s edition covers the 2022 OpenACC and Hackathons Asia-Pacific Summit, NVIDIA’s GTC, upcoming Open Hackathons and Bootcamps, EuroHPC, the launch of Frontier and Polaris supercomputers, recent research, new resources, and more!
In this deck from the GPU Technology Conference, Kevin Roe from the Maui High Performance Computing Center presents: Multi-GPU FFT Performance on Different Hardware. Configurations.
"We will characterize the performance of multi-GPU systems in an effort to determine their viability for running physics-based applications using Fast Fourier Transforms (FFTs). Additionally, we'll discuss how multi-GPU FFTs allow available memory to exceed the limits of a single GPU and how they can reduce computational time for larger problem sizes."
Watch the video: https://wp.me/p3RLHQ-kjQ
Learn more: https://www.mhpcc.hpc.mil/
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses the design and implementation of AM and QPSK software defined radio transmitters and receivers using the ZedBoard and FMComms4 RF transceiver. It begins with descriptions of the QPSK transmitter and receiver designs in Simulink, including resampling, modulation, filtering and data synchronization components. Implementation of the designs in HDL and resource utilization on the ZedBoard is also covered. In addition, an AM transmitter and receiver is presented which is able to transmit and recover an audio signal in under a second. The document provides guidance on building SDR systems on the ZedBoard from initial simulation to final hardware implementation.
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
The document evaluates the performance impact of virtualization on high-performance computing (HPC) clouds. Experiments were conducted on the AIST Super Green Cloud, a 155-node HPC cluster. Benchmark results show that while PCI passthrough mitigates I/O overhead, virtualization still incurs performance penalties for MPI collectives as node counts increase. Application benchmarks demonstrate overhead is limited to around 5%. The study concludes HPC clouds are promising due to utilization improvements from virtualization, but further optimization of virtual machine placement and pass-through technologies could help reduce overhead.
This document discusses QGATE, a quantum circuit simulator that can accelerate simulations using GPUs. QGATE uses several techniques to optimize simulations, including gate cancellation, dynamic qubit grouping, and operator reordering. Gate cancellation removes redundant gates, dynamic qubit grouping reduces the number of variables needed for state vectors when qubits are not entangled, and operator reordering maximizes the effects of dynamic qubit grouping by rearranging gates and measurements. These optimizations aim to improve simulation performance by reducing calculation amounts. Benchmark results show QGATE achieves up to a 220x speedup over CPU simulations for a circuit with 30 qubits and 10 Hadamard gates on each qubit.
Improvements in space radiation-tolerant FPGA implementation of land surface ...IJECEIAES
The trend in satellite remote sensing assignments has continuously been concerning using hardware devices with more flexibility, smaller size, and higher computational power. Therefore, field programmable gate arrays (FPGA) technology is often used by the developers of the scientific community and equipment for carrying out different satellite remote sensing algorithms. This article explains hardware implementation of land surface temperature split window (LST-SW) algorithm based on the FPGA. To get a high-speed process and real-time application, VHSIC hardware description language (VHDL) was employed to design the LST-SW algorithm. The paper presents the benefits of the used Virtex-4QV of radiation tolerant series FPGA. The experimental results revealed that the suggested implementation of the algorithm using Virtex4QV achieved higher throughput of 435.392 Mbps, and faster processing time with value of 2.95 ms. Furthermore, a comparison between the proposed implementation and existing work demonstrated that the proposed implementation has better performance in terms of area utilization; 1.17% reduction in number of Slice used and 1.06% reduction in of LUTs. Moreover, the significant advantage of area utilization would be the none use of block RAMs comparing to existing work using three blocks RAMs. Finally, comparison results show improvements using the proposed implementation with rates of 2.28% higher frequency, 3.66 x higher throughput, and 1.19% faster processing time.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the first remote GPU Hackathons, a complete schedule of upcoming events, using OpenACC for a biophysics problem, NVIDIA HPC SDK, GCC 10, new resources and more!
Final presentation [dissertation project], 20192 esv0002MOHAMMED FURQHAN
This document presents a dissertation project on implementing a real-time data acquisition system on an FPGA. The project was presented by Mohammed Furqhan and guided by Mr. Subramanyam Vinayaka Babu. The project involves interfacing an LCD with a Spartan-6 FPGA for real-time data display from an ADC and DAC through sensors. A LabVIEW interface will also be created for data logging and acquisition. The document outlines the objectives, components used including an LCD, ADC, DAC, and FPGA, and provides information on future work and conclusions.
The document discusses the architecture of CPLDs and FPGAs. It begins by explaining the problems with using basic logic gates on PCBs and introduces programmable logic devices as a solution. It then describes different types of PLDs including PLA, PAL, GAL, CPLD and FPGA. CPLDs have a complexity between FPGAs and basic PLDs, containing non-volatile memory and supporting larger logic than PLDs. FPGAs contain logic cells, interconnects, and can implement thousands of gates. The document provides examples of implementing logic with different PLDs and describes the architecture and programming of CPLDs and FPGAs.
11 Synchoricity as the basis for going Beyond MooreRCCSRENKEI
The document discusses synchronicity as a basis for going beyond Moore's law through the use of silicon lego (SiLago) blocks. SiLago blocks allow for the temporal and spatial composition of designs by ensuring clock and grid cell alignment during composition. This enables very large designs to be synthesized from higher levels of abstraction. Example SiLago block types include functional units like dense linear algebra blocks as well as infrastructure units like networks-on-chips. The document argues that treating SiLago blocks as the new standard cells could enable new design methodologies and computational paradigms like computation in memory to achieve major improvements in performance, energy, and cost beyond what is possible with conventional CMOS scaling alone.
FPGA are a special form of Programmable logic devices(PLDs) with higher densities as compared to custom ICs and capable of implementing functionality in a short period of time using computer aided design (CAD) software....by mathewsubin3388@gmail.com
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. July’s edition covers the 2022 OpenACC and Hackathons Summit, NVIDIA’s Applied Research Accelerator Program, upcoming Open Hackathons and Bootcamps, recent research, new resources, and more!
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers working on applications for the new Frontier supercomputer, using OpenACC for weather forecasting, upcoming GPU Hackathons and Bootcamps, and new resources!
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
This document summarizes research on optimizing an explicit finite-difference scheme for fluid dynamics simulations to achieve high performance on many-core systems like the PEZY-SC2 processor. The researchers developed a code generation framework that uses temporal blocking to optimize for low memory bandwidth. On a PEZY-SC2 system with 16 million cores, they achieved 4.78 PFlops and 21.5% efficiency, comparable to other works on higher bandwidth machines. Temporal blocking reduced the required memory bandwidth and allowed good weak scaling to larger core counts.
This document discusses using CUDA on GPUs to accelerate map projection calculations. It presents a method for implementing the Universal Transverse Mercator projection on a GPU using CUDA. Experiments show the GPU implementation provides a 6-8x speedup over a CPU version when including data transfer times, and a 70-90x speedup when only considering calculation times. Two task assignment approaches are evaluated, with striped partitioning performing slightly better than a matrix distribution method. Future work is proposed to implement other GIS algorithms on GPUs to take advantage of the significant speed increases possible.
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemLarry Smarr
12.10.11
Keynote Lecture
12th Annual Global LambdaGrid Workshop
Title: How to Terminate the GLIF by Building a Campus Big Data Freeway System
Chicago, IL
Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...Sucharita Saha
Design of a BPSK and QPSK digital Modulation scheme and its implementation on FPGAs for universal mobile telecommunications system and SDR applications. The simulation of the system is made in MATLAB Simulink environment and System Generator, a tool used for FPGA design. Hardware Co-Simulation is designed using VHDL a hardware description language targeting a Xilinx FPGA and is verified using MATLAB Simulink. It is then converted to VHDL level using Simulink HDL coder. The design is synthesized and fitted with Xilinx 14.2 ISE Edition software, and downloaded to Spartan 3E (XC3S500E) board.
OpenACC and Open Hackathons Monthly Highlights: April 2022OpenACC
Stay up-to-date on the latest news, events and resources for the OpenACC and Open Hackathon community. This month’s highlights covers upcoming GPU Hackathons and Bootcamps, call for speakers for the OpenACC and Hackthons 2022 Summit , recent research, new resources and more!
The Cygnus supercomputer combines GPUs and FPGAs to provide high performance computing capabilities. It has 81 nodes with a total peak performance of 2.4 PFLOPS from GPUs, CPUs, and FPGAs. 49 nodes contain GPUs only, while 32 nodes contain GPUs, FPGAs, and high-speed interconnects between the FPGAs. The FPGAs allow for application-specific acceleration and high-speed external communication. Cygnus aims to enhance performance through mixed and variable precision operations on the FPGAs.
This document provides a monthly highlights summary of OpenACC:
- OpenACC is a programming model for parallel computing on CPUs and GPUs using compiler directives to add parallelism to existing serial code.
- OpenACC is seeing wide adoption across major HPC applications and allows performance portability between CPU and GPU.
- The document highlights recent optimizations, events, publications and resources around OpenACC programming.
Qo s based mac protocol for medical wireless body area sensor networksIffat Anjum
The document proposes a QoS-based MAC protocol for medical wireless body area sensor networks that prioritizes critical traffic. The protocol differentiates traffic into critical and non-critical based on sensed values and thresholds. It allows critical packets more retransmissions to increase throughput and decrease rejection rates, while maintaining minimum QoS for non-critical traffic. Mathematical analysis of the protocol examines aggregate traffic generation, throughput, and rejection rates for different packet arrival rates. Performance evaluation shows the protocol increases critical throughput and decreases critical packet rejection rates compared to standard protocols.
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...Larry Smarr
10.04.07
Presentation by Larry Smarr to the NSF Campus Bridging Workshop
University Place Conference Center
Title: A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging End-User Laboratories to Data-Intensive Sources
Indianapolis, IN
Warp processing is a technique that dynamically optimizes software to improve performance and energy efficiency. It works by profiling an application to identify critical regions, then partitioning those regions to hardware using an FPGA. The binary is updated to execute the partitioned regions on the FPGA circuit while the rest continues in software. This allows applications to achieve speedups of 2-100x or more while using 20x less memory and reducing power consumption by 38-94%.
Welcome slides presented at the 2nd International Workshop on Deployment and Use of Accelerators (DUAC). Co-located with the 51st International Conference on Parallel Processing (ICPP). August 29, 2021 (virtual event). More information at: https://duac2022.wordpress.com/
vAccel is a framework that allows serverless workloads running in virtual machines to access hardware accelerators. It defines a generic API that can be mapped to specific accelerator implementations through plugins. This allows workloads to leverage accelerators while maintaining isolation between tenants. vAccel currently supports frameworks like TensorFlow and frameworks running on hypervisors like QEMU and Firecracker through virtio and vsock plugins. Performance evaluation shows vAccel introduces minimal overhead for accelerated workloads running in virtual machines.
More Related Content
Similar to Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the first remote GPU Hackathons, a complete schedule of upcoming events, using OpenACC for a biophysics problem, NVIDIA HPC SDK, GCC 10, new resources and more!
Final presentation [dissertation project], 20192 esv0002MOHAMMED FURQHAN
This document presents a dissertation project on implementing a real-time data acquisition system on an FPGA. The project was presented by Mohammed Furqhan and guided by Mr. Subramanyam Vinayaka Babu. The project involves interfacing an LCD with a Spartan-6 FPGA for real-time data display from an ADC and DAC through sensors. A LabVIEW interface will also be created for data logging and acquisition. The document outlines the objectives, components used including an LCD, ADC, DAC, and FPGA, and provides information on future work and conclusions.
The document discusses the architecture of CPLDs and FPGAs. It begins by explaining the problems with using basic logic gates on PCBs and introduces programmable logic devices as a solution. It then describes different types of PLDs including PLA, PAL, GAL, CPLD and FPGA. CPLDs have a complexity between FPGAs and basic PLDs, containing non-volatile memory and supporting larger logic than PLDs. FPGAs contain logic cells, interconnects, and can implement thousands of gates. The document provides examples of implementing logic with different PLDs and describes the architecture and programming of CPLDs and FPGAs.
11 Synchoricity as the basis for going Beyond MooreRCCSRENKEI
The document discusses synchronicity as a basis for going beyond Moore's law through the use of silicon lego (SiLago) blocks. SiLago blocks allow for the temporal and spatial composition of designs by ensuring clock and grid cell alignment during composition. This enables very large designs to be synthesized from higher levels of abstraction. Example SiLago block types include functional units like dense linear algebra blocks as well as infrastructure units like networks-on-chips. The document argues that treating SiLago blocks as the new standard cells could enable new design methodologies and computational paradigms like computation in memory to achieve major improvements in performance, energy, and cost beyond what is possible with conventional CMOS scaling alone.
FPGA are a special form of Programmable logic devices(PLDs) with higher densities as compared to custom ICs and capable of implementing functionality in a short period of time using computer aided design (CAD) software....by mathewsubin3388@gmail.com
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. July’s edition covers the 2022 OpenACC and Hackathons Summit, NVIDIA’s Applied Research Accelerator Program, upcoming Open Hackathons and Bootcamps, recent research, new resources, and more!
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers working on applications for the new Frontier supercomputer, using OpenACC for weather forecasting, upcoming GPU Hackathons and Bootcamps, and new resources!
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
This document summarizes research on optimizing an explicit finite-difference scheme for fluid dynamics simulations to achieve high performance on many-core systems like the PEZY-SC2 processor. The researchers developed a code generation framework that uses temporal blocking to optimize for low memory bandwidth. On a PEZY-SC2 system with 16 million cores, they achieved 4.78 PFlops and 21.5% efficiency, comparable to other works on higher bandwidth machines. Temporal blocking reduced the required memory bandwidth and allowed good weak scaling to larger core counts.
This document discusses using CUDA on GPUs to accelerate map projection calculations. It presents a method for implementing the Universal Transverse Mercator projection on a GPU using CUDA. Experiments show the GPU implementation provides a 6-8x speedup over a CPU version when including data transfer times, and a 70-90x speedup when only considering calculation times. Two task assignment approaches are evaluated, with striped partitioning performing slightly better than a matrix distribution method. Future work is proposed to implement other GIS algorithms on GPUs to take advantage of the significant speed increases possible.
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemLarry Smarr
12.10.11
Keynote Lecture
12th Annual Global LambdaGrid Workshop
Title: How to Terminate the GLIF by Building a Campus Big Data Freeway System
Chicago, IL
Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...Sucharita Saha
Design of a BPSK and QPSK digital Modulation scheme and its implementation on FPGAs for universal mobile telecommunications system and SDR applications. The simulation of the system is made in MATLAB Simulink environment and System Generator, a tool used for FPGA design. Hardware Co-Simulation is designed using VHDL a hardware description language targeting a Xilinx FPGA and is verified using MATLAB Simulink. It is then converted to VHDL level using Simulink HDL coder. The design is synthesized and fitted with Xilinx 14.2 ISE Edition software, and downloaded to Spartan 3E (XC3S500E) board.
OpenACC and Open Hackathons Monthly Highlights: April 2022OpenACC
Stay up-to-date on the latest news, events and resources for the OpenACC and Open Hackathon community. This month’s highlights covers upcoming GPU Hackathons and Bootcamps, call for speakers for the OpenACC and Hackthons 2022 Summit , recent research, new resources and more!
The Cygnus supercomputer combines GPUs and FPGAs to provide high performance computing capabilities. It has 81 nodes with a total peak performance of 2.4 PFLOPS from GPUs, CPUs, and FPGAs. 49 nodes contain GPUs only, while 32 nodes contain GPUs, FPGAs, and high-speed interconnects between the FPGAs. The FPGAs allow for application-specific acceleration and high-speed external communication. Cygnus aims to enhance performance through mixed and variable precision operations on the FPGAs.
This document provides a monthly highlights summary of OpenACC:
- OpenACC is a programming model for parallel computing on CPUs and GPUs using compiler directives to add parallelism to existing serial code.
- OpenACC is seeing wide adoption across major HPC applications and allows performance portability between CPU and GPU.
- The document highlights recent optimizations, events, publications and resources around OpenACC programming.
Qo s based mac protocol for medical wireless body area sensor networksIffat Anjum
The document proposes a QoS-based MAC protocol for medical wireless body area sensor networks that prioritizes critical traffic. The protocol differentiates traffic into critical and non-critical based on sensed values and thresholds. It allows critical packets more retransmissions to increase throughput and decrease rejection rates, while maintaining minimum QoS for non-critical traffic. Mathematical analysis of the protocol examines aggregate traffic generation, throughput, and rejection rates for different packet arrival rates. Performance evaluation shows the protocol increases critical throughput and decreases critical packet rejection rates compared to standard protocols.
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...Larry Smarr
10.04.07
Presentation by Larry Smarr to the NSF Campus Bridging Workshop
University Place Conference Center
Title: A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging End-User Laboratories to Data-Intensive Sources
Indianapolis, IN
Warp processing is a technique that dynamically optimizes software to improve performance and energy efficiency. It works by profiling an application to identify critical regions, then partitioning those regions to hardware using an FPGA. The binary is updated to execute the partitioned regions on the FPGA circuit while the rest continues in software. This allows applications to achieve speedups of 2-100x or more while using 20x less memory and reducing power consumption by 38-94%.
Similar to Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling (20)
Welcome slides presented at the 2nd International Workshop on Deployment and Use of Accelerators (DUAC). Co-located with the 51st International Conference on Parallel Processing (ICPP). August 29, 2021 (virtual event). More information at: https://duac2022.wordpress.com/
vAccel is a framework that allows serverless workloads running in virtual machines to access hardware accelerators. It defines a generic API that can be mapped to specific accelerator implementations through plugins. This allows workloads to leverage accelerators while maintaining isolation between tenants. vAccel currently supports frameworks like TensorFlow and frameworks running on hypervisors like QEMU and Firecracker through virtio and vsock plugins. Performance evaluation shows vAccel introduces minimal overhead for accelerated workloads running in virtual machines.
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPUCarlos Reaño González
Paper presented at the 2nd International Workshop on Deployment and Use of Accelerators (DUAC). Co-located with the 51st International Conference on Parallel Processing (ICPP). August 29, 2021 (virtual event). More information at: https://duac2022.wordpress.com/
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs ...Carlos Reaño González
Paper presented at the 2nd International Workshop on Deployment and Use of Accelerators (DUAC). Co-located with the 51st International Conference on Parallel Processing (ICPP). August 29, 2021 (virtual event). More information at: https://duac2022.wordpress.com/
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...Carlos Reaño González
This document discusses pipelined compression in remote GPU virtualization systems using rCUDA. It introduces remote GPU virtualization and the challenges of slow networks. It then describes a pipelined compression architecture that can compress data on the fly during transfer. Experimental results show that compression libraries reduce execution time by 1-6 minutes for various machine learning models. Analysis finds that over 90% of transfers are small, between 1 byte and 1 KB, and could benefit from further compression. The initial implementation shows potential for reducing execution time but leaves room for improvement.
A framework for low communication approaches for large scale 3D convolutionCarlos Reaño González
Paper presented at the 2nd International Workshop on Deployment and Use of Accelerators (DUAC). Co-located with the 51st International Conference on Parallel Processing (ICPP). August 29, 2021 (virtual event). More information at: https://duac2022.wordpress.com/
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Sérgio Sacani
Magmatic iron-meteorite parent bodies are the earliest planetesimals in the Solar System,and they preserve information about conditions and planet-forming processes in thesolar nebula. In this study, we include comprehensive elemental compositions andfractional-crystallization modeling for iron meteorites from the cores of five differenti-ated asteroids from the inner Solar System. Together with previous results of metalliccores from the outer Solar System, we conclude that asteroidal cores from the outerSolar System have smaller sizes, elevated siderophile-element abundances, and simplercrystallization processes than those from the inner Solar System. These differences arerelated to the formation locations of the parent asteroids because the solar protoplane-tary disk varied in redox conditions, elemental distributions, and dynamics at differentheliocentric distances. Using highly siderophile-element data from iron meteorites, wereconstruct the distribution of calcium-aluminum-rich inclusions (CAIs) across theprotoplanetary disk within the first million years of Solar-System history. CAIs, the firstsolids to condense in the Solar System, formed close to the Sun. They were, however,concentrated within the outer disk and depleted within the inner disk. Future modelsof the structure and evolution of the protoplanetary disk should account for this dis-tribution pattern of CAIs.
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Creative-Biolabs
Neutralizing antibodies, pivotal in immune defense, specifically bind and inhibit viral pathogens, thereby playing a crucial role in protecting against and mitigating infectious diseases. In this slide, we will introduce what antibodies and neutralizing antibodies are, the production and regulation of neutralizing antibodies, their mechanisms of action, classification and applications, as well as the challenges they face.
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆Sérgio Sacani
Context. The early-type galaxy SDSS J133519.91+072807.4 (hereafter SDSS1335+0728), which had exhibited no prior optical variations during the preceding two decades, began showing significant nuclear variability in the Zwicky Transient Facility (ZTF) alert stream from December 2019 (as ZTF19acnskyy). This variability behaviour, coupled with the host-galaxy properties, suggests that SDSS1335+0728 hosts a ∼ 106M⊙ black hole (BH) that is currently in the process of ‘turning on’. Aims. We present a multi-wavelength photometric analysis and spectroscopic follow-up performed with the aim of better understanding the origin of the nuclear variations detected in SDSS1335+0728. Methods. We used archival photometry (from WISE, 2MASS, SDSS, GALEX, eROSITA) and spectroscopic data (from SDSS and LAMOST) to study the state of SDSS1335+0728 prior to December 2019, and new observations from Swift, SOAR/Goodman, VLT/X-shooter, and Keck/LRIS taken after its turn-on to characterise its current state. We analysed the variability of SDSS1335+0728 in the X-ray/UV/optical/mid-infrared range, modelled its spectral energy distribution prior to and after December 2019, and studied the evolution of its UV/optical spectra. Results. From our multi-wavelength photometric analysis, we find that: (a) since 2021, the UV flux (from Swift/UVOT observations) is four times brighter than the flux reported by GALEX in 2004; (b) since June 2022, the mid-infrared flux has risen more than two times, and the W1−W2 WISE colour has become redder; and (c) since February 2024, the source has begun showing X-ray emission. From our spectroscopic follow-up, we see that (i) the narrow emission line ratios are now consistent with a more energetic ionising continuum; (ii) broad emission lines are not detected; and (iii) the [OIII] line increased its flux ∼ 3.6 years after the first ZTF alert, which implies a relatively compact narrow-line-emitting region. Conclusions. We conclude that the variations observed in SDSS1335+0728 could be either explained by a ∼ 106M⊙ AGN that is just turning on or by an exotic tidal disruption event (TDE). If the former is true, SDSS1335+0728 is one of the strongest cases of an AGNobserved in the process of activating. If the latter were found to be the case, it would correspond to the longest and faintest TDE ever observed (or another class of still unknown nuclear transient). Future observations of SDSS1335+0728 are crucial to further understand its behaviour. Key words. galaxies: active– accretion, accretion discs– galaxies: individual: SDSS J133519.91+072807.4
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
1. Center for Computational Sciences, Univ. of Tsukuba
Taisuke Boku Norihisa Fujita Ryohei Kobayashi Osamu Tatebe
Center for Computational Sciences
University of Tsukuba
{taisuke,fujita}@ccs.tsukuba.ac.jp {kobayashi,tatebe}@cs.tsukuba.ac.jp
Cygnus – World First Multihybrid Accelerated Cluster
wtih GPU and FPGA Coupling
2022/08/29
1
DUAC2022
2. Center for Computational Sciences, Univ. of Tsukuba
Accelerators in HPC ⇒ majority = GPU
n Is GPU perfect ?
n good for many applications (replacing vector machines)
n depending on very wide and regular parallelism
n large scale SIMD (STMD) mechanism in a chip
n high bandwidth memory (HBM, HBM2) and local memory
n insufficient for cases with...
n not enough parallelism
n not regular computation (warp divergence)
n frequent inter-node communication (kernel switch, go back to CPU)
2022/08/29 DUAC2022
2
NVIDIA Tesla A100
Tensor Core GPU
(from NVIDIA web page)
3. Center for Computational Sciences, Univ. of Tsukuba
FPGA in HPC
n Goodness of recent FPGA for HPC
n True codesigning with applications (essential)
n Programmability improvement: OpenCL, other high level languages
n High performance interconnect: 100Gb x N
n Precision control is possible
n Relatively low power
n Problems
n Programmability: OpenCL is not enough, not efficient
n Low standard FLOPS: still cannot catch up to GPU
-> “never try what GPU works well on”
n Memory bandwidth: 1-gen older than high end CPU/GPU
-> be improved by HBM (Stratix10)
2022/08/29 DUAC2022
3
BittWare 520N with Intel Stratix10 FPGA
equipped with 4x 100Gbps optical
interconnection interfaces
5. Center for Computational Sciences, Univ. of Tsukuba
CHARM: Cooperative Heterogeneous Acceleration with
Reconfigurable Multi-devices
2022/08/29 DUAC2022
5
CPU
GPU
FPGA
comp.
PCIe
comm.
invoke GPU/FPGA kernsls
data transfer via PCIe
(invoked from FPGA)
CPU
GPU
FPGA
comp.
PCIe
comm.
FPGA Network
Application oriented
FPGA-FPGA communication
Basic cluster with GPUs (by InfiniBand)
100Gbps direct optical link
multi-physics/multi-scale
complicated problem
Cooperative computing with GPU and FPGA
6. Center for Computational Sciences, Univ. of Tsukuba
Cygnus: world first multi-hybrid cluster with GPU+FPGA
2022/08/29 DUAC2022
6
Cygnus supercomputer at Center for Computational Sciences, Univ. of Tsukuba (Apr. 2019~)
85 nodes in total including 32 “Albireo” nodes with GPU+FPGA (other “Deneb” nodes have GPU only)
@ CCS, Univ. of Tsukuba (deployed by NEC)
7. Center for Computational Sciences, Univ. of Tsukuba
Single node configuration (Albireo)
2022/08/29 DUAC2022
7
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
SINGLE
NODE
(with FPGA)
• Each node is equipped with
both IB EDR and FPGA-direct
network
• Some nodes are equipped
with both FPGAs and GPUs,
and other nodes are with
GPUs only
Network switch
(100Gbps x2)
8. Center for Computational Sciences, Univ. of Tsukuba
Two types of interconnection network
FPGA FPGA FPGA
FPGA FPGA FPGA
FPGA FPGA FPGA
comp.
node
…
IB HDR100/200 Network (100Gbps x4/node)
For all computation nodes (Albireo and Deneb) are connected by full-bisection
Fat Tree network with 4 channels of InfiniBand HDR100 (combined to HDR200
switch) for parallel processing communication such as MPI, and also used to
access to Lustre shared file system.
comp.
node
comp.
node
…
comp.
node
Deneb nodes Albireo nodes
comp.
node
comp.
node
Inter-FPGA direct network
(only for Albireo nodes)
InfiniBand HDR100/200 network for parallel processing
communication and shared file system access from all nodes
…
…
Inter-FPGA torus network
64 of FPGAs on Albireo nodes (2FPGAS/node)
are connected by 8x8 2D torus network
without switch
8 2022/08/29 DUAC2022
9. Center for Computational Sciences, Univ. of Tsukuba
2022/08/29 DUAC2022
9
G
P
U
G
P
U
G
P
U
G
P
U
F
P
G
A
F
P
G
A
CPU CPU
IB HDR100 x4
⇨ HDR200 x2
100Gbps x4
FPGA optical
network x2
IB HDR200
switch (for
full-bisection
Fat-Tree)
Albireo node
1.2Tbps/node
10. Center for Computational Sciences, Univ. of Tsukuba
Research to support CHARM model on Cygnus
n FPGA-network: CIRCUS (Communication Integrated Reconfigurable CompUting System)
n direct interconnect facility among FPGA boards by multi-dimensional optical link (~100Gps) with router and
OpenCL-ready API
n pipelining all computation and communication seamlessly
n GPU-FPGA DMA: kicked by FPGA (without CPU)
n PCIe-protocol base DMA engine to reduce multi-device high speed data transfer
n Programming:
n Intel oneAPI
⇒ task-by-task manner assignment of computation part to GPU and FPGA under DPC++ device queue
management
n Appllication: ARGOT, application on astrophysics for early-universe object generation
n two main parts are executed by GPU and FPGA
2022/08/29 DUAC2022
10
11. CIRCUS
n Intel FPGA SDK for OpenCL
n We can describe FPGA hardware in OpenCL
n Problem: How to write inter-FPGA communication code in OpenCL?
n MPI is the standard method for HPC applications
n It is memory-to-memory communication, not suitable for FPGAs
n We need to utilize pipeline-based communication in an FPGA
n →CIRCUS: Communication Integrated Reconfigurable CompUting System
n Pipelined communication and computation
n communicate from or to a computation pipeline directly
11
sender(__global float* restrict x, int n)
{
for (int i = 0; i < n; i++) {
float v = x[i];
write_channel_intel(simple_out, v);
}
}
sender code on FPGA1
receiver(__global float* restrict x, int n) {
for (int i = 0; i < n; i++) {
float v = read_channel_intel(simple_in);
x[i] = v;
}
}
receiver code on FPGA2
Comm.
Backend
* N. Fujita, et al., Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA , AsHES2020.
2022/08/29 DUAC2022
12. CIRCUS performance
latency+ /hop
~250 ns
Latency(1hop~7hops)
max. throughput
90.2Gbps
min. latency
500ns
Throughput(1hop~7hops)
12
Better
Better
Evaluated on up to 8 Bittware 520N FPGA boards in Cygnus supercomputer at CCS, University of Tsukuba
N. Fujita, et al., Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA , AsHES2020.
2022/08/29 DUAC2022
13. Center for Computational Sciences, Univ. of Tsukuba
What CIRCUS provides ?
n CIRCUS: Communication Integrated Reconfigurable CompUting System
n Goal1: providing High Level Synthesis programming environment for parallel FPGA system by
FPGA-FPGA communication link
n Goal2: combining computation pipeline and communication pipeline seamlessly to fully utilize
the goodness of FPGA computation/communication
2022/08/29 DUAC2022
13
14. Center for Computational Sciences, Univ. of Tsukuba
14
CHARM by oneAPI
n In oneAPI, programming in DPC++ is
recommended
−(a) approach
−Problem: Existing GPU and FPGA code
is written in other languages such as
CUDA, OpenCL, etc
These code assets already exist
−Reimplementation by DPC++ is a
burden for users
n oneAPI also can use modules
written in other languages
−(b) approach
−Code can be reused
2022/08/29 DUAC2022
15. Center for Computational Sciences, Univ. of Tsukuba
Application Example – ARGOT code
n ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree)
n Simulator for early stage universe where the first stars and galaxies were born
n Radiative transfer code developed in Center for Computational Sciences (CCS),
University of Tsukuba
n CPU (OpenMP) and GPU (CUDA) implementations are available
n Inter-node parallelisms is also supported using MPI
n ART (Authentic Radiation Transfer) method
n It solves radiative transfer from light source spreading out in the space
n Dominant computation part (90%~) of the ARGOT program
n We accelerate the ART method on an FPGA using Intel FPGA SDK for
OpenCL as an HLS environment (with oneAPI)
15 2022/08/29 DUAC2022
16. Cosmic Radiative Transfer Simulation
ARGOT *
16
Point Source Diffuse Photon
Two computation elements in ARGOT code: ARGOT method and ART method
• ARGOT method: Point Source processing
• ART method (Authentic Radiation Transfer): Diffused Photon processing
2022/08/29 DUAC2022
17. Cosmic Radiative Transfer Simulation
ARGOT *
17
GPU acceleration
ARGOT scheme
for radiative transfer (RT)
from point source
ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) code
Point Source
ART scheme
for RT from matters
spatially spreading out
FPGA acceleration
Diffuse Photon
FPGA
GPU GPU GPU
FPGA FPGA
CHARM
Two computation elements in ARGOT code: ARGOT method and ART method
• ARGOT method: Point Source processing
• ART method (Authentic Radiation Transfer): Diffused Photon processing
2022/08/29 DUAC2022
18. Center for Computational Sciences, Univ. of Tsukuba
18
Performance evaluation (bare CUDA+OpenCL vs with oneAPI)
n problem size of 32!
n Single node (1 GPU + 1
FPGA)
n ART
−ART on GPU is slow
−FPGA can accelerate
pipelined manner
n oneAPI vs CUDA+OpenCL
−The execution time of oneAPI
is increased by 1.5%
=> almost no overhead
Lower
is
better
R. Kashino, et al., Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment , HPC Asia 2022, Jan. 2022.
2022/08/29 DUAC2022
19. ARGOT with 2 nodes (2 GPUs/IB + 2 FPGAs/CIRCUS)
n Weak scaling, 32x32x32 mesh for each node
19
0
0.5
1
1.5
2
2.5
GPU-only GPU + FPGA GPU-only GPU + FPGA
1 Node / (32, 32, 32) 2 Nodes / (64, 32, 32)
Execution
time
[s]
# of Nodes / total mesh size
Others
ART comm.
ART comp.
ART init
Optical depth accumulation
Ray segment assginment
ARGOT comp.
ARGOT init
Lower
is
better
0.89
0.13
2.05
0.16
・1 node performance
・GPU+FPGA : GPU-only
= 6.8x higher
・2 nodes performance
・GPU+FPGA : GPU-only
= 12.8x higher
ART method part:
GPU-GPU MPI comm. is so heavy
→ large overhead by small chunks
of multiple data copy
FPGA-FPGA MPI comm. by CIRCUS
→ very effective
・low latency & high bandwidth
・comp. + comm. pipelining
20. Center for Computational Sciences, Univ. of Tsukuba
Summary
n Toward Exa-scale era, homogeneous or single accelerator system will have limitation on application
variation and scalability
n CCS, U. Tsukuba, is running a multi-hetero supercomputer named Cygnus under CHARM
(Cooperative Heterogeneous Acceleration with Reconfigurable Multi-devices) concept by GPU+FPGA
n Several supporting systems on FPGA and GPU coworking are developed including language solution
toward high sustained performance of multi-physical simulations
n FPGA for HPC is a new concept toward next generation’s flexible and low power solution beyond
GPU-only computing
n Multi-physics simulation is the first stage target of Cygnus and will be expanded to variety of
applications where GPU-only solution has some bottleneck
n Current FPGA-side implementation is based on OpenCL barely, and we need to expand to other
languages and other run-time systems
n Call me if you want to use Cygnus with us!
2022/08/29 DUAC2022
20