This document discusses recent developments in the HPX and Octo-Tiger software frameworks. It describes how HPX provides uniform APIs for local and remote parallel operations and exposes performance counters. Octo-Tiger is an astrophysics simulation program that uses HPX for parallelization and supports MPI, Kokkos, and CUDA/HIP backends. New features discussed include integrating Kokkos with HPX to enable vectorization, different work aggregation strategies, and measurements showing lightweight threads in HPX have low overhead for large simulations. Scaling tests on supercomputers like Summit demonstrate good strong and weak scaling using HPX's asynchronous communication compared to MPI.
Recent developments in HPX and Octo-TigerPatrick Diehl
In this talk, we will briefly introduce the astrophysics application Octo-Tiger, which simulates the evolution of star systems based on the fast multipole method on adaptive Octrees. This application is the most advanced HPX application with CUDA and AMD acceleration card support. In the remaining talk, we will talk about the pure CUDA integration and the recently added Kokkos integration to support portability for heterogeneous acceleration cards, especially AMD GPUs. Here, we showcase some scaling results on ORNL's Summit and SCSC's PiZ Daint. Another aspect is performance profiling in asynchronous applications. Here, we recently added CUDA profiling to HPX's performance framework, APEX. Thus, we can collect distributed combined CPU and GPU profiling to analyze the performance of HPX and Octo-Tiger. Here, we show runs on Summit and Piz Daint to compare the performance on different architectures and GPUs. The final aspect is the overhead introduced by the performance framework. Here, we study the overhead introduced by the CPU only profiling and the much more expensive overhead added by the CUDA profiling.
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuPatrick Diehl
This document summarizes a presentation on simulating stellar mergers using HPX/Kokkos on the Fugaku supercomputer. It discusses the astrophysical application of merging stars, the Octo-Tiger software framework, Kokkos for performance portability, HPX for concurrency and parallelism, porting to the Fugaku supercomputer, node-level and distributed scaling tests, performance optimizations, and conclusions about the success and potential for future work.
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...Patrick Diehl
This document discusses porting the astrophysics application Octo-Tiger to Arm64FX processors and adding Arm64FX support using the Kokkos and HPX frameworks. It provides an overview of Octo-Tiger, HPX, and how Kokkos is integrated. Preliminary scaling tests on an Arm64FX node show good node-level scaling and basic distributed scaling across multiple nodes. Future work includes optimizing for SVE and porting more components to Kokkos to improve performance.
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Princeton GPU Hackathon, OpenACC at SC22, updates from GNU Tools Cauldron, the upcoming UK DPU Hackathon, relevant research and more!
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. July’s edition covers the 2022 OpenACC and Hackathons Summit, NVIDIA’s Applied Research Accelerator Program, upcoming Open Hackathons and Bootcamps, recent research, new resources, and more!
Stay up-to-date with the OpenACC Monthly Highlights. February's edition covers the updated specification OpenACC 3.2, upcoming GPU Hackathons and Bootcamps, OpenACC's BOF at SC21 , recent research, new resources and more!
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the first remote GPU Hackathons, a complete schedule of upcoming events, using OpenACC for a biophysics problem, NVIDIA HPC SDK, GCC 10, new resources and more!
Recent developments in HPX and Octo-TigerPatrick Diehl
In this talk, we will briefly introduce the astrophysics application Octo-Tiger, which simulates the evolution of star systems based on the fast multipole method on adaptive Octrees. This application is the most advanced HPX application with CUDA and AMD acceleration card support. In the remaining talk, we will talk about the pure CUDA integration and the recently added Kokkos integration to support portability for heterogeneous acceleration cards, especially AMD GPUs. Here, we showcase some scaling results on ORNL's Summit and SCSC's PiZ Daint. Another aspect is performance profiling in asynchronous applications. Here, we recently added CUDA profiling to HPX's performance framework, APEX. Thus, we can collect distributed combined CPU and GPU profiling to analyze the performance of HPX and Octo-Tiger. Here, we show runs on Summit and Piz Daint to compare the performance on different architectures and GPUs. The final aspect is the overhead introduced by the performance framework. Here, we study the overhead introduced by the CPU only profiling and the much more expensive overhead added by the CUDA profiling.
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuPatrick Diehl
This document summarizes a presentation on simulating stellar mergers using HPX/Kokkos on the Fugaku supercomputer. It discusses the astrophysical application of merging stars, the Octo-Tiger software framework, Kokkos for performance portability, HPX for concurrency and parallelism, porting to the Fugaku supercomputer, node-level and distributed scaling tests, performance optimizations, and conclusions about the success and potential for future work.
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...Patrick Diehl
This document discusses porting the astrophysics application Octo-Tiger to Arm64FX processors and adding Arm64FX support using the Kokkos and HPX frameworks. It provides an overview of Octo-Tiger, HPX, and how Kokkos is integrated. Preliminary scaling tests on an Arm64FX node show good node-level scaling and basic distributed scaling across multiple nodes. Future work includes optimizing for SVE and porting more components to Kokkos to improve performance.
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Princeton GPU Hackathon, OpenACC at SC22, updates from GNU Tools Cauldron, the upcoming UK DPU Hackathon, relevant research and more!
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. July’s edition covers the 2022 OpenACC and Hackathons Summit, NVIDIA’s Applied Research Accelerator Program, upcoming Open Hackathons and Bootcamps, recent research, new resources, and more!
Stay up-to-date with the OpenACC Monthly Highlights. February's edition covers the updated specification OpenACC 3.2, upcoming GPU Hackathons and Bootcamps, OpenACC's BOF at SC21 , recent research, new resources and more!
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the first remote GPU Hackathons, a complete schedule of upcoming events, using OpenACC for a biophysics problem, NVIDIA HPC SDK, GCC 10, new resources and more!
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers working on applications for the new Frontier supercomputer, using OpenACC for weather forecasting, upcoming GPU Hackathons and Bootcamps, and new resources!
Developing, experimenting, and deploying ML models at scale requires substantial tooling, scripting, tracking, versioning, and monitoring.
Watch full video here: https://cnvrg.io/webinars-and-workshops/scaling-mlops-on-nvidia-dgx-systems/
Data scientists want to do data science – and are slowed down by MLOps and DevOps tasks.
They lack user friendly tools needed to track experiments, attach resources, manage datasets and launch multiple ML pipelines.
In this presentation cnvrg.io CEO, Yochay Ettun will host a special guest from NVIDIA, Sr. Product Manager for NVIDIA DGX systems, Michael Balint, and discuss how to optimize the use of any NVIDIA DGX and NVIDIA GPU asset both on-prem or in the cloud with the cnvrg.io machine learning platform.
We will show best practices to reach high utilization of NVIDIA DGX systems, while conducting meta-scheduling across multiple heterogeneous Kubernetes/OpenShift/Linux server clusters.
In addition, we will introduce the concept of production flows, which automate hundreds of models from the data hub to deployment. We will wrap up with a real-life demo of flows, exercising many experiments across DGX platforms.
What you will learn:
- Creating a data science flow: from data to deployment, while attaching different NVIDIA DGX Kubernetes clusters to each step of the flow
- The concept of meta-scheduler: scheduling experiments disperse resources or other schedulers, accomplishing high utilization at scale
- How the NVIDIA DGX ecosystem with cnvrg.io makes GPU assets consumed easily, with one-click, bypassing complexity of MLOps
- How to leverage NGC containers in ML pipelines
You can watch the full presentation along with audio and video in the link here: https://cnvrg.io/webinars-and-workshops/scaling-mlops-on-nvidia-dgx-systems/
OpenACC and Open Hackathons Monthly Highlights August 2022OpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. August’s edition covers the 2022 OpenACC and Hackathons Asia-Pacific Summit, NVIDIA’s GTC, upcoming Open Hackathons and Bootcamps, EuroHPC, the launch of Frontier and Polaris supercomputers, recent research, new resources, and more!
Stay up-to-date with the OpenACC Monthly Highlights. June's edition covers the OpenACC Summit 2021, NVIDIA GTC'21 on-demand sessions, upcoming GPU Hackathons and Bootcamps, Intersect360 Research HPC market forecast, recent research, new resources and more!
Stay up-to-date with the OpenACC Monthly Highlights. July's edition covers the OpenACC Summit 2021, upcoming GPU Hackathons and Bootcamps, PEARC21 panel review , recent research, new resources and more!
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerPatrick Diehl
In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-task runtime systems (AMT) is important. We describe our experience with porting of a full 3D adaptive mesh-refinement, multi-scale, multi-model, and multi-physics application, Octo-Tiger, that is based on the HPX AMT, and we explore its performance characteristics on different RISC-V systems. The demonstrated results confirm that Octo-Tiger shows good scaling behavior on all tested systems. We, however, expect that exceptional hardware support based on dedicated ISA extensions (such as single-cycle context switches, extended atomic operations, and direct support for HPX's global address space) would allow for even better performance results.
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
I presented to the Environmental Data Science group at UChicago, with the goal of getting them excited about the opportunities inherent in big data, big computing, and AI--and to think about how to collaborate with Argonne in those areas. We had a great and long conversation about Takuya Kurihana's work on unsupervised learning for cloud classification. I also mentioned our work making NASA and CMIP data accessible on AI supercomputers.
At the technology meeting of the Association of Independent Research Centers (http://airi.org): An overview of recent Scientific Computing activities at Fred Hutch, Seattle
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the newly released PGI 19.7, the upcoming 2019 OpenACC Annual Meeting, GPU Bootcamp at RIKEN R-CCS, a complete schedule of GPU hackathons and more!
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses accelerating science discovery with AI inference-as-a-service. It describes showcases using this approach for high energy physics and gravitational wave experiments. It outlines the vision of the A3D3 institute to unite domain scientists, computer scientists, and engineers to achieve real-time AI and transform science. Examples are provided of using AI inference-as-a-service to accelerate workflows for CMS, ProtoDUNE, LIGO, and other experiments.
The document provides an overview of grid computing concepts, architecture, and applications. It discusses how grid users like scientists need coordinated sharing of resources for tasks like data-intensive analysis and simulations. The key challenges for grids are establishing trust between participants and enabling secure sharing of applications and data. Standards like Globus and OGSA have evolved to address these through services, virtual organizations, and other architectural components. Example applications described are the NEESgrid for earthquake engineering and the Virtual Observatory for astronomy data sharing.
OpenACC and Open Hackathons Monthly Highlights June 2022.pdfOpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. June’s edition covers the 2022 OpenACC and Hackathons Summit, NSF’s Traineeship Program, NVIDIA’s Academic Hardware Grant program, upcoming Open Hackathons and Bootcamps, recent research, new resources, and more!
4 TeraGrid Sites Have Focal Points:
SDSC – The Data Place
Large-scale and high-performance data analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA – The Compute Place
Large-scale, Large Flops computation
Argonne – The Viz place
Scalable Viz walls
Caltech – The Applications place
Data and flops for applications – Especially some of the GriPhyN Apps
Specific machine configurations reflect this
Mulvery is a Ruby-based development environment for CPU+FPGA platforms that allows hardware synthesis from Ruby code. The developer created Mulvery to simplify development for CPU+FPGA systems by allowing developers to use only Ruby. Mulvery uses reactive programming concepts and extracts parts of code suitable for hardware implementation, generating hardware designs without requiring hardware knowledge from developers.
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...Arghya Kusum Das
The paper is published in IEEE Cloud 2017 with title 'Augmenting Amdahl's Second Law: A Theoretical Model to Build Cost-Effective Balanced HPC Infrastructure for Data-Driven Science'
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models AI has always been on of the most exciting applications of big data and Apache Spark. Increasingly Spark users want to integrate Spark with distributed deep learning and machine learning frameworks built for state-of-the-art training. On the other side, increasingly DL/AI users want to handle large and complex data scenarios needed for their production pipelines.
This talk introduces a new project that substantially improves the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark. We will introduce the major directions and provide progress updates, including 1) barrier execution mode for distributed DL training, 2) fast data exchange between Spark and DL frameworks, and 3) accelerator-awareness scheduling.
Grid computing involves distributing computing resources across a network to tackle large problems. The Worldwide LHC Computing Grid (WLCG) was established to support the Large Hadron Collider (LHC) experiment, which produces around 15 petabytes of data annually. The WLCG uses a four-tiered model, with raw data stored at Tier-0 (CERN), copies distributed to Tier-1 data centers, computational resources provided by Tier-2 centers, and Tier-3 facilities providing additional analysis capabilities. This distributed model has proven effective in supporting the first year of LHC data collection and analysis through globally shared computing resources.
Low Power High-Performance Computing on the BeagleBoard Platforma3labdsp
The ever increasing energy requirements of supercomputers and server farms is driving the scientific and industrial communities to take in deeper consideration the energy efficiency of computing equipments. This contribution addresses the issue proposing a cluster of ARM processors for high-performance computing. The cluster is composed of five BeagleBoard-xM, with one board managing the cluster, and the other boards executing the actual processing. The software platform is based on the Angstrom GNU/Linux distribution and is equipped with a distributed file system to ease sharing data and code among the nodes of the cluster, and with tools for managing tasks and monitoring the status of each node. The computational capabilities of the cluster have been assessed through High-Performance Linpack and a cluster-wide speaker diarization algorithm, while power consumption has been measured using a clamp meter. Experimental results obtained in the speaker diarization task showed that the energy efficiency of the BeagleBoard-xM cluster is comparable to the one of a laptop computer equipped with a Intel Core2 Duo T8300 running at 2.4 GHz. Furthermore, removing the bottleneck due to the Ethernet interface, the BeagleBoard-xM cluster is able to achieve a superior energy efficiency.
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsPatrick Diehl
Patrick Diehl received his Ph.D. in Applied Mathematics from the University of Bonn. He was a postdoctoral fellow at Ecole de Polytechnique Montreal prior to joining Louisiana State University as a research scientist and adjunct faculty. His research interests include scientific high-performance computing, asynchronous many-task runtime systems, and Modern C++. He teaches Modern C++ and advocates for open-source software in science to enhance reproducibility.
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...Patrick Diehl
Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to iteratively solve the ap- propriate systems of differential equations. In this paper we investigate implementations of an extremely simple simulation of this type using var- ious programming systems and languages. We focus on a shared memory, parallelized algorithm that simulates a 1D heat diffusion using asyn- chronous queues for the ghost zone exchange. We discuss the advantages of the various platforms and explore the performance of this model code on different computing architectures: Intel, AMD, and ARM64FX. As a result, Python was the slowest of the set we compared. Java, Go, Swift, and Julia were the intermediate performers. The higher performing plat- forms were C++, Rust, Chapel, Charm++, and HPX.
More Related Content
Similar to Recent developments in HPX and Octo-Tiger
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers working on applications for the new Frontier supercomputer, using OpenACC for weather forecasting, upcoming GPU Hackathons and Bootcamps, and new resources!
Developing, experimenting, and deploying ML models at scale requires substantial tooling, scripting, tracking, versioning, and monitoring.
Watch full video here: https://cnvrg.io/webinars-and-workshops/scaling-mlops-on-nvidia-dgx-systems/
Data scientists want to do data science – and are slowed down by MLOps and DevOps tasks.
They lack user friendly tools needed to track experiments, attach resources, manage datasets and launch multiple ML pipelines.
In this presentation cnvrg.io CEO, Yochay Ettun will host a special guest from NVIDIA, Sr. Product Manager for NVIDIA DGX systems, Michael Balint, and discuss how to optimize the use of any NVIDIA DGX and NVIDIA GPU asset both on-prem or in the cloud with the cnvrg.io machine learning platform.
We will show best practices to reach high utilization of NVIDIA DGX systems, while conducting meta-scheduling across multiple heterogeneous Kubernetes/OpenShift/Linux server clusters.
In addition, we will introduce the concept of production flows, which automate hundreds of models from the data hub to deployment. We will wrap up with a real-life demo of flows, exercising many experiments across DGX platforms.
What you will learn:
- Creating a data science flow: from data to deployment, while attaching different NVIDIA DGX Kubernetes clusters to each step of the flow
- The concept of meta-scheduler: scheduling experiments disperse resources or other schedulers, accomplishing high utilization at scale
- How the NVIDIA DGX ecosystem with cnvrg.io makes GPU assets consumed easily, with one-click, bypassing complexity of MLOps
- How to leverage NGC containers in ML pipelines
You can watch the full presentation along with audio and video in the link here: https://cnvrg.io/webinars-and-workshops/scaling-mlops-on-nvidia-dgx-systems/
OpenACC and Open Hackathons Monthly Highlights August 2022OpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. August’s edition covers the 2022 OpenACC and Hackathons Asia-Pacific Summit, NVIDIA’s GTC, upcoming Open Hackathons and Bootcamps, EuroHPC, the launch of Frontier and Polaris supercomputers, recent research, new resources, and more!
Stay up-to-date with the OpenACC Monthly Highlights. June's edition covers the OpenACC Summit 2021, NVIDIA GTC'21 on-demand sessions, upcoming GPU Hackathons and Bootcamps, Intersect360 Research HPC market forecast, recent research, new resources and more!
Stay up-to-date with the OpenACC Monthly Highlights. July's edition covers the OpenACC Summit 2021, upcoming GPU Hackathons and Bootcamps, PEARC21 panel review , recent research, new resources and more!
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerPatrick Diehl
In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-task runtime systems (AMT) is important. We describe our experience with porting of a full 3D adaptive mesh-refinement, multi-scale, multi-model, and multi-physics application, Octo-Tiger, that is based on the HPX AMT, and we explore its performance characteristics on different RISC-V systems. The demonstrated results confirm that Octo-Tiger shows good scaling behavior on all tested systems. We, however, expect that exceptional hardware support based on dedicated ISA extensions (such as single-cycle context switches, extended atomic operations, and direct support for HPX's global address space) would allow for even better performance results.
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
I presented to the Environmental Data Science group at UChicago, with the goal of getting them excited about the opportunities inherent in big data, big computing, and AI--and to think about how to collaborate with Argonne in those areas. We had a great and long conversation about Takuya Kurihana's work on unsupervised learning for cloud classification. I also mentioned our work making NASA and CMIP data accessible on AI supercomputers.
At the technology meeting of the Association of Independent Research Centers (http://airi.org): An overview of recent Scientific Computing activities at Fred Hutch, Seattle
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the newly released PGI 19.7, the upcoming 2019 OpenACC Annual Meeting, GPU Bootcamp at RIKEN R-CCS, a complete schedule of GPU hackathons and more!
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses accelerating science discovery with AI inference-as-a-service. It describes showcases using this approach for high energy physics and gravitational wave experiments. It outlines the vision of the A3D3 institute to unite domain scientists, computer scientists, and engineers to achieve real-time AI and transform science. Examples are provided of using AI inference-as-a-service to accelerate workflows for CMS, ProtoDUNE, LIGO, and other experiments.
The document provides an overview of grid computing concepts, architecture, and applications. It discusses how grid users like scientists need coordinated sharing of resources for tasks like data-intensive analysis and simulations. The key challenges for grids are establishing trust between participants and enabling secure sharing of applications and data. Standards like Globus and OGSA have evolved to address these through services, virtual organizations, and other architectural components. Example applications described are the NEESgrid for earthquake engineering and the Virtual Observatory for astronomy data sharing.
OpenACC and Open Hackathons Monthly Highlights June 2022.pdfOpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. June’s edition covers the 2022 OpenACC and Hackathons Summit, NSF’s Traineeship Program, NVIDIA’s Academic Hardware Grant program, upcoming Open Hackathons and Bootcamps, recent research, new resources, and more!
4 TeraGrid Sites Have Focal Points:
SDSC – The Data Place
Large-scale and high-performance data analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA – The Compute Place
Large-scale, Large Flops computation
Argonne – The Viz place
Scalable Viz walls
Caltech – The Applications place
Data and flops for applications – Especially some of the GriPhyN Apps
Specific machine configurations reflect this
Mulvery is a Ruby-based development environment for CPU+FPGA platforms that allows hardware synthesis from Ruby code. The developer created Mulvery to simplify development for CPU+FPGA systems by allowing developers to use only Ruby. Mulvery uses reactive programming concepts and extracts parts of code suitable for hardware implementation, generating hardware designs without requiring hardware knowledge from developers.
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...Arghya Kusum Das
The paper is published in IEEE Cloud 2017 with title 'Augmenting Amdahl's Second Law: A Theoretical Model to Build Cost-Effective Balanced HPC Infrastructure for Data-Driven Science'
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models AI has always been on of the most exciting applications of big data and Apache Spark. Increasingly Spark users want to integrate Spark with distributed deep learning and machine learning frameworks built for state-of-the-art training. On the other side, increasingly DL/AI users want to handle large and complex data scenarios needed for their production pipelines.
This talk introduces a new project that substantially improves the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark. We will introduce the major directions and provide progress updates, including 1) barrier execution mode for distributed DL training, 2) fast data exchange between Spark and DL frameworks, and 3) accelerator-awareness scheduling.
Grid computing involves distributing computing resources across a network to tackle large problems. The Worldwide LHC Computing Grid (WLCG) was established to support the Large Hadron Collider (LHC) experiment, which produces around 15 petabytes of data annually. The WLCG uses a four-tiered model, with raw data stored at Tier-0 (CERN), copies distributed to Tier-1 data centers, computational resources provided by Tier-2 centers, and Tier-3 facilities providing additional analysis capabilities. This distributed model has proven effective in supporting the first year of LHC data collection and analysis through globally shared computing resources.
Low Power High-Performance Computing on the BeagleBoard Platforma3labdsp
The ever increasing energy requirements of supercomputers and server farms is driving the scientific and industrial communities to take in deeper consideration the energy efficiency of computing equipments. This contribution addresses the issue proposing a cluster of ARM processors for high-performance computing. The cluster is composed of five BeagleBoard-xM, with one board managing the cluster, and the other boards executing the actual processing. The software platform is based on the Angstrom GNU/Linux distribution and is equipped with a distributed file system to ease sharing data and code among the nodes of the cluster, and with tools for managing tasks and monitoring the status of each node. The computational capabilities of the cluster have been assessed through High-Performance Linpack and a cluster-wide speaker diarization algorithm, while power consumption has been measured using a clamp meter. Experimental results obtained in the speaker diarization task showed that the energy efficiency of the BeagleBoard-xM cluster is comparable to the one of a laptop computer equipped with a Intel Core2 Duo T8300 running at 2.4 GHz. Furthermore, removing the bottleneck due to the Ethernet interface, the BeagleBoard-xM cluster is able to achieve a superior energy efficiency.
Similar to Recent developments in HPX and Octo-Tiger (20)
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsPatrick Diehl
Patrick Diehl received his Ph.D. in Applied Mathematics from the University of Bonn. He was a postdoctoral fellow at Ecole de Polytechnique Montreal prior to joining Louisiana State University as a research scientist and adjunct faculty. His research interests include scientific high-performance computing, asynchronous many-task runtime systems, and Modern C++. He teaches Modern C++ and advocates for open-source software in science to enhance reproducibility.
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...Patrick Diehl
Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to iteratively solve the ap- propriate systems of differential equations. In this paper we investigate implementations of an extremely simple simulation of this type using var- ious programming systems and languages. We focus on a shared memory, parallelized algorithm that simulates a 1D heat diffusion using asyn- chronous queues for the ghost zone exchange. We discuss the advantages of the various platforms and explore the performance of this model code on different computing architectures: Intel, AMD, and ARM64FX. As a result, Python was the slowest of the set we compared. Java, Go, Swift, and Julia were the intermediate performers. The higher performing plat- forms were C++, Rust, Chapel, Charm++, and HPX.
The document discusses two contexts of subtle asynchrony. First, how to bring asynchronous task parallelism to Fortran without relying on threads. Second, it describes how NWChem achieves asynchronous task parallelism through overdecomposition of work, without programmers explicitly using tasks. This demonstrates that asynchronous many-task execution principles can be achieved without specialized runtime systems or programming abstractions. Quantum chemistry algorithms are provided as an example where overdecomposition leads to implicit asynchronous parallelism through dynamic scheduling of irregularly distributed tasks.
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranPatrick Diehl
Most parallel scientific programs contain compiler directives (pragmas) such as those from OpenMP, explicit calls to runtime li- brary procedures such as those implementing the Message Passing In- terface (MPI), or compiler-specific language extensions such as those provided by CUDA. By contrast, the recent Fortran standards em- power developers to express parallel algorithms without directly refer- encing lower-level parallel programming models. Fortran’s parallel features place the language within the Partitioned Global Address Space (PGAS) class of programming models. When writing programs that ex- ploit data-parallelism, application developers often find it straightfor- ward to develop custom parallel algorithms. Problems involving complex, heterogeneous, staged calculations, however, pose much greater chal- lenges. Such applications require careful coordination of tasks in a man- ner that respects dependencies prescribed by a directed acyclic graph. When rolling one’s own solution proves difficult, extending a customiz- able framework becomes attractive. The paper presents the design, imple- mentation, and use of the Framework for Extensible Asynchronous Task Scheduling (FEATS), which we believe to be the first task-scheduling tool written in modern Fortran. We describe the benefits and compro- mises associated with choosing Fortran as the implementation language, and we propose ways in which future Fortran standards can best support the use case in this paper.
JOSS and FLOSS for science: Examples for promoting open source software and s...Patrick Diehl
P. Diehl. JOSS and FLOSS for science: Examples for promoting open source software and science communication. SIGDIUS Seminars, 14.06 2023, Virtual event.
A tale of two approaches for coupling nonlocal and local modelsPatrick Diehl
The document summarizes a presentation on coupling nonlocal peridynamic models with local continuum models for fracture simulations. It discusses two approaches: 1) Enriching the partition of unity method (PUM) with peridynamic models by using a global-local solution, where the global PUM solution identifies regions for local peridynamic solving. 2) Directly coupling classical and peridynamic models. Several numerical examples are shown comparing the two models on problems like beam bending, stationary cracks, and inclined cracks. The results demonstrate the models match when peridynamics is in the linear regime but diverge once bond softening occurs.
Challenges for coupling approaches for classical linear elasticity and bond-b...Patrick Diehl
The document presents coupling approaches for combining classical linear elasticity models with non-local peridynamic models for applications in computational mechanics. It describes three coupling methods - matching displacements (MDCM), matching stresses (MSCM), and variable horizon (VHCM). Numerical examples are presented to compare the accuracy of the three methods on manufactured solutions using cubic and quartic polynomials, demonstrating that all methods converge with refinement but VHCM typically has the lowest error.
Quantifying Overheads in Charm++ and HPX using Task BenchPatrick Diehl
Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling. In this paper, we present the comparison of the AMT systems Charm++ and HPX with the main stream MPI, OpenMP, and MPI+OpenMP libraries using the Task Bench benchmarks. Charm++ is a parallel programming language based on C++, supporting stackless tasks as well as light-weight threads asynchronously along with an adaptive runtime system. HPX is a C++ library for concurrency and parallelism, exposing C++ standards conforming API. First, we analyze the commonalities, differences, and advantageous scenarios of Charm++ and HPX in detail. Further, to investigate the potential overheads introduced by the tasking systems of Charm++ and HPX, we utilize an existing parameterized benchmark, Task Bench, wherein 15 different programming systems were implemented, e.g., MPI, OpenMP, MPI + OpenMP, and extend Task Bench by adding HPX implementations. We quantify the overheads of Charm++, HPX, and the main stream libraries in different scenarios where a single task and multi-task are assigned to each core, respectively. We also investigate each system's scalability and the ability to hide the communication latency.
Interactive C++ code development using C++Explorer and GitHub Classroom for e...Patrick Diehl
The document describes using C++Explorer and GitHub Classroom for interactive C++ code development and teaching parallel programming concepts. C++Explorer allows running C++ code interactively in Jupyter notebooks and includes Cling, HPX, and Blaze libraries. GitHub Classroom is used for version control and collaboration. A survey of students found the environment was mostly well-received, with room for additional features. The materials are open source to allow other educators to use in their teaching.
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...Patrick Diehl
On modern supercomputers, asynchronous many task systems are emerging to address the new architecture of computational nodes. Through this shift of increasing cores per node, a new programming model with focus on handling of the fine-grain parallelism with increasing amount of cores per computational node is needed. Asynchronous Many Task (AMT) run time systems represent a paradigm for addressing the fine-grain parallelism. They handle the increasing amount of threads per node and concurrency. HPX, a open source C++ standard library for parallelism and concurrency, is one AMT which is conforming to the C++ standard. Motivated by the impressive performance of asynchronous task-based parallelism through HPX to N-body problem and astrophysics simulation, in this work, we consider its application to the Peridynamics theory. Peridynamics is a non-local generalization of continuum mechanics tailored to address discontinuous displacement fields arising in fracture mechanics. Peridynamics requires considerable computing resources, owing to its non-local nature of formulation, offering a scope for improved computing performance via asynchronous task-based parallelism. Our results show that HPX-based peridynamic computation is scalable, and the scalability is in agreement with the theory. In addition to the scalability, we also show the validation results and the mesh convergence results. For the validation, we consider implicit time integration and compare the result with the classical continuum mechanics (CCM) (peridynamics under small deformation should give similar results as CCM). For the mesh convergence, we consider explicit time integration and show that the results are in agreement with theoretical claims in previous works.
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Patrick Diehl
The document discusses a new method for quasistatic fracture simulation using a regularized nonlinear pairwise (RNP) potential. Key points:
1) An analytic tangent stiffness matrix is derived for the RNP potential by taking the derivative of the bond potential, allowing for more efficient simulations.
2) Two loading algorithms are presented - soft loading and hard loading. Soft loading uses bond softening while hard loading applies a prescribed displacement field.
3) Numerical results show the method can capture linear elastic behavior, bond softening prior to crack growth, and eventual stable crack propagation under both soft and hard loading conditions.
A review of benchmark experiments for the validation of peridynamics modelsPatrick Diehl
This document reviews peridynamic models that have been compared to experimental data. It summarizes 39 papers that compared peridynamic simulations to experiments in areas like wave propagation, crack initiation/propagation in materials like composites, steel, aluminum, concrete and glass. It evaluates the confidence of peridynamic models by looking at metrics like relative error of scalar observables and R^2 correlation of observable series. It also discusses two papers on advanced visualization techniques for fracture simulations: physically-based rendering and extraction of fragments. The document concludes with an outlook on future work.
Deploying a Task-based Runtime System on Raspberry Pi ClustersPatrick Diehl
This document summarizes research deploying a task-based runtime system on Raspberry Pi clusters. The researchers used Raspberry Pi 3B, 3B+, and 4 models to build small, low-cost clusters. They benchmarked HPX and Phylanx applications, finding best performance on 2 cores due to memory bandwidth limitations. Multi-node codes scaled well but threads provided little gain. The clusters showed modest performance at reasonable cost and could be used for teaching and collecting sensor data in the field.
EMI 2021 - A comparative review of peridynamics and phase-field models for en...Patrick Diehl
This document provides a comparative review of peridynamics and phase-field models for engineering fracture mechanics. It discusses the computational aspects of both models, their advantages in modeling crack initiation and propagation, and common challenges. Some specific challenges for each model are also outlined, such as applying boundary conditions for peridynamics and modeling fast crack propagation under dynamic loading for phase-field. The conclusion states that both models can capture microscale fracture physics but comparative validation studies against experimental data are still lacking.
Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...Patrick Diehl
This document outlines the agenda for a session on open science and open source software. The session will discuss why open source software is essential for open science, the role of Google Summer of Code in this process, and how to make students aware of the importance of open science and open source topics through undergraduate and graduate education. The agenda also allows time for general discussion of these issues.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...Advanced-Concepts-Team
Presentation in the Science Coffee of the Advanced Concepts Team of the European Space Agency on the 07.06.2024.
Speaker: Diego Blas (IFAE/ICREA)
Title: Gravitational wave detection with orbital motion of Moon and artificial
Abstract:
In this talk I will describe some recent ideas to find gravitational waves from supermassive black holes or of primordial origin by studying their secular effect on the orbital motion of the Moon or satellites that are laser ranged.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Recent developments in HPX and Octo-Tiger
1. Recent developments in HPX and Octo-Tiger
Patrick Diehl
Joint work with: Gregor Daiß, Sagiv Schieber, Dominic Marcello, Kevin Huck,
Hartmut Kaiser, Juhan Frank, Geoffery Clayton, Patrick Motl, Dirk Pflüger,
Orsola DeMarco, Mikael Simberg, John Bidiscombe, Srinivas Yada, and many
more
Center for Computation & Technology, Louisiana State University
Department of Physics & Astronomy
patrickdiehl@lsu.edu
November 2022
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 1 / 44
2. Motivation
Astrophysical event: Merging of two stars – Flow on the surface which
corresponds in layman terms to the weather on the stars.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 2 / 44
3. Outline
1 Astrophysical application
2 Software framework
Octo-Tiger
HPX
APEX
3 New features
Kokkos and HPX
Vectorization
Work aggregation
4 Overhead measurements
5 Scaling
Synchronous (MPI) vs asynchronous communication (libfabric)
Scaling on ORNL’s Summit
First experience on Fugaku using A64FX
6 Performance profiling
7 Conclusion and Outlook
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 3 / 44
4. Outline
1 Astrophysical application
2 Software framework
Octo-Tiger
HPX
APEX
3 New features
Kokkos and HPX
Vectorization
Work aggregation
4 Overhead measurements
5 Scaling
Synchronous (MPI) vs asynchronous communication (libfabric)
Scaling on ORNL’s Summit
First experience on Fugaku using A64FX
6 Performance profiling
7 Conclusion and Outlook
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 4 / 44
6. V1309 Scorpii
At peak brightness, the rare 2002 red
nova V838 Monocerotis briefly
rivalled the most powerful stars in
the Galaxy. Credit: NASA/ESA/H.
E. Bond (STScl)
A near-infrared (I band) light curve
for V1309 Scorpii, plotted from
OGLE data
Reference
Tylenda, R., et al. ”V1309 Scorpii: merger of a contact binary.” Astronomy & Astrophysics 528 (2011): A114.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 6 / 44
8. Octo-Tiger
Astrophysics open source program1 simulating the evolution of star
systems based on the fast multipole method on adaptive Octrees.
Modules
Hydro
Gravity
Radiation (benchmarking)
Supports
Communication: MPI/libfabric
Backends: CUDA, HIP, Kokkos
Reference
Marcello, Dominic C., et al. ”octo-tiger: a new, 3D hydrodynamic code for stellar mergers that uses hpx parallelization.”
Monthly Notices of the Royal Astronomical Society 504.4 (2021): 5345-5382.
1
https://github.com/STEllAR-GROUP/octotiger
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 8 / 44
9. Example of adaptive mesh refinement
Reference
Heller, Thomas, et al. ”Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two
stars.” The International Journal of High Performance Computing Applications 33.4 (2019): 699-715.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 9 / 44
10. Example of adaptive mesh refinement
Reference
Heller, Thomas, et al. ”Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two
stars.” The International Journal of High Performance Computing Applications 33.4 (2019): 699-715.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 9 / 44
11. HPX
HPX is a open source C++ Standard Library for Concurrency and
Parallelism2.
Features
HPX exposes a uniform, standards-oriented API for ease of
programming parallel and distributed applications.
HPX provides unified syntax and semantics for local and remote
operations.
HPX exposes a uniform, flexible, and extendable performance counter
framework which can enable runtime adaptivity.
Reference
Kaiser, Hartmut, et al. ”Hpx-the c++ standard library for parallelism and concurrency.” Journal of Open Source
Software 5.53 (2020): 2352.
2
https://github.com/STEllAR-GROUP/hpx
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 10 / 44
12. HPX’s architecture
Application
Operating System
C++2z Concurrency/Parallelism APIs
Threading Subsystem
Active Global Address
Space (AGAS)
Local Control Objects
(LCOs)
Parcel Transport Layer
(Networking)
API
OS
Performance Counter
Framework
Policy
Engine/Policies
Reference
Kaiser, Hartmut, et al. ”Hpx-the c++ standard library for parallelism and concurrency.” Journal of Open Source
Software 5.53 (2020): 2352.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 11 / 44
13. APEX
APEX: Autonomous
Performance
Environment for
Exascale: Performance
measurement library for
distributed,
asynchronous
multitasking systems.
CUPTI used to capture CUDA events
NVML used to monitor the GPU
OTF2 and Google Trace Events trace
output
Task Graphs and Trees
Scatterplots of timers and counters
Reference
Huck, Kevin A., et al. ”An autonomic performance environment for exascale.” Supercomputing frontiers and innovations
2.3 (2015): 49-66.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 12 / 44
14. APEX
To support performance
measurement in systems
that employ user-level
threading, APEX uses a
dependency chain in
addition to the call stack
to produce traces and
task dependency graphs.
CUPTI used to capture CUDA events
NVML used to monitor the GPU
OTF2 and Google Trace Events trace
output
Task Graphs and Trees
Scatterplots of timers and counters
Reference
Huck, Kevin A., et al. ”An autonomic performance environment for exascale.” Supercomputing frontiers and innovations
2.3 (2015): 49-66.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 12 / 44
16. HPX and Kokkos
Reference
Edwards, H. Carter, Christian R. Trott, and Daniel Sunderland. ”Kokkos: Enabling manycore performance portability
through polymorphic memory access patterns.” Journal of parallel and distributed computing 74.12 (2014): 3202-3216.
Daiß, Gregor, et al. ”Beyond Fork-Join: Integration of Performance Portable Kokkos Kernels with HPX.” 2021 IEEE
International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2021.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 14 / 44
17. Overhead
Reference
Daiß, Gregor, et al. ”Beyond Fork-Join: Integration of Performance Portable Kokkos Kernels with HPX.” 2021 IEEE
International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2021.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 15 / 44
19. Vectorizaton using Kokkos + HPX
We can now easily switch both the SIMD library (Kokkos SIMD or
std::experimental::simd) and the used SIMD extensions.
Reference
Daiß, Gregor, et al. ”From Merging Frameworks to Merging Stars: Experiences using HPX, Kokkos and SIMD Types.”
arXiv preprint arXiv:2210.06439 (2022). (Accepted to SC 22 workshop proceedings)
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 17 / 44
20. Single node runs on different CPU architectures
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 18 / 44
21. Work aggregation strategies
Reference
Daiß, Gregor, et al. ”From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks
into Portable GPU Kernels.” arXiv preprint arXiv:2210.06438 (2022). (Accepted to SC 22 workshop proceedings)
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 19 / 44
24. Task Bench
Task Bench is a configurable benchmark for evaluating the efficiency and
performance of parallel and distributed programming models, runtimes,
and languages. It is primarily intended for evaluating task-based
models, in which the basic unit of execution is a task, but it can be
implemented in any parallel system.
Reference
Slaughter, Elliott, et al. ”Task bench: A parameterized benchmark for evaluating parallel runtime performance.” SC20:
International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 22 / 44
25. Measurements
Single Node: 1 task per core
Distributed runs: 16 task per core
METG (Minimum Effective Task Granularity): 50% percent effective task granularity, the time a system takes to achieve 50%
overall efficiency.
Take away: Ligth-weight threads and work-stealing comes with overhead,
but not for large simulations.
Reference
Wu, Nanmiao, et al. ”Quantifying Overheads in Charm++ and HPX using Task Bench.” arXiv preprint
arXiv:2207.12127 (2022). (Accepted to EuroPar 22 workshop proceedings)
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 23 / 44
27. Synchronous (MPI) vs asynchronous communication
(libfabric)
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 25 / 44
28. Configuration
Reference
Daiß, Gregor, et al. ”From piz daint to the stars: Simulation of stellar mergers using high-level abstractions.”
Proceedings of the international conference for high performance computing, networking, storage and analysis. 2019.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 26 / 44
29. Synchronous vs asynchronous communication
Reference
Daiß, Gregor, et al. ”From piz daint to the stars: Simulation of stellar mergers using high-level abstractions.”
Proceedings of the international conference for high performance computing, networking, storage and analysis. 2019.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 27 / 44
30. Scaling on ORNL’s Summit
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 28 / 44
31. Node level scaling: Hydro
Reference
Diehl, Patrick, et al. ”Octo-Tiger’s New Hydro Module and Performance Using HPX+ CUDA on ORNL’s Summit.”
2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2021.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 29 / 44
32. Distributes scaling: Hydro
Reference
Diehl, Patrick, et al. ”Octo-Tiger’s New Hydro Module and Performance Using HPX+ CUDA on ORNL’s Summit.”
2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2021.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 30 / 44
33. Node level scaling: Hydro + Gravity
Reference
Diehl, Patrick, et al. ”Octo-Tiger’s New Hydro Module and Performance Using HPX+ CUDA on ORNL’s Summit.”
2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2021.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 31 / 44
34. Distributed scaling: Hydro + Gravity
Reference
Diehl, Patrick, et al. ”Octo-Tiger’s New Hydro Module and Performance Using HPX+ CUDA on ORNL’s Summit.”
2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2021.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 32 / 44
35. First experience on Fugaku using A64FX
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 33 / 44
36. Porting HPX to Arm
Challenges on Fugaku:
Add support for Parallel job
manager (PJM)
Cross compilation on x86 head
node to A64FX compute nodes.
Some of the dependencies do
not support cross compilation.
Reference
Gupta, Nikunj, et al. ”Deploying a task-based runtime system on raspberry pi clusters.” 2020 IEEE/ACM Fifth
International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 2020.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 34 / 44
37. Nodel-level scaling
SVE vectorization reduced the computation time by a factor of two
compared to Neon.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 35 / 44
38. Distributed scaling
Due to the 28 GB of memory, more nodes on Fugaku are required as on
other machines.
References: HPCI report submitted and IDPDS workshop paper in preparation.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 36 / 44
41. Task trees and task graphs
Reference
Diehl, Patrick, et al. ”Octo-Tiger’s New Hydro Module and Performance Using HPX+ CUDA on ORNL’s Summit.”
2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2021.
Diehl, Patrick, et al. ”Distributed, combined CPU and GPU profiling within HPX using APEX.” arXiv preprint
arXiv:2210.06437 (2022).
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 39 / 44
42. Sampled profile of tasks on Piz Daint and Summit
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 40 / 44
44. Conclusion
HPX and Octo-Tiger
Asynchronous integration of GPU in HPX:
→ Using CUDA API or HIP API
→ Using Kokkos for Nvidia or AMD GPUs
Providing HPX as a backend for Kokkos and integration of
asynchronous Kokkos lanuches
Work aggregation for small tasks on CPU and GPU
Vectorization using std::simd
Alternatives to MPI, like libfabric or LCI
Programming model
Work-stealing; overlapping communication and computation; and
light-weighted threads can improve the performance of irregular
workloads.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 42 / 44
45. Outlook
Outlook
Distributed scaling data and optimization for AMD GPUs
Optimization and large scale runs on Fugaku or Ookami
Get ready for Intel GPU support
Test the effect of libfabric on other systems
More astrophysics studies to prepare for the production run with the
light curve
Advanced visualization of the large scale results
Thanks to all my collaborators and without their effort, I could not present all these results.
Thanks for your attention! Questions?
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 43 / 44
47. Resolution convergence: Double white dwarf merger
Reference
Diehl, Patrick, et al. ”Performance Measurements Within Asynchronous Task-Based Runtime Systems: A Double White
Dwarf Merger as an Application.” Computing in Science & Engineering 23.3 (2021): 73-81.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 45 / 44
48. Higher reconstruction in the hydro module
Reference
Diehl, Patrick, et al. ”Octo-Tiger’s New Hydro Module and Performance Using HPX+ CUDA on ORNL’s Summit.”
arXiv:2107.10987 (2021). (Accepted IEEE Cluster 21)
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 46 / 44
49. Comparison with Flash I
Reference
Marcello, Dominic C., et al. ”octo-tiger: a new, 3D hydrodynamic code for stellar mergers that uses hpx parallelization.”
Monthly Notices of the Royal Astronomical Society 504.4 (2021): 5345-5382.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 47 / 44
50. Comparison with Flash II
Reference
Marcello, Dominic C., et al. ”octo-tiger: a new, 3D hydrodynamic code for stellar mergers that uses hpx parallelization.”
Monthly Notices of the Royal Astronomical Society 504.4 (2021): 5345-5382.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 48 / 44
51. This work is licensed under a Creative
Commons “Attribution-NonCommercial-
NoDerivatives 4.0 International” license.
Patrick Diehl (CCT/LSU) HPX & Octo-Tiger November 2022 44 / 44