DOE Exascale Computing Project (EC) Software Technology focus area
is developing an HPC software ecosystem that will enable the efficient
and performant execution of exascale applications. Through the
Extreme-scale Scientific Software Stack (E4S), it is developing a
comprehensive and coherent software stack that will enable application
developers to productively write highly parallel applications that can
portably target diverse exascale architectures - including the IBM
OpenPOWER with NVIDIA GPU systems. E4S features a broad collection of
HPC software packages including the TAU Performance System(R) for
performance evaluation of HPC and AI/ML codes. TAU is a versatile
profiling and tracing toolkit that supports performance engineering of
codes written for CPU and GPUs and has support for most IBM platforms.
This talk will give an overview of TAU and E4S and how developers can
use these tools to analyze the performance of their codes. TAU supports
transparent instrumentation of codes without modifying the application
binary. The talk will describe TAU's support for CUDA, OpenACC, pthread,
OpenMP, Kokkos, and MPI applications. It will describe TAU's use for
Python based frameworks such as Tensorflow and PyTorch. It will cover
the use of TAU in E4S containers using Docker and Singularity runtimes
under ppc64le. E4S provides both source builds through the Spack
platform and a set of containers that feature a broad collection of HPC
software packages. E4S exists to accelerate the development, deployment, and use of HPC software, lowering the barriers for HPC users.
The session will present HPC challenges in fuelling machine learning and deep learning into the simulations. Besides, we will present a user-centric view of IBM Watson ML Community Edition and the newly IBM inference system IC922 adoption into AIops of large HPC clusters (from deployment to inference).
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
A presentation at FOSDEM 2022 about AMD GPUs, tuning, programming models and software roadmap. This is continuation from the previous talk (FOSDEM 2021)
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-chandra
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Vikas Chandra, Senior Principal Engineer and Director of Machine Learning at Arm, presents the "Deep Learning on Arm Cortex-M Microcontrollers" tutorial at the May 2018 Embedded Vision Summit.
Deep learning algorithms are gaining popularity in IoT edge devices due to their human-level accuracy in many tasks, such as image classification and speech recognition. As a result, there is increasing interest in deploying neural networks (NNs) on the types of low-power processors found in always-on systems, such as those based on Arm Cortex-M microcontrollers.
In this talk, Chandra introduces the challenges of deploying neural networks on microcontrollers with limited memory and compute resources and power budgets. He introduces CMSIS-NN, a library of optimized software kernels to enable deployment of neural networks on Cortex-M cores. He also presents techniques for NN algorithm exploration to develop lightweight models suitable for resource constrained systems, using image classification as an example.
The session will present HPC challenges in fuelling machine learning and deep learning into the simulations. Besides, we will present a user-centric view of IBM Watson ML Community Edition and the newly IBM inference system IC922 adoption into AIops of large HPC clusters (from deployment to inference).
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
A presentation at FOSDEM 2022 about AMD GPUs, tuning, programming models and software roadmap. This is continuation from the previous talk (FOSDEM 2021)
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-chandra
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Vikas Chandra, Senior Principal Engineer and Director of Machine Learning at Arm, presents the "Deep Learning on Arm Cortex-M Microcontrollers" tutorial at the May 2018 Embedded Vision Summit.
Deep learning algorithms are gaining popularity in IoT edge devices due to their human-level accuracy in many tasks, such as image classification and speech recognition. As a result, there is increasing interest in deploying neural networks (NNs) on the types of low-power processors found in always-on systems, such as those based on Arm Cortex-M microcontrollers.
In this talk, Chandra introduces the challenges of deploying neural networks on microcontrollers with limited memory and compute resources and power budgets. He introduces CMSIS-NN, a library of optimized software kernels to enable deployment of neural networks on Cortex-M cores. He also presents techniques for NN algorithm exploration to develop lightweight models suitable for resource constrained systems, using image classification as an example.
In this deck, Peter Braam looks at how TensorFlow framework could be used to accelerate high performance computing.
"Google has developed TensorFlow, a truly complete platform for ML. The performance of the platform is amazing, and it begs the question if it will be useful for HPC in a similar manner that GPU’s heralded a revolution.
As described in his talk at the CHPC 2018 Conference in South Africa, TensorFlow contains many ingredients, for example:
* many domain specific libraries for machine learning
* the TensorFlow domain specific data-flow language
carefully organized input and output for data flow
* an optimizing runtime and compiler
* hardware implementations of TensorFlow operations in
* TensorFlow processing unit (TPU) chips
Learn more: https://wp.me/p3RLHQ-jMv
and
https://www.tensorflow.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/10/deploying-pytorch-models-for-real-time-inference-on-the-edge-a-presentation-from-nomitri/
Moritz August, CDO at Nomitri GmbH, presents the “Deploying PyTorch Models for Real-time Inference On the Edge” tutorial at the May 2021 Embedded Vision Summit.
In this presentation, August provides an overview of workflows for deploying compressed deep learning models, starting with PyTorch and creating native C++ application code running in real-time on embedded hardware platforms. He illustrates these workflows on smartphones with real-world examples targeting ARM-based CPU, GPUs, and NPUs as well as embedded chips and modules like the NXP i.MX8+ and NVIDIA Jetson Nano.
August examines TorchScript, architecture-side optimizations, quantization and common pitfalls. Additionally, he shows how the PyTorch deployment workflow can be extended to conversion to ONNX and quantization of ONNX models using an ONNX Runtime. On the application side, he demonstrates how deployed models can be integrated efficiently into a C++ library that runs natively on mobile and embedded devices and highlights known limitations.
The first version of eBPF hardware offload was merged into the Linux kernel in October 2016 and became part of Linux v4.9. For the last two years the project has been growing and evolving to integrate more closely with the core kernel infrastructure and enable more advanced use cases. This talk will explain the internals of the kernel architecture of the offload and how it allows seamless execution of unmodified eBPF datapaths in HW.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
Presentation CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora from the AMD Developer Summit (APU13) November 2013.
Atsushi Hori
RIKEN
New portable and practical parallel execution model, Process in Process (PiP in short) will be presented. PiP tasks share the same virtual address space like the multi-thread model and privatized variables like the multi-process model. Because of this, PiP provides the best of two worlds, multi-process (MPI) and multi-thread (OpenMP). Researcher, System Software Development Team, RIKEN
This Webinar explores a variety of new and updated features in Java 8, and discuss how these changes can positively impact your day-to-day programming.
Watch the video replay here: http://bit.ly/1vStxKN
Your Webinar presenter, Marnie Knue, is an instructor for Develop Intelligence and has taught Sun & Oracle certified Java classes, RedHat JBoss administration, Spring, and Hibernate. Marnie also has spoken at JavaOne.
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
Speaker: Paul Navrátil, Texas Advanced Computing Center (TACC)
The design emphasis for supercomputing systems has moved from raw performance to performance-per-watt, and as a result, supercomputing architectures are converging on processors with wide vector units and many processing cores per chip. Such processors are capable of performant image rendering purely in software. This improved capability is fortuitous, since the prevailing homogeneous system designs lack dedicated, hardware-accelerated rendering subsystems for use in data visualization. Reliance on this “software-defined” rendering capability will grow in importance since, due to growing data sizes, visualizations must be performed on the same machine where the data is produced. Further, as data sizes outgrow disk I/O capacity, visualization will be increasingly incorporated into the simulation code itself (in situ visualization).
This talk presents recent work in high-fidelity visualization using the OSPRay ray tracing framework on TACC’s local and remote visualization systems. We present work using OSPRay within ParaView Catalyst in situ framework from Kitware, including capitalizing on opportunities to reduce data costs migrating through VTK filters for visualization. We highlight the performance opportunities and advantages of Intel® Advanced Vector Extensions 512, the memory system improvements possible with Intel® Xeon Phi™ processor multi-channel DRAM (MCDRAM) and the Intel® Omni-Path Architecture interconnect.
Fast Scalable Easy Machine Learning with OpenPOWER, GPUs and DockerIndrajit Poddar
Transparently accelerated Deep Learning workloads on OpenPOWER systems and GPUs using easy to use open source frameworks such as Caffe, Torch, Tensorflow, Theano.
Best Practices and Performance Studies for High-Performance Computing ClustersIntel® Software
This session focuses on key system tunables for maximizing application performance of high-performance computing (HPC) workloads, and addresses porting, optimizing, and running applications to maximize performance. We present practical tips and techniques for building and running applications on multicore processors. We analyze sample performance and scaling data from various applications, and identify the best options.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit-khronos
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of Khronos and Vice President at NVIDIA, presents the "OpenVX Hardware Acceleration API for Embedded Vision Applications and Libraries" tutorial at the May 2014 Embedded Vision Summit.
This presentation introduces OpenVX, a new application programming interface (API) from the Khronos Group. OpenVX enables performance and power optimized vision algorithms for use cases such as face, body and gesture tracking, smart video surveillance, automatic driver assistance systems, object and scene reconstruction, augmented reality, visual inspection, robotics and more.
OpenVX enables significant implementation innovation while maintaining a consistent API for developers. OpenVX can be used directly by applications or to accelerate higher-level middleware with platform portability. OpenVX complements the popular OpenCV open source vision library that is often used for application prototyping.
An Overview of the IHK/McKernel Multi-kernel Operating SystemLinaro
By Balazs Gerofi, RIKEN Advanced Institute For Computational Science
RIKEN Advanced Institute for Computation Science is in charge of leading the development of Japan's next generation flagship supercomputer, the successor of the K. Part of this effort is to design and develop a system software stack that suits the needs of future extreme scale computing. In this talk, we focus on operating system (OS) requirements for HPC and discuss IHK/McKernel, a multi-kernel based operating system framework. IHK/McKernel runs Linux with a light-weight kernel (LWK) side-by-side on compute nodes with the primary motivation of providing scalable, consistent performance for large scale HPC simulations, but at the same time to retain a fully Linux compatible execution environment. We provide an overview of the project and discuss the status of its support for ARM architecture.
Balazs Gerofi Bio
Research Scientist at RIKEN Advanced Institute For Computational Science.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
This is a presentation about how to use Kubeflow for "AI pipeline optimization" - we show the "traditional" pipeline and why it should be optimized to have it available to a wider audience. Services are getting more and more important nowadays - thats why we call it "Data Science as a service".
AMD’s math libraries can support a range of programmers from hobbyists to ninja programmers. Kent Knox from AMD’s library team introduces you to OpenCL libraries for linear algebra, FFT, and BLAS, and shows you how to leverage the speed of OpenCL through the use of these libraries.
Review the material presented in the AMD Math libraries webinar in this deck.
For more:
Visit the AMD Developer Forums:http://devgurus.amd.com/welcome
Watch the replay: www.youtube.com/user/AMDDevCentral
Follow us on Twitter: https://twitter.com/AMDDevCentral
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
In this deck from the 2019 UK HPC Conference, Dr. Oliver Perks from Arm presents: Arm as a Viable Architecture for HPC & AI.
"In the past two years Arm has transitioned from being a novelty research project for HPC to a viable candidate for large scale procurements. Through the advent of competitive processors, such as the Marvell ThunderX2, Arm is being taking increasingly seriously as an alternative to traditional X86 based supercomputers. Whilst the novelty lies within the architectural design, the most significant body of work has taken place in the ecosystem and applications space, ensuing a smooth transition for production scientific workloads. In this talk we will present the current status of Arm in HPC and scientific computing, and what to expect from future generations of Arm based processors. Additionally, we will cover the best practices for the adoption of Arm technology in a production HPC setting."
Watch the video: https://wp.me/p3RLHQ-kV5
Learn more: https://developer.arm.com/solutions/hpc
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Containerizing HPC and AI applications using E4S and Performance Monitor toolGanesan Narayanasamy
The DOE Exascale Computing Project (ECP) Software Technology focus area is developing an HPC software ecosystem that will enable the efficient and performant execution of exascale applications. Through the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io], it is developing a comprehensive and coherent software stack that will enable application developers to productively write highly parallel applications that can portably target diverse exascale architectures. E4S provides both source builds through the Spack platform and a set of containers that feature a broad collection of HPC software packages. E4S exists to accelerate the development, deployment, and use of HPC software, lowering the barriers for HPC users. It provides container images, build manifests, and turn-key, from-source builds of popular HPC software packages developed as Software Development Kits (SDKs). This effort includes a broad range of areas including programming models and runtimes (MPICH, Kokkos, RAJA, OpenMPI), development tools (TAU, HPCToolkit, PAPI), math libraries (PETSc, Trilinos), data and visualization tools (Adios, HDF5, Paraview), and compilers (LLVM), all available through the Spack package manager. It will describe the community engagements and interactions that led to the many artifacts produced by E4S. It will introduce the E4S containers are being deployed at the HPC systems at DOE national laboratories using Singularity, Shifter, and Charliecloud container runtimes.
This talk will describe how E4S can support the OpenPOWER platform with NVIDIA GPUs.
In this deck, Peter Braam looks at how TensorFlow framework could be used to accelerate high performance computing.
"Google has developed TensorFlow, a truly complete platform for ML. The performance of the platform is amazing, and it begs the question if it will be useful for HPC in a similar manner that GPU’s heralded a revolution.
As described in his talk at the CHPC 2018 Conference in South Africa, TensorFlow contains many ingredients, for example:
* many domain specific libraries for machine learning
* the TensorFlow domain specific data-flow language
carefully organized input and output for data flow
* an optimizing runtime and compiler
* hardware implementations of TensorFlow operations in
* TensorFlow processing unit (TPU) chips
Learn more: https://wp.me/p3RLHQ-jMv
and
https://www.tensorflow.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/10/deploying-pytorch-models-for-real-time-inference-on-the-edge-a-presentation-from-nomitri/
Moritz August, CDO at Nomitri GmbH, presents the “Deploying PyTorch Models for Real-time Inference On the Edge” tutorial at the May 2021 Embedded Vision Summit.
In this presentation, August provides an overview of workflows for deploying compressed deep learning models, starting with PyTorch and creating native C++ application code running in real-time on embedded hardware platforms. He illustrates these workflows on smartphones with real-world examples targeting ARM-based CPU, GPUs, and NPUs as well as embedded chips and modules like the NXP i.MX8+ and NVIDIA Jetson Nano.
August examines TorchScript, architecture-side optimizations, quantization and common pitfalls. Additionally, he shows how the PyTorch deployment workflow can be extended to conversion to ONNX and quantization of ONNX models using an ONNX Runtime. On the application side, he demonstrates how deployed models can be integrated efficiently into a C++ library that runs natively on mobile and embedded devices and highlights known limitations.
The first version of eBPF hardware offload was merged into the Linux kernel in October 2016 and became part of Linux v4.9. For the last two years the project has been growing and evolving to integrate more closely with the core kernel infrastructure and enable more advanced use cases. This talk will explain the internals of the kernel architecture of the offload and how it allows seamless execution of unmodified eBPF datapaths in HW.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
Presentation CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora from the AMD Developer Summit (APU13) November 2013.
Atsushi Hori
RIKEN
New portable and practical parallel execution model, Process in Process (PiP in short) will be presented. PiP tasks share the same virtual address space like the multi-thread model and privatized variables like the multi-process model. Because of this, PiP provides the best of two worlds, multi-process (MPI) and multi-thread (OpenMP). Researcher, System Software Development Team, RIKEN
This Webinar explores a variety of new and updated features in Java 8, and discuss how these changes can positively impact your day-to-day programming.
Watch the video replay here: http://bit.ly/1vStxKN
Your Webinar presenter, Marnie Knue, is an instructor for Develop Intelligence and has taught Sun & Oracle certified Java classes, RedHat JBoss administration, Spring, and Hibernate. Marnie also has spoken at JavaOne.
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
Speaker: Paul Navrátil, Texas Advanced Computing Center (TACC)
The design emphasis for supercomputing systems has moved from raw performance to performance-per-watt, and as a result, supercomputing architectures are converging on processors with wide vector units and many processing cores per chip. Such processors are capable of performant image rendering purely in software. This improved capability is fortuitous, since the prevailing homogeneous system designs lack dedicated, hardware-accelerated rendering subsystems for use in data visualization. Reliance on this “software-defined” rendering capability will grow in importance since, due to growing data sizes, visualizations must be performed on the same machine where the data is produced. Further, as data sizes outgrow disk I/O capacity, visualization will be increasingly incorporated into the simulation code itself (in situ visualization).
This talk presents recent work in high-fidelity visualization using the OSPRay ray tracing framework on TACC’s local and remote visualization systems. We present work using OSPRay within ParaView Catalyst in situ framework from Kitware, including capitalizing on opportunities to reduce data costs migrating through VTK filters for visualization. We highlight the performance opportunities and advantages of Intel® Advanced Vector Extensions 512, the memory system improvements possible with Intel® Xeon Phi™ processor multi-channel DRAM (MCDRAM) and the Intel® Omni-Path Architecture interconnect.
Fast Scalable Easy Machine Learning with OpenPOWER, GPUs and DockerIndrajit Poddar
Transparently accelerated Deep Learning workloads on OpenPOWER systems and GPUs using easy to use open source frameworks such as Caffe, Torch, Tensorflow, Theano.
Best Practices and Performance Studies for High-Performance Computing ClustersIntel® Software
This session focuses on key system tunables for maximizing application performance of high-performance computing (HPC) workloads, and addresses porting, optimizing, and running applications to maximize performance. We present practical tips and techniques for building and running applications on multicore processors. We analyze sample performance and scaling data from various applications, and identify the best options.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit-khronos
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of Khronos and Vice President at NVIDIA, presents the "OpenVX Hardware Acceleration API for Embedded Vision Applications and Libraries" tutorial at the May 2014 Embedded Vision Summit.
This presentation introduces OpenVX, a new application programming interface (API) from the Khronos Group. OpenVX enables performance and power optimized vision algorithms for use cases such as face, body and gesture tracking, smart video surveillance, automatic driver assistance systems, object and scene reconstruction, augmented reality, visual inspection, robotics and more.
OpenVX enables significant implementation innovation while maintaining a consistent API for developers. OpenVX can be used directly by applications or to accelerate higher-level middleware with platform portability. OpenVX complements the popular OpenCV open source vision library that is often used for application prototyping.
An Overview of the IHK/McKernel Multi-kernel Operating SystemLinaro
By Balazs Gerofi, RIKEN Advanced Institute For Computational Science
RIKEN Advanced Institute for Computation Science is in charge of leading the development of Japan's next generation flagship supercomputer, the successor of the K. Part of this effort is to design and develop a system software stack that suits the needs of future extreme scale computing. In this talk, we focus on operating system (OS) requirements for HPC and discuss IHK/McKernel, a multi-kernel based operating system framework. IHK/McKernel runs Linux with a light-weight kernel (LWK) side-by-side on compute nodes with the primary motivation of providing scalable, consistent performance for large scale HPC simulations, but at the same time to retain a fully Linux compatible execution environment. We provide an overview of the project and discuss the status of its support for ARM architecture.
Balazs Gerofi Bio
Research Scientist at RIKEN Advanced Institute For Computational Science.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
This is a presentation about how to use Kubeflow for "AI pipeline optimization" - we show the "traditional" pipeline and why it should be optimized to have it available to a wider audience. Services are getting more and more important nowadays - thats why we call it "Data Science as a service".
AMD’s math libraries can support a range of programmers from hobbyists to ninja programmers. Kent Knox from AMD’s library team introduces you to OpenCL libraries for linear algebra, FFT, and BLAS, and shows you how to leverage the speed of OpenCL through the use of these libraries.
Review the material presented in the AMD Math libraries webinar in this deck.
For more:
Visit the AMD Developer Forums:http://devgurus.amd.com/welcome
Watch the replay: www.youtube.com/user/AMDDevCentral
Follow us on Twitter: https://twitter.com/AMDDevCentral
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
In this deck from the 2019 UK HPC Conference, Dr. Oliver Perks from Arm presents: Arm as a Viable Architecture for HPC & AI.
"In the past two years Arm has transitioned from being a novelty research project for HPC to a viable candidate for large scale procurements. Through the advent of competitive processors, such as the Marvell ThunderX2, Arm is being taking increasingly seriously as an alternative to traditional X86 based supercomputers. Whilst the novelty lies within the architectural design, the most significant body of work has taken place in the ecosystem and applications space, ensuing a smooth transition for production scientific workloads. In this talk we will present the current status of Arm in HPC and scientific computing, and what to expect from future generations of Arm based processors. Additionally, we will cover the best practices for the adoption of Arm technology in a production HPC setting."
Watch the video: https://wp.me/p3RLHQ-kV5
Learn more: https://developer.arm.com/solutions/hpc
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Containerizing HPC and AI applications using E4S and Performance Monitor toolGanesan Narayanasamy
The DOE Exascale Computing Project (ECP) Software Technology focus area is developing an HPC software ecosystem that will enable the efficient and performant execution of exascale applications. Through the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io], it is developing a comprehensive and coherent software stack that will enable application developers to productively write highly parallel applications that can portably target diverse exascale architectures. E4S provides both source builds through the Spack platform and a set of containers that feature a broad collection of HPC software packages. E4S exists to accelerate the development, deployment, and use of HPC software, lowering the barriers for HPC users. It provides container images, build manifests, and turn-key, from-source builds of popular HPC software packages developed as Software Development Kits (SDKs). This effort includes a broad range of areas including programming models and runtimes (MPICH, Kokkos, RAJA, OpenMPI), development tools (TAU, HPCToolkit, PAPI), math libraries (PETSc, Trilinos), data and visualization tools (Adios, HDF5, Paraview), and compilers (LLVM), all available through the Spack package manager. It will describe the community engagements and interactions that led to the many artifacts produced by E4S. It will introduce the E4S containers are being deployed at the HPC systems at DOE national laboratories using Singularity, Shifter, and Charliecloud container runtimes.
This talk will describe how E4S can support the OpenPOWER platform with NVIDIA GPUs.
TAU for Accelerating AI Applications at OpenPOWER Summit Europe OpenPOWERorg
Sameer Shende, director, Performance Research Laboratory, University of Oregon, presents TAU for Accelerating AI Applications at OpenPOWER Summit Europe 2018.
Overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of-core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented.
(Open) MPI, Parallel Computing, Life, the Universe, and EverythingJeff Squyres
This talk is a general discussion of the current state of Open MPI, and a deep dive on two new features:
1. The flexible process affinity system (I presented many of these slides at the Madrid EuroMPI'13 conference in September 2013).
2. The MPI-3 "MPI_T" tools interface.
I originally gave this talk at Lawrence Berkeley Labs on Thursday, November 7, 2013.
ScicomP 2015 presentation discussing best practices for debugging CUDA and OpenACC applications with a case study on our collaboration with LLNL to bring debugging to the OpenPOWER stack and OMPT.
SREcon 2016 Performance Checklists for SREsBrendan Gregg
Talk from SREcon2016 by Brendan Gregg. Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg . "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.
In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
In this deck from IWOCL / SYCLcon 2020, Hal Finkel from Argonne National Laboratory presents: Preparing to program Aurora at Exascale - Early experiences and future directions.
"Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms. Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.
This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success. Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems."
Watch the video: https://wp.me/p3RLHQ-lPT
Learn more: https://www.iwocl.org/iwocl-2020/conference-program/
and
https://www.anl.gov/topic/aurora
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This is the material of a Burst Buffer training that was presented for the early users program. It covers introduction about parallel I/O, Lustre, Darshan tool, Burst Buffer, and optimization parameters for MPI I/O.
Full video: https://youtu.be/8zLcZmiTweg
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...Rosemary Wang
Thoughtworks Tech Talks NYC, 11/30
We built an application or a platform! However, we soon realize that it is t-minus two weeks before release and we have no way of supporting it when it goes to production. Operations has not been trained, no one will know if a component goes down, and somehow the pipeline used in testing does not work in production. Oops. In this talk, we'll cover ten tips from the operations battlefront to remember as you develop an application or platform. With a focus on operations as a user and designing for support, these tips range from reminders on systems quirks to practices on engaging operations early in the development process. By taking a bit of an "operations" mindset in the development process, we can ease the release process and move closer to DevOps culture.
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together.
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
Pak Lui from the HPC Advisory Council presented this deck at the 2017 Stanford HPC Conference.
"To achieve good scalability performance on the HPC scientific applications typically involves good understanding of the workload though performing profile analysis, and comparing behaviors of using different hardware which pinpoint bottlenecks in different areas of the HPC cluster. In this session, a selection of HPC applications will be shown to demonstrate various methods of profiling and analysis to determine the bottleneck, and the effectiveness of the tuning to improve on the application performance from tests conducted at the HPC Advisory Council High Performance Center."
Watch the video presentation: http://wp.me/p3RLHQ-gpY
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Similar to Performance Evaluation using TAU Performance System and E4S (20)
The Libre-SOC Project aims to create an entirely Libre-Licensed, transparently-developed fully auditable Hybrid 3D CPU-GPU-VPU, using the Supercomputer-class OpenPOWER ISA as the foundation.
Our first test ASIC is a 180nm "Fixed-Point" Power ISA v3.0B processor, 5.1mm x 5.9mm, as a proof-of-concept for the team, whose primary expertise is in Software Engineering. Software Engineering training brings a radically different approach to Hardware development: extensive unit tests, source code revision control, automated development tools are normal. Libre Project Management brings even more: bug trackers, mailing lists, auditable IRC logs and a wiki are standard fare for Libre Projects that are simply not normal Industry-Standard practice.
This talk therefore goes through the workflow, from the original HDL through to the GDS-II layout, showing how we were able to keep track of the development that led to the IMEC 180nm tape-out in July 2021. In particular, by following a parallel development process involving "Real" and "Symbolic" Cell Libraries, developed by Chips4Makers, will be shown how our developers did not need to sign a Foundry NDA, but were still able to work side-by-side with a University that did. With this parallel development process, the University upheld their NDA obligations, and Libre-SOC were simultaneously able to honour its Transparency Objectives.
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
IT Industry is going through two major transformations. One is adaption of AI and tight integration of the same in the commercial applications and enterprise workflow. Two the transformation in software architecture through the concepts like microservices and the cloud native architecture. These transformation alongside the aggressive adaption of IoT/mobile and 5G in all our day today activities is making the world operate in more real time manner which opens-up a new challenge to improve the hardware architecture to adapt to these requirements. These above two major transformation pushes the boundary of the entire systems stack making the designer rethink hardware. This talk presents you a picture of how the enterprise Industry leading POWER architecture is transforming to fulfill the performance demands of these newer generation workloads with primary focus on the AI acceleration on the chip.
July 16th 2021 , Friday for our newest workshop with DoMS, IIT Roorkee, Concept to Solutions using OpenPOWER Stack. It's time to discover advances in #DeepLearning tools and techniques from the world's leading innovators across industries, research, and public speakers.
Register here:
https://lnkd.in/ggxMq2N
This presentation covers two uses cases using OpenPOWER Systems
1. Diabetic Retinopathy using AI on NVIDIA Jetson Nano: The objective is to classify the diabetic level solely on retina image in a remote area with minimum doctor's inference. The model uses VGG16 network architecture and gets trained from scratch on POWER9. The model was deployed on the Jetson Nano board.
1. Classifying Covid positivity using lung X-ray images: The idea is to build ML models to detect positive cases using X-ray images. The model was trained on POWER9, and the application was developed using Python.
IBM Bayesian Optimization Accelerator (BOA) is a do-it-yourself toolkit to apply state-of-the-art Bayesian inferencing techniques and obtain optimal solutions for complex, real-world design simulations without requiring deep machine learning skills. This talk will describe IBM BOA, its differentiation and ease of use, and how researchers can take advantage of it for optimizing any arbitrary HPC simulation.
This presentation covers various partners and collaborators who are currently working with OpenPOWER foundation ,Use cases of OpenPOWER systems in multiple Industries , OpenPOWER Workgroups and OpenCAPI features .
The IBM POWER10 processor represents the 10th generation of the POWER family of enterprise computing engines. Its performance is a result of both powerful processing cores and high-bandwidth intra- and inter-chip interconnect. POWER10 systems can be configured with up to 16 processor chips and 1920 simultaneous threads of execution. Cross-system memory sharing, through the new Memory Inception technology, and 2 Petabytes of addressing space support an expansive memory system. The POWER10 processing core has been significantly enhanced over its POWER9 predecessor, including a doubling of vector units and the addition of an all-new matrix math engine. Throughput gains from POWER9 to POWER10 average 30% at the core level and three-fold at the socket level. Those gains can reach ten- or twenty-fold at the socket level for matrix-intensive computations.
Everything is changing from Health Care to the Automotive markets without forgetting Financial markets or any type of engineering everything has stopped being created as an individual or best-case scenario a team effort to something that is being developed and perfectioned by using AI and hundreds of computers.And even AI is something that we no longer can run in a single computer, no matter how powerful it is. What drives everything today is HPC or High-Performance Computing heavily linked to AI In this session we will discuss about AI, HPC computing, IBM Power architecture and how it can help develop better Healthcare, better Automobiles, better financials and better everything that we run on them
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
This talk gives an introduction about Healthcare Use cases - The AI ladder and Lifestyle AI at Scale Themes The iterative nature of the workflow and some of the important components to be aware in developing AI health care solutions were being discussed. The different types of algorithms and when machine learning might be more appropriate in deep learning or the other way will also be discussed. Use cases in terms of examples are also shared as part of this presentation .
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Moving object recognition (MOR) corresponds to the localization and classification of moving objects in videos. Discriminating moving objects from static objects and background in videos is an essential task for many computer vision applications. MOR has widespread applications in intelligent visual surveillance, intrusion detection, anomaly detection and monitoring, industrial sites monitoring, detection-based tracking, autonomous vehicles, etc. In this session, Murari provided a poster about the deep learning algorithms to identify both locations and corresponding categories of moving objects with a convolutional network. The challenges in developing such algorithms have been discussed.
Clarisse Hedglin from IBM presented this as part of 3 days International Summit .. She shared the scenarios AI can solve for today using the IBM AI infrastructure.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Performance Evaluation using TAU Performance System and E4S
1. TAU and E4S
July 9, 2020, 3:15pm CEST/6:15am PT
OpenPOWER Foundation Workshop at CINECA
CINECA, Italy
Prof. Sameer Shende
Research Associate Professor and
Director, Performance Research Laboratory, OACISS,
University of Oregon
http://tau.uoregon.edu/TAU_IBM_Cineca20.pdf
2. 2
Challenges
• With growing hardware complexity, it is getting harder to
accurately measure and optimize the performance of our
HPC and AI/ML workloads.
• As our software gets more complex, it is getting harder to
install tools and libraries correctly in an integrated and
interoperable software stack.
3. 3
Motivation: Improving Productivity
• TAU Performance System®:
– Deliver a scalable, portable, performance evaluation toolkit for HPC and
AI/ML workloads
– http://tau.uoregon.edu
• Extreme-scale Scientific Software Stack (E4S):
– Delivering a modular, interoperable, and deployable software stack
– Deliver expanded and vertically integrated software stacks to achieve full
potential of extreme-scale computing
– Lower barrier to using software technology (ST) products from ECP
– Enable uniform APIs where possible
– https://e4s.io
4. 4
TAU Performance System®
Parallel performance framework and toolkit
• Supports all HPC platforms, compilers, runtime system
• Provides portable instrumentation, measurement, analysis
5. 5
TAU Performance System
Instrumentation
• Python, C++, C, Fortran, UPC, Java, Chapel, Spark
• Automatic instrumentation
Measurement and analysis support
• MPI, OpenSHMEM, ARMCI, PGAS
• pthreads, OpenMP, OMPT interface, hybrid, other thread models
• GPU, ROCm, CUDA, OpenCL, OpenACC
• Parallel profiling and tracing
Analysis
• Parallel profile analysis (ParaProf), data mining (PerfExplorer)
• Performance database technology (TAUdb)
• 3D profile browser
6. 6
• How much time is spent in each application routine and outer loops? Within loops, what is the
contribution of each statement? What is the time spent in OpenMP loops? In kernels on
GPUs. How long did it take to transfer data between host and device (GPU)?
• How many instructions are executed in these code regions?
Floating point, Level 1 and 2 data cache misses, hits, branches taken? What is the extent of
vectorization for loops?
• What is the memory usage of the code? When and where is memory allocated/de-allocated?
Are there any memory leaks? What is the memory footprint of the application? What is the
memory high water mark?
• How much energy does the application use in Joules? What is the peak power usage?
• What are the I/O characteristics of the code? What is the peak read and write bandwidth of
individual calls, total volume?
• How does the application scale? What is the efficiency, runtime breakdown of performance
across different core counts?
Application Performance Engineering
using TAU
7. 7
Instrumentation
• Source instrumentation using a preprocessor
– Add timer start/stop calls in a copy of the source code.
– Use Program Database Toolkit (PDT) for parsing source code.
– Requires recompiling the code using TAU shell scripts (tau_cc.sh, tau_f90.sh)
– Selective instrumentation (filter file) can reduce runtime overhead and narrow instrumentation
focus.
• Compiler-based instrumentation
– Use system compiler to add a special flag to insert hooks at routine entry/exit.
– Requires recompiling using TAU compiler scripts (tau_cc.sh, tau_f90.sh…)
• Runtime preloading of TAU’s Dynamic Shared Object (DSO)
– No need to recompile code! Use mpirun tau_exec ./app with options.
Add hooks in the code to perform measurements
8. 8
Profiling and Tracing
• Tracing shows you when the events take
place on a timeline
Profiling Tracing
• Profiling shows you how much
(total) time was spent in each routine
• Profiling and tracing
Profiling shows you how much (total) time was spent in each routine
Tracing shows you when the events take place on a timeline
9. 9
Using TAU’s Runtime Preloading Tool: tau_exec
• Preload a wrapper that intercepts the runtime system call and substitutes with another
o MPI
o OpenMP
o POSIX I/O
o Memory allocation/deallocation routines
o Wrapper library for an external package
• No modification to the binary executable!
• Enable other TAU options (communication matrix, OTF2, event-based sampling)
13. 13
TAU supports Python, MPI, and CUDA
Without any modification to the source code or DSOs or interpreter, it instruments and
samples the application using Python, MPI, and CUDA instrumentation.
% mpirun –np 230 tau_python –T cupti,mpi,pdt –ebs –cupti ./exafel.py
Instead of:
% mpirun –np 230 python ./exafel.py
Kernel on GPU
14. 14
TAU Thread Statistics Table
Python, MPI, CUDA, and samples from DSOs are all integrated in a single view
31. 31
TAU’s Support for Runtime Systems
MPI
PMPI profiling interface
MPI_T tools interface using performance and control variables
Pthread
Captures time spent in routines per thread of execution
OpenMP
OMPT tools interface to track salient OpenMP runtime events
Opari source rewriter
Preloading wrapper OpenMP runtime library when OMPT is not supported
OpenACC
OpenACC instrumentation API
Track data transfers between host and device (per-variable)
Track time spent in kernels
32. 32
TAU’s Support for Runtime Systems (contd.)
OpenCL
OpenCL profiling interface
Track timings of kernels
CUDA
Cuda Profiling Tools Interface (CUPTI)
Track data transfers between host and GPU
Track access to uniform shared memory between host and GPU
ROCm
Rocprofiler and Roctracer instrumentation interfaces
Track data transfers and kernel execution between host and GPU
Kokkos
Kokkos profiling API
Push/pop interface for region, kernel execution interface
Python
Python interpreter instrumentation API
Tracks Python routine transitions as well as Python to C transitions
33. 33
Examples of Multi-Level Instrumentation
MPI + OpenMP
MPI_T + PMPI + OMPT may be used to track MPI and OpenMP
MPI + CUDA
PMPI + CUPTI interfaces
OpenCL + ROCm
Rocprofiler + OpenCL instrumentation interfaces
Kokkos + OpenMP
Kokkos profiling API + OMPT to transparently track events
Kokkos + pthread + MPI
Kokkos + pthread wrapper interposition library + PMPI layer
Python + CUDA + MPI
Python + CUPTI + pthread profiling interfaces (e.g., Tensorflow, PyTorch) + MPI
MPI + OpenCL
PMPI + OpenCL profiling interfaces
36. 36
Extreme-scale Scientific Software Stack (E4S)
https://e4s.io
§ E4S is a community effort to provide open source software packages for
developing, deploying, and running scientific applications on HPC platforms.
§ E4S provides both source builds and containers of a broad collection of HPC
software packages.
§ E4S exists to accelerate the development, deployment and use of HPC
software, lowering the barriers for HPC users.
§ E4S provides containers and turn-key, from-source builds of 80+ popular
HPC software packages:
§ MPI: MPICH and OpenMPI
§ Development tools: TAU, HPCToolkit, and PAPI
§ Math libraries: PETSc and Trilinos
§ Data and Viz tools: Adios, HDF5, and Paraview
37. 37
ECP is working towards a periodic, hierarchical release process
• In ECP, teams increasingly need to ensure that their libraries and
components work together
– Historically, HPC codes used very few dependencies
• Now, groups related SW products work together on small releases
of “Software Development Kits” - SDKs
• SDKs will be rolled into a larger, periodic release – E4S.
• Deployment at Facilities builds on SDKs and E4S
Develop
Package
Build
Test
Deploy
Math
Libraries
Develop
Package
Build
Test
Deploy
Data and
Visualization
Develop
Package
Build
Test
Deploy
Programming
Models
…
Build
TestDeploy
Integrate
E4S:
ECP-wide
software
release
https://e4s.io
38. 38
E4S: Extreme-scale Scientific Software Stack
• Curated release of ECP ST products based on Spack [http://spack.io] package manager
• Spack binary build caches for bare-metal installs
– x86_64, ppc64le (IBM Power 9), and aarch64 (ARM64)
• Container images on DockerHub and E4S website of pre-built binaries of ECP ST products
• Base images and full featured containers (GPU support)
• GitHub recipes for creating custom images from base images
• GitLab integration for building E4S images
• E4S validation test suite on GitHub
• E4S VirtualBox image with support for container runtimes
• Docker
• Singularity
• Shifter
• Charliecloud
• AWS image to deploy E4S on EC2
https://e4s.io
39. 39
Integration and Interoperability: E4S
• E4S is released twice a year. ECP Annual Meeting release of E4S v1.1:
– Containers and turn-key, from-source builds popular HPC software packages
– 45+ full release ECP ST products including:
• MPI: MPICH and OpenMPI
• Development tools: TAU, HPCToolkit, and PAPI
• Math libraries: PETSc and Trilinos
• Data and Viztools: Adios, HDF5
– Limited access to 10 additional ECP ST products
– GPU support
41. 41
E4S v1.1 Release at DockerHub: GPU Support
• 40+ ECP ST Products
• Support for GPUs
• NVIDIA
(CUDA 10.1.243)
• ppc64le and x86_64
% docker pull ecpe4s/ubuntu18.04-e4s-gpu
42. 42
E4S v1.1 GPU Release
• 46 ECP ST products
• Ubuntu v18.04 ppc64le
• Support for GPUs
• NVIDIA
46. 46
Spack
• E4S uses the Spack package manager for software delivery
• Spack provides the ability to specify versions of software packages that are and are not
interoperable.
• Spack is a build layer for not only E4S software, but also a large collection of software tools
and libraries outside of ECP ST.
• Spack supports achieving and maintaining interoperability between ST software packages.
48. 48
• Half of this DAG is external (blue); more than half of it is open source
• Nearly all of it needs to be built specially for HPC to get the best performance
Even proprietary codes are based on many open source libraries
ARES
tcl
tkscipy
python
cmake
hpdf
opclient
boost
zlib
numpy
bzip2
LAPACK
gsl
HDF5
gperftools papi
GA
bdivxml
sgeos_xmlScallop
rng perflib memusage timers
SiloSAMRAI
HYPRE
matprop
overlink qd
LEOS
MSlibLaser
CRETIN
tdf
Cheetah DSD
Teton
Nuclear
ASCLaser
MPI
ncurses
sqlite readline openssl BLAS
Physics Utility Math External
Types of Packages
49. 49
• How to install Spack (works out of the box):
• How to install a package:
• TAU and its dependencies are installed within the Spack directory.
• Unlike typical package managers, Spack can also install many variants of the same build.
– Different compilers
– Different MPI implementations
– Different build options
Spack is a flexible package manager for HPC
$ git clone https://github.com/spack/spack
$ . spack/share/spack/setup-env.sh
$ spack install tau
50. 50
• Each expression is a spec for a particular configuration
– Each clause adds a constraint to the spec
– Constraints are optional – specify only what you need.
– Customize install on the command line!
• Spec syntax is recursive
– Full control over the combinatorial build space
Spack provides the spec syntax to describe custom configurations
$ spack install tau unconstrained
$ spack install tau@2.29 @ custom version
$ spack install tau@2.29 %gcc@7.3.0 % custom compiler
$ spack install tau@2.29 %gcc@7.3.0 +mpi+python+pthreads +/- build option
$ spack install tau@2.29 %gcc@7.3.0 +mpi ^mvapich2@2.3~wrapperrpath ^ dependency information
51. 51
`spack find` shows what is installed
• All the versions coexist!
– Multiple versions of same
package are ok.
• Packages are installed to
automatically find correct
dependencies.
• Binaries work regardless of
user’s environment.
• Spack also generates
module files.
– Don’t have to use them.
52. 52
• Spack simplifies HPC software for:
– Users
– Developers
– Cluster installations
– The largest HPC facilities
• Spack is central to ECP’s software strategy
– Enable software reuse for developers and users
– Allow the facilities to consume the entire ECP stack
• The roadmap is packed with new features:
– Building the ECP software distribution
– Better workflows for building containers
– Stacks for facilities
– Chains for rapid dev workflow
– Optimized binaries
– Better dependency resolution
The Spack community is growing rapidly
@spackpm
github.com/spack/spack
Visit spack.io
54. 54
Reproducible, Customizable Container Builds & Spack Mirrors
• E4S provides base images and recipes for building Docker containers based on SDKs
– Git: https://github.com/UO-OACISS/e4s
– E4S provides build caches for Spack for native bare-metal as well as container builds based
installation of ST products
– Build caches: https://oaciss.uoregon.edu/e4s/inventory.html
• The build cache model can be extended to target platforms, and can be managed by facilities staff when
appropriate.
55. 55
E4S: Spack Build Cache at U. Oregon
• https://oaciss.uoregon.edu/e4s/inventory.html
• 10,000+ binaries
• S3 mirror
• No need to build
from source code!
59. 59
E4S: Build Pipeline for Creating Spack Build Cache at U. Oregon
• GitLab
• Multi-platform builds
60. 60
E4S Validation Test Suite
• git clone https://github.com/E4S-Project/testsuite.git
To validate packages built in E4S
61. 61
Reproducible Container Builds using E4S Base Images
● PMR SDK base image has Spack build cache mirror and
GPG key installed.
● Base image has GCC and MPICH configured for MPICH
ABI level replacement (with system MPI).
● Customized container build using binaries from E4S
Spack build cache for fast deployment.
● No need to rebuild packages from the source code.
● Same recipe for container and native bare-metal builds
with Spack!
63. 63
Future work, issues…
● Improved support for GPUs and visualization tools
● DOE LLVM
● Addition of CI testing
● Facility deployment
● Scalable startup with full-featured “Supercontainers”
● Improving the launch of MPI applications
64. 64
Conclusions
● TAU provides a versatile tool to measure performance of HPC and AI/ML workloads
● TAU is part of the bigger ECP E4S project
● E4S provides a well integrated software stack for AI/ML and HPC for IBM ppc64le
● From-source builds assisted by a binary build cache or containers
● Docker and Singularity images are available for download
● https://e4s.io
65. 65
Download TAU from U. Oregon
http://tau.uoregon.edu
for more information
Free download, open source, BSD license
67. 67
• US Department of Energy (DOE)
– ANL
– Office of Science contracts, ECP
– SciDAC, LBL contracts
– LLNL-LANL-SNL ASC/NNSA contract
– Battelle, PNNL and ORNL contract
• Department of Defense (DoD)
– PETTT, HPCMP
• National Science Foundation (NSF)
– SI2-SSI, Glassbox
• NASA
• CEA, France
• Partners:
–University of Oregon
–The Ohio State University
–ParaTools, Inc.
–University of Tennessee, Knoxville
–T.U. Dresden, GWT
–Jülich Supercomputing Center
Support Acknowledgements
68. 68
Acknowledgment
“This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of
two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security
Administration) responsible for the planning and preparation of a capable exascale ecosystem, including
software, applications, hardware, advanced system engineering, and early testbed platforms, in support
of the nation’s exascale computing imperative.”