In this deck from the 2018 Swiss HPC Conference, DK Panda from Ohio State University presents: Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models by taking into account support for multi-core systems (KNL and OpenPower), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features, sample performance numbers and best practices of using MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu)will be presented.
For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://wp.me/p3RLHQ-iyc
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the 2016 HPC Advisory Council Switzerland Conference, DK Panda from Ohio State University presents: High-Performance and Scalable Designs of Programming Models for Exascale Systems.
"This talk will focus on challenges in designing runtime environments for Exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPUs and Intel MIC) and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video presentation: http://wp.me/p3RLHQ-f7c
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Dr. Robert Voigt from the Krell Institute presented this deck at the recent HPC Saudi conference.
"This talk will provide a historical perspective on the challenges of educating computational scientists based on my personal involvement over a number of years. Three decidedly different activities will be drawn on to indicate how one can successfully approach the challenge. The first is based on experiences at the Institute for Computer Applications in Science and Engineering at the NASA Langley Research Center where visiting students were exposed to multidisciplinary research driven by computer simulations. The second is the Predictive Science Academic Alliance Program funded by the National Nuclear Security Administration, a component of the US Department of Energy (DOE). The third is the Computational Science Graduate Fellowship program funded by the DOE. The latter two programs provide students with exposure to multidisciplinary research and perhaps more unique, require them to spend a three month period at one of the DOE national laboratories. My experience with these three efforts suggest that development of computational scientists require three key components: class room exposure to applied mathematics, computer science and a scientific or engineering discipline; exposure to teams conducting multidisciplinary research; and a significant internship at a major research facility."
Watch a conversation with Dr. Robert Voight: http://wp.me/p3RLHQ-gBl
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
This document discusses scaling deep learning training on HPC systems. It begins by providing background on deep learning and how interest in it has grown significantly. It then discusses how HPC systems can be leveraged for deep learning by supporting distributed training across multiple nodes. Several challenges of designing deep learning frameworks for HPC are outlined, including memory and communication overhead. The document proposes a co-design approach between deep learning frameworks and communication runtimes to better support distributed training and exploit HPC resources. MVAPICH2 software is discussed as an example that provides optimized MPI support for CPU- and GPU-based deep learning on HPC clusters.
In this deck from the 2017 MVAPICH User Group, DK Panda from Ohio State University presents: Overview of the MVAPICH Project and Future Roadmap.
"This talk will provide an overview of the MVAPICH project (past, present and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X, MVAPICH2-GDR, MVAPICH2-Virt, MVAPICH2-EA and MVAPICH2-MIC) will be presented. Current status and future plans for OSU INAM, OEMT and OMB will also be presented."
Watch the video: https://www.youtube.com/watch?v=wF7t-oH7wi4
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Addressing Emerging Challenges in Designing HPC Runtimesinside-BigData.com
The document discusses several challenges in designing HPC runtimes for exascale systems, including energy awareness, accelerators, and virtualization. It focuses on the MVAPICH2 project which addresses these challenges. MVAPICH2 provides integrated support for GPUs and MICs, virtualization using SR-IOV and containers, and energy awareness. It also achieves high performance for GPU-aware MPI using features like GPUDirect RDMA. Application tests with HOOMD-blue and COSMO show improvements from MVAPICH2's GPU support.
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?
"This talk will start with an overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of- core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented."
Watch the video: https://youtu.be/LeUNoKZVuwQ
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models taking into account support for multi-core systems (Xeon, OpenPower, and ARM), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-Enabled Big Data stacks. Finally, we will outline the challenges in moving middleware to the Cloud environments."
Watch the video: https://youtu.be/hR8cnFVF8Zg
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the 2016 HPC Advisory Council Switzerland Conference, DK Panda from Ohio State University presents: High-Performance and Scalable Designs of Programming Models for Exascale Systems.
"This talk will focus on challenges in designing runtime environments for Exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPUs and Intel MIC) and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video presentation: http://wp.me/p3RLHQ-f7c
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Dr. Robert Voigt from the Krell Institute presented this deck at the recent HPC Saudi conference.
"This talk will provide a historical perspective on the challenges of educating computational scientists based on my personal involvement over a number of years. Three decidedly different activities will be drawn on to indicate how one can successfully approach the challenge. The first is based on experiences at the Institute for Computer Applications in Science and Engineering at the NASA Langley Research Center where visiting students were exposed to multidisciplinary research driven by computer simulations. The second is the Predictive Science Academic Alliance Program funded by the National Nuclear Security Administration, a component of the US Department of Energy (DOE). The third is the Computational Science Graduate Fellowship program funded by the DOE. The latter two programs provide students with exposure to multidisciplinary research and perhaps more unique, require them to spend a three month period at one of the DOE national laboratories. My experience with these three efforts suggest that development of computational scientists require three key components: class room exposure to applied mathematics, computer science and a scientific or engineering discipline; exposure to teams conducting multidisciplinary research; and a significant internship at a major research facility."
Watch a conversation with Dr. Robert Voight: http://wp.me/p3RLHQ-gBl
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
This document discusses scaling deep learning training on HPC systems. It begins by providing background on deep learning and how interest in it has grown significantly. It then discusses how HPC systems can be leveraged for deep learning by supporting distributed training across multiple nodes. Several challenges of designing deep learning frameworks for HPC are outlined, including memory and communication overhead. The document proposes a co-design approach between deep learning frameworks and communication runtimes to better support distributed training and exploit HPC resources. MVAPICH2 software is discussed as an example that provides optimized MPI support for CPU- and GPU-based deep learning on HPC clusters.
In this deck from the 2017 MVAPICH User Group, DK Panda from Ohio State University presents: Overview of the MVAPICH Project and Future Roadmap.
"This talk will provide an overview of the MVAPICH project (past, present and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X, MVAPICH2-GDR, MVAPICH2-Virt, MVAPICH2-EA and MVAPICH2-MIC) will be presented. Current status and future plans for OSU INAM, OEMT and OMB will also be presented."
Watch the video: https://www.youtube.com/watch?v=wF7t-oH7wi4
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Addressing Emerging Challenges in Designing HPC Runtimesinside-BigData.com
The document discusses several challenges in designing HPC runtimes for exascale systems, including energy awareness, accelerators, and virtualization. It focuses on the MVAPICH2 project which addresses these challenges. MVAPICH2 provides integrated support for GPUs and MICs, virtualization using SR-IOV and containers, and energy awareness. It also achieves high performance for GPU-aware MPI using features like GPUDirect RDMA. Application tests with HOOMD-blue and COSMO show improvements from MVAPICH2's GPU support.
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?
"This talk will start with an overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of- core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented."
Watch the video: https://youtu.be/LeUNoKZVuwQ
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models taking into account support for multi-core systems (Xeon, OpenPower, and ARM), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-Enabled Big Data stacks. Finally, we will outline the challenges in moving middleware to the Cloud environments."
Watch the video: https://youtu.be/hR8cnFVF8Zg
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Welcome to the 2016 HPC Advisory Council Switzerland Conferenceinside-BigData.com
This document contains the agenda for the HPC Advisory Council Swiss Conference 2016, which will take place in March. It provides details on keynote speakers, tutorials, and best practices sessions covering topics like deep learning, programming models, containers, and more. Sponsor and exhibitor information is also included.
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
In this video from the 2017 MVAPICH User Group, Ron Brightwell from Sandia presents: Challenges and Opportunities for HPC Interconnects and MPI.
"This talk will reflect on prior analysis of the challenges facing high-performance interconnect technologies intended to support extreme-scale scientific computing systems, how some of these challenges have been addressed, and what new challenges lay ahead. Many of these challenges can be attributed to the complexity created by hardware diversity, which has a direct impact on interconnect technology, but new challenges are also arising indirectly as reactions to other aspects of high-performance computing, such as alternative parallel programming models and more complex system usage models. We will describe some near-term research on proposed extensions to MPI to better support massive multithreading and implementation optimizations aimed at reducing the overhead of MPI tag matching. We will also describe a new portable programming model to offload simple packet processing functions to a network interface that is based on the current Portals data movement layer. We believe this capability will offer significant performance improvements to applications and services relevant to high-performance computing as well as data analytics."
Watch the video: https://wp.me/p3RLHQ-hhK
Learn more: http://mug.mvapich.cse.ohio-state.edu/program/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Gilad Shainer from the HPC AI Advisory Council describes how this organization fosters innovation in the high performance computing community.
"The HPC-AI Advisory Council’s mission is to bridge the gap between high-performance computing (HPC) and Artificial Intelligence (AI) use and its potential, bring the beneficial capabilities of HPC and AI to new users for better research, education, innovation and product manufacturing, bring users the expertise needed to operate HPC and AI systems, provide application designers with the tools needed to enable parallel computing, and to strengthen the qualification and integration of HPC and AI system products."
Watch the video: https://wp.me/p3RLHQ-lNz
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A Library for Emerging High-Performance Computing ClustersIntel® Software
This document discusses the challenges of developing communication libraries for exascale systems using hybrid MPI+X programming models. It describes how current MPI+PGAS approaches use separate runtimes, which can lead to issues like deadlock. The document advocates for a unified runtime that can support multiple programming models simultaneously to avoid such issues and enable better performance. It also outlines MVAPICH2's work on designs like multi-endpoint that integrate MPI and OpenMP to efficiently support emerging highly threaded systems.
In this deck from the GoingARM workshop at SC17, Filippo Mantovani describes the contributions of the Barcelona Supercomputing Center to the European Mont-Blanc project.
"Since 2011, Mont-Blanc has pushed the adoption of Arm technology in High Performance Computing, deploying Arm-based prototypes, enhancing system software ecosystem and projecting performance of current systems for developing new, more powerful and less power hungry HPC computing platforms based on Arm SoC. In this talk, Filippo introduces the last Mont-Blanc system, called Dibona, designed and integrated by the coordinator and industrial partner of the project, Bull/ATOS. He also talks about tests performed at BSC of the Arm software tools (HPC compiler and mathematical libraries) as well as the Dynamic Load Balancing (DLB) technique and the Multiscale Simulator Architecture (MUSA)."
Watch the video: https://wp.me/p3RLHQ-i6o
Learn more: http://www.goingarm.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
TAU Performance System and the Extreme-scale Scientific Software Stack (E4S) aim to improve productivity for HPC and AI workloads. TAU provides a portable performance evaluation toolkit, while E4S delivers modular and interoperable software stacks. Together, they lower barriers to using software tools from the Exascale Computing Project and enable performance analysis of complex, multi-component applications.
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss the challenges in designing runtime environments for MPI+X (PGAS-OpenSHMEM/UPC/CAF/UPC++, OpenMP and Cuda) programming models by taking into account support for multi-core systems (KNL and OpenPower), high networks, GPGPUs (including GPUDirect RDMA) and energy awareness. Features and sample performance numbers from MVAPICH2 libraries will be presented. For the Deep Learning domain, we will focus on popular Deep Learning framewords (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-enabled Big Data stacks. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://youtu.be/i2I6XqOAh_I
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
In this deck from the Stanford HPC Conference, Adam Bertsch from LLNL presents: The Sierra Supercomputer: Science and Technology on a Mission.
"LLNL just celebrated its 65th anniversary. Since 1952, the laboratory has been at the forefront of high performance computing. Initially, HPC was used to accelerate the design and testing of the nation's nuclear stockpile. Since the last U.S. nuclear test in 1992, HPC has been used to validate the safety, security, and reliability of stockpile without nuclear testing.
Our next flagship HPC system at LLNL will be called Sierra. A collaboration between multiple government and industry partners, Sierra and its sister system Summit at ORNL, will pave the way towards Exascale computing architectures and predictive capability."
Watch the video: https://wp.me/p3RLHQ-i4K
Learn more: https://computation.llnl.gov/computers/sierra
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this video from the HPC User Forum in Santa Fe, Yoonho Park from IBM presents: IBM Datacentric Servers & OpenPOWER.
"Big data analytics, machine learning and deep learning are among the most rapidly growing workloads in the data center. These workloads have the compute performance requirements of traditional technical computing or high performance computing, coupled with a much larger volume and velocity of data."
Watch the video: http://wp.me/p3RLHQ-gJv
Learn more: https://openpowerfoundation.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
HTML slides and longer abstract can be found at https://github.com/ljdursi/EuroMPI2016.
For years, the academic science and engineering community was almost alone in pursuing very large-scale numerical computing, and MPI was the lingua franca for such work. But starting in the mid-2000s, we were no longer alone. First internet-scale companies like Google and Yahoo! started performing fairly basic analytics tasks at enormous scale, and since then others have begun tackling increasingly complex and data-heavy machine-learning computations, which involve very familiar scientific computing primitives such as linear algebra, unstructured mesh decomposition, and numerical optimization. These new communities have created programming environments which emphasize what we’ve learned about computer science and programmability since 1994 – with greater levels of abstraction and encapsulation, separating high-level computation from the low-level implementation details.
At about the same time, new academic research communities began using computing at scale to attack their problems - but in many cases, an ideal distributed-memory application for them begins to look more like the new concurrent distributed databases than a large CFD simulation, with data structures like dynamic hash tables and Bloom trees playing more important roles than rectangular arrays or unstructured meshes. These new academic communities are among the first to adopt emerging big-data technologies over traditional HPC options; but as big-data technologies improve their tightly-coupled number-crunching capabilities, they are unlikely to be the last.
In this talk, I sketch out the landscape of distributed technical computing frameworks and environments, and look to see where MPI and the MPI community fits in to this new ecosystem.
Real time machine learning proposers day v3mustafa sarac
This document discusses DARPA's Real Time Machine Learning (RTML) program. The objective is to develop hardware generators and compilers that can automatically create application-specific integrated circuits for machine learning from high-level code. This would allow no-human-in-the-loop creation of efficient neural network hardware. The program has two phases: phase 1 develops an ML hardware compiler, and phase 2 demonstrates RTML systems for applications like wireless communication and image processing. Key goals are high performance, low power consumption, and support for a variety of neural network architectures and machine learning techniques.
Increasing Cluster Performance by Combining rCUDA with Slurminside-BigData.com
Federico Silla from the Technical University of Valencia presented this deck at the Switzerland HPC Conference.
"Graphics Processing Units (GPUs) are currently used in data centers to reduce the execution time of compute-intensive applications. However, the use of GPUs presents several side ef- fects, such as increased acquisition costs as well as larger space requirements. Furthermore, GPUs require a non-negligible amount of energy even while idle. Additionally, GPU utilization is usually low for most applications. The use of virtual GPUs may address these concerns. In this regard, the remote GPU virtualization mechanism could be leveraged to share the GPUs present in the computing facility among the nodes of the cluster. This would increase overall GPU utilization, thus reducing the negative impact of the increased costs mentioned before. Reducing the amount of GPUs installed in the cluster could also be possible. In this talk
we present the remote GPU virtualization mechanism using as case study the performance attained by a cluster using the rCUDA middleware and a modified version of the Slurm sched- uler, which is able to map remote virtual GPUs to jobs. By leveraging rCUDA+Slurm, cluster throughput, measured as jobs completed per time unit, is doubled at the same time that total energy consumption is reduced up to 40%. GPU utilization is also increased."
Watch the video presentation: https://www.youtube.com/watch?v=yQWiQiyFpAg
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
The document discusses performance optimization of Apache Spark on scale-up servers through near-data processing. It finds that Spark workloads have poor multi-core scalability and high I/O wait times on scale-up servers. It proposes exploiting near-data processing through in-storage processing and 2D-integrated processing-in-memory to reduce data movements and latency. The author evaluates these techniques through modeling and a programmable FPGA accelerator to improve the performance of Spark MLlib workloads by up to 9x. Challenges in hybrid CPU-FPGA design and attaining peak performance are also discussed.
In this deck from the 2019 Stanford HPC Conference, Rob Neely, from Lawrence Livermore National Laboratory presents: Sierra - Science Unleashed.
"This talk will give an overview of Sierra and some of the early science results it has enabled. Sierra is an IBM system harnessing the power of over 17,000 NVIDIA Volta GPUs recently deployed at Lawrence Livermore National Laboratory and is currently ranked as the #2 system on the Top500. Before being turned over for use in the classified mission, Sierra spent months in an “open science campaign” where we got an early glimpse at some of the truly game-changing science this system will unleash – selected results of which will be presented."
Rob Neely is a Computer Scientist and Technical Manager at Lawrence Livermore National Laboratory where he is the Weapon Simulation & Computing Program Coordinator for Computing Environments, and the Associate Division Lead for the Center for Applied Scientific Computing (CASC). He also is the DOE Exascale Computing Project lead for Software Technologies Ecosystem and Delivery. He has been involved in High Performance Computing for his entire 25+ year career.
Learn more: https://computation.llnl.gov/computers/sierra
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
SystemML is an Apache project that provides a declarative machine learning language for data scientists. It aims to simplify the development of custom machine learning algorithms and enable scalable execution on everything from single nodes to clusters. SystemML provides pre-implemented machine learning algorithms, APIs for various languages, and a cost-based optimizer to compile execution plans tailored to workload and hardware characteristics in order to maximize performance.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
This document discusses current trends in high performance computing. It begins with an introduction to high performance computing and its applications in science, engineering, business analysis, and more. It then discusses why high performance computing is needed due to changes in scientific discovery, the need to solve larger problems, and modern business needs. The document also discusses the top 500 supercomputers in the world and provides examples of some of the most powerful systems. It then covers performance development trends and challenges in increasing processor speeds. The rest of the document discusses parallel computing approaches using multi-core and many-core architectures, as well as cluster, grid, and cloud computing models for high performance.
IBM provides infrastructure to accelerate medical research tasks like genomics, molecular simulation, diagnostics, and quality inspection. This infrastructure delivers faster insights through high-performance data and AI deployed at massive scale on IBM Power Systems and Storage. Case studies show the infrastructure reduces time to results for tasks like processing millions of cryogenic electron microscope images from days to hours.
Xilinx provides adaptable acceleration platforms for data centers. Their Alveo product lineup includes the U280, U250, U200, and low-profile U50 accelerator cards. The cards feature FPGAs with up to 1.3 million logic cells and high-speed memory. Xilinx also offers the U25 SmartNIC which combines an FPGA, ARM CPU, and dual 25GbE ports. These platforms accelerate workloads such as AI, databases, storage, and networking using reconfigurable and adaptable hardware. Xilinx supports deployment from their devices to cloud platforms using a unified software stack.
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
- The document discusses programming models and challenges for exascale systems. It focuses on MPI and PGAS models like OpenSHMEM.
- Key challenges include supporting hybrid MPI+PGAS programming, efficient communication for multi-core and accelerator nodes, fault tolerance, and extreme low memory usage.
- The MVAPICH2 project aims to address these challenges through its high performance MPI and PGAS implementation and optimization of communication for technologies like InfiniBand.
In this deck from the HPC Advisory Council Spain Conference, DK Panda from Ohio State University presents: Communication Frameworks for HPC and Big Data.
Watch the video presentation: http://insidehpc.com/2015/09/video-communication-frameworks-for-hpc-and-big-data/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Learn more: http://www.hpcadvisorycouncil.com/events/2015/spain-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Welcome to the 2016 HPC Advisory Council Switzerland Conferenceinside-BigData.com
This document contains the agenda for the HPC Advisory Council Swiss Conference 2016, which will take place in March. It provides details on keynote speakers, tutorials, and best practices sessions covering topics like deep learning, programming models, containers, and more. Sponsor and exhibitor information is also included.
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
In this video from the 2017 MVAPICH User Group, Ron Brightwell from Sandia presents: Challenges and Opportunities for HPC Interconnects and MPI.
"This talk will reflect on prior analysis of the challenges facing high-performance interconnect technologies intended to support extreme-scale scientific computing systems, how some of these challenges have been addressed, and what new challenges lay ahead. Many of these challenges can be attributed to the complexity created by hardware diversity, which has a direct impact on interconnect technology, but new challenges are also arising indirectly as reactions to other aspects of high-performance computing, such as alternative parallel programming models and more complex system usage models. We will describe some near-term research on proposed extensions to MPI to better support massive multithreading and implementation optimizations aimed at reducing the overhead of MPI tag matching. We will also describe a new portable programming model to offload simple packet processing functions to a network interface that is based on the current Portals data movement layer. We believe this capability will offer significant performance improvements to applications and services relevant to high-performance computing as well as data analytics."
Watch the video: https://wp.me/p3RLHQ-hhK
Learn more: http://mug.mvapich.cse.ohio-state.edu/program/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Gilad Shainer from the HPC AI Advisory Council describes how this organization fosters innovation in the high performance computing community.
"The HPC-AI Advisory Council’s mission is to bridge the gap between high-performance computing (HPC) and Artificial Intelligence (AI) use and its potential, bring the beneficial capabilities of HPC and AI to new users for better research, education, innovation and product manufacturing, bring users the expertise needed to operate HPC and AI systems, provide application designers with the tools needed to enable parallel computing, and to strengthen the qualification and integration of HPC and AI system products."
Watch the video: https://wp.me/p3RLHQ-lNz
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A Library for Emerging High-Performance Computing ClustersIntel® Software
This document discusses the challenges of developing communication libraries for exascale systems using hybrid MPI+X programming models. It describes how current MPI+PGAS approaches use separate runtimes, which can lead to issues like deadlock. The document advocates for a unified runtime that can support multiple programming models simultaneously to avoid such issues and enable better performance. It also outlines MVAPICH2's work on designs like multi-endpoint that integrate MPI and OpenMP to efficiently support emerging highly threaded systems.
In this deck from the GoingARM workshop at SC17, Filippo Mantovani describes the contributions of the Barcelona Supercomputing Center to the European Mont-Blanc project.
"Since 2011, Mont-Blanc has pushed the adoption of Arm technology in High Performance Computing, deploying Arm-based prototypes, enhancing system software ecosystem and projecting performance of current systems for developing new, more powerful and less power hungry HPC computing platforms based on Arm SoC. In this talk, Filippo introduces the last Mont-Blanc system, called Dibona, designed and integrated by the coordinator and industrial partner of the project, Bull/ATOS. He also talks about tests performed at BSC of the Arm software tools (HPC compiler and mathematical libraries) as well as the Dynamic Load Balancing (DLB) technique and the Multiscale Simulator Architecture (MUSA)."
Watch the video: https://wp.me/p3RLHQ-i6o
Learn more: http://www.goingarm.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
TAU Performance System and the Extreme-scale Scientific Software Stack (E4S) aim to improve productivity for HPC and AI workloads. TAU provides a portable performance evaluation toolkit, while E4S delivers modular and interoperable software stacks. Together, they lower barriers to using software tools from the Exascale Computing Project and enable performance analysis of complex, multi-component applications.
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss the challenges in designing runtime environments for MPI+X (PGAS-OpenSHMEM/UPC/CAF/UPC++, OpenMP and Cuda) programming models by taking into account support for multi-core systems (KNL and OpenPower), high networks, GPGPUs (including GPUDirect RDMA) and energy awareness. Features and sample performance numbers from MVAPICH2 libraries will be presented. For the Deep Learning domain, we will focus on popular Deep Learning framewords (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-enabled Big Data stacks. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://youtu.be/i2I6XqOAh_I
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
In this deck from the Stanford HPC Conference, Adam Bertsch from LLNL presents: The Sierra Supercomputer: Science and Technology on a Mission.
"LLNL just celebrated its 65th anniversary. Since 1952, the laboratory has been at the forefront of high performance computing. Initially, HPC was used to accelerate the design and testing of the nation's nuclear stockpile. Since the last U.S. nuclear test in 1992, HPC has been used to validate the safety, security, and reliability of stockpile without nuclear testing.
Our next flagship HPC system at LLNL will be called Sierra. A collaboration between multiple government and industry partners, Sierra and its sister system Summit at ORNL, will pave the way towards Exascale computing architectures and predictive capability."
Watch the video: https://wp.me/p3RLHQ-i4K
Learn more: https://computation.llnl.gov/computers/sierra
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this video from the HPC User Forum in Santa Fe, Yoonho Park from IBM presents: IBM Datacentric Servers & OpenPOWER.
"Big data analytics, machine learning and deep learning are among the most rapidly growing workloads in the data center. These workloads have the compute performance requirements of traditional technical computing or high performance computing, coupled with a much larger volume and velocity of data."
Watch the video: http://wp.me/p3RLHQ-gJv
Learn more: https://openpowerfoundation.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
HTML slides and longer abstract can be found at https://github.com/ljdursi/EuroMPI2016.
For years, the academic science and engineering community was almost alone in pursuing very large-scale numerical computing, and MPI was the lingua franca for such work. But starting in the mid-2000s, we were no longer alone. First internet-scale companies like Google and Yahoo! started performing fairly basic analytics tasks at enormous scale, and since then others have begun tackling increasingly complex and data-heavy machine-learning computations, which involve very familiar scientific computing primitives such as linear algebra, unstructured mesh decomposition, and numerical optimization. These new communities have created programming environments which emphasize what we’ve learned about computer science and programmability since 1994 – with greater levels of abstraction and encapsulation, separating high-level computation from the low-level implementation details.
At about the same time, new academic research communities began using computing at scale to attack their problems - but in many cases, an ideal distributed-memory application for them begins to look more like the new concurrent distributed databases than a large CFD simulation, with data structures like dynamic hash tables and Bloom trees playing more important roles than rectangular arrays or unstructured meshes. These new academic communities are among the first to adopt emerging big-data technologies over traditional HPC options; but as big-data technologies improve their tightly-coupled number-crunching capabilities, they are unlikely to be the last.
In this talk, I sketch out the landscape of distributed technical computing frameworks and environments, and look to see where MPI and the MPI community fits in to this new ecosystem.
Real time machine learning proposers day v3mustafa sarac
This document discusses DARPA's Real Time Machine Learning (RTML) program. The objective is to develop hardware generators and compilers that can automatically create application-specific integrated circuits for machine learning from high-level code. This would allow no-human-in-the-loop creation of efficient neural network hardware. The program has two phases: phase 1 develops an ML hardware compiler, and phase 2 demonstrates RTML systems for applications like wireless communication and image processing. Key goals are high performance, low power consumption, and support for a variety of neural network architectures and machine learning techniques.
Increasing Cluster Performance by Combining rCUDA with Slurminside-BigData.com
Federico Silla from the Technical University of Valencia presented this deck at the Switzerland HPC Conference.
"Graphics Processing Units (GPUs) are currently used in data centers to reduce the execution time of compute-intensive applications. However, the use of GPUs presents several side ef- fects, such as increased acquisition costs as well as larger space requirements. Furthermore, GPUs require a non-negligible amount of energy even while idle. Additionally, GPU utilization is usually low for most applications. The use of virtual GPUs may address these concerns. In this regard, the remote GPU virtualization mechanism could be leveraged to share the GPUs present in the computing facility among the nodes of the cluster. This would increase overall GPU utilization, thus reducing the negative impact of the increased costs mentioned before. Reducing the amount of GPUs installed in the cluster could also be possible. In this talk
we present the remote GPU virtualization mechanism using as case study the performance attained by a cluster using the rCUDA middleware and a modified version of the Slurm sched- uler, which is able to map remote virtual GPUs to jobs. By leveraging rCUDA+Slurm, cluster throughput, measured as jobs completed per time unit, is doubled at the same time that total energy consumption is reduced up to 40%. GPU utilization is also increased."
Watch the video presentation: https://www.youtube.com/watch?v=yQWiQiyFpAg
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
The document discusses performance optimization of Apache Spark on scale-up servers through near-data processing. It finds that Spark workloads have poor multi-core scalability and high I/O wait times on scale-up servers. It proposes exploiting near-data processing through in-storage processing and 2D-integrated processing-in-memory to reduce data movements and latency. The author evaluates these techniques through modeling and a programmable FPGA accelerator to improve the performance of Spark MLlib workloads by up to 9x. Challenges in hybrid CPU-FPGA design and attaining peak performance are also discussed.
In this deck from the 2019 Stanford HPC Conference, Rob Neely, from Lawrence Livermore National Laboratory presents: Sierra - Science Unleashed.
"This talk will give an overview of Sierra and some of the early science results it has enabled. Sierra is an IBM system harnessing the power of over 17,000 NVIDIA Volta GPUs recently deployed at Lawrence Livermore National Laboratory and is currently ranked as the #2 system on the Top500. Before being turned over for use in the classified mission, Sierra spent months in an “open science campaign” where we got an early glimpse at some of the truly game-changing science this system will unleash – selected results of which will be presented."
Rob Neely is a Computer Scientist and Technical Manager at Lawrence Livermore National Laboratory where he is the Weapon Simulation & Computing Program Coordinator for Computing Environments, and the Associate Division Lead for the Center for Applied Scientific Computing (CASC). He also is the DOE Exascale Computing Project lead for Software Technologies Ecosystem and Delivery. He has been involved in High Performance Computing for his entire 25+ year career.
Learn more: https://computation.llnl.gov/computers/sierra
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
SystemML is an Apache project that provides a declarative machine learning language for data scientists. It aims to simplify the development of custom machine learning algorithms and enable scalable execution on everything from single nodes to clusters. SystemML provides pre-implemented machine learning algorithms, APIs for various languages, and a cost-based optimizer to compile execution plans tailored to workload and hardware characteristics in order to maximize performance.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
This document discusses current trends in high performance computing. It begins with an introduction to high performance computing and its applications in science, engineering, business analysis, and more. It then discusses why high performance computing is needed due to changes in scientific discovery, the need to solve larger problems, and modern business needs. The document also discusses the top 500 supercomputers in the world and provides examples of some of the most powerful systems. It then covers performance development trends and challenges in increasing processor speeds. The rest of the document discusses parallel computing approaches using multi-core and many-core architectures, as well as cluster, grid, and cloud computing models for high performance.
IBM provides infrastructure to accelerate medical research tasks like genomics, molecular simulation, diagnostics, and quality inspection. This infrastructure delivers faster insights through high-performance data and AI deployed at massive scale on IBM Power Systems and Storage. Case studies show the infrastructure reduces time to results for tasks like processing millions of cryogenic electron microscope images from days to hours.
Xilinx provides adaptable acceleration platforms for data centers. Their Alveo product lineup includes the U280, U250, U200, and low-profile U50 accelerator cards. The cards feature FPGAs with up to 1.3 million logic cells and high-speed memory. Xilinx also offers the U25 SmartNIC which combines an FPGA, ARM CPU, and dual 25GbE ports. These platforms accelerate workloads such as AI, databases, storage, and networking using reconfigurable and adaptable hardware. Xilinx supports deployment from their devices to cloud platforms using a unified software stack.
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
- The document discusses programming models and challenges for exascale systems. It focuses on MPI and PGAS models like OpenSHMEM.
- Key challenges include supporting hybrid MPI+PGAS programming, efficient communication for multi-core and accelerator nodes, fault tolerance, and extreme low memory usage.
- The MVAPICH2 project aims to address these challenges through its high performance MPI and PGAS implementation and optimization of communication for technologies like InfiniBand.
In this deck from the HPC Advisory Council Spain Conference, DK Panda from Ohio State University presents: Communication Frameworks for HPC and Big Data.
Watch the video presentation: http://insidehpc.com/2015/09/video-communication-frameworks-for-hpc-and-big-data/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Learn more: http://www.hpcadvisorycouncil.com/events/2015/spain-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"This talk will focus on challenges in designing runtime environments for exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPGPUs and Intel MIC), virtualization technologies (KVM, Docker, and Singularity), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video: http://wp.me/p3RLHQ-glW
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of-core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented.
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...inside-BigData.com
This talk will focus on challenges in designing software libraries and middleware for upcoming exascale systems with millions of processors and accelerators. Two kinds of application domains - Scientific Computing and Big data will be considered. For scientific computing domain, we will discuss about challenges in designing runtime environments for MPI and PGAS (UPC and OpenSHMEM) programming models by taking into account support for multi-core, high-performance networks, GPGPUs and Intel MIC. Features and sample performance numbers from MVAPICH2 and MVAPICH2-X (supporting Hybrid MPI and PGAS (UPC and OpenSHMEM)) libraries will be presented. For Big Data domain, we will focus on high-performance and scalable designs of Hadoop (including HBase) and Memcached using native RDMA support of InfiniBand and RoCE.
Building Efficient HPC Clouds with MCAPICH2 and RDMA-Hadoop over SR-IOV Infin...inside-BigData.com
Xiaoyi Lu from Ohio State University presented this deck at the OpenFabrics Workshop.
"Single Root I/O Virtualization (SR-IOV) technology has been steadily gaining momentum for high performance interconnects such as InfiniBand. SR-IOV can deliver near native performance but lacks locality-aware communication support. This talk presents an efficient approach to building HPC clouds based on MVAPICH2 and RDMA-Hadoop with SR-IOV. We discuss high-performance designs of the
virtual machine and container aware MVAPICH2 library over SR-IOV enabled HPC Clouds."
This talk will also present a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clouds. The MVAPICH2 software for building HPC Clouds presented in this talk is publicly available. We will also discuss how to leverage the high-performance networking features (e.g., RDMA, SR-IOV) on cloud environments to accelerate data processing through RDMAHadoop package, which is publicly available. Comprehensive performance evaluations on NSF-supported Chameleon Cloud show that our design can deliver the near bare-metal performance."
Watch the video: http://wp.me/p3RLHQ-gB3
Learn more: http://%20mvapich.cse.ohio-state.edu/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plansinside-BigData.com
In this video from the 2014 HPC Advisory Council Europe Conference, DK Panda from Ohio State University presents: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans.
This talk will focus on latest developments and future plans for the MVAPICH2 and MVAPICH2-X projects. For the MVAPICH2 project, we will focus on scalable and highly-optimized designs for pt-to-pt communication (two-sided and one-sided MPI-3 RMA), collective communication (blocking and MPI-3 non-blocking), support for GPGPUs and Intel MIC, support for MPI-T interface and schemes for fault-tolerance/fault-resilience. For the MVAPICH2-X project, will focus on efficient support for hybrid MPI and PGAS (UPC and OpenSHMEM) programming model with unified runtime."
Watch the video presentation: http://wp.me/p3RLHQ-coF
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
Learn about opportunities and challenges for accelerating big data middleware on modern high-performance computing (HPC) clusters by exploiting HPC technologies.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
UCX: An Open Source Framework for HPC Network APIs and BeyondEd Dodds
UCX is an open source framework for high performance computing (HPC) network APIs and beyond. It is a collaborative effort between industry, national laboratories, and academia to develop the next generation HPC communication framework. UCX aims to provide a unified communication API that supports multiple network architectures and HPC programming models through a performance-oriented and community-driven approach.
The document discusses accelerating Apache Hadoop through high-performance networking and I/O technologies. It describes how technologies like InfiniBand, RoCE, SSDs, and NVMe can benefit big data applications by alleviating bottlenecks. It outlines projects from the High-Performance Big Data project that implement RDMA for Hadoop, Spark, HBase and Memcached to improve performance. Evaluation results demonstrate significant acceleration of HDFS, MapReduce, and other workloads through the high-performance designs.
Ucx an open source framework for hpc network ap is and beyondinside-BigData.com
"Unified Communication X (UCX) is a set of network APIs and their implementations for high performance computing. UCX comes from the combined efforts of national laboratories, industry, and academia to co-design and implement a high-performing and highly scalable communication APIs for next generation applications and systems. UCX solves the problem of moving data memory location “A" to memory location "B" considering across multiple type of memories (DRAM, accelerator memories, etc.) and multiple transports (e.g. InfiniBand, uGNI, Shared Memory, CUDA, etc. ), while minimizing latency, and maximizing bandwidth and message rate."
Designing High performance & Scalable Middleware for HPCObject Automation
The document discusses the design of high-performance middleware for HPC, AI, and data science applications. It describes challenges including supporting various programming models, applications, network technologies, and architectures. The MVAPICH project is presented as an open-source MPI library that supports these domains and has been downloaded over 1.5 million times. It provides optimized communication through features like GPU-direct support and improved nested datatype transfers.
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
This document discusses designing high-performance middleware for HPC, AI, and data science applications. It provides an overview of the MVAPICH2 project, which develops an open-source MPI library supporting modern HPC architectures and networking technologies. MVAPICH2 aims to provide a converged software stack for HPC, deep learning, and data science through libraries like MVAPICH2, HiDL, and HiBD. The document outlines challenges in communication library design for exascale systems and MVAPICH2's architecture supporting programming models across domains.
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
Google’s TensorFlow is one of the most popular deep learning (DL) frameworks. In distributed TensorFlow, gradient updates are a critical step governing the total model training time. These updates incur a massive volume of data transfer over the network.
In this talk, we first present a thorough analysis of the communication patterns in distributed TensorFlow. Then we propose a unified way of achieving high performance through enhancing the gRPC runtime with Remote Direct Memory Access (RDMA) technology on InfiniBand and RoCE. Through our proposed RDMA-gRPC design, TensorFlow only needs to run over the gRPC channel and gets the optimal performance. Our design includes advanced features such as message pipelining, message coalescing, zero-copy transmission, etc. The performance evaluations show that our proposed design can significantly speed up gRPC throughput by up to 1.5x compared to the default gRPC design. By integrating our RDMA-gRPC with TensorFlow, we are able to achieve up to 35% performance improvement for TensorFlow training with CNN models.
Speakers
Dhabaleswar K (DK) Panda, Professor and University Distinguished Scholar, The Ohio State University
Xiaoyi Lu, Research Scientist, The Ohio State University
Designing RISC-V-based Accelerators for next generation Computers (DRAC) is a 3-year project (2019-2022) funded by the ERDF Operational Program of Catalonia 2014-2020. DRAC will design, verify, implement and fabricate a high performance general purpose processor that will incorporate different accelerators based on the RISC-V technology, with specific applications in the field of post-quantum security, genomics and autonomous navigation. In this talk, we will provide an overview of the main achievements in the DRAC project, including the fabrication of Lagarto, the first RISC-V processor developed in Spain.
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
Similar to Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems (20)
The document discusses the top 5 technologies that all organizations must understand: digital transformation, quantum computing, IoT, 5G, and AI/HPC. It provides an overview of each technology including opportunities and threats to organizations. The document emphasizes that understanding these emerging technologies is mandatory as the information revolution changes many aspects of life and business.
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
In this deck from IWOCL / SYCLcon 2020, Hal Finkel from Argonne National Laboratory presents: Preparing to program Aurora at Exascale - Early experiences and future directions.
"Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms. Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.
This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success. Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems."
Watch the video: https://wp.me/p3RLHQ-lPT
Learn more: https://www.iwocl.org/iwocl-2020/conference-program/
and
https://www.anl.gov/topic/aurora
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Greg Wahl from Advantech presents: Transforming Private 5G Networks.
Advantech Networks & Communications Group is driving innovation in next-generation network solutions with their High Performance Servers. We provide business critical hardware to the world's leading telecom and networking equipment manufacturers with both standard and customized products. Our High Performance Servers are highly configurable platforms designed to balance the best in x86 server-class processing performance with maximum I/O and offload density. The systems are cost effective, highly available and optimized to meet next generation networking and media processing needs.
“Advantech’s Networks and Communication Group has been both an innovator and trusted enabling partner in the telecommunications and network security markets for over a decade, designing and manufacturing products for OEMs that accelerate their network platform evolution and time to market.” Said Advantech Vice President of Networks & Communications Group, Ween Niu. “In the new IP Infrastructure era, we will be expanding our expertise in Software Defined Networking (SDN) and Network Function Virtualization (NFV), two of the essential conduits to 5G infrastructure agility making networks easier to install, secure, automate and manage in a cloud-based infrastructure.”
In addition to innovation in air interface technologies and architecture extensions, 5G will also need a new generation of network computing platforms to run the emerging software defined infrastructure, one that provides greater topology flexibility, essential to deliver on the promises of high availability, high coverage, low latency and high bandwidth connections. This will open up new parallel industry opportunities through dedicated 5G network slices reserved for specific industries dedicated to video traffic, augmented reality, IoT, connected cars etc. 5G unlocks many new doors and one of the keys to its enablement lies in the elasticity and flexibility of the underlying infrastructure.
Advantech’s corporate vision is to enable an intelligent planet. The company is a global leader in the fields of IoT intelligent systems and embedded platforms. To embrace the trends of IoT, big data, and artificial intelligence, Advantech promotes IoT hardware and software solutions with the Edge Intelligence WISE-PaaS core to assist business partners and clients in connecting their industrial chains. Advantech is also working with business partners to co-create business ecosystems that accelerate the goal of industrial intelligence."
Watch the video: https://wp.me/p3RLHQ-lPQ
* Company website: https://www.advantech.com/
* Solution page: https://www2.advantech.com/nc/newsletter/NCG/SKY/benefits.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
In this deck from the Stanford HPC Conference, Katie Lewis from Lawrence Livermore National Laboratory presents: The Incorporation of Machine Learning into Scientific Simulations at Lawrence Livermore National Laboratory.
"Scientific simulations have driven computing at Lawrence Livermore National Laboratory (LLNL) for decades. During that time, we have seen significant changes in hardware, tools, and algorithms. Today, data science, including machine learning, is one of the fastest growing areas of computing, and LLNL is investing in hardware, applications, and algorithms in this space. While the use of simulations to focus and understand experiments is well accepted in our community, machine learning brings new challenges that need to be addressed. I will explore applications for machine learning in scientific simulations that are showing promising results and further investigation that is needed to better understand its usefulness."
Watch the video: https://youtu.be/NVwmvCWpZ6Y
Learn more: https://computing.llnl.gov/research-area/machine-learning
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
In this deck from the Stanford HPC Conference, Nick Nystrom and Paola Buitrago provide an update from the Pittsburgh Supercomputing Center.
Nick Nystrom is Chief Scientist at the Pittsburgh Supercomputing Center (PSC). Nick is architect and PI for Bridges, PSC's flagship system that successfully pioneered the convergence of HPC, AI, and Big Data. He is also PI for the NIH Human Biomolecular Atlas Program’s HIVE Infrastructure Component and co-PI for projects that bring emerging AI technologies to research (Open Compass), apply machine learning to biomedical data for breast and lung cancer (Big Data for Better Health), and identify causal relationships in biomedical big data (the Center for Causal Discovery, an NIH Big Data to Knowledge Center of Excellence). His current research interests include hardware and software architecture, applications of machine learning to multimodal data (particularly for the life sciences) and to enhance simulation, and graph analytics.
Watch the video: https://youtu.be/LWEU1L1o7yY
Learn more: https://www.psc.edu/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses using systems intelligence and artificial intelligence/neural networks to enhance semiconductor electronic design automation (EDA) workflows by collecting telemetry data from EDA jobs and infrastructure and analyzing it using complex event processing, machine learning models, and messaging substrates to provide insights that could optimize EDA pipelines and infrastructure. The approach aims to allow both internal and external augmentation of EDA processes and environments through unsupervised and incremental learning.
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
In this deck from the Stanford HPC Conference, Nicole Xu from Stanford University describes how she transformed a common jellyfish into a bionic creature that is part animal and part machine.
"Animal locomotion and bioinspiration have the potential to expand the performance capabilities of robots, but current implementations are limited. Mechanical soft robots leverage engineered materials and are highly controllable, but these biomimetic robots consume more power than corresponding animal counterparts. Biological soft robots from a bottom-up approach offer advantages such as speed and controllability but are limited to survival in cell media. Instead, biohybrid robots that comprise live animals and self- contained microelectronic systems leverage the animals’ own metabolism to reduce power constraints and body as an natural scaffold with damage tolerance. We demonstrate that by integrating onboard microelectronics into live jellyfish, we can enhance propulsion up to threefold, using only 10 mW of external power input to the microelectronics and at only a twofold increase in cost of transport to the animal. This robotic system uses 10 to 1000 times less external power per mass than existing swimming robots in literature and can be used in future applications for ocean monitoring to track environmental changes."
Watch the video: https://youtu.be/HrmJFyvInj8
Learn more: https://sanfrancisco.cbslocal.com/2020/02/05/stanford-research-project-common-jellyfish-bionic-sea-creatures/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Stanford HPC Conference, Peter Dueben from the European Centre for Medium-Range Weather Forecasts (ECMWF) presents: Machine Learning for Weather Forecasts.
"I will present recent studies that use deep learning to learn the equations of motion of the atmosphere, to emulate model components of weather forecast models and to enhance usability of weather forecasts. I will than talk about the main challenges for the application of deep learning in cutting-edge weather forecasts and suggest approaches to improve usability in the future."
Peter is contributing to the development and optimization of weather and climate models for modern supercomputers. He is focusing on a better understanding of model error and model uncertainty, on the use of reduced numerical precision that is optimised for a given level of model error, on global cloud- resolving simulations with ECMWF's forecast model, and the use of machine learning, and in particular deep learning, to improve the workflow and predictions. Peter has graduated in Physics and wrote his PhD thesis at the Max Planck Institute for Meteorology in Germany. He worked as Postdoc with Tim Palmer at the University of Oxford and has taken up a position as University Research Fellow of the Royal Society at the European Centre for Medium-Range Weather Forecasts (ECMWF) in 2017.
Watch the video: https://youtu.be/ks3fkRj8Iqc
Learn more: https://www.ecmwf.int/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Today RIKEN in Japan announced that the Fugaku supercomputer will be made available for research projects aimed to combat COVID-19.
"Fugaku is currently being installed and is scheduled to be available to the public in 2021. However, faced with the devastating disaster unfolding before our eyes, RIKEN and MEXT decided to make a portion of the computational resources of Fugaku available for COVID-19-related projects ahead of schedule while continuing the installation process.
Fugaku is being developed not only for the progress in science, but also to help build the society dubbed as the “Society 5.0” by the Japanese government, where all people will live safe and comfortable lives. The current initiative to fight against the novel coronavirus is driven by the philosophy behind the development of Fugaku."
Initial Projects
Exploring new drug candidates for COVID-19 by "Fugaku"
Yasushi Okuno, RIKEN / Kyoto University
Prediction of conformational dynamics of proteins on the surface of SARS-Cov-2 using Fugaku
Yuji Sugita, RIKEN
Simulation analysis of pandemic phenomena
Nobuyasu Ito, RIKEN
Fragment molecular orbital calculations for COVID-19 proteins
Yuji Mochizuki, Rikkyo University
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
The document discusses how DDN A3I storage solutions and Nvidia's SuperPOD platform can enable HPC at scale. It provides details on DDN's A3I appliances that are optimized for AI and deep learning workloads and validated for Nvidia's DGX-2 SuperPOD reference architecture. The solutions are said to deliver the fastest performance, effortless scaling, reliability and flexibility for data-intensive workloads.
In this deck, Paul Isaacs from Linaro presents: State of ARM-based HPC. This talk provides an overview of applications and infrastructure services successfully ported to Aarch64 and benefiting from scale.
"With its debut on the TOP500, the 125,000-core Astra supercomputer at New Mexico’s Sandia Labs uses Cavium ThunderX2 chips to mark Arm’s entry into the petascale world. In Japan, the Fujitsu A64FX Arm-based CPU in the pending Fugaku supercomputer has been optimized to achieve high-level, real-world application performance, anticipating up to one hundred times the application execution performance of the K computer. K was the first computer to top 10 petaflops in 2011."
Watch the video: https://wp.me/p3RLHQ-lIT
Learn more: https://www.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
Today Xilinx announced Versal Premium, the third series in the Versal ACAP portfolio. The Versal Premium series features highly integrated, networked and power-optimized cores and the industry’s highest bandwidth and compute density on an adaptable platform. Versal Premium is designed for the highest bandwidth networks operating in thermally and spatially constrained environments, as well as for cloud providers who need scalable, adaptable application acceleration.
Versal is the industry’s first adaptive compute acceleration platform (ACAP), a revolutionary new category of heterogeneous compute devices with capabilities that far exceed those of conventional silicon architectures. Developed on TSMC’s 7-nanometer process technology, Versal Premium combines software programmability with dynamically configurable hardware acceleration and pre-engineered connectivity and security features to enable a faster time-to- market. The Versal Premium series delivers up to 3X higher throughput compared to current generation FPGAs, with built-in Ethernet, Interlaken, and cryptographic engines that enable fast and secure networks. The series doubles the compute density of currently deployed mainstream FPGAs and provides the adaptability to keep pace with increasingly diverse and evolving cloud and networking workloads.
Learn more: https://insidehpc.com/2020/03/xilinx-announces-versal-premium-acap-for-network-and-cloud-acceleration/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
In this video from the Rice Oil & Gas Conference, Chin Fang from Zettar presents: Moving Massive Amounts of Data across Any Distance Efficiently.
The objective of this talk is to present two on-going projects aiming at improving and ensuring highly efficient bulk transferring or streaming of massive amounts of data over digital connections across any distance. It examines the current state of the art, a few very common misconceptions, the differences among the three major type of data movement solutions, a current initiative attempting to improve the data movement efficiency from the ground up, and another multi-stage project that shows how to conduct long distance large scale data movement at speed and scale internationally. Both projects have real world motivations, e.g. the ambitious data transfer requirements of Linac Coherent Light Source II (LCLS-II) [1], a premier preparation project of the U.S. DOE Exascale Computing Initiative (ECI) [2]. Their immediate goals are described and explained, together with the solution used for each. Findings and early results are reported. Possible future works are outlined.
Watch the video: https://wp.me/p3RLHQ-lBX
Learn more: https://www.zettar.com/
and
https://rice2020oghpc.rice.edu/program-2/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Rice Oil & Gas Conference, Bradley McCredie from AMD presents: Scaling TCO in a Post Moore's Law Era.
"While foundries bravely drive forward to overcome the technical and economic challenges posed by scaling to 5nm and beyond, Moore’s law alone can provide only a fraction of the performance / watt and performance / dollar gains needed to satisfy the demands of today’s high performance computing and artificial intelligence applications. To close the gap, multiple strategies are required. First, new levels of innovation and design efficiency will supplement technology gains to continue to deliver meaningful improvements in SoC performance. Second, heterogenous compute architectures will create x-factor increases of performance efficiency for the most critical applications. Finally, open software frameworks, APIs, and toolsets will enable broad ecosystems of application level innovation."
Watch the video:
Learn more: http://amd.com
and
https://rice2020oghpc.rice.edu/program-2/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
In this deck from the ECSS Symposium, Abe Stern from NVIDIA presents: CUDA-Python and RAPIDS for blazing fast scientific computing.
"We will introduce Numba and RAPIDS for GPU programming in Python. Numba allows us to write just-in-time compiled CUDA code in Python, giving us easy access to the power of GPUs from a powerful high-level language. RAPIDS is a suite of tools with a Python interface for machine learning and dataframe operations. Together, Numba and RAPIDS represent a potent set of tools for rapid prototyping, development, and analysis for scientific computing. We will cover the basics of each library and go over simple examples to get users started. Finally, we will briefly highlight several other relevant libraries for GPU programming."
Watch the video: https://wp.me/p3RLHQ-lvu
Learn more: https://developer.nvidia.com/rapids
and
https://www.xsede.org/for-users/ecss/ecss-symposium
Sign up for our insideHPC Newsletter: http://insidehp.com/newsletter
In this deck from FOSDEM 2020, Colin Sauze from Aberystwyth University describes the development of a RaspberryPi cluster for teaching an introduction to HPC.
"The motivation for this was to overcome four key problems faced by new HPC users:
* The availability of a real HPC system and the effect running training courses can have on the real system, conversely the availability of spare resources on the real system can cause problems for the training course.
* A fear of using a large and expensive HPC system for the first time and worries that doing something wrong might damage the system.
* That HPC systems are very abstract systems sitting in data centres that users never see, it is difficult for them to understand exactly what it is they are using.
* That new users fail to understand resource limitations, in part because of the vast resources in modern HPC systems a lot of mistakes can be made before running out of resources. A more resource constrained system makes it easier to understand this.
The talk will also discuss some of the technical challenges in deploying an HPC environment to a Raspberry Pi and attempts to keep that environment as close to a "real" HPC as possible. The issue to trying to automate the installation process will also be covered."
Learn more: https://github.com/colinsauze/pi_cluster
and
https://fosdem.org/2020/schedule/events/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
In this deck from FOSDEM 2020, Frank McQuillan from Pivotal presents: Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases.
"In this session we will present an efficient way to train many deep learning model configurations at the same time with Greenplum, a free and open source massively parallel database based on PostgreSQL. The implementation involves distributing data to the workers that have GPUs available and hopping model state between those workers, without sacrificing reproducibility or accuracy. Then we apply optimization algorithms to generate and prune the set of model configurations to try.
Deep neural networks are revolutionizing many machine learning applications, but hundreds of trials may be needed to generate a good model architecture and associated hyperparameters. This is the challenge of model selection. It is time consuming and expensive, especially if you are only training one model at a time.
Massively parallel processing databases can have hundreds of workers, so can you use this parallel compute architecture to address the challenge of model selection for deep nets, in order to make it faster and cheaper?
It’s possible!
We will demonstrate results from this project using a version of Hyperband, which is a well known hyperparameter optimization algorithm, and the deep learning frameworks Keras and TensorFlow, all running on Greenplum database using Apache MADlib. Other topics will include architecture, scalability results and bright opportunities for the future."
Watch the video: https://wp.me/p3RLHQ-lsQ
Learn more: https://fosdem.org/2020/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Huihuo Zheng from Argonne National Laboratory presents: Data Parallel Deep Learning.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two weeks of training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-lsl
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
1. Designing Scalable HPC, Deep Learning, and Cloud
Middleware for Exascale Systems
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Talk at Swiss Conference (April ‘18)
by
2. HPCAC-Switzerland (April ’18) 2Network Based Computing Laboratory
High-End Computing (HEC): Towards Exascale
Expected to have an ExaFlop system in 2019-2021!
100 PFlops in
2016
1 EFlops in 2019-
2021?
3. HPCAC-Switzerland (April ’18) 3Network Based Computing Laboratory
Big Data
(Hadoop, Spark,
HBase,
Memcached,
etc.)
Deep Learning
(Caffe, TensorFlow, BigDL,
etc.)
HPC
(MPI, RDMA,
Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big
Data, and Deep Learning!
Increasing Need to Run these
applications on the Cloud!!
4. HPCAC-Switzerland (April ’18) 4Network Based Computing Laboratory
• Traditional HPC
– Message Passing Interface (MPI), including MPI + OpenMP
– Support for PGAS and MPI + PGAS (OpenSHMEM, UPC)
– Exploiting Accelerators
• Deep Learning
– Acceleration of Caffe, CNTK, TensorFlow, and many more
– Out-of-core Processing
• Cloud for HPC and BigData
– Virtualization with SR-IOV and Containers
HPC, Deep Learning, and Cloud
5. HPCAC-Switzerland (April ’18) 5Network Based Computing Laboratory
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• PGAS models and Hybrid MPI+PGAS models are gradually receiving
importance
6. HPCAC-Switzerland (April ’18) 6Network Based Computing Laboratory
Partitioned Global Address Space (PGAS) Models
• Key features
- Simple shared memory abstractions
- Light weight one-sided communication
- Easier to express irregular communication
• Different approaches to PGAS
- Languages
• Unified Parallel C (UPC)
• Co-Array Fortran (CAF)
• X10
• Chapel
- Libraries
• OpenSHMEM
• UPC++
• Global Arrays
7. HPCAC-Switzerland (April ’18) 7Network Based Computing Laboratory
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS
based on communication characteristics
• Benefits:
– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
Kernel 1
MPI
Kernel 2
MPI
Kernel 3
MPI
Kernel N
MPI
HPC Application
Kernel 2
PGAS
Kernel N
PGAS
8. HPCAC-Switzerland (April ’18) 8Network Based Computing Laboratory
Supporting Programming Models for Multi-Petaflop and
Exaflop Systems: Challenges
Programming Models
MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications
Networking Technologies
(InfiniBand, 40/100GigE,
Aries, and Omni-Path)
Multi-/Many-core
Architectures
Accelerators
(GPU and FPGA)
Middleware
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Resilience
Communication Library or Runtime for Programming Models
Point-to-point
Communication
Collective
Communication
Energy-
Awareness
Synchronization
and Locks
I/O and
File Systems
Fault
Tolerance
9. HPCAC-Switzerland (April ’18) 9Network Based Computing Laboratory
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)
– Scalable job start-up
• Scalable Collective communication
– Offload
– Non-blocking
– Topology-aware
• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)
– Multiple end-points per node
• Support for efficient multi-threading
• Integrated Support for GPGPUs and Accelerators
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM,
MPI+UPC++, CAF, …)
• Virtualization
• Energy-Awareness
Broad Challenges in Designing Communication Middleware for (MPI+X) at
Exascale
10. HPCAC-Switzerland (April ’18) 10Network Based Computing Laboratory
• Extreme Low Memory Footprint
– Memory per core continues to decrease
• D-L-A Framework
– Discover
• Overall network topology (fat-tree, 3D, …), Network topology for processes for a given job
• Node architecture, Health of network and node
– Learn
• Impact on performance and scalability
• Potential for failure
– Adapt
• Internal protocols and algorithms
• Process mapping
• Fault-tolerance solutions
– Low overhead techniques while delivering performance, scalability and fault-tolerance
Additional Challenges for Designing Exascale Software Libraries
11. HPCAC-Switzerland (April ’18) 11Network Based Computing Laboratory
Overview of the MVAPICH2 Project
• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,875 organizations in 86 countries
– More than 462,000 (> 0.46 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘17 ranking)
• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 9th, 556,104 cores (Oakforest-PACS) in Japan
• 12th, 368,928-core (Stampede2) at TACC
• 17th, 241,108-core (Pleiades) at NASA
• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
13. HPCAC-Switzerland (April ’18) 13Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface
(MPI)
PGAS
(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X
(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication Runtime
Diverse APIs and Mechanisms
Point-to-
point
Primitives
Collectives
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
Tolerance
Virtualization
Active
Messages
Job Startup
Introspection
& Analysis
Support for Modern Networking Technology
(InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures
(Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODP
SR-
IOV
Multi
Rail
Transport Mechanisms
Shared
Memory
CMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
XPMEM*
14. HPCAC-Switzerland (April ’18) 14Network Based Computing Laboratory
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication
– Scalable Start-up
– Optimized Collectives using SHArP and Multi-Leaders
– Optimized CMA-based Collectives
– Upcoming Optimized XPMEM-based Collectives
• Integrated Support for GPGPUs
• Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM
• Application Scalability and Best Practices
Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale
17. HPCAC-Switzerland (April ’18) 17Network Based Computing Laboratory
Startup Performance on KNL + Omni-Path
0
5
10
15
20
25
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
TimeTaken(Seconds)
Number of Processes
MPI_Init & Hello World - Oakforest-PACS
Hello World (MVAPICH2-2.3a)
MPI_Init (MVAPICH2-2.3a)
• MPI_Init takes 22 seconds on 229,376 processes on 3,584 KNL nodes (Stampede2 – Full scale)
• 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC)
• At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS)
• All numbers reported with 64 processes per node
5.8s
21s
22s
New designs available since MVAPICH2-2.3a and as patch for SLURM 15, 16, and 17
18. HPCAC-Switzerland (April ’18) 18Network Based Computing Laboratory
0
0.05
0.1
0.15
0.2
(4,28) (8,28) (16,28)
Latency(seconds)
(Number of Nodes, PPN)
MVAPICH2
MVAPICH2-SHArP
13%
Mesh Refinement Time of MiniAMR
Advanced Allreduce Collective Designs Using SHArP
12%
0
0.1
0.2
0.3
0.4
(4,28) (8,28) (16,28)
Latency(seconds)
(Number of Nodes, PPN)
MVAPICH2
MVAPICH2-SHArP
Avg DDOT Allreduce time of HPCG
SHArP Support is available
since MVAPICH2 2.3a
M. Bayatpour, S. Chakraborty, H. Subramoni, X.
Lu, and D. K. Panda, Scalable Reduction
Collectives with Data Partitioning-based Multi-
Leader Design, SuperComputing '17.
19. HPCAC-Switzerland (April ’18) 19Network Based Computing Laboratory
Performance of MPI_Allreduce On Stampede2 (10,240 Processes)
0
50
100
150
200
250
300
4 8 16 32 64 128 256 512 1024 2048 4096
Latency(us)
Message Size
MVAPICH2 MVAPICH2-OPT IMPI
0
200
400
600
800
1000
1200
1400
1600
1800
2000
8K 16K 32K 64K 128K 256K
Message Size
MVAPICH2 MVAPICH2-OPT IMPI
OSU Micro Benchmark 64 PPN
2.4X
• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X
M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based
Multi-Leader Design, SuperComputing '17.
Available in MVAPICH2-X 2.3b
20. HPCAC-Switzerland (April ’18) 20Network Based Computing Laboratory
Optimized CMA-based Collectives for Large Messages
1
10
100
1000
10000
100000
1000000
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Message Size
KNL (2 Nodes, 128 Procs)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Tuned CMA
Latency(us)
1
10
100
1000
10000
100000
1000000
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
Message Size
KNL (4 Nodes, 256 Procs)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Tuned CMA
1
10
100
1000
10000
100000
1000000
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
Message Size
KNL (8 Nodes, 512 Procs)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Tuned CMA
• Significant improvement over existing implementation for Scatter/Gather with
1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower)
• New two-level algorithms for better scalability
• Improved performance for other collectives (Bcast, Allgather, and Alltoall)
~ 2.5x
Better
~ 3.2x
Better
~ 4x
Better
~ 17x
Better
S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI
Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist
Performance of MPI_Gather on KNL nodes (64PPN)
Available in MVAPICH2-X 2.3b
21. HPCAC-Switzerland (April ’18) 21Network Based Computing Laboratory
Shared Address Space (XPMEM)-based Collectives Design
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Latency(us)
Message Size
MVAPICH2-2.3b
IMPI-2017v1.132
MVAPICH2-Opt
OSU_Allreduce (Broadwell 256 procs)
• “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2
• Offloaded computation/communication to peers ranks in reduction collective operation
• Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce
73.2
1.8X
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Message Size
MVAPICH2-2.3b
IMPI-2017v1.132
MVAPICH2-Opt
OSU_Reduce (Broadwell 256 procs)
4X
36.1
37.9
16.8
J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction
Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.
Will be available in future
22. HPCAC-Switzerland (April ’18) 22Network Based Computing Laboratory
• Scalability for million to billion processors
• Integrated Support for GPGPUs
– CUDA-aware MPI
– GPUDirect RDMA (GDR) Support
• Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM
• Application Scalability and Best Practices
Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale
23. HPCAC-Switzerland (April ’18) 23Network Based Computing Laboratory
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
24. HPCAC-Switzerland (April ’18) 24Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU
adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from
GPU device buffers
• Unified memory
27. HPCAC-Switzerland (April ’18) 27Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96NormalizedExecutionTime
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
0
0.2
0.4
0.6
0.8
1
1.2
4 8 16 32
NormalizedExecutionTime
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes
• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data
Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content
/tasks/operational/meteoSwiss/
28. HPCAC-Switzerland (April ’18) 28Network Based Computing Laboratory
• Scalability for million to billion processors
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI +
UPC, CAF, UPC++, …)
• Integrated Support for GPGPUs
• Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM
• Application Scalability and Best Practices
Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale
35. HPCAC-Switzerland (April ’18) 35Network Based Computing Laboratory
• Scalability for million to billion processors
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI +
UPC, CAF, UPC++, …)
• Integrated Support for GPGPUs
• Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM
• Application Scalability and Best Practices
Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale
36. HPCAC-Switzerland (April ’18) 36Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
milc leslie3d pop2 lammps wrf2 tera_tf lu
ExecutionTimein(S)
Intel MPI 18.0.0
MVAPICH2 2.3 rc1
2%
6%
1%
6%
Performance of SPEC MPI 2007 Benchmarks (KNL + Omni-Path)
Mvapich2 outperforms Intel MPI by up to 10%
448 processes
on 7 KNL nodes of
TACC Stampede2
(64 ppn)
10%
2%
4%
37. HPCAC-Switzerland (April ’18) 37Network Based Computing Laboratory
0
20
40
60
80
100
120
milc leslie3d pop2 lammps wrf2 GaP tera_tf lu
ExecutionTimein(S)
Intel MPI 18.0.0
MVAPICH2 2.3 rc1
2%
4%
Performance of SPEC MPI 2007 Benchmarks (Skylake + Omni-Path)
MVAPICH2 outperforms Intel MPI by up to 38%
480 processes
on 10 Skylake nodes
of TACC Stampede2
(48 ppn)
0% 1%
0%
-4%
38%
-3%
39. HPCAC-Switzerland (April ’18) 39Network Based Computing Laboratory
• MPI runtime has many parameters
• Tuning a set of parameters can help you to extract higher performance
• Compiled a list of such contributions through the MVAPICH Website
– http://mvapich.cse.ohio-state.edu/best_practices/
• Initial list of applications
– Amber
– HoomDBlue
– HPCG
– Lulesh
– MILC
– Neuron
– SMG2000
– Cloverleaf
– SPEC (LAMMPS, POP2, TERA_TF, WRF2)
• Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu.
• We will link these results with credits to you.
Applications-Level Tuning: Compilation of Best Practices
40. HPCAC-Switzerland (April ’18) 40Network Based Computing Laboratory
• Traditional HPC
– Message Passing Interface (MPI), including MPI + OpenMP
– Support for PGAS and MPI + PGAS (OpenSHMEM, UPC)
– Exploiting Accelerators
• Deep Learning
– Acceleration of Caffe, CNTK, TensorFlow, and many more
– Out-of-core Processing
• Cloud for HPC and BigData
– Virtualization with SR-IOV and Containers
HPC, Deep Learning, and Cloud
41. HPCAC-Switzerland (April ’18) 41Network Based Computing Laboratory
• Deep Learning frameworks are a different game
altogether
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• Existing State-of-the-art
– cuDNN, cuBLAS, NCCL --> scale-up performance
– NCCL2, CUDA-Aware MPI --> scale-out performance
• For small and medium message sizes only!
• Proposed: Can we co-design the MPI runtime (MVAPICH2-
GDR) and the DL framework (Caffe) to achieve both?
– Efficient Overlap of Computation and Communication
– Efficient Large-Message Communication (Reductions)
– What application co-designs are needed to exploit
communication-runtime co-designs?
Deep Learning: New Challenges for MPI Runtimes
Scale-upPerformance
Scale-out Performance
cuDNN
NCCL
gRPC
Hadoop
Proposed
Co-
Designs
MPI
cuBLAS
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU
Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
NCCL2
43. HPCAC-Switzerland (April ’18) 43Network Based Computing Laboratory
1
10
100
1000
10000
100000
Latency(us)
Message Size (Bytes)
NCCL2 MVAPICH2-GDR
MVAPICH2-GDR vs. NCCL2 – Broadcast Operation
• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
• MPI_Bcast (MVAPICH2-GDR) vs. ncclBcast (NCCL2) on 16 K-80 GPUs
*Will be available with upcoming MVAPICH2-GDR 2.3b
1
10
100
1000
10000
100000
Latency(us)
Message Size (Bytes)
NCCL2 MVAPICH2-GDR
~10X better
~4X better
1 GPU/node 2 GPUs/node
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-80 GPUs, and EDR InfiniBand Inter-connect
44. HPCAC-Switzerland (April ’18) 44Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Reduce Operation
• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
• MPI_Reduce (MVAPICH2-GDR) vs. ncclReduce (NCCL2) on 16 GPUs
*Will be available with upcoming MVAPICH2-GDR 2.3b
1
10
100
1000
10000
100000
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~5X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~2.5X better
45. HPCAC-Switzerland (April ’18) 45Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation
• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs
*Will be available with upcoming MVAPICH2-GDR 2.3b
1
10
100
1000
10000
100000
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~1.2X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4
8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~3X better
46. HPCAC-Switzerland (April ’18) 46Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses
– Multi-GPU Training within a single node
– Performance degradation for GPUs across different
sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training
– Enable Scale-up (within a node) and Scale-out (across
multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10 network on
CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet network on
ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
TrainingTime(seconds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use case
OSU-Caffe publicly available from
http://hidl.cse.ohio-state.edu/
Support on OPENPOWER will be available soon
47. HPCAC-Switzerland (April ’18) 47Network Based Computing Laboratory
High Productivity and High Performance Out-of-Core DNN Training
• Large Size Deep Neural Networks (DNNs) cannot be trained on
GPUs due to memory limitation!
– ResNet-50 - state-of-the-art DNN architecture for Image
Recognition (Trainable with a small batch size of 45)
– Next generation models like Neural Machine Translation
(NMT) are emerging that require even more memory
• Can we design Out-of-core DNN training support using new
features in CUDA 8/9 and hardware mechanisms in
Pascal/Volta GPUs?
• The proposed framework called OC-Caffe (Out-of-Core Caffe)
shows the potential of managed memory designs that can
provide performance with negligible/no overhead.
– OC-Caffe eliminates 3,000 lines of code for a high-
productivity design by exploiting Unified Memory features
Submission Under Review
48. HPCAC-Switzerland (April ’18) 48Network Based Computing Laboratory
• Comparable performance to Caffe-Default for “in-memory” batch sizes
• OC-Caffe-Opt: up to 5X improvement over Intel-MKL-optimized CPU-based AlexNet
training on Volta V100 GPU with CUDA9 and CUDNN7
Performance Trends for OC-Caffe
OC-Caffe will be released by the HiDL Team@OSU
hidl.cse.ohio-state.edu
Out-of-core (over-subscription)Trainable (in-memory)
0
200
400
600
800
1000
1200
1400
1600
1800
Images/sec(Higherisbetter)
caffe-gpu oc-caffe-naïve oc-caffe-opt
oc-caffe-opt - only 2.9% degradation
Submission Under Review
0
200
400
600
800
1000
1200
1400
1600
Img/sec(Higherisbetter)
caffe-gpu oc-caffe-naïve oc-caffe-opt
caffe-cpu intel-caffe intel-caffe-opt
oc-caffe-opt is 5X
better than
intel-caffe-opt
caffe-gpu
cannot
run
X
49. HPCAC-Switzerland (April ’18) 49Network Based Computing Laboratory
• Traditional HPC
– Message Passing Interface (MPI), including MPI + OpenMP
– Support for PGAS and MPI + PGAS (OpenSHMEM, UPC)
– Exploiting Accelerators
• Deep Learning
– Acceleration of Caffe, CNTK, TensorFlow, and many more
– Out-of-core Processing
• Cloud for HPC and BigData
– Virtualization with SR-IOV and Containers
HPC, Deep Learning, and Cloud
50. HPCAC-Switzerland (April ’18) 50Network Based Computing Laboratory
• Virtualization has many benefits
– Fault-tolerance
– Job migration
– Compaction
• Have not been very popular in HPC due to overhead associated with
Virtualization
• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox
InfiniBand adapters changes the field
• Enhanced MVAPICH2 support for SR-IOV
• MVAPICH2-Virt 2.2 supports:
– OpenStack, Docker, and singularity
Can HPC and Virtualization be Combined?
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based
Virtualized InfiniBand Clusters? EuroPar'14
J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Library over SR-IOV enabled InfiniBand
Clusters, HiPC’14
J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build
HPC Clouds, CCGrid’15
51. HPCAC-Switzerland (April ’18) 51Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
ExecutionTime(s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
1%
9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
ExecutionTime(ms)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
2%
• 32 VMs, 6 Core/VM
• Compared to Native, 2-5% overhead for Graph500 with 128 Procs
• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
Application-Level Performance on Chameleon
SPEC MPI2007Graph500
5%
52. HPCAC-Switzerland (April ’18) 52Network Based Computing Laboratory
0
500
1000
1500
2000
2500
3000
22,16 22,20 24,16 24,20 26,16 26,20
BFSExecutionTime(ms)
Problem Size (Scale, Edgefactor)
Graph500
Singularity
Native
0
50
100
150
200
250
300
CG EP FT IS LU MG
ExecutionTime(s)
NPB Class D
Singularity
Native
• 512 Processes across 32 nodes
• Less than 7% and 6% overhead for NPB and Graph500, respectively
Application-Level Performance on Singularity with MVAPICH2
7%
6%
J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?,
UCC ’17, Best Student Paper Award
54. HPCAC-Switzerland (April ’18) 54Network Based Computing Laboratory
MVAPICH2 – Plans for Exascale
• Performance and Memory scalability toward 1-10M cores
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)
• MPI + Task*
• Enhanced Optimization for GPU Support and Accelerators
• Taking advantage of advanced features of Mellanox InfiniBand
• Tag Matching*
• Adapter Memory*
• Enhanced communication schemes for upcoming architectures
• Knights Landing with MCDRAM*
• NVLINK*
• CAPI*
• Enhanced Support for Deep Learning
• Extended topology-aware collectives
• Extended Energy-aware designs and Virtualization Support
• Extended Support for MPI Tools Interface (as in MPI 3.0)
• Extended FT support
• Support for * features will be available in future MVAPICH2 Releases
55. HPCAC-Switzerland (April ’18) 55Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
56. HPCAC-Switzerland (April ’18) 56Network Based Computing Laboratory
Personnel Acknowledgments
Current Students (Graduate)
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)
– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists
– X. Lu
– H. Subramoni
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– M. Arnold
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc
– A. Ruhela
– K. Manian
Current Students (Undergraduate)
– N. Sarkauskas (B.S.)
57. HPCAC-Switzerland (April ’18) 57Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
panda@cse.ohio-state.edu
The High-Performance MPI/PGAS Project
http://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Project
http://hidl.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/