In this deck from the 2016 HPC Advisory Council Switzerland Conference, DK Panda from Ohio State University presents: High-Performance and Scalable Designs of Programming Models for Exascale Systems.
"This talk will focus on challenges in designing runtime environments for Exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPUs and Intel MIC) and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video presentation: http://wp.me/p3RLHQ-f7c
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Welcome to the 2016 HPC Advisory Council Switzerland Conferenceinside-BigData.com
This document contains the agenda for the HPC Advisory Council Swiss Conference 2016, which will take place in March. It provides details on keynote speakers, tutorials, and best practices sessions covering topics like deep learning, programming models, containers, and more. Sponsor and exhibitor information is also included.
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?
"This talk will start with an overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of- core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented."
Watch the video: https://youtu.be/LeUNoKZVuwQ
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
In this deck from the ECSS Symposium, Abe Stern from NVIDIA presents: CUDA-Python and RAPIDS for blazing fast scientific computing.
"We will introduce Numba and RAPIDS for GPU programming in Python. Numba allows us to write just-in-time compiled CUDA code in Python, giving us easy access to the power of GPUs from a powerful high-level language. RAPIDS is a suite of tools with a Python interface for machine learning and dataframe operations. Together, Numba and RAPIDS represent a potent set of tools for rapid prototyping, development, and analysis for scientific computing. We will cover the basics of each library and go over simple examples to get users started. Finally, we will briefly highlight several other relevant libraries for GPU programming."
Watch the video: https://wp.me/p3RLHQ-lvu
Learn more: https://developer.nvidia.com/rapids
and
https://www.xsede.org/for-users/ecss/ecss-symposium
Sign up for our insideHPC Newsletter: http://insidehp.com/newsletter
TAU Performance System and the Extreme-scale Scientific Software Stack (E4S) aim to improve productivity for HPC and AI workloads. TAU provides a portable performance evaluation toolkit, while E4S delivers modular and interoperable software stacks. Together, they lower barriers to using software tools from the Exascale Computing Project and enable performance analysis of complex, multi-component applications.
Axel Koehler from Nvidia presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
“Accelerated computing is transforming the data center that delivers unprecedented through- put, enabling new discoveries and services for end users. This talk will give an overview about the NVIDIA Tesla accelerated computing platform including the latest developments in hardware and software. In addition it will be shown how deep learning on GPUs is changing how we use computers to understand data.”
In related news, the GPU Technology Conference takes place April 4-7 in Silicon Valley.
Watch the video presentation: http://insidehpc.com/2016/03/tesla-accelerated-computing/
See more talks in the Swiss Conference Video Gallery:
http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter:
http://insidehpc.com/newsletter
In this deck, Paul Isaacs from Linaro presents: State of ARM-based HPC. This talk provides an overview of applications and infrastructure services successfully ported to Aarch64 and benefiting from scale.
"With its debut on the TOP500, the 125,000-core Astra supercomputer at New Mexico’s Sandia Labs uses Cavium ThunderX2 chips to mark Arm’s entry into the petascale world. In Japan, the Fujitsu A64FX Arm-based CPU in the pending Fugaku supercomputer has been optimized to achieve high-level, real-world application performance, anticipating up to one hundred times the application execution performance of the K computer. K was the first computer to top 10 petaflops in 2011."
Watch the video: https://wp.me/p3RLHQ-lIT
Learn more: https://www.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses how HPC infrastructure is being transformed with AI. It summarizes that cognitive systems use distributed deep learning across HPC clusters to speed up training times. It also outlines IBM's hardware portfolio expansion for AI training, inference, and storage capabilities. The document discusses software stacks for AI like Watson Machine Learning Community Edition that use containers and universal base images to simplify deployment.
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
- The document discusses programming models and challenges for exascale systems. It focuses on MPI and PGAS models like OpenSHMEM.
- Key challenges include supporting hybrid MPI+PGAS programming, efficient communication for multi-core and accelerator nodes, fault tolerance, and extreme low memory usage.
- The MVAPICH2 project aims to address these challenges through its high performance MPI and PGAS implementation and optimization of communication for technologies like InfiniBand.
Welcome to the 2016 HPC Advisory Council Switzerland Conferenceinside-BigData.com
This document contains the agenda for the HPC Advisory Council Swiss Conference 2016, which will take place in March. It provides details on keynote speakers, tutorials, and best practices sessions covering topics like deep learning, programming models, containers, and more. Sponsor and exhibitor information is also included.
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?
"This talk will start with an overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of- core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented."
Watch the video: https://youtu.be/LeUNoKZVuwQ
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
In this deck from the ECSS Symposium, Abe Stern from NVIDIA presents: CUDA-Python and RAPIDS for blazing fast scientific computing.
"We will introduce Numba and RAPIDS for GPU programming in Python. Numba allows us to write just-in-time compiled CUDA code in Python, giving us easy access to the power of GPUs from a powerful high-level language. RAPIDS is a suite of tools with a Python interface for machine learning and dataframe operations. Together, Numba and RAPIDS represent a potent set of tools for rapid prototyping, development, and analysis for scientific computing. We will cover the basics of each library and go over simple examples to get users started. Finally, we will briefly highlight several other relevant libraries for GPU programming."
Watch the video: https://wp.me/p3RLHQ-lvu
Learn more: https://developer.nvidia.com/rapids
and
https://www.xsede.org/for-users/ecss/ecss-symposium
Sign up for our insideHPC Newsletter: http://insidehp.com/newsletter
TAU Performance System and the Extreme-scale Scientific Software Stack (E4S) aim to improve productivity for HPC and AI workloads. TAU provides a portable performance evaluation toolkit, while E4S delivers modular and interoperable software stacks. Together, they lower barriers to using software tools from the Exascale Computing Project and enable performance analysis of complex, multi-component applications.
Axel Koehler from Nvidia presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
“Accelerated computing is transforming the data center that delivers unprecedented through- put, enabling new discoveries and services for end users. This talk will give an overview about the NVIDIA Tesla accelerated computing platform including the latest developments in hardware and software. In addition it will be shown how deep learning on GPUs is changing how we use computers to understand data.”
In related news, the GPU Technology Conference takes place April 4-7 in Silicon Valley.
Watch the video presentation: http://insidehpc.com/2016/03/tesla-accelerated-computing/
See more talks in the Swiss Conference Video Gallery:
http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter:
http://insidehpc.com/newsletter
In this deck, Paul Isaacs from Linaro presents: State of ARM-based HPC. This talk provides an overview of applications and infrastructure services successfully ported to Aarch64 and benefiting from scale.
"With its debut on the TOP500, the 125,000-core Astra supercomputer at New Mexico’s Sandia Labs uses Cavium ThunderX2 chips to mark Arm’s entry into the petascale world. In Japan, the Fujitsu A64FX Arm-based CPU in the pending Fugaku supercomputer has been optimized to achieve high-level, real-world application performance, anticipating up to one hundred times the application execution performance of the K computer. K was the first computer to top 10 petaflops in 2011."
Watch the video: https://wp.me/p3RLHQ-lIT
Learn more: https://www.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses how HPC infrastructure is being transformed with AI. It summarizes that cognitive systems use distributed deep learning across HPC clusters to speed up training times. It also outlines IBM's hardware portfolio expansion for AI training, inference, and storage capabilities. The document discusses software stacks for AI like Watson Machine Learning Community Edition that use containers and universal base images to simplify deployment.
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
- The document discusses programming models and challenges for exascale systems. It focuses on MPI and PGAS models like OpenSHMEM.
- Key challenges include supporting hybrid MPI+PGAS programming, efficient communication for multi-core and accelerator nodes, fault tolerance, and extreme low memory usage.
- The MVAPICH2 project aims to address these challenges through its high performance MPI and PGAS implementation and optimization of communication for technologies like InfiniBand.
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
In this deck from IWOCL / SYCLcon 2020, Hal Finkel from Argonne National Laboratory presents: Preparing to program Aurora at Exascale - Early experiences and future directions.
"Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms. Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.
This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success. Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems."
Watch the video: https://wp.me/p3RLHQ-lPT
Learn more: https://www.iwocl.org/iwocl-2020/conference-program/
and
https://www.anl.gov/topic/aurora
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
IBM provides infrastructure to accelerate medical research tasks like genomics, molecular simulation, diagnostics, and quality inspection. This infrastructure delivers faster insights through high-performance data and AI deployed at massive scale on IBM Power Systems and Storage. Case studies show the infrastructure reduces time to results for tasks like processing millions of cryogenic electron microscope images from days to hours.
Increasing Cluster Performance by Combining rCUDA with Slurminside-BigData.com
Federico Silla from the Technical University of Valencia presented this deck at the Switzerland HPC Conference.
"Graphics Processing Units (GPUs) are currently used in data centers to reduce the execution time of compute-intensive applications. However, the use of GPUs presents several side ef- fects, such as increased acquisition costs as well as larger space requirements. Furthermore, GPUs require a non-negligible amount of energy even while idle. Additionally, GPU utilization is usually low for most applications. The use of virtual GPUs may address these concerns. In this regard, the remote GPU virtualization mechanism could be leveraged to share the GPUs present in the computing facility among the nodes of the cluster. This would increase overall GPU utilization, thus reducing the negative impact of the increased costs mentioned before. Reducing the amount of GPUs installed in the cluster could also be possible. In this talk
we present the remote GPU virtualization mechanism using as case study the performance attained by a cluster using the rCUDA middleware and a modified version of the Slurm sched- uler, which is able to map remote virtual GPUs to jobs. By leveraging rCUDA+Slurm, cluster throughput, measured as jobs completed per time unit, is doubled at the same time that total energy consumption is reduced up to 40%. GPU utilization is also increased."
Watch the video presentation: https://www.youtube.com/watch?v=yQWiQiyFpAg
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the 2018 Swiss HPC Conference, DK Panda from Ohio State University presents: Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark, and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project will be shown. Benefits of these stacks to accelerate deep learning frameworks (such as CaffeOnSpark and TensorFlowOnSpark) will be presented."
Watch the video: https://wp.me/p3RLHQ-iko
Learn more: http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from FOSDEM 2020, Colin Sauze from Aberystwyth University describes the development of a RaspberryPi cluster for teaching an introduction to HPC.
"The motivation for this was to overcome four key problems faced by new HPC users:
* The availability of a real HPC system and the effect running training courses can have on the real system, conversely the availability of spare resources on the real system can cause problems for the training course.
* A fear of using a large and expensive HPC system for the first time and worries that doing something wrong might damage the system.
* That HPC systems are very abstract systems sitting in data centres that users never see, it is difficult for them to understand exactly what it is they are using.
* That new users fail to understand resource limitations, in part because of the vast resources in modern HPC systems a lot of mistakes can be made before running out of resources. A more resource constrained system makes it easier to understand this.
The talk will also discuss some of the technical challenges in deploying an HPC environment to a Raspberry Pi and attempts to keep that environment as close to a "real" HPC as possible. The issue to trying to automate the installation process will also be covered."
Learn more: https://github.com/colinsauze/pi_cluster
and
https://fosdem.org/2020/schedule/events/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
DK Panda from Ohio State University presented this deck at the OpenFabrics Workshop.
"Modern HPC clusters are having many advanced features, such as multi-/many-core architectures, highperformance RDMA-enabled interconnects, SSD-based storage devices, burst-buffers and parallel file systems. However, current generation Big Data processing middleware (such as Hadoop, Spark, and Memcached) have not fully exploited the benefits of the advanced features on modern HPC clusters. This talk will present RDMA-based designs using OpenFabrics Verbs and heterogeneous storage architectures to accelerate multiple components of Hadoop (HDFS, MapReduce, RPC, and HBase), Spark and Memcached. An overview of the associated RDMA-enabled software libraries (being designed and publicly distributed as a part of the HiBD project for Apache Hadoop (integrated and plug-ins for Apache, HDP, and Cloudera distributions), Apache Spark and Memcached will be presented. The talk will also address the need for designing benchmarks using a multi-layered and systematic approach, which can be used to evaluate the performance of these Big Data processing middleware."
Watch the video presentation: http://wp.me/p3RLHQ-gzg
Learn more: http://hibd.cse.ohio-state.edu/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Gilad Shainer from Mellanox announces the world’s first HDR 200Gb/s data center interconnect solutions. "These 200Gb/s HDR InfiniBand solutions maintain Mellanox’s generation-ahead leadership while enabling customers and users to leverage an open, standards-based technology that maximizes application performance and scalability while minimizing overall data center total cost of ownership. Mellanox 200Gb/s HDR solutions will become generally available in 2017.
Watch the video presentation: http://insidehpc.com/2016/11/hdr-infiniband/
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
This document introduces hardware acceleration using FPGAs with OpenCAPI. It discusses how classic FPGA acceleration has issues like slow CPU-managed memory access and lack of data coherency. OpenCAPI allows FPGAs to directly access host memory, providing faster memory access and data coherency. It also introduces the OC-Accel framework that allows programming FPGAs using C/C++ instead of HDL languages, addressing issues like long development times. Example applications demonstrated significant performance improvements using this approach over CPU-only or classic FPGA acceleration methods.
In this deck from the 2017 MVAPICH User Group, DK Panda from Ohio State University presents: Overview of the MVAPICH Project and Future Roadmap.
"This talk will provide an overview of the MVAPICH project (past, present and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X, MVAPICH2-GDR, MVAPICH2-Virt, MVAPICH2-EA and MVAPICH2-MIC) will be presented. Current status and future plans for OSU INAM, OEMT and OMB will also be presented."
Watch the video: https://www.youtube.com/watch?v=wF7t-oH7wi4
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
This document discusses scaling deep learning training on HPC systems. It begins by providing background on deep learning and how interest in it has grown significantly. It then discusses how HPC systems can be leveraged for deep learning by supporting distributed training across multiple nodes. Several challenges of designing deep learning frameworks for HPC are outlined, including memory and communication overhead. The document proposes a co-design approach between deep learning frameworks and communication runtimes to better support distributed training and exploit HPC resources. MVAPICH2 software is discussed as an example that provides optimized MPI support for CPU- and GPU-based deep learning on HPC clusters.
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses using temporal shift modules (TSM) for efficient video recognition, where TSM enables temporal modeling in 2D CNNs with no additional computation cost; TSM models achieve better performance than 3D CNNs and previous methods while using less computation, and can be used for applications like online video understanding, low-latency deployment on edge devices, and large-scale distributed training on supercomputers.
Xilinx provides adaptable acceleration platforms for data centers. Their Alveo product lineup includes the U280, U250, U200, and low-profile U50 accelerator cards. The cards feature FPGAs with up to 1.3 million logic cells and high-speed memory. Xilinx also offers the U25 SmartNIC which combines an FPGA, ARM CPU, and dual 25GbE ports. These platforms accelerate workloads such as AI, databases, storage, and networking using reconfigurable and adaptable hardware. Xilinx supports deployment from their devices to cloud platforms using a unified software stack.
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
"Algorithmic processing performed in High Performance Computing environments impacts the lives of billions of people, and planning for exascale computing presents significant power challenges to the industry. ARM delivers the enabling technology behind HPC. The 64-bit design of the ARMv8-A architecture combined with Advanced SIMD vectorization are ideal to enable large scientific computing calculations to be executed efficiently on ARM HPC machines. In addition ARM and its partners are working to ensure that all the software tools and libraries, needed by both users and systems administrators, are provided in readily available, optimized packages."
Learn more: https://developer.arm.com/hpc
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Gilad Shainer from the HPC AI Advisory Council describes how this organization fosters innovation in the high performance computing community.
"The HPC-AI Advisory Council’s mission is to bridge the gap between high-performance computing (HPC) and Artificial Intelligence (AI) use and its potential, bring the beneficial capabilities of HPC and AI to new users for better research, education, innovation and product manufacturing, bring users the expertise needed to operate HPC and AI systems, provide application designers with the tools needed to enable parallel computing, and to strengthen the qualification and integration of HPC and AI system products."
Watch the video: https://wp.me/p3RLHQ-lNz
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
In this video from the 2017 MVAPICH User Group, Ron Brightwell from Sandia presents: Challenges and Opportunities for HPC Interconnects and MPI.
"This talk will reflect on prior analysis of the challenges facing high-performance interconnect technologies intended to support extreme-scale scientific computing systems, how some of these challenges have been addressed, and what new challenges lay ahead. Many of these challenges can be attributed to the complexity created by hardware diversity, which has a direct impact on interconnect technology, but new challenges are also arising indirectly as reactions to other aspects of high-performance computing, such as alternative parallel programming models and more complex system usage models. We will describe some near-term research on proposed extensions to MPI to better support massive multithreading and implementation optimizations aimed at reducing the overhead of MPI tag matching. We will also describe a new portable programming model to offload simple packet processing functions to a network interface that is based on the current Portals data movement layer. We believe this capability will offer significant performance improvements to applications and services relevant to high-performance computing as well as data analytics."
Watch the video: https://wp.me/p3RLHQ-hhK
Learn more: http://mug.mvapich.cse.ohio-state.edu/program/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systemsinside-BigData.com
In this deck from the 2018 Swiss HPC Conference, DK Panda from Ohio State University presents: Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models by taking into account support for multi-core systems (KNL and OpenPower), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features, sample performance numbers and best practices of using MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu)will be presented.
For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://wp.me/p3RLHQ-iyc
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC Advisory Council Spain Conference, DK Panda from Ohio State University presents: Communication Frameworks for HPC and Big Data.
Watch the video presentation: http://insidehpc.com/2015/09/video-communication-frameworks-for-hpc-and-big-data/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Learn more: http://www.hpcadvisorycouncil.com/events/2015/spain-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
In this deck from IWOCL / SYCLcon 2020, Hal Finkel from Argonne National Laboratory presents: Preparing to program Aurora at Exascale - Early experiences and future directions.
"Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms. Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.
This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success. Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems."
Watch the video: https://wp.me/p3RLHQ-lPT
Learn more: https://www.iwocl.org/iwocl-2020/conference-program/
and
https://www.anl.gov/topic/aurora
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
IBM provides infrastructure to accelerate medical research tasks like genomics, molecular simulation, diagnostics, and quality inspection. This infrastructure delivers faster insights through high-performance data and AI deployed at massive scale on IBM Power Systems and Storage. Case studies show the infrastructure reduces time to results for tasks like processing millions of cryogenic electron microscope images from days to hours.
Increasing Cluster Performance by Combining rCUDA with Slurminside-BigData.com
Federico Silla from the Technical University of Valencia presented this deck at the Switzerland HPC Conference.
"Graphics Processing Units (GPUs) are currently used in data centers to reduce the execution time of compute-intensive applications. However, the use of GPUs presents several side ef- fects, such as increased acquisition costs as well as larger space requirements. Furthermore, GPUs require a non-negligible amount of energy even while idle. Additionally, GPU utilization is usually low for most applications. The use of virtual GPUs may address these concerns. In this regard, the remote GPU virtualization mechanism could be leveraged to share the GPUs present in the computing facility among the nodes of the cluster. This would increase overall GPU utilization, thus reducing the negative impact of the increased costs mentioned before. Reducing the amount of GPUs installed in the cluster could also be possible. In this talk
we present the remote GPU virtualization mechanism using as case study the performance attained by a cluster using the rCUDA middleware and a modified version of the Slurm sched- uler, which is able to map remote virtual GPUs to jobs. By leveraging rCUDA+Slurm, cluster throughput, measured as jobs completed per time unit, is doubled at the same time that total energy consumption is reduced up to 40%. GPU utilization is also increased."
Watch the video presentation: https://www.youtube.com/watch?v=yQWiQiyFpAg
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the 2018 Swiss HPC Conference, DK Panda from Ohio State University presents: Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark, and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project will be shown. Benefits of these stacks to accelerate deep learning frameworks (such as CaffeOnSpark and TensorFlowOnSpark) will be presented."
Watch the video: https://wp.me/p3RLHQ-iko
Learn more: http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from FOSDEM 2020, Colin Sauze from Aberystwyth University describes the development of a RaspberryPi cluster for teaching an introduction to HPC.
"The motivation for this was to overcome four key problems faced by new HPC users:
* The availability of a real HPC system and the effect running training courses can have on the real system, conversely the availability of spare resources on the real system can cause problems for the training course.
* A fear of using a large and expensive HPC system for the first time and worries that doing something wrong might damage the system.
* That HPC systems are very abstract systems sitting in data centres that users never see, it is difficult for them to understand exactly what it is they are using.
* That new users fail to understand resource limitations, in part because of the vast resources in modern HPC systems a lot of mistakes can be made before running out of resources. A more resource constrained system makes it easier to understand this.
The talk will also discuss some of the technical challenges in deploying an HPC environment to a Raspberry Pi and attempts to keep that environment as close to a "real" HPC as possible. The issue to trying to automate the installation process will also be covered."
Learn more: https://github.com/colinsauze/pi_cluster
and
https://fosdem.org/2020/schedule/events/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
DK Panda from Ohio State University presented this deck at the OpenFabrics Workshop.
"Modern HPC clusters are having many advanced features, such as multi-/many-core architectures, highperformance RDMA-enabled interconnects, SSD-based storage devices, burst-buffers and parallel file systems. However, current generation Big Data processing middleware (such as Hadoop, Spark, and Memcached) have not fully exploited the benefits of the advanced features on modern HPC clusters. This talk will present RDMA-based designs using OpenFabrics Verbs and heterogeneous storage architectures to accelerate multiple components of Hadoop (HDFS, MapReduce, RPC, and HBase), Spark and Memcached. An overview of the associated RDMA-enabled software libraries (being designed and publicly distributed as a part of the HiBD project for Apache Hadoop (integrated and plug-ins for Apache, HDP, and Cloudera distributions), Apache Spark and Memcached will be presented. The talk will also address the need for designing benchmarks using a multi-layered and systematic approach, which can be used to evaluate the performance of these Big Data processing middleware."
Watch the video presentation: http://wp.me/p3RLHQ-gzg
Learn more: http://hibd.cse.ohio-state.edu/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Gilad Shainer from Mellanox announces the world’s first HDR 200Gb/s data center interconnect solutions. "These 200Gb/s HDR InfiniBand solutions maintain Mellanox’s generation-ahead leadership while enabling customers and users to leverage an open, standards-based technology that maximizes application performance and scalability while minimizing overall data center total cost of ownership. Mellanox 200Gb/s HDR solutions will become generally available in 2017.
Watch the video presentation: http://insidehpc.com/2016/11/hdr-infiniband/
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
This document introduces hardware acceleration using FPGAs with OpenCAPI. It discusses how classic FPGA acceleration has issues like slow CPU-managed memory access and lack of data coherency. OpenCAPI allows FPGAs to directly access host memory, providing faster memory access and data coherency. It also introduces the OC-Accel framework that allows programming FPGAs using C/C++ instead of HDL languages, addressing issues like long development times. Example applications demonstrated significant performance improvements using this approach over CPU-only or classic FPGA acceleration methods.
In this deck from the 2017 MVAPICH User Group, DK Panda from Ohio State University presents: Overview of the MVAPICH Project and Future Roadmap.
"This talk will provide an overview of the MVAPICH project (past, present and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X, MVAPICH2-GDR, MVAPICH2-Virt, MVAPICH2-EA and MVAPICH2-MIC) will be presented. Current status and future plans for OSU INAM, OEMT and OMB will also be presented."
Watch the video: https://www.youtube.com/watch?v=wF7t-oH7wi4
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
This document discusses scaling deep learning training on HPC systems. It begins by providing background on deep learning and how interest in it has grown significantly. It then discusses how HPC systems can be leveraged for deep learning by supporting distributed training across multiple nodes. Several challenges of designing deep learning frameworks for HPC are outlined, including memory and communication overhead. The document proposes a co-design approach between deep learning frameworks and communication runtimes to better support distributed training and exploit HPC resources. MVAPICH2 software is discussed as an example that provides optimized MPI support for CPU- and GPU-based deep learning on HPC clusters.
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses using temporal shift modules (TSM) for efficient video recognition, where TSM enables temporal modeling in 2D CNNs with no additional computation cost; TSM models achieve better performance than 3D CNNs and previous methods while using less computation, and can be used for applications like online video understanding, low-latency deployment on edge devices, and large-scale distributed training on supercomputers.
Xilinx provides adaptable acceleration platforms for data centers. Their Alveo product lineup includes the U280, U250, U200, and low-profile U50 accelerator cards. The cards feature FPGAs with up to 1.3 million logic cells and high-speed memory. Xilinx also offers the U25 SmartNIC which combines an FPGA, ARM CPU, and dual 25GbE ports. These platforms accelerate workloads such as AI, databases, storage, and networking using reconfigurable and adaptable hardware. Xilinx supports deployment from their devices to cloud platforms using a unified software stack.
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
"Algorithmic processing performed in High Performance Computing environments impacts the lives of billions of people, and planning for exascale computing presents significant power challenges to the industry. ARM delivers the enabling technology behind HPC. The 64-bit design of the ARMv8-A architecture combined with Advanced SIMD vectorization are ideal to enable large scientific computing calculations to be executed efficiently on ARM HPC machines. In addition ARM and its partners are working to ensure that all the software tools and libraries, needed by both users and systems administrators, are provided in readily available, optimized packages."
Learn more: https://developer.arm.com/hpc
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Gilad Shainer from the HPC AI Advisory Council describes how this organization fosters innovation in the high performance computing community.
"The HPC-AI Advisory Council’s mission is to bridge the gap between high-performance computing (HPC) and Artificial Intelligence (AI) use and its potential, bring the beneficial capabilities of HPC and AI to new users for better research, education, innovation and product manufacturing, bring users the expertise needed to operate HPC and AI systems, provide application designers with the tools needed to enable parallel computing, and to strengthen the qualification and integration of HPC and AI system products."
Watch the video: https://wp.me/p3RLHQ-lNz
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
In this video from the 2017 MVAPICH User Group, Ron Brightwell from Sandia presents: Challenges and Opportunities for HPC Interconnects and MPI.
"This talk will reflect on prior analysis of the challenges facing high-performance interconnect technologies intended to support extreme-scale scientific computing systems, how some of these challenges have been addressed, and what new challenges lay ahead. Many of these challenges can be attributed to the complexity created by hardware diversity, which has a direct impact on interconnect technology, but new challenges are also arising indirectly as reactions to other aspects of high-performance computing, such as alternative parallel programming models and more complex system usage models. We will describe some near-term research on proposed extensions to MPI to better support massive multithreading and implementation optimizations aimed at reducing the overhead of MPI tag matching. We will also describe a new portable programming model to offload simple packet processing functions to a network interface that is based on the current Portals data movement layer. We believe this capability will offer significant performance improvements to applications and services relevant to high-performance computing as well as data analytics."
Watch the video: https://wp.me/p3RLHQ-hhK
Learn more: http://mug.mvapich.cse.ohio-state.edu/program/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systemsinside-BigData.com
In this deck from the 2018 Swiss HPC Conference, DK Panda from Ohio State University presents: Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models by taking into account support for multi-core systems (KNL and OpenPower), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features, sample performance numbers and best practices of using MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu)will be presented.
For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://wp.me/p3RLHQ-iyc
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC Advisory Council Spain Conference, DK Panda from Ohio State University presents: Communication Frameworks for HPC and Big Data.
Watch the video presentation: http://insidehpc.com/2015/09/video-communication-frameworks-for-hpc-and-big-data/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Learn more: http://www.hpcadvisorycouncil.com/events/2015/spain-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models taking into account support for multi-core systems (Xeon, OpenPower, and ARM), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-Enabled Big Data stacks. Finally, we will outline the challenges in moving middleware to the Cloud environments."
Watch the video: https://youtu.be/hR8cnFVF8Zg
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss the challenges in designing runtime environments for MPI+X (PGAS-OpenSHMEM/UPC/CAF/UPC++, OpenMP and Cuda) programming models by taking into account support for multi-core systems (KNL and OpenPower), high networks, GPGPUs (including GPUDirect RDMA) and energy awareness. Features and sample performance numbers from MVAPICH2 libraries will be presented. For the Deep Learning domain, we will focus on popular Deep Learning framewords (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-enabled Big Data stacks. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://youtu.be/i2I6XqOAh_I
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...inside-BigData.com
This talk will focus on challenges in designing software libraries and middleware for upcoming exascale systems with millions of processors and accelerators. Two kinds of application domains - Scientific Computing and Big data will be considered. For scientific computing domain, we will discuss about challenges in designing runtime environments for MPI and PGAS (UPC and OpenSHMEM) programming models by taking into account support for multi-core, high-performance networks, GPGPUs and Intel MIC. Features and sample performance numbers from MVAPICH2 and MVAPICH2-X (supporting Hybrid MPI and PGAS (UPC and OpenSHMEM)) libraries will be presented. For Big Data domain, we will focus on high-performance and scalable designs of Hadoop (including HBase) and Memcached using native RDMA support of InfiniBand and RoCE.
A Library for Emerging High-Performance Computing ClustersIntel® Software
This document discusses the challenges of developing communication libraries for exascale systems using hybrid MPI+X programming models. It describes how current MPI+PGAS approaches use separate runtimes, which can lead to issues like deadlock. The document advocates for a unified runtime that can support multiple programming models simultaneously to avoid such issues and enable better performance. It also outlines MVAPICH2's work on designs like multi-endpoint that integrate MPI and OpenMP to efficiently support emerging highly threaded systems.
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
The document discusses performance optimization of Apache Spark on scale-up servers through near-data processing. It finds that Spark workloads have poor multi-core scalability and high I/O wait times on scale-up servers. It proposes exploiting near-data processing through in-storage processing and 2D-integrated processing-in-memory to reduce data movements and latency. The author evaluates these techniques through modeling and a programmable FPGA accelerator to improve the performance of Spark MLlib workloads by up to 9x. Challenges in hybrid CPU-FPGA design and attaining peak performance are also discussed.
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plansinside-BigData.com
In this video from the 2014 HPC Advisory Council Europe Conference, DK Panda from Ohio State University presents: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans.
This talk will focus on latest developments and future plans for the MVAPICH2 and MVAPICH2-X projects. For the MVAPICH2 project, we will focus on scalable and highly-optimized designs for pt-to-pt communication (two-sided and one-sided MPI-3 RMA), collective communication (blocking and MPI-3 non-blocking), support for GPGPUs and Intel MIC, support for MPI-T interface and schemes for fault-tolerance/fault-resilience. For the MVAPICH2-X project, will focus on efficient support for hybrid MPI and PGAS (UPC and OpenSHMEM) programming model with unified runtime."
Watch the video presentation: http://wp.me/p3RLHQ-coF
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"This talk will focus on challenges in designing runtime environments for exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPGPUs and Intel MIC), virtualization technologies (KVM, Docker, and Singularity), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video: http://wp.me/p3RLHQ-glW
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Real time machine learning proposers day v3mustafa sarac
This document discusses DARPA's Real Time Machine Learning (RTML) program. The objective is to develop hardware generators and compilers that can automatically create application-specific integrated circuits for machine learning from high-level code. This would allow no-human-in-the-loop creation of efficient neural network hardware. The program has two phases: phase 1 develops an ML hardware compiler, and phase 2 demonstrates RTML systems for applications like wireless communication and image processing. Key goals are high performance, low power consumption, and support for a variety of neural network architectures and machine learning techniques.
Addressing Emerging Challenges in Designing HPC Runtimesinside-BigData.com
The document discusses several challenges in designing HPC runtimes for exascale systems, including energy awareness, accelerators, and virtualization. It focuses on the MVAPICH2 project which addresses these challenges. MVAPICH2 provides integrated support for GPUs and MICs, virtualization using SR-IOV and containers, and energy awareness. It also achieves high performance for GPU-aware MPI using features like GPUDirect RDMA. Application tests with HOOMD-blue and COSMO show improvements from MVAPICH2's GPU support.
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
HPC in Barcelona is centered around the MareNostrum supercomputer and BSC's 425-person team from 40 countries. MareNostrum allows simulation and analysis in fields like life sciences, earth sciences, and engineering. To meet new demands of big data analytics, BSC developed the Spark4MN module to run Spark workloads on MareNostrum. Benchmarking showed Spark4MN achieved good speed-up and scale-out. Further work profiles Spark using BSC tools and benchmarks workloads like image analysis on different hardware. BSC's vision is to advance understanding through technologies like cognitive computing and deep learning.
In this deck from the HPC User Forum in Austin, Yutaka Ishikawa from Riken AICS presents: Japan's post K Computer.
Watch the video presentation: http://wp.me/p3RLHQ-fJ6
Learn more: http://hpcuserforum.com
The convergence of HPC and BigData: What does it mean for HPC sysadmins?inside-BigData.com
In this deck from FOSDEM'19, Damien Francois from the Université catholique de Louvain presents: The convergence of HPC and BigData: What does it mean for HPC sysadmins?
"There are mainly two types of people in the scientific computing world: those who produce data and those who consume it. Those who have models and generate data from those models, a process known as 'simulation', and those who have data and infer models from the data ('analytics'). The former often originate from disciplines such as Engineering, Physics, or Climatology, while the latter are most often active in Remote sensing, Bioinformatics, Sociology, or Management.
Simulations often require large amount of computations so they are often run on generic High-Performance Computing (HPC) infrastructures built on a cluster of powerful high-end machines linked together with high-bandwidth low-latency networks. The cluster is often augmented with hardware accelerators (co-processors such as GPUs or FPGAs) and a large and fast parallel filesystem, all setup and tuned by systems administrators. By contrast, in analytics, the focus is on the storage and access of the data so analytics is often performed on a BigData infrastructure suited for the problem at hand. Those infrastructure offer specific data stores and are often installed in a more or less self-service way on a public or private 'Cloud' typically built on top of 'commodity' hardware.
Those two worlds, the world of HPC and the world of BigData are slowly, but surely, converging. The HPC world realizes that there are more to data storage than just files and that 'self-service' ideas are tempting. In the meantime, the BigData world realizes that co-processors and fast networks can really speedup analytics. And indeed, all major public Cloud services now have an HPC offering. And many academic HPC centres start to offer Cloud infrastructures and BigData-related tools.
This talk will focus on the latter point of view and review the tools originating from the BigData and the ideas from the Cloud that can be implemented in a HPC context to enlarge the offer for scientific computing in universities and research centres."
This document provides a history and market overview of Apache Spark. It discusses the motivation for distributed data processing due to increasing data volumes, velocities and varieties. It then covers brief histories of Google File System, MapReduce, BigTable, and other technologies. Hadoop and MapReduce are explained. Apache Spark is introduced as a faster alternative to MapReduce that keeps data in memory. Competitors like Flink, Tez and Storm are also mentioned.
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
Learn about opportunities and challenges for accelerating big data middleware on modern high-performance computing (HPC) clusters by exploiting HPC technologies.
In this video from the 2016 Stanford HPC Conference, DK Panda from Ohio State University presents: Programming Models for Exascale Systems.
This talk will focus on programming models and their designs for upcoming exascale systems with millions of processors and accelerators. Current status and future trends of MPI and PGAS (UPC and OpenSHMEM) programming models will be presented."
Percy Tzelnic from Dell Technologies presented this deck at the HPC User Forum in Austin.
Watch the video presentation: http://insidehpc.com/2016/09/emc-in-hpc-the-journey-so-far-and-the-road-ahead/
Learn more: http://emc.com/
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) Wim Vanderbauwhede
This document provides a historical overview of the evolution of FPGA technology and programming approaches over several decades. It discusses early theoretical foundations in the 1930s-40s and the development of integrated circuits, hardware description languages, and high-level synthesis tools from the 1950s onwards. More recently, it describes the rise of heterogeneous computing using GPUs, FPGAs and other accelerators, and the ongoing challenges around programming such systems at a suitable level of abstraction.
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan
The document discusses opportunities for improving Apache Spark performance using near data computing architectures. It proposes exploiting in-storage processing and 2D integrated processing-in-memory to reduce data movement between CPUs and memory. Certain Spark workloads like joins and aggregations that are I/O bound would benefit from in-storage processing, while iterative workloads are more suited for 2D integrated processing-in-memory. The document outlines a system design using FPGAs to emulate these architectures for evaluating Spark machine learning workloads like k-means clustering.
Similar to Programming Models for Exascale Systems (20)
The document discusses the top 5 technologies that all organizations must understand: digital transformation, quantum computing, IoT, 5G, and AI/HPC. It provides an overview of each technology including opportunities and threats to organizations. The document emphasizes that understanding these emerging technologies is mandatory as the information revolution changes many aspects of life and business.
In this deck, Greg Wahl from Advantech presents: Transforming Private 5G Networks.
Advantech Networks & Communications Group is driving innovation in next-generation network solutions with their High Performance Servers. We provide business critical hardware to the world's leading telecom and networking equipment manufacturers with both standard and customized products. Our High Performance Servers are highly configurable platforms designed to balance the best in x86 server-class processing performance with maximum I/O and offload density. The systems are cost effective, highly available and optimized to meet next generation networking and media processing needs.
“Advantech’s Networks and Communication Group has been both an innovator and trusted enabling partner in the telecommunications and network security markets for over a decade, designing and manufacturing products for OEMs that accelerate their network platform evolution and time to market.” Said Advantech Vice President of Networks & Communications Group, Ween Niu. “In the new IP Infrastructure era, we will be expanding our expertise in Software Defined Networking (SDN) and Network Function Virtualization (NFV), two of the essential conduits to 5G infrastructure agility making networks easier to install, secure, automate and manage in a cloud-based infrastructure.”
In addition to innovation in air interface technologies and architecture extensions, 5G will also need a new generation of network computing platforms to run the emerging software defined infrastructure, one that provides greater topology flexibility, essential to deliver on the promises of high availability, high coverage, low latency and high bandwidth connections. This will open up new parallel industry opportunities through dedicated 5G network slices reserved for specific industries dedicated to video traffic, augmented reality, IoT, connected cars etc. 5G unlocks many new doors and one of the keys to its enablement lies in the elasticity and flexibility of the underlying infrastructure.
Advantech’s corporate vision is to enable an intelligent planet. The company is a global leader in the fields of IoT intelligent systems and embedded platforms. To embrace the trends of IoT, big data, and artificial intelligence, Advantech promotes IoT hardware and software solutions with the Edge Intelligence WISE-PaaS core to assist business partners and clients in connecting their industrial chains. Advantech is also working with business partners to co-create business ecosystems that accelerate the goal of industrial intelligence."
Watch the video: https://wp.me/p3RLHQ-lPQ
* Company website: https://www.advantech.com/
* Solution page: https://www2.advantech.com/nc/newsletter/NCG/SKY/benefits.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
In this deck from the Stanford HPC Conference, Katie Lewis from Lawrence Livermore National Laboratory presents: The Incorporation of Machine Learning into Scientific Simulations at Lawrence Livermore National Laboratory.
"Scientific simulations have driven computing at Lawrence Livermore National Laboratory (LLNL) for decades. During that time, we have seen significant changes in hardware, tools, and algorithms. Today, data science, including machine learning, is one of the fastest growing areas of computing, and LLNL is investing in hardware, applications, and algorithms in this space. While the use of simulations to focus and understand experiments is well accepted in our community, machine learning brings new challenges that need to be addressed. I will explore applications for machine learning in scientific simulations that are showing promising results and further investigation that is needed to better understand its usefulness."
Watch the video: https://youtu.be/NVwmvCWpZ6Y
Learn more: https://computing.llnl.gov/research-area/machine-learning
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
In this deck from the Stanford HPC Conference, Nick Nystrom and Paola Buitrago provide an update from the Pittsburgh Supercomputing Center.
Nick Nystrom is Chief Scientist at the Pittsburgh Supercomputing Center (PSC). Nick is architect and PI for Bridges, PSC's flagship system that successfully pioneered the convergence of HPC, AI, and Big Data. He is also PI for the NIH Human Biomolecular Atlas Program’s HIVE Infrastructure Component and co-PI for projects that bring emerging AI technologies to research (Open Compass), apply machine learning to biomedical data for breast and lung cancer (Big Data for Better Health), and identify causal relationships in biomedical big data (the Center for Causal Discovery, an NIH Big Data to Knowledge Center of Excellence). His current research interests include hardware and software architecture, applications of machine learning to multimodal data (particularly for the life sciences) and to enhance simulation, and graph analytics.
Watch the video: https://youtu.be/LWEU1L1o7yY
Learn more: https://www.psc.edu/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses using systems intelligence and artificial intelligence/neural networks to enhance semiconductor electronic design automation (EDA) workflows by collecting telemetry data from EDA jobs and infrastructure and analyzing it using complex event processing, machine learning models, and messaging substrates to provide insights that could optimize EDA pipelines and infrastructure. The approach aims to allow both internal and external augmentation of EDA processes and environments through unsupervised and incremental learning.
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
In this deck from the Stanford HPC Conference, Nicole Xu from Stanford University describes how she transformed a common jellyfish into a bionic creature that is part animal and part machine.
"Animal locomotion and bioinspiration have the potential to expand the performance capabilities of robots, but current implementations are limited. Mechanical soft robots leverage engineered materials and are highly controllable, but these biomimetic robots consume more power than corresponding animal counterparts. Biological soft robots from a bottom-up approach offer advantages such as speed and controllability but are limited to survival in cell media. Instead, biohybrid robots that comprise live animals and self- contained microelectronic systems leverage the animals’ own metabolism to reduce power constraints and body as an natural scaffold with damage tolerance. We demonstrate that by integrating onboard microelectronics into live jellyfish, we can enhance propulsion up to threefold, using only 10 mW of external power input to the microelectronics and at only a twofold increase in cost of transport to the animal. This robotic system uses 10 to 1000 times less external power per mass than existing swimming robots in literature and can be used in future applications for ocean monitoring to track environmental changes."
Watch the video: https://youtu.be/HrmJFyvInj8
Learn more: https://sanfrancisco.cbslocal.com/2020/02/05/stanford-research-project-common-jellyfish-bionic-sea-creatures/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Stanford HPC Conference, Peter Dueben from the European Centre for Medium-Range Weather Forecasts (ECMWF) presents: Machine Learning for Weather Forecasts.
"I will present recent studies that use deep learning to learn the equations of motion of the atmosphere, to emulate model components of weather forecast models and to enhance usability of weather forecasts. I will than talk about the main challenges for the application of deep learning in cutting-edge weather forecasts and suggest approaches to improve usability in the future."
Peter is contributing to the development and optimization of weather and climate models for modern supercomputers. He is focusing on a better understanding of model error and model uncertainty, on the use of reduced numerical precision that is optimised for a given level of model error, on global cloud- resolving simulations with ECMWF's forecast model, and the use of machine learning, and in particular deep learning, to improve the workflow and predictions. Peter has graduated in Physics and wrote his PhD thesis at the Max Planck Institute for Meteorology in Germany. He worked as Postdoc with Tim Palmer at the University of Oxford and has taken up a position as University Research Fellow of the Royal Society at the European Centre for Medium-Range Weather Forecasts (ECMWF) in 2017.
Watch the video: https://youtu.be/ks3fkRj8Iqc
Learn more: https://www.ecmwf.int/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Today RIKEN in Japan announced that the Fugaku supercomputer will be made available for research projects aimed to combat COVID-19.
"Fugaku is currently being installed and is scheduled to be available to the public in 2021. However, faced with the devastating disaster unfolding before our eyes, RIKEN and MEXT decided to make a portion of the computational resources of Fugaku available for COVID-19-related projects ahead of schedule while continuing the installation process.
Fugaku is being developed not only for the progress in science, but also to help build the society dubbed as the “Society 5.0” by the Japanese government, where all people will live safe and comfortable lives. The current initiative to fight against the novel coronavirus is driven by the philosophy behind the development of Fugaku."
Initial Projects
Exploring new drug candidates for COVID-19 by "Fugaku"
Yasushi Okuno, RIKEN / Kyoto University
Prediction of conformational dynamics of proteins on the surface of SARS-Cov-2 using Fugaku
Yuji Sugita, RIKEN
Simulation analysis of pandemic phenomena
Nobuyasu Ito, RIKEN
Fragment molecular orbital calculations for COVID-19 proteins
Yuji Mochizuki, Rikkyo University
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
The document discusses how DDN A3I storage solutions and Nvidia's SuperPOD platform can enable HPC at scale. It provides details on DDN's A3I appliances that are optimized for AI and deep learning workloads and validated for Nvidia's DGX-2 SuperPOD reference architecture. The solutions are said to deliver the fastest performance, effortless scaling, reliability and flexibility for data-intensive workloads.
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
Today Xilinx announced Versal Premium, the third series in the Versal ACAP portfolio. The Versal Premium series features highly integrated, networked and power-optimized cores and the industry’s highest bandwidth and compute density on an adaptable platform. Versal Premium is designed for the highest bandwidth networks operating in thermally and spatially constrained environments, as well as for cloud providers who need scalable, adaptable application acceleration.
Versal is the industry’s first adaptive compute acceleration platform (ACAP), a revolutionary new category of heterogeneous compute devices with capabilities that far exceed those of conventional silicon architectures. Developed on TSMC’s 7-nanometer process technology, Versal Premium combines software programmability with dynamically configurable hardware acceleration and pre-engineered connectivity and security features to enable a faster time-to- market. The Versal Premium series delivers up to 3X higher throughput compared to current generation FPGAs, with built-in Ethernet, Interlaken, and cryptographic engines that enable fast and secure networks. The series doubles the compute density of currently deployed mainstream FPGAs and provides the adaptability to keep pace with increasingly diverse and evolving cloud and networking workloads.
Learn more: https://insidehpc.com/2020/03/xilinx-announces-versal-premium-acap-for-network-and-cloud-acceleration/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
In this video from the Rice Oil & Gas Conference, Chin Fang from Zettar presents: Moving Massive Amounts of Data across Any Distance Efficiently.
The objective of this talk is to present two on-going projects aiming at improving and ensuring highly efficient bulk transferring or streaming of massive amounts of data over digital connections across any distance. It examines the current state of the art, a few very common misconceptions, the differences among the three major type of data movement solutions, a current initiative attempting to improve the data movement efficiency from the ground up, and another multi-stage project that shows how to conduct long distance large scale data movement at speed and scale internationally. Both projects have real world motivations, e.g. the ambitious data transfer requirements of Linac Coherent Light Source II (LCLS-II) [1], a premier preparation project of the U.S. DOE Exascale Computing Initiative (ECI) [2]. Their immediate goals are described and explained, together with the solution used for each. Findings and early results are reported. Possible future works are outlined.
Watch the video: https://wp.me/p3RLHQ-lBX
Learn more: https://www.zettar.com/
and
https://rice2020oghpc.rice.edu/program-2/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Rice Oil & Gas Conference, Bradley McCredie from AMD presents: Scaling TCO in a Post Moore's Law Era.
"While foundries bravely drive forward to overcome the technical and economic challenges posed by scaling to 5nm and beyond, Moore’s law alone can provide only a fraction of the performance / watt and performance / dollar gains needed to satisfy the demands of today’s high performance computing and artificial intelligence applications. To close the gap, multiple strategies are required. First, new levels of innovation and design efficiency will supplement technology gains to continue to deliver meaningful improvements in SoC performance. Second, heterogenous compute architectures will create x-factor increases of performance efficiency for the most critical applications. Finally, open software frameworks, APIs, and toolsets will enable broad ecosystems of application level innovation."
Watch the video:
Learn more: http://amd.com
and
https://rice2020oghpc.rice.edu/program-2/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
In this deck from FOSDEM 2020, Frank McQuillan from Pivotal presents: Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases.
"In this session we will present an efficient way to train many deep learning model configurations at the same time with Greenplum, a free and open source massively parallel database based on PostgreSQL. The implementation involves distributing data to the workers that have GPUs available and hopping model state between those workers, without sacrificing reproducibility or accuracy. Then we apply optimization algorithms to generate and prune the set of model configurations to try.
Deep neural networks are revolutionizing many machine learning applications, but hundreds of trials may be needed to generate a good model architecture and associated hyperparameters. This is the challenge of model selection. It is time consuming and expensive, especially if you are only training one model at a time.
Massively parallel processing databases can have hundreds of workers, so can you use this parallel compute architecture to address the challenge of model selection for deep nets, in order to make it faster and cheaper?
It’s possible!
We will demonstrate results from this project using a version of Hyperband, which is a well known hyperparameter optimization algorithm, and the deep learning frameworks Keras and TensorFlow, all running on Greenplum database using Apache MADlib. Other topics will include architecture, scalability results and bright opportunities for the future."
Watch the video: https://wp.me/p3RLHQ-lsQ
Learn more: https://fosdem.org/2020/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Huihuo Zheng from Argonne National Laboratory presents: Data Parallel Deep Learning.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two weeks of training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-lsl
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from DOE CSGF 2019, Chelsea Harris from the University of Michigan presents: Making Supernovae with Jets.
"Supernovae are the explosions of stars. One reason they are fundamentally important is they create and disperse elements heavier than carbon throughout the universe. Different stars explode in different ways, but the most common supernova type is from massive stars (greater than 10 times the mass of the sun) whose cores collapse to form a neutron star or black hole – "core collapse supernovae" or CC SNe. Even among massive stars, though, there are differences that can affect the outcome of core collapse. I am specifically interested in progenitors whose cores are rotating and magnetic (magnetorotational). Such cores may experience instabilities after collapse that launch a fast jet, which could rescue r-process elements formed near the proto-neutron star from destruction. The instabilities can also add power to the explosion and relieve tension between observations and theory. In addition to running simulations with existing FLASH code modules, I am developing a FLASH hydrodynamics module, SparkJoy, to perform these simulations at high order. These projects are part of a DOE INCITE project to explore progenitor effects on CC SNe and of the DOE SciDAC program "Towards Exascale Astrophysics of Mergers and Supernovae," a nationwide collaboration of supernova theorists unprecedented in its collaborative scale. Research like mine is made much easier at Michigan State University through the Department of Computational Mathematics, Science, and Engineering which, like the DOE CSGF, brings together members from different areas to share knowledge and strengthen each other's research."
Watch the video: https://wp.me/p3RLHQ-lr0
Learn more: https://www.krellinst.org/csgf/conf/2019/video/charris
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses dense linear algebra solvers and algorithms. It provides an overview of existing software for dense linear algebra including LINPACK, EISPACK, LAPACK, ScaLAPACK, PLASMA, and MAGMA. It then discusses challenges with dense linear algebra on modern hardware including distributed memory, heterogeneity, and the high cost of communication. It introduces tile algorithms as an approach to address these challenges compared to traditional LAPACK algorithms.
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
This document discusses extending high-performance computing (HPC) to integrate data analytics and connect to edge computing. It presents two use cases: 1) augmenting molecular dynamics workflows with in situ and in transit analytics to capture protein structural information, and 2) connecting HPC to sensors at the edge for precision farming applications involving soil moisture data prediction. The document outlines approaches for building closed-loop workflows that integrate simulation, data generation, analytics, and data feedback between HPC and edge resources to enable real-time decision making.
In this deck from ATPESC 2019, Yunong Shi from the University of Chicago presents: SW/HW co-design for near-term quantum computing.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two weeks of training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-lpv
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from ATPESC 2019, James Moawad and Greg Nash from Intel present: FPGAs and Machine Learning.
"Neural networks are inspired by biological systems, in particular the human brain. Through the combination of powerful computing resources and novel architectures for neurons, neural networks have achieved state-of-the-art results in many domains such as computer vision and machine translation. FPGAs are a natural choice for implementing neural networks as they can handle different algorithms in computing, logic, and memory resources in the same device. Faster performance comparing to competitive implementations as the user can hardcore operations into the hardware. Software developers can use the OpenCL device C level programming standard to target FPGAs as accelerators to standard CPUs without having to deal with hardware level design."
Watch the video: https://wp.me/p3RLHQ-lnc
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
and
https://www.intel.com/content/www/us/en/products/programmable/fpga.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. HPCAC-Switzerland (Mar ‘16) 2 Network Based CompuNng Laboratory
High-End CompuNng (HEC): ExaFlop & ExaByte
100-200
PFlops in
2016-2018
1 EFlops in
2020-2024?
F i g u r e 1
Source: IDC's Digital Universe Study, sponsored by EMC, December 2012
Within these broad outlines of the digital universe are some singularities worth noting.
First, while the portion of the digital universe holding potential analytic value is growing, only a tin
fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digita
10K-20K
EBytes in
2016-2018
40K EBytes
in 2020 ?
ExaFlop & HPC •
ExaByte & BigData •
4. HPCAC-Switzerland (Mar ‘16) 4 Network Based CompuNng Laboratory
Drivers of Modern HPC Cluster Architectures
Tianhe – 2 Titan Stampede Tianhe – 1A
• MulR-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-VolaRle Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
Accelerators / Coprocessors
high compute density, high
performance/waU
>1 TFlop DP on a chip
High Performance Interconnects -
InfiniBand
<1usec latency, 100Gbps Bandwidth> MulN-core Processors SSD, NVMe-SSD, NVRAM
5. HPCAC-Switzerland (Mar ‘16) 5 Network Based CompuNng Laboratory
• 235 IB Clusters (47%) in the Nov’ 2015 Top500 list (h<p://www.top500.org)
• InstallaRons in the Top 50 (21 systems):
Large-scale InfiniBand InstallaNons
462,462 cores (Stampede) at TACC (10th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (25th)
185,344 cores (Pleiades) at NASA/Ames (13th) 194,616 cores (Cascade) at PNNL (27th)
72,800 cores Cray CS-Storm in US (15th) 76,032 cores (Makman-2) at Saudi Aramco (32nd)
72,800 cores Cray CS-Storm in US (16th) 110,400 cores (Pangea) in France (33rd)
265,440 cores SGI ICE at Tulip Trading Australia (17th) 37,120 cores (Lomonosov-2) at Russia/MSU (35th)
124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (18th) 57,600 cores (SwifLucy) in US (37th)
72,000 cores (HPC2) in Italy (19th) 55,728 cores (Prometheus) at Poland/Cyfronet (38th)
152,692 cores (Thunder) at AFRL/USA (21st ) 50,544 cores (Occigen) at France/GENCI-CINES (43rd)
147,456 cores (SuperMUC) in Germany (22nd) 76,896 cores (Salomon) SGI ICE in Czech Republic (47th)
86,016 cores (SuperMUC Phase 2) in Germany (24th) and many more!
6. HPCAC-Switzerland (Mar ‘16) 6 Network Based CompuNng Laboratory
• ScienRfic CompuRng
– Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant
Programming Model
– Many discussions towards ParRRoned Global Address Space (PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial CompuRng
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce)
– Spark is emerging for in-memory compuRng
– Memcached is also used for Web 2.0
Two Major Categories of ApplicaNons
7. HPCAC-Switzerland (Mar ‘16) 7 Network Based CompuNng Laboratory
Towards Exascale System (Today and Target)
Systems 2016
Tianhe-2
2020-2024 Difference
Today & Exascale
System peak 55 PFlop/s 1 EFlop/s ~20x
Power 18 MW
(3 Gflops/W)
~20 MW
(50 Gflops/W)
O(1)
~15x
System memory 1.4 PB
(1.024PB CPU + 0.384PB CoP)
32 – 64 PB ~50X
Node performance 3.43TF/s
(0.4 CPU + 3 CoP)
1.2 or 15 TF O(1)
Node concurrency 24 core CPU +
171 cores CoP
O(1k) or O(10k) ~5x - ~50x
Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x
System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x
Total concurrency 3.12M
12.48M threads (4 /core)
O(billion)
for latency hiding
~100x
MTTI Few/day Many/day O(?)
Courtesy: Prof. Jack Dongarra
8. HPCAC-Switzerland (Mar ‘16) 8 Network Based CompuNng Laboratory
• Energy and Power Challenge
– Hard to solve power requirements for data movement
• Memory and Storage Challenge
– Hard to achieve high capacity and high data rate
• Concurrency and Locality Challenge
– Management of very large amount of concurrency (billion threads)
• Resiliency Challenge
– Low voltage devices (for low power) introduce more faults
Basic Design Challenges for Exascale Systems
9. HPCAC-Switzerland (Mar ‘16) 9 Network Based CompuNng Laboratory
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
ParRRoned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• PGAS models and Hybrid MPI+PGAS models are gradually receiving
importance
10. HPCAC-Switzerland (Mar ‘16) 10 Network Based CompuNng Laboratory
• Message Passing Library standardized by MPI Forum
– C and Fortran
• Goal: portable, efficient and flexible standard for wriRng parallel applicaRons
• Not IEEE or ISO standard, but widely considered “industry standard” for HPC
applicaRon
• EvoluRon of MPI
– MPI-1: 1994
– MPI-2: 1996
– MPI-3.0: 2008 – 2012, standardized on September 21, 2012
– MPI-3.1: 2012 – 2015, standardized on June 4, 2015
– Next plan is for MPI 4.0
MPI Overview and History
16. HPCAC-Switzerland (Mar ‘16) 16 Network Based CompuNng Laboratory
• Enables overlap of computaRon with communicaRon
• Non-blocking calls do not match blocking collecRve calls
– MPI may use different algorithms for blocking and non-blocking collecRves
– Blocking collecRves: OpRmized for latency
– Non-blocking collecRves: OpRmized for overlap
• A process calling an NBC operaRon
– Schedules collecRve operaRon and immediately returns
– Executes applicaRon computaRon code
– Waits for the end of the collecRve
• The communicaRon progress by
– ApplicaRon code through MPI_Test
– Network adapter (HCA) with hardware support
– Dedicated processes / thread in MPI library
• There is a non-blocking equivalent for each blocking operaRon
– Has an “I” in the name (MPI_Bcast -> MPI_Ibcast; MPI_Reduce -> MPI_Ireduce)
MPI-3 Non-blocking CollecNve (NBC) OperaNons
18. HPCAC-Switzerland (Mar ‘16) 18 Network Based CompuNng Laboratory
• MPI 3.1 was approved on June 4, 2015
– SpecificaRon is available from:
h<p://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
• Major features and enhancements:
– CorrecRon to the Fortran bindings introduced in MPI-3.0
– New funcRons added include rouRnes to manipulate MPI_Aint values in a
portable manner
– Nonblocking collecRve I/O rouRnes
– RouRnes to get the index value by name for MPI_T performance and
control variables
MPI-3.1 Enhancements
21. HPCAC-Switzerland (Mar ‘16) 21 Network Based CompuNng Laboratory
• UPC: Unified Parallel C - PGAS based language extension to C
– An ISO C99-based language providing uniform programming model for both shared and distributed
memory hardware to support HPC
– UPC = UPC translator + C compiler + UPC runRme
• Coarray Fortran (CAF): Language-level PGAS support in Fortran
– An extension to Fortran to support global shared array (coarray) in parallel Fortran applicaRons
– CAF = CAF compiler + CAF runRme (libcaf)
– Basic support in Fortran 2008 and extended support to collecRve in Fortran 2015
• UPC++: An Object Oriented PGAS Programming Model
– A compiler-free PGAS programming model in context of C++
– Built on top of C++ standard templates and runRme libraries
– Extension to UPC’s programming idioms
– Register task for async execuRon
UPC, CAF and UPC++
22. HPCAC-Switzerland (Mar ‘16) 22 Network Based CompuNng Laboratory
• Hierarchical architectures with mulRple address spaces
• (MPI + PGAS) Model
– MPI across address spaces
– PGAS within an address space
• MPI is good at moving data between address spaces
• Within an address space, MPI can interoperate with other shared memory programming
models
• ApplicaRons can have kernels with different communicaRon pa<erns
• Can benefit from different models
• Re-wriRng complete applicaRons can be a huge effort
• Port criRcal kernels to the desired model instead
MPI+PGAS for Exascale Architectures and ApplicaNons
25. HPCAC-Switzerland (Mar ‘16) 25 Network Based CompuNng Laboratory
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communicaRon (both two-sided and one-sided)
– Scalable job start-up
• Scalable CollecRve communicaRon
– Offload
– Non-blocking
– Topology-aware
• Balancing intra-node and inter-node communicaRon for next generaRon nodes (128-1024 cores)
– MulRple end-points per node
• Support for efficient mulR-threading
• Integrated Support for GPGPUs and Accelerators
• Fault-tolerance/resiliency
• QoS support for communicaRon and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM,
CAF, …)
• VirtualizaRon
• Energy-Awareness
Broad Challenges in Designing CommunicaNon Libraries for (MPI+X) at
Exascale
26. HPCAC-Switzerland (Mar ‘16) 26 Network Based CompuNng Laboratory
• Extreme Low Memory Footprint
– Memory per core conRnues to decrease
• D-L-A Framework
– Discover
• Overall network topology (fat-tree, 3D, …), Network topology for processes for a given job
• Node architecture, Health of network and node
– Learn
• Impact on performance and scalability
• PotenRal for failure
– Adapt
• Internal protocols and algorithms
• Process mapping
• Fault-tolerance soluRons
– Low overhead techniques while delivering performance, scalability and fault-tolerance
AddiNonal Challenges for Designing Exascale Soqware Libraries
27. HPCAC-Switzerland (Mar ‘16) 27 Network Based CompuNng Laboratory
Overview of the MVAPICH2 Project
• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for VirtualizaRon (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Used by more than 2,525 organizaNons in 77 countries
– More than 356,000 (> 0.36 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘15 ranking)
• 10th ranked 519,640-core cluster (Stampede) at TACC
• 13th ranked 185,344-core cluster (Pleiades) at NASA
• 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo InsRtute of Technology and many others
– Available with sofware stacks of many vendors and Linux Distros (RedHat and SuSE)
– h<p://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
– Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)
32. HPCAC-Switzerland (Mar ‘16) 32 Network Based CompuNng Laboratory
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communicaRon (both two-sided and one-sided
RMA)
– Support for advanced IB mechanisms (UMR and ODP)
– Extremely minimal memory footprint
– Scalable job start-up
• CollecRve communicaRon
• Unified RunRme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI +
UPC, CAF, …)
• InfiniBand Network Analysis and Monitoring (INAM)
• Integrated Support for GPGPUs
• Integrated Support for MICs
• VirtualizaRon (SR-IOV and Container)
• Energy-Awareness
Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale
36. HPCAC-Switzerland (Mar ‘16) 36 Network Based CompuNng Laboratory
• Introduced by Mellanox to support direct local and remote nonconRguous
memory access
– Avoid packing at sender and unpacking at receiver
• Available with MVAPICH2-X 2.2b
User-mode Memory RegistraNon (UMR)
0
50
100
150
200
250
300
350
4K 16K 64K 256K 1M
Latency (us)
Message Size (Bytes)
Small & Medium Message Latency
UMR
Default
0
5000
10000
15000
20000
2M 4M 8M 16M
Latency (us)
Message Size (Bytes)
Large Message Latency
UMR
Default
Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with
User-mode Memory RegistraNon: Challenges, Designs and Benefits, CLUSTER, 2015
37. HPCAC-Switzerland (Mar ‘16) 37 Network Based CompuNng Laboratory
• Introduced by Mellanox to support direct remote memory access without pinning
• Memory regions paged in/out dynamically by the HCA/OS
• Size of registered buffers can be larger than physical memory
• Will be available in future MVAPICH2 release
On-Demand Paging (ODP)
Connect-IB (54 Gbps): 2.6 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
0
500
1000
1500
16 32 64
Pin-down Buffer Size
(MB)
Number of Processes
Graph500 Pin-down Buffer Sizes
Pin-down ODP
0
1
2
3
4
5
16 32 64
ExecuNon Time (s)
Number of Processes
Graph500 BFS Kernel
Pin-down ODP
38. HPCAC-Switzerland (Mar ‘16) 38 Network Based CompuNng Laboratory
Minimizing Memory Footprint by Direct Connect (DC) Transport
Node
0
P1
P0 Node 1
P3
P2
Node 3
P7
P6
Node
2
P5
P4 IB
Network
• Constant connecRon cost (One QP for any peer)
• Full Feature Set (RDMA, Atomics etc)
• Separate objects for send (DC IniRator) and receive (DC Target)
– DC Target idenRfied by “DCT Number”
– Messages routed with (DCT Number, LID)
– Requires same “DC Key” to enable communicaRon
• Available since MVAPICH2-X 2.2a
0
0.5
1
160 320 620
Normalized ExecuNon
Time
Number of Processes
NAMD - Apoa1: Large data set
RC DC-Pool UD XRC
10
22
47
97
1 1 1
2
10 10 10 10
1 1
3
5
1
10
100
80 160 320 640
ConnecNon Memory (KB)
Number of Processes
Memory Footprint for Alltoall
RC DC-Pool UD XRC
H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT)
of InfiniBand : Early Experiences. IEEE InternaRonal SupercompuRng Conference (ISC ’14)
39. HPCAC-Switzerland (Mar ‘16) 39 Network Based CompuNng Laboratory
• Near-constant MPI and OpenSHMEM
iniRalizaRon Rme at any process count
• 10x and 30x improvement in startup Rme
of MPI and OpenSHMEM respecRvely at
16,384 processes
• Memory consumpRon reduced for
remote endpoint informaRon by
O(processes per node)
• 1GB Memory saved per node with 1M
processes and 16 processes per node
Towards High Performance and Scalable Startup at Exascale
P M
O
Job Startup Performance
Memory Required to Store
Endpoint InformaRon
P
M
PGAS – State of the art
MPI – State of the art
O PGAS/MPI – OpRmized
PMIX_Ring
PMIX_Ibarrier
PMIX_Iallgather
Shmem based PMI
On-demand
ConnecRon
On-demand ConnecNon Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K
Panda, 20th InternaRonal Workshop on High-level Parallel Programming Models and SupporRve Environments (HIPS ’15)
PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st
European MPI Users' Group MeeRng (EuroMPI/Asia ’14)
Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/
ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ’15)
SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th
IEEE/ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ’16) , Accepted for Publica=on
40. HPCAC-Switzerland (Mar ‘16) 40 Network Based CompuNng Laboratory
• SHMEMPMI allows MPI processes to directly read remote endpoint (EP) informaRon from the process
manager through shared memory segments
• Only a single copy per node - O(processes per node) reducRon in memory usage
• EsRmated savings of 1GB per node with 1 million processes and 16 processes per node
• Up to 1,000 Rmes faster PMI Gets compared to default design. Will be available in MVAPICH2 2.2RC1.
Process Management Interface over Shared Memory (SHMEMPMI)
TACC Stampede - Connect-IB (54 Gbps): 2.6 GHz Quad Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR
SHMEMPMI – Shared Memory Based PMI for Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and D.K. Panda,
16th IEEE/ACM InternaRonal Symposium on Cluster, Cloud and Grid CompuRng (CCGrid ‘16), Accepted for publica=on
0
50
100
150
200
250
300
1 2 4 8 16 32
Time Taken (milliseconds)
Number of Processes per Node
Time Taken by one PMI_Get
Default
SHMEMPMI
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
16 64 256 1K 4K 16K 64K 256K 1M
Memory Usage per Node (MB)
Number of Processes per Job
Memory Usage for Remote EP InformaRon
Fence - Default
Allgather - Default
Fence - Shmem
Allgather - Shmem
EsNmated
1000x
Actual
16x
42. HPCAC-Switzerland (Mar ‘16) 42 Network Based CompuNng Laboratory
Modified HPL with Offload-Bcast does up to 4.5% be<er than default
version (512 Processes)
0
1
2
3
4
5
512 600 720 800
ApplicaNon Run-Time
(s)
Data Size
0
5
10
15
64 128 256 512
Run-Time (s)
Number of Processes
PCG-Default Modified-PCG-Offload
Co-Design with MPI-3 Non-Blocking CollecNves and CollecNve Offload Co-Direct
Hardware (Available since MVAPICH2-X 2.2a)
Modified P3DFFT with Offload-Alltoall does up to 17% be<er than
default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All
with CollecNve Offload on InfiniBand Clusters: A Study with Parallel 3D FFT,
ISC 2011
17%
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70
Normalized
Performance
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce
does up to 21.8% be<er than default version
K. Kandalla, et. al, Designing Non-blocking Broadcast with CollecNve Offload on
InfiniBand Clusters: A Case Study with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-blocking Allreduce with CollecNve Offload on
InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
Can Network-Offload based Non-Blocking Neighborhood MPI CollecNves
Improve CommunicaNon Overheads of Irregular Graph Algorithms? K. Kandalla,
A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’
12
43. HPCAC-Switzerland (Mar ‘16) 43 Network Based CompuNng Laboratory
Network-Topology-Aware Placement of Processes
• Can we design a highly scalable network topology detecRon service for IB?
• How do we design the MPI communicaRon library in a network-topology-aware manner to efficiently leverage the topology
informaRon generated by our service?
• What are the potenRal benefits of using a network-topology-aware MPI library on the performance of parallel scienRfic applicaRons?
Overall performance and Split up of physical communicaNon for MILC on Ranger
Performance for varying
system sizes Default for 2048 core run Topo-Aware for 2048 core run
15%
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand
Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist
• Reduce network topology discovery Nme from O(N2
hosts) to O(Nhosts)
• 15% improvement in MILC execuNon Nme @ 2048 cores
• 15% improvement in Hypre execuNon Nme @ 1024 cores
47. HPCAC-Switzerland (Mar ‘16) 47 Network Based CompuNng Laboratory
MiniMD – Total ExecuNon Time
• Hybrid design performs be<er than MPI implementaRon
• 1,024 processes
- 17% improvement over MPI version
• Strong Scaling
Input size: 128 * 128 * 128
Performance Strong Scaling
0
500
1000
1500
2000
2500
512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
17%
0
500
1000
1500
2000
2500
3000
256 512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
Time (ms)
Time (ms)
# of Cores # of Cores
M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko and D. K. Panda, Scalable MiniMD Design with Hybrid MPI and OpenSHMEM, OpenSHMEM User Group
MeeNng (OUG ’14), held in conjuncNon with 8th InternaNonal Conference on ParNNoned Global Address Space Programming Models, (PGAS 14).
48. HPCAC-Switzerland (Mar ‘16) 48 Network Based CompuNng Laboratory
Hybrid MPI+UPC NAS-FT
• Modified NAS FT UPC all-to-all pa<ern using MPI_Alltoall
• Truly hybrid program
• For FT (Class C, 128 processes)
• 34% improvement over UPC-GASNet
• 30% improvement over UPC-OSU
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Time (s)
NAS Problem Size – System Size
UPC-GASNet
UPC-OSU
Hybrid-OSU
34%
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI RunNmes: Experience with MVAPICH, Fourth Conference on
ParNNoned Global Address Space Programming Model (PGAS ’10), October 2010
Hybrid MPI + UPC Support
Available since
MVAPICH2-X 1.9 (2012)
50. HPCAC-Switzerland (Mar ‘16) 50 Network Based CompuNng Laboratory
Overview of OSU INAM
• A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network
with inputs from the MPI runRme
– h<p://mvapich.cse.ohio-state.edu/tools/osu-inam/
– h<p://mvapich.cse.ohio-state.edu/userguide/osu-inam/
• Monitors IB clusters in real Rme by querying various subnet management enRRes and gathering
input from the MPI runRmes
• Capability to analyze and profile node-level, job-level and process-level acRviRes for MPI
communicaRon (Point-to-Point, CollecRves and RMA)
• Ability to filter data based on type of counters using “drop down” list
• Remotely monitor various metrics of MPI processes at user specified granularity
• "Job Page" to display jobs in ascending/descending order of various performance metrics in
conjuncRon with MVAPICH2-X
• Visualize the data transfer happening in a “live” or “historical” fashion for enRre network, job or set
of nodes
55. HPCAC-Switzerland (Mar ‘16) 55 Network Based CompuNng Laboratory
List of Supported Switch Counters
• The following counters are queried from the InfiniBand Switches
• Xmit Data
– Total number of data octets, divided by 4, transmi<ed on all VLs from the port
– This includes all octets between (and not including) the start of packet delimiter and the VCRC, and
may include packets containing errors
– Excludes all link packets.
• Rcv Data
– Total number of data octets, divided by 4, received on all VLs from the port
– This includes all octets between (and not including) the start of packet delimiter and the VCRC, and
may include packets containing errors
– Excludes all link packets.
• Max [Xmit Data/Rcv Data]: Maximum of the two values above
57. HPCAC-Switzerland (Mar ‘16) 57 Network Based CompuNng Laboratory
List of Supported MPI Process Level Counters (Cont.)
• Max [Coll Bytes Sent/Rcvd]
– Maximum of the two values above
• RMA Bytes Sent
– Total number of bytes transmi<ed as part of MPI RMA operaRons
– Note that due to the nature of the RMA operaRons, bytes received for RMA operaRons cannot be counted
• RC VBUF
– The number of internal communicaRon buffers used for reliable connecRon (RC)
• UD VBUF
– The number of internal communicaRon buffers used for unreliable datagram (UD)
• VM Size
– Total number of bytes used by the program for its virtual memory
• VM Peak
– Maximum number of virtual memory bytes for the program
• VM RSS
– The number of bytes resident in the memory (Resident set size)
• VM HWM
– The maximum number of bytes that can be resident in memory (Peak resident set size or High water mark)
58. HPCAC-Switzerland (Mar ‘16) 58 Network Based CompuNng Laboratory
List of Supported Network Error Counters (Cont.)
• XmtDiscards
– Total number of outbound packets discarded by the port because the port is down or congested. Reasons for this include:
• Output port is not in the acRve state
• Packet length exceeded NeighborMTU
• Switch LifeRme Limit exceeded
• Switch HOQ LifeRme Limit exceeded This may also include packets discarded while in VLStalled State.
• XmtConstraintErrors
– Total number of packets not transmi<ed from the switch physical port for the following reasons:
• FilterRawOutbound is true and packet is raw
• ParRRonEnforcementOutbound is true and packet fails parRRon key check or IP version check
• RcvConstraintErrors
– Total number of packets not received from the switch physical port for the following reasons:
• FilterRawInbound is true and packet is raw
• ParRRonEnforcementInbound is true and packet fails parRRon key check or IP version check
• LinkIntegrityErrors
– The number of Rme s that the count of local physical errors exceeded the threshold specified by LocalPhyErrors
• ExcBufOverrunErrors
– The number of Rmes that OverrunErrors consecuRve flow control update periods occurred, each having at least one overrun error
• VL15Dropped: Number of incoming VL15 packets dropped due to resource limitaRons (e.g., lack of buffers) in the port
59. HPCAC-Switzerland (Mar ‘16) 59 Network Based CompuNng Laboratory
List of Supported Network Error Counters
• The following error counters are available both at switch and process level:
• SymbolErrors
– Total number of minor link errors detected on one or more physical lanes
• LinkRecovers
– Total number of Rmes the Port Training state machine has successfully completed the link error recovery process
• LinkDowned
– Total number of Rmes the Port Training state machine has failed the link error recovery process and downed the link
• RcvErrors
– Total number of packets containing an error that were received on the port. These errors include:
• Local physical errors
• Malformed data packet errors
• Malformed link packet errors
• Packets discarded due to buffer overrun
• RcvRemotePhysErrors
– Total number of packets marked with the EBP delimiter received on the port.
• RcvSwitchRelayErrors
– Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay