Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
In this presentation, we focus on an alternative approach that uses nodes that contain Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. Programming models and the development tools are identical for these resources, greatly simplifying development. We discuss how the same models for vectorization and threading can be used across these compute resources to create software that performs well on them. We further propose an extension to the Intel® Threading Building Blocks (Intel® TBB) flow graph interface that enables intra-node distributed memory programming, simplifying communication, and load balancing between the processors and coprocessors. Finally, we validate this approach by presenting a benchmark of a risk analysis implementation that achieves record-setting performance.
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
In this presentation, we focus on an alternative approach that uses nodes that contain Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. Programming models and the development tools are identical for these resources, greatly simplifying development. We discuss how the same models for vectorization and threading can be used across these compute resources to create software that performs well on them. We further propose an extension to the Intel® Threading Building Blocks (Intel® TBB) flow graph interface that enables intra-node distributed memory programming, simplifying communication, and load balancing between the processors and coprocessors. Finally, we validate this approach by presenting a benchmark of a risk analysis implementation that achieves record-setting performance.
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The growing interest in FPGA-based solutions for accelerating compute demanding algorithms is pushing the need for new tools and methods to improve productivity. High-Level Synthesis (HLS) tools already provide an handy way to describe an FPGA-based hardware implementations starting from a software description of an algorithm. However, HLS directives allow to improve the hardware design only from a computational perspective, requiring a manual code restructuring in case memory transfer needs optimizing. This aspect limits the effectiveness of Design Space Exploration (DSE) approaches that only target HLS directives. Therefore, we present a comprehensive methodology to support the designer in the generation of optimal HLS-based hardware implementations. First, we propose an automated roofline model generation that directly operates on a C/C++ description of the target algorithm. The approach enables a fast evaluation of the operational intensity of the target function and visualizes the main bottlenecks of the current HLS implementation, providing guidance on how to improve it. Second, we introduce a DSE methodology for quickly evaluating different HLS directives to identify an optimal implementation. We report the DSE performance when running on the PolyBench test suite, outperforming previous automated solutions in the literature. Finally, we illustrate the process of accelerating by means of our framework a complex application such as the N-body physics simulation algorithm, achieving results comparable to bespoke state-of-the-art implementations.
Programming Languages & Tools for Higher Performance & ProductivityLinaro
By Hitoshi Murai, RIKEN AICS
For higher performance and productivity of HPC systems, it is important to provide users with good programming environment including languages, compilers, and tools. In this talk, the programming model of the post-K supercomputer will be shown.
Hitoshi Murai Bio
Hitoshi Murai received a master's degree in information science from Kyoto University in 1996. He worked as a software developer in NEC from 1996 to 2010. He received a Ph.D degree in computer science from University of Tsukuba in 2010. He is currently a research scientist of the programming environment research team and the Flagship 2020 project in Advanced Institute for Computational Science, RIKEN. His research interests include compilers and parallel programming languages.
Email
h-murai@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-iodice
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Gian Marco Iodice, Software Engineer at ARM, presents the "Using SGEMM and FFTs to Accelerate Deep Learning" tutorial at the May 2016 Embedded Vision Summit.
Matrix Multiplication and the Fast Fourier Transform are numerical foundation stones for a wide range of scientific algorithms. With the emergence of deep learning, they are becoming even more important, particularly as use cases extend into mobile and embedded devices. In this presentation, lodice discusses and analyzes how these two key, computationally-intensive algorithms can be used to gain significant performance improvements for convolutional neural network (CNN) implementations.
After a brief introduction to the nature of CNN computations, Iodice explores the use of GEMM (General Matrix Multiplication) and mixed-radix FFTs to accelerate 3D convolution. He shows examples of OpenCL implementations of these functions and highlights their advantages, limitations and trade-offs. Central to the techniques explored is an emphasis on cache-efficient memory accesses and the crucial role of reduced-precision data types.
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) Wim Vanderbauwhede
Keynote I gave at the ParCo conference (http://www.parco2015.org) workshop paraFPGA in Edinburgh, Sept 2015, on the need to raise the abstraction level for programming of heterogeneous systems.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project
Abstract: Modern edge devices are equipped with more powerful computing resources than ever before which opens up the opportunity to execute deep neural networks on the devices instead of cloud-based implementation. Existing DNN frameworks such as TensorFlow, Caffe, Torch, etc.. do not exploit the heterogeneous features of these devices except supporting GPUs. Modern edge devices contain different categories of heterogeneity i.e. clusters of different types of cores fabricated on the same chip. An example of such type of board is Jetson TX2 from Nvidia. To support DNN applications on heterogeneous edge devices, we have developed a framework which generates an efficient and balanced parallel pipelined implementation of CNN inference from a simplified template language interface. We leverage the input and output information provided by template language to generate a balanced pipeline. Since the cores are heterogeneous, we run a brief online training phase to find the best core distribution for a balanced and high throughput pipeline.
Our experiments show that a pipeline mapping configuration obtained from online training provides upto 22% faster pipeline as compared to the baseline. We compare against kernel level parallel implementation of widely used image classification CNN, VGG-16 on Nvidia Jetson TX2 board.
Poster presented by Pirah Noor Soomro at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
Presentation slide for CARL2013 (Co-located with MICRO-46) at Davis, CA.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...inside-BigData.com
In this deck from the HPC User Forum at Argonne, Deepak Pathania presents: SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in Server Chassis.
"The NEC Vector Engine Processor was developed using 16 nm FinFET process technology for extreme high performance and low power consumption. The Vector Engine Processor has the world's first implementation of one processor with six HBM2 memory modules using Chip-on-Wafer-on-Substrate technology, leading to the world-record memory bandwidth of 1.2 TB/s."
Watch the video: https://wp.me/p3RLHQ-kOK
Learn more: https://www.nec.com/en/global/solutio...
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software
Learn about the latest developments and tools for high-performance Python*, which are used with scikit-learn, NumPy, SciPy, pandas, mpi4py, and Numba*. Apply low-overhead profiling tools, including Intel® VTune™ Amplifier, to analyze mixed C, C++, and Python applications to detect performance bottlenecks in the code and to pinpoint hotspots as the target for performance tuning. Get the best performance from your Python application with the best-known methods, tools, and libraries.
Extracting a Rails Engine to a separated applicationJônatas Paganini
As a Rails Application grows, there is a need to decouple heavy systems from the monolithic applications. Several teams in different companies are doing the same: extracting (micro) services from their monolithic applications to give the engineering teams more flexibility to speed up the workflow.
From the separation of the business logic to the server's setup, every change should respect the zero-downtime approach.
This talk shares the automated steps and exercises we created to have a smooth transition to the new system.
I'll share the context of the tool that is automatically extracting an entire
rails engine from a project and moving it to a separate service.
We build AI and HPC solutions. Expertise: highly optimized AI Engines and HPC Apps.
• HPC: accelerating time to results and adapting complex algorithms to GPU, FPGA, many-CPU architectures.
Leverage byteLAKE expertise in complex algorithms adaptation and optimization for NVIDIA GPUs, Xilinx Alveo FPGAs, Intel, AMD and ARM solutions. From single nodes to clusters.
More: www.byteLAKE.com/en/Alveo
The growing interest in FPGA-based solutions for accelerating compute demanding algorithms is pushing the need for new tools and methods to improve productivity. High-Level Synthesis (HLS) tools already provide an handy way to describe an FPGA-based hardware implementations starting from a software description of an algorithm. However, HLS directives allow to improve the hardware design only from a computational perspective, requiring a manual code restructuring in case memory transfer needs optimizing. This aspect limits the effectiveness of Design Space Exploration (DSE) approaches that only target HLS directives. Therefore, we present a comprehensive methodology to support the designer in the generation of optimal HLS-based hardware implementations. First, we propose an automated roofline model generation that directly operates on a C/C++ description of the target algorithm. The approach enables a fast evaluation of the operational intensity of the target function and visualizes the main bottlenecks of the current HLS implementation, providing guidance on how to improve it. Second, we introduce a DSE methodology for quickly evaluating different HLS directives to identify an optimal implementation. We report the DSE performance when running on the PolyBench test suite, outperforming previous automated solutions in the literature. Finally, we illustrate the process of accelerating by means of our framework a complex application such as the N-body physics simulation algorithm, achieving results comparable to bespoke state-of-the-art implementations.
Programming Languages & Tools for Higher Performance & ProductivityLinaro
By Hitoshi Murai, RIKEN AICS
For higher performance and productivity of HPC systems, it is important to provide users with good programming environment including languages, compilers, and tools. In this talk, the programming model of the post-K supercomputer will be shown.
Hitoshi Murai Bio
Hitoshi Murai received a master's degree in information science from Kyoto University in 1996. He worked as a software developer in NEC from 1996 to 2010. He received a Ph.D degree in computer science from University of Tsukuba in 2010. He is currently a research scientist of the programming environment research team and the Flagship 2020 project in Advanced Institute for Computational Science, RIKEN. His research interests include compilers and parallel programming languages.
Email
h-murai@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-iodice
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Gian Marco Iodice, Software Engineer at ARM, presents the "Using SGEMM and FFTs to Accelerate Deep Learning" tutorial at the May 2016 Embedded Vision Summit.
Matrix Multiplication and the Fast Fourier Transform are numerical foundation stones for a wide range of scientific algorithms. With the emergence of deep learning, they are becoming even more important, particularly as use cases extend into mobile and embedded devices. In this presentation, lodice discusses and analyzes how these two key, computationally-intensive algorithms can be used to gain significant performance improvements for convolutional neural network (CNN) implementations.
After a brief introduction to the nature of CNN computations, Iodice explores the use of GEMM (General Matrix Multiplication) and mixed-radix FFTs to accelerate 3D convolution. He shows examples of OpenCL implementations of these functions and highlights their advantages, limitations and trade-offs. Central to the techniques explored is an emphasis on cache-efficient memory accesses and the crucial role of reduced-precision data types.
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) Wim Vanderbauwhede
Keynote I gave at the ParCo conference (http://www.parco2015.org) workshop paraFPGA in Edinburgh, Sept 2015, on the need to raise the abstraction level for programming of heterogeneous systems.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project
Abstract: Modern edge devices are equipped with more powerful computing resources than ever before which opens up the opportunity to execute deep neural networks on the devices instead of cloud-based implementation. Existing DNN frameworks such as TensorFlow, Caffe, Torch, etc.. do not exploit the heterogeneous features of these devices except supporting GPUs. Modern edge devices contain different categories of heterogeneity i.e. clusters of different types of cores fabricated on the same chip. An example of such type of board is Jetson TX2 from Nvidia. To support DNN applications on heterogeneous edge devices, we have developed a framework which generates an efficient and balanced parallel pipelined implementation of CNN inference from a simplified template language interface. We leverage the input and output information provided by template language to generate a balanced pipeline. Since the cores are heterogeneous, we run a brief online training phase to find the best core distribution for a balanced and high throughput pipeline.
Our experiments show that a pipeline mapping configuration obtained from online training provides upto 22% faster pipeline as compared to the baseline. We compare against kernel level parallel implementation of widely used image classification CNN, VGG-16 on Nvidia Jetson TX2 board.
Poster presented by Pirah Noor Soomro at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
Presentation slide for CARL2013 (Co-located with MICRO-46) at Davis, CA.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...inside-BigData.com
In this deck from the HPC User Forum at Argonne, Deepak Pathania presents: SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in Server Chassis.
"The NEC Vector Engine Processor was developed using 16 nm FinFET process technology for extreme high performance and low power consumption. The Vector Engine Processor has the world's first implementation of one processor with six HBM2 memory modules using Chip-on-Wafer-on-Substrate technology, leading to the world-record memory bandwidth of 1.2 TB/s."
Watch the video: https://wp.me/p3RLHQ-kOK
Learn more: https://www.nec.com/en/global/solutio...
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software
Learn about the latest developments and tools for high-performance Python*, which are used with scikit-learn, NumPy, SciPy, pandas, mpi4py, and Numba*. Apply low-overhead profiling tools, including Intel® VTune™ Amplifier, to analyze mixed C, C++, and Python applications to detect performance bottlenecks in the code and to pinpoint hotspots as the target for performance tuning. Get the best performance from your Python application with the best-known methods, tools, and libraries.
Extracting a Rails Engine to a separated applicationJônatas Paganini
As a Rails Application grows, there is a need to decouple heavy systems from the monolithic applications. Several teams in different companies are doing the same: extracting (micro) services from their monolithic applications to give the engineering teams more flexibility to speed up the workflow.
From the separation of the business logic to the server's setup, every change should respect the zero-downtime approach.
This talk shares the automated steps and exercises we created to have a smooth transition to the new system.
I'll share the context of the tool that is automatically extracting an entire
rails engine from a project and moving it to a separate service.
We build AI and HPC solutions. Expertise: highly optimized AI Engines and HPC Apps.
• HPC: accelerating time to results and adapting complex algorithms to GPU, FPGA, many-CPU architectures.
Leverage byteLAKE expertise in complex algorithms adaptation and optimization for NVIDIA GPUs, Xilinx Alveo FPGAs, Intel, AMD and ARM solutions. From single nodes to clusters.
More: www.byteLAKE.com/en/Alveo
Small introduction to FPGA acceleration and the impact of the new High Level Synthesis toolchains to their programmability
Video here: https://www.linkedin.com/posts/marcobarbone_can-my-application-benefit-from-fpga-acceleration-activity-6848674747375460352-0fua
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
Integrated into Intel® Advisor, Cache-aware Roofline Modeling (CARM) provides insight into how an application behaves by helping to determine a) how optimally it works on a given hardware, b) the main factors that limit performance, c) if the workload is memory or compute-bound, and d) the right strategy to improve application performance.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
In the framework of the Intel Parallel Computing Centre at the Research Campus Garching in Munich, our group at LRZ presents recent results on performance optimization of Gadget-3, a widely used community code for computational astrophysics. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm and focus on threading parallelism optimization, change of the data layout into Structure of Arrays (SoA), compiler auto-vectorization and algorithmic improvements in the particle sorting. We measure lower execution time and improved threading scalability both on Intel Xeon (2.6× on Ivy Bridge) and Xeon Phi (13.7× on Knights Corner) systems. First tests on second generation Xeon Phi (Knights Landing) demonstrate the portability of the devised optimization solutions to upcoming architectures.
Scrooge Attack: Undervolting ARM Processors for ProfitLEGATO project
Malicious cloud provider can intentionally undervolt cloud infrastructure for additional savings on the electricity bill. ARM processors are low power processors which can lead to substantial energy saving for cloud providers. In our scenario we consider a scrooge cloud provider which undervolts its ARM
infrastructure for profit. The instances can be undervolted in a stealthy manner by avoiding critical voltage regions.
Applications running under critical undervolting conditions can
malfunction. These conditions can be exploited by a cloud user to uncover the undervolted instances. For this novel attack scenario we present a detection method for cloud users. The detection method injects non-selectively faults into processes with the intend to crash the cloud instance. Even if the cloud
provider can spoof temperature and voltage readings of the processor, the cloud user is capable to uncover undervolted instances. By crashing instances simultaneously using the detection method, the cloud user is covered by the service licence agreement and exposes the scrooge cloud provider.
TEEMon: A continuous performance monitoring framework for TEEsLEGATO project
LEGaTO paper presented at ACM Middleware 2020 by Robert Krahn, Donald Dragoti, Franz Gregor, Do Le Quoc, Valerio Schiavoni, Pascal Felber, Clenimar Souza, Andrey Brito and Christof Fetzer
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the projectLEGATO project
Presentation by Osman Unsal and Pirah Noor Soomro at the webinar AI4EU WebCafé: 'Energy-efficient AI, a perspective from the LEGaTO project' on 28 October 2020
Presentation given by Jens Hagemeyer (Bielefeld University) at the ‘Low-Energy Heterogeneous Computing Workshop’ on 16 October 2020 within HiPEAC CSW Autumn 2020
TZ4Fabric: Executing Smart Contracts with ARM TrustZoneLEGATO project
Paper presented by Christian Göttel at SRDS'20.
Abstract: Transparency in blockchains can be an advantage and a disadvantage, in particular if confidential information such as assets or business interactions are exposed. There are no confidentiality guarantees in blockchain systems to protect the logic of a smart contract or the data it processes. One solution to this problem can be trusted execution environments (TEE) which are an emerging technology for example available in edge or mobile-grade processors (e.g., ARM TrustZone) or in server-grade processors (e.g., Intel SGX). In this presentation we introduce TZ4Fabric, an extension of Hyperledger Fabric which leverages ARM TrustZone to shield the execution of smart contracts from compromised systems and powerful attackers. TZ4Fabric exploits the open source OP-TEE framework to enable ARM TrustZone features. We evaluate our prototype on the Raspberry Pi platform and highlight energy and performance trade-offs.
Infection Research with Maxeler Dataflow ComputingLEGATO project
Presentation given by Tobias Becker (Maxeler) at the LEGaTO Final Event: Low-Energy Heterogeneous Computing Workshop on 4 September 2020
This event was collocated with FPL 2020
Presentation given by Nils Kucza (Bielefeld University) at the LEGaTO Final Event: Low-Energy Heterogeneous Computing Workshop on 4 September 2020
This event was collocated with FPL 2020
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-ResiliencyLEGATO project
Tutorial by Behzad Salami, Osman Unsal and Leonardo Bautista at 30th International Conference on Field-Programmable Logic and Applications (FPL2020), 3 September 2020
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsLEGATO project
Presentation by Jing Chen and Pirah Noor Soomro (Chalmers University of Technology) at the 16th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS 2020) on 17 August 2020.
SRMPDS was a virtual event and collocated with ICPP’20 - 2020 International Conference on Parallel Processing.
RECS – Cloud to Edge Microserver Platform for Energy-Efficient ComputingLEGATO project
Abstract:Today, application developers and data center operators face the challenging task to achieve high performance while at the same time needing to reduce the total cost of ownership, which is especially driven by the energy consumption of the server itself.
This poster shows the RECS Microserver platform, developed by Christmann and Bielefeld University. RECS simplifies the combined use of heterogeneous target architectures to achieve high performance and superior energy-efficiency.
Poster presented by Martin Kaiser at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'
HiPerMAb: A statistical tool for judging the potential of short fat dataLEGATO project
Abstract: Common statistical approaches are not designed to deal with so-called “short fat data” in biomarker pilot studies, where the number of biomarker candidates exceeds the sample size by magnitudes. Because of the high cost and long time to collect and prepare the data in this type of study, researchers prefer to check the potential of the large set of biomarker candidates in a small pilot study.
The aim of the pilot study is to answer the question whether it is worthwhile to extend the study to a larger study and to obtain information about the required sample size. HiPerMAb tool is proposed as method to judge the potential in a small biomarker pilot study without the need to explicitly identifying and confirming a specific subset of biomarkers. It allows to evaluate pilot studies based on performance measures like multiclass AUC, entropy, area above the cost curve, hypervolume under manifold and misclassification rate. Entropy is a useful tool in machine learning and became one of the most exciting developments in biology today. However, it has no closed form solution like area under the ROC curve (AUC) to estimate the required p-values for HiPerMAb. The possible solution is the simulations, Monte Carlo simulations, and with such number of biomarker candidates in such studies the number of simulation become significantly large, computational cost, and energy consuming. By using Maxeller DFE on Jülich testbed, we are able to look at study with more than 50,000 biomarkers so we need to estimate a probability smaller than 1/50000, which means we need to run up to 50 million simulations. Then number of “good” biomarker candidates is compared to the expected number of “good” biomarker candidates in a dataset with no association to the considered disease states to judge if the study is worthy to be extended with appropriate sample size to find and evaluate a final combination of biomarkers with high predictive value.
Poster presented by Amani Al-Mekhlafi at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
1. The LEGaTO project has received funding from the European Union's Horizon 2020 research and
innovation programme under the grant agreement No 780681
10/13/20
LEGaTO:
Software Stack
Runtimes
HiPEAC 2020
Computer Systems Week
16-10-2020
Miquel Pericas
Chalmers University of Technology
3. HiPEAC CSW Autumn 2020
Slurm and RECS Master
• Integration of Slurm with RECS Master
o Nodes specification at slurm configuration (partitions, limits…)
o Slurm gets node specification and selects target nodes
o Allocates, joins and starts nodes
o Executes the application(s)
o Shuts-down nodes and destroys allocation
3
$ sinfo
PART… AVAIL LIMIT NODES STATE NODELIST
debug* up infinite 1 idle* pcxavim5
debug* up infinite 16 idle BB_1_[0,2-15],pcxavim6
4. HiPEAC CSW Autumn 2020
Slurm and RECS Master
• Slurm contacts RECS Master at job execution and
termination times
4
#!/bin/bash
#SBATCH -N 10
#SBATCH --constraint=ARM,bigLITTLE,hasGPU
#SBATCH -o test-%j.out
#SBATCH -e test-%j.err
// App invocation
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle* pcxavim5
debug* up infinite 10 alloc BB_1_[0,2-10]
debug* up infinite 6 idle BB_1_[11-15],pcxavim6
$ sbatch batch-10-bl.sh
Submitted batch job 39
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
39 debug batch-10 xavim R 0:42 10 BB_1_[0,2-10]
5. HiPEAC CSW Autumn 2020
Slurm and RECS Master
• Composed nodes are created using the
RECS Master webservice
• And started and stopped automatically
5
10 nodes are
turned on
7. 7
HiPEAC CSW Autumn 2020
● Acceleration of matrix multiplication on FPGAs
− 4 ARM cores (OpenBLAS)
− 1 to 3 IP cores
● Block size 256x256
0
1
2
3
4
5
6
7
8
0
20
40
60
80
100
4 ARM cores 1 IP core 2 IP cores 3 IP cores
GFlops/W
GFlops
Axis Title
Matrix multiply, energy efficiency
Gflops Gflops/W
● Best performance
● 3 IP cores
● Best energy-efficiency
● 2 IP cores
OmpSs@FPGA
8. 8
HiPEAC CSW Autumn 2020
XiTAO: Energy Aware Scheduler
• Module 1: Power Profiling
• help runtime understand CPU power consumption trends (number/type of
cores, different frequencies)
•
• Module 2: Dynamic Performance Modeling
• provide accurate prediction for future task given a set of resources
• independent of platforms and frequencies
• achieve scalablity and portablity goals
•
• Module 3: Idleness Tracing
• give the information about real-time status of cores
• put cores to ”sleep” when it is under-utilized
• sleeping time exploits backoff exponential strategy
• provide the real-time parallel slackness of active cores =>
calculation of shared board static power on each running task
•
• Module 4: Task Mapping Algorithm (Per task level)
For a given configuration (Start core, number of cores):
• Performance Tracer => Execution Time Prediction
• Power Profiles => Dynamic Power Prediction
• Power Profiles + Idleness Tracer => Static Power Prediction
• Energy Prediction = (Static Power + Dynamic Power) x Execution Time
9. 9
HiPEAC CSW Autumn 2020
XiTAO: Energy Aware Scheduler
● 31%-74% energy
reduction than
RWS
● 19%-68% energy
reduction than
FCC
● 25%-73% energy
reduction than
LCC
Name Acronym ● Notion
Random Work Stealing
(+Sleep)
RWS
(+S)
Typical greedy scheduling (enhanced with Sleep)
Fastest Cores with
Criticality (+Sleep)
FCC
(+S)
Critical tasks are mapped to the set of cores that minimize
execution time and are not allowed work stealing, noncritical
tasks follow parent queue and only search for the best number of
cores that minimize the execution time of the task (enhanced with
Sleep)
Lowest Cost with
Criticality (+Sleep)
LCC
(+S)
The difference between LCC and FCC is that minimizing execution
time becomes minimizing parallel cost. The parallel cost means
”execution time * number of cores” (enhanced with Sleep)
Lowest Energy without
Criticality
LENC Task scheduling targets lowest energy, no need for criticality
awareness
10. 10
HiPEAC CSW Autumn 2020
STA
train
Sched
• Mapping logical data locations to physical locations (to create a model per locality)
• The Software Topology Address (STA) is a portable key that is to
be interpreted by the XiTAO runtime to map a task to a place.
• Example: space filling order is used as an STA, transforming
coordinates to an integer for Cartesian inputs. Paper includes
other example such keys.
• This STA-to-location mapping is leveraged to model the
performance per task’s data locality
• A performance model per the (STA, task_type) tuple is created
• Energy aware model can be potentially used here.
• Example system’s elastic partitions to be used by the
model
XiTAO: Software & Hardware Topologies
11. 11
HiPEAC CSW Autumn 2020
XiTAO: Model Validation on DAG Chain
•Adaptive resource selection (leader, width) for an
cache intensive task. Green is NUMA node where
task (depicted by STA) is initialized
•Scheduler mostly chooses widths 1 and 2 (within
the shared L2 cache)
• Adaptive resource selection (leader, width) for a
memory intensive task.
• Scheduler mostly chooses widths 12 (a socket
encapsulating 2 NUMA nodes)
• Random work-stealing behavior for compute
bound tasks while preferring larger widths
• Scalability of model running memory-bound DAG
chains. Up to 2.5x speedup with larger task count
• To validate the STA-driven
performance modeling, we
− Test on a 4-socket
AMD system (2
NUMA each)
− Print a resource
selection trace of a
chain of tasks
• The scheduler adaptively
behaves as locality-aware for
memory/cach intensive tasks,
and as a work-stealing
scheduler for compute bound
tasks
12. 12
HiPEAC CSW Autumn 2020
XiTAO: Moldable pipelines for CNNs
on heterogenous edge devices
● A simple template tensor language to develop CNN
networks.
● XiTAO Pipelines are generated using the information
provided by language interface.
● An online training phase determines the optimal pipeline
configuration.
• Network Layer distribution among pipeline stages.
• Resource partitioning among pipeline stages
● The training is led by a search algorithm which utilizes
computational hints provided by the language interface.
13. 13
HiPEAC CSW Autumn 2020
Network description in template language
main(){
…
Conv1 = CONV(ip, op, weights);
Conv2 = CONV(conv1, op, weights);
….
network.add(Conv1);
network.add(Conv2);
…
network.execute();
}
XiTAO: Moldable pipelines for CNNs
on heterogenous edge devices
14. 14
HiPEAC CSW Autumn 2020
FPGA Undervolting
Problem: FPGAs are at least 10X less power-efficient than equivalent ASICs
Goal: Bridge the power-efficiency gap between ASICs and FPGAs by
Undervolting below nominal level
• Case Study: Power consumption of neural networks is a main concern
✔ Hardware acceleration: GPUs, FPGAs, and ASICs
Evaluation Setup
✔ 5 Image classification workloads
✔ 3 Xilinx UltraScale+ ZCU102 platforms
✔ 2 On-chip voltage rails
Main Results
✔ Large voltage guardband (i.e., 33%)
✔ >3X power-efficiency gain
15. 15
HiPEAC CSW Autumn 2020
Overall Voltage Behavior
Slight variation of voltage behavior across platforms and benchmarks
❑ FPGA stops operatingCrash
❑ No performance or reliability loss
❑ Added by the vendor to ensure the
worst-case conditions
❑ Large guardband, average of 33%
Guard
band
❑ A narrow voltage region
❑ Neural network accuracy collapseCritical
16. 16
HiPEAC CSW Autumn 2020
GPU Checkpointing with FTI
● Transparent multi-
GPU/multi-node
checkpointing
● Parallel streams to
improve I/O efficiency
● Fast checksum
calculation using GPUs
MD5 algorithm
17. 17
HiPEAC CSW Autumn 2020
GPU Checkpointing with FTI
● Over 100x speed up
with the new GPU MD5
algorithm
● Checkpoint takes less
than 1 second
● FPGA checkpoint
implementation coming