The document discusses best practices and a performance study of HPC clusters. It covers system configuration and tuning, building applications, Intel Xeon processors, efficient execution methods, tools for boosting performance, and application performance highlights using HPL and HPCG benchmarks. The document contains agenda items, market share data, typical BIOS settings, compiler flags, MPI usage, and performance results from single node and cluster runs of the benchmarks.
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
Learn about opportunities and challenges for accelerating big data middleware on modern high-performance computing (HPC) clusters by exploiting HPC technologies.
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
Speaker: Paul Navrátil, Texas Advanced Computing Center (TACC)
The design emphasis for supercomputing systems has moved from raw performance to performance-per-watt, and as a result, supercomputing architectures are converging on processors with wide vector units and many processing cores per chip. Such processors are capable of performant image rendering purely in software. This improved capability is fortuitous, since the prevailing homogeneous system designs lack dedicated, hardware-accelerated rendering subsystems for use in data visualization. Reliance on this “software-defined” rendering capability will grow in importance since, due to growing data sizes, visualizations must be performed on the same machine where the data is produced. Further, as data sizes outgrow disk I/O capacity, visualization will be increasingly incorporated into the simulation code itself (in situ visualization).
This talk presents recent work in high-fidelity visualization using the OSPRay ray tracing framework on TACC’s local and remote visualization systems. We present work using OSPRay within ParaView Catalyst in situ framework from Kitware, including capitalizing on opportunities to reduce data costs migrating through VTK filters for visualization. We highlight the performance opportunities and advantages of Intel® Advanced Vector Extensions 512, the memory system improvements possible with Intel® Xeon Phi™ processor multi-channel DRAM (MCDRAM) and the Intel® Omni-Path Architecture interconnect.
"OpenHPC is a collaborative, community effort that initiated from a desire to aggregate a number of common ingredients required to deploy and manage High Performance Computing (HPC) Linux clusters including provisioning tools, resource management, I/O clients, development tools, and a variety of scientific libraries. Packages provided by OpenHPC have been pre-built with HPC integration in mind with a goal to provide re-usable building blocks for the HPC community. Over time, the community also plans to identify and develop abstraction interfaces between key components to further enhance modularity and interchangeability. The community includes representation from a variety of sources including software vendors, equipment manufacturers, research institutions, supercomputing sites, and others."
Watch the video: http://wp.me/p3RLHQ-gKz
Learn more: http://openhpc.community/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
Intel s'intéresse tout particulièrement aux FPGA et notamment au potentiel qu'ils apportent lorsque les ISV et développeurs ont des besoins très spécifiques en Génomique, traitement d'images, traitement de bases de données, et même dans le Cloud. Dans ce document vous aurez l'occasion d'en savoir plus sur notre stratégie, et sur un programme de recherche lancé par Intel et Altera impliquant des Xeon E5 équipés... de FPGA
Intel is looking at FPGA and what they bring to ISVs and developers and their very specific needs in genomics, image processing, databases, and even in the cloud. In this document you will have the opportunity to learn more about our strategy, and a research program initiated by Intel and Altera involving Xeon E5 with... FPGA inside.
Auteur(s)/Author(s):
P. K. Gupta, Director of Cloud Platform Technology, Intel Corporation
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
In this video from Switzerland HPC Conference, Martin Hilgeman from Dell presents: HPC Workload Efficiency and the Challenges for System Builders.
"With all the advances in massively parallel and multi-core computing with CPUs and accelerators it is often overlooked whether the computational work is being done in an efficient manner. This efficiency is largely being determined at the application level and therefore puts the responsibility of sustaining a certain performance trajectory into the hands of the user. It is observed that the adoption rate of new hardware capabilities is decreasing and lead to a feeling of diminishing returns. This presentation shows the well-known laws of parallel performance from the perspective of a system builder. It also covers through the use of real case studies, examples of how to program for energy efficient parallel application performance."
Watch the video: http://wp.me/p3RLHQ-gIS
Learn more: http://dell.com
and
http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC User Forum in Santa Fe, Sibendu Som from Argonne presents: HPC Accelerating Combustion Engine Design.
Sibendu Som is a mechanical engineer and principal investigator for developing predictive spray and combustion modeling capabilities for compression ignition engines. With the aid of high-performance computing, Sibendu focuses on developing robust models which can improve the performance and emission characteristics of a variety of bio-derived fuels. Predictive simulation capability can provide significant insights on how to improve the efficiency and emissions for different bio-derived fuels of interest.
Watch the video: http://wp.me/p3RLHQ-gHO
Learn more: http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Jean-Pierre Panziera from Atos presents: BXI - Bull eXascale Interconnect.
"Exascale entails an explosion of performance, of the number of nodes/cores, of data volume and data movement. At such a scale, optimizing the network that is the backbone of the system becomes a major contributor to global performance. The interconnect is going to be a key enabling technology for exascale systems. This is why one of the cornerstones of Bull’s exascale program is the development of our own new-generation interconnect. The Bull eXascale Interconnect or BXI introduces a paradigm shift in terms of performance, scalability, efficiency, reliability and quality of service for extreme workloads."
Watch the video: http://wp.me/p3RLHQ-gJa
Learn more: https://bull.com/bull-exascale-interconnect/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
Learn about opportunities and challenges for accelerating big data middleware on modern high-performance computing (HPC) clusters by exploiting HPC technologies.
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
Speaker: Paul Navrátil, Texas Advanced Computing Center (TACC)
The design emphasis for supercomputing systems has moved from raw performance to performance-per-watt, and as a result, supercomputing architectures are converging on processors with wide vector units and many processing cores per chip. Such processors are capable of performant image rendering purely in software. This improved capability is fortuitous, since the prevailing homogeneous system designs lack dedicated, hardware-accelerated rendering subsystems for use in data visualization. Reliance on this “software-defined” rendering capability will grow in importance since, due to growing data sizes, visualizations must be performed on the same machine where the data is produced. Further, as data sizes outgrow disk I/O capacity, visualization will be increasingly incorporated into the simulation code itself (in situ visualization).
This talk presents recent work in high-fidelity visualization using the OSPRay ray tracing framework on TACC’s local and remote visualization systems. We present work using OSPRay within ParaView Catalyst in situ framework from Kitware, including capitalizing on opportunities to reduce data costs migrating through VTK filters for visualization. We highlight the performance opportunities and advantages of Intel® Advanced Vector Extensions 512, the memory system improvements possible with Intel® Xeon Phi™ processor multi-channel DRAM (MCDRAM) and the Intel® Omni-Path Architecture interconnect.
"OpenHPC is a collaborative, community effort that initiated from a desire to aggregate a number of common ingredients required to deploy and manage High Performance Computing (HPC) Linux clusters including provisioning tools, resource management, I/O clients, development tools, and a variety of scientific libraries. Packages provided by OpenHPC have been pre-built with HPC integration in mind with a goal to provide re-usable building blocks for the HPC community. Over time, the community also plans to identify and develop abstraction interfaces between key components to further enhance modularity and interchangeability. The community includes representation from a variety of sources including software vendors, equipment manufacturers, research institutions, supercomputing sites, and others."
Watch the video: http://wp.me/p3RLHQ-gKz
Learn more: http://openhpc.community/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
Intel s'intéresse tout particulièrement aux FPGA et notamment au potentiel qu'ils apportent lorsque les ISV et développeurs ont des besoins très spécifiques en Génomique, traitement d'images, traitement de bases de données, et même dans le Cloud. Dans ce document vous aurez l'occasion d'en savoir plus sur notre stratégie, et sur un programme de recherche lancé par Intel et Altera impliquant des Xeon E5 équipés... de FPGA
Intel is looking at FPGA and what they bring to ISVs and developers and their very specific needs in genomics, image processing, databases, and even in the cloud. In this document you will have the opportunity to learn more about our strategy, and a research program initiated by Intel and Altera involving Xeon E5 with... FPGA inside.
Auteur(s)/Author(s):
P. K. Gupta, Director of Cloud Platform Technology, Intel Corporation
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
In this video from Switzerland HPC Conference, Martin Hilgeman from Dell presents: HPC Workload Efficiency and the Challenges for System Builders.
"With all the advances in massively parallel and multi-core computing with CPUs and accelerators it is often overlooked whether the computational work is being done in an efficient manner. This efficiency is largely being determined at the application level and therefore puts the responsibility of sustaining a certain performance trajectory into the hands of the user. It is observed that the adoption rate of new hardware capabilities is decreasing and lead to a feeling of diminishing returns. This presentation shows the well-known laws of parallel performance from the perspective of a system builder. It also covers through the use of real case studies, examples of how to program for energy efficient parallel application performance."
Watch the video: http://wp.me/p3RLHQ-gIS
Learn more: http://dell.com
and
http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC User Forum in Santa Fe, Sibendu Som from Argonne presents: HPC Accelerating Combustion Engine Design.
Sibendu Som is a mechanical engineer and principal investigator for developing predictive spray and combustion modeling capabilities for compression ignition engines. With the aid of high-performance computing, Sibendu focuses on developing robust models which can improve the performance and emission characteristics of a variety of bio-derived fuels. Predictive simulation capability can provide significant insights on how to improve the efficiency and emissions for different bio-derived fuels of interest.
Watch the video: http://wp.me/p3RLHQ-gHO
Learn more: http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Jean-Pierre Panziera from Atos presents: BXI - Bull eXascale Interconnect.
"Exascale entails an explosion of performance, of the number of nodes/cores, of data volume and data movement. At such a scale, optimizing the network that is the backbone of the system becomes a major contributor to global performance. The interconnect is going to be a key enabling technology for exascale systems. This is why one of the cornerstones of Bull’s exascale program is the development of our own new-generation interconnect. The Bull eXascale Interconnect or BXI introduces a paradigm shift in terms of performance, scalability, efficiency, reliability and quality of service for extreme workloads."
Watch the video: http://wp.me/p3RLHQ-gJa
Learn more: https://bull.com/bull-exascale-interconnect/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
"Algorithmic processing performed in High Performance Computing environments impacts the lives of billions of people, and planning for exascale computing presents significant power challenges to the industry. ARM delivers the enabling technology behind HPC. The 64-bit design of the ARMv8-A architecture combined with Advanced SIMD vectorization are ideal to enable large scientific computing calculations to be executed efficiently on ARM HPC machines. In addition ARM and its partners are working to ensure that all the software tools and libraries, needed by both users and systems administrators, are provided in readily available, optimized packages."
Learn more: https://developer.arm.com/hpc
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
ILP32 is a programming model that may be useful on AArch64 systems for performance and also for legacy code with 32-bit data size assumptions. We combined ILP32 support from upstream projects with the LEAP distribution to enable experimentation with this model. This talk discusses the relative benchmark performance of the LP64 and ILP32 programming models under AArch64.
This was presented by Yong LU at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/16/OpenCAPI%20Acceleration%20Framework_YongLu_ver2.pdf
In the design of electronics and semiconductors, challenges are compounded by the integration of AI, multi-core, real-time software, network, connectivity, diagnostics, and security. Performance limits, battery life, and cost are adoption barriers. It is extremely important to have tools and processes that deliver efficiency throughout the design cycle.
Continuous verification from planning to development addresses the multi-discipline needs of hardware, software, and networks. This unique approach accelerates the design phase, defines the test efforts, and finds defects during specification. Architecture modeling is required to meet timing deadlines, generate the lowest power consumption, and attain the highest Quality-of-Service. optimize the electronic design system and designing of custom components.
Xilinx vs Intel (Altera) FPGA performance comparison Roy Messinger
You're welcome to check out this interesting comparison I've conducted between these 2 vendors. Very interesting and surprising results (I did not expect such differences).
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Ronald P. Luijten from IBM Research in Zurich presents: DOME 64-bit μDataCenter.
I like to call it a datacenter in a shoebox. With the combination of power and energy efficiency, we believe the microserver will be of interest beyond the DOME project, particularly for cloud data centers and Big Data analytics applications."
The microserver’s team has designed and demonstrated a prototype 64-bit microserver using a PowerPC based chip from Freescale Semiconductor running Linux Fedora and IBM DB2. At 133 × 55 mm2 the microserver contains all of the essential functions of today’s servers, which are 4 to 10 times larger in size. Not only is the microserver compact, it is also very energy-efficient.
Watch the video: http://wp.me/p3RLHQ-gJM
Learn more: https://www.zurich.ibm.com/microserver/
Sign up for our insideHPC Newsletter: http://insideHPC/newsletter
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
Abstract: In the Webinar, we will show you how to construct, simulate, analyze, validate, optimize an architecture model using pre-built components. We will compare micro and application benchmarks on system SoC models containing clusters of ARM Cortex A53, SiFive u74, ARM Cortex A77, and other vendor cores. The system will be built around custom switches, Ingress/Egress buffers, credit flow control, AI accelerators, NoC and AMBA AXI buses with multi-level caches, DDR4 DRAM and DMA. The evaluation and optimization criteria will be task latency, dCache hit-ratio, power consumed/task and memory bandwidth. The parameters to be modified are bus topology, cache size, processor clock speed, custom arbiters, task thread allocation and changing the processor pipeline.
Selection of cores is a combination of financial and technical bias. Technical comparison of processor cores requires the understanding of the workload, task partitioning and cache-memory structure. A core must be evaluated in the context of the target application. To evaluate these selections, architecture simulation software must be fortified with a library of Intellectual property for power and timing accurate processor cores, simulator at 100 million events per second, peripherals, and all possible traffic distributions
Key Takeaways:
1. Validating architecture models using mathematical calculus and hardware traces
2. Construct custom policies, arbitrations and configure processor cores
3. Select the right combination of statistics to detect bottlenecks and optimize the architecture
4. Identify the right use of stochastic, transaction, cycle-accurate and traces to construct the model
Speaker Bio:
Alex Su is a FPGA solution architect at E-Elements Technology, Hsinchu, Taiwan. He has been an FPGA Solution Architect and Xilinx FPGA Trainer for a number of years, supporting companies, research centers and universities in China and Taiwan. Prior to that, Mr Su has worked at ARM Ltd for 5 years in technical support of Arm CPU and System IP. Alex has also been engaged with a variety of FPGA-based Hardware Emulation System and over ten years in ASIC/SoC design and verification engineer.
Deepak Shankar is the Founder of Mirabilis Design and has been involved in the architecture exploration of over 250 SoC and processors. Mr. Shankar started Mirabilis Design because of a vacuum in the systems engineering and modeling space with the focus shifting to network design and early software development. Deepak has published over 50 articles and presented at over 30 conferences in EDA, semiconductors and embedded computing. Mr. Shankar has an MBA from UC Berkeley, MS in from Clemson University and BS from Coimbatore Institute of Technology, both in Electronics and Communication.
The Microarchitecure Of FPGA Based Soft ProcessorDeepak Tomar
this presentation is on the Paper "The Microarchitecure Of FPGA Based Soft Processor" by Peter Yiannacouras, Jonathan Rose and
J Gregory Steffan
Dept. of Electrical and Computer Engineering
University of Toronto
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
Luigi Brochard from Lenovo gave this talk at the Switzerland HPC Conference. "High performance computing is converging more and more with the big data topic and related infrastructure requirements in the field. Lenovo is investing in developing systems designed to resolve todays and future problems in a more efficient way and respond to the demands of Industrial and research application landscape."
Watch the video: http://wp.me/p3RLHQ-gDC
Learn more: http://www3.lenovo.com/us/en/data-center/solutions/hpc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Redfish is an IPMI replacement standardized by the DMTF. It provides a RESTful API for server out of band management and a lightweight data model specification that is scalable, discoverable and extensible. (Cf: http://www.dmtf.org/standards/redfish). This presentation will start by detailing its role and the features it provides with examples. It will demonstrate the benefits it provides to system administrator by providing a standardized open interface for multiple servers, and also storage systems.
We will then cover various tools such as the DMTF ones and the python-redfish library (Cf: https://github.com/openstack/python-redfish) offering Redfish abstractions.
One of the biggest issues for a developer – whether they are an engineer at an OEM or working for a mobile AI application startup – is that their apps are at the mercy of pre-set power and performance settings as defined by OEMs or Silicon vendors. So how can a developer break through that barrier when it seems their hands are tied behind their backs? The Snapdragon Power Optimization SDK allows developers to control the CPU and GPU frequency much more finely from their own application logic. This provides developers with more control within the bounds of the power/thermal framework.
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
"Algorithmic processing performed in High Performance Computing environments impacts the lives of billions of people, and planning for exascale computing presents significant power challenges to the industry. ARM delivers the enabling technology behind HPC. The 64-bit design of the ARMv8-A architecture combined with Advanced SIMD vectorization are ideal to enable large scientific computing calculations to be executed efficiently on ARM HPC machines. In addition ARM and its partners are working to ensure that all the software tools and libraries, needed by both users and systems administrators, are provided in readily available, optimized packages."
Learn more: https://developer.arm.com/hpc
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
ILP32 is a programming model that may be useful on AArch64 systems for performance and also for legacy code with 32-bit data size assumptions. We combined ILP32 support from upstream projects with the LEAP distribution to enable experimentation with this model. This talk discusses the relative benchmark performance of the LP64 and ILP32 programming models under AArch64.
This was presented by Yong LU at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/16/OpenCAPI%20Acceleration%20Framework_YongLu_ver2.pdf
In the design of electronics and semiconductors, challenges are compounded by the integration of AI, multi-core, real-time software, network, connectivity, diagnostics, and security. Performance limits, battery life, and cost are adoption barriers. It is extremely important to have tools and processes that deliver efficiency throughout the design cycle.
Continuous verification from planning to development addresses the multi-discipline needs of hardware, software, and networks. This unique approach accelerates the design phase, defines the test efforts, and finds defects during specification. Architecture modeling is required to meet timing deadlines, generate the lowest power consumption, and attain the highest Quality-of-Service. optimize the electronic design system and designing of custom components.
Xilinx vs Intel (Altera) FPGA performance comparison Roy Messinger
You're welcome to check out this interesting comparison I've conducted between these 2 vendors. Very interesting and surprising results (I did not expect such differences).
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Ronald P. Luijten from IBM Research in Zurich presents: DOME 64-bit μDataCenter.
I like to call it a datacenter in a shoebox. With the combination of power and energy efficiency, we believe the microserver will be of interest beyond the DOME project, particularly for cloud data centers and Big Data analytics applications."
The microserver’s team has designed and demonstrated a prototype 64-bit microserver using a PowerPC based chip from Freescale Semiconductor running Linux Fedora and IBM DB2. At 133 × 55 mm2 the microserver contains all of the essential functions of today’s servers, which are 4 to 10 times larger in size. Not only is the microserver compact, it is also very energy-efficient.
Watch the video: http://wp.me/p3RLHQ-gJM
Learn more: https://www.zurich.ibm.com/microserver/
Sign up for our insideHPC Newsletter: http://insideHPC/newsletter
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
Abstract: In the Webinar, we will show you how to construct, simulate, analyze, validate, optimize an architecture model using pre-built components. We will compare micro and application benchmarks on system SoC models containing clusters of ARM Cortex A53, SiFive u74, ARM Cortex A77, and other vendor cores. The system will be built around custom switches, Ingress/Egress buffers, credit flow control, AI accelerators, NoC and AMBA AXI buses with multi-level caches, DDR4 DRAM and DMA. The evaluation and optimization criteria will be task latency, dCache hit-ratio, power consumed/task and memory bandwidth. The parameters to be modified are bus topology, cache size, processor clock speed, custom arbiters, task thread allocation and changing the processor pipeline.
Selection of cores is a combination of financial and technical bias. Technical comparison of processor cores requires the understanding of the workload, task partitioning and cache-memory structure. A core must be evaluated in the context of the target application. To evaluate these selections, architecture simulation software must be fortified with a library of Intellectual property for power and timing accurate processor cores, simulator at 100 million events per second, peripherals, and all possible traffic distributions
Key Takeaways:
1. Validating architecture models using mathematical calculus and hardware traces
2. Construct custom policies, arbitrations and configure processor cores
3. Select the right combination of statistics to detect bottlenecks and optimize the architecture
4. Identify the right use of stochastic, transaction, cycle-accurate and traces to construct the model
Speaker Bio:
Alex Su is a FPGA solution architect at E-Elements Technology, Hsinchu, Taiwan. He has been an FPGA Solution Architect and Xilinx FPGA Trainer for a number of years, supporting companies, research centers and universities in China and Taiwan. Prior to that, Mr Su has worked at ARM Ltd for 5 years in technical support of Arm CPU and System IP. Alex has also been engaged with a variety of FPGA-based Hardware Emulation System and over ten years in ASIC/SoC design and verification engineer.
Deepak Shankar is the Founder of Mirabilis Design and has been involved in the architecture exploration of over 250 SoC and processors. Mr. Shankar started Mirabilis Design because of a vacuum in the systems engineering and modeling space with the focus shifting to network design and early software development. Deepak has published over 50 articles and presented at over 30 conferences in EDA, semiconductors and embedded computing. Mr. Shankar has an MBA from UC Berkeley, MS in from Clemson University and BS from Coimbatore Institute of Technology, both in Electronics and Communication.
The Microarchitecure Of FPGA Based Soft ProcessorDeepak Tomar
this presentation is on the Paper "The Microarchitecure Of FPGA Based Soft Processor" by Peter Yiannacouras, Jonathan Rose and
J Gregory Steffan
Dept. of Electrical and Computer Engineering
University of Toronto
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
Luigi Brochard from Lenovo gave this talk at the Switzerland HPC Conference. "High performance computing is converging more and more with the big data topic and related infrastructure requirements in the field. Lenovo is investing in developing systems designed to resolve todays and future problems in a more efficient way and respond to the demands of Industrial and research application landscape."
Watch the video: http://wp.me/p3RLHQ-gDC
Learn more: http://www3.lenovo.com/us/en/data-center/solutions/hpc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Redfish is an IPMI replacement standardized by the DMTF. It provides a RESTful API for server out of band management and a lightweight data model specification that is scalable, discoverable and extensible. (Cf: http://www.dmtf.org/standards/redfish). This presentation will start by detailing its role and the features it provides with examples. It will demonstrate the benefits it provides to system administrator by providing a standardized open interface for multiple servers, and also storage systems.
We will then cover various tools such as the DMTF ones and the python-redfish library (Cf: https://github.com/openstack/python-redfish) offering Redfish abstractions.
One of the biggest issues for a developer – whether they are an engineer at an OEM or working for a mobile AI application startup – is that their apps are at the mercy of pre-set power and performance settings as defined by OEMs or Silicon vendors. So how can a developer break through that barrier when it seems their hands are tied behind their backs? The Snapdragon Power Optimization SDK allows developers to control the CPU and GPU frequency much more finely from their own application logic. This provides developers with more control within the bounds of the power/thermal framework.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN
Make Python usable beyond prototyping environment by
scaling out to HPC and Big Data environments. Hardware and software efficiency crucial in production (Perf/Watt, etc.)
If you like what you read be sure you ♥ it below. Thank you!
The Cypress PSoC is a programmable “system on chip” device which includes all the functions of a traditional microcontroller, in addition to programmable analog and digital blocks. This combination of resources makes the chip well suited to robotics applications. This will be an introductory talk covering the basic architecture and development tools.
Watch the replay: http://cs.co/90028vTty
You have heard of NETCONF, YANG, Python, and REST. However, do you really understand what they can do for your business, network, and networking career?
Learn how you can create apps that automate operations, consume network telemetry, and control network resources with minimal programming experience.
Resources:
Watch the New Era of Networking playlist: http://cs.co/90058sI1U
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
Advances in cell biology and creation of an immense amount of data are converging with advances in Machine learning to analyze this data. Biology is experiencing its AI moment and driving the massive computation involved in understanding biological mechanisms and driving interventions. Learn about how cutting edge technologies such as Software Guard Extensions (SGX) in the latest Intel Xeon Processors and Open Federated Learning (OpenFL), an open framework for federated learning developed by Intel, are helping advance AI in gene therapy, drug design, disease identification and more.
Python Data Science and Machine Learning at Scale with Intel and AnacondaIntel® Software
Python is the number 1 language for data scientists, and Anaconda is the most popular python platform. Intel and Anaconda have partnered to bring scalability and near-native performance to Python with simple installations. Learn how data scientists can now access oneAPI-optimized Python packages such as NumPy, Scikit-Learn, Modin, Pandas, and XGBoost directly from the Anaconda repository through simple installation and minimal code changes.
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
Preprocess, visualize, and Build AI Faster at-Scale on Intel Architecture. Develop end-to-end AI pipelines for inferencing including data ingestion, preprocessing, and model inferencing with tabular, NLP, RecSys, video and image using Intel oneAPI AI Analytics Toolkit and other optimized libraries. Build at-scale performant pipelines with Databricks and end-to-end Xeon optimizations. Learn how to visualize with the OmniSci Immerse Platform and experience a live demonstration of the Intel Distribution of Modin and OmniSci.
AI for good: Scaling AI in science, healthcare, and more.Intel® Software
How do we scale AI to its full potential to enrich the lives of everyone on earth? Learn about AI hardware and software acceleration and how Intel AI technologies are being used to solve critical problems in high energy physics, cancer research, financial inclusion, and more. Get started on your AI Developer Journey @ software.intel.com/ai
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
Software AI Accelerators deliver orders of magnitude performance gain for AI across deep learning, classical machine learning, and graph analytics and are key to enabling AI Everywhere. Get started on your AI Developer Journey @ software.intel.com/ai.
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Intel® Software
Learn about the algorithms and associated implementations that power SigOpt, a platform for efficiently conducting model development and hyperparameter optimization. Get started on your AI Developer Journey @ software.intel.com/ai.
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Intel® Software
oneDNN Graph API extends oneDNN with a graph interface which reduces deep learning integration costs and maximizes compute efficiency across a variety of AI hardware including AI accelerators. Get started on your AI Developer Journey @ software.intel.com/ai.
AWS & Intel Webinar Series - Accelerating AI ResearchIntel® Software
Scale your research workloads faster with Intel on AWS. Learn how the performance and productivity of Intel Hardware and Software help bridge the gap between ideation and results in Data Science. Get started on your AI Developer Journey @ software.intel.com/ai.
Whether you are an AI, HPC, IoT, Graphics, Networking or Media developer, visit the Intel Developer Zone today to access the latest software products, resources, training, and support. Test-drive the latest Intel hardware and software products on DevCloud, our online development sandbox, and use DevMesh, our online collaboration portal, to meet and work with other innovators and product leaders. Get started by joining the Intel Developer Community @ software.intel.com.
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
Explore practical elements, such as performance profiling, debugging, and porting advice. Get an overview of advanced programming topics, like common design patterns, SIMD lane interoperability, data conversions, and more.
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
Explore how to build a unified framework based on FFmpeg and GStreamer to enable video analytics on all Intel® hardware, including CPUs, GPUs, VPUs, FPGAs, and in-circuit emulators.
Review state-of-the-art techniques that use neural networks to synthesize motion, such as mode-adaptive neural network and phase-functioned neural networks. See how next-generation CPUs with reinforcement learning can offer better performance.
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
This talk focuses on the newest release in RenderMan* 22.5 and its adoption at Pixar Animation Studios* for rendering future movies. With native support for Intel® Advanced Vector Extensions, Intel® Advanced Vector Extensions 2, and Intel® Advanced Vector Extensions 512, it includes enhanced library features, debugging support, and an extensive test framework.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
2. Agenda
HPC at HPE
System Configuration and Tuning
Best Practices for Building Applications
Intel Xeon Processors
Efficient Methods in Executing Applications
Tools and Techniques for Boosting Performance
Application Performance Highlights
Conclusions
2
6. Typical BIOS Settings: Processor Options
– Hyperthreading Options Disabled : Better scaling for HPC workloads
– Processor Core Disable 0 : Enables all available cores
– Intel Turbo Boost Technology Enabled : Increases clock frequency (increase affected by factors).
– ACPI SLIT Preferences Enabled : OS can improve performance by efficient allocation of
resources among processor, memory and I/O subsystems.
– QPI Snoop Configuration Home/Early/COD : Experiment and set the right configuration for your workload.
Home: High Memory Bandwidth for average NUMA workloads.
COD: Cluster On Die, Increased Memory Bandwidth for
optimized and aggressive NUMA workloads.
Early: Decreases latency but may also decrease memory
bandwidth compared to other two modes.
6
7. Typical BIOS Settings: Power Settings and Management
– HPE Power Profile should be set to Maximum Performance to get best performance (idle and average power
will increase significantly).
– Custom Power Profile will reduce idle and average power at the expense of 1-2% performance reduction.
– To get highest Turbo clock speeds (when partial cores are used), use Power Savings Settings.
– For Custom Power Profile, you will have to set the following additional settings:
7
9. 9
-O2 enable optimizations ( = -O, Default)
-O1 optimize for maximum speed, but disable some optimizations which increases code
size for small speed benefit
-O3 enable –O2 plus more aggressive optimizations that may or may not improve
performance for all programs.
-fast enable –O3 –ipo –static
-xHOST optimize code based on the native node used for compilation
-xAVX enable advanced vector instructions set (for Ivy Bridge performance)
-xCORE-AVX2 enable advanced vector instructions set 2 (key Haswell/Broadwell performance)
-xMIC-AVX512 enable advanced vector instructions set 512 (for future KNL/SkyLake based systems)
-mp maintain floating-point precision (disables some optimizations)
-parallel enable the auto parallelizer to generate multi-threaded code
-openmp generate multi-threaded parallel code based on OpenMP directives
-ftz enable/disable flushing denormalized results to zero
-opt-streaming-stores [always auto never] generates streaming stores
-mcmodel=[small medium large] controls the code and data memory allocation
-fp-model=[fast precise source strict] controls floating point model variation
-mkl=[parallel sequential cluster] link to Intel MKL Lib. to build optimized code.
Building Applications: Intel Compiler Flags
15. Intel Xeon Processors: Turbo, AVX and more (Contd …)
Complete Specifications at:
http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html
16. Intel Xeon Processors: Turbo, AVX and more (Contd…)
Complete Specifications at:
http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html
17. Intel Xeon Processors: Turbo, AVX and more (Contd …)
Intel publishes 4 different reference frequencies for every Xeon Processor:
1. Base Frequency 2. Non-AVX Turbo 3. AVX Base Frequency 4. AVX Turbo
• Turbo clock for a given model can vary as much as 5% from one processor to another
• Four possible scenarios exist:
• Turbo=OFF and AVX=NO => Clock is set to Base frequency
• Turbo=ON and AVX=NO => Clock range will be from Base to Non-AVX Turbo
• Turbo=OFF and AVX=YES => Clock range will be from AVX Base to Base Frequency
• Turbo=ON and AVX=YES => Clock range will be from AVX Base to AVX Turbo
19. Running Parallel Programs in a Cluster: Intel MPI
– Environments in General
– export PATH
– export LD_LIBRARY_PATH
– export MPI_ROOT
– export I_MPI_FABRICS= shm:dapl
– export I_MPI_DAPL_PROVIDER=ofa-v2-mlx5_0-1u
– export NPROCS=256
– export PPN=16
– export I_MPI_PIN_PROCESSOR_LIST 0-15
– export OMP_NUM_THREADS=2
– export KMP_STACKSIZE=400M
– export KMP_SCHEDULE= static,balanced
– Example Command using Intel MPI
– time mpirun -np $NPROCS -hostfile ./hosts -genvall –ppn $PPN –genv I_MPI_PIN_DOMAIN=omp
./myprogram.exe
19
20. Profiling a Parallel Program: Intel MPI
– Using Intel MPS (MPI Performance Snapshot)
– Set all env variables to run Intel MPI based application
– Source the following additionally:
source /opt/intel/16.0/itac/9.1.2.024/intel64/bin/mpsvars.sh –papi | vtune
– Run your application as: mpirun –mps -np $NPROCS -hostfile ./hosts ….
– Two files app_stat_xxx.txt and stats_xxx.txt will be available at the end of the job.
– Analyze the these *.txt using mps tool
– Sample data you can gather from:
– Computation Time: 174.54 sec 51.93%
– MPI Time: 161.58 sec 48.07%
– MPI Imbalance: 147.27 sec 43.81%
– OpenMP Time: 155.79 sec 46.35%
– I/O wait time: 576.47 sec ( 0.08 %)
– Using Intel MPI built-in Profiling Capabilities
– Native mode: mpirun -env I_MPI_STATS 1-4 -env I_MPI_STATS_FILE native_1to4.txt …
– IPM mode: mpirun -env I_MPI_STATS ipm -env I_MPI_STATS_FILE ipm_full.txt
20
22. Tools, Techniques and Commands
– Check Linux pseudo files and confirm the system details
– cat /proc/cpuinfo >> provides processor details (Intel’s tool cpuinfo.x)
– cat /proc/meminfo >> shows the memory details
– /usr/ sbin /ibstat >> shows the Interconnect IB fabric details
– /sbin/sysctl –a >> shows details of system (kernel, file system etc.)
– /usr/bin/lscpu >> shows cpu details including cache sizes
– /usr/bin/lstopo >> shows the hardware topology
– /bin/uname -a >> shows the system information
– /bin/rpm –qa >> shows the list of installed products including versions
– cat /etc/redhat-release >> shows the redhat version
– /usr/sbin/dmidecode >> shows system hardware and other details (need to be root)
– /bin/ dmesg >> shows system boot-up messages
– /usr/bin/numactl >> checks or sets NUMA policy for processes or shared memory
– /usr/bin/taskset >> shows cores and memory of numa nodes of a system
23. Top10 Practical Tips for Boosting Performance
– Check the system details thoroughly (Never assume !)
– Choose a compiler and MPI to build your application ( All are not same !)
– Start with some basic compiler flags and try additional flags one at a time (Optimization is incremental !)
– Use the built-in libraries and tools to save time and improve performance (Libs., Tools are your friends !)
– Change compiler and MPI if your code fails to compile or run correctly (Trying to fix things is futile !)
– Test your application at every level to arrive at an optimized code (Remember the 80-20 rule !)
– Customize your runtime environment to achieve desired goals (process parallel, hybrid run etc.)
– Always place and bind the processes and threads appropriately (Life saver !)
– Gather, check and correct your runtime environment (what you get may not be what you want !)
– Profile and adjust optimization and runtime environments accordingly (Exercise caution !)
23
25. Application: High Performance Linpack (HPL)
Description:
Benchmark to measure floating point rates (and times) by solving a random dense linear system of
equations in double-precision.
• Originally developed by Jack Dongarra at Univ. of Tennessee.
• Used Intel optimized HPL binary for this study.
• Ran the code in hybrid mode, one MPI process per processor and each process launched threads equal to
no. of cores on the processor.
• Used explicit placing and binding of threads.
• Attempted various choices of array sizes and other parameters to identify the best performance.
• The code provides a self-check to validate the results.
Additional details at: http://icl.eecs.utk.edu/hpl/
28. Application: High Performance Conjugate Gradient (HPCG)
Description:
Benchmark designed to create a new metric for ranking HPC systems, complementing the current HPL
benchmark. HPCG is designed to exercise computational and data access patterns that more closely match a
different and broad set of important HPC applications.
• Supports various operations in a standalone and an unified code.
• Reference implementation is written in C++ with MPI and OpenMP support.
• Driven by multigrid preconditioned conjugate gradient algorithm that exercises the key kernels on a nested set
of coarse grids.
• Unlike the HPL, HPCG can be run for predetermined time (input).
• Local domain size (input) for a node is replicated to identify a global domain resulting in near-linear speed-up.
• Performance is measured by GFLOP/s rating reported by the code.
• An Intel optimized HPCG binary was used for this benchmark study.
• Ran the HPCG binary in hybrid mode, MPI processes + OpenMP threads.
Additional details at: http://hpcg-benchmark.org/
30. Application: Graph500
Description:
Benchmark designed to address performance of data intensive HPC applications using Graph Algorithms
• The code generates problem size with a scale (input) creating vertices equal to 10scale
• The performance is measured in TEPS (Traversed Edges Per Second).
• The median_TEPS, in either GTEPS (Giga TEPS) or MTEPS (Mega TEPS) are reported.
• Used a source code optimized for a scale-out (DMP) system by Kyushu University (Japan).
• Application is written in C language.
• Compiled using GNU compiler, gcc.
• Code automatically detects the no. of processors and cores and runs optimally.
• No external placement and binding by the user are needed.
• Needs large memory foot-print to run very large scale problem.
Additional details at: http://www.graph500.org/
32. Application: Weather Research and Forecasting (WRF)
Description:
WRF is a Numerical Weather Prediction (NWP) model designed to serve both atmospheric research and
operational forecasting needs. NWP refers to the simulation and prediction of the atmosphere with a computer
model, and WRF is a set of software to accomplish this.
• The code was jointly developed by NCAR, NCEP, FSL , AFWA, NRL, Univ. of Oklahoma and FAA.
• WRF is freely distributed and supported by NCAR.
• Offers two dynamical solvers: WRF-ARW (Advanced Research WRF) and WRF-NMM (Nonhydrostatic
Mesoscale Model).
• Capabilities to mix and match modules to simulate various atmospheric conditions and couple with other
NWP models (Ocean Modeling codes)
• Can accommodate simulation with nested data domains, coarse to very fine grids in a single simulation.
• Popular data sets to port and optimize are: CONUS12 and CONUS2.5 (available at NCAR).
• Options to use dedicated processors for I/O (quilting) and various layout of processors (tiling) exist.
Additional details at: http://www.wrf-model.org/index.php
33. WRF (v 3.8.1) Results with CONUS 2.5km Data Set
0.0000
0.5000
1.0000
1.5000
2.0000
2.5000
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
4 8 16 24 32 48 64 96 128
Ave.Time/step(s)
Speedup
No. of Nodes
WRF Scaling with CONUS 2.5km
Ave. Time/step (s) Speedup
202 448
1017
1585
2034
3234
4065
5245
7964
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
4 8 16 24 32 48 64 96 128
GFLOPS
No. of Nodes
WRF GFLOPS with CONUS 2.5km
System: XL170r Gen9, Intel E5-2699 v4, 2.2 GHz, 2P/44C, 256 GB (DDR4-2133 MHz) Memory, RHEL 6.7, IB/EDR 1:1,
Intel Composer XE and Intel MPI (2016.3.210), Turbo ON, Hyperthreading OFF
34. Application: Clover Leaf
Description:
Clover Leaf is an open-source Computational Fluid Dynamics (CFD) code developed and distributed by UK
Mini-Application Consortium (UK-MAC).
• Solves compressible Euler equations on a Cartesian grid using explicit second-order accurate method.
• Uses a ‘kernel’ (low level building block) approach with minimal control logic to increase compiler
optimization.
• Supports for accelerators (using both OpenACC and OpenCL) available.
• Scarifies memory (saving intermediate results and than re-computing) to improve performance.
• Available in two flavors, in 2-Dimenional (2D) and 3-Dimensional (3D) modeling.
• Rugged and easy to port and run and good candidate for evaluating and comparing systems.
• Available large no. of data sets for 2D and 3D models with control of run times, few seconds to hours.
Additional details at: http://uk-mac.github.io/CloverLeaf/
35. Clover Leaf (3D) Results with bm256 Data Set
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
0
2000
4000
6000
8000
10000
12000
1 4 8 16 24 32 48 64 96 128
Speedup
WallClock(s)
No. of Nodes
Clover Leaf Scaling with bm256 Data
Wall Clock (s) Speedup
System: XL230a Gen9, Intel E5-2697A v4, 2.6 GHz, 2P/32C, 128 GB (DDR4-2400 MHz) Memory, RHEL 7.2,
IB/EDR 1:1, Intel Composer XE and Intel MPI (2016.3.210), Turbo ON, Hyperthreading OFF
37. Conclusions
37
HP is No. 1 vendor in HPC and Cluster Solutions
Configure and tune the system first
Check the system details (processor, clock, memory and BIOS settings)
Investigate compiler and flags that best suit your application
Profile the application and optimize further for boosting performance
Explore and decide on the right interconnect and protocols
Take advantage of tools and commands to improve performance
Run the application the right way (environment, placement etc.)
Choose the right file system (local disk, NFS, Lustre, IBRIX etc.)
Settle on an environment that is best for your application, time and value
Never assume and always check the cluster before benchmarking