This document discusses managing memory on supercomputers with heterogeneous memory systems. It begins with background on the author's qualifications and research. It then outlines techniques for characterizing memory on these systems, including attributes like bandwidth and latency. APIs are presented for accessing memory attribute information. Benchmarking, profiling, and static code analysis are discussed as methods for determining how best to allocate memory for applications on heterogeneous systems based on each application's sensitivity to different memory metrics. The document emphasizes strategies for improving application productivity on these complex memory architectures.
A New Direction for Computer Architecture Researchdbpublications
This paper we suggest a different computing environment as a worthy new direction for computer architecture research: personal mobile computing, where portable devices are used for visual computing and personal communications tasks. Such a device supports in an integrated fashion all the functions provided to-day by a portable computer, a cellular phone, a digital camera and a video game. The requirements placed on the processor in this environment are energy efficiency, high performance for multimedia and DSP functions, and area efficient, scalable designs. We examine the architectures that were recently pro-posed for billion transistor microprocessors. While they are very promising for the stationary desktop and server workloads, we discover that most of them are un-able to meet the challenges of the new environment and provide the necessary enhancements for multimedia applications running on portable devices.
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
Emerging machine learning approaches, including deep learning methods, for information retrieval (IR) have recently demonstrated significant improvements in accuracy of relevance estimation at the cost of increasing model complexity and corresponding rise in computational and environmental costs of training and inference. In web search, these costs are further compounded by the necessity to train on large-scale datasets, consume long documents as inputs, and retrieve relevant documents from web-scale collections within milliseconds in response to high volume query traffic. A typical playbook for developing deep learning models for IR involves largely ignoring efficiency concerns during model development and then later scaling these methods by either finding faster approximations of the same models or employing heuristics to reduce the input space over which these models operate. Domain knowledge about the specific IR task and deeper understanding of system design and data structures in whose context these models are deployed can significantly help with not only model simplification but also to inform data-structure specific machine learning model design. Alternatively, predictive machine learning can also be employed specifically to improve efficiency in large scale IR settings. In this talk, I will cover several case studies for both improving efficiency of machine learning models for IR as well as direct application of machine learning to improve retrieval efficiency, and conclude with a brief discussion on potential future directions for efficiency-sensitive benchmarking of machine learning models for IR.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
The document discusses plans to establish an institutional high performance computing (HPC) facility at North-West University. It outlines the technical goals of building a Beowulf cluster to link existing departmental clusters and integrate with national and international computational grids. It also discusses management principles for the new HPC facility to ensure sustainability, efficiency, reliability, availability and high performance.
This document provides an overview of HPE solutions for challenges in AI and big data. It discusses HPE storage solutions including aggregated storage-in-compute using NVMe devices, tiered storage using flash, disk, and object storage, and zero watt storage to reduce power usage. It also covers the Scality object storage platform and WekaIO parallel file system for all-flash environments. The document aims to illustrate how HPE technologies can provide efficient, scalable storage for challenging AI and big data workloads.
OpenACC and Hackathons Monthly Highlights: April 2023OpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Open Hackathon Mentor Program, highlight from the recent UK National Hackathon, upcoming Open Hackathon and Bootcamp events, and more!
A New Direction for Computer Architecture Researchdbpublications
This paper we suggest a different computing environment as a worthy new direction for computer architecture research: personal mobile computing, where portable devices are used for visual computing and personal communications tasks. Such a device supports in an integrated fashion all the functions provided to-day by a portable computer, a cellular phone, a digital camera and a video game. The requirements placed on the processor in this environment are energy efficiency, high performance for multimedia and DSP functions, and area efficient, scalable designs. We examine the architectures that were recently pro-posed for billion transistor microprocessors. While they are very promising for the stationary desktop and server workloads, we discover that most of them are un-able to meet the challenges of the new environment and provide the necessary enhancements for multimedia applications running on portable devices.
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
Emerging machine learning approaches, including deep learning methods, for information retrieval (IR) have recently demonstrated significant improvements in accuracy of relevance estimation at the cost of increasing model complexity and corresponding rise in computational and environmental costs of training and inference. In web search, these costs are further compounded by the necessity to train on large-scale datasets, consume long documents as inputs, and retrieve relevant documents from web-scale collections within milliseconds in response to high volume query traffic. A typical playbook for developing deep learning models for IR involves largely ignoring efficiency concerns during model development and then later scaling these methods by either finding faster approximations of the same models or employing heuristics to reduce the input space over which these models operate. Domain knowledge about the specific IR task and deeper understanding of system design and data structures in whose context these models are deployed can significantly help with not only model simplification but also to inform data-structure specific machine learning model design. Alternatively, predictive machine learning can also be employed specifically to improve efficiency in large scale IR settings. In this talk, I will cover several case studies for both improving efficiency of machine learning models for IR as well as direct application of machine learning to improve retrieval efficiency, and conclude with a brief discussion on potential future directions for efficiency-sensitive benchmarking of machine learning models for IR.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
The document discusses plans to establish an institutional high performance computing (HPC) facility at North-West University. It outlines the technical goals of building a Beowulf cluster to link existing departmental clusters and integrate with national and international computational grids. It also discusses management principles for the new HPC facility to ensure sustainability, efficiency, reliability, availability and high performance.
This document provides an overview of HPE solutions for challenges in AI and big data. It discusses HPE storage solutions including aggregated storage-in-compute using NVMe devices, tiered storage using flash, disk, and object storage, and zero watt storage to reduce power usage. It also covers the Scality object storage platform and WekaIO parallel file system for all-flash environments. The document aims to illustrate how HPE technologies can provide efficient, scalable storage for challenging AI and big data workloads.
OpenACC and Hackathons Monthly Highlights: April 2023OpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Open Hackathon Mentor Program, highlight from the recent UK National Hackathon, upcoming Open Hackathon and Bootcamp events, and more!
ACACES 2019: Towards Energy Efficient Deep LearningLEGATO project
Motivations of this work:
• Efficient exploitation of heterogeneous hardware for deep learning and higher order physics applications.
• To achieve high efficiency in real time.
• Intelligent resource management with the goals of reducing inter-application interference and intra-application contention.
• Enhance task scheduling with the knowledge of workload and resource requirement.
• Adapting to resource availability and the underlying hardware topology.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers working on applications for the new Frontier supercomputer, using OpenACC for weather forecasting, upcoming GPU Hackathons and Bootcamps, and new resources!
Performance Comparison between Pytorch and Mindsporeijdms
Deep learning has been well used in many fields. However, there is a large amount of data when training neural networks, which makes many deep learning frameworks appear to serve deep learning practitioners, providing services that are more convenient to use and perform better. MindSpore and PyTorch are both deep learning frameworks. MindSpore is owned by HUAWEI, while PyTorch is owned by Facebook. Some people think that HUAWEI's MindSpore has better performance than FaceBook's PyTorch, which makes deep learning practitioners confused about the choice between the two. In this paper, we perform analytical and experimental analysis to reveal the comparison of training speed of MIndSpore and PyTorch on a single GPU. To ensure that our survey is as comprehensive as possible, we carefully selected neural networks in 2 main domains, which cover computer vision and natural language processing (NLP). The contribution of this work is twofold. First, we conduct detailed benchmarking experiments on MindSpore and PyTorch to analyze the reasons for their performance differences. This work provides guidance for end users to choose between these two frameworks.
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...Matteo Ferroni
In the last few years, multi-core processors entered into the domain of embedded systems: this, together with virtualization techniques, allows multiple applications to easily run on the same System-on-Chip (SoC). As power consumption remains one of the most impacting costs on any digital system, several approaches have been explored in literature to cope with power caps, trying to maximize the performance of the hosted applications. In this paper, we present some preliminary results and opportunities towards a performance-aware power capping orchestrator for the Xen hypervisor. The proposed solution, called XeMPUPiL, uses the Intel Running Average Power Limit (RAPL) hardware interface to set a strict limit on the processor’s power consumption, while a software-level Observe-Decide-Act (ODA) loop performs an exploration of the available resource allocations to find the most power efficient one for the running workload. We show how XeMPUPiL is able to achieve higher performance under different power caps for almost all the different classes of benchmarks analyzed (e.g., CPU-, memory- and IO-bound).
Full paper: http://ceur-ws.org/Vol-1697/EWiLi16_17.pdf
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
Satoshi Matsuoka from RIKEN gave this talk at the HPC User Forum in Santa Fe.
"With rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and BD/AI would converge, in a BYTES-oriented fashion. Post-K is the flagship next generation national supercomputer being developed by Riken and Fujitsu in collaboration. Post-K will have hyperscale class resource in one exascale machine, with well more than 100,000 nodes of sever-class A64fx many-core Arm CPUs, realized through extensive co-design process involving the entire Japanese HPC community.
Rather than to focus on double precision flops that are of lesser utility, rather Post-K, especially its Arm64fx processor and the Tofu-D network is designed to sustain extreme bandwidth on realistic applications including those for oil and gas, such as seismic wave propagation, CFD, as well as structural codes, besting its rivals by several factors in measured performance. Post-K is slated to perform 100 times faster on some key applications c.f. its predecessor, the K-Computer, but also will likely to be the premier big data and AI/ML infrastructure. Currently, we are conducting research to scale deep learning to more than 100,000 nodes on Post-K, where we would obtain near top GPU-class performance on each node."
Watch the video: https://wp.me/p3RLHQ-k6G
Learn more: https://en.wikichip.org/wiki/supercomputers/post-k
and
http://hpcuserforum.com
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
In the framework of the Intel Parallel Computing Centre at the Research Campus Garching in Munich, our group at LRZ presents recent results on performance optimization of Gadget-3, a widely used community code for computational astrophysics. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm and focus on threading parallelism optimization, change of the data layout into Structure of Arrays (SoA), compiler auto-vectorization and algorithmic improvements in the particle sorting. We measure lower execution time and improved threading scalability both on Intel Xeon (2.6× on Ivy Bridge) and Xeon Phi (13.7× on Knights Corner) systems. First tests on second generation Xeon Phi (Knights Landing) demonstrate the portability of the devised optimization solutions to upcoming architectures.
This document proposes a design procedure for a re-configurable convolutional neural network (CNN) engine for field-programmable gate array (FPGA) applications. The procedure includes developing an accurate CNN model using TensorFlow and Python, and implementing a re-configurable CNN engine from scratch using register-transfer level design. The proposed engine was synthesized for 180nm CMOS technology and achieved 96% accuracy on MNIST and CIFAR-10 datasets. A graphical user interface was also designed for loading and testing datasets on the hardware engine.
Exploring emerging technologies in the HPC co-design spacejsvetter
This document discusses emerging technologies for high performance computing (HPC), focusing on heterogeneous computing and non-volatile memory. It provides an overview of HPC architectures past and present, highlighting the trend toward more heterogeneous systems using GPUs and other accelerators. The document discusses challenges for applications to adapt to these changing architectures. It also explores potential future technologies like 3D memory and discusses the Department of Energy's efforts in codesign centers to facilitate collaboration between application developers and emerging hardware.
Elastic multicore scheduling with the XiTAO runtimeMiquel Pericas
This presentation describes the XiTAO scheduler for heterogeneous computing that is currently under development in the EU LEGaTO project. The scheduler targets mixed-mode parallelism and assigns resource partitions just-in-time by creating a model of the platform's static and dynamic heterogeneity.
Elastic multicore scheduling with the XiTAO runtimeLEGATO project
This presentation describes the XiTAO scheduler for heterogeneous computing that is currently under development in the EU LEGaTO project. The scheduler targets mixed-mode parallelism and assigns resource partitions just-in-time by creating a model of the platform's static and dynamic heterogeneity.
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING mlaij
The use of Machine Learning in Artificial Intelligence is the inspiration that shaped technology as it is today. Machine Learning has the power to greatly simplify our lives. Improvement in speech recognition and language understanding help the community interact more naturally with technology. The popularity of machine learning opens up the opportunities for optimizing the design of computing platforms using welldefined hardware accelerators. In the upcoming few years, cameras will be utilised as sensors for several applications. For ease of use and privacy restrictions, the requested image processing should be limited to a local embedded computer platform and with a high accuracy. Furthermore, less energy should be consumed. Dedicated acceleration of Convolutional Neural Networks can achieve these targets with high flexibility to perform multiple vision tasks. However, due to the exponential growth in technology constraints (especially in terms of energy) which could lead to heterogeneous multicores, and increasing number of defects, the strategy of defect-tolerant accelerators for heterogeneous multi-cores may become a main micro-architecture research issue. The up to date accelerators used still face some performance issues such as memory limitations, bandwidth, speed etc. This literature summarizes (in terms of a survey) recent work of accelerators including their advantages and disadvantages to make it easier for developers with neural network interests to further improve what has already been established.
IRJET- Python Libraries and Packages for Deep Learning-A SurveyIRJET Journal
This document summarizes a survey of Python libraries and packages that are commonly used for deep learning. It discusses several popular deep learning frameworks like TensorFlow, Keras, Caffe, PyTorch, and Theano that can be used with Python. It also summarizes several research papers that utilized these Python deep learning libraries and packages to implement applications like image classification on embedded devices, mobile edge caching using deep learning, and high performance text recognition. The document highlights the benefits of using Python for deep learning due to its extensive library support, simplicity, reliability, and ease of developing applications.
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
Often, HPC software outlives the HPC systems for which they are initially developed. The innovations in the HPC platforms’ hardware and parallel programming standards drive the modernization of HPC applications so that they continue being performant. While such code modernization efforts may not be very challenging for HPC experts and well-funded research groups, many domain-experts may find it challenging to adapt their applications for latest HPC platforms due to lack of expertise, time, and funds. The challenges of such domain-experts can be mitigated by providing them high-level tools for code modernization and migration.
IRJET - Positioning and Tracking of a Person using Embedded Controller in a D...IRJET Journal
This document proposes a system to track and monitor the location of individuals within a defined area using GPS. The system uses an ESP8266 microcontroller interfaced with GPS modules to acquire location data and update it to a cloud database. An administrator can then monitor locations in real-time through a mobile app or web interface by requesting location coordinates from the cloud. The system aims to provide easier tracking of individuals compared to conventional camera-based methods while eliminating the need for continuous human monitoring.
The document describes a distributed memory architecture for a coarse-grain reconfigurable architecture (CGRA) with network-on-chip (NoC) capabilities. The key aspects of the architecture are:
1) It uses a distributed memory approach with memory banks (mBanks) connected via a circuit-switched NoC to enable private and parallel execution environments (PREX).
2) The memory is partitionable and partitions can be reconfigured at runtime to modify the memory to computation ratio.
3) Controllers synchronize data streaming from mBanks to computation elements to improve performance and energy efficiency.
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
In present times, updated information and knowledge has become readily accessible to researchers, enthusiasts, developers, and academics through the Internet on many different subjects for wider areas of application. The underlying framework facilitating such possibilities is networking of servers, nodes, and personal computers. However, such setups, comprising of mainframes, servers and networking devices are inaccessible to many, costly, and are not portable. In addition, students and lab-level enthusiasts do not have the requisite access to modify the functionality to suit specific purposes. The Raspberry-Pi (R-Pi) is a small device capable of many functionalities akin to super-computing while being portable, economical and flexible. It runs on open source Linux, making it a preferred choice for lab-level research and studies. Users have started using the embedded networking capability to design portable clusters that replace the costlier machines. This paper introduces new users to the most commonly used frameworks and some recent developments that best exploit the capabilities of R-Pi when used in clusters. This paper also introduces some of the tools and measures that rate efficiencies of clusters to help users assess the quality of cluster design. The paper aims to make users aware of the various parameters in a cluster environment.
The document discusses the future of computing platforms and how they will change to handle massive amounts of data and machine learning tasks. Some key points:
- Traditional views of performance gains from clock speed increases are over. New architectures enabled by multi-core CPUs will radically change computing.
- "Big data" tasks like search, machine learning, and real-time data analysis will be increasingly important drivers of new computing platforms.
- Simple machine learning models applied to massive amounts of data can produce useful results, even without deep domain expertise. This approach has been demonstrated to work well for tasks like language translation.
- Future platforms may blend CPUs and GPUs differently to best handle both serial and parallel tasks for big data and machine
The document discusses performance characterization across a computing continuum from the edge to the cloud. It evaluates the performance of video encoding and machine learning tasks on different devices. For video encoding, older single-board computers had significantly higher encoding times than other resources but provided lower data transfer times. For machine learning, training a convolutional neural network took much longer than a simpler model. Cloud and fog resources generally outperformed edge devices for more complex tasks. The document recommends offloading large or complex tasks to more powerful resources when possible.
Performance evaluation of ecc in single and multi( eliptic curve)Danilo Calle
The document discusses performance evaluation of ECC (Elliptic Curve Cryptography) implementation on FPGA-based embedded systems using single and dual processor architectures. It explores implementing ECC using a single MicroBlaze soft processor core and a dual MicroBlaze core design with shared memory for inter-processor communication. Experimental results show the dual core design improves throughput by 3.3x over the single core design, encrypting data 3.3 times faster, but utilizes more resources and power due to the additional processor core.
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESanfaltahir1010
Image: Include an image that represents the concept of precision, such as a AI helix or a futuristic healthcare
setting.
Objective: Provide a foundational understanding of precision medicine and its departure from traditional
approaches
Role of theory: Discuss how genomics, the study of an organism's complete set of AI ,
plays a crucial role in precision medicine.
Customizing treatment plans: Highlight how genetic information is used to customize
treatment plans based on an individual's genetic makeup.
Examples: Provide real-world examples of successful application of AI such as genetic
therapies or targeted treatments.
Importance of molecular diagnostics: Explain the role of molecular diagnostics in identifying
molecular and genetic markers associated with diseases.
Biomarker testing: Showcase how biomarker testing aids in creating personalized treatment plans.
Content:
• Ethical issues: Examine ethical concerns related to precision medicine, such as privacy, consent, and
potential misuse of genetic information.
• Regulations and guidelines: Present examples of ethical guidelines and regulations in place to safeguard
patient rights.
• Visuals: Include images or icons representing ethical considerations.
Content:
• Ethical issues: Examine ethical concerns related to precision medicine, such as privacy, consent, and
potential misuse of genetic information.
• Regulations and guidelines: Present examples of ethical guidelines and regulations in place to safeguard
patient rights.
• Visuals: Include images or icons representing ethical considerations.
Content:
• Ethical issues: Examine ethical concerns related to precision medicine, such as privacy, consent, and
potential misuse of genetic information.
• Regulations and guidelines: Present examples of ethical guidelines and regulations in place to safeguard
patient rights.
• Visuals: Include images or icons representing ethical considerations.
Real-world case study: Present a detailed case study showcasing the success of precision
medicine in a specific medical scenario.
Patient's journey: Discuss the patient's journey, treatment plan, and outcomes.
Impact: Emphasize the transformative effect of precision medicine on the individual's
health.
Objective: Ground the presentation in a real-world example, highlighting the practical
application and success of precision medicine.
Data challenges: Address the challenges associated with managing large sets of patient data in precision
medicine.
Technological solutions: Discuss technological innovations and solutions for handling and analyzing vast
datasets.
Visuals: Include graphics representing data management challenges and technological solutions.
Objective: Acknowledge the data-related challenges in precision medicine and highlight innovative solutions.
Data challenges: Address the challenges associated with managing large sets of patient data in precision
medicine.
Technological solutions: Discuss technological innovations and solutions
ACACES 2019: Towards Energy Efficient Deep LearningLEGATO project
Motivations of this work:
• Efficient exploitation of heterogeneous hardware for deep learning and higher order physics applications.
• To achieve high efficiency in real time.
• Intelligent resource management with the goals of reducing inter-application interference and intra-application contention.
• Enhance task scheduling with the knowledge of workload and resource requirement.
• Adapting to resource availability and the underlying hardware topology.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers working on applications for the new Frontier supercomputer, using OpenACC for weather forecasting, upcoming GPU Hackathons and Bootcamps, and new resources!
Performance Comparison between Pytorch and Mindsporeijdms
Deep learning has been well used in many fields. However, there is a large amount of data when training neural networks, which makes many deep learning frameworks appear to serve deep learning practitioners, providing services that are more convenient to use and perform better. MindSpore and PyTorch are both deep learning frameworks. MindSpore is owned by HUAWEI, while PyTorch is owned by Facebook. Some people think that HUAWEI's MindSpore has better performance than FaceBook's PyTorch, which makes deep learning practitioners confused about the choice between the two. In this paper, we perform analytical and experimental analysis to reveal the comparison of training speed of MIndSpore and PyTorch on a single GPU. To ensure that our survey is as comprehensive as possible, we carefully selected neural networks in 2 main domains, which cover computer vision and natural language processing (NLP). The contribution of this work is twofold. First, we conduct detailed benchmarking experiments on MindSpore and PyTorch to analyze the reasons for their performance differences. This work provides guidance for end users to choose between these two frameworks.
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...Matteo Ferroni
In the last few years, multi-core processors entered into the domain of embedded systems: this, together with virtualization techniques, allows multiple applications to easily run on the same System-on-Chip (SoC). As power consumption remains one of the most impacting costs on any digital system, several approaches have been explored in literature to cope with power caps, trying to maximize the performance of the hosted applications. In this paper, we present some preliminary results and opportunities towards a performance-aware power capping orchestrator for the Xen hypervisor. The proposed solution, called XeMPUPiL, uses the Intel Running Average Power Limit (RAPL) hardware interface to set a strict limit on the processor’s power consumption, while a software-level Observe-Decide-Act (ODA) loop performs an exploration of the available resource allocations to find the most power efficient one for the running workload. We show how XeMPUPiL is able to achieve higher performance under different power caps for almost all the different classes of benchmarks analyzed (e.g., CPU-, memory- and IO-bound).
Full paper: http://ceur-ws.org/Vol-1697/EWiLi16_17.pdf
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
Satoshi Matsuoka from RIKEN gave this talk at the HPC User Forum in Santa Fe.
"With rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and BD/AI would converge, in a BYTES-oriented fashion. Post-K is the flagship next generation national supercomputer being developed by Riken and Fujitsu in collaboration. Post-K will have hyperscale class resource in one exascale machine, with well more than 100,000 nodes of sever-class A64fx many-core Arm CPUs, realized through extensive co-design process involving the entire Japanese HPC community.
Rather than to focus on double precision flops that are of lesser utility, rather Post-K, especially its Arm64fx processor and the Tofu-D network is designed to sustain extreme bandwidth on realistic applications including those for oil and gas, such as seismic wave propagation, CFD, as well as structural codes, besting its rivals by several factors in measured performance. Post-K is slated to perform 100 times faster on some key applications c.f. its predecessor, the K-Computer, but also will likely to be the premier big data and AI/ML infrastructure. Currently, we are conducting research to scale deep learning to more than 100,000 nodes on Post-K, where we would obtain near top GPU-class performance on each node."
Watch the video: https://wp.me/p3RLHQ-k6G
Learn more: https://en.wikichip.org/wiki/supercomputers/post-k
and
http://hpcuserforum.com
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
In the framework of the Intel Parallel Computing Centre at the Research Campus Garching in Munich, our group at LRZ presents recent results on performance optimization of Gadget-3, a widely used community code for computational astrophysics. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm and focus on threading parallelism optimization, change of the data layout into Structure of Arrays (SoA), compiler auto-vectorization and algorithmic improvements in the particle sorting. We measure lower execution time and improved threading scalability both on Intel Xeon (2.6× on Ivy Bridge) and Xeon Phi (13.7× on Knights Corner) systems. First tests on second generation Xeon Phi (Knights Landing) demonstrate the portability of the devised optimization solutions to upcoming architectures.
This document proposes a design procedure for a re-configurable convolutional neural network (CNN) engine for field-programmable gate array (FPGA) applications. The procedure includes developing an accurate CNN model using TensorFlow and Python, and implementing a re-configurable CNN engine from scratch using register-transfer level design. The proposed engine was synthesized for 180nm CMOS technology and achieved 96% accuracy on MNIST and CIFAR-10 datasets. A graphical user interface was also designed for loading and testing datasets on the hardware engine.
Exploring emerging technologies in the HPC co-design spacejsvetter
This document discusses emerging technologies for high performance computing (HPC), focusing on heterogeneous computing and non-volatile memory. It provides an overview of HPC architectures past and present, highlighting the trend toward more heterogeneous systems using GPUs and other accelerators. The document discusses challenges for applications to adapt to these changing architectures. It also explores potential future technologies like 3D memory and discusses the Department of Energy's efforts in codesign centers to facilitate collaboration between application developers and emerging hardware.
Elastic multicore scheduling with the XiTAO runtimeMiquel Pericas
This presentation describes the XiTAO scheduler for heterogeneous computing that is currently under development in the EU LEGaTO project. The scheduler targets mixed-mode parallelism and assigns resource partitions just-in-time by creating a model of the platform's static and dynamic heterogeneity.
Elastic multicore scheduling with the XiTAO runtimeLEGATO project
This presentation describes the XiTAO scheduler for heterogeneous computing that is currently under development in the EU LEGaTO project. The scheduler targets mixed-mode parallelism and assigns resource partitions just-in-time by creating a model of the platform's static and dynamic heterogeneity.
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING mlaij
The use of Machine Learning in Artificial Intelligence is the inspiration that shaped technology as it is today. Machine Learning has the power to greatly simplify our lives. Improvement in speech recognition and language understanding help the community interact more naturally with technology. The popularity of machine learning opens up the opportunities for optimizing the design of computing platforms using welldefined hardware accelerators. In the upcoming few years, cameras will be utilised as sensors for several applications. For ease of use and privacy restrictions, the requested image processing should be limited to a local embedded computer platform and with a high accuracy. Furthermore, less energy should be consumed. Dedicated acceleration of Convolutional Neural Networks can achieve these targets with high flexibility to perform multiple vision tasks. However, due to the exponential growth in technology constraints (especially in terms of energy) which could lead to heterogeneous multicores, and increasing number of defects, the strategy of defect-tolerant accelerators for heterogeneous multi-cores may become a main micro-architecture research issue. The up to date accelerators used still face some performance issues such as memory limitations, bandwidth, speed etc. This literature summarizes (in terms of a survey) recent work of accelerators including their advantages and disadvantages to make it easier for developers with neural network interests to further improve what has already been established.
IRJET- Python Libraries and Packages for Deep Learning-A SurveyIRJET Journal
This document summarizes a survey of Python libraries and packages that are commonly used for deep learning. It discusses several popular deep learning frameworks like TensorFlow, Keras, Caffe, PyTorch, and Theano that can be used with Python. It also summarizes several research papers that utilized these Python deep learning libraries and packages to implement applications like image classification on embedded devices, mobile edge caching using deep learning, and high performance text recognition. The document highlights the benefits of using Python for deep learning due to its extensive library support, simplicity, reliability, and ease of developing applications.
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
Often, HPC software outlives the HPC systems for which they are initially developed. The innovations in the HPC platforms’ hardware and parallel programming standards drive the modernization of HPC applications so that they continue being performant. While such code modernization efforts may not be very challenging for HPC experts and well-funded research groups, many domain-experts may find it challenging to adapt their applications for latest HPC platforms due to lack of expertise, time, and funds. The challenges of such domain-experts can be mitigated by providing them high-level tools for code modernization and migration.
IRJET - Positioning and Tracking of a Person using Embedded Controller in a D...IRJET Journal
This document proposes a system to track and monitor the location of individuals within a defined area using GPS. The system uses an ESP8266 microcontroller interfaced with GPS modules to acquire location data and update it to a cloud database. An administrator can then monitor locations in real-time through a mobile app or web interface by requesting location coordinates from the cloud. The system aims to provide easier tracking of individuals compared to conventional camera-based methods while eliminating the need for continuous human monitoring.
The document describes a distributed memory architecture for a coarse-grain reconfigurable architecture (CGRA) with network-on-chip (NoC) capabilities. The key aspects of the architecture are:
1) It uses a distributed memory approach with memory banks (mBanks) connected via a circuit-switched NoC to enable private and parallel execution environments (PREX).
2) The memory is partitionable and partitions can be reconfigured at runtime to modify the memory to computation ratio.
3) Controllers synchronize data streaming from mBanks to computation elements to improve performance and energy efficiency.
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
In present times, updated information and knowledge has become readily accessible to researchers, enthusiasts, developers, and academics through the Internet on many different subjects for wider areas of application. The underlying framework facilitating such possibilities is networking of servers, nodes, and personal computers. However, such setups, comprising of mainframes, servers and networking devices are inaccessible to many, costly, and are not portable. In addition, students and lab-level enthusiasts do not have the requisite access to modify the functionality to suit specific purposes. The Raspberry-Pi (R-Pi) is a small device capable of many functionalities akin to super-computing while being portable, economical and flexible. It runs on open source Linux, making it a preferred choice for lab-level research and studies. Users have started using the embedded networking capability to design portable clusters that replace the costlier machines. This paper introduces new users to the most commonly used frameworks and some recent developments that best exploit the capabilities of R-Pi when used in clusters. This paper also introduces some of the tools and measures that rate efficiencies of clusters to help users assess the quality of cluster design. The paper aims to make users aware of the various parameters in a cluster environment.
The document discusses the future of computing platforms and how they will change to handle massive amounts of data and machine learning tasks. Some key points:
- Traditional views of performance gains from clock speed increases are over. New architectures enabled by multi-core CPUs will radically change computing.
- "Big data" tasks like search, machine learning, and real-time data analysis will be increasingly important drivers of new computing platforms.
- Simple machine learning models applied to massive amounts of data can produce useful results, even without deep domain expertise. This approach has been demonstrated to work well for tasks like language translation.
- Future platforms may blend CPUs and GPUs differently to best handle both serial and parallel tasks for big data and machine
The document discusses performance characterization across a computing continuum from the edge to the cloud. It evaluates the performance of video encoding and machine learning tasks on different devices. For video encoding, older single-board computers had significantly higher encoding times than other resources but provided lower data transfer times. For machine learning, training a convolutional neural network took much longer than a simpler model. Cloud and fog resources generally outperformed edge devices for more complex tasks. The document recommends offloading large or complex tasks to more powerful resources when possible.
Performance evaluation of ecc in single and multi( eliptic curve)Danilo Calle
The document discusses performance evaluation of ECC (Elliptic Curve Cryptography) implementation on FPGA-based embedded systems using single and dual processor architectures. It explores implementing ECC using a single MicroBlaze soft processor core and a dual MicroBlaze core design with shared memory for inter-processor communication. Experimental results show the dual core design improves throughput by 3.3x over the single core design, encrypting data 3.3 times faster, but utilizes more resources and power due to the additional processor core.
Similar to Performence Metrics to Manage Memory SC. (20)
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESanfaltahir1010
Image: Include an image that represents the concept of precision, such as a AI helix or a futuristic healthcare
setting.
Objective: Provide a foundational understanding of precision medicine and its departure from traditional
approaches
Role of theory: Discuss how genomics, the study of an organism's complete set of AI ,
plays a crucial role in precision medicine.
Customizing treatment plans: Highlight how genetic information is used to customize
treatment plans based on an individual's genetic makeup.
Examples: Provide real-world examples of successful application of AI such as genetic
therapies or targeted treatments.
Importance of molecular diagnostics: Explain the role of molecular diagnostics in identifying
molecular and genetic markers associated with diseases.
Biomarker testing: Showcase how biomarker testing aids in creating personalized treatment plans.
Content:
• Ethical issues: Examine ethical concerns related to precision medicine, such as privacy, consent, and
potential misuse of genetic information.
• Regulations and guidelines: Present examples of ethical guidelines and regulations in place to safeguard
patient rights.
• Visuals: Include images or icons representing ethical considerations.
Content:
• Ethical issues: Examine ethical concerns related to precision medicine, such as privacy, consent, and
potential misuse of genetic information.
• Regulations and guidelines: Present examples of ethical guidelines and regulations in place to safeguard
patient rights.
• Visuals: Include images or icons representing ethical considerations.
Content:
• Ethical issues: Examine ethical concerns related to precision medicine, such as privacy, consent, and
potential misuse of genetic information.
• Regulations and guidelines: Present examples of ethical guidelines and regulations in place to safeguard
patient rights.
• Visuals: Include images or icons representing ethical considerations.
Real-world case study: Present a detailed case study showcasing the success of precision
medicine in a specific medical scenario.
Patient's journey: Discuss the patient's journey, treatment plan, and outcomes.
Impact: Emphasize the transformative effect of precision medicine on the individual's
health.
Objective: Ground the presentation in a real-world example, highlighting the practical
application and success of precision medicine.
Data challenges: Address the challenges associated with managing large sets of patient data in precision
medicine.
Technological solutions: Discuss technological innovations and solutions for handling and analyzing vast
datasets.
Visuals: Include graphics representing data management challenges and technological solutions.
Objective: Acknowledge the data-related challenges in precision medicine and highlight innovative solutions.
Data challenges: Address the challenges associated with managing large sets of patient data in precision
medicine.
Technological solutions: Discuss technological innovations and solutions
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISTier1 app
Are you ready to unlock the secrets hidden within Java thread dumps? Join us for a hands-on session where we'll delve into effective troubleshooting patterns to swiftly identify the root causes of production problems. Discover the right tools, techniques, and best practices while exploring *real-world case studies of major outages* in Fortune 500 enterprises. Engage in interactive lab exercises where you'll have the opportunity to troubleshoot thread dumps and uncover performance issues firsthand. Join us and become a master of Java thread dump analysis!
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Enhanced Screen Flows UI/UX using SLDS with Tom KittPeter Caitens
Join us for an engaging session led by Flow Champion, Tom Kitt. This session will dive into a technique of enhancing the user interfaces and user experiences within Screen Flows using the Salesforce Lightning Design System (SLDS). This technique uses Native functionality, with No Apex Code, No Custom Components and No Managed Packages required.
UI5con 2024 - Bring Your Own Design SystemPeter Muessig
How do you combine the OpenUI5/SAPUI5 programming model with a design system that makes its controls available as Web Components? Since OpenUI5/SAPUI5 1.120, the framework supports the integration of any Web Components. This makes it possible, for example, to natively embed own Web Components of your design system which are created with Stencil. The integration embeds the Web Components in a way that they can be used naturally in XMLViews, like with standard UI5 controls, and can be bound with data binding. Learn how you can also make use of the Web Components base class in OpenUI5/SAPUI5 to also integrate your Web Components and get inspired by the solution to generate a custom UI5 library providing the Web Components control wrappers for the native ones.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
Preparing Non - Technical Founders for Engaging a Tech AgencyISH Technologies
Preparing non-technical founders before engaging a tech agency is crucial for the success of their projects. It starts with clearly defining their vision and goals, conducting thorough market research, and gaining a basic understanding of relevant technologies. Setting realistic expectations and preparing a detailed project brief are essential steps. Founders should select a tech agency with a proven track record and establish clear communication channels. Additionally, addressing legal and contractual considerations and planning for post-launch support are vital to ensure a smooth and successful collaboration. This preparation empowers non-technical founders to effectively communicate their needs and work seamlessly with their chosen tech agency.Visit our site to get more details about this. Contact us today www.ishtechnologies.com.au
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfVALiNTRY360
Salesforce Healthcare CRM, implemented by VALiNTRY360, revolutionizes patient management by enhancing patient engagement, streamlining administrative processes, and improving care coordination. Its advanced analytics, robust security, and seamless integration with telehealth services ensure that healthcare providers can deliver personalized, efficient, and secure patient care. By automating routine tasks and providing actionable insights, Salesforce Healthcare CRM enables healthcare providers to focus on delivering high-quality care, leading to better patient outcomes and higher satisfaction. VALiNTRY360's expertise ensures a tailored solution that meets the unique needs of any healthcare practice, from small clinics to large hospital systems.
For more info visit us https://valintry360.com/solutions/health-life-sciences
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
The Rising Future of CPaaS in the Middle East 2024Yara Milbes
Explore "The Rising Future of CPaaS in the Middle East in 2024" with this comprehensive PPT presentation. Discover how Communication Platforms as a Service (CPaaS) is transforming communication across various sectors in the Middle East.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Performence Metrics to Manage Memory SC.
1. Performance Metrics to Manage
Memory on Supercomputers
Andrès RUBIO PROAÑO
Post-doctoral Researcher, High Performance Big Data Research Team, RIKEN R-CCS,
21/12/2023
2. My Background
⚫ 2015 Electronics Engineering Bachelor
Degree at Escuela Politécnica Nacional,
Ecuador.
⚫ CEDIA CEPRA Projects
⚫ 2018 Computer Engineering Master’s
Degree at Universitat Autònoma de
Barcelona, Spain.
⚫ 2021 PhD in Computer Science at
Université de Bordeaux, France
⚫ PhD Contractual with Inria during 3
years with Intel Funding.
⚫ 2021-currently Postdoctoral Researcher at
Riken R-CCS, Japan.
3. Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary
4. INTRODUCTION:
From Real World to HPC Applications
Real World Problem
• Meshes, Sparse, Matrices,
etc.
• Need to be optimally partit
ioned
Surrogate Model
• Molecular Dynamics
• Oil and gas exploration
• Forecast climate changes
• Discover new drogs for deseas
es
Computer Simulation Model
• AI, DL, ML workloads
6. From Homogeneity to Heterogeneity:
Memory/Storage System
Register
CPU
~0.1ns
Level 1 Cache
Level 2 Cache
Level 3 Cache
CACHE
~1-50ns
VOLATILE MEMORY
DRAM
~80-100ns
DISSAGREGATED MEMORY
CXL.mem (HBM, DRAM, NVDIMM)
~170-250ns*
MECHANICAL MEDIA
HDD
~10ms
SEQUENTIAL MEDIA
TAPE
~100ms
SOLID STATE MEDIA
NVMe SSD
SATA SSD
~10-100µs
HBM
IN PACKAGE MEMORY
*Improvement
in Bandwidth
Memory
Bus
Capacity/Latency
Increasing
PERSISTENT MEMORY
NVDIMM
~350ns-1µs
I/O
Bus
Processor
Package
Memory
Bus
I/O
Bus
Register
CPU
~0.1ns
Level 1 Cache
Level 2 Cache
Level 3 Cache
CACHE
~1-50ns
STORAGE
HDD
~80-100ns
DRAM
MEMORY
~10ms
7. Heterogeneous Memory Systems
NVM
DRAM
CPU
2MK HMS
Xeon Cascade Lake
Xeon Icelake
Sapphire Rapids
DRAM
HBM
2MK HMS
CPU
KNL
Sapphire Rapids
NVM
DRAM
HBM
3MK HMS
CPU
next generations?
CXL.mem?
10. Where to allocate?
Application
Buffer
Buffer
Buffer
N x
total
Allocation
Requests
Memory System
NVM
DRAM
HBM
Allocating on a Homogenous Memory System?
NUMA 0
NUMA 1
Allocating on a NUMA Memory System
Allocating on a Heterogeneous Memory System
11. Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Where to allocate memory
buffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
Rubio Proaño, A & Sato ,K. (2023, December). Understanding Power Consumption M
etric on Heterogeneous Memory Systems In The 29th IEEE International Conference o
n Parallel and Distributed Systems (ICPADS 2023) (in proceedings).
12. Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary
13. 1
2
3
Memory Attributes
High Bandwidth
High Capacity
Low Power
Consumption
Low Latency
HBM
DRAM
NVM
Attributes Discovery
Capacity, Locality Native supported
Bandwidth, Latency On most platforms
R/W Bandwidth, R/W Latency On some platforms
Reliability, Persistence,
Endurance, Power Consumption
Under investigation
14. Hwloc:
Apple Mac Mini with M1 hybrid processor
4 E-cores on top (energy efficient), 4 P-cores below (performance, with bigger caches).
The machine has 16GB of memory but most of it is given to the GPU (as shown in the OpenCL device)
16. Hwloc:
2x Xeon SapphireRapids Max 9460
Processors are configured in SubNUMA-Cluster mode, hence showing 4 DRAM NUMA nodes and 4 HBMs
in each package.
17. API functions for manage memory attributes
⚫ Get the array of memory targets that are local to a given initiator:
hwloc_get_local_numanode_objs(topology, initiator, &nr, &targets)
⚫ Get the best memory target (and its value) for the given initiator and attribute:
hwloc_memattr_get_best_target(topology, attribute, initiator, &best target,
&target value)
⚫ Get the value of an attribute for the given memory target and initiator:
hwloc_memattr_get_value(topology, attribute, target, initiator, &value)
⚫ Add a custom memory attribute: (e.g. STREAM-triad kernel)
hwloc_memattr_register(topology, attribute, name, &value)
18. E.g. Allocate on the best target for an existing attribute
/* Initialise Topology */
hwloc_topology_init(&topology);
hwloc_topology_load(topology);
[...]
/* Allocating function */
void * alloc_on_best_target(topology, initiator, attribute, size);
{
hwloc_memattr_get_best_target(topology, attribute, initiator, &best_target, NULL);
return hwloc_alloc_membind(topology, size, best_target->nodeset, BIND);
}
[...]
/* Allocating 1MB on best bandwidth memory near a given core */
void * buffer = alloc_on_best_target(topology, core->cpuset, HWLOC_MEMATTR_ID_BANDWIDTH, 1024*1024);
22. Summary
⚫ Expected:
⚫ Obtain memory attributes information from HMAT (simpler)
⚫ Vendors start using HMAT and put reliable information
⚫ No need to spend time benchmarking
⚫ Currently:
⚫ Benchmarking is the safest way to get attribute values
⚫ HMAT is only appearing in genering platforms ACPI tables since 2021
23. Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Criterion about where to
allocate memorybuffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
24. Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary
25. Strategy Framework –> High Productivity
Application
Heterogeneous Allocator
hwloc / API Extension
Hardware
Allocation Requests
MemoryTargets and Attributes
• Measured Performance
• Hardware Performance Information
MemoryIdentifiers
Determine Sensitivity
to memory metrics
Allocation Criteria
Profiling
Benchmarking
Static code analysis
26. Benchmarking (Bandwidth): Stream-Triad
NVM
DRAM
HBM
3MK HMS
Application
Buffer
Buffer
Buffer
N
x
total
3N binding
tests
Buffer target
node
Best Rate in
GB/s
A B C Triad
0 0 0 74.97
0 0 1 51.88
0 1 0 55.59
0 1 1 38.32
1 0 0 9.92
1 0 1 9.05
1 1 0 9.16
1 1 1 8.50
0→ DRAM
1→ NVM
27. Benchmarking
⚫ General Idea of the sensitivity of the application
⚫ Hard to evaluate when taking into account all buffers separately
31. Profiling
⚫ Perform an analysis of the execution
⚫ Kind of memory used
⚫ Most relevant buffers
⚫ Analyse the related source code line.
⚫ Identify memory related issues:
⚫ Bottleneks, Hot spots, etc
⚫ Fewer runs but analysis could be more difficult
32. Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Criterion about where to
allocate memorybuffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
33. Power Consumption
⚫ We consider that we need to understand
and/or characterise the power
consumption within a Heterogeneous
Memory System(HMS) to be able to give a
ranking of memory targets that enables of
use applications in power constraint
scenarios or in situations that requires a
balance between performance and power
consumption. And for that we need to
follow a strategy
MSR_DRAM_ENERGY_STATUS
Package Domain
Cores Domain
Memory Domain
DRAM NVDIMM
MSR_ENERGY_STATUS
36. Summary
⚫ Manage the complexity of having HMS through hwloc extension.
⚫ Presented a strategy that allows HPC applications detect affinities for certain kinds of memory
and allocate in the right place their buffers.
⚫ The presented strategy framework is for high productivity ( for non-experienced developers)
and better utilisation of the memory system.
⚫ Manage performance counters in a manner that allow us to differentiate the power
consumption of different types of memory
37. Future Work
⚫ Validate and extend our work on emerging platforms.
⚫ Static Code Analysis for taking allocation decisions.
⚫ Extend our allocation policies to handle more application requirements.