This document summarizes Issam Said's December 21, 2015 Ph.D. defense presentation on contributions of hybrid architectures to depth imaging. The presentation compared the performance of CPU, GPU, and APU architectures on seismic imaging algorithms like reverse time migration (RTM). It found that the APU, with its integrated CPU and GPU, is a promising HPC solution for depth imaging as it may be more efficient than CPUs and able to overcome GPU limitations through unified memory and lower power consumption. Evaluation of finite difference stencils, a building block of RTM, showed the APU achieved best performance using local memory with zero-copy data placement for small problems and explicit copy for large problems.
This talk will provide an in-depth treatment of satellite telephony networks from a security perspective. The overall system seems secure, but in reality, it cannot be expected to be fully reliable.
We will briefly cover the satellite mobile system architecture, then discuss GMR (GEO-Mobile Radio) system elements, e.g. GSS (Gateway Station Subsystem), MES (Mobile Earth Station), AOC (Advanced Operation Center), and TCS (Traffic Control Subsystem) for GMR-1 systems and NCC (Network Control Center), GW (Gateway), SCF (Satellite Control Facility) and CMIS (Customer Management Information System) for GMR-2 systems.
From there, we will discuss the security issues of GMR system as it shares similar vulnerabilities with GSM–GMR is derived from the terrestrial digital cellular standard GSM and support access to GSM core networks, along with some interesting demos.
Time permitting, a question and answer session at the end of the presentation will allow participants to cover any additional issues in satellite telephony system they’d like to discuss.
This document discusses the GRACE (Gravity Recovery and Climate Experiment) mission, which consists of two satellites that measure changes in Earth's gravity field to map variations in ocean and land water, ice sheets, and groundwater. The original GRACE mission operated from 2002 to 2017, exceeding its expected 5-year lifespan. GRACE Follow-on was launched in 2018 to continue the mission after the original satellites decommissioned.
Application of Geo-informatics in Environmental ManagementMahaMadhu2
Geo-informatics is the science and the technology which develops and uses information science, infrastructure to address the problems of geography, geosciences and related branches of engineering. “The art, science or technology dealing with the acquisition, storage, processing, production, presentation & dissemination of geo-information“. Perhaps the most important concern for all of us today is protecting the environment we live and breathe in. Climate change issues are creating havoc with erratic weather patterns affecting everything from crop production to untimely melting of ice glaciers.
There is a lot to worry about and immediate action is definitely required. It’s not that the world has not geared up to take corrective actions, but we need to do more, and Geo-informatics can help us achieve that. Geo-informatics is a powerful platform which enables every sector to perform better and the environment is no exception! Coupled with a digital map, GIS allows a user to see locations, events, features, and environmental changes with unprecedented clarity, showing layer upon layer of information such as environmental trends, soil stability, pesticide use, migration corridors, hazardous waste generators, dust source points, lake remediation efforts, and at-risk water wells. Effective environmental practice considers the whole spectrum of the environment. ArcGIS® & other GIS technologies offers a wide variety of analytical tools to meet the needs of many people, helping them make better decisions about the environment. People in the environmental management community use GIS to organize existing information and communicate that information throughout their organizations. GIS can be used as a strategic tool to automate processes, transform environmental management operations by garnering new knowledge, and support decisions that make a profound difference on our environment.
Speech van Jan Pieter De Nul bij de uitreiking van HR Manager van het jaarSoliday das Sonnensegel
Hij spreekt niet veel in het openbaar maar Mr De Nul heeft zijn fileermes even boven gehaald om het ondernemersklimaat en de Belgische politiek te ontleden. Een zeer juiste en vooral pijnlijke analyse.
The GRACE and GRACE Follow-On missions used satellites to measure changes in Earth's gravity field caused by movements of mass, such as water. This allowed for monitoring of terrestrial water storage (TWS) on a global scale. The summaries provided monthly maps of TWS anomalies at coarse spatial resolution of 105 km2. Assimilating these measurements into land surface models helped improve estimates of TWS and its components like snowpack and soil moisture at finer scales. The combined data helped analyze trends, floods, and droughts around the world.
SRTM is the Shuttle Radar Topography Mission that obtained digital elevation models on a near-global scale using radar. SRTM data is available in different formats from various sources and can be used for mapping geomorphology, structures, and revealing subsurface geological patterns when combined with Landsat data. SRTM imagery and derived hillshades can identify drainage patterns and be compared to paleochannels from aeromagnetic data to study geological processes over time.
91 Free Twitter Tools and Apps to Fit Any NeedBuffer
We’ve collected a great bunch of free tools for Twitter - all the tools we’ve found helpful and many more that we’re excited to try. If there’s a free Twitter tool out there, you’re likely to find a mention here in our list.
This talk will provide an in-depth treatment of satellite telephony networks from a security perspective. The overall system seems secure, but in reality, it cannot be expected to be fully reliable.
We will briefly cover the satellite mobile system architecture, then discuss GMR (GEO-Mobile Radio) system elements, e.g. GSS (Gateway Station Subsystem), MES (Mobile Earth Station), AOC (Advanced Operation Center), and TCS (Traffic Control Subsystem) for GMR-1 systems and NCC (Network Control Center), GW (Gateway), SCF (Satellite Control Facility) and CMIS (Customer Management Information System) for GMR-2 systems.
From there, we will discuss the security issues of GMR system as it shares similar vulnerabilities with GSM–GMR is derived from the terrestrial digital cellular standard GSM and support access to GSM core networks, along with some interesting demos.
Time permitting, a question and answer session at the end of the presentation will allow participants to cover any additional issues in satellite telephony system they’d like to discuss.
This document discusses the GRACE (Gravity Recovery and Climate Experiment) mission, which consists of two satellites that measure changes in Earth's gravity field to map variations in ocean and land water, ice sheets, and groundwater. The original GRACE mission operated from 2002 to 2017, exceeding its expected 5-year lifespan. GRACE Follow-on was launched in 2018 to continue the mission after the original satellites decommissioned.
Application of Geo-informatics in Environmental ManagementMahaMadhu2
Geo-informatics is the science and the technology which develops and uses information science, infrastructure to address the problems of geography, geosciences and related branches of engineering. “The art, science or technology dealing with the acquisition, storage, processing, production, presentation & dissemination of geo-information“. Perhaps the most important concern for all of us today is protecting the environment we live and breathe in. Climate change issues are creating havoc with erratic weather patterns affecting everything from crop production to untimely melting of ice glaciers.
There is a lot to worry about and immediate action is definitely required. It’s not that the world has not geared up to take corrective actions, but we need to do more, and Geo-informatics can help us achieve that. Geo-informatics is a powerful platform which enables every sector to perform better and the environment is no exception! Coupled with a digital map, GIS allows a user to see locations, events, features, and environmental changes with unprecedented clarity, showing layer upon layer of information such as environmental trends, soil stability, pesticide use, migration corridors, hazardous waste generators, dust source points, lake remediation efforts, and at-risk water wells. Effective environmental practice considers the whole spectrum of the environment. ArcGIS® & other GIS technologies offers a wide variety of analytical tools to meet the needs of many people, helping them make better decisions about the environment. People in the environmental management community use GIS to organize existing information and communicate that information throughout their organizations. GIS can be used as a strategic tool to automate processes, transform environmental management operations by garnering new knowledge, and support decisions that make a profound difference on our environment.
Speech van Jan Pieter De Nul bij de uitreiking van HR Manager van het jaarSoliday das Sonnensegel
Hij spreekt niet veel in het openbaar maar Mr De Nul heeft zijn fileermes even boven gehaald om het ondernemersklimaat en de Belgische politiek te ontleden. Een zeer juiste en vooral pijnlijke analyse.
The GRACE and GRACE Follow-On missions used satellites to measure changes in Earth's gravity field caused by movements of mass, such as water. This allowed for monitoring of terrestrial water storage (TWS) on a global scale. The summaries provided monthly maps of TWS anomalies at coarse spatial resolution of 105 km2. Assimilating these measurements into land surface models helped improve estimates of TWS and its components like snowpack and soil moisture at finer scales. The combined data helped analyze trends, floods, and droughts around the world.
SRTM is the Shuttle Radar Topography Mission that obtained digital elevation models on a near-global scale using radar. SRTM data is available in different formats from various sources and can be used for mapping geomorphology, structures, and revealing subsurface geological patterns when combined with Landsat data. SRTM imagery and derived hillshades can identify drainage patterns and be compared to paleochannels from aeromagnetic data to study geological processes over time.
91 Free Twitter Tools and Apps to Fit Any NeedBuffer
We’ve collected a great bunch of free tools for Twitter - all the tools we’ve found helpful and many more that we’re excited to try. If there’s a free Twitter tool out there, you’re likely to find a mention here in our list.
Cache Optimization Techniques for General Purpose Graphic Processing UnitsVajira Thambawita
This document summarizes research on adapting CPU cache optimization techniques for general purpose graphic processing units (GPGPUs). It first discusses related work on CPU and GPGPU cache architectures and optimization techniques. It then presents the conceptual design of selecting CPU techniques and analyzing their adaptation to GPGPUs. Two common CPU techniques, stride-one access and blocking, are adapted and experimental results show their effectiveness on a GPGPU, with blocking providing better performance than non-blocking approaches. The research contributes techniques for programmers to optimize GPGPU cache performance.
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
The document provides an overview of Sundance Multiprocessor Technology Ltd. and their EM3V - Embedded Vision product. Some key points:
- Sundance is an employee-owned company with over 300 years of experience designing and building their own products.
- Their VCS-1 (EMC2) system is a modular and reconfigurable hardware platform compatible with Zynq UltraScale+ MPSoC devices and a wide range of sensors.
- The system includes open source software, firmware and documentation and is compatible with popular frameworks like ROS, OpenCV and deep learning stacks for running neural networks.
This document discusses optimizations for high performance and energy efficient implementations of the Smith-Waterman algorithm on FPGAs using OpenCL. It describes an architecture with a systolic array for parallel computation along anti-diagonals and compression techniques to address the memory-bound nature. Experimental results on two FPGA boards show up to 42.5 GCUPS performance with the best performance/power ratio compared to CPUs and other FPGA implementations.
This document summarizes a presentation on GPGPU-Sim, a widely used GPU simulator. It introduces GPGPU-Sim and its key components, including its functional model that simulates the GPU programming model and virtual/machine instruction sets, its performance model that simulates timing of GPU components, and its power model GPUWattch. It outlines new features in GPGPU-Sim, including an improved Volta architecture model, the ability to run closed source libraries like cuDNN, modeling of tensor cores, and running the CUTLASS library. The document provides an overview of these new developments and how they enhance GPGPU-Sim's accuracy in simulating modern GPUs.
OCP liquid direct to chip temperature guideline.pdfbui thequan
Liquid cooling is becoming necessary for machine learning training modules and high-speed switch fabric ASICs as they exceed the limits of air-based cooling. The document proposes that the industry converge on a coolant temperature standard of 30°C, which would allow for efficient data center design while supporting emerging hardware for many generations. Optimizations across the thermal stack from chips to data centers could further extend the viability of 30°C cooling. The author calls for industry collaboration to establish standards and investments that leverage cooling infrastructure for as long as possible.
Advanced Computer Architecture – An IntroductionDilum Bandara
Introduction to advanced computer architecture, including classes of computers,
Instruction set architecture, Trends, Technology, Power and energy
Cost
Principles of computer design
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
The document discusses some real needs for and limits of reconfigurable computing systems. It describes how partial dynamic reconfiguration can provide flexibility and enhance performance but introduces drawbacks. Simulation and verification tools are needed to design such systems. Reconfiguration times significantly impact latency so tasks should be reused and reconfiguration hidden when possible through techniques like relocation.
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
AI Solutions for Industries | Quality Inspection | Data Insights | AI-accelerated CFD | Self-Checkout | byteLAKE.com
byteLAKE: Empowering Industries with AI Solutions. Embrace cutting-edge technology for advanced quality inspection, data insights, and more. Harness the potential of our CFD Suite, accelerating Computational Fluid Dynamics for heightened productivity. Unlock new possibilities with Cognitive Services: image analytics for precise visual inspection for Manufacturing, sound analytics enabling proactive maintenance for Automotive, and wet line analytics for the Paper Industry. Seamlessly convert data into actionable insights using Data Insights' AI module, enabling advanced predictive maintenance and risk detection. Simplify Restaurant and Retail operations with our efficient self-checkout solution, recognizing meals and groceries and elevating customer satisfaction. Custom AI Development services available for tailored solutions. Discover more at www.byteLAKE.com.
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses accelerating science discovery with AI inference-as-a-service. It describes showcases using this approach for high energy physics and gravitational wave experiments. It outlines the vision of the A3D3 institute to unite domain scientists, computer scientists, and engineers to achieve real-time AI and transform science. Examples are provided of using AI inference-as-a-service to accelerate workflows for CMS, ProtoDUNE, LIGO, and other experiments.
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Editor IJMTER
This document describes a proposed low-power CORDIC-based DCT architecture that prioritizes processing of low-frequency DCT coefficients over high-frequency coefficients to reduce power consumption with minimal image quality degradation. It uses a look-ahead CORDIC approach to allow varying the number of CORDIC iterations for different coefficients. Experimental results show the proposed architecture achieves 38.1% area and power savings compared to DA-based DCT, with comparable power to MCM-based DCT but using 100% less area and a minor 0.04dB quality loss.
Task allocation on many core-multi processor distributed systemDeepak Shankar
Migration of software from a single to multi-core, single to multi-thread, and integrated into a distributed system requires a knowledge of the system and scheduling algorithms. The system consists of a combination of hardware, RTOS, network, and traffic profiles. Of the 100+ popular scheduling algorithms, the majority use First Come-First Server with priority and preemption, Weight Round Robin, and Slot-based. The task allocation must take into consideration a number of factors including the hardware configuration, the RTOS scheduling, task dependency, parallel partitioning, shared resources, and memory access. Additionally, embedded system architectures always have the possibility of using custom hardware to implement tasks that may be associated with Artificial Intelligence, diagnostic or image processing.
In this Webinar, we will show you how to conduct trade-offs using a system model of the tasks and the target resources. You will learn to make decisions based on the hardware and network statistics. The statistics will assist in identifying deadlocks, bottlenecks, possible failures and hardware requirements. To estimate the best task allocation and partitioning, a discrete-event simulation with both time- and quantity-shared resource modeling is essential. The software must be defined as a UML or a task graph.
Web: www.mirabilisdesign.com
Webinar Youtube Link: https://youtu.be/ZrV39SYTWSc
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-sze
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Vivienne Sze, Associate Professor at MIT, presents the "Approaches for Energy Efficient Implementation of Deep Neural Networks" tutorial at the May 2018 Embedded Vision Summit.
Deep neural networks (DNNs) are proving very effective for a variety of challenging machine perception tasks. But these algorithms are very computationally demanding. To enable DNNs to be used in practical applications, it’s critical to find efficient ways to implement them.
This talk explores how DNNs are being mapped onto today’s processor architectures, and how these algorithms are evolving to enable improved efficiency. Sze explores the energy consumption of commonly used CNNs versus their accuracy, and provides insights on "energy-aware" pruning of these networks.
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
This document summarizes the results of performance analysis and optimizations done on the STAR-CCM+ application run on different Intel CPU configurations. The analysis showed that the application's performance was highly dependent on CPU frequency (85-88%) and benefited from optimizations like CPU binding, huge pages, and scatter task placement. Comparing CPU types showed the 12-core CPU was 8-9% faster. Hyperthreading had a minimal impact on performance. Turbo Boost was effective but its benefits reduced as fewer cores were utilized.
Cache Optimization Techniques for General Purpose Graphic Processing UnitsVajira Thambawita
This document summarizes research on adapting CPU cache optimization techniques for general purpose graphic processing units (GPGPUs). It first discusses related work on CPU and GPGPU cache architectures and optimization techniques. It then presents the conceptual design of selecting CPU techniques and analyzing their adaptation to GPGPUs. Two common CPU techniques, stride-one access and blocking, are adapted and experimental results show their effectiveness on a GPGPU, with blocking providing better performance than non-blocking approaches. The research contributes techniques for programmers to optimize GPGPU cache performance.
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
The document provides an overview of Sundance Multiprocessor Technology Ltd. and their EM3V - Embedded Vision product. Some key points:
- Sundance is an employee-owned company with over 300 years of experience designing and building their own products.
- Their VCS-1 (EMC2) system is a modular and reconfigurable hardware platform compatible with Zynq UltraScale+ MPSoC devices and a wide range of sensors.
- The system includes open source software, firmware and documentation and is compatible with popular frameworks like ROS, OpenCV and deep learning stacks for running neural networks.
This document discusses optimizations for high performance and energy efficient implementations of the Smith-Waterman algorithm on FPGAs using OpenCL. It describes an architecture with a systolic array for parallel computation along anti-diagonals and compression techniques to address the memory-bound nature. Experimental results on two FPGA boards show up to 42.5 GCUPS performance with the best performance/power ratio compared to CPUs and other FPGA implementations.
This document summarizes a presentation on GPGPU-Sim, a widely used GPU simulator. It introduces GPGPU-Sim and its key components, including its functional model that simulates the GPU programming model and virtual/machine instruction sets, its performance model that simulates timing of GPU components, and its power model GPUWattch. It outlines new features in GPGPU-Sim, including an improved Volta architecture model, the ability to run closed source libraries like cuDNN, modeling of tensor cores, and running the CUTLASS library. The document provides an overview of these new developments and how they enhance GPGPU-Sim's accuracy in simulating modern GPUs.
OCP liquid direct to chip temperature guideline.pdfbui thequan
Liquid cooling is becoming necessary for machine learning training modules and high-speed switch fabric ASICs as they exceed the limits of air-based cooling. The document proposes that the industry converge on a coolant temperature standard of 30°C, which would allow for efficient data center design while supporting emerging hardware for many generations. Optimizations across the thermal stack from chips to data centers could further extend the viability of 30°C cooling. The author calls for industry collaboration to establish standards and investments that leverage cooling infrastructure for as long as possible.
Advanced Computer Architecture – An IntroductionDilum Bandara
Introduction to advanced computer architecture, including classes of computers,
Instruction set architecture, Trends, Technology, Power and energy
Cost
Principles of computer design
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
The document discusses some real needs for and limits of reconfigurable computing systems. It describes how partial dynamic reconfiguration can provide flexibility and enhance performance but introduces drawbacks. Simulation and verification tools are needed to design such systems. Reconfiguration times significantly impact latency so tasks should be reused and reconfiguration hidden when possible through techniques like relocation.
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
AI Solutions for Industries | Quality Inspection | Data Insights | AI-accelerated CFD | Self-Checkout | byteLAKE.com
byteLAKE: Empowering Industries with AI Solutions. Embrace cutting-edge technology for advanced quality inspection, data insights, and more. Harness the potential of our CFD Suite, accelerating Computational Fluid Dynamics for heightened productivity. Unlock new possibilities with Cognitive Services: image analytics for precise visual inspection for Manufacturing, sound analytics enabling proactive maintenance for Automotive, and wet line analytics for the Paper Industry. Seamlessly convert data into actionable insights using Data Insights' AI module, enabling advanced predictive maintenance and risk detection. Simplify Restaurant and Retail operations with our efficient self-checkout solution, recognizing meals and groceries and elevating customer satisfaction. Custom AI Development services available for tailored solutions. Discover more at www.byteLAKE.com.
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses accelerating science discovery with AI inference-as-a-service. It describes showcases using this approach for high energy physics and gravitational wave experiments. It outlines the vision of the A3D3 institute to unite domain scientists, computer scientists, and engineers to achieve real-time AI and transform science. Examples are provided of using AI inference-as-a-service to accelerate workflows for CMS, ProtoDUNE, LIGO, and other experiments.
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Editor IJMTER
This document describes a proposed low-power CORDIC-based DCT architecture that prioritizes processing of low-frequency DCT coefficients over high-frequency coefficients to reduce power consumption with minimal image quality degradation. It uses a look-ahead CORDIC approach to allow varying the number of CORDIC iterations for different coefficients. Experimental results show the proposed architecture achieves 38.1% area and power savings compared to DA-based DCT, with comparable power to MCM-based DCT but using 100% less area and a minor 0.04dB quality loss.
Task allocation on many core-multi processor distributed systemDeepak Shankar
Migration of software from a single to multi-core, single to multi-thread, and integrated into a distributed system requires a knowledge of the system and scheduling algorithms. The system consists of a combination of hardware, RTOS, network, and traffic profiles. Of the 100+ popular scheduling algorithms, the majority use First Come-First Server with priority and preemption, Weight Round Robin, and Slot-based. The task allocation must take into consideration a number of factors including the hardware configuration, the RTOS scheduling, task dependency, parallel partitioning, shared resources, and memory access. Additionally, embedded system architectures always have the possibility of using custom hardware to implement tasks that may be associated with Artificial Intelligence, diagnostic or image processing.
In this Webinar, we will show you how to conduct trade-offs using a system model of the tasks and the target resources. You will learn to make decisions based on the hardware and network statistics. The statistics will assist in identifying deadlocks, bottlenecks, possible failures and hardware requirements. To estimate the best task allocation and partitioning, a discrete-event simulation with both time- and quantity-shared resource modeling is essential. The software must be defined as a UML or a task graph.
Web: www.mirabilisdesign.com
Webinar Youtube Link: https://youtu.be/ZrV39SYTWSc
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-sze
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Vivienne Sze, Associate Professor at MIT, presents the "Approaches for Energy Efficient Implementation of Deep Neural Networks" tutorial at the May 2018 Embedded Vision Summit.
Deep neural networks (DNNs) are proving very effective for a variety of challenging machine perception tasks. But these algorithms are very computationally demanding. To enable DNNs to be used in practical applications, it’s critical to find efficient ways to implement them.
This talk explores how DNNs are being mapped onto today’s processor architectures, and how these algorithms are evolving to enable improved efficiency. Sze explores the energy consumption of commonly used CNNs versus their accuracy, and provides insights on "energy-aware" pruning of these networks.
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
This document summarizes the results of performance analysis and optimizations done on the STAR-CCM+ application run on different Intel CPU configurations. The analysis showed that the application's performance was highly dependent on CPU frequency (85-88%) and benefited from optimizations like CPU binding, huge pages, and scatter task placement. Comparing CPU types showed the 12-core CPU was 8-9% faster. Hyperthreading had a minimal impact on performance. Turbo Boost was effective but its benefits reduced as fewer cores were utilized.
2. Energy supply and demand
• 40% more energy is needed by 2035
• No choice but Oil, Gas and Coal
• Sophisticated seismic methods
I. Said Ph.D. defense 12/21/2015 1/50
3. Seismic methods for Oil & Gas exploration
Acquisition Processing Interpretation
Shot = source activation + data collection (receivers)
Seismic survey
• Air-gun array
• Hydrophones
Shot record
I. Said Ph.D. defense 12/21/2015 2/50
4. Seismic methods for Oil & Gas exploration
Acquisition Processing
Noise at-
tenuation
Demul-
tiple
Interpo-
lation
Imaging
Interpretation
{Subsurface image
I. Said Ph.D. defense 12/21/2015 2/50
5. Seismic methods for Oil & Gas exploration
Acquisition Processing Interpretation
Calculate seismic attributes
• Dip
• Azimuth
• Coherence
I. Said Ph.D. defense 12/21/2015 2/50
6. Seismic methods for Oil & Gas exploration
Acquisition Processing Interpretation
Calculate seismic attributes
• Dip
• Azimuth
• Coherence (courtesy of Total)
I. Said Ph.D. defense 12/21/2015 2/50
7. Reverse Time Migration (RTM)
• The reference computer based imaging algorithm in the industry
• Repositions seismic events into their true location in the subsurface
I. Said Ph.D. defense 12/21/2015 3/50
8. Reverse Time Migration (RTM)
• The reference computer based imaging algorithm in the industry
• Repositions seismic events into their true location in the subsurface
• Sub-salt and steep dips imaging
• Accurate (full wave equation (two-way))
• Requires massive compute resources (compute and storage)
I. Said Ph.D. defense 12/21/2015 3/50
11. RTM workflow
Forward modeling (FWD) Backward modeling (BWD) Imaging condition
I. Said Ph.D. defense 12/21/2015 4/50
12. RTM workflow
Forward modeling (FWD) Backward modeling (BWD) Imaging condition
I. Said Ph.D. defense 12/21/2015 4/50
13. RTM workflow
Forward modeling (FWD) Backward modeling (BWD) Imaging condition
I. Said Ph.D. defense 12/21/2015 4/50
14. The underlying theory of the RTM algorithm
The RTM operator
Img(x) =
H
0
T
0
Sh(x, t) ∗ Rh(x, T − t)dt dh
The Cauchy problem
1
c2
∂2
u(x, t)
∂t2
− ∆u(x, t) = s(t), in Ω
u(x, 0) = 0
∂u(x, 0)
∂t
= 0
Boundary condition
u = 0 on ∂Ω
I. Said Ph.D. defense 12/21/2015 5/50
15. Finite Difference Time Domain for RTM
• Finite Difference Time Domain (8th
order in space, 2nd
order in time)
• Regular grids
• Perfectly Matched Layers (PML) as an absorbing boundary condition
Un+1
i,j,k = 2Un
i,j,k − Un−1
i,j,k + c2
i,j,k∆t2
∆Un
i,j,k + c2
i,j,k∆t2
sn
• Heavy computation (hours to days of processing time)
• Terabytes of temporary data
• Requires High Performance Computing
I. Said Ph.D. defense 12/21/2015 6/50
16. HPC solutions for RTM
CPU clusters are the reference
• Process large data sets across interconnected multi-core CPUs
• Advanced optimization techniques (vectorization, cache blocking)
Hardware accelerators and co-processors
• RTM is massively parallel
• GPU, FPGA, Intel Xeon Phi
• Dominance of GPUs:
• Huge compute power (up to 5 TFlop/s)
• High memory bandwidth (up to 300 GB/s)
• Possible PCI overheads (sustained bandwidth up to 12 GB/s)
• Data snapshotting
• MPI communications with neighbors (multi-GPU)
• Limited memory capacities
• A high-end GPU has only 12 GB at most
• CPU based compute nodes have 128 GB
• High power consumptions 400 W (CPU+GPU(not standalone))
I. Said Ph.D. defense 12/21/2015 7/50
17. GPU based solutions for RTM
• Possible software techniques to overcome RTM limits on GPUs:
• Temporal blocking (PCI overhead)
• Overlapping CPU-GPU transfers with computations (PCI overhead)
• Out-of-core algorithms (memory limitation)
• Extensive efforts and investments
• Hardware solution with an acceptable performance/efforts trade-off?
I. Said Ph.D. defense 12/21/2015 8/50
18. Towards unifying CPUs and GPUs
GPU main memory
Dispatch units
L2
CU0
L1Local memory
Register file
PE
CU1
L1Local memory
Register file
PE
CUN-1
L1Local memory
Register file
PE
CPU
System memory
CPU0
L1 WC
L2
CPUs-1
L1 WC
L2
L3
FPUFPU
PCI Express Bus
CPU
System memory
CPU3
L1 WC
L2
FPU
CPU2
L1 WC
L2
FPU
CPU0
L1 WC
FPU
CPU1
L1 WC
L2
FPU
Quad-core CPU module
Dispatch units
CU0
TEX L1Local memory
Register file
PE
CU1
TEX L1Local memory
Register file
PE
CUN-1
TEX L1Local memory
Register file
PE
Integrated GPU moduleUNB GPUmemorycontroller
Memorycontroller
GARLIC
ONION
Accelerated Processing Unit (APU)
CPU+discrete GPU
I. Said Ph.D. defense 12/21/2015 9/50
19. Towards unifying CPUs and GPUs
Strengths
• No PCI Express bus
• Integrated GPUs can address the entire memory
• Low power processors ( 95 W TDP at most):
• CPU 150 W TDP at most
• GPU 300 W at most
Weaknesses
• Low compute power as compared to GPUs:
• Kaveri APU 730 GFlop/s (integrated GPU)
• Phenom CPU 150 GFlop/s
• Tahiti GPU 3700 GFlop/s
• An order of magnitude less memory bandwidth than GPUs:
• APU up to 25 GB/s memory bandwidth
• GPU 300 GB/s
I. Said Ph.D. defense 12/21/2015 9/50
25. Evaluation of the APU technology
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 11/50
26. The APU memory subsystem
• Onion: coherent bus (slow)
• Garlic: non coherent bus (full memory bandwidth)
I. Said Ph.D. defense 12/21/2015 12/50
27. The APU memory subsystem
• c: regular CPU memory (size depends on the RAM)
I. Said Ph.D. defense 12/21/2015 12/50
28. The APU memory subsystem
• g: fixed size (512 MB to 4 GB)
• cg: explicit copy from CPU memory to GPU memory
• gc: explicit copy from GPU memory to CPU memory
I. Said Ph.D. defense 12/21/2015 12/50
29. The APU memory subsystem
• u: zero-copy and non coherent (read-only accesses from GPU cores)
• Fixed and limited size (up to 1 GB)
I. Said Ph.D. defense 12/21/2015 12/50
30. The APU memory subsystem
• z: zero-copy and coherent memory
I. Said Ph.D. defense 12/21/2015 12/50
31. The APU memory subsystem
• z: zero-copy and coherent memory
• Variable size (up to the maximum CPU memory size)
I. Said Ph.D. defense 12/21/2015 12/50
32. Data placement strategies on APU
• OpenCL data copy kernel
• From buffer A to buffer B
• Store buffers A and B in different memory locations
• Evaluate different combinations:
• For example cggc (explicit copy):
• zz (zero-copy):
I. Said Ph.D. defense 12/21/2015 13/50
33. Data placement benchmark results on APU
0
5
10
15
20
25
30
35
cggc zgc ugc zz uz
time(ms)lowerisbetter
Data placement strategies
buffer size: 128 MB
kernel(copy time) GPU-to-CPU CPU-to-GPU
• Using zero-copy = 60% maximum sustained bandwidth
• Select the most relevant strategies: cggc, ugc and zz
I. Said Ph.D. defense 12/21/2015 14/50
34. Applicative benchmarks on APU
Matrix multiplication
• Compute bound algorithm
• Evaluate the sustained compute gap between GPUs and APUs
8th
order 3D finite difference stencil
• Memory bound algorithm
• Building block of the Reverse Time Migration
• Evaluate the APU memory performance
Impact of data placement strategies on the APU performance
I. Said Ph.D. defense 12/21/2015 15/50
35. Finite difference stencils
∆Un
i,j,k =
1
∆x2
p/2
l=−p/2
alUn
i+l,j,k+
1
∆y2
p/2
l=−p/2
alUn
i,j+l,k+
1
∆z2
p/2
l=−p/2
alUn
i,j,k+l, p = 8
• Compute complexity O(N3
)
• Storage complexity O(N3
)
• Data snapshotting (K ∈ [1 − 10])
I. Said Ph.D. defense 12/21/2015 16/50
36. Stencils: implementation details
• 2D work-item grid on the 3D domain
• 1 column along the Z axis/work-item
• Register blocking when traversing the
Z dimension
• Implementations:
• scalar: global memory
• local scalar: local memory to exploit
memory access redundancies
• vectorized: global memory +
explicit vectorization
• local vectorized: local memory +
explicit vectorization
I. Said Ph.D. defense 12/21/2015 17/50
37. Stencil computations on CPU
2
4
6
8
10
12
14
16
18
64 128 256 512 1024
GFlop/shigherisbetter
NxNx32
scalar
vectorized
local vectorized
openmp
• Explicit vectorization helped to deliver the best performance (SSE)
• OpenCL ≥ OpenMP
I. Said Ph.D. defense 12/21/2015 18/50
38. Stencil computations on GPU
0
100
200
300
400
500
600
64 128 256 512 1024
GFlop/shigherisbetter
NxNx32
scalar
local scalar
vectorized
local vectorized
• Scalar ≥ vectorized thanks to GCN (Graphics Core Next)
• Scalar code + OpenCL local memory offered the best performance
I. Said Ph.D. defense 12/21/2015 19/50
39. Stencil computations on APU
10
20
30
40
50
60
70
80
90
64 128 256 512 1024
GFlop/shigherisbetter
NxNx32
scalar
local scalar
vectorized
local vectorized
• Local scalar gives the best performance numbers for N ≥ 128
• Vectorization is not needed thanks to GCN
I. Said Ph.D. defense 12/21/2015 20/50
40. Stencils: data placement strategies
• Fixed problem size (1024 × 1024 × 32)
• One snapshot every K computation (1 ≤ K ≤ 10)
• Select the best OpenCL implementations (scalar, local scalar)
• Combine them with data placement strategies: cggc, ugc, zz
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10
GFlop/shigherisbetter
K computations + 1 snapshot
problem size: 1024x1024x32 (128 MB)
best
scalar-cggc
scalar-ugc
scalar-zz
local scalar-cggc
local scalar-ugc
local scalar-zz
• Best: local scalar (zz) for 1 ≤ K ≤ 3 and (cggc) for 3 ≤ K ≤ 10
• Select explicit copy (cggc) and zero-copy (zz) for RTM
I. Said Ph.D. defense 12/21/2015 21/50
41. Stencils: performance comparison
8
16
32
64
128
256
512
1 2 3 4 5 6 7 8 9 10
GFlop/shigherisbetter
K computations + 1 snapshot
performance projection
problem size: 1024x1024x32 (128 MB)
CPU
GPU
APU
APU(Onion=Garlic)
• APU > CPU ∀K
• GPU > APU, 2 ≤ K ≤ 10
• APU > GPU when performing one snapshot after each iteration
I. Said Ph.D. defense 12/21/2015 22/50
42. Stencils: conclusion
• APU can be an attractive solution:
• For a high rate of data snapshotting (finite difference)
• For medium sized problems (matrix multiplication)
• An order of magnitude of theoretical performance gap GPU/APU:
• But only 3× to 4× only in practice
• Performance only: the GPU remains the privileged solution
• Power is gaining interest in the HPC community (Green500)
• Power wall and Exascale
• What about power consumption?
I. Said Ph.D. defense 12/21/2015 23/50
43. Power measurement methodology
• Raritan PX (DPXR8A-16) PDU to monitor the power consumption
• Performance per Watt (PPW) metric
Methodology
• The power drawn by the system as a whole:
• Same functional hardware components for the 3 architectures
• CPU+GPU for GPU based solutions
• Importance of Power Supply Units (PSUs) electric efficiency
I. Said Ph.D. defense 12/21/2015 24/50
44. Stencils: power efficiency comparison
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6 7 8 9 10
GFlop/s/Whigherisbetter
K computations + 1 snapshot
up to 62 W
up to 222 W
up to 159 W
problem size: 1024x1024x32 (128 MB)
CPU GPU APU
• CPU offers a very low power efficiency (0.08 GFlop/s/W)
• APU is 13% more power efficient that the GPU
• Higher gain for compute bound algorithm (matrix multiplication):
• Flops consume less power than memory accesses
I. Said Ph.D. defense 12/21/2015 25/50
45. RTM on one HPC node
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 26/50
46. One-node RTM GPU/APU implementations
• Multiple OpenCL kernels (PML):
• Reduce compute/memory divergence
• Stencils study conclusions:
• Stencil optimizations and auto-tuning
• scalar and local scalar
• Data placement strategies (APU)
• Imaging condition on CPU
• Evaluate:
• Kernels (kernels only)
• Full application (overall)
Absorbing boundaries
Z
Y
X
Physical domain
Free surface
Absorbing boundaries
Case study
• 3D SEG/EAGE Salt velocity model
• Compute grid that fits in one GPU compute node (less than 3 GB)
• Selective checkpointing frequency K=10
I. Said Ph.D. defense 12/21/2015 27/50
47. One-node RTM on GPU/APU
kernels only (GFlop/s) overall (GFlop/s) %loss
GPU 141.77 29.65 79%
APU(explicit copy) 32.42 15.93 50%
APU(zero-copy) 15.2 11.45 24%
GPU
• Best: scalar implementation
• Impact of PCI+IO (snapshotting) on performance
APU
• Best: scalar with using explicit data copies (cggc)
• Local memory is beneficial when using zero-copy memory objects
I. Said Ph.D. defense 12/21/2015 28/50
48. One-node RTM: performance comparison
0
20
40
60
80
100
120
140
160
GFlop/shigherisbetter
one node/architecture
CPU
APU(zero-copy)
APU(explicitcopy)
GPU
RTM(kernels only) RTM(overall)
• Poor performance on the Phenom CPU (OpenCL)
• Gap between GPU and APU:
• 4.4× with kernels only
• 1.8× only when considering overall
I. Said Ph.D. defense 12/21/2015 29/50
49. One-node RTM: power efficiency comparison
0
0.05
0.1
0.15
0.2
0.25
0.3
GFlop/s/Whigherisbetter
CPU
APU(zero-copy)
APU(explicit copy)
GPU
137 W
62 W
62 W
198 W
• Performance numbers based on overall timings
• Poor power efficiency on the Phenom CPU (0.013 GFlop/s/W)
• APU can be more power efficient than the GPU:
• 1.80× (explicit copy)
• 1.23× (zero-copy)
I. Said Ph.D. defense 12/21/2015 30/50
50. One-node RTM: conclusion
• RTM(kernels only): huge gap between APU and GPU
• RTM(overall): the performance gap is reduced
• Performance + power: APU is almost twice more efficient than GPU
I. Said Ph.D. defense 12/21/2015 31/50
51. RTM on multi-node hybrid architectures
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 32/50
52. RTM on multi-node hybrid architectures
Motivations
• Real world cases generate large amounts of data ( 1 Terabyte)
• Larger than one node memory capacities
• Impact of MPI communications on the PCI overhead (GPU)?
• Impact of zero-copy on MPI communications (APU)?
Clusters (located at Total)
CPU cluster APU cluster GPU cluster
Number of used nodes 16
Processors/node
2×Intel Xeon CPU E5-2670 1×AMD A10-7850K 1×NVIDIA Tesla K40s
(Kaveri) 1×Intel Xeon CPU E5-2680
Case study
• Same velocity model, K=10
• Compute grids size 25 GB
I. Said Ph.D. defense 12/21/2015 33/50
53. Multi-node RTM: implementation
• 3D domain decomposition
• One-node study conclusions
• Boundaries copied to contiguous
buffers:
• For GPUs, using OpenCL kernels
• For GPUs, PCI memory transfers:
• Communications with neighbors
• I/O operations for snapshotting
Z
Y
X
South
North
Back
East
West
Front
I. Said Ph.D. defense 12/21/2015 34/50
54. Multi-node RTM: MPI overlapping
Problem: ineffective non-blocking communications (initial)
Isend(buf) do work(no buf) Wait() use bufProcess P0
Process P1 time
Recv(buf)
no progress thread
Solution: explicit overlap technique (overlap)
MPI communications (blocking)
sync
Update the domain boundaries
Update the inner domain
User thread
Auxiliary thread
time
activate
I. Said Ph.D. defense 12/21/2015 35/50
64. Multi-node RTM: performance comparison
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
time(s)lowerisbetter
16 nodes/cluster
CPU
APU(zero-copy)
APU(explicit copy)
GPU
• APU cluster (explicit copy) > CPU cluster
• 1.6× (node (2 CPU) to node (1 APU))
• 3.2× (socket (1 CPU) to socket (1 APU))
• GPU cluster > APU cluster (explicit copy) by 3.5×
• GPU cluster > APU cluster (zero-copy) by 8.3×
• APU cluster > APU cluster (zero-copy) by 2.3×
I. Said Ph.D. defense 12/21/2015 45/50
65. Multi-node RTM: estimated power efficiency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
time(s)lowerisbetter
1600 WCPU
8 nodes
APU(zero-copy)
16 nodes
APU(explicit copy)
16 nodes
GPU
4 nodes
$3200
$12000
• Power budget 1600 W (TDP and maximum power consumption)
• APU cluster (zero-copy) > CPU cluster
• APU cluster (explicit copy) = GPU cluster
I. Said Ph.D. defense 12/21/2015 46/50
66. Conclusions
• Evaluation of the APU technology:
• Performance standpoint: GPU > APU
• Performance + power: the APU becomes an attractive solution
• Importance of data placement strategies
• One-node RTM study:
• The same conclusions (APU evaluation) were confirmed
• Multi-node study of the RTM:
• GPU/APU: I/O and communications = high fraction of run times
• GPU/APU: overlapping I/O and communications is mandatory
• Kaveri APU 3.2× speedup over Intel Xeon E5-2670
• Kaveri APU falls behind NVIDIA Tesla K40s GPU by 3.5×
• APU = GPU (power efficiency)
I. Said Ph.D. defense 12/21/2015 47/50
67. Conclusions on programming models
• 3 OpenACC based solutions:
• OpenACC only
• OpenACC+HMPPcg (extension to HMPP provided by CAPS)
• OpenACC+code modification
APU (GFlop/s) GPU (GFlop/s) #LOC
OpenACC 17.61 77.55 34
OpenCL 32.42 141.77 779
• OpenACC+HMPPcg offers the best directive based performance
• OpenACC+HMPPcg provides only half the OpenCL performance
• But 26× less lines of code (LOC)
I. Said Ph.D. defense 12/21/2015 48/50
68. Perspectives
• Directive-based approach for multi-node RTM
• Upcoming APU roadmap
• Full memory unification (hardware level)
• HBM (High Bandwidth Memory) + compute units count increase
• OpenPower and NVLink
• More complex and realistic RTM algorithms:
• Adding anisotropy
• Elastic media
I. Said Ph.D. defense 12/21/2015 49/50
69. Thank you for your attention, questions?
List of publications
• H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said,
Assessing the relevance of APU for high performance scientific computing,
AMD Fusion Developer Summit (AFDS), 2012.
• H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said,
Evaluation of successive CPUs/APUs/GPUs based on an OpenCL finite difference stencil,
21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2013.
• H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said,
Forward seismic modeling on AMD Accelerated Processing Unit,
2013 Rice Oil & Gas HPC Workshop.
• P. Eberhart, I. Said, P. Fortin, H. Calandra,
Hybrid strategy for stencil computations on the APU,
The 1st International Workshop on High-Performance Stencil Computations, 2014.
• F. Jézéquel, J.-L. Lamotte, I. Said,
Estimation of numerical reproducibility on CPU and GPU,
Federated Conference on Computer Science and Information Systems, 2015.
• I. Said, P. Fortin, J.-L. Lamotte and H. Calandra,
Leveraging the Accelerated Processing Units for seismic imaging: a performance and power efficiency comparison against
CPUs and GPUs,
(submitted on October 2015 to an international journal).
• I. Said, P. Fortin, J.-L. Lamotte, H. Calandra,
Efficient Reverse Time Migration on APU clusters,
2016 Rice Oil & Gas HPC Workshop (submitted on November 2015).
I. Said Ph.D. defense 12/21/2015 50/50