This document summarizes a presentation about runtime mapping of hardware accelerators on an embedded field-programmable gate array (FPGA) layer. It discusses how the multicore era is hitting a utilization wall due to power constraints limiting the percentage of a chip that can operate at full frequency. It proposes heterogeneous multicores with reconfigurable hardware accelerators and a 3D-stacked design to improve bandwidth, latency, resource usage, performance and energy efficiency. The document outlines an embedded FPGA reconfigurable fabric and its expected features like low reconfiguration overhead through a double-context configuration memory. It describes how tasks can be more easily allocated and migrated in this eFPGA compared to a traditional FPGA through the use of a
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #7: FlexTiles Emulation platform
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
FPL'2014 - FlexTiles Workshop - 8 - FlexTiles DemoFlexTiles Team
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #8: FlexTiles Demo
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
FlexTiles Platform integrated in 19" Rack EnclosureFlexTiles Team
The FlexTiles Development Platform is suitable for verifying the concept of a single 3D SoC chip design, but with a 19" Rack enclosure it can scale to either a larger 3D chip design or become a FPGA-based HPC
Conference on Adaptive Hardware and Systems (AHS'14) - The FlexTiles Embedded...FlexTiles Team
The FP7 FlexTiles Project will provide tools for building a 3D SoC chip. This chip has an FPGA embedded and these slides will explain the ideas and how we will make it a re-configurable fabric like never seen before
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #3: FlexTiles DSP Accelerators
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
In this deck from IWOCL / SYCLcon 2020, Hal Finkel from Argonne National Laboratory presents: Preparing to program Aurora at Exascale - Early experiences and future directions.
"Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms. Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.
This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success. Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems."
Watch the video: https://wp.me/p3RLHQ-lPT
Learn more: https://www.iwocl.org/iwocl-2020/conference-program/
and
https://www.anl.gov/topic/aurora
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #7: FlexTiles Emulation platform
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
FPL'2014 - FlexTiles Workshop - 8 - FlexTiles DemoFlexTiles Team
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #8: FlexTiles Demo
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
FlexTiles Platform integrated in 19" Rack EnclosureFlexTiles Team
The FlexTiles Development Platform is suitable for verifying the concept of a single 3D SoC chip design, but with a 19" Rack enclosure it can scale to either a larger 3D chip design or become a FPGA-based HPC
Conference on Adaptive Hardware and Systems (AHS'14) - The FlexTiles Embedded...FlexTiles Team
The FP7 FlexTiles Project will provide tools for building a 3D SoC chip. This chip has an FPGA embedded and these slides will explain the ideas and how we will make it a re-configurable fabric like never seen before
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #3: FlexTiles DSP Accelerators
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
In this deck from IWOCL / SYCLcon 2020, Hal Finkel from Argonne National Laboratory presents: Preparing to program Aurora at Exascale - Early experiences and future directions.
"Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms. Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.
This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success. Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems."
Watch the video: https://wp.me/p3RLHQ-lPT
Learn more: https://www.iwocl.org/iwocl-2020/conference-program/
and
https://www.anl.gov/topic/aurora
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Paul Isaacs from Linaro presents: State of ARM-based HPC. This talk provides an overview of applications and infrastructure services successfully ported to Aarch64 and benefiting from scale.
"With its debut on the TOP500, the 125,000-core Astra supercomputer at New Mexico’s Sandia Labs uses Cavium ThunderX2 chips to mark Arm’s entry into the petascale world. In Japan, the Fujitsu A64FX Arm-based CPU in the pending Fugaku supercomputer has been optimized to achieve high-level, real-world application performance, anticipating up to one hundred times the application execution performance of the K computer. K was the first computer to top 10 petaflops in 2011."
Watch the video: https://wp.me/p3RLHQ-lIT
Learn more: https://www.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
In this deck from the ECSS Symposium, Abe Stern from NVIDIA presents: CUDA-Python and RAPIDS for blazing fast scientific computing.
"We will introduce Numba and RAPIDS for GPU programming in Python. Numba allows us to write just-in-time compiled CUDA code in Python, giving us easy access to the power of GPUs from a powerful high-level language. RAPIDS is a suite of tools with a Python interface for machine learning and dataframe operations. Together, Numba and RAPIDS represent a potent set of tools for rapid prototyping, development, and analysis for scientific computing. We will cover the basics of each library and go over simple examples to get users started. Finally, we will briefly highlight several other relevant libraries for GPU programming."
Watch the video: https://wp.me/p3RLHQ-lvu
Learn more: https://developer.nvidia.com/rapids
and
https://www.xsede.org/for-users/ecss/ecss-symposium
Sign up for our insideHPC Newsletter: http://insidehp.com/newsletter
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
In this deck, Ronald P. Luijten from IBM Research in Zurich presents: DOME 64-bit μDataCenter.
I like to call it a datacenter in a shoebox. With the combination of power and energy efficiency, we believe the microserver will be of interest beyond the DOME project, particularly for cloud data centers and Big Data analytics applications."
The microserver’s team has designed and demonstrated a prototype 64-bit microserver using a PowerPC based chip from Freescale Semiconductor running Linux Fedora and IBM DB2. At 133 × 55 mm2 the microserver contains all of the essential functions of today’s servers, which are 4 to 10 times larger in size. Not only is the microserver compact, it is also very energy-efficient.
Watch the video: http://wp.me/p3RLHQ-gJM
Learn more: https://www.zurich.ibm.com/microserver/
Sign up for our insideHPC Newsletter: http://insideHPC/newsletter
CINECA for HCP and e-infrastructures infrastructuresCineca
Sanzio Bassini. Head of the HPC Department of Cineca. Cineca is the technological partner of the Ministry of Education, and takes part in the Italian commitment for the development of e-infrastrcuture in Italy and in Europe for HCP and HCP technologies; scientific data repository and management, cloud computing for industries and Public administration, for the development of computing intensive and data intensive methods for science and engineering
Cineca offers a unique offer for: open access of integrated tier0 and tier1 HCP national infrastructure; of education and training activities under the umbrella of PRACE Training
advanced center action; integrated help desk and scale up process for HCP users support
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?
"This talk will start with an overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of- core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented."
Watch the video: https://youtu.be/LeUNoKZVuwQ
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
In this video from Switzerland HPC Conference, Martin Hilgeman from Dell presents: HPC Workload Efficiency and the Challenges for System Builders.
"With all the advances in massively parallel and multi-core computing with CPUs and accelerators it is often overlooked whether the computational work is being done in an efficient manner. This efficiency is largely being determined at the application level and therefore puts the responsibility of sustaining a certain performance trajectory into the hands of the user. It is observed that the adoption rate of new hardware capabilities is decreasing and lead to a feeling of diminishing returns. This presentation shows the well-known laws of parallel performance from the perspective of a system builder. It also covers through the use of real case studies, examples of how to program for energy efficient parallel application performance."
Watch the video: http://wp.me/p3RLHQ-gIS
Learn more: http://dell.com
and
http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Conference on Adaptive Hardware and Systems (AHS'14) - FlexTiles IntroductionsFlexTiles Team
FlexTiles is a FP7 Project with the goal of designing a tool-chain for the design of a 3D SoC and prototype on a FPGA Development Platform. This presentation covers the "why, how, when and where" of the Project that will complete in Year 2015
"Algorithmic processing performed in High Performance Computing environments impacts the lives of billions of people, and planning for exascale computing presents significant power challenges to the industry. ARM delivers the enabling technology behind HPC. The 64-bit design of the ARMv8-A architecture combined with Advanced SIMD vectorization are ideal to enable large scientific computing calculations to be executed efficiently on ARM HPC machines. In addition ARM and its partners are working to ensure that all the software tools and libraries, needed by both users and systems administrators, are provided in readily available, optimized packages."
Learn more: https://developer.arm.com/hpc
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Jean-Pierre Panziera from Atos presents: BXI - Bull eXascale Interconnect.
"Exascale entails an explosion of performance, of the number of nodes/cores, of data volume and data movement. At such a scale, optimizing the network that is the backbone of the system becomes a major contributor to global performance. The interconnect is going to be a key enabling technology for exascale systems. This is why one of the cornerstones of Bull’s exascale program is the development of our own new-generation interconnect. The Bull eXascale Interconnect or BXI introduces a paradigm shift in terms of performance, scalability, efficiency, reliability and quality of service for extreme workloads."
Watch the video: http://wp.me/p3RLHQ-gJa
Learn more: https://bull.com/bull-exascale-interconnect/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The NPOESS Integrated Program Office and its principal contractors -- Raytheon and Northrop-Grumman -- are prepraring for the launch of the NPP risk-reduction mission in 2006. This presentation will review program status, and how HDF will be used to deliver NPOESS products at the domestic weather centrals and worldwide field terminals.
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #5:FlexTiles Simulation Platform
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #4: FlexTiles Virtual Platform
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Paul Isaacs from Linaro presents: State of ARM-based HPC. This talk provides an overview of applications and infrastructure services successfully ported to Aarch64 and benefiting from scale.
"With its debut on the TOP500, the 125,000-core Astra supercomputer at New Mexico’s Sandia Labs uses Cavium ThunderX2 chips to mark Arm’s entry into the petascale world. In Japan, the Fujitsu A64FX Arm-based CPU in the pending Fugaku supercomputer has been optimized to achieve high-level, real-world application performance, anticipating up to one hundred times the application execution performance of the K computer. K was the first computer to top 10 petaflops in 2011."
Watch the video: https://wp.me/p3RLHQ-lIT
Learn more: https://www.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
In this deck from the ECSS Symposium, Abe Stern from NVIDIA presents: CUDA-Python and RAPIDS for blazing fast scientific computing.
"We will introduce Numba and RAPIDS for GPU programming in Python. Numba allows us to write just-in-time compiled CUDA code in Python, giving us easy access to the power of GPUs from a powerful high-level language. RAPIDS is a suite of tools with a Python interface for machine learning and dataframe operations. Together, Numba and RAPIDS represent a potent set of tools for rapid prototyping, development, and analysis for scientific computing. We will cover the basics of each library and go over simple examples to get users started. Finally, we will briefly highlight several other relevant libraries for GPU programming."
Watch the video: https://wp.me/p3RLHQ-lvu
Learn more: https://developer.nvidia.com/rapids
and
https://www.xsede.org/for-users/ecss/ecss-symposium
Sign up for our insideHPC Newsletter: http://insidehp.com/newsletter
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
In this deck, Ronald P. Luijten from IBM Research in Zurich presents: DOME 64-bit μDataCenter.
I like to call it a datacenter in a shoebox. With the combination of power and energy efficiency, we believe the microserver will be of interest beyond the DOME project, particularly for cloud data centers and Big Data analytics applications."
The microserver’s team has designed and demonstrated a prototype 64-bit microserver using a PowerPC based chip from Freescale Semiconductor running Linux Fedora and IBM DB2. At 133 × 55 mm2 the microserver contains all of the essential functions of today’s servers, which are 4 to 10 times larger in size. Not only is the microserver compact, it is also very energy-efficient.
Watch the video: http://wp.me/p3RLHQ-gJM
Learn more: https://www.zurich.ibm.com/microserver/
Sign up for our insideHPC Newsletter: http://insideHPC/newsletter
CINECA for HCP and e-infrastructures infrastructuresCineca
Sanzio Bassini. Head of the HPC Department of Cineca. Cineca is the technological partner of the Ministry of Education, and takes part in the Italian commitment for the development of e-infrastrcuture in Italy and in Europe for HCP and HCP technologies; scientific data repository and management, cloud computing for industries and Public administration, for the development of computing intensive and data intensive methods for science and engineering
Cineca offers a unique offer for: open access of integrated tier0 and tier1 HCP national infrastructure; of education and training activities under the umbrella of PRACE Training
advanced center action; integrated help desk and scale up process for HCP users support
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?
"This talk will start with an overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of- core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented."
Watch the video: https://youtu.be/LeUNoKZVuwQ
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
In this video from Switzerland HPC Conference, Martin Hilgeman from Dell presents: HPC Workload Efficiency and the Challenges for System Builders.
"With all the advances in massively parallel and multi-core computing with CPUs and accelerators it is often overlooked whether the computational work is being done in an efficient manner. This efficiency is largely being determined at the application level and therefore puts the responsibility of sustaining a certain performance trajectory into the hands of the user. It is observed that the adoption rate of new hardware capabilities is decreasing and lead to a feeling of diminishing returns. This presentation shows the well-known laws of parallel performance from the perspective of a system builder. It also covers through the use of real case studies, examples of how to program for energy efficient parallel application performance."
Watch the video: http://wp.me/p3RLHQ-gIS
Learn more: http://dell.com
and
http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Conference on Adaptive Hardware and Systems (AHS'14) - FlexTiles IntroductionsFlexTiles Team
FlexTiles is a FP7 Project with the goal of designing a tool-chain for the design of a 3D SoC and prototype on a FPGA Development Platform. This presentation covers the "why, how, when and where" of the Project that will complete in Year 2015
"Algorithmic processing performed in High Performance Computing environments impacts the lives of billions of people, and planning for exascale computing presents significant power challenges to the industry. ARM delivers the enabling technology behind HPC. The 64-bit design of the ARMv8-A architecture combined with Advanced SIMD vectorization are ideal to enable large scientific computing calculations to be executed efficiently on ARM HPC machines. In addition ARM and its partners are working to ensure that all the software tools and libraries, needed by both users and systems administrators, are provided in readily available, optimized packages."
Learn more: https://developer.arm.com/hpc
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Jean-Pierre Panziera from Atos presents: BXI - Bull eXascale Interconnect.
"Exascale entails an explosion of performance, of the number of nodes/cores, of data volume and data movement. At such a scale, optimizing the network that is the backbone of the system becomes a major contributor to global performance. The interconnect is going to be a key enabling technology for exascale systems. This is why one of the cornerstones of Bull’s exascale program is the development of our own new-generation interconnect. The Bull eXascale Interconnect or BXI introduces a paradigm shift in terms of performance, scalability, efficiency, reliability and quality of service for extreme workloads."
Watch the video: http://wp.me/p3RLHQ-gJa
Learn more: https://bull.com/bull-exascale-interconnect/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The NPOESS Integrated Program Office and its principal contractors -- Raytheon and Northrop-Grumman -- are prepraring for the launch of the NPP risk-reduction mission in 2006. This presentation will review program status, and how HDF will be used to deliver NPOESS products at the domestic weather centrals and worldwide field terminals.
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #5:FlexTiles Simulation Platform
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #4: FlexTiles Virtual Platform
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
FPL'2014 - FlexTiles Workshop - 1 - FlexTiles OverviewFlexTiles Team
Slides presented at the FlexTiles Workshop at FPL'2014.
Presentation #1: FlexTiles overview
FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.
Conference on Adaptive Hardware and Systems (AHS'14) - The 3D FlexTiles ConceptFlexTiles Team
The FP7 FlexTiles Project's ultimate goal is to design tools for enableing the design of a System-on-Chip that contains CPUs/GPPs, DSPs and FPGA logic and this chip is not an ordinary SoC chip; it's a 3D chip and these slides explains why a 3D concept is requries
Conference on Adaptive Hardware and Systems (AHS'14) - The DSP for FlexTilesFlexTiles Team
The FP7 FlexTiles Project uses DSP accelerators. They are connected with each other - and with the general purpose procesors (GPPs) through a Network-on-Chip (NoC). These slides give the details about the DSP accelerator.
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) ArchitectureMichelle Holley
The Presentation will outline the KPIs and key optimizations at the platform, NFVi and Stack level in implementing wireless base station stack and Telco Edge cloud on Intel Architecture. The presentation will use the FlexRAN LTE Reference PHY and NEV SDK for MEC to outline the NFV and 5G use cases like network slicing.
Conference on Adaptive Hardware and Systems (AHS'14) - FlexTiles FPGA EmulationFlexTiles Team
The FP7 FlexTiles Project will provide a tool-chain that allows DSPs, CPUs and a FPGA to be implemented on the FlexTiles Development Platform. This slide gives some details about the dynamic re-configurable of the FPGA by the CPU
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las VegasBruno Teixeira
Jason Davis, Distinguished Services Engineer , Cisco Software-Defined Networking (SDN) is an exciting new approach to network IT Service Management. If you are trying to understand what SDN is and want to understand more about Controllers, APIs, Overlays, OpenFlow and ACI, then this introductory session is for you! We will cover the genesis of SDN, what it is, what it is not, and Cisco's involvement in this space. You may also be wondering what products and services are SDN-enabled and how you can solve your unique business challenges by enhancing and differentiating your services by leveraging network programmability. Cisco's SDN-enabled Products and Services will be explained enabling you to consider your own implementations. Since SDN extends network flexibility and functionality which impacts Network Engineering and Operations teams, we'll also cover the IT Service Management impact. Finally, we'll explore what skills and capabilities are needed to take advantage of SDN and Network Programmability. Network engineers, network operation staff, IT Service Managers, IT personnel managers, and application/compute SMEs will benefit from this session.
Conference on Adaptive Hardware and Systems (AHS'14) - Why FlexTiles uses OVP...FlexTiles Team
The FlexTiles concept is going to integrate DSPs, GPPs/CPUs and a Embedded FPGA and OVP - http://www.ovpworld.org/ - tools makes it easier to simulate and these Slides will explain how
Adaptive Hardware and Systems (AHS'14) - FlexTiles OVP DemoFlexTiles Team
The OVP - http://www.ovpworld.org/ - tools are used by the FlexTiles Team to simulate the MultiCore implementation of the GPU. In this example, we have 2x MicroBlaze CPU running in Parallel
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
1. www.flextiles.eu
FlexTiles
Runtime Mapping of Hardware Accelerators on the Embedded FPGA Layer
FPL’14, FlexTiles Workshop September 1st 2014
Olivier SENTIEYS★, Christophe HURIAUX, Antoine COURTAY University of Rennes 1
★ Inria
2. 2 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
The Multicore Era is Hitting the Utilization Wall
Multicore era is true since 2005-2008, but what’s next?
Energy efficiency is not scaling along with integration capacity
Transistor and power budgets no longer balanced
Classical scaling
Device count S2
Device frequency S
Device power (cap) 1/S
Device power (Vdd) 1/S2
Utilization 1
Leakage limited scaling
Device count S2
Device frequency S
Device power (cap) 1/S
Device power (Vdd) ~1
Utilization 1/S2
Pi=ai fi Ci Vddi2
Corei
[Venkatesh et al., ASPLOS’10]
3. 3 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution,
copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
The Utilization Wall
With each successive process generation, the
percentage of a chip that can switch at full frequency
drops exponentially due to power constraints
8nm in 2018
best-case average
3.7x speedup
14% per year
(highly parallel codes
and optimal per-benchmark)
[Esmaeilzadeh et al., ISCA’11]
4. 4 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
0
5
10
15
20
45nm
32nm
22nm
16nm
11nm
8nm
Speedup
Historical Scaling
ITRS Scaling
Realistic Scaling
18x
7.9x
3.7x
Multicore and Dark Silicon
[Doug Burger, HiPEAC’13]
Dark Silicon
47%
36%
71%
51%
62%
40%
17%
1%
2014
>2016
>2018
5. 5 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution,
copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
The Efficiency of Specialization
* Source: Ning Zhang and Bob Brodersen, ISSCC data
100-1000X Gap in Efficiency … but Specialization
comes with Penalties in Programmability
ASICs
FPGAs
6. 6 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Heterogeneous Multicores
Different cores on a single chip
GPPs, HW accelerators, memory, network-on-chip
Reconfigurable HW accelerators keep flexibility while increasing area and energy efficiency Self-adapting devices
Dynamically adapt the hardware to the application and to changing environments
Core
Core
Core
Core
Core
Core
Core
Core
Core
Proc.
Reconf.
HW
Mem.
HW
Acc.
7. 7 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Can 3D Stacking Help?
3D-Stacked Reconfigurable Accelerators
Improved bandwidth/latency between cores and accelerators
Improved resource usage
Improved performance and energy efficiency
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
reconfigurable layer
multicore layer
8. 8 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Outline
eFPGA Reconfigurable Fabric
General architecture overview
Expected features
Task migration in FPGA vs. task migration in eFPGA Virtual Bit-Stream Coping with Heterogeneous Blocks Development Flow Achievements & Conclusion
9. 9 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
FlexTiles Architecture Overview
- 9
3D interface to the NoC
DSP blocks
Memory blocks
10. 10 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Expected Features of the Reconfigurable Layer
Main expected features
Low reconfiguration time (and power) overhead
Double-context configuration memory
Low complexity reconfiguration control
Resource sharing/distribution easiness, simplified task migration
No predefined configuration domains
Bit-stream independent from task location
Smaller bit-stream size in configuration memory Virtual Bit-Stream (VBS)
11. 11 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Task Allocation & Migration in an FPGA
Predefined reconfigurable regions
Bit-stream depends on task location
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
HW Accelerator #1
BS #1
HW Accelerator #1
BS #2
12. 12 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Task Migration in eFPGA
3D NI
3D NI
3D NI
3D NI
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
HW Accelerator #2
BS #2
HW Accelerator #1
BS #1
13. 13 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Outline
eFPGA Reconfigurable Fabric Virtual Bit-Stream
Concept
Abstraction of routing details
Results Coping with Heterogeneous Fabric Development Flow Achievements & Conclusion
14. 14 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution,
copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Concept of Virtual Bit-Stream
A task is synthesized and
placed&routed into a Virtual
Bit-Stream (VBS)
Hide some routing details which are
architecture dependent
Remove details coming from task
physical location in the fabric
No predefined configuration domains
Final Bits-Stream is
generated at run time
Resource sharing/distribution
becomes easier, task migration is
simplified
Quartus II
15. 15 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Interconnection Architecture
Hiding routing details
Full BS is 129 bits
Could be reduced by giving less details
CLBIN[1]
CLBIN[2]
CLBIN[3]
CLBOUT
CLBIN[0]
4 5 6 7
12 13 14 15
0 1 2 3
8 9 10 11
16
17
18
19 20
16. 16 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Virtual Bit Stream
Hiding routing details
List of I/O and connections
20 8
1 9
5 18
4 5 6 7
12 13 14 15
0 1 2 3
8 9 10 11
16
17
18
19 20
17. 17 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution,
copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Results
VBS is independent of task location with a
smaller size than BS
44.4%
49.2%
47.2%
55.2%
49.7%
29.5%
27.4% 26.6%
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
0
200
400
600
800
1000
1200
1400
1600
tseng tseng diffeq diffeq apex4 des ex5p misex3
Kilo-bits
BS size
VBS size
Compression ra o
3-4 time smaller for
large bit-streams
18. 18 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
eFPGA Architecture using VBS
Reconfiguration controller
Upon GPP requirements: can place, duplicate and migrate tasks
Finalizes VBS
Reconfiguration controller
External memory
VBS 1
VBS 2
VBS 3
VBS N
Buffer memory
data
control
1
2
19. 19 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Outline
eFPGA Reconfigurable Fabric Virtual Bit-Stream Coping with Heterogeneous Fabric
Heterogeneous Blocks
Task placement in a Homogeneous context
Task placement in a Heterogeneous context Development Flow Achievements & Conclusion
20. 20 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution,
copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Heterogeneous Blocks
Logic Elements
Cluster of four 6-input LUTs
3309 mm2
Arithmetic Elements
18x18 multiplier, 48-bit adder/subtractor
4351 mm2
…
…
… … …
CLBIN
CLBOUT
LUT
LUT
LUT
LUT
+
-
A
B
18
18
36
48
21. 21 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution,
copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Heterogeneous Blocks
Memories
1024 x 16-bit word SRAM
6570 mm2
3D TSV and Accelerator Interface
Reconfiguration
Controller
3D
3D 3D
3D
3D
3D
3D
3D
3D
Reconfiguration
RAM
3DNI 3DNI 3DNI
3DNI 3DNI
3DNI 3DNI 3DNI
NoC Link (400 I/O) Pitch X Y size X size Y Area mm²
40 20 20 800 800 0,64
26.95mm²
Work In Progress
22. 22 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
eFPGA Floorplan (heterogeneous)
Logic Block Arithmetic Accelerator Memories Accelerator Interface
23. 23 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Task Placement & Migration
Homogeneous case
No constraint on task placement
Regular routing architecture
Easy! (thanks to the Virtual Bit-Stream) Cope with heterogeneity
RAM, DSP, 3D I/Os
Migration is limited
vertically to the same column
to the next column containing same complex blocks
Task
Configured LE
Logic Element (LE)
24. 24 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
eFPGA: Handling of Complex Blocks
Heterogeneous blocks routing is abstracted from logic routing
Long lines allow a trade-off between placement flexibility and routing complexity
A two-level routing is performed at runtime:
Logic routing (as in the homogeneous case)
Heterogeneous block routing through long lines
25. 25 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
eFPGA: Handling of Complex Blocks
Delay depends on final placement
Only worst-case delay can be estimated offline Flexibility is still limited in the vertical axis
Multiple of block height Length of long lines and connections long-lines – routing-resources should be limited
Area overhead, but slight delay penalty
(see our paper at FPL’14 on Wednesday)
26. 26 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Outline
eFPGA Reconfigurable Fabric
Virtual Bit-Stream
Coping with Heterogeneous Fabric
Development Flow
Achievements & Conclusion
27. 27 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution,
copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Development Flow
Custom development flow from C to Virtual Bit-Stream
High-level Synthesis
High-level task
description
RTL task description
HDL Synthesis
HDL task description
Flat logic netlist
Technology mapping
Mapped logic netlist
Placer Router
Placement
data
Routing
data
Arch.
netlist
Bitstream generation
Virtual bit-stream
Arch.
description
Integrated within the
FlexTiles
development flow
Generates VBS from
a C description or a
HDL description
28. 28 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Development Flow
Custom development flow from C to Virtual Bit-Stream
Relies on Catapult C from Calypto Design Systems
High-level synthesis from C to VHDL
29. 29 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Development Flow
Custom development flow from C to Virtual Bit-Stream
Use the Verilog To Routing (VTR) academic tool flow to generate netlist and routing data from Verilog
RTL task description HDL Synthesis HDL task description Flat logic netlist Technology mapping Mapped logic netlist Placer Router Placement data Routing data Arch. netlist Arch. description
30. 30 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Development Flow
Custom development flow from C to Virtual Bit-Stream
A custom back-end generate the VBS from the data generated by VTR
The VBS can be loaded on the FlexTiles platform
31. 31 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Conclusions
Overall results and achievements
3-D stacked embedded FPGA coupled to a processor layer
Flexible resource allocation/sharing
Seamless task migration
Virtual Bit-Stream
VBS also reduces bitstream size eFPGA Chip “Proof of Concept”
65nm CMOS
Homogenous Fabric of LBs
I/O Ring (not 3D…)
External Reconfiguration Controller
32. 32 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Results
Thank you for your attention
33. 33 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
D-cache 6%
Datapath 3%
Energy Saved 91%
D-cache 6%
Datapath 38%
Reg. File 14%
Fetch/ Decode 19%
I-cache 23%
Where do the energy savings come from?
MIPS baseline 91 pJ/instr.
Specialized core 8 pJ/instr.
[Goulding et al., Hot Chips’10]
34. 34 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Energy per operation: 45nm CMOS, 40nm V6 FPGA
HW operators (45nm)
32-bit addition: 0.5pJ
16-bit multiply: 2.2pJ
64-bit FPU: 50pJ/op 40nm V6 FPGA
16/32-bit multiply and add: 114pJ (DSP blocks), 170pJ (LUT)
32-bit I/O access: 1.47nJ
32-bit memory read: 660 pJ
32-bit register R/W: 1.12 pJ Embedded RISC Processor (45nm)
32-bit register R/W: 0.33pJ
32-bit cache R/W: 3.5pJ
add instruction⋆⋆: 5.32 pJ
⋆⋆add instruction (best case) = fetch, decode, read 2 operands from RF, execute, write back (into local reg. first, then copy into RF)
[Dally et al., Computer, 2010]
[Bonamy et al., 2013]
35. 35 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
The Energy Cost of Data Movement
Fetching operands costs more than computing
Energy cost of cache coherence is huge!
28nm
CMOS
500 pJ
Efficient
off-chip link
16 nJ
DRAM
Rd/Wr
64-bit DP
20pJ
26 pJ
256 pJ
1 nJ
256- bit
buses
50 pJ
256-bit access
8 kB SRAM
[Dally, IPDPS’11]
36. 36 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
Efficient Hardware Task Swapping
Hiding reconfiguration time with computing
Single-context memory
Double-context memory
eFPGA will use double-context memory
Gain in dynamic reconfiguration efficiency
At the cost of ~50% overhead
Task 1
Task 2
time
Cfg. 2
Cfg. 1
Task 1
Task 2
time
Cfg. 2
Cfg. 1
CB
FF
ConfClk
Latch
ConfEn
CB
CB: one configuration bit
37. 37 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
eFPGA(V1) Architecture
Logic Block Switch Block
LUT
CLBIN
ScanIn
FF
mux
CB
ScanOut
CLBOUT
clk,rstb
CB
CB
CB
CB
NORTH(i)
SOUTH(i)
EAST(i)
WEST(i)
ScanIn
ScanOut
38. 38 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
eFPGA Architecture
Interconnection Block
CLBIN[1]
CLBIN[2]
CLBIN[3]
CLBOUT
CLBIN[0]
NORTH
0 1 2 3
0 1 2 3
SOUTH
0 1 2 3
WEST
EAST
0 1 2 3
39. 39 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
eFPGA Architecture
eFPGA macro
CHANY
(i,j+1)
SB
(i-1,j)
CHANX
(i+1,j)
CLB
(i+1,j)
SB
(i,j-1)
SB(i,j)
CLB
(i,j+1)
CLB
(i,j)
CLBIN[1]
CLBIN[2]
CLBIN[0]
CLBIN[3]
CLBOUT
CHANX(i,j)
CHANY(i,j)
CLBIN[3]
CLBOUT
CLBIN[0]
40. 40 /
The information contained in this document and any attachments are the property of FlexTiles consortium. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document must be done in accordance with the CA of the project (TRT/DJ/624412785.2011). Template version 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop
32
eFPGA Floorplan
eFPGA Floorplan