Excellent presentation by Hal Finkel (hfinkel@anl.gov), Kazutomo Yoshii, and Franck Cappello as to why FPGAs are a competitive HPC accelerator technology.
If you like what you read be sure you ♥ it below. Thank you!
This presentation introduces coarse-grained FPGAs like the Xilinx 7 series and Altera 10/V series. It compares the features of different FPGAs, including logic elements, slices/ALMs, registers, memory blocks, and DSPs. The internal structures of slices and DSP blocks for some FPGAs are shown. In conclusion, the presentation provides an overview and comparison of coarse-grained FPGA capabilities and architectures.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
This document describes PyCoRAM, a Python-based implementation of the CoRAM memory architecture for FPGA-based computing. PyCoRAM provides a high-level abstraction for memory management that decouples computing logic from memory access behaviors. It allows defining memory access patterns using Python control threads. PyCoRAM generates an IP core that integrates with standard IP cores on Xilinx FPGAs using the AMBA AXI4 interconnect. It supports parameterized RTL design and achieves high memory bandwidth utilization of over 84% on two FPGA boards in evaluations of an array summation application.
The document describes the design and implementation of a 16-bit greatest common divisor (GCD) circuit on a Xilinx Spartan6 FPGA. It presents the GCD circuit at different behavioral and structural levels of abstraction, from a high-level behavioral model to an optimized structural model. Implementation results for each level show tradeoffs in resources used and speed obtained. The optimized structural model achieved the fastest speed of 23.6 delay units using the fewest resources of 33 slices, 110 LUTs, and 16 MUXCYs.
Small introduction to FPGA acceleration and the impact of the new High Level Synthesis toolchains to their programmability
Video here: https://www.linkedin.com/posts/marcobarbone_can-my-application-benefit-from-fpga-acceleration-activity-6848674747375460352-0fua
This document provides an overview of reconfigurable computing and field programmable gate arrays (FPGAs). It discusses the history and flexibility advantages of FPGAs compared to application-specific integrated circuits (ASICs) and general purpose processors (GPPs). The document outlines FPGA architecture including logic blocks, interconnect networks, memory and digital signal processing blocks. It also covers FPGA programming technologies, data flow graphs, and considerations for implementing algorithms on FPGAs which requires a codesign approach.
This document proposes a platform for networked FPGA systems using an ORB (Object Request Broker) engine with a Java-to-HDL synthesizer called JavaRock. This allows software engineers to easily use FPGAs. As a demonstration, a remotely controlled FPGA system for an inverted pendulum was designed. Measurements found the ORB engine latency to be below 200 microseconds with a hardware resource overhead of 372 slices on a Spartan6 FPGA. Future work aims to improve performance and build an ecosystem to support networked and distributed FPGA system design using the ORB architecture.
This document discusses the challenges of building and optimizing open RAN systems for 5G networks. It describes Picocom's 5G baseband system-on-chip architecture using multiple RISC-V clusters and hardware accelerators. Maintaining performance and detecting problems is difficult due to the complex timing requirements across hundreds of users. Mentor's embedded analytics solution monitors the system non-intrusively using on-chip sensors to detect issues like timing overruns and help optimize performance both during development and over the lifetime of deployments.
This presentation introduces coarse-grained FPGAs like the Xilinx 7 series and Altera 10/V series. It compares the features of different FPGAs, including logic elements, slices/ALMs, registers, memory blocks, and DSPs. The internal structures of slices and DSP blocks for some FPGAs are shown. In conclusion, the presentation provides an overview and comparison of coarse-grained FPGA capabilities and architectures.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
This document describes PyCoRAM, a Python-based implementation of the CoRAM memory architecture for FPGA-based computing. PyCoRAM provides a high-level abstraction for memory management that decouples computing logic from memory access behaviors. It allows defining memory access patterns using Python control threads. PyCoRAM generates an IP core that integrates with standard IP cores on Xilinx FPGAs using the AMBA AXI4 interconnect. It supports parameterized RTL design and achieves high memory bandwidth utilization of over 84% on two FPGA boards in evaluations of an array summation application.
The document describes the design and implementation of a 16-bit greatest common divisor (GCD) circuit on a Xilinx Spartan6 FPGA. It presents the GCD circuit at different behavioral and structural levels of abstraction, from a high-level behavioral model to an optimized structural model. Implementation results for each level show tradeoffs in resources used and speed obtained. The optimized structural model achieved the fastest speed of 23.6 delay units using the fewest resources of 33 slices, 110 LUTs, and 16 MUXCYs.
Small introduction to FPGA acceleration and the impact of the new High Level Synthesis toolchains to their programmability
Video here: https://www.linkedin.com/posts/marcobarbone_can-my-application-benefit-from-fpga-acceleration-activity-6848674747375460352-0fua
This document provides an overview of reconfigurable computing and field programmable gate arrays (FPGAs). It discusses the history and flexibility advantages of FPGAs compared to application-specific integrated circuits (ASICs) and general purpose processors (GPPs). The document outlines FPGA architecture including logic blocks, interconnect networks, memory and digital signal processing blocks. It also covers FPGA programming technologies, data flow graphs, and considerations for implementing algorithms on FPGAs which requires a codesign approach.
This document proposes a platform for networked FPGA systems using an ORB (Object Request Broker) engine with a Java-to-HDL synthesizer called JavaRock. This allows software engineers to easily use FPGAs. As a demonstration, a remotely controlled FPGA system for an inverted pendulum was designed. Measurements found the ORB engine latency to be below 200 microseconds with a hardware resource overhead of 372 slices on a Spartan6 FPGA. Future work aims to improve performance and build an ecosystem to support networked and distributed FPGA system design using the ORB architecture.
This document discusses the challenges of building and optimizing open RAN systems for 5G networks. It describes Picocom's 5G baseband system-on-chip architecture using multiple RISC-V clusters and hardware accelerators. Maintaining performance and detecting problems is difficult due to the complex timing requirements across hundreds of users. Mentor's embedded analytics solution monitors the system non-intrusively using on-chip sensors to detect issues like timing overruns and help optimize performance both during development and over the lifetime of deployments.
FPGA introduction for absolute beginners
- What is inside FPGA (Altera example)
- What are the major differences between firmware development for MCU and FPGA
- Some very basics of Verilog HDL language (by similarities with C/C++)
- Testbench approach and Icarus simulator demonstration
- Altera Quartus IDE demonstration -- creating project, compilation, and download
- Signal-Tap internal logic analyzer demonstration
(Verilog source code examples attached inside presentation)
The document discusses optimizing deep neural networks (DNNs) for deployment on ultra-low power RISC-V cores. It describes the PULP-NN library which optimizes the computational backend for int8 arithmetic. PULP-NN maximizes data reuse, improves kernel regularity, and exploits parallelism to achieve high utilization. It also introduces DORY, a tool for tiling and code generation that formulates tiling as a constraint programming problem to maximize tile sizes while fitting memory constraints.
This document discusses configurable logic devices (CLDs) and field programmable gate arrays (FPGAs). It describes the structure of a CPLD as consisting of logic blocks with macrocells and programmable interconnects. An FPGA consists of an array of configurable logic blocks surrounded by programmable I/O blocks and connected with programmable interconnects. The document provides details on the architecture and components of Xilinx XC9500 CPLDs and the configuration of logic blocks, look-up tables, flip-flops, and programmable interconnects in FPGAs.
The document discusses ASIC prototyping using FPGAs at Ericsson. It provides an overview of the ASIC prototyping team and their history of using FPGAs to prototype ASIC designs. It describes the FPGA platforms used from 2001-present, the design flow and differences between prototyping with FPGAs versus ASICs. Timescales and resources needed for FPGA firmware development are also outlined.
This document discusses various design options for digital systems including ASICs, FPGAs, and PLDs. It provides details on full-custom and cell-based ASIC design, gate array design, FPGA architecture, and different types of PLDs including ROM, PAL, and PLA. Examples are given to compare implementation of logic functions using these different PLD types. The document also discusses hierarchical system design at different levels from system to circuit.
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...chiportal
The document discusses OpenCL for accelerating FPGA designs. It provides an overview of technology trends favoring parallelism and programmability. OpenCL is presented as a solution to bring FPGA design closer to software development by providing a standard programming model and faster compilation. The document describes how OpenCL maps to FPGAs by compiling kernels to hardware pipelines and discusses examples accelerated using OpenCL on FPGAs, including AES encryption, option pricing, document filtering, and video compression.
The document discusses the structure and components of field programmable gate arrays (FPGAs). FPGAs consist of programmable logic blocks, interconnects, and input/output blocks. The logic blocks contain lookup tables and flip flops that can be programmed to implement desired logic functions. The interconnects include vertical and horizontal routing channels and switch boxes that allow the logic blocks to be connected as needed. The input/output blocks provide interfaces between the FPGA and external devices.
The document discusses factors to consider when selecting an FPGA device for a project, including technical requirements, vendor options, specific device resources, costs, and compatibility with tools and IPs. The key steps are to specify requirements, research compatible devices from vendors, isolate top candidates, compare based on resources, costs, and support, then select the best device.
The document discusses the architecture and programming of CPLDs and FPGAs. CPLDs and FPGAs are types of programmable logic devices (PLDs) that can implement complex digital logic functions. CPLDs contain logic blocks that can be programmed, while FPGAs contain an array of configurable logic blocks and interconnects. The document describes the components and programming of PLDs like PLA and PAL, as well as the logic cells and interconnects that make up CPLDs and FPGAs.
This document provides an overview of FPGA computing using an Intel/Altera Arria 10 FPGA. It begins with the history of the FPGA computing project and defines what an FPGA is. It then discusses the Intel Arria 10 FPGA used in tests. To measure performance on new architectures, it introduces the concept of the "Seven Dwarfs" benchmark algorithms and arithmetic intensity. The Roofline performance model is explained as a way to estimate performance based on peak flop rate and memory bandwidth. Actual performance results are shown for algorithms like vector addition, stencil code, and matrix multiplication run on the FPGA compared to CPU. OpenCL is discussed as a programming model for FPGAs.
This document provides an overview of FPGA technology. It describes that an FPGA is a field programmable gate array that can be reprogrammed after manufacturing. The core components of an FPGA include look-up tables, flip-flops, multiplexors, I/O blocks, programmable interconnects, and SRAM memory cells. FPGAs offer advantages over ASICs like quick time to market and reprogrammability. Major FPGA manufacturers like Xilinx and Altera integrate additional components into their devices like RAM blocks, DSP blocks, and embedded processor cores.
An Open Discussion of RISC-V BitManip, trends, and comparisons _ CuffRISC-V International
Join RISC-V BitManip industry leader Claire Xenia Wolf and Dr. James Cuff for an open and lively discussion with an interactive Q&A on RISC-V and BitManip including trends and comparisons with the existing architecture landscape including x86 and ARM and what specifically makes RISC-V unique.
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
Speaker: Paul Navrátil, Texas Advanced Computing Center (TACC)
The design emphasis for supercomputing systems has moved from raw performance to performance-per-watt, and as a result, supercomputing architectures are converging on processors with wide vector units and many processing cores per chip. Such processors are capable of performant image rendering purely in software. This improved capability is fortuitous, since the prevailing homogeneous system designs lack dedicated, hardware-accelerated rendering subsystems for use in data visualization. Reliance on this “software-defined” rendering capability will grow in importance since, due to growing data sizes, visualizations must be performed on the same machine where the data is produced. Further, as data sizes outgrow disk I/O capacity, visualization will be increasingly incorporated into the simulation code itself (in situ visualization).
This talk presents recent work in high-fidelity visualization using the OSPRay ray tracing framework on TACC’s local and remote visualization systems. We present work using OSPRay within ParaView Catalyst in situ framework from Kitware, including capitalizing on opportunities to reduce data costs migrating through VTK filters for visualization. We highlight the performance opportunities and advantages of Intel® Advanced Vector Extensions 512, the memory system improvements possible with Intel® Xeon Phi™ processor multi-channel DRAM (MCDRAM) and the Intel® Omni-Path Architecture interconnect.
The document is a seminar report on FPGA technology in outer space applications. It discusses the history and evolution of FPGA technology over time, including increasing gate densities and falling prices. It describes typical FPGA architecture which includes configurable logic blocks, interconnects, and I/O pads. Modern FPGAs integrate additional resources like memory blocks, DSP slices, and soft processor cores. The document highlights applications of FPGAs in aerospace, including COTS boards and development kits. It also outlines future potential for FPGAs in more complex roles in space systems.
The document provides a history of digital logic and programmable logic devices such as PLDs, CPLDs, and ASICs. It describes the advantages of FPGAs over other technologies including lower costs, faster time to market, and easier design changes. The architecture of FPGAs is explained including logic blocks, interconnects, embedded memory and DSP blocks. Modern SoC FPGAs integrate an ARM processor for improved performance. Applications include automotive, wireless, military, and medical imaging systems.
This document discusses image processing applications using Vivado for FPGAs. It provides information on FPGA architecture including distributed memory, block RAM features, and core generator. An example of a real-time breast cancer diagnosis application using YOLO on an FPGA board is described. A second example discusses implementing CCSDS standard DWT-based hyperspectral image decompression on an FPGA using techniques like Haar wavelet transform and MAP encoding.
The document discusses plans to establish an institutional high performance computing (HPC) facility at North-West University. It outlines the technical goals of building a Beowulf cluster to link existing departmental clusters and integrate with national and international computational grids. It also discusses management principles for the new HPC facility to ensure sustainability, efficiency, reliability, availability and high performance.
Programmable logic devices (PLDs) allow users to implement digital logic designs on a single chip. PLDs have advantages over traditional integrated circuits like lower costs for lower production volumes and shorter design times. Common types of PLDs include simple programmable logic devices like PALs, GALs, and CPLDs. PLDs are configured using memory like SRAM, EPROM, EEPROM, or flash to store the programmed logic pattern. Reprogrammability allows PLDs to be reused for different logic functions.
This document discusses CPLDs (Complex Programmable Logic Devices), including their general architecture, reprogrammability, density, and common vendors and families. CPLDs contain programmable macrocells (equivalent to around 20 gates each) connected by a programmable interconnect and support up to 200 I/O pins. They are between FPGAs and SPLDs in complexity. The document describes CPLD architecture including logic array blocks (LABs) and programmable interconnect arrays (PIAs). It provides examples of Xilinx CPLD families, packages, and a datasheet for the XC9572 device.
Hardware for deep learning includes CPUs, GPUs, FPGAs, and ASICs. CPUs are general purpose but support deep learning through instructions like AVX-512 and libraries. GPUs like NVIDIA and AMD models are commonly used due to high parallelism and memory bandwidth. FPGAs offer high efficiency but require specialized programming. ASICs like Google's TPU are customized for deep learning and provide high performance but limited flexibility. Emerging hardware aims to improve efficiency and better match neural network computations.
A Primer on FPGAs - Field Programmable Gate ArraysTaylor Riggan
A focus on the use of FPGAs by cloud service providers. Includes Microsoft Azure Catapult, Google Tensor Processors, and Amazon EC2 F1 instances. Also includes background info on how to get started with FPGAs
FPGA introduction for absolute beginners
- What is inside FPGA (Altera example)
- What are the major differences between firmware development for MCU and FPGA
- Some very basics of Verilog HDL language (by similarities with C/C++)
- Testbench approach and Icarus simulator demonstration
- Altera Quartus IDE demonstration -- creating project, compilation, and download
- Signal-Tap internal logic analyzer demonstration
(Verilog source code examples attached inside presentation)
The document discusses optimizing deep neural networks (DNNs) for deployment on ultra-low power RISC-V cores. It describes the PULP-NN library which optimizes the computational backend for int8 arithmetic. PULP-NN maximizes data reuse, improves kernel regularity, and exploits parallelism to achieve high utilization. It also introduces DORY, a tool for tiling and code generation that formulates tiling as a constraint programming problem to maximize tile sizes while fitting memory constraints.
This document discusses configurable logic devices (CLDs) and field programmable gate arrays (FPGAs). It describes the structure of a CPLD as consisting of logic blocks with macrocells and programmable interconnects. An FPGA consists of an array of configurable logic blocks surrounded by programmable I/O blocks and connected with programmable interconnects. The document provides details on the architecture and components of Xilinx XC9500 CPLDs and the configuration of logic blocks, look-up tables, flip-flops, and programmable interconnects in FPGAs.
The document discusses ASIC prototyping using FPGAs at Ericsson. It provides an overview of the ASIC prototyping team and their history of using FPGAs to prototype ASIC designs. It describes the FPGA platforms used from 2001-present, the design flow and differences between prototyping with FPGAs versus ASICs. Timescales and resources needed for FPGA firmware development are also outlined.
This document discusses various design options for digital systems including ASICs, FPGAs, and PLDs. It provides details on full-custom and cell-based ASIC design, gate array design, FPGA architecture, and different types of PLDs including ROM, PAL, and PLA. Examples are given to compare implementation of logic functions using these different PLD types. The document also discusses hierarchical system design at different levels from system to circuit.
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...chiportal
The document discusses OpenCL for accelerating FPGA designs. It provides an overview of technology trends favoring parallelism and programmability. OpenCL is presented as a solution to bring FPGA design closer to software development by providing a standard programming model and faster compilation. The document describes how OpenCL maps to FPGAs by compiling kernels to hardware pipelines and discusses examples accelerated using OpenCL on FPGAs, including AES encryption, option pricing, document filtering, and video compression.
The document discusses the structure and components of field programmable gate arrays (FPGAs). FPGAs consist of programmable logic blocks, interconnects, and input/output blocks. The logic blocks contain lookup tables and flip flops that can be programmed to implement desired logic functions. The interconnects include vertical and horizontal routing channels and switch boxes that allow the logic blocks to be connected as needed. The input/output blocks provide interfaces between the FPGA and external devices.
The document discusses factors to consider when selecting an FPGA device for a project, including technical requirements, vendor options, specific device resources, costs, and compatibility with tools and IPs. The key steps are to specify requirements, research compatible devices from vendors, isolate top candidates, compare based on resources, costs, and support, then select the best device.
The document discusses the architecture and programming of CPLDs and FPGAs. CPLDs and FPGAs are types of programmable logic devices (PLDs) that can implement complex digital logic functions. CPLDs contain logic blocks that can be programmed, while FPGAs contain an array of configurable logic blocks and interconnects. The document describes the components and programming of PLDs like PLA and PAL, as well as the logic cells and interconnects that make up CPLDs and FPGAs.
This document provides an overview of FPGA computing using an Intel/Altera Arria 10 FPGA. It begins with the history of the FPGA computing project and defines what an FPGA is. It then discusses the Intel Arria 10 FPGA used in tests. To measure performance on new architectures, it introduces the concept of the "Seven Dwarfs" benchmark algorithms and arithmetic intensity. The Roofline performance model is explained as a way to estimate performance based on peak flop rate and memory bandwidth. Actual performance results are shown for algorithms like vector addition, stencil code, and matrix multiplication run on the FPGA compared to CPU. OpenCL is discussed as a programming model for FPGAs.
This document provides an overview of FPGA technology. It describes that an FPGA is a field programmable gate array that can be reprogrammed after manufacturing. The core components of an FPGA include look-up tables, flip-flops, multiplexors, I/O blocks, programmable interconnects, and SRAM memory cells. FPGAs offer advantages over ASICs like quick time to market and reprogrammability. Major FPGA manufacturers like Xilinx and Altera integrate additional components into their devices like RAM blocks, DSP blocks, and embedded processor cores.
An Open Discussion of RISC-V BitManip, trends, and comparisons _ CuffRISC-V International
Join RISC-V BitManip industry leader Claire Xenia Wolf and Dr. James Cuff for an open and lively discussion with an interactive Q&A on RISC-V and BitManip including trends and comparisons with the existing architecture landscape including x86 and ARM and what specifically makes RISC-V unique.
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
Speaker: Paul Navrátil, Texas Advanced Computing Center (TACC)
The design emphasis for supercomputing systems has moved from raw performance to performance-per-watt, and as a result, supercomputing architectures are converging on processors with wide vector units and many processing cores per chip. Such processors are capable of performant image rendering purely in software. This improved capability is fortuitous, since the prevailing homogeneous system designs lack dedicated, hardware-accelerated rendering subsystems for use in data visualization. Reliance on this “software-defined” rendering capability will grow in importance since, due to growing data sizes, visualizations must be performed on the same machine where the data is produced. Further, as data sizes outgrow disk I/O capacity, visualization will be increasingly incorporated into the simulation code itself (in situ visualization).
This talk presents recent work in high-fidelity visualization using the OSPRay ray tracing framework on TACC’s local and remote visualization systems. We present work using OSPRay within ParaView Catalyst in situ framework from Kitware, including capitalizing on opportunities to reduce data costs migrating through VTK filters for visualization. We highlight the performance opportunities and advantages of Intel® Advanced Vector Extensions 512, the memory system improvements possible with Intel® Xeon Phi™ processor multi-channel DRAM (MCDRAM) and the Intel® Omni-Path Architecture interconnect.
The document is a seminar report on FPGA technology in outer space applications. It discusses the history and evolution of FPGA technology over time, including increasing gate densities and falling prices. It describes typical FPGA architecture which includes configurable logic blocks, interconnects, and I/O pads. Modern FPGAs integrate additional resources like memory blocks, DSP slices, and soft processor cores. The document highlights applications of FPGAs in aerospace, including COTS boards and development kits. It also outlines future potential for FPGAs in more complex roles in space systems.
The document provides a history of digital logic and programmable logic devices such as PLDs, CPLDs, and ASICs. It describes the advantages of FPGAs over other technologies including lower costs, faster time to market, and easier design changes. The architecture of FPGAs is explained including logic blocks, interconnects, embedded memory and DSP blocks. Modern SoC FPGAs integrate an ARM processor for improved performance. Applications include automotive, wireless, military, and medical imaging systems.
This document discusses image processing applications using Vivado for FPGAs. It provides information on FPGA architecture including distributed memory, block RAM features, and core generator. An example of a real-time breast cancer diagnosis application using YOLO on an FPGA board is described. A second example discusses implementing CCSDS standard DWT-based hyperspectral image decompression on an FPGA using techniques like Haar wavelet transform and MAP encoding.
The document discusses plans to establish an institutional high performance computing (HPC) facility at North-West University. It outlines the technical goals of building a Beowulf cluster to link existing departmental clusters and integrate with national and international computational grids. It also discusses management principles for the new HPC facility to ensure sustainability, efficiency, reliability, availability and high performance.
Programmable logic devices (PLDs) allow users to implement digital logic designs on a single chip. PLDs have advantages over traditional integrated circuits like lower costs for lower production volumes and shorter design times. Common types of PLDs include simple programmable logic devices like PALs, GALs, and CPLDs. PLDs are configured using memory like SRAM, EPROM, EEPROM, or flash to store the programmed logic pattern. Reprogrammability allows PLDs to be reused for different logic functions.
This document discusses CPLDs (Complex Programmable Logic Devices), including their general architecture, reprogrammability, density, and common vendors and families. CPLDs contain programmable macrocells (equivalent to around 20 gates each) connected by a programmable interconnect and support up to 200 I/O pins. They are between FPGAs and SPLDs in complexity. The document describes CPLD architecture including logic array blocks (LABs) and programmable interconnect arrays (PIAs). It provides examples of Xilinx CPLD families, packages, and a datasheet for the XC9572 device.
Hardware for deep learning includes CPUs, GPUs, FPGAs, and ASICs. CPUs are general purpose but support deep learning through instructions like AVX-512 and libraries. GPUs like NVIDIA and AMD models are commonly used due to high parallelism and memory bandwidth. FPGAs offer high efficiency but require specialized programming. ASICs like Google's TPU are customized for deep learning and provide high performance but limited flexibility. Emerging hardware aims to improve efficiency and better match neural network computations.
A Primer on FPGAs - Field Programmable Gate ArraysTaylor Riggan
A focus on the use of FPGAs by cloud service providers. Includes Microsoft Azure Catapult, Google Tensor Processors, and Amazon EC2 F1 instances. Also includes background info on how to get started with FPGAs
This document discusses soft processors like the NIOS II that can be implemented on FPGAs. It provides details on the NIOS II architecture, implementation, and IDE. It compares NIOS II to the TigerSHARC architecture. It also analyzes the performance of a FIR filter algorithm on both platforms, showing NIOS II is slower but hardware acceleration could improve its performance. Overall it presents NIOS II as a customizable alternative to DSPs that blurs the line between FPGAs and processors.
The Cygnus supercomputer combines GPUs and FPGAs to provide high performance computing capabilities. It has 81 nodes with a total peak performance of 2.4 PFLOPS from GPUs, CPUs, and FPGAs. 49 nodes contain GPUs only, while 32 nodes contain GPUs, FPGAs, and high-speed interconnects between the FPGAs. The FPGAs allow for application-specific acceleration and high-speed external communication. Cygnus aims to enhance performance through mixed and variable precision operations on the FPGAs.
2.FPGA for dummies: modern FPGA architectureMaurizio Donna
The document discusses the architecture and components of field programmable gate arrays (FPGAs). It describes the basic building blocks of FPGAs, including look-up tables (LUTs), flip-flops (FFs), wires, and input/output pads. It notes that modern FPGAs also include additional elements like embedded memory, phase-locked loops, high-speed transceivers, memory controllers, multiply-accumulate blocks, and embedded processors. The document provides details on these individual components, such as the use of block RAM for memory, PLLs and DLLs for clocking, transceivers for high-speed communication, and DSP blocks for arithmetic functions.
The document discusses Altera's system on chip (SoC) devices that integrate an ARM-based hard processor system with an FPGA fabric. The SoCs allow users to select the needed intellectual property and peripherals to create a custom system on chip tailored for their application. The SoCs provide benefits like reducing power, cost, and board size while increasing performance compared to discrete processor and FPGA solutions. Altera offers a range of SoC devices in their 28nm Cyclone V and Arria V families.
This document provides an overview of field-programmable gate arrays (FPGAs):
FPGAs can be configured to act like any circuit and do many computational tasks. Unlike CPUs and GPUs which have fixed hardware, FPGAs have programmable hardware that can be optimized for specific applications. This specialized hardware enables much higher performance and efficiency than general purpose processors. Programming FPGAs involves using hardware description languages to describe circuits or high-level synthesis tools to compile software languages into hardware implementations.
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
The document evaluates the performance impact of virtualization on high-performance computing (HPC) clouds. Experiments were conducted on the AIST Super Green Cloud, a 155-node HPC cluster. Benchmark results show that while PCI passthrough mitigates I/O overhead, virtualization still incurs performance penalties for MPI collectives as node counts increase. Application benchmarks demonstrate overhead is limited to around 5%. The study concludes HPC clouds are promising due to utilization improvements from virtualization, but further optimization of virtual machine placement and pass-through technologies could help reduce overhead.
The document describes an IBM workshop on CAPI and OpenCAPI technologies. It provides an overview of FPGA acceleration using SNAP, including how SNAP simplifies FPGA programming using a C/C++ based approach. Examples of use cases for FPGA acceleration like video processing and machine learning inference are also presented.
The GIST AI-X Computing Cluster provides powerful accelerated computation resources for machine learning using GPUs and other hardware. It includes DGX A100 and DGX-1V nodes with 8 NVIDIA A100 or V100 GPUs each, connected by high-speed networking. The cluster uses Singularity containers, Slurm scheduling, and Ceph storage. It allows researchers to request resources, build container images, and run distributed deep learning jobs across multiple GPUs.
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
This document introduces hardware acceleration using FPGAs with OpenCAPI. It discusses how classic FPGA acceleration has issues like slow CPU-managed memory access and lack of data coherency. OpenCAPI allows FPGAs to directly access host memory, providing faster memory access and data coherency. It also introduces the OC-Accel framework that allows programming FPGAs using C/C++ instead of HDL languages, addressing issues like long development times. Example applications demonstrated significant performance improvements using this approach over CPU-only or classic FPGA acceleration methods.
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
Intel s'intéresse tout particulièrement aux FPGA et notamment au potentiel qu'ils apportent lorsque les ISV et développeurs ont des besoins très spécifiques en Génomique, traitement d'images, traitement de bases de données, et même dans le Cloud. Dans ce document vous aurez l'occasion d'en savoir plus sur notre stratégie, et sur un programme de recherche lancé par Intel et Altera impliquant des Xeon E5 équipés... de FPGA
Intel is looking at FPGA and what they bring to ISVs and developers and their very specific needs in genomics, image processing, databases, and even in the cloud. In this document you will have the opportunity to learn more about our strategy, and a research program initiated by Intel and Altera involving Xeon E5 with... FPGA inside.
Auteur(s)/Author(s):
P. K. Gupta, Director of Cloud Platform Technology, Intel Corporation
Short Survey on the current state of Field-programmable gate array usage in Deep learning by several companies like Intel Nervana and Google's TPU (tensor processing units) vs GPU usage in terms of energy consumption and performance.
An FPGA (field-programmable gate array) is an integrated circuit designed to be configured by a customer after manufacturing. FPGAs contain programmable logic blocks and a hierarchy of reconfigurable interconnects that allow the blocks to be wired together in different configurations. This flexibility allows FPGAs to implement any logical function that an ASIC could perform, with advantages including the ability to reprogram functionality after shipping and lower engineering costs than an ASIC. Common applications of FPGAs include digital signal processing, software-defined radio, medical imaging, and more.
This presentation is dedicated to Field Programmable Gate Array (FPGA), its interaction with CPU, distrbuted resources, use of FPGA for prototyping chips, and OpenCL in FPGA.
This presentation by Andriy Smolskyy, GlobalLogic Engineering Consultant, was delivered at a GlobalLogic Embedded TechTalk in Lviv on March 29, 2017.
An Arithmetic Logic Unit (ALU) is a digital circuit that performs arithmetic and logic operations on binary numbers. The student has designed a 4-bit ALU using VHDL that can perform operations like addition, subtraction, AND, OR, etc. based on control inputs. The design was tested through simulation to verify it functions correctly as intended. Project planning involved estimating cost, duration, and resources needed. The ALU design was developed modularly using a top-down approach in VHDL.
This document discusses OpenCAPI acceleration using the OpenCAPI Acceleration Framework (oc-accel). It provides an overview of the oc-accel components and workflow, benchmarks the OC-Accel bandwidth and latency, and provides examples of how to fully utilize OC-Accel capabilities to accelerate functions on an FPGA. The document also outlines the OC-Accel development process and previews upcoming features like support for ODMA to port existing PCIe accelerators to OpenCAPI.
Similar to FPGAs for Supercomputing: The Why and How (20)
The AI Index is an independent initiative at the Stanford Institute for Human-Centered Artificial Intelligence (HAI), led by the AI Index Steering Committee, an interdisciplinary group of experts from across academia and industry. The annual report tracks, collates, distills, and visualizes data relating to artificial intelligence, enabling decision-makers to take meaningful action to advance AI responsibly and ethically with humans in mind.
The document discusses the history of hardware acceleration for cryptography through new processor instructions. It notes that starting in 2010, Intel launched processors with AES-NI instructions to accelerate AES encryption. In 2013, SHA instructions were added to accelerate hash functions. Additional instructions like ADX in 2014 helped accelerate public key cryptography. The document outlines Intel's approach of using new cryptography instructions in processors along with hardware accelerators and optimized software libraries to improve the performance of encryption and decryption workloads.
The Intel Blockscale ASIC is a custom application-specific integrated circuit (ASIC) designed for cryptocurrency mining and blockchain proof-of-work applications. It provides up to 580 gigahashes per second of hashing power while consuming between 4.8 and 22.7 watts of power, resulting in an efficiency of up to 26 joules per terahash. The ASIC features on-chip temperature and voltage sensors and supports a range of operating frequencies and up to 256 chips per chain. It is supported by reference hardware and software to simplify system development for customized and energy-efficient cryptocurrency mining solutions.
Cryptography Processing with 3rd Gen Intel Xeon Scalable ProcessorsDESMOND YUEN
- The document discusses new capabilities in 3rd Gen Intel Xeon Scalable processors to enhance cryptographic operations, known as Intel Crypto Acceleration. It includes new instructions that help improve performance of encryption algorithms and enable stronger encryption with larger keys.
- Performance test results on workloads like NGINX, HAProxy, and TLS show speedups of up to 3x when utilizing the new crypto instructions compared to software encryption. This is achieved while maintaining high frequencies for the majority of workload cycles.
- The document dives into details of how the new crypto instructions map to different frequency levels, and how 3rd Gen Xeon Scalable processors have reduced frequency impacts compared to previous generations when executing these instructions.
At Intel, security comes first both in the way we work and in what we work on. Our culture and practices guide everything we build, with the goal of delivering the highest performance and optimal protections. As with previous reports, the 2021 Intel Product Security Report demonstrates our Security First Pledge and our endless efforts to proactively seek out and mitigate security issues.
How can regulation keep up as transformation races ahead? 2022 Global regulat...DESMOND YUEN
As the pandemic drags into its third year, financial services firms face a range of challenges, from increased operational complexity and an evolving regulatory directive to address environmental and social issues to new forms of competition
and evolving technologies, such as digital assets and cryptocurrencies. Banks, insurers, asset managers and other financial services firms (collectively referred to as “firms” in
the rest of this document) must innovate more effectively — and rapidly — to keep up with the pace of change while still identifying emerging risks and building appropriate governance and controls.
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, MoreDESMOND YUEN
NASA's mission of exploration requires new technologies, software, and research – which show up in daily life. The agency’s Spinoff 2022 publication tells the stories of companies, start-ups, and entrepreneurs transforming these innovations into cutting-edge products and services that boost the economy, protect the planet, and save lives.
“The value of NASA is not confined to the cosmos but realized throughout our country – from hundreds of thousands of well-paying jobs to world-leading climate science, understanding the universe and our place within it, to technology transfers that make life easier for folks around the world,” NASA Administrator Bill Nelson said. “As we combat the coronavirus pandemic and promote environmental justice and sustainability, NASA technology is essential to address humanity’s greatest challenges.”
Spinoff 2022 features more than 45 companies using NASA technology to advance manufacturing techniques, detoxify polluted soil, improve weather forecasting, and even clean the air to slow the spread of viruses, including coronavirus.
"NASA's technology portfolio contains many innovations that not only enable exploration but also address challenges and improve life here at home," said Jim Reuter, associate administrator of the agency’s Space Technology Mission Directorate (STMD) in Washington. "We’ve captured these examples of successful commercialization of NASA technology and research, not only to share the benefits of the space program with the public, but to inspire the next generation of entrepreneurs."
This year in Spinoff, readers will learn more about:
How companies use information from NASA’s vertical farm to sustainably grow fresh produce
New ways that technology developed for insulation in space keeps people warm in the great outdoors
How a system created for growing plants in space now helps improve indoor air quality and reduces the spread of airborne viruses like coronavirus
How phase-change materials – originally developed to help astronauts wearing spacesuits – absorb, hold, and release heat to help keep race car drivers cool
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...DESMOND YUEN
Internet of Things (IoT) is an innovative paradigm
envisioned to provide massive applications that are now part of
our daily lives. Millions of smart devices are deployed within
complex networks to provide vibrant functionalities including
communications, monitoring, and controlling of critical infrastructures. However, this massive growth of IoT devices and the corresponding huge data traffic generated at the edge of the network created additional burdens on the state-of-the-art
centralized cloud computing paradigm due to the bandwidth and
resources scarcity. Hence, edge computing (EC) is emerging as
an innovative strategy that brings data processing and storage
near to the end users, leading to what is called EC-assisted IoT.
Although this paradigm provides unique features and enhanced
quality of service (QoS), it also introduces huge risks in data security and privacy aspects. This paper conducts a comprehensive survey on security and privacy issues in the context of EC-assisted IoT. In particular, we first present an overview of EC-assisted IoT including definitions, applications, architecture, advantages, and challenges. Second, we define security and privacy in the context of EC-assisted IoT. Then, we extensively discuss the major classifications of attacks in EC-assisted IoT and provide possible solutions and countermeasures along with the related research efforts. After that, we further classify some security and privacy issues as discussed in the literature based on security services and based on security objectives and functions. Finally, several open challenges and future research directions for secure EC-assisted IoT paradigm are also extensively provided.
PUTTING PEOPLE FIRST: ITS IS SMART COMMUNITIES AND CITIESDESMOND YUEN
The document summarizes the ITS America Annual Conference, which focuses on putting people first through smart communities and cities. It provides an introduction from panelists at the US Department of Transportation and discusses moving forward by putting people first with smart cities and communities. It then covers topics like defining smart cities and communities, their benefits, the US DOT's role in supporting them, and success factors. Finally, it discusses how smart cities and communities are tackling transportation challenges and provides information on the ITS Joint Program Office and their research programs.
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPEDESMOND YUEN
Five companies—Deutsche Telekom, Orange, Telecom Italia, Telefónica, and Vodafone—published a report outlining why they feel Europe as a whole is lagging behind other regions such as the U.S. and Japan in developing Open RAN. The companies point to both a lack of companies developing key components, notably silicon chips, for Open RAN technologies, as well as the need to get incumbent equipment vendors Ericsson and Nokia on board with Open RAN development.
An Introduction to Semiconductors and IntelDESMOND YUEN
Did you know that...
The average American adult spends over 12 hours a day engaged with electronics — computers, mobile devices, TVs, cars, to name just a few — powered by semiconductors.
A common chip the size of your smallest fingernail is only about 1-millimeter thick but contains roughly 30 different layers of components and wires (called interconnects) that make up its complex circuitry.
Intel owns nearly 70,000 active patents worldwide. Its first — “Resistor for Integrated Circuit,” #3,631,313 — was granted to Gordon Moore on Dec. 28, 1971.
Those are a few fun facts in a high-level presentation that provides an easy-to-understand look at the world of semiconductors, why they matter and the role Intel plays in their creation.
Changing demographics and economic growth bloomDESMOND YUEN
This document discusses key trends in global demographics and their implications. It notes that while population growth rates have declined globally, absolute numbers continue to rise significantly each decade. Less developed regions now encompass most of the world's population and will continue to see the vast majority of population increases. Mortality declines and fertility declines have driven major shifts in population age structures. Younger populations in places like Africa and South Asia may benefit economic growth if policies support labor force participation and human capital development, while aging societies globally face challenges supporting retirees that policies aim to address.
Intel Corporation (“Intel”) designs and manufactures
advanced integrated digital technology platforms that power
an increasingly connected world. A platform consists of
a microprocessor and chipset, and may be enhanced by
additional hardware, software, and services. The platforms
are used in a wide range of applications, such as PCs, laptops,
servers, tablets, smartphones, automobiles, automated
factory systems, and medical devices. Intel is also in the midst
of a corporate transformation that has seen its data-centric
businesses capture an increasing share of its revenue.
This report provides economic impact estimates for Intel in terms of employment, labor income, and gross domestic product (“GDP”) for the most recent historical year, 2019.1
Discover how private 5G networks can give enterprises options to enhance services and deliver new use cases with the level of control and investment they want.
Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...DESMOND YUEN
The document discusses Capgemini Engineering's 5G Smart Road Side Unit solution which uses the ENSCONCE Edge Computing Platform and cloud-native architecture to enable intelligent transportation applications through visual computing and 5G connectivity. The solution places computing capabilities at the network edge using an all-weather Intel-based device to support applications like traffic management and connected vehicles with low latency. It addresses challenges of legacy infrastructure and complexity by providing an integrated platform for edge applications.
Tackle more data science challenges than ever before without the need for discrete acceleration with the 3rd Gen Intel® Xeon® Scalable processors. Learn about the built-in AI acceleration and performance optimizations for popular AI libraries, tools and models.
The document describes how the latest Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions and Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) enabled in the latest Intel® 3rd Generation Xeon® Scalable Processor are used to significantly increase and achieve 1 Tb of IPsec throughput.
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."DESMOND YUEN
This document summarizes George Shultz's reflections on trust and relationships after turning 100 years old. Some of the key lessons he learned over his century-long life are that trust is essential for positive outcomes, as seen through his experiences with family, teachers, colleagues, and in the military and government. He discusses how earning trust through integrity, competence, caring about others, and enabling participation helped him succeed in challenging situations over his career.
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...DESMOND YUEN
This whitepaper is a blueprint for developing an Open RAN solution. It provides an overview of the main
technology elements that Telefónica is developing
in collaboration with selected partners in the Open
RAN ecosystem.
It describes the architectural elements, design
criteria, technology choices, and key chipsets
employed to build a complete portfolio of radio
units and baseband equipment capable of a full
4G/5G RAN rollout in any market of interest.
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalRPeter Gallagher
In this session delivered at NDC Oslo 2024, I talk about how you can control a 3D printed Robot Arm with a Raspberry Pi, .NET 8, Blazor and SignalR.
I also show how you can use a Unity app on an Meta Quest 3 to control the arm VR too.
You can find the GitHub repo and workshop instructions here;
https://bit.ly/dotnetrobotgithub
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"Emmanuel Onwumere
In iOS 18, Apple has introduced a significant revamp to the Control Centre, making it more intuitive and user-friendly. One of the standout features is a quicker and more accessible way to shut down your iPhone. This enhancement aims to streamline the user experience, allowing for faster access to essential functions. Discover how iOS 18's redesigned Control Centre can simplify your daily interactions with your iPhone, bringing convenience right at your fingertips.
1. FPGAs for Supercomputing: The Why and How
Hal Finkel2
(hfinkel@anl.gov), Kazutomo Yoshii1
, and Franck Cappello1
1
Mathematics and Computer Science (MCS)
2
Leadership Computing Facility (ALCF)
Argonne National Laboratory
Advanced Scientific Computing Advisory Committee
Tuesday, December 20, 2016
Washington, DC
2. Outline
● Why are FPGAs interesting?
● Can FPGAs competitively accelerate traditional HPC workloads?
● Challenges and potential solutions to FPGA programming.
3. For some things, FPGAs are really good!
http://escholarship.org/uc/item/35x310n6
70x faster!
bioinformatics
4. For some things, FPGAs are really good!
machine learning and neural networks
http://ieeexplore.ieee.org/abstract/document/7577314/
FPGA is faster than both
the CPU and GPU,
10x more power efficient,
and a much higher percentage
of peak!
7. http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx
To Decrease Energy, Move Data Less!
On-die Data Movement vs Compute
Interconnect energy (per mm) reduces slower than compute
On-die data movement energy will start to dominate
90 65 45 32 22 14 10 7
0
0.2
0.4
0.6
0.8
1
1.2
Technology (nm)
Source: Intel
On die IC energy/mm
Compute energy
6X
60%
https://www.semiwiki.com/forum/content/6160-2016-leading-edge-semiconductor-landscape.html
8. Compute vs. Movement – Changes Afoot
http://iwcse.phys.ntu.edu.tw/plenary/HorstSimon_IWCSE2013.pdf
(2013)
10. Where Does the Power Go (CPU)?
http://link.springer.com/article/10.1186/1687-3963-2013-9
(Model with (# register files) x (read ports) x (write ports))
Fetch and decode
take most of the
energy!
More centralized register
files means more data
movement which takes
more power.
Only a small portion
of the energy goes
to the underlying
computation.
See also: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2008-130.pdf
11. Modern FPGAs: DSP Blocks and Block RAM
http://yosefk.com/blog/category/hardware
Design mapped
(Place & Route)
Intel Stratix 10 will have up to:
● 5760 DSP Blocks = 9.2 SP TFLOPS
● 11721 20Kb Block RAMs = 28MB
● 64-bit 4-core ARM @ 1.5 GHz
https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html
DSP blocks multiply
(Intel/Altera FPGAs have full SP FMA)
12. GFLOPS/Watt (Single Precision)
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
20
40
60
80
100
120
GFLOPS/Watt
● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - Taking 165 W max range
● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf
● http://www.xilinx.com/applications/high-performance-computing.html - Ultrascale+ figure inferred by a 33% performance increase (from Hotchips presentation)
● https://devblogs.nvidia.com/parallelforall/inside-pascal/
● https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html
Marketing Numbers
for unreleased products…
(be skeptical)
Do these FPGA
numbers include
system memory?
13. GFLOPS/Watt (Single Precision) – Let's be more realistic...
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
20
40
60
80
100
120
GFLOPS/Watt
● http://www.tomshardware.com/reviews/intel-core-i7-5960x-haswell-e-cpu,3918-13.html
● https://hal.inria.fr/hal-00686006v2/document
● http://www.eecg.toronto.edu/~davor/papers/capalija_fpl2014_slides.pdf - Tile approach yields 75% of peak clock rate on full device
Conclusion: FPGAs are a competitive HPC accelerator technology by 2017!
90% of peak
on a CPU is excellent!
70% of peak
on a GPU is excellent!
Plus system memory:
assuming 6W for 16 GB DDR4
(and 150 W for the FPGA)
14. GFLOPS/device (Single Precision)
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
2000
4000
6000
8000
10000
12000
GFLOPS
● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/pt/stratix-10-product-table.pdf - Largest variant with all DSPs doing FMAs @ the 800 MHz max
● http://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html
● http://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf - LUTs, not DSPs, are the limiting resource – filling device with FMAs @ 1 GHz
● https://devblogs.nvidia.com/parallelforall/inside-pascal/
● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - 28 cores @ 3.7 GHz * 16 FP ops per cycle * 2 for FMA (assuming same clock rate as the
E5-1660 v2)
● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf
All in theory...
15. GFLOPS/device (Single Precision) – Let's be more realistic...
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
2000
4000
6000
8000
10000
12000
GFLOPS
● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01222-understanding-peak-floating-point-performance-claims.pdf
● https://www.altera.com/en_US/pdfs/literature/wp/wp-01028.pdf (old but still useful)
90% of peak
on a CPU is excellent!
70% of peak
on a GPU is excellent!
80% usage at peak
frequency of an
FPGA is excellent!
Xilinx has no hard FP logic...
Reserving 30% of the
LUTs for other purposes.
16. For FPGAs, Parallelism is Essential
(CPU/GPU)(FPGA)
90nm
90nm65nm
http://rssi.ncsa.illinois.edu/proceedings/academic/Williams.pdf
(2008)
17. An experiment...
● Nallatech 385A Arria10
board
● 200 – 300 MHz (depend on
a design)
● 20 nm
● two DRAM channels. 34.1
● Sandy Bridge E5-2670
● 2.6 GHz (3.3 GHz w/ turbo)
● 32 nm
● four DRAM channels. 51.2
GB/s peak
18. An experiment: Power is Measured...
● Intel RAPL is used to measure
CPU energy
– CPU and memory
● Yokogawa WT310, an external
power meter, is used to measure
the FPGA power
– FPGA_pwr = meter_pwr -
host_idle_pwr +
FPGA_idle_pwr (~17 W)
– Note that meter_pwr includes
both CPU and FPGA
19. An experiment: Random Access with Computation using OpenCL
● # work-units is 256
● CPU: Sandy Bridge (4ch memory)
● FPGA: Arria 10 (2ch memory)
for (int i = 0; i < M; i++) {
double8 tmp;
index = rand() % len;
tmp = array[index];
sum += (tmp.s0 + tmp.s1) / 2.0;
sum += (tmp.s2 + tmp.s3) / 2.0;
sum += (tmp.s4 + tmp.s5) / 2.0;
sum += (tmp.s6 + tmp.s7) / 2.0;
}
20. An experiment: Random Access with Computation using OpenCL
● # work-units is 256
● CPU: Sandy Bridge (2ch memory)
● FPGA: Arria 10 (2ch memory)
for (int i = 0; i < M; i++) {
double8 tmp;
index = rand() % len;
tmp = array[index];
sum += (tmp.s0 + tmp.s1) / 2.0;
sum += (tmp.s2 + tmp.s3) / 2.0;
sum += (tmp.s4 + tmp.s5) / 2.0;
sum += (tmp.s6 + tmp.s7) / 2.0;
}
Make the comparison more fair...
23. High-End CPU + FPGA Systems Are Coming...
● Intel/Altera are starting to produce Xeon + FPGA systems
● Xilinx are producing ARM + FPGA systems
These are not just embedded cores,
but state-of-the-art multicore CPUs
Low latency and high bandwidth
CPU + FPGA systems fit nicely into the
HPC accelerator model! (“#pragma omp
target” can work for FPGAs too)
https://www.nextplatform.com/2016/03/14/intel-marrying-fpga-beefy-broadwell-open-compute-future/
24. Common Algorithm Classes in HPC
http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
25. Common Algorithm Classes in HPC – What do they need?
http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
26. FPGAs Can Help Everyone!
Compute Bound
(FPGAs have lots of compute)
Memory-Latency Bound
(FPGAs can pipeline deeply)
Memory-Bandwidth Bound
(FPGAs can do on-the-fly compression)
FPGAshavelotsofregisters
FPGAshavelotsembeddedmemory
27. Logic Synthesis
Place & Route
High-level Synthesis
datapathcontroller
Behavior
level
RT level
(VHDL, Verilog)
Gate level
(netlist)
C, C++, SystemC, OpenCL
High-level languages (OpenMP, OpenACC, etc.)
Source to Source
Levels of
Abstraction
Altera/Xilinx
toolchains
Bitstream
Derived from Deming Chen’s slide (UIUC).
FPGA Programming: Levels of Abstraction
28. FPGA Programming Techniques
● Use FPGAs as accelerators through (vendor-)optimized libraries
● Use of FPGAs through overlay architectures (pre-compiled custom processors)
● Use of FPGAs through high-level synthesis (e.g. via OpenMP)
● Use of FPGAs through programming in Verilog/VHDL (the FPGA “assembly language”)
● Lowest Risk
● Lowest User Difficulty
● Highest Risk
● Highest User Difficulty
29. Beware of Compile Time...
● Compiling a full design for a large FPGA (synthesis + place & route) can take many hours!
● Tile-based designs can help, but can still take tens of minutes!
● Overlay architectures (pre-compiled custom processors and on-chip networks) can help...
Is kernel really
Important in
this application?
Traditional compilation
for optimized
overlay architecture.
Use high-level synthesis
to generate custom hardware.
32. A Toolchain using HLS in Practice?
Compiler
(C/C++/Fortran)
Executable
Extract parallel
regions and compile
for the host in the
usual way
High-level
Synthesis
Place and Route
If placement
and routing takes
hours, we can't do
it this way!
33. A Toolchain using HLS in Practice?
Compiler
(C/C++/Fortran)
Executable
Extract parallel
regions and compile
for the host in the
usual way
High-level
Synthesis
Place and Route
Some kind
of token
34. Challenges Remain...
● OpenMP 4 technology for FPGAs is in its infancy (even less mature than the GPU
implementations).
● High-level synthesis technology has come a long way, but is just now starting to give
competitive performance to hand-programmed HDL designs.
● CPU + FPGA systems with cache-coherent interconnects are very new.
● High-performance overlay architectures have been created in academia, but none
targeting HPC workloads. High-performance on-chip networks are tricky.
● No one has yet created a complete HPC-practical toolchain.
Theoretical maximum performance on many algorithms on GPUs is 50-70%.
This is lower than CPU systems, but CPU systems have higher overhead.
In theory, FPGAs offer high percentage of peak and low overhead,
but can that be realized in practice?
35. Conclusions
✔ FPGA technology offers the most-promising direction toward higher FLOPS/Watt.
✔ FPGAs, soon combined with powerful CPUs, will naturally fit into our accelerator-infused HPC ecosystem.
✔ FPGAs can compete with CPUs/GPUs on traditional workloads while excelling at bioinformatics, machine
learning, and more!
✔ Combining high-level synthesis with overlay architectures can address FPGA programming challenges.
✔ Even so, pulling all of the pieces together will be challenging!
➔ ALCF is supported by DOE/SC under contract DE-AC02-06CH11357
47. OpenMP Accelerator Support – An Example (SAXPY)
http://llvm-hpc2-workshop.github.io/slides/Wong.pdf
Memory transfer
if necessary.
Traditional CPU-targeted
OpenMP might
only need this directive!
48. HPC-relevant Parallelism is Coming to C++17!
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm
using namespace std::execution::parallel;
int a[] = {0,1};
for_each(par, std::begin(a), std::end(a), [&](int i) {
do_something(i);
});
void f(float* a, float*b) {
...
for_each(par_unseq, begin, end, [&](int i)
{
a[i] = b[i] + c;
});
}
The “par_unseq” execution policy
allows for vectorization as well.
Almost as concise
as OpenMP, but in many
ways more powerful!
49. HPC-relevant Parallelism is Coming to C++17!
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm
50. Clang/LLVM
Where do we stand now?
Clang
(OpenMP 4 support nearly done)
Intel, IBM, and others finishing target offload support
LLVM
Polly
(Polyhedral optimizations)
SPIR-V
(Prototypes available,
but only for LLVM 3.6)
Vendor tools not
yet ready
C
C backend not upstream.
There is a relatively-recent
version on github.
Vendor HLS / OpenCL tools
Generate VHDL/Verilog directly?
51. Current FPGA + CPU System
http://www.panoradio-sdr.de/sdr-implementation/fpga-software-design/
Xilinx Zynq 7020 has
two ARM Cortex A9
cores.
53,200 LUTs
560 KB SRAM
220 DSP slices
53. CPU and GPU Trends
https://www.hpcwire.com/2016/08/23/2016-important-year-hpc-two-decades/
KNL KNL
54. CPU vs. FGPA Efficiency
http://authors.library.caltech.edu/1629/1/DEHcomputer00.pdf
CPU and FPGA achieve maximum algorithmic
efficiency at polar opposite sides of the parameter
space!