SlideShare a Scribd company logo
FPGAs for Supercomputing: The Why and How
Hal Finkel2
(hfinkel@anl.gov), Kazutomo Yoshii1
, and Franck Cappello1
1
Mathematics and Computer Science (MCS)
2
Leadership Computing Facility (ALCF)
Argonne National Laboratory
Advanced Scientific Computing Advisory Committee
Tuesday, December 20, 2016
Washington, DC
Outline
● Why are FPGAs interesting?
● Can FPGAs competitively accelerate traditional HPC workloads?
● Challenges and potential solutions to FPGA programming.
For some things, FPGAs are really good!
http://escholarship.org/uc/item/35x310n6
70x faster!
bioinformatics
For some things, FPGAs are really good!
machine learning and neural networks
http://ieeexplore.ieee.org/abstract/document/7577314/
FPGA is faster than both
the CPU and GPU,
10x more power efficient,
and a much higher percentage
of peak!
http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx
Parallelism Triumphs As We Head Toward Exascale
1986 1991 1996 2001 2006 2011 2016 2021
1
10
RelativeTransistorPerf
Giga
Tera
Peta
Exa
32x from transistor
32x from parallelism
8x from transistor
128x from parallelism
1.5x from transistor
670x from parallelism
System performance from parallelism
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201604/McCormick-ASCAC.pdf
(Maybe) It's All About the Power...
Do FPGA's
perform less
data movement
per computation?
http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx
To Decrease Energy, Move Data Less!
On-die Data Movement vs Compute
Interconnect energy (per mm) reduces slower than compute
On-die data movement energy will start to dominate
90 65 45 32 22 14 10 7
0
0.2
0.4
0.6
0.8
1
1.2
Technology (nm)
Source: Intel
On die IC energy/mm
Compute energy
6X
60%
https://www.semiwiki.com/forum/content/6160-2016-leading-edge-semiconductor-landscape.html
Compute vs. Movement – Changes Afoot
http://iwcse.phys.ntu.edu.tw/plenary/HorstSimon_IWCSE2013.pdf
(2013)
FPGAs vs. CPUs
http://evergreen.loyola.edu/dhhoe/www/HoeResearchFPGA.htm
FPGA
http://www.ics.ele.tue.nl/~heco/courses/EmbSystems/adv-architectures.ppt
CPU
Where Does the Power Go (CPU)?
http://link.springer.com/article/10.1186/1687-3963-2013-9
(Model with (# register files) x (read ports) x (write ports))
Fetch and decode
take most of the
energy!
More centralized register
files means more data
movement which takes
more power.
Only a small portion
of the energy goes
to the underlying
computation.
See also: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2008-130.pdf
Modern FPGAs: DSP Blocks and Block RAM
http://yosefk.com/blog/category/hardware
Design mapped
(Place & Route)
Intel Stratix 10 will have up to:
● 5760 DSP Blocks = 9.2 SP TFLOPS
● 11721 20Kb Block RAMs = 28MB
● 64-bit 4-core ARM @ 1.5 GHz
https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html
DSP blocks multiply
(Intel/Altera FPGAs have full SP FMA)
GFLOPS/Watt (Single Precision)
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
20
40
60
80
100
120
GFLOPS/Watt
● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - Taking 165 W max range
● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf
● http://www.xilinx.com/applications/high-performance-computing.html - Ultrascale+ figure inferred by a 33% performance increase (from Hotchips presentation)
● https://devblogs.nvidia.com/parallelforall/inside-pascal/
● https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html
Marketing Numbers
for unreleased products…
(be skeptical)
Do these FPGA
numbers include
system memory?
GFLOPS/Watt (Single Precision) – Let's be more realistic...
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
20
40
60
80
100
120
GFLOPS/Watt
● http://www.tomshardware.com/reviews/intel-core-i7-5960x-haswell-e-cpu,3918-13.html
● https://hal.inria.fr/hal-00686006v2/document
● http://www.eecg.toronto.edu/~davor/papers/capalija_fpl2014_slides.pdf - Tile approach yields 75% of peak clock rate on full device
Conclusion: FPGAs are a competitive HPC accelerator technology by 2017!
90% of peak
on a CPU is excellent!
70% of peak
on a GPU is excellent!
Plus system memory:
assuming 6W for 16 GB DDR4
(and 150 W for the FPGA)
GFLOPS/device (Single Precision)
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
2000
4000
6000
8000
10000
12000
GFLOPS
● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/pt/stratix-10-product-table.pdf - Largest variant with all DSPs doing FMAs @ the 800 MHz max
● http://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html
● http://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf - LUTs, not DSPs, are the limiting resource – filling device with FMAs @ 1 GHz
● https://devblogs.nvidia.com/parallelforall/inside-pascal/
● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - 28 cores @ 3.7 GHz * 16 FP ops per cycle * 2 for FMA (assuming same clock rate as the
E5-1660 v2)
● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf
All in theory...
GFLOPS/device (Single Precision) – Let's be more realistic...
Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+
0
2000
4000
6000
8000
10000
12000
GFLOPS
● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01222-understanding-peak-floating-point-performance-claims.pdf
● https://www.altera.com/en_US/pdfs/literature/wp/wp-01028.pdf (old but still useful)
90% of peak
on a CPU is excellent!
70% of peak
on a GPU is excellent!
80% usage at peak
frequency of an
FPGA is excellent!
Xilinx has no hard FP logic...
Reserving 30% of the
LUTs for other purposes.
For FPGAs, Parallelism is Essential
(CPU/GPU)(FPGA)
90nm
90nm65nm
http://rssi.ncsa.illinois.edu/proceedings/academic/Williams.pdf
(2008)
An experiment...
● Nallatech 385A Arria10
board
● 200 – 300 MHz (depend on
a design)
● 20 nm
● two DRAM channels. 34.1
● Sandy Bridge E5-2670
● 2.6 GHz (3.3 GHz w/ turbo)
● 32 nm
● four DRAM channels. 51.2
GB/s peak
An experiment: Power is Measured...
● Intel RAPL is used to measure
CPU energy
– CPU and memory
● Yokogawa WT310, an external
power meter, is used to measure
the FPGA power
– FPGA_pwr = meter_pwr -
host_idle_pwr +
FPGA_idle_pwr (~17 W)
– Note that meter_pwr includes
both CPU and FPGA
An experiment: Random Access with Computation using OpenCL
● # work-units is 256
● CPU: Sandy Bridge (4ch memory)
● FPGA: Arria 10 (2ch memory)
for (int i = 0; i < M; i++) {
double8 tmp;
index = rand() % len;
tmp = array[index];
sum += (tmp.s0 + tmp.s1) / 2.0;
sum += (tmp.s2 + tmp.s3) / 2.0;
sum += (tmp.s4 + tmp.s5) / 2.0;
sum += (tmp.s6 + tmp.s7) / 2.0;
}
An experiment: Random Access with Computation using OpenCL
● # work-units is 256
● CPU: Sandy Bridge (2ch memory)
● FPGA: Arria 10 (2ch memory)
for (int i = 0; i < M; i++) {
double8 tmp;
index = rand() % len;
tmp = array[index];
sum += (tmp.s0 + tmp.s1) / 2.0;
sum += (tmp.s2 + tmp.s3) / 2.0;
sum += (tmp.s4 + tmp.s5) / 2.0;
sum += (tmp.s6 + tmp.s7) / 2.0;
}
Make the comparison more fair...
FPGAs – Molecular Dynamics – Strong Scaling Again!
Martin Herbordt (Boston University)
FPGAs – Molecular Dynamics – Strong Scaling Again!
Martin Herbordt (Boston University)
High-End CPU + FPGA Systems Are Coming...
● Intel/Altera are starting to produce Xeon + FPGA systems
● Xilinx are producing ARM + FPGA systems
These are not just embedded cores,
but state-of-the-art multicore CPUs
Low latency and high bandwidth
CPU + FPGA systems fit nicely into the
HPC accelerator model! (“#pragma omp
target” can work for FPGAs too)
https://www.nextplatform.com/2016/03/14/intel-marrying-fpga-beefy-broadwell-open-compute-future/
Common Algorithm Classes in HPC
http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
Common Algorithm Classes in HPC – What do they need?
http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
FPGAs Can Help Everyone!
Compute Bound
(FPGAs have lots of compute)
Memory-Latency Bound
(FPGAs can pipeline deeply)
Memory-Bandwidth Bound
(FPGAs can do on-the-fly compression)
FPGAshavelotsofregisters
FPGAshavelotsembeddedmemory
Logic Synthesis
Place & Route
High-level Synthesis
datapathcontroller
Behavior
level
RT level
(VHDL, Verilog)
Gate level
(netlist)
C, C++, SystemC, OpenCL
High-level languages (OpenMP, OpenACC, etc.)
Source to Source
Levels of
Abstraction
Altera/Xilinx
toolchains
Bitstream
Derived from Deming Chen’s slide (UIUC).
FPGA Programming: Levels of Abstraction
FPGA Programming Techniques
● Use FPGAs as accelerators through (vendor-)optimized libraries
● Use of FPGAs through overlay architectures (pre-compiled custom processors)
● Use of FPGAs through high-level synthesis (e.g. via OpenMP)
● Use of FPGAs through programming in Verilog/VHDL (the FPGA “assembly language”)
● Lowest Risk
● Lowest User Difficulty
● Highest Risk
● Highest User Difficulty
Beware of Compile Time...
● Compiling a full design for a large FPGA (synthesis + place & route) can take many hours!
● Tile-based designs can help, but can still take tens of minutes!
● Overlay architectures (pre-compiled custom processors and on-chip networks) can help...
Is kernel really
Important in
this application?
Traditional compilation
for optimized
overlay architecture.
Use high-level synthesis
to generate custom hardware.
Overlay (iDEA)
https://www2.warwick.ac.uk/fac/sci/eng/staff/saf/publications/fpt2012-cheah.pdf
● A very-small CPU.
● Runs near peak clock rate of the block RAM / DSP block!
● Makes use of dynamic configuration of the DSP block.
Overlay (DeCO)
https://www2.warwick.ac.uk/fac/sci/eng/staff/saf/publications/fccm2016-jain.pdf
● Also spatial computing, but with much coarser resources.
● Place & Route is much faster!
● Performance is very good.
Each of these is a small soft CPU.
A Toolchain using HLS in Practice?
Compiler
(C/C++/Fortran)
Executable
Extract parallel
regions and compile
for the host in the
usual way
High-level
Synthesis
Place and Route
If placement
and routing takes
hours, we can't do
it this way!
A Toolchain using HLS in Practice?
Compiler
(C/C++/Fortran)
Executable
Extract parallel
regions and compile
for the host in the
usual way
High-level
Synthesis
Place and Route
Some kind
of token
Challenges Remain...
● OpenMP 4 technology for FPGAs is in its infancy (even less mature than the GPU
implementations).
● High-level synthesis technology has come a long way, but is just now starting to give
competitive performance to hand-programmed HDL designs.
● CPU + FPGA systems with cache-coherent interconnects are very new.
● High-performance overlay architectures have been created in academia, but none
targeting HPC workloads. High-performance on-chip networks are tricky.
● No one has yet created a complete HPC-practical toolchain.
Theoretical maximum performance on many algorithms on GPUs is 50-70%.
This is lower than CPU systems, but CPU systems have higher overhead.
In theory, FPGAs offer high percentage of peak and low overhead,
but can that be realized in practice?
Conclusions
✔ FPGA technology offers the most-promising direction toward higher FLOPS/Watt.
✔ FPGAs, soon combined with powerful CPUs, will naturally fit into our accelerator-infused HPC ecosystem.
✔ FPGAs can compete with CPUs/GPUs on traditional workloads while excelling at bioinformatics, machine
learning, and more!
✔ Combining high-level synthesis with overlay architectures can address FPGA programming challenges.
✔ Even so, pulling all of the pieces together will be challenging!
➔ ALCF is supported by DOE/SC under contract DE-AC02-06CH11357
Extra Slides
http://fire.pppl.gov/FESAC_AdvComput_Binkley_041014.pdf
ALCF Systems
https://www.alcf.anl.gov/files/alcfscibro2015.pdf
https://www.alcf.anl.gov/files/alcfscibro2015.pdf
Current Large-Scale Scientific Computing
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201604/2016-0404-ascac-01.pdf
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20150324/20150324_ASCAC_02a_No_Backups.pdf
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20150324/20150324_ASCAC_02a_No_Backups.pdf
How do we express parallelism?
http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
How do we express parallelism - MPI+X?
http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
OpenMP Evolving Toward Accelerators
http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
New in OpenMP 4
OpenMP Accelerator Support – An Example (SAXPY)
http://llvm-hpc2-workshop.github.io/slides/Wong.pdf
OpenMP Accelerator Support – An Example (SAXPY)
http://llvm-hpc2-workshop.github.io/slides/Wong.pdf
Memory transfer
if necessary.
Traditional CPU-targeted
OpenMP might
only need this directive!
HPC-relevant Parallelism is Coming to C++17!
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm
using namespace std::execution::parallel;
int a[] = {0,1};
for_each(par, std::begin(a), std::end(a), [&](int i) {
do_something(i);
});
void f(float* a, float*b) {
...
for_each(par_unseq, begin, end, [&](int i)
{
a[i] = b[i] + c;
});
}
The “par_unseq” execution policy
allows for vectorization as well.
Almost as concise
as OpenMP, but in many
ways more powerful!
HPC-relevant Parallelism is Coming to C++17!
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm
Clang/LLVM
Where do we stand now?
Clang
(OpenMP 4 support nearly done)
Intel, IBM, and others finishing target offload support
LLVM
Polly
(Polyhedral optimizations)
SPIR-V
(Prototypes available,
but only for LLVM 3.6)
Vendor tools not
yet ready
C
C backend not upstream.
There is a relatively-recent
version on github.
Vendor HLS / OpenCL tools
Generate VHDL/Verilog directly?
Current FPGA + CPU System
http://www.panoradio-sdr.de/sdr-implementation/fpga-software-design/
Xilinx Zynq 7020 has
two ARM Cortex A9
cores.
53,200 LUTs
560 KB SRAM
220 DSP slices
http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx
Interconnect Energy
Interconnect Structures
Buses over short distance
Shared busShared bus
1 to 10 fJ/bit
0 to 5mm
Limited scalability
Multi-ported MemoryMulti-ported Memory
Shared memory
10 to 100 fJ/bit
1 to 5mm
Limited scalability
X-BarX-Bar
Cross Bar Switch
0.1 to 1pJ/bit
2 to 10mm
Moderate scalability
1 to 3pJ/bit
>5 mm, scalable
Packet Switched Network
CPU and GPU Trends
https://www.hpcwire.com/2016/08/23/2016-important-year-hpc-two-decades/
KNL KNL
CPU vs. FGPA Efficiency
http://authors.library.caltech.edu/1629/1/DEHcomputer00.pdf
CPU and FPGA achieve maximum algorithmic
efficiency at polar opposite sides of the parameter
space!

More Related Content

What's hot

FPGA workshop
FPGA workshopFPGA workshop
FPGA workshop
Alex Borisevich
 
An open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresAn open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V cores
RISC-V International
 
Fpga 03-cpld-and-fpga
Fpga 03-cpld-and-fpgaFpga 03-cpld-and-fpga
Fpga 03-cpld-and-fpga
Malik Tauqir Hasan
 
Synopsys User Group Presentation
Synopsys User Group PresentationSynopsys User Group Presentation
Synopsys User Group Presentation
emlawgr
 
Design options for digital systems
Design options for digital systemsDesign options for digital systems
Design options for digital systems
dennis gookyi
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
chiportal
 
Session 2,3 FPGAs
Session 2,3 FPGAsSession 2,3 FPGAs
Session 2,3 FPGAs
Subhash Iyer
 
Fpga Device Selection
Fpga Device SelectionFpga Device Selection
Fpga Device Selection
Vikram Singh
 
FPGA In a Nutshell
FPGA In a NutshellFPGA In a Nutshell
FPGA In a Nutshell
Somnath Mazumdar
 
Cpld fpga
Cpld fpgaCpld fpga
Cpld fpga
anishgoel
 
Fpga computing
Fpga computingFpga computing
Fpga computing
rinnocente
 
Fpga technology
Fpga technologyFpga technology
Fpga technology
Naren Sridhar
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
RISC-V International
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
Intel® Software
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar report
rahul kumar verma
 
SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
Siraj Muhammad
 
Dr.s.shiyamala fpga ppt
Dr.s.shiyamala  fpga pptDr.s.shiyamala  fpga ppt
Dr.s.shiyamala fpga ppt
SHIYAMALASUBRAMANI1
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 
Programmable Logic Devices Plds
Programmable Logic Devices PldsProgrammable Logic Devices Plds
Programmable Logic Devices Plds
Gaditek
 
CPLDs
CPLDsCPLDs

What's hot (20)

FPGA workshop
FPGA workshopFPGA workshop
FPGA workshop
 
An open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresAn open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V cores
 
Fpga 03-cpld-and-fpga
Fpga 03-cpld-and-fpgaFpga 03-cpld-and-fpga
Fpga 03-cpld-and-fpga
 
Synopsys User Group Presentation
Synopsys User Group PresentationSynopsys User Group Presentation
Synopsys User Group Presentation
 
Design options for digital systems
Design options for digital systemsDesign options for digital systems
Design options for digital systems
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
 
Session 2,3 FPGAs
Session 2,3 FPGAsSession 2,3 FPGAs
Session 2,3 FPGAs
 
Fpga Device Selection
Fpga Device SelectionFpga Device Selection
Fpga Device Selection
 
FPGA In a Nutshell
FPGA In a NutshellFPGA In a Nutshell
FPGA In a Nutshell
 
Cpld fpga
Cpld fpgaCpld fpga
Cpld fpga
 
Fpga computing
Fpga computingFpga computing
Fpga computing
 
Fpga technology
Fpga technologyFpga technology
Fpga technology
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar report
 
SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
Dr.s.shiyamala fpga ppt
Dr.s.shiyamala  fpga pptDr.s.shiyamala  fpga ppt
Dr.s.shiyamala fpga ppt
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Programmable Logic Devices Plds
Programmable Logic Devices PldsProgrammable Logic Devices Plds
Programmable Logic Devices Plds
 
CPLDs
CPLDsCPLDs
CPLDs
 

Similar to FPGAs for Supercomputing: The Why and How

AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
Grigory Sapunov
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
Taylor Riggan
 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.ppt
Atef46
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PC Cluster Consortium
 
2.FPGA for dummies: modern FPGA architecture
2.FPGA for dummies: modern FPGA architecture2.FPGA for dummies: modern FPGA architecture
2.FPGA for dummies: modern FPGA architecture
Maurizio Donna
 
SoC - altera's user-customizable arm-based soc
SoC - altera's user-customizable arm-based socSoC - altera's user-customizable arm-based soc
SoC - altera's user-customizable arm-based soc
Satya Harish
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
ssuser0de10a
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ryousei Takano
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
Grigory Sapunov
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
Ganesan Narayanasamy
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
FPGAs : An Overview
FPGAs : An OverviewFPGAs : An Overview
FPGAs : An Overview
Sanjiv Malik
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
Ayush Singh, MS
 
Fpga
FpgaFpga
Using FPGA in Embedded Devices
Using FPGA in Embedded DevicesUsing FPGA in Embedded Devices
Using FPGA in Embedded Devices
GlobalLogic Ukraine
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
Julien SIMON
 
4_BIT_ALU
4_BIT_ALU4_BIT_ALU
4_BIT_ALU
Sohel Siddique
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
Yutaka Kawai
 

Similar to FPGAs for Supercomputing: The Why and How (20)

AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.ppt
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 
2.FPGA for dummies: modern FPGA architecture
2.FPGA for dummies: modern FPGA architecture2.FPGA for dummies: modern FPGA architecture
2.FPGA for dummies: modern FPGA architecture
 
SoC - altera's user-customizable arm-based soc
SoC - altera's user-customizable arm-based socSoC - altera's user-customizable arm-based soc
SoC - altera's user-customizable arm-based soc
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
FPGAs : An Overview
FPGAs : An OverviewFPGAs : An Overview
FPGAs : An Overview
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
 
Fpga
FpgaFpga
Fpga
 
Using FPGA in Embedded Devices
Using FPGA in Embedded DevicesUsing FPGA in Embedded Devices
Using FPGA in Embedded Devices
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
 
4_BIT_ALU
4_BIT_ALU4_BIT_ALU
4_BIT_ALU
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 

More from DESMOND YUEN

2022-AI-Index-Report_Master.pdf
2022-AI-Index-Report_Master.pdf2022-AI-Index-Report_Master.pdf
2022-AI-Index-Report_Master.pdf
DESMOND YUEN
 
Small Is the New Big
Small Is the New BigSmall Is the New Big
Small Is the New Big
DESMOND YUEN
 
Intel® Blockscale™ ASIC Product Brief
Intel® Blockscale™ ASIC Product BriefIntel® Blockscale™ ASIC Product Brief
Intel® Blockscale™ ASIC Product Brief
DESMOND YUEN
 
Cryptography Processing with 3rd Gen Intel Xeon Scalable Processors
Cryptography Processing with 3rd Gen Intel Xeon Scalable ProcessorsCryptography Processing with 3rd Gen Intel Xeon Scalable Processors
Cryptography Processing with 3rd Gen Intel Xeon Scalable Processors
DESMOND YUEN
 
Intel 2021 Product Security Report
Intel 2021 Product Security ReportIntel 2021 Product Security Report
Intel 2021 Product Security Report
DESMOND YUEN
 
How can regulation keep up as transformation races ahead? 2022 Global regulat...
How can regulation keep up as transformation races ahead? 2022 Global regulat...How can regulation keep up as transformation races ahead? 2022 Global regulat...
How can regulation keep up as transformation races ahead? 2022 Global regulat...
DESMOND YUEN
 
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, More
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, MoreNASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, More
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, More
DESMOND YUEN
 
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...
DESMOND YUEN
 
PUTTING PEOPLE FIRST: ITS IS SMART COMMUNITIES AND CITIES
PUTTING PEOPLE FIRST:  ITS IS SMART COMMUNITIES AND  CITIESPUTTING PEOPLE FIRST:  ITS IS SMART COMMUNITIES AND  CITIES
PUTTING PEOPLE FIRST: ITS IS SMART COMMUNITIES AND CITIES
DESMOND YUEN
 
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPE
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPEBUILDING AN OPEN RAN ECOSYSTEM FOR EUROPE
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPE
DESMOND YUEN
 
An Introduction to Semiconductors and Intel
An Introduction to Semiconductors and IntelAn Introduction to Semiconductors and Intel
An Introduction to Semiconductors and Intel
DESMOND YUEN
 
Changing demographics and economic growth bloom
Changing demographics and economic growth bloomChanging demographics and economic growth bloom
Changing demographics and economic growth bloom
DESMOND YUEN
 
Intel’s Impacts on the US Economy
Intel’s Impacts on the US EconomyIntel’s Impacts on the US Economy
Intel’s Impacts on the US Economy
DESMOND YUEN
 
2021 private networks infographics
2021 private networks infographics2021 private networks infographics
2021 private networks infographics
DESMOND YUEN
 
Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...
Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...
Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...
DESMOND YUEN
 
Accelerate Your AI Today
Accelerate Your AI TodayAccelerate Your AI Today
Accelerate Your AI Today
DESMOND YUEN
 
Increasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery NetworksIncreasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery Networks
DESMOND YUEN
 
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
DESMOND YUEN
 
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm.""Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."
DESMOND YUEN
 
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...Telefónica views on the design, architecture, and technology of 4G/5G Open RA...
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...
DESMOND YUEN
 

More from DESMOND YUEN (20)

2022-AI-Index-Report_Master.pdf
2022-AI-Index-Report_Master.pdf2022-AI-Index-Report_Master.pdf
2022-AI-Index-Report_Master.pdf
 
Small Is the New Big
Small Is the New BigSmall Is the New Big
Small Is the New Big
 
Intel® Blockscale™ ASIC Product Brief
Intel® Blockscale™ ASIC Product BriefIntel® Blockscale™ ASIC Product Brief
Intel® Blockscale™ ASIC Product Brief
 
Cryptography Processing with 3rd Gen Intel Xeon Scalable Processors
Cryptography Processing with 3rd Gen Intel Xeon Scalable ProcessorsCryptography Processing with 3rd Gen Intel Xeon Scalable Processors
Cryptography Processing with 3rd Gen Intel Xeon Scalable Processors
 
Intel 2021 Product Security Report
Intel 2021 Product Security ReportIntel 2021 Product Security Report
Intel 2021 Product Security Report
 
How can regulation keep up as transformation races ahead? 2022 Global regulat...
How can regulation keep up as transformation races ahead? 2022 Global regulat...How can regulation keep up as transformation races ahead? 2022 Global regulat...
How can regulation keep up as transformation races ahead? 2022 Global regulat...
 
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, More
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, MoreNASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, More
NASA Spinoffs Help Fight Coronavirus, Clean Pollution, Grow Food, More
 
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...
A Survey on Security and Privacy Issues in Edge Computing-Assisted Internet o...
 
PUTTING PEOPLE FIRST: ITS IS SMART COMMUNITIES AND CITIES
PUTTING PEOPLE FIRST:  ITS IS SMART COMMUNITIES AND  CITIESPUTTING PEOPLE FIRST:  ITS IS SMART COMMUNITIES AND  CITIES
PUTTING PEOPLE FIRST: ITS IS SMART COMMUNITIES AND CITIES
 
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPE
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPEBUILDING AN OPEN RAN ECOSYSTEM FOR EUROPE
BUILDING AN OPEN RAN ECOSYSTEM FOR EUROPE
 
An Introduction to Semiconductors and Intel
An Introduction to Semiconductors and IntelAn Introduction to Semiconductors and Intel
An Introduction to Semiconductors and Intel
 
Changing demographics and economic growth bloom
Changing demographics and economic growth bloomChanging demographics and economic growth bloom
Changing demographics and economic growth bloom
 
Intel’s Impacts on the US Economy
Intel’s Impacts on the US EconomyIntel’s Impacts on the US Economy
Intel’s Impacts on the US Economy
 
2021 private networks infographics
2021 private networks infographics2021 private networks infographics
2021 private networks infographics
 
Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...
Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...
Transforming the Modern City with the Intel-based 5G Smart City Road Side Uni...
 
Accelerate Your AI Today
Accelerate Your AI TodayAccelerate Your AI Today
Accelerate Your AI Today
 
Increasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery NetworksIncreasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery Networks
 
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
3rd Generation Intel® Xeon® Scalable Processor - Achieving 1 Tbps IPsec with ...
 
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm.""Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."
"Life and Learning After One-Hundred Years: Trust Is The Coin Of The Realm."
 
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...Telefónica views on the design, architecture, and technology of 4G/5G Open RA...
Telefónica views on the design, architecture, and technology of 4G/5G Open RA...
 

Recently uploaded

欧洲杯投注-欧洲杯投注押注app-欧洲杯投注押注app官网|【​网址​🎉ac10.net🎉​】
欧洲杯投注-欧洲杯投注押注app-欧洲杯投注押注app官网|【​网址​🎉ac10.net🎉​】欧洲杯投注-欧洲杯投注押注app-欧洲杯投注押注app官网|【​网址​🎉ac10.net🎉​】
欧洲杯投注-欧洲杯投注押注app-欧洲杯投注押注app官网|【​网址​🎉ac10.net🎉​】
akrooshsaleem36
 
欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】
欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】
欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】
lopezkatherina914
 
一比一原版西三一大学毕业证(TWU毕业证书)学历如何办理
一比一原版西三一大学毕业证(TWU毕业证书)学历如何办理一比一原版西三一大学毕业证(TWU毕业证书)学历如何办理
一比一原版西三一大学毕业证(TWU毕业证书)学历如何办理
bttak
 
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalRBuilding a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR
Peter Gallagher
 
买(usyd毕业证书)澳洲悉尼大学毕业证研究生文凭证书原版一模一样
买(usyd毕业证书)澳洲悉尼大学毕业证研究生文凭证书原版一模一样买(usyd毕业证书)澳洲悉尼大学毕业证研究生文凭证书原版一模一样
买(usyd毕业证书)澳洲悉尼大学毕业证研究生文凭证书原版一模一样
nvoyobt
 
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER""IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"
Emmanuel Onwumere
 
一比一原版不列颠哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版不列颠哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版不列颠哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版不列颠哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
bttak
 
按照学校原版(UPenn文凭证书)宾夕法尼亚大学毕业证快速办理
按照学校原版(UPenn文凭证书)宾夕法尼亚大学毕业证快速办理按照学校原版(UPenn文凭证书)宾夕法尼亚大学毕业证快速办理
按照学校原版(UPenn文凭证书)宾夕法尼亚大学毕业证快速办理
uwoso
 
一比一原版圣托马斯大学毕业证(UST毕业证书)学历如何办理
一比一原版圣托马斯大学毕业证(UST毕业证书)学历如何办理一比一原版圣托马斯大学毕业证(UST毕业证书)学历如何办理
一比一原版圣托马斯大学毕业证(UST毕业证书)学历如何办理
bttak
 
欧洲杯赌钱-欧洲杯赌钱冠军-欧洲杯赌钱冠军赔率|【​网址​🎉ac10.net🎉​】
欧洲杯赌钱-欧洲杯赌钱冠军-欧洲杯赌钱冠军赔率|【​网址​🎉ac10.net🎉​】欧洲杯赌钱-欧洲杯赌钱冠军-欧洲杯赌钱冠军赔率|【​网址​🎉ac10.net🎉​】
欧洲杯赌钱-欧洲杯赌钱冠军-欧洲杯赌钱冠军赔率|【​网址​🎉ac10.net🎉​】
hanniaarias53
 

Recently uploaded (10)

欧洲杯投注-欧洲杯投注押注app-欧洲杯投注押注app官网|【​网址​🎉ac10.net🎉​】
欧洲杯投注-欧洲杯投注押注app-欧洲杯投注押注app官网|【​网址​🎉ac10.net🎉​】欧洲杯投注-欧洲杯投注押注app-欧洲杯投注押注app官网|【​网址​🎉ac10.net🎉​】
欧洲杯投注-欧洲杯投注押注app-欧洲杯投注押注app官网|【​网址​🎉ac10.net🎉​】
 
欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】
欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】
欧洲杯体彩-欧洲杯体彩比赛投注-欧洲杯体彩比赛投注官网|【​网址​🎉ac99.net🎉​】
 
一比一原版西三一大学毕业证(TWU毕业证书)学历如何办理
一比一原版西三一大学毕业证(TWU毕业证书)学历如何办理一比一原版西三一大学毕业证(TWU毕业证书)学历如何办理
一比一原版西三一大学毕业证(TWU毕业证书)学历如何办理
 
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalRBuilding a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR
 
买(usyd毕业证书)澳洲悉尼大学毕业证研究生文凭证书原版一模一样
买(usyd毕业证书)澳洲悉尼大学毕业证研究生文凭证书原版一模一样买(usyd毕业证书)澳洲悉尼大学毕业证研究生文凭证书原版一模一样
买(usyd毕业证书)澳洲悉尼大学毕业证研究生文凭证书原版一模一样
 
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER""IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"
 
一比一原版不列颠哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版不列颠哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版不列颠哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版不列颠哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
按照学校原版(UPenn文凭证书)宾夕法尼亚大学毕业证快速办理
按照学校原版(UPenn文凭证书)宾夕法尼亚大学毕业证快速办理按照学校原版(UPenn文凭证书)宾夕法尼亚大学毕业证快速办理
按照学校原版(UPenn文凭证书)宾夕法尼亚大学毕业证快速办理
 
一比一原版圣托马斯大学毕业证(UST毕业证书)学历如何办理
一比一原版圣托马斯大学毕业证(UST毕业证书)学历如何办理一比一原版圣托马斯大学毕业证(UST毕业证书)学历如何办理
一比一原版圣托马斯大学毕业证(UST毕业证书)学历如何办理
 
欧洲杯赌钱-欧洲杯赌钱冠军-欧洲杯赌钱冠军赔率|【​网址​🎉ac10.net🎉​】
欧洲杯赌钱-欧洲杯赌钱冠军-欧洲杯赌钱冠军赔率|【​网址​🎉ac10.net🎉​】欧洲杯赌钱-欧洲杯赌钱冠军-欧洲杯赌钱冠军赔率|【​网址​🎉ac10.net🎉​】
欧洲杯赌钱-欧洲杯赌钱冠军-欧洲杯赌钱冠军赔率|【​网址​🎉ac10.net🎉​】
 

FPGAs for Supercomputing: The Why and How

  • 1. FPGAs for Supercomputing: The Why and How Hal Finkel2 (hfinkel@anl.gov), Kazutomo Yoshii1 , and Franck Cappello1 1 Mathematics and Computer Science (MCS) 2 Leadership Computing Facility (ALCF) Argonne National Laboratory Advanced Scientific Computing Advisory Committee Tuesday, December 20, 2016 Washington, DC
  • 2. Outline ● Why are FPGAs interesting? ● Can FPGAs competitively accelerate traditional HPC workloads? ● Challenges and potential solutions to FPGA programming.
  • 3. For some things, FPGAs are really good! http://escholarship.org/uc/item/35x310n6 70x faster! bioinformatics
  • 4. For some things, FPGAs are really good! machine learning and neural networks http://ieeexplore.ieee.org/abstract/document/7577314/ FPGA is faster than both the CPU and GPU, 10x more power efficient, and a much higher percentage of peak!
  • 5. http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx Parallelism Triumphs As We Head Toward Exascale 1986 1991 1996 2001 2006 2011 2016 2021 1 10 RelativeTransistorPerf Giga Tera Peta Exa 32x from transistor 32x from parallelism 8x from transistor 128x from parallelism 1.5x from transistor 670x from parallelism System performance from parallelism
  • 6. http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201604/McCormick-ASCAC.pdf (Maybe) It's All About the Power... Do FPGA's perform less data movement per computation?
  • 7. http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx To Decrease Energy, Move Data Less! On-die Data Movement vs Compute Interconnect energy (per mm) reduces slower than compute On-die data movement energy will start to dominate 90 65 45 32 22 14 10 7 0 0.2 0.4 0.6 0.8 1 1.2 Technology (nm) Source: Intel On die IC energy/mm Compute energy 6X 60% https://www.semiwiki.com/forum/content/6160-2016-leading-edge-semiconductor-landscape.html
  • 8. Compute vs. Movement – Changes Afoot http://iwcse.phys.ntu.edu.tw/plenary/HorstSimon_IWCSE2013.pdf (2013)
  • 10. Where Does the Power Go (CPU)? http://link.springer.com/article/10.1186/1687-3963-2013-9 (Model with (# register files) x (read ports) x (write ports)) Fetch and decode take most of the energy! More centralized register files means more data movement which takes more power. Only a small portion of the energy goes to the underlying computation. See also: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2008-130.pdf
  • 11. Modern FPGAs: DSP Blocks and Block RAM http://yosefk.com/blog/category/hardware Design mapped (Place & Route) Intel Stratix 10 will have up to: ● 5760 DSP Blocks = 9.2 SP TFLOPS ● 11721 20Kb Block RAMs = 28MB ● 64-bit 4-core ARM @ 1.5 GHz https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html DSP blocks multiply (Intel/Altera FPGAs have full SP FMA)
  • 12. GFLOPS/Watt (Single Precision) Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+ 0 20 40 60 80 100 120 GFLOPS/Watt ● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - Taking 165 W max range ● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf ● http://www.xilinx.com/applications/high-performance-computing.html - Ultrascale+ figure inferred by a 33% performance increase (from Hotchips presentation) ● https://devblogs.nvidia.com/parallelforall/inside-pascal/ ● https://www.altera.com/products/fpga/stratix-series/stratix-10/features.html Marketing Numbers for unreleased products… (be skeptical) Do these FPGA numbers include system memory?
  • 13. GFLOPS/Watt (Single Precision) – Let's be more realistic... Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+ 0 20 40 60 80 100 120 GFLOPS/Watt ● http://www.tomshardware.com/reviews/intel-core-i7-5960x-haswell-e-cpu,3918-13.html ● https://hal.inria.fr/hal-00686006v2/document ● http://www.eecg.toronto.edu/~davor/papers/capalija_fpl2014_slides.pdf - Tile approach yields 75% of peak clock rate on full device Conclusion: FPGAs are a competitive HPC accelerator technology by 2017! 90% of peak on a CPU is excellent! 70% of peak on a GPU is excellent! Plus system memory: assuming 6W for 16 GB DDR4 (and 150 W for the FPGA)
  • 14. GFLOPS/device (Single Precision) Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+ 0 2000 4000 6000 8000 10000 12000 GFLOPS ● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/pt/stratix-10-product-table.pdf - Largest variant with all DSPs doing FMAs @ the 800 MHz max ● http://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html ● http://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf - LUTs, not DSPs, are the limiting resource – filling device with FMAs @ 1 GHz ● https://devblogs.nvidia.com/parallelforall/inside-pascal/ ● http://wccftech.com/massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/ - 28 cores @ 3.7 GHz * 16 FP ops per cycle * 2 for FMA (assuming same clock rate as the E5-1660 v2) ● http://cgo.org/cgo2016/wp-content/uploads/2016/04/sodani-slides.pdf All in theory...
  • 15. GFLOPS/device (Single Precision) – Let's be more realistic... Intel Skylake Intel Knights Landing NVIDIA Pascal Altera Stratix 10 Xilinx Virtex Ultrascale+ 0 2000 4000 6000 8000 10000 12000 GFLOPS ● https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01222-understanding-peak-floating-point-performance-claims.pdf ● https://www.altera.com/en_US/pdfs/literature/wp/wp-01028.pdf (old but still useful) 90% of peak on a CPU is excellent! 70% of peak on a GPU is excellent! 80% usage at peak frequency of an FPGA is excellent! Xilinx has no hard FP logic... Reserving 30% of the LUTs for other purposes.
  • 16. For FPGAs, Parallelism is Essential (CPU/GPU)(FPGA) 90nm 90nm65nm http://rssi.ncsa.illinois.edu/proceedings/academic/Williams.pdf (2008)
  • 17. An experiment... ● Nallatech 385A Arria10 board ● 200 – 300 MHz (depend on a design) ● 20 nm ● two DRAM channels. 34.1 ● Sandy Bridge E5-2670 ● 2.6 GHz (3.3 GHz w/ turbo) ● 32 nm ● four DRAM channels. 51.2 GB/s peak
  • 18. An experiment: Power is Measured... ● Intel RAPL is used to measure CPU energy – CPU and memory ● Yokogawa WT310, an external power meter, is used to measure the FPGA power – FPGA_pwr = meter_pwr - host_idle_pwr + FPGA_idle_pwr (~17 W) – Note that meter_pwr includes both CPU and FPGA
  • 19. An experiment: Random Access with Computation using OpenCL ● # work-units is 256 ● CPU: Sandy Bridge (4ch memory) ● FPGA: Arria 10 (2ch memory) for (int i = 0; i < M; i++) { double8 tmp; index = rand() % len; tmp = array[index]; sum += (tmp.s0 + tmp.s1) / 2.0; sum += (tmp.s2 + tmp.s3) / 2.0; sum += (tmp.s4 + tmp.s5) / 2.0; sum += (tmp.s6 + tmp.s7) / 2.0; }
  • 20. An experiment: Random Access with Computation using OpenCL ● # work-units is 256 ● CPU: Sandy Bridge (2ch memory) ● FPGA: Arria 10 (2ch memory) for (int i = 0; i < M; i++) { double8 tmp; index = rand() % len; tmp = array[index]; sum += (tmp.s0 + tmp.s1) / 2.0; sum += (tmp.s2 + tmp.s3) / 2.0; sum += (tmp.s4 + tmp.s5) / 2.0; sum += (tmp.s6 + tmp.s7) / 2.0; } Make the comparison more fair...
  • 21. FPGAs – Molecular Dynamics – Strong Scaling Again! Martin Herbordt (Boston University)
  • 22. FPGAs – Molecular Dynamics – Strong Scaling Again! Martin Herbordt (Boston University)
  • 23. High-End CPU + FPGA Systems Are Coming... ● Intel/Altera are starting to produce Xeon + FPGA systems ● Xilinx are producing ARM + FPGA systems These are not just embedded cores, but state-of-the-art multicore CPUs Low latency and high bandwidth CPU + FPGA systems fit nicely into the HPC accelerator model! (“#pragma omp target” can work for FPGAs too) https://www.nextplatform.com/2016/03/14/intel-marrying-fpga-beefy-broadwell-open-compute-future/
  • 24. Common Algorithm Classes in HPC http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
  • 25. Common Algorithm Classes in HPC – What do they need? http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
  • 26. FPGAs Can Help Everyone! Compute Bound (FPGAs have lots of compute) Memory-Latency Bound (FPGAs can pipeline deeply) Memory-Bandwidth Bound (FPGAs can do on-the-fly compression) FPGAshavelotsofregisters FPGAshavelotsembeddedmemory
  • 27. Logic Synthesis Place & Route High-level Synthesis datapathcontroller Behavior level RT level (VHDL, Verilog) Gate level (netlist) C, C++, SystemC, OpenCL High-level languages (OpenMP, OpenACC, etc.) Source to Source Levels of Abstraction Altera/Xilinx toolchains Bitstream Derived from Deming Chen’s slide (UIUC). FPGA Programming: Levels of Abstraction
  • 28. FPGA Programming Techniques ● Use FPGAs as accelerators through (vendor-)optimized libraries ● Use of FPGAs through overlay architectures (pre-compiled custom processors) ● Use of FPGAs through high-level synthesis (e.g. via OpenMP) ● Use of FPGAs through programming in Verilog/VHDL (the FPGA “assembly language”) ● Lowest Risk ● Lowest User Difficulty ● Highest Risk ● Highest User Difficulty
  • 29. Beware of Compile Time... ● Compiling a full design for a large FPGA (synthesis + place & route) can take many hours! ● Tile-based designs can help, but can still take tens of minutes! ● Overlay architectures (pre-compiled custom processors and on-chip networks) can help... Is kernel really Important in this application? Traditional compilation for optimized overlay architecture. Use high-level synthesis to generate custom hardware.
  • 30. Overlay (iDEA) https://www2.warwick.ac.uk/fac/sci/eng/staff/saf/publications/fpt2012-cheah.pdf ● A very-small CPU. ● Runs near peak clock rate of the block RAM / DSP block! ● Makes use of dynamic configuration of the DSP block.
  • 31. Overlay (DeCO) https://www2.warwick.ac.uk/fac/sci/eng/staff/saf/publications/fccm2016-jain.pdf ● Also spatial computing, but with much coarser resources. ● Place & Route is much faster! ● Performance is very good. Each of these is a small soft CPU.
  • 32. A Toolchain using HLS in Practice? Compiler (C/C++/Fortran) Executable Extract parallel regions and compile for the host in the usual way High-level Synthesis Place and Route If placement and routing takes hours, we can't do it this way!
  • 33. A Toolchain using HLS in Practice? Compiler (C/C++/Fortran) Executable Extract parallel regions and compile for the host in the usual way High-level Synthesis Place and Route Some kind of token
  • 34. Challenges Remain... ● OpenMP 4 technology for FPGAs is in its infancy (even less mature than the GPU implementations). ● High-level synthesis technology has come a long way, but is just now starting to give competitive performance to hand-programmed HDL designs. ● CPU + FPGA systems with cache-coherent interconnects are very new. ● High-performance overlay architectures have been created in academia, but none targeting HPC workloads. High-performance on-chip networks are tricky. ● No one has yet created a complete HPC-practical toolchain. Theoretical maximum performance on many algorithms on GPUs is 50-70%. This is lower than CPU systems, but CPU systems have higher overhead. In theory, FPGAs offer high percentage of peak and low overhead, but can that be realized in practice?
  • 35. Conclusions ✔ FPGA technology offers the most-promising direction toward higher FLOPS/Watt. ✔ FPGAs, soon combined with powerful CPUs, will naturally fit into our accelerator-infused HPC ecosystem. ✔ FPGAs can compete with CPUs/GPUs on traditional workloads while excelling at bioinformatics, machine learning, and more! ✔ Combining high-level synthesis with overlay architectures can address FPGA programming challenges. ✔ Even so, pulling all of the pieces together will be challenging! ➔ ALCF is supported by DOE/SC under contract DE-AC02-06CH11357
  • 43. How do we express parallelism? http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
  • 44. How do we express parallelism - MPI+X? http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
  • 45. OpenMP Evolving Toward Accelerators http://llvm-hpc2-workshop.github.io/slides/Tian.pdf New in OpenMP 4
  • 46. OpenMP Accelerator Support – An Example (SAXPY) http://llvm-hpc2-workshop.github.io/slides/Wong.pdf
  • 47. OpenMP Accelerator Support – An Example (SAXPY) http://llvm-hpc2-workshop.github.io/slides/Wong.pdf Memory transfer if necessary. Traditional CPU-targeted OpenMP might only need this directive!
  • 48. HPC-relevant Parallelism is Coming to C++17! http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm using namespace std::execution::parallel; int a[] = {0,1}; for_each(par, std::begin(a), std::end(a), [&](int i) { do_something(i); }); void f(float* a, float*b) { ... for_each(par_unseq, begin, end, [&](int i) { a[i] = b[i] + c; }); } The “par_unseq” execution policy allows for vectorization as well. Almost as concise as OpenMP, but in many ways more powerful!
  • 49. HPC-relevant Parallelism is Coming to C++17! http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm
  • 50. Clang/LLVM Where do we stand now? Clang (OpenMP 4 support nearly done) Intel, IBM, and others finishing target offload support LLVM Polly (Polyhedral optimizations) SPIR-V (Prototypes available, but only for LLVM 3.6) Vendor tools not yet ready C C backend not upstream. There is a relatively-recent version on github. Vendor HLS / OpenCL tools Generate VHDL/Verilog directly?
  • 51. Current FPGA + CPU System http://www.panoradio-sdr.de/sdr-implementation/fpga-software-design/ Xilinx Zynq 7020 has two ARM Cortex A9 cores. 53,200 LUTs 560 KB SRAM 220 DSP slices
  • 52. http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx Interconnect Energy Interconnect Structures Buses over short distance Shared busShared bus 1 to 10 fJ/bit 0 to 5mm Limited scalability Multi-ported MemoryMulti-ported Memory Shared memory 10 to 100 fJ/bit 1 to 5mm Limited scalability X-BarX-Bar Cross Bar Switch 0.1 to 1pJ/bit 2 to 10mm Moderate scalability 1 to 3pJ/bit >5 mm, scalable Packet Switched Network
  • 53. CPU and GPU Trends https://www.hpcwire.com/2016/08/23/2016-important-year-hpc-two-decades/ KNL KNL
  • 54. CPU vs. FGPA Efficiency http://authors.library.caltech.edu/1629/1/DEHcomputer00.pdf CPU and FPGA achieve maximum algorithmic efficiency at polar opposite sides of the parameter space!