Building affordable and
programmable exascale capable
computers
SURFsara, 18 April 2019
Georgi Gaydadjiev, Director of Maxeler IoT-Labs BV, Delft
Honorary Visiting Professor at the Department of Computing, Imperial College London
Think Ångströms not nanometers in 2019
Ideally we should steer movements of almost each individual electron to solve our problems
0.1 nm  1 Å
14 nm  140 Å
DNA
C-C
bond
1Å 10Å 102Å 103Å 104Å
glucose
hemoglobin
ribosome
…
cells
100s of Si atoms in 14nm
very few atoms
(e.g., 3nm / 30Å  6 to 12 atoms)
light microscope
resolution
2
Moving data on-chip will use as much energy as computing with it
Moving data off-chip will use 200x more energy!
and is much slower as well
The power challenge
Today* 2020
Double precision Float Op ~20pJ <10pJ
Moving data on-chip: 1mm 6pJ
Moving data on-chip: 20mm 120pJ
Moving data to off-chip
memory
4,000pJ 2,000pJ
3
The data movement challenge
Next generation computer systems should take data movements at all levels very seriously
3
Data movement challenge confirmed
4
Wires that carry the data (and instructions) become more and more important
(Courtesy: NVIDIA and ITRS)
BUT
4
5
“… without dramatic increases in
efficiency, ICT industry could
use 20% of all electricity and
emit up to 5.5% of the world’s
carbon emissions by 2025.”
“We have a tsunami of data
approaching. Everything which
can be is being digitalised. It is
a perfect storm.”
“ … a single $1bn Apple data
centre planned for Athenry in Co
Galway, expects to eventually
use 300MW of electricity, or over
8% of the national capacity and
more than the daily entire usage
of Dublin. It will require 144 large
diesel generators as back up for
when the wind does not blow.”
Why is all of this important?
6
Computing in Time:
Follow a recipe step by step
one at the time
Computing in Space:
Build a “recipe specific” factory with multiple
paths performed simultaneously
One result per clock cycle
Efficient, predictable, reliable “mass production” of huge data amounts
Build Computers for your Problem and Data
1. Describe Conjugate Gradient
as dataflow graph
3. Stream data through the
Custom Accelerator
2. Compile dataflow structure and load to hardware
Create customized mega accelerators with massive inherent throughputs
Programming a Dataflow “mass production” Engine
7
Program in HL Language
Machine Architecture
Implementation
Circuits
Algorithm
Devices
Problem
Solutions
Solutions
Co-optimise the HW and the SW
stack for the performance critical
areas of the application
8
Solving Computing Problems Vertically
8
From Equations to Dataflow Hardware
u
x
s
x
vd
x
F
ah
p
ah
TRuu
a
vu
ah
u
t
u





















1ln

9
Real data flow graph as
generated by MaxCompiler
4,866 nodes;
10,000s of stages/cycles
Full Customization in:
Space, Value and Time
(SVT)
1010
Easy it is not (and not really new)
Slotnick’s law (of effort):
“The parallel approach to computing does require that
some original thinking be done about numerical
analysis and data management in order to secure
efficient use.
In an environment which has represented the
absence of the need to think as the highest virtue this
is a decided disadvantage.”
Daniel Slotnick (1931-1985)
Chief Architect of Illiac IV
11
Programing in Space basics
12
• Control and Data-flows are decoupled
– Both are fully programmable
• Operations exist in space and by default run in parallel
– Their number is limited only by the available space
• All operations can be customized at various levels
– e.g., from algorithm down to the number representation
• Multiple operations constitute kernels
• Data streams through the operations / kernels
• The data transport and processing can be balanced
• All resources work all of the time for max performance
• The In/Out data rates determine the operating frequency
Equally spread the available “forces” and move no faster than required by the application
12
The Computational Model
13
• Dataflow sub-system (DataFlow Engine- DFE)
– Spatial arithmetic chip “hardware” technology with flexible arithmetic units
and programmable interconnect (looks like FPGAs but is not limited to)
– Programmable Static Dataflow
– Systolic Execution at kernel level
– Streaming Custom Computing at system level
– Implicit GALS* IO and kernel-to-kernel communication
• Dedicated software (MaxCompiler, MaxelerOS and SLiC)
– compilation toolchain and design methodology
– Incorporated simulation and debug environment for rapid development
– Linux fully integrated runtime system and low level software support
– Help designer focus on the data/algorithm and the system architecture
• Only three basic memory types (explicitly exposed)
– Scalars (exposed to the CPU)
– Fast Memory (FMEM): small and fast (on-chip)
– Large Memory (LMEM): large and slow (off-chip)
* GALS – Globally Asynchronous Locally Synchronous
13
Maxeler’s DataFlow Engine (DFE, MAX4)
14
MaxRing
Interconnect
Dataflow Engine (DFE)LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
• 48GB DRAM (LMEM)
• Stratix V D8
• MaxRing interconnect
• 4,000 multipliers
• 700K logic cells
• 6.25MB of FMEM
14
Application Level Components
SLiC
MaxelerOS
Memory
CPU
DFE
Memory
Kernels (MaxJ)
(instantiate the arithmetic
structure)
*+
+
Manager (MaxJ)
(arrange the data
orchestration)
Host application (C, Python, Matlab..)
15
PCI Express
or
Infiniband
15
MaxJ: Moving Average of three numbers
Dataflow computing in hardware using a language you know
16
x
x
+
30
y
DFEVar x = io.input("x", dfeFloat(10,31));
DFEVar result = x * x + 30;
io.output("y", result, dfeFloat(10,31));
17
Simple example: y = x2 + 30
17
MaxJ example: Control in Space
18
x
+
1
y
-
1
>
10
class SimpleKernel extends Kernel {
SimpleKernel() {
DFEVar x = io.input(“x”, dfeInt(24));
DFEVar result = (x>10) ? x+1 : x-1;
io.output(“y”, result, dfeInt(25));
}
}
18
19
SIMULATE AND DEBUG
GENERATE DATAFLOWPROGRAMARCHITECTANALYSE
Used to build real systems, however, very difficult to learn/educate
Non Traditional Design Process
OK?many hours …
Custom
HW
Multiple scales of
computing
Important features for
optimization
complete system level  balance compute, storage
and IO
parallel node level  maximize utilization of
compute and interconnect
microarchitecture level  minimize data movement
arithmetic level  tradeoff range, precision
and accuracy
= discretize in Time, Space
and Value
bit level  encode and add
redundancy
transistor level => manipulate ‘0’ and ‘1’
and more, e.g., trade/hide Communication (Time) for/behind Computation (Space)
20
Optimizations at all levelsFlow/Time
Space
20
21
1. Higher chip / system price compared to microprocessors
2. Lead design times (3 months in the best case)
a. Complex numerical transformations
b. Non-trivial area and data movement optimizations
3. “Painful” Place & Route times (12 hours to 24 hours)
a. Expensive Vendor Specific tools
b. Serious developments ask for dedicated build clusters
4. Need to compete at 200MHz with processors at 3GHz
5. Current HW technology is sub-optimal
• On-chip memory not built for stream processing
• On-chip interconnect overdesigned for Dataflow
6. Long learning curve (Tools and Methods needed)
7. Designer’s productivity should improve (Tools and Methods)
8. …
Some of the challenges
Ongoing effort on improved methodologies and tools
MaxRing
Interconnect
Dataflow Engine (DFE)
LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
Multiple platforms, single DFE abstraction
+
{
Application and MaxJ
gen4 gen5
Performance Portable Migration
(Intel based) (Xilinx based)
22
• MaxCompiler generates VHDL
ready for FPGA vendor tools
• Synthesis transforms VHDL into
logical “netlist” – sets of basic logic
expressions
• Map fits basic logic into N-input
look-up tables
• Place puts LUTs, DSPs, RAMs etc
at specific locations on chip
• Route sets up wiring between
blocks
23
Substrate Agnostic Compilation
MaxCompiler compilation
Synthesis
Map
Place
Route
Generate Maxfile
VHDL
Complete FPGA
Netlist
LUTs
Placed FPGA
23
DFE Place and Route example
24
Mon 16:27: MaxCompiler version: 2012.2
Mon 16:27: Build “MyKernel" start time: Mon Apr 08 16:27:24 BST 2013
Mon 16:27: Main build process running as user training1 on host Maxworkstation7478
Mon 16:27: Build location: /home/training1/maxcompiler-builds/MyKernel
Mon 16:27: Instantiating manager
Mon 16:27: Instantiating kernel “MyKernel"
Mon 16:27: Compiling manager (CPU I/O Only)
Mon 16:27: Compiling kernel "MyKernel"
Mon 16:27: Generating input files (VHDL, netlists, CoreGen)
Mon 16:27: Running back-end build (12 phases)
Mon 16:27: (1/12) - Prepare MaxFile Data (GenerateMaxFileDataFile)
Mon 16:27: (2/12) - Synthesize DFE Modules (XST)
Mon 16:30: (3/12) - Link DFE Modules (NGCBuild)
Mon 16:30: (4/12) - Prepare for Resource Analysis (EDIF2MxruBuildPass)
Mon 16:30: (5/12) - Generate Preliminary Annotated Source Code
Mon 16:30: (6/12) - Report Resource Usage (ResourceCounter)
Mon 16:30: About to start chip vendor Map/Place/Route toolflow. This will take some time.
Mon 16:30: (7/12) - Prepare for Placement (NGDBuild)
Mon 16:30: (8/12) - Place and Route DFE (MPPR)
Mon 16:30: Executing MPPR with 1 cost tables and 1 threads.
Mon 16:30: MPPR: Starting 1 cost table
Mon 16:43: MPPR: Cost table 1 met timing with score 0 (best score 0)
Mon 16:43: (9/12) - Prepare for Resource Analysis (XDLBuild)
Mon 16:44: (10/12) - Generate Resource Report (ResourceUsageBuildPass)
Mon 16:44: (11/12) - Generate Annotated Source Code (ResourceAnnotationBuildPass)
Mon 16:44: (12/12) - Generate MaxFile (GenerateMaxFile)
Mon 16:45:
Mon 16:45: FINAL RESOURCE USAGE
Mon 16:45: LUTs: 9503 / 149760 (6.35%)
Mon 16:45: FFs: 12749 / 149760 (8.51%)
Mon 16:45: BRAMs: 34 / 516 (6.59%)
Mon 16:45: DSPs: 0 / 1056 (0.00%)
Mon 16:45:
Mon 16:45: MaxFile: /home/training1/maxcompiler-builds/MyKernel/results/MyKernel.max
(MD5Sum: e564cd922aeeda04acfa2f4ecce8236d)
Mon 16:45: Build completed: Mon Apr 08 16:45:58 BST 2013 (took 18 mins, 33 secs)
FPGA vendor specific
back-end tool flow
Abstracted by MaxCompiler
24
• Allows you to see what lines of code are
• using what resources and focus optimization
• Separate reports for each kernel and for the manager
DFE Resource Usage Reporting
LUTs FFs BRAMs DSPs : MyKernel.java
727 871 1.0 2 : resources used by this file
0.24% 0.15% 0.09% 0.10% : % of available
71.41% 61.82% 100.00% 100.00% : % of total used
94.29% 97.21% 100.00% 100.00% : % of user resources
:
: public class MyKernel extends Kernel {
: public MyKernel (KernelParameters parameters) {
: super(parameters);
1 31 0.0 0 : DFEVar p = io.input("p", dfeFloat(8,24));
2 9 0.0 0 : DFEVar q = io.input("q", dfeUInt(8));
: DFEVar offset = io.scalarInput("offset", dfeUInt(8));
8 8 0.0 0 : DFEVar addr = offset + q;
18 40 1.0 0 : DFEVar v = mem.romMapped("table", addr,
: dfeFloat(8,24), 256);
139 145 0.0 2 : p = p * p;
401 541 0.0 0 : p = p + v;
: io.output("r", p, dfeFloat(8,24));
: }
: }
DSP Blocks
Block RAMs
IO Blocks
LUT/FFs
? ?
Different operations
use different
resources
25
• MaxCompiler gives detailed latency and area annotation
back to the programmer
• Evaluate precise effect of code
on latency and chip area
26
Optimization Feedback
12.8ns 6.4ns+ = 19.2ns (total compute latency)
26
27
Small pilot system deployed in Oct 2017
• one 1U MPC-X with 8 MAX5 DFEs
• one 1U AMD EPYC based server
• one 1U login head node
Scaling using Amazon AWS cloud
• MAX5 fully compatible with F1 instances
• Elastic scaling between private and public
MPC-X node
Remote users
MAX5 DFE EPYC CPU
1TB DDR4
Head/Build node
ipmi
56 Gbps 2x Infiniband @ 56 Gbps
10 Gbps
10 Gbps
Supermicro EPYC node
Pilot System Deployed at Jülich
http://www.prace-ri.eu/pcp/
27
28
PRACE-PCP: SpecFEM3D on DFE
28
The BQCD Chip - AERIAL VIEW
Scalable Conjugate Gradient Design for the CG step of BQCD
Problem
(Small/Large)
System (composition) (size) TTS
[sec]
ETS
[kWh]
DTS
(F1)
BQCD 32x32x32x32 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 1,054 0.44 -
64x64x64x64 1PF equivalent (48 DFEs, 512 EPYC cores) (14U) 1,703.8 4.26 $39.93
NEMO GYRE6 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 388 0.164 -
GYRE144 1PF equivalent (48 DFEs, 92 EPYC cores) (8U) 1,942 3.77 $42.72
SFM3D 1 chunk x64x64 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 232 0.096 -
6 chunks x1,440x1,440 1PF equivalent (384 DFEs, 768 EPYC cores) (60U) 5,150 70.1 $1,267.2
QE Al2O3 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 32 0.013 -
Ta2O5 1PF equivalent (64 DFEs, 64 EPYC cores) (9U) 3,210 7.58 $94.16
Achieved Performance PRACE workloads
30
Global Weather Simulation with DFEs in China
⬥L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X.
Huang, Y. Zhang, and G. Yang, Accelerating
solvers for global atmospheric equations
through mixed-precision data flow engine,
published at FPL 2013
⬥Joint research with Imperial College and
Tsinghua University
⬥Simulating the atmosphere using the
shallow water equation
An order of magnitude improvement over the Linpack-driven supercomputer technology
Platform Speedup Efficiency
6 Core CPU 1x 1x
Tianhe-1A Node 23x 15x
Maxeler MPC-X 330x 145x
31
32
• A (fancy) name does not help with solving the problem at hand
• Cloud, (Intelligent) Edge, Fog are just names like … Maria
• FPGA is just a technology that can help bridging the gap to something
better (Spatial Computing Acceleration HW, Quantum Processing, …)
• just focus on building the best computer for the given job
• Learn, think, pioneer and stay always critical
• abstraction is powerful but quite often not needed
• use it with great care and remember Dan Slotnick
• We are turning Earth into a heterogeneous, planet-wide computer
• so we should try to not kill it in the process
• There is a lot of interest in this topic 
Conclusions
?
Questions
Contact me at: georgi@maxeler.com
Or find me on Google: “Georgi computer” should do
33
Some links with more information
Maxeler Multiscale Dataflow Computing:
https://www.maxeler.com/technology/dataflow-computing/
Computing in Space explained by Mike Flynn:
http://www.openspl.org/what-is-openspl/
Computing in Space Course at Imperial College:
http://cc.doc.ic.ac.uk/openspl16/
Exciting Applications for DFEs (and JDFEs):
http://appgallery.maxeler.com
Maxeler DFEs on AWS EC2 F1:
https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6-
c66ab2ba202a
Maxeler and Xilinx Alveo collaboration:
https://www.xilinx.com/products/boards-and-kits/alveo.html
34
Maxeler Applications Gallery
Dataflow Engine (DFE) Ecosystem
⬥ With over 150 universities in our university program, we
decided to create an app gallery to enable the community
to share applications, examples, demos, …
⬥ The App Gallery is complemented by a teaching program,
with the first successful course taught at Imperial College in
2014. see
http://cc.doc.ic.ac.uk/openspl14
⬥ Top 10 APPS:
➢ Correlation: in real-time, pairwise, on 6,000 streams
➢ 100% Guaranteed Packet Capture
➢ Webserver, cache and load balancing
➢ HESTON Option pricer
➢ N-body simulation
➢ Regex matching (e.g. for Security)
➢ Brain network simulation
➢ Quantum Chromo-Dynamics kernel
➢ Seismic Imaging
➢ Realtime Classification
Dataflow Apps and Analytics for Machine Learning http://appgallery.maxeler.com/
35
Peer Reviewed Dataflow Publications
2008: Seismic Imaging with Dataflow Engines 25x faster, An Implementation of the Acoustic
Wave Equation, T. Nemeth et al, Chevron, Society of Exploration Geophysicists, Nov 2008.
2010: Credit Derivatives Valuation and Risk, from 8 hours to 2 minutes, American Finance
Technology Award, with JP Morgan.
2011: Modeling and Imaging with Schlumberger, Beyond Traditional Microprocessors for
Geoscience High-Performance Computing Applications, O. Lindtjörn et al, Schlumberger,
IEEE Micro, vol. 31, no. 2, March/April 2011.
2012: Weather Imaging with CRS4, 60x faster, Acceleration of a Meteorological Limited Area
Model with dataflow Engines, Diego Oriato†, Simon Tilbury†, Marino Marrocu§, Gabrielle
Pusceddu§†Maxeler, §CRS4, 2012 Symposium on Application Accelerators in HPC.
2013: Convergence of Risk and Trading in partnership with CME Group and birth of OpenSPL
industry standard (www.openspl.org), In Cloud Computing it’s the Era of Convergence, Open
Markets Magazine, Ari Studnitzer, CME Group.
2014: Brain Simulation with Erasmus, Real-Time Olivary Neuron Simulations on Dataflow
Computing Machines, Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis,
Catalin Ciobanu, Oskar Mencer, and Chris I. De Zeeuw, Supercomputing; Springer, 487-497
2017: High Energy Physics with Imperial, Using MaxCompiler for the high level synthesis
of trigger algorithms, S. Summers, A. Rose and P. Sanders, Journal of Instrumentation,
Volume 12, IOP Publishing.
36
Maxeler University Program
37
Maxeler Trophy Cabinet
Academic History since 2005
• Imperial College Research Excellence Award
• Top EPSRC Advanced Fellowship
• Two Best Paper Awards
• Early Dataflow paper by Maxeler’s Founder has been recognized as one of the most
influential papers at the FPL conference in the last 25 years.
Recent Commercial Awards
• HPCwire Editors Choice Award, November 2011.
• American Finance Technology Awards, New York, winner,
“Most Cutting Edge IT Initiative,” December 2011.
• Golden Arrow, “...for revolutionizing Computers, ”
COM-SULT, January 2012.
• Gartner “Cool Vendor of the Year,” March 2012.
• Frost and Sullivan “Most innovative IT vendor, ” Dec 2013.
• CIO Review, 20 Most Promising Networking Companies, March 2014.
• CIO Review, 20 Most Promising HPC Companies, March 2015.
38

Programmable Exascale Supercomputer

  • 1.
    Building affordable and programmableexascale capable computers SURFsara, 18 April 2019 Georgi Gaydadjiev, Director of Maxeler IoT-Labs BV, Delft Honorary Visiting Professor at the Department of Computing, Imperial College London
  • 2.
    Think Ångströms notnanometers in 2019 Ideally we should steer movements of almost each individual electron to solve our problems 0.1 nm  1 Å 14 nm  140 Å DNA C-C bond 1Å 10Å 102Å 103Å 104Å glucose hemoglobin ribosome … cells 100s of Si atoms in 14nm very few atoms (e.g., 3nm / 30Å  6 to 12 atoms) light microscope resolution 2
  • 3.
    Moving data on-chipwill use as much energy as computing with it Moving data off-chip will use 200x more energy! and is much slower as well The power challenge Today* 2020 Double precision Float Op ~20pJ <10pJ Moving data on-chip: 1mm 6pJ Moving data on-chip: 20mm 120pJ Moving data to off-chip memory 4,000pJ 2,000pJ 3 The data movement challenge Next generation computer systems should take data movements at all levels very seriously 3
  • 4.
    Data movement challengeconfirmed 4 Wires that carry the data (and instructions) become more and more important (Courtesy: NVIDIA and ITRS) BUT 4
  • 5.
    5 “… without dramaticincreases in efficiency, ICT industry could use 20% of all electricity and emit up to 5.5% of the world’s carbon emissions by 2025.” “We have a tsunami of data approaching. Everything which can be is being digitalised. It is a perfect storm.” “ … a single $1bn Apple data centre planned for Athenry in Co Galway, expects to eventually use 300MW of electricity, or over 8% of the national capacity and more than the daily entire usage of Dublin. It will require 144 large diesel generators as back up for when the wind does not blow.” Why is all of this important?
  • 6.
    6 Computing in Time: Followa recipe step by step one at the time Computing in Space: Build a “recipe specific” factory with multiple paths performed simultaneously One result per clock cycle Efficient, predictable, reliable “mass production” of huge data amounts Build Computers for your Problem and Data
  • 7.
    1. Describe ConjugateGradient as dataflow graph 3. Stream data through the Custom Accelerator 2. Compile dataflow structure and load to hardware Create customized mega accelerators with massive inherent throughputs Programming a Dataflow “mass production” Engine 7
  • 8.
    Program in HLLanguage Machine Architecture Implementation Circuits Algorithm Devices Problem Solutions Solutions Co-optimise the HW and the SW stack for the performance critical areas of the application 8 Solving Computing Problems Vertically 8
  • 9.
    From Equations toDataflow Hardware u x s x vd x F ah p ah TRuu a vu ah u t u                      1ln  9
  • 10.
    Real data flowgraph as generated by MaxCompiler 4,866 nodes; 10,000s of stages/cycles Full Customization in: Space, Value and Time (SVT) 1010
  • 11.
    Easy it isnot (and not really new) Slotnick’s law (of effort): “The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.” Daniel Slotnick (1931-1985) Chief Architect of Illiac IV 11
  • 12.
    Programing in Spacebasics 12 • Control and Data-flows are decoupled – Both are fully programmable • Operations exist in space and by default run in parallel – Their number is limited only by the available space • All operations can be customized at various levels – e.g., from algorithm down to the number representation • Multiple operations constitute kernels • Data streams through the operations / kernels • The data transport and processing can be balanced • All resources work all of the time for max performance • The In/Out data rates determine the operating frequency Equally spread the available “forces” and move no faster than required by the application 12
  • 13.
    The Computational Model 13 •Dataflow sub-system (DataFlow Engine- DFE) – Spatial arithmetic chip “hardware” technology with flexible arithmetic units and programmable interconnect (looks like FPGAs but is not limited to) – Programmable Static Dataflow – Systolic Execution at kernel level – Streaming Custom Computing at system level – Implicit GALS* IO and kernel-to-kernel communication • Dedicated software (MaxCompiler, MaxelerOS and SLiC) – compilation toolchain and design methodology – Incorporated simulation and debug environment for rapid development – Linux fully integrated runtime system and low level software support – Help designer focus on the data/algorithm and the system architecture • Only three basic memory types (explicitly exposed) – Scalars (exposed to the CPU) – Fast Memory (FMEM): small and fast (on-chip) – Large Memory (LMEM): large and slow (off-chip) * GALS – Globally Asynchronous Locally Synchronous 13
  • 14.
    Maxeler’s DataFlow Engine(DFE, MAX4) 14 MaxRing Interconnect Dataflow Engine (DFE)LMEM (Large Memory) 4-96GB Reconfigurable compute fabric Dataflow cores & FMEM (Fast Memory) High bandwidth memory link Link to main data network (e.g., PCIe, Infiniband) MaxRing links • 48GB DRAM (LMEM) • Stratix V D8 • MaxRing interconnect • 4,000 multipliers • 700K logic cells • 6.25MB of FMEM 14
  • 15.
    Application Level Components SLiC MaxelerOS Memory CPU DFE Memory Kernels(MaxJ) (instantiate the arithmetic structure) *+ + Manager (MaxJ) (arrange the data orchestration) Host application (C, Python, Matlab..) 15 PCI Express or Infiniband 15
  • 16.
    MaxJ: Moving Averageof three numbers Dataflow computing in hardware using a language you know 16
  • 17.
    x x + 30 y DFEVar x =io.input("x", dfeFloat(10,31)); DFEVar result = x * x + 30; io.output("y", result, dfeFloat(10,31)); 17 Simple example: y = x2 + 30 17
  • 18.
    MaxJ example: Controlin Space 18 x + 1 y - 1 > 10 class SimpleKernel extends Kernel { SimpleKernel() { DFEVar x = io.input(“x”, dfeInt(24)); DFEVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, dfeInt(25)); } } 18
  • 19.
    19 SIMULATE AND DEBUG GENERATEDATAFLOWPROGRAMARCHITECTANALYSE Used to build real systems, however, very difficult to learn/educate Non Traditional Design Process OK?many hours … Custom HW
  • 20.
    Multiple scales of computing Importantfeatures for optimization complete system level  balance compute, storage and IO parallel node level  maximize utilization of compute and interconnect microarchitecture level  minimize data movement arithmetic level  tradeoff range, precision and accuracy = discretize in Time, Space and Value bit level  encode and add redundancy transistor level => manipulate ‘0’ and ‘1’ and more, e.g., trade/hide Communication (Time) for/behind Computation (Space) 20 Optimizations at all levelsFlow/Time Space 20
  • 21.
    21 1. Higher chip/ system price compared to microprocessors 2. Lead design times (3 months in the best case) a. Complex numerical transformations b. Non-trivial area and data movement optimizations 3. “Painful” Place & Route times (12 hours to 24 hours) a. Expensive Vendor Specific tools b. Serious developments ask for dedicated build clusters 4. Need to compete at 200MHz with processors at 3GHz 5. Current HW technology is sub-optimal • On-chip memory not built for stream processing • On-chip interconnect overdesigned for Dataflow 6. Long learning curve (Tools and Methods needed) 7. Designer’s productivity should improve (Tools and Methods) 8. … Some of the challenges Ongoing effort on improved methodologies and tools
  • 22.
    MaxRing Interconnect Dataflow Engine (DFE) LMEM (LargeMemory) 4-96GB Reconfigurable compute fabric Dataflow cores & FMEM (Fast Memory) High bandwidth memory link Link to main data network (e.g., PCIe, Infiniband) MaxRing links Multiple platforms, single DFE abstraction + { Application and MaxJ gen4 gen5 Performance Portable Migration (Intel based) (Xilinx based) 22
  • 23.
    • MaxCompiler generatesVHDL ready for FPGA vendor tools • Synthesis transforms VHDL into logical “netlist” – sets of basic logic expressions • Map fits basic logic into N-input look-up tables • Place puts LUTs, DSPs, RAMs etc at specific locations on chip • Route sets up wiring between blocks 23 Substrate Agnostic Compilation MaxCompiler compilation Synthesis Map Place Route Generate Maxfile VHDL Complete FPGA Netlist LUTs Placed FPGA 23
  • 24.
    DFE Place andRoute example 24 Mon 16:27: MaxCompiler version: 2012.2 Mon 16:27: Build “MyKernel" start time: Mon Apr 08 16:27:24 BST 2013 Mon 16:27: Main build process running as user training1 on host Maxworkstation7478 Mon 16:27: Build location: /home/training1/maxcompiler-builds/MyKernel Mon 16:27: Instantiating manager Mon 16:27: Instantiating kernel “MyKernel" Mon 16:27: Compiling manager (CPU I/O Only) Mon 16:27: Compiling kernel "MyKernel" Mon 16:27: Generating input files (VHDL, netlists, CoreGen) Mon 16:27: Running back-end build (12 phases) Mon 16:27: (1/12) - Prepare MaxFile Data (GenerateMaxFileDataFile) Mon 16:27: (2/12) - Synthesize DFE Modules (XST) Mon 16:30: (3/12) - Link DFE Modules (NGCBuild) Mon 16:30: (4/12) - Prepare for Resource Analysis (EDIF2MxruBuildPass) Mon 16:30: (5/12) - Generate Preliminary Annotated Source Code Mon 16:30: (6/12) - Report Resource Usage (ResourceCounter) Mon 16:30: About to start chip vendor Map/Place/Route toolflow. This will take some time. Mon 16:30: (7/12) - Prepare for Placement (NGDBuild) Mon 16:30: (8/12) - Place and Route DFE (MPPR) Mon 16:30: Executing MPPR with 1 cost tables and 1 threads. Mon 16:30: MPPR: Starting 1 cost table Mon 16:43: MPPR: Cost table 1 met timing with score 0 (best score 0) Mon 16:43: (9/12) - Prepare for Resource Analysis (XDLBuild) Mon 16:44: (10/12) - Generate Resource Report (ResourceUsageBuildPass) Mon 16:44: (11/12) - Generate Annotated Source Code (ResourceAnnotationBuildPass) Mon 16:44: (12/12) - Generate MaxFile (GenerateMaxFile) Mon 16:45: Mon 16:45: FINAL RESOURCE USAGE Mon 16:45: LUTs: 9503 / 149760 (6.35%) Mon 16:45: FFs: 12749 / 149760 (8.51%) Mon 16:45: BRAMs: 34 / 516 (6.59%) Mon 16:45: DSPs: 0 / 1056 (0.00%) Mon 16:45: Mon 16:45: MaxFile: /home/training1/maxcompiler-builds/MyKernel/results/MyKernel.max (MD5Sum: e564cd922aeeda04acfa2f4ecce8236d) Mon 16:45: Build completed: Mon Apr 08 16:45:58 BST 2013 (took 18 mins, 33 secs) FPGA vendor specific back-end tool flow Abstracted by MaxCompiler 24
  • 25.
    • Allows youto see what lines of code are • using what resources and focus optimization • Separate reports for each kernel and for the manager DFE Resource Usage Reporting LUTs FFs BRAMs DSPs : MyKernel.java 727 871 1.0 2 : resources used by this file 0.24% 0.15% 0.09% 0.10% : % of available 71.41% 61.82% 100.00% 100.00% : % of total used 94.29% 97.21% 100.00% 100.00% : % of user resources : : public class MyKernel extends Kernel { : public MyKernel (KernelParameters parameters) { : super(parameters); 1 31 0.0 0 : DFEVar p = io.input("p", dfeFloat(8,24)); 2 9 0.0 0 : DFEVar q = io.input("q", dfeUInt(8)); : DFEVar offset = io.scalarInput("offset", dfeUInt(8)); 8 8 0.0 0 : DFEVar addr = offset + q; 18 40 1.0 0 : DFEVar v = mem.romMapped("table", addr, : dfeFloat(8,24), 256); 139 145 0.0 2 : p = p * p; 401 541 0.0 0 : p = p + v; : io.output("r", p, dfeFloat(8,24)); : } : } DSP Blocks Block RAMs IO Blocks LUT/FFs ? ? Different operations use different resources 25
  • 26.
    • MaxCompiler givesdetailed latency and area annotation back to the programmer • Evaluate precise effect of code on latency and chip area 26 Optimization Feedback 12.8ns 6.4ns+ = 19.2ns (total compute latency) 26
  • 27.
    27 Small pilot systemdeployed in Oct 2017 • one 1U MPC-X with 8 MAX5 DFEs • one 1U AMD EPYC based server • one 1U login head node Scaling using Amazon AWS cloud • MAX5 fully compatible with F1 instances • Elastic scaling between private and public MPC-X node Remote users MAX5 DFE EPYC CPU 1TB DDR4 Head/Build node ipmi 56 Gbps 2x Infiniband @ 56 Gbps 10 Gbps 10 Gbps Supermicro EPYC node Pilot System Deployed at Jülich http://www.prace-ri.eu/pcp/ 27
  • 28.
  • 29.
    The BQCD Chip- AERIAL VIEW Scalable Conjugate Gradient Design for the CG step of BQCD
  • 30.
    Problem (Small/Large) System (composition) (size)TTS [sec] ETS [kWh] DTS (F1) BQCD 32x32x32x32 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 1,054 0.44 - 64x64x64x64 1PF equivalent (48 DFEs, 512 EPYC cores) (14U) 1,703.8 4.26 $39.93 NEMO GYRE6 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 388 0.164 - GYRE144 1PF equivalent (48 DFEs, 92 EPYC cores) (8U) 1,942 3.77 $42.72 SFM3D 1 chunk x64x64 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 232 0.096 - 6 chunks x1,440x1,440 1PF equivalent (384 DFEs, 768 EPYC cores) (60U) 5,150 70.1 $1,267.2 QE Al2O3 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 32 0.013 - Ta2O5 1PF equivalent (64 DFEs, 64 EPYC cores) (9U) 3,210 7.58 $94.16 Achieved Performance PRACE workloads 30
  • 31.
    Global Weather Simulationwith DFEs in China ⬥L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X. Huang, Y. Zhang, and G. Yang, Accelerating solvers for global atmospheric equations through mixed-precision data flow engine, published at FPL 2013 ⬥Joint research with Imperial College and Tsinghua University ⬥Simulating the atmosphere using the shallow water equation An order of magnitude improvement over the Linpack-driven supercomputer technology Platform Speedup Efficiency 6 Core CPU 1x 1x Tianhe-1A Node 23x 15x Maxeler MPC-X 330x 145x 31
  • 32.
    32 • A (fancy)name does not help with solving the problem at hand • Cloud, (Intelligent) Edge, Fog are just names like … Maria • FPGA is just a technology that can help bridging the gap to something better (Spatial Computing Acceleration HW, Quantum Processing, …) • just focus on building the best computer for the given job • Learn, think, pioneer and stay always critical • abstraction is powerful but quite often not needed • use it with great care and remember Dan Slotnick • We are turning Earth into a heterogeneous, planet-wide computer • so we should try to not kill it in the process • There is a lot of interest in this topic  Conclusions
  • 33.
    ? Questions Contact me at:georgi@maxeler.com Or find me on Google: “Georgi computer” should do 33
  • 34.
    Some links withmore information Maxeler Multiscale Dataflow Computing: https://www.maxeler.com/technology/dataflow-computing/ Computing in Space explained by Mike Flynn: http://www.openspl.org/what-is-openspl/ Computing in Space Course at Imperial College: http://cc.doc.ic.ac.uk/openspl16/ Exciting Applications for DFEs (and JDFEs): http://appgallery.maxeler.com Maxeler DFEs on AWS EC2 F1: https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6- c66ab2ba202a Maxeler and Xilinx Alveo collaboration: https://www.xilinx.com/products/boards-and-kits/alveo.html 34
  • 35.
    Maxeler Applications Gallery DataflowEngine (DFE) Ecosystem ⬥ With over 150 universities in our university program, we decided to create an app gallery to enable the community to share applications, examples, demos, … ⬥ The App Gallery is complemented by a teaching program, with the first successful course taught at Imperial College in 2014. see http://cc.doc.ic.ac.uk/openspl14 ⬥ Top 10 APPS: ➢ Correlation: in real-time, pairwise, on 6,000 streams ➢ 100% Guaranteed Packet Capture ➢ Webserver, cache and load balancing ➢ HESTON Option pricer ➢ N-body simulation ➢ Regex matching (e.g. for Security) ➢ Brain network simulation ➢ Quantum Chromo-Dynamics kernel ➢ Seismic Imaging ➢ Realtime Classification Dataflow Apps and Analytics for Machine Learning http://appgallery.maxeler.com/ 35
  • 36.
    Peer Reviewed DataflowPublications 2008: Seismic Imaging with Dataflow Engines 25x faster, An Implementation of the Acoustic Wave Equation, T. Nemeth et al, Chevron, Society of Exploration Geophysicists, Nov 2008. 2010: Credit Derivatives Valuation and Risk, from 8 hours to 2 minutes, American Finance Technology Award, with JP Morgan. 2011: Modeling and Imaging with Schlumberger, Beyond Traditional Microprocessors for Geoscience High-Performance Computing Applications, O. Lindtjörn et al, Schlumberger, IEEE Micro, vol. 31, no. 2, March/April 2011. 2012: Weather Imaging with CRS4, 60x faster, Acceleration of a Meteorological Limited Area Model with dataflow Engines, Diego Oriato†, Simon Tilbury†, Marino Marrocu§, Gabrielle Pusceddu§†Maxeler, §CRS4, 2012 Symposium on Application Accelerators in HPC. 2013: Convergence of Risk and Trading in partnership with CME Group and birth of OpenSPL industry standard (www.openspl.org), In Cloud Computing it’s the Era of Convergence, Open Markets Magazine, Ari Studnitzer, CME Group. 2014: Brain Simulation with Erasmus, Real-Time Olivary Neuron Simulations on Dataflow Computing Machines, Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis, Catalin Ciobanu, Oskar Mencer, and Chris I. De Zeeuw, Supercomputing; Springer, 487-497 2017: High Energy Physics with Imperial, Using MaxCompiler for the high level synthesis of trigger algorithms, S. Summers, A. Rose and P. Sanders, Journal of Instrumentation, Volume 12, IOP Publishing. 36
  • 37.
  • 38.
    Maxeler Trophy Cabinet AcademicHistory since 2005 • Imperial College Research Excellence Award • Top EPSRC Advanced Fellowship • Two Best Paper Awards • Early Dataflow paper by Maxeler’s Founder has been recognized as one of the most influential papers at the FPL conference in the last 25 years. Recent Commercial Awards • HPCwire Editors Choice Award, November 2011. • American Finance Technology Awards, New York, winner, “Most Cutting Edge IT Initiative,” December 2011. • Golden Arrow, “...for revolutionizing Computers, ” COM-SULT, January 2012. • Gartner “Cool Vendor of the Year,” March 2012. • Frost and Sullivan “Most innovative IT vendor, ” Dec 2013. • CIO Review, 20 Most Promising Networking Companies, March 2014. • CIO Review, 20 Most Promising HPC Companies, March 2015. 38