Programmable Exascale Supercomputer

Building affordable and
programmable exascale capable
computers
SURFsara, 18 April 2019
Georgi Gaydadjiev, Director of Maxeler IoT-Labs BV, Delft
Honorary Visiting Professor at the Department of Computing, Imperial College London

Think Ångströms not nanometers in 2019
Ideally we should steer movements of almost each individual electron to solve our problems
0.1 nm  1 Å
14 nm  140 Å
DNA
C-C
bond
1Å 10Å 102Å 103Å 104Å
glucose
hemoglobin
ribosome
…
cells
100s of Si atoms in 14nm
very few atoms
(e.g., 3nm / 30Å  6 to 12 atoms)
light microscope
resolution
2

Moving data on-chip will use as much energy as computing with it
Moving data off-chip will use 200x more energy!
and is much slower as well
The power challenge
Today* 2020
Double precision Float Op ~20pJ <10pJ
Moving data on-chip: 1mm 6pJ
Moving data on-chip: 20mm 120pJ
Moving data to off-chip
memory
4,000pJ 2,000pJ
3
The data movement challenge
Next generation computer systems should take data movements at all levels very seriously
3

Data movement challenge confirmed
4
Wires that carry the data (and instructions) become more and more important
(Courtesy: NVIDIA and ITRS)
BUT
4

5
“… without dramatic increases in
efficiency, ICT industry could
use 20% of all electricity and
emit up to 5.5% of the world’s
carbon emissions by 2025.”
“We have a tsunami of data
approaching. Everything which
can be is being digitalised. It is
a perfect storm.”
“ … a single $1bn Apple data
centre planned for Athenry in Co
Galway, expects to eventually
use 300MW of electricity, or over
8% of the national capacity and
more than the daily entire usage
of Dublin. It will require 144 large
diesel generators as back up for
when the wind does not blow.”
Why is all of this important?

6
Computing in Time:
Follow a recipe step by step
one at the time
Computing in Space:
Build a “recipe specific” factory with multiple
paths performed simultaneously
One result per clock cycle
Efficient, predictable, reliable “mass production” of huge data amounts
Build Computers for your Problem and Data

1. Describe Conjugate Gradient
as dataflow graph
3. Stream data through the
Custom Accelerator
2. Compile dataflow structure and load to hardware
Create customized mega accelerators with massive inherent throughputs
Programming a Dataflow “mass production” Engine
7

Program in HL Language
Machine Architecture
Implementation
Circuits
Algorithm
Devices
Problem
Solutions
Solutions
Co-optimise the HW and the SW
stack for the performance critical
areas of the application
8
Solving Computing Problems Vertically
8

From Equations to Dataflow Hardware
u
x
s
x
vd
x
F
ah
p
ah
TRuu
a
vu
ah
u
t
u





















1ln

9

Real data flow graph as
generated by MaxCompiler
4,866 nodes;
10,000s of stages/cycles
Full Customization in:
Space, Value and Time
(SVT)
1010

Easy it is not (and not really new)
Slotnick’s law (of effort):
“The parallel approach to computing does require that
some original thinking be done about numerical
analysis and data management in order to secure
efficient use.
In an environment which has represented the
absence of the need to think as the highest virtue this
is a decided disadvantage.”
Daniel Slotnick (1931-1985)
Chief Architect of Illiac IV
11

Programing in Space basics
12
• Control and Data-flows are decoupled
– Both are fully programmable
• Operations exist in space and by default run in parallel
– Their number is limited only by the available space
• All operations can be customized at various levels
– e.g., from algorithm down to the number representation
• Multiple operations constitute kernels
• Data streams through the operations / kernels
• The data transport and processing can be balanced
• All resources work all of the time for max performance
• The In/Out data rates determine the operating frequency
Equally spread the available “forces” and move no faster than required by the application
12

The Computational Model
13
• Dataflow sub-system (DataFlow Engine- DFE)
– Spatial arithmetic chip “hardware” technology with flexible arithmetic units
and programmable interconnect (looks like FPGAs but is not limited to)
– Programmable Static Dataflow
– Systolic Execution at kernel level
– Streaming Custom Computing at system level
– Implicit GALS* IO and kernel-to-kernel communication
• Dedicated software (MaxCompiler, MaxelerOS and SLiC)
– compilation toolchain and design methodology
– Incorporated simulation and debug environment for rapid development
– Linux fully integrated runtime system and low level software support
– Help designer focus on the data/algorithm and the system architecture
• Only three basic memory types (explicitly exposed)
– Scalars (exposed to the CPU)
– Fast Memory (FMEM): small and fast (on-chip)
– Large Memory (LMEM): large and slow (off-chip)
* GALS – Globally Asynchronous Locally Synchronous
13

Maxeler’s DataFlow Engine (DFE, MAX4)
14
MaxRing
Interconnect
Dataflow Engine (DFE)LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
• 48GB DRAM (LMEM)
• Stratix V D8
• MaxRing interconnect
• 4,000 multipliers
• 700K logic cells
• 6.25MB of FMEM
14

Application Level Components
SLiC
MaxelerOS
Memory
CPU
DFE
Memory
Kernels (MaxJ)
(instantiate the arithmetic
structure)
*+
+
Manager (MaxJ)
(arrange the data
orchestration)
Host application (C, Python, Matlab..)
15
PCI Express
or
Infiniband
15

MaxJ: Moving Average of three numbers
Dataflow computing in hardware using a language you know
16

x
x
+
30
y
DFEVar x = io.input("x", dfeFloat(10,31));
DFEVar result = x * x + 30;
io.output("y", result, dfeFloat(10,31));
17
Simple example: y = x2 + 30
17

MaxJ example: Control in Space
18
x
+
1
y
-
1
>
10
class SimpleKernel extends Kernel {
SimpleKernel() {
DFEVar x = io.input(“x”, dfeInt(24));
DFEVar result = (x>10) ? x+1 : x-1;
io.output(“y”, result, dfeInt(25));
}
}
18

19
SIMULATE AND DEBUG
GENERATE DATAFLOWPROGRAMARCHITECTANALYSE
Used to build real systems, however, very difficult to learn/educate
Non Traditional Design Process
OK?many hours …
Custom
HW

Multiple scales of
computing
Important features for
optimization
complete system level  balance compute, storage
and IO
parallel node level  maximize utilization of
compute and interconnect
microarchitecture level  minimize data movement
arithmetic level  tradeoff range, precision
and accuracy
= discretize in Time, Space
and Value
bit level  encode and add
redundancy
transistor level => manipulate ‘0’ and ‘1’
and more, e.g., trade/hide Communication (Time) for/behind Computation (Space)
20
Optimizations at all levelsFlow/Time
Space
20

21
1. Higher chip / system price compared to microprocessors
2. Lead design times (3 months in the best case)
a. Complex numerical transformations
b. Non-trivial area and data movement optimizations
3. “Painful” Place & Route times (12 hours to 24 hours)
a. Expensive Vendor Specific tools
b. Serious developments ask for dedicated build clusters
4. Need to compete at 200MHz with processors at 3GHz
5. Current HW technology is sub-optimal
• On-chip memory not built for stream processing
• On-chip interconnect overdesigned for Dataflow
6. Long learning curve (Tools and Methods needed)
7. Designer’s productivity should improve (Tools and Methods)
8. …
Some of the challenges
Ongoing effort on improved methodologies and tools

MaxRing
Interconnect
Dataflow Engine (DFE)
LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
Multiple platforms, single DFE abstraction
+
{
Application and MaxJ
gen4 gen5
Performance Portable Migration
(Intel based) (Xilinx based)
22

• MaxCompiler generates VHDL
ready for FPGA vendor tools
• Synthesis transforms VHDL into
logical “netlist” – sets of basic logic
expressions
• Map fits basic logic into N-input
look-up tables
• Place puts LUTs, DSPs, RAMs etc
at specific locations on chip
• Route sets up wiring between
blocks
23
Substrate Agnostic Compilation
MaxCompiler compilation
Synthesis
Map
Place
Route
Generate Maxfile
VHDL
Complete FPGA
Netlist
LUTs
Placed FPGA
23

DFE Place and Route example
24
Mon 16:27: MaxCompiler version: 2012.2
Mon 16:27: Build “MyKernel" start time: Mon Apr 08 16:27:24 BST 2013
Mon 16:27: Main build process running as user training1 on host Maxworkstation7478
Mon 16:27: Build location: /home/training1/maxcompiler-builds/MyKernel
Mon 16:27: Instantiating manager
Mon 16:27: Instantiating kernel “MyKernel"
Mon 16:27: Compiling manager (CPU I/O Only)
Mon 16:27: Compiling kernel "MyKernel"
Mon 16:27: Generating input files (VHDL, netlists, CoreGen)
Mon 16:27: Running back-end build (12 phases)
Mon 16:27: (1/12) - Prepare MaxFile Data (GenerateMaxFileDataFile)
Mon 16:27: (2/12) - Synthesize DFE Modules (XST)
Mon 16:30: (3/12) - Link DFE Modules (NGCBuild)
Mon 16:30: (4/12) - Prepare for Resource Analysis (EDIF2MxruBuildPass)
Mon 16:30: (5/12) - Generate Preliminary Annotated Source Code
Mon 16:30: (6/12) - Report Resource Usage (ResourceCounter)
Mon 16:30: About to start chip vendor Map/Place/Route toolflow. This will take some time.
Mon 16:30: (7/12) - Prepare for Placement (NGDBuild)
Mon 16:30: (8/12) - Place and Route DFE (MPPR)
Mon 16:30: Executing MPPR with 1 cost tables and 1 threads.
Mon 16:30: MPPR: Starting 1 cost table
Mon 16:43: MPPR: Cost table 1 met timing with score 0 (best score 0)
Mon 16:43: (9/12) - Prepare for Resource Analysis (XDLBuild)
Mon 16:44: (10/12) - Generate Resource Report (ResourceUsageBuildPass)
Mon 16:44: (11/12) - Generate Annotated Source Code (ResourceAnnotationBuildPass)
Mon 16:44: (12/12) - Generate MaxFile (GenerateMaxFile)
Mon 16:45:
Mon 16:45: FINAL RESOURCE USAGE
Mon 16:45: LUTs: 9503 / 149760 (6.35%)
Mon 16:45: FFs: 12749 / 149760 (8.51%)
Mon 16:45: BRAMs: 34 / 516 (6.59%)
Mon 16:45: DSPs: 0 / 1056 (0.00%)
Mon 16:45:
Mon 16:45: MaxFile: /home/training1/maxcompiler-builds/MyKernel/results/MyKernel.max
(MD5Sum: e564cd922aeeda04acfa2f4ecce8236d)
Mon 16:45: Build completed: Mon Apr 08 16:45:58 BST 2013 (took 18 mins, 33 secs)
FPGA vendor specific
back-end tool flow
Abstracted by MaxCompiler
24

• Allows you to see what lines of code are
• using what resources and focus optimization
• Separate reports for each kernel and for the manager
DFE Resource Usage Reporting
LUTs FFs BRAMs DSPs : MyKernel.java
727 871 1.0 2 : resources used by this file
0.24% 0.15% 0.09% 0.10% : % of available
71.41% 61.82% 100.00% 100.00% : % of total used
94.29% 97.21% 100.00% 100.00% : % of user resources
:
: public class MyKernel extends Kernel {
: public MyKernel (KernelParameters parameters) {
: super(parameters);
1 31 0.0 0 : DFEVar p = io.input("p", dfeFloat(8,24));
2 9 0.0 0 : DFEVar q = io.input("q", dfeUInt(8));
: DFEVar offset = io.scalarInput("offset", dfeUInt(8));
8 8 0.0 0 : DFEVar addr = offset + q;
18 40 1.0 0 : DFEVar v = mem.romMapped("table", addr,
: dfeFloat(8,24), 256);
139 145 0.0 2 : p = p * p;
401 541 0.0 0 : p = p + v;
: io.output("r", p, dfeFloat(8,24));
: }
: }
DSP Blocks
Block RAMs
IO Blocks
LUT/FFs
? ?
Different operations
use different
resources
25

• MaxCompiler gives detailed latency and area annotation
back to the programmer
• Evaluate precise effect of code
on latency and chip area
26
Optimization Feedback
12.8ns 6.4ns+ = 19.2ns (total compute latency)
26

27
Small pilot system deployed in Oct 2017
• one 1U MPC-X with 8 MAX5 DFEs
• one 1U AMD EPYC based server
• one 1U login head node
Scaling using Amazon AWS cloud
• MAX5 fully compatible with F1 instances
• Elastic scaling between private and public
MPC-X node
Remote users
MAX5 DFE EPYC CPU
1TB DDR4
Head/Build node
ipmi
56 Gbps 2x Infiniband @ 56 Gbps
10 Gbps
10 Gbps
Supermicro EPYC node
Pilot System Deployed at Jülich
http://www.prace-ri.eu/pcp/
27

28
PRACE-PCP: SpecFEM3D on DFE
28

The BQCD Chip - AERIAL VIEW
Scalable Conjugate Gradient Design for the CG step of BQCD

Problem
(Small/Large)
System (composition) (size) TTS
[sec]
ETS
[kWh]
DTS
(F1)
BQCD 32x32x32x32 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 1,054 0.44 -
64x64x64x64 1PF equivalent (48 DFEs, 512 EPYC cores) (14U) 1,703.8 4.26 $39.93
NEMO GYRE6 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 388 0.164 -
GYRE144 1PF equivalent (48 DFEs, 92 EPYC cores) (8U) 1,942 3.77 $42.72
SFM3D 1 chunk x64x64 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 232 0.096 -
6 chunks x1,440x1,440 1PF equivalent (384 DFEs, 768 EPYC cores) (60U) 5,150 70.1 $1,267.2
QE Al2O3 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 32 0.013 -
Ta2O5 1PF equivalent (64 DFEs, 64 EPYC cores) (9U) 3,210 7.58 $94.16
Achieved Performance PRACE workloads
30

Global Weather Simulation with DFEs in China
⬥L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X.
Huang, Y. Zhang, and G. Yang, Accelerating
solvers for global atmospheric equations
through mixed-precision data flow engine,
published at FPL 2013
⬥Joint research with Imperial College and
Tsinghua University
⬥Simulating the atmosphere using the
shallow water equation
An order of magnitude improvement over the Linpack-driven supercomputer technology
Platform Speedup Efficiency
6 Core CPU 1x 1x
Tianhe-1A Node 23x 15x
Maxeler MPC-X 330x 145x
31

32
• A (fancy) name does not help with solving the problem at hand
• Cloud, (Intelligent) Edge, Fog are just names like … Maria
• FPGA is just a technology that can help bridging the gap to something
better (Spatial Computing Acceleration HW, Quantum Processing, …)
• just focus on building the best computer for the given job
• Learn, think, pioneer and stay always critical
• abstraction is powerful but quite often not needed
• use it with great care and remember Dan Slotnick
• We are turning Earth into a heterogeneous, planet-wide computer
• so we should try to not kill it in the process
• There is a lot of interest in this topic 
Conclusions

?
Questions
Contact me at: georgi@maxeler.com
Or find me on Google: “Georgi computer” should do
33

Some links with more information
Maxeler Multiscale Dataflow Computing:
https://www.maxeler.com/technology/dataflow-computing/
Computing in Space explained by Mike Flynn:
http://www.openspl.org/what-is-openspl/
Computing in Space Course at Imperial College:
http://cc.doc.ic.ac.uk/openspl16/
Exciting Applications for DFEs (and JDFEs):
http://appgallery.maxeler.com
Maxeler DFEs on AWS EC2 F1:
https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6-
c66ab2ba202a
Maxeler and Xilinx Alveo collaboration:
https://www.xilinx.com/products/boards-and-kits/alveo.html
34

Maxeler Applications Gallery
Dataflow Engine (DFE) Ecosystem
⬥ With over 150 universities in our university program, we
decided to create an app gallery to enable the community
to share applications, examples, demos, …
⬥ The App Gallery is complemented by a teaching program,
with the first successful course taught at Imperial College in
2014. see
http://cc.doc.ic.ac.uk/openspl14
⬥ Top 10 APPS:
➢ Correlation: in real-time, pairwise, on 6,000 streams
➢ 100% Guaranteed Packet Capture
➢ Webserver, cache and load balancing
➢ HESTON Option pricer
➢ N-body simulation
➢ Regex matching (e.g. for Security)
➢ Brain network simulation
➢ Quantum Chromo-Dynamics kernel
➢ Seismic Imaging
➢ Realtime Classification
Dataflow Apps and Analytics for Machine Learning http://appgallery.maxeler.com/
35

Peer Reviewed Dataflow Publications
2008: Seismic Imaging with Dataflow Engines 25x faster, An Implementation of the Acoustic
Wave Equation, T. Nemeth et al, Chevron, Society of Exploration Geophysicists, Nov 2008.
2010: Credit Derivatives Valuation and Risk, from 8 hours to 2 minutes, American Finance
Technology Award, with JP Morgan.
2011: Modeling and Imaging with Schlumberger, Beyond Traditional Microprocessors for
Geoscience High-Performance Computing Applications, O. Lindtjörn et al, Schlumberger,
IEEE Micro, vol. 31, no. 2, March/April 2011.
2012: Weather Imaging with CRS4, 60x faster, Acceleration of a Meteorological Limited Area
Model with dataflow Engines, Diego Oriato†, Simon Tilbury†, Marino Marrocu§, Gabrielle
Pusceddu§†Maxeler, §CRS4, 2012 Symposium on Application Accelerators in HPC.
2013: Convergence of Risk and Trading in partnership with CME Group and birth of OpenSPL
industry standard (www.openspl.org), In Cloud Computing it’s the Era of Convergence, Open
Markets Magazine, Ari Studnitzer, CME Group.
2014: Brain Simulation with Erasmus, Real-Time Olivary Neuron Simulations on Dataflow
Computing Machines, Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis,
Catalin Ciobanu, Oskar Mencer, and Chris I. De Zeeuw, Supercomputing; Springer, 487-497
2017: High Energy Physics with Imperial, Using MaxCompiler for the high level synthesis
of trigger algorithms, S. Summers, A. Rose and P. Sanders, Journal of Instrumentation,
Volume 12, IOP Publishing.
36

Maxeler Trophy Cabinet
Academic History since 2005
• Imperial College Research Excellence Award
• Top EPSRC Advanced Fellowship
• Two Best Paper Awards
• Early Dataflow paper by Maxeler’s Founder has been recognized as one of the most
influential papers at the FPL conference in the last 25 years.
Recent Commercial Awards
• HPCwire Editors Choice Award, November 2011.
• American Finance Technology Awards, New York, winner,
“Most Cutting Edge IT Initiative,” December 2011.
• Golden Arrow, “...for revolutionizing Computers, ”
COM-SULT, January 2012.
• Gartner “Cool Vendor of the Year,” March 2012.
• Frost and Sullivan “Most innovative IT vendor, ” Dec 2013.
• CIO Review, 20 Most Promising Networking Companies, March 2014.
• CIO Review, 20 Most Promising HPC Companies, March 2015.
38

Programmable Exascale Supercomputer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Programmable Exascale Supercomputer

Similar to Programmable Exascale Supercomputer (20)

Recently uploaded

Recently uploaded (20)

Programmable Exascale Supercomputer