SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
Keynote I gave at the ParCo conference (http://www.parco2015.org) workshop paraFPGA in Edinburgh, Sept 2015, on the need to raise the abstraction level for programming of heterogeneous systems.
Keynote I gave at the ParCo conference (http://www.parco2015.org) workshop paraFPGA in Edinburgh, Sept 2015, on the need to raise the abstraction level for programming of heterogeneous systems.
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
1.
FPGAs
as
Components
in
Heterogeneous
High-‐Performance
Compu8ng
Systems:
Raising
the
Abstrac8on
Level
Wim
Vanderbauwhede
School
of
Compu6ng
Science
University
of
Glasgow
3.
80
Years
ago:
The
Theory
Turing,
Alan
Mathison.
"On
computable
numbers,
with
an
applica6on
to
the
Entscheidungsproblem."
J.
of
Math
58,
no.
345-‐363
(1936):
5.
4.
1936:
Universal
machine
(Alan
Turing)
1936:
Lambda
calculus
(Alonzo
Church)
1936:
Stored-‐program
concept
(Konrad
Zuse)
1937:
Church-‐Turing
thesis
1945:
The
Von
Neumann
architecture
Church,
Alonzo.
"A
set
of
postulates
for
the
founda6on
of
logic."
Annals
of
mathema0cs
(1932):
346-‐366.
6.
1957:
Fortran,
John
Backus,
IBM
1958:
First
IC,
Jack
Kilby,
Texas
Instruments
1965:
Moore’s
law
1971:
First
microprocessor,
Texas
Instruments
1972:
C,
Dennis
Ritchie,
Bell
Labs
1977:
Fortran-‐77
1977:
von
Neumann
bocleneck,
John
Backus
8.
1984:
Verilog
1984:
First
reprogrammable
logic
device,
Altera
1985:
First
FPGA,Xilinx
1987:
VHDL
Standard
IEEE
1076-‐1987
1989:
Algotronix
CAL1024,
the
first
FPGA
to
offer
random
access
to
its
control
memory
9.
20
Years
ago:
High-‐level
Synthesis
Page,
Ian.
"Closing
the
gap
between
hardware
and
soiware:
hardware-‐soiware
cosynthesis
at
Oxford."
(1996):
2-‐2.
16.
High-‐Level
Synthesis
•
For
many
years,
Verilog/VHDL
were
good
enough
•
Then
the
complexity
gap
created
the
need
for
HLS.
•
This
reflects
the
ra6onale
behind
VHDL:
"A
language
with
a
wide
range
of
descrip0ve
capability
that
was
independent
of
technology
or
design
methodology."
•
What
is
lacking
in
this
requirement
is
"capability
for
scalable
abstrac6on".
17.
“C
to
Gates”
•
"C-‐to-‐Gates"
offered
that
higher
abstrac6on
level
•
But
it
was
in
a
way
a
return
to
the
days
before
standardised
VHDL/Verilog:
the
various
components
making
up
a
system
were
designed
and
verified
using
a
wide
range
of
different
and
incompa0ble
languages
and
tools.
18.
The
Choice
of
C
•
C
was
designed
by
Ritchie
for
the
specific
purpose
of
wri6ng
the
UNIX
opera6ng
system.
•
i.e.
to
create
a
control
system
for
a
RAM-‐
based
single-‐threaded
system.
•
It
is
basically
a
syntac6c
layer
over
assembly
language.
•
Very
different
seman6cs
from
HDLs
•
But
it
became
the
lingua
franca
for
engineers,
and
hence
the
de-‐facto
language
for
HLS
tools.
19.
Really
C?
•
None
of
them
was
ever
really
C
though:
•
"C
with
restric6ons
and
pragmas”
(e.g.
DIME-‐C)
•
“C
with
restric6ons
and
a
CSP
API”
(e.g.
Impulse-‐C)
•
"C-‐syntax
language
with
parallel
and
CSP
seman6cs”
(e.g.
Handel-‐C)
•
Typically,
no
recursion,
func6on
pointers
(no
stack)
and
dynamic
alloca6on
(no
OS)
20.
The
Odd
Ones
Out
•
Mitrion-‐C
is
func6onal
dataflow
language,
very
different
from
C
in
abstrac6on
level
and
seman6cs;
•
Bluespec
too
was
a
radical
departure
from
"C-‐
to-‐Gates".
•
Both
were
inspired
by
func6onal
languages
like
Haskell.
23.
GPUs,
Manycores
and
FPGAs
•
Accelerators
acached
to
host
systems
have
become
increasingly
popular
•
Mainly
GPUs,
•
But
increasingly
manycores
(MIC,
Tilera)
•
And
FPGAs
25.
Heterogeneous
Programming
•
State
of
affairs
today:
•
Programmer
must
decide
what
to
offload
•
Write
host-‐accelerator
control
and
data
movement
code
using
dedicated
API
•
Write
accelerator
code
using
dedicated
language
•
Many
approaches
(CUDA,
OpenCL,
MaxJ,
C++
AMP)
26.
Programming
Model
•
All
solu6ons
assume
data
parallelism:
•
Each
kernel
is
single-‐threaded,
works
on
a
por6on
of
the
data
•
Programmer
must
iden6fy
these
por6ons
and
the
amount
of
parallelism
•
So
not
ideal
for
FPGAs
•
Recent
OpenCL
specifica6ons
have
kernel
pipes
allowing
construc6on
of
pipelines
•
Also
support
for
unified
memory
space
27.
Performance
Nearest
neighbour
Lava
MD
Document
Classifica6on
Kernel
Speed
5.345454545
4.317035156
1.306521951
Total
Speed
1.603603604
4.232313328
1.037712032
0
1
2
3
4
5
6
Speed-‐up
OpenCL-‐FPGA
Speed-‐up
vs
OpenCL-‐CPU
Segal,
Oren,
Nasibeh
Nasiri,
Mar6n
Margala,
and
Wim
Vanderbauwhede.
"High
level
programming
of
FPGAs
for
HPC
and
data
centric
applica6ons.”
Proc.
IEEE
HPEC
2014,
pp.
1-‐3.
28.
Power
Savings
Nearest
neighbour
Lava
MD
Document
Classifica6on
Kernel
Power
5.241358852
4.232966577
1.281079155
Total
Power
1.572375533
4.149894595
1.017503956
0
1
2
3
4
5
6
Power
Saving
CPU/FPGA
Power
Consump8on
Ra8o
Segal,
Oren,
Nasibeh
Nasiri,
Mar6n
Margala,
and
Wim
Vanderbauwhede.
"High
level
programming
of
FPGAs
for
HPC
and
data
centric
applica6ons.”
Proc.
IEEE
HPEC
2014,
pp.
1-‐3.
30.
Heterogeneous
HPC
Systems
•
Modern
HPC
cluster
node:
•
Mul6core/manycore
host
•
Accelerators:
GPGPU,
MIC
and
increasingly,
FPGAs
•
HPC
workloads
•
Very
complex
codebase
•
Legacy
code
31.
Example:
WRF
•
Weather
Research
and
Forecas6ng
Model
•
Fortran-‐90,
support
for
MPI
and
OpenMP
•
1,263,320
lines
of
code
•
So
about
ten
thousand
pages
of
code
lis6ngs
•
Parts
of
it
have
been
accelerated
manually
on
GPU
(a
few
thousands
of
lines)
•
Changing
the
code
for
a
GPU/FPGA
system
would
be
a
huge
task,
and
the
result
would
not
be
portable.
32.
FPGAs
in
HPC
•
FPGAs
are
good
at
some
tasks,
e.g.:
•
Bit
level,
integer
and
string
opera6ons
•
Pipeline
parallelism
rather
than
data
parallelism
•
Superior
internal
memory
bandwidth
•
Streaming
dataflow
computa6ons
•
But
not
so
good
at
others
•
Double-‐precision
floa6ng
point
computa6ons
•
Random
memory
access
computa6ons
33.
"On the Capability and Achievable
Performance of FPGAs for HPC
Applications"
Wim Vanderbauwhede
School of Computing Science, University of Glasgow, UK
hcp://www.slideshare.net/WimVanderbauwhede
35.
One
Codebase,
Many
Components
•
For
complex
HPC
applica6ons,
FPGAs
will
never
be
op6mal
for
the
whole
codebase
•
But
neither
will
mul6cores
or
GPUs
•
So
we
need
to
be
able
to
split
the
codebase
automa6cally
over
the
different
components
in
the
heterogeneous
system.
•
Therefore,
we
need
to
raise
the
abstrac6on
level
beyond
“heterogeneous
programming”
and
“high-‐
level
synthesis”
36.
Today’s
high-‐level
language
is
tomorrow’s
compiler
target
37.
•
Device-‐specific
high-‐level
abstrac6on
is
no
longer
good
enough
•
OpenCL
is
rela6vely
high-‐level
and
device-‐
independent,
but
it
is
s6ll
not
good
enough
•
High-‐level
synthesis
languages
and
heterogeneous
programming
frameworks
should
be
compila6on
targets!
•
Just
like
assembly
/IR
languages
and
HDLs
38.
Automa8c
Program
Transforma8ons
and
Cost
models
39.
•
Star6ng
from
a
complete,
unop6mised
program
•
Compiler-‐based
program
transforma6ons
•
Correct-‐by-‐construc6on
•
Component-‐based,
hierarchical
cost
model
for
the
full
system
•
Op6miza6on
problem:
find
the
op6mal
program
variant
given
the
system
cost
model
40.
A
Func6onal-‐Programming
Approach
•
For
the
par6cular
case
of
scien6fic
HPC
codes
•
Focus
on
array
computa6ons
•
Express
the
program
using
higher-‐order
func6ons
•
Type
Transforma6on
based
program
transforma6on
41.
Func6onal
Programming
•
There
are
only
func6ons
•
Func6ons
can
operate
on
func6ons
•
Func6ons
can
return
func6ons
•
Syntac6c
sugar
over
the
λ-‐calculus
42.
Types
in
Func6onal
Programming
•
Types
are
just
labels
to
help
us
reason
about
the
values
in
a
computa6on
•
More
general
than
types
in
e.g
C
•
For
our
purpose,
we
focus
on
types
of
func6ons
that
perform
array
opera6ons
•
Func6ons
are
values,
so
they
need
a
type
43.
Examples
of
Types
-‐-‐
a
func6on
f
taking
a
vector
of
n
values
of
type
a
and
returing
a
vector
of
m
values
of
type
b
f : Vec a n -> Vec b m
-‐-‐
a
func6on
map
taking
a
func0on
from
a
to
b
and
a
vector
of
type
a,
and
returning
a
vector
of
type
b
map : (a -> b) -> Vec a n -> Vec b n
44.
Type
Transforma6ons
•
Transform
the
type
of
a
func6on
into
another
type
•
The
func6on
transforma6on
can
be
derived
automa6cally
from
the
type
transforma6on
•
The
type
transforma6ons
are
provably
correct
•
Thus
the
transformed
program
is
correct
by
construc6on!
45.
Array
Type
Transforma6ons
•
For
this
talk,
focus
on
•
Vector
(array)
types
•
FPGA
cost
model
•
Programs
must
be
composed
using
par6cular
higher-‐order
func6ons
(correctness
condi6ons)
•
Transforma6ons
essen6ally
reshape
the
arrays
46.
Higher-‐order
Func6ons
•
map:
perform
a
computa6on
on
all
elements
of
an
array
independently,
e.g.
square
all
values.
•
can
be
done
sequen6ally,
in
parallel
or
using
a
pipeline
if
the
computa6on
is
pipelined
•
foldl:
reduce
an
array
to
a
value
using
an
accumulator,
e.g.
sum
all
values.
•
can
be
done
sequen6ally
or,
if
the
computa6on
is
associa6ve,
using
a
binary
tree
47.
Example:
SOR
•
Successive
Over-‐Relaxa6on
(SOR)
kernel
from
a
Large-‐Eddy
simulator
(weather
simula6on)
in
Fortran:
48.
Example:
SOR
using
map
•
Fortran
code
rewricen
as
a
map
of
a
func6on
over
a
1-‐D
vector
49.
Example:
Type
Transforma6on
•
Transform
the
1-‐D
vector
into
a
2-‐D
vector
•
The
program
transforma6on
is
derived
50.
FPGA
Cost
Modeling
•
Based
on
an
overlay
architecture
51.
Cost
Calcula6on
•
Uses
an
Intermediate
Representa6on
Language,
the
TyTra-‐IR
•
TyTra-‐IR
uses
LLVM
syntax
but
can
express
sequen6al,
parallel
and
pipeline
seman6cs
•
Thus
a
direct
mapping
to
the
higher-‐order
func6ons
•
But
the
cost
of
computa6ons
and
communica6on
can
be
computed
directly
from
the
TyTra-‐IR
program
•
No
need
for
synthesis
53.
Cost
Space
and
Cost
Es6ma6on
Logic and
Memory
Resources
Communication
Bandwidth
(local memory, global
memory, host)
Performance
(Throughput)
The Resource-Wall
(computation-bound)
The Bandwidth-Wall
(communication-bound)
56.
Full-‐Program
Transforma6on
•
Type
Transforma6ons
are
not
FPGA-‐specific
•
Compiler
can
create
variants
for
full
program
•
Then
separate
out
parts
of
the
program
based
on
minimal
cost
on
given
components
of
the
system
•
For
parallelisa6on
over
a
cluster,
use
Mul6-‐Party
Session
Types
to
transform
the
program
into
communica6ng
processes
57.
Conclusions
•
FPGAs
have
reached
maturity
as
HPC
plauorms
•
High-‐Level
Synthesis
and
Heterogeneous
Programming
are
both
very
important
steps
forward,
and
performance
is
already
impressive
•
But
we
need
to
raise
the
abstrac6on
level
even
more
•
Full-‐system
compilers
for
heterogeneous
systems
•
FPGAs
are
merely
components
in
such
systems
•
Type
Transforma6ons
are
one
possible
way