ParaForming - Patterns and Refactoring for Parallel Programming

Paraforming:

Forming
Parallel
(Func2onal)
Programs
from

High-‐Level
Pa:erns

using
Advanced

Refactoring

Kevin
Hammond,
Chris
Brown,
Vladimir
Janjic

University
of
St
Andrews,
Scotland

Build
Stuﬀ,
Vilnius,
Lithuania,
December
10
2013

T:

@paraphrase_fp7,
@khstandrews

E:

kh@cs.st-‐andrews.ac.uk

W: http://www.paraphrase-ict.eu!

The
Present

Pound
versus
Dollar

3

The
Future:
“megacore”
computers?

§  Hundreds
of
thousands,
or
millions,
of
(small)
cores

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

5

What
will
“megacore”
computers
look

like?

§  Probably
not
just
scaled
versions
of
today’s
mul2core

§ 
§ 
§ 
§ 
§ 
§ 

Perhaps
hundreds
of
dedicated
lightweight
integer
units

Hundreds
of
ﬂoa9ng
point
units
(enhanced
GPU
designs)

A
few
heavyweight
general-‐purpose
cores

Some
specialised
units
for
graphics,
authen9ca9on,
network
etc

possibly
so*
cores
(FPGAs
etc)

Highly
heterogeneous

6

What
will
“megacore”
computers
look

like?

§  Probably
not
uniform
shared
memory

§  NUMA
is
likely,
even
hardware
distributed
shared
memory

§  or
even
message-‐passing
systems
on
a
chip

§  shared-‐memory
will
not
be
a
good
abstrac:on

int
arr

[x]
[y];

7

Laki (NEC Nehalem Cluster) and hermit (XE6)
Laki

hermit (phase 1 step 1)
700 dual socket Xeon 5560 2,8GHz
(“Gainestown”)

4/6

Nodes with 32GB and 64GB memory
reflecting different user needs
2.7PB storage capacity @ 150GB/s IO
bandwidth

I

Scientific Linux 6.0

Each compute node will have 2 sockets
AMD Interlagos @ 2.3GHz 16 Cores
each leading to 113.664 cores

External Access Nodes, Pre-
Postprocessing Nodes, Remote
Visualization Nodes

32 nodes with additional Nvidia Tesla
S1070

I

96 service nodes and 3552 compute
nodes

I

Infiniband (QDR)

I

I

12 GB DDR3 RAM / node

I

38 racks with 96 nodes each

I

I

I

I

I

::

HLRS in ParaPhrase

::

Turin, 4th/5th October 2011

::

8

The
Biggest
Computer
in
the
World

Tianhe-‐2,
Chinese
Na2onal
University
of
Defence
Technology

33.86
petaﬂops/s
(June
17,
2013)

16,000
Nodes;
each
with
2
Ivy
Bridge
mul9cores
and
3
Xeon
Phis

3,120,000
x86
cores
in
total!!!

9

It’s
not
just
about
large
systems

•  Even
mobile
phones
are

mul9core

§  Samsung
Exynos
5
Octa
has
8
cores,
4
of

which
are
“dark”

•  Performance/energy
tradeoﬀs

mean
systems
will
be

increasingly
parallel

•  If
we
don’t
solve
the
mul9core

challenge,
then
no
other

advances
will
maber!

ALL
Future

Programming
will
be

Parallel!

10

The
Manycore
Challenge

“Ul9mately,
developers
should
start
thinking
about
tens,
hundreds,
and

thousands
of
cores
now
in
their
algorithmic
development
and
deployment

pipeline.”

Anwar
Ghuloum,
Principal
Engineer,
Intel
Microprocessor
Technology
Lab

The
ONLY
important
challenge
in
Computer
Science

(Intel)

“The
dilemma
is
that
a
large
percentage
of
mission-‐cri9cal
enterprise
applica9ons

will
not
``automagically''
run
faster
on
mul9-‐core
servers.
In
fact,
many
will

actually
run
slower.
We
must
make
it
as
easy
as
possible
for
applica9ons

programmers
to
exploit
the
latest
developments
in
mul9-‐core/many-‐core

Also
recognised
as
thema9c
priori9es
by
EU
and

architectures,
while
s9ll
making
it
easy
to
target
future
(and
perhaps

na9onal
funding
bodies

unan9cipated)
hardware
developments.”

Patrick
Leonard,
Vice
President
for
Product
Development

Rogue
Wave
Sobware

But
Doesn’t
that
mean
millions
of

threads
on
a
megacore
machine??

13

How
to
build
a
wall

(with
apologies
to
Ian
Watson,
Univ.
Manchester)

How
to
build
a
wall
faster

How
NOT
to
build
a
wall

Typical
CONCURRENCY

Approaches
require
the

Programmer
to
solve
these

Task
iden2ﬁca2on
is
not
the
only
problem…

Must
also
consider
Coordina9on,
communica9on,
placement,

scheduling,
…

We
need
structure

We
need
abstrac2on

We
don’t
need
another
brick
in
the
wall

17

Thinking
Parallel

§  Fundamentally,
programmers
must
learn
to
“think
parallel”

§  this
requires
new
high-‐level
programming
constructs

§  perhaps
dealing
with
hundreds
of
millions
of
threads

§  You
cannot
program
effec2vely
while
worrying
about
deadlocks
etc.

§  they
must
be
eliminated
from
the
design!

§  You
cannot
program
effec2vely
while
fiddling
with
communica2on
etc.

§  this
needs
to
be
packaged/abstracted!

§  You
cannot
program
effec2vely
without
performance
informa2on

§  this
needs
to
be
included
as
part
of
the
design!

18

A
Solu2on?

“The
only
thing
that
works
for

parallelism
is
func2onal

programming”

Bob
Harper,
Carnegie
Mellon
University

Parallel
Func2onal
Programming

§  No
explicit
ordering
of
expressions

§  Purity
means
no
side-‐effects

§  Impossible
for
parallel
processes
to
interfere
with
each
other

§  Can
debug
sequen2ally
but
run
in
parallel

§  Enormous
saving
in
effort

§  Programmer
concentrate
on
solving
the
problem

§  Not
por9ng
a
sequen9al
algorithm
into
a
(ill-‐defined)
parallel
domain

§  No
locks,
deadlocks
or
race
condi2ons!!

§  Huge
produc2vity
gains!

λ

λ

λ

ParaPhrase
Project:
Parallel
Pa:erns
for
Heterogeneous
Mul2core
Systems

(ICT-‐288570),

2011-‐2014,
€4.2M
budget

13
Partners,
8
European
countries

UK,
Italy,
Germany,
Austria,
Ireland,
Hungary,
Poland,
Israel

Coordinated
by
Kevin
Hammond
St
Andrews

0

The
ParaPhrase
Approach

§  Start
bobom-‐up

§  iden9fy

(strongly
hygienic)
COMPONENTS

§  using
semi-‐automated
refactoring

both
legacy
and

new
programs

§  Think
about
the
PATTERN
of
parallelism

§  e.g.
map(reduce),
task
farm,
parallel
search,
parallel
comple9on,
...

§  STRUCTURE
the
components
into
a
parallel
program

§  turn
the
pa?erns
into
concrete
(skeleton)
code

§  Take
performance,
energy
etc.
into
account
(mul9-‐objec9ve
op9misa9on)

§  also
using
refactoring

§  RESTRUCTURE
if
necessary!
(also
using
refactoring)

25

Some
Common
Pa:erns

§  High-‐level
abstract
paberns
of
common
parallel
algorithms

Google
map-‐
reduce
combines

two
of
these!

Generally,
we
need

to
nest/combine

paberns
in
arbitray

ways

35

The
Skel
Library
for
Erlang

§  Skeletons
implement
speciﬁc
parallel
paberns

§  Pluggable
templates

§  Skel
is
a
new
(AND
ONLY!)
Skeleton
library
in
Erlang

§  map,
farm,
reduce,
pipeline,
feedback

§  instan9ated
using
skel:run

§  Fully
Nestable

chrisb.host.cs.st-‐andrews.ac.uk/skel.html

hbps://github.com/ParaPhrase/skel

§  A
DSL
for
parallelism

!
OutputItems = skel:run(Skeleton, InputItems).!
!
36

e

Parallel
Pipeline
Skeleton

§  Each
stage
of
the
pipeline
can
be
executed
in
parallel

§  The
input
and
output
are
streams

{pipe, [Skel1 , Skel2 , · · · , Skeln ]}
Tn · · · T1

Skel1

Skel2

···

Skeln

Tn · · · T1

skel:run([{pipe,[Skel1, Skel2,..,SkelN]}], Inputs).!

Inc
= { seq , fun ( X ) - X +1 end } ,
!
Double = { seq , fun ( X ) - X *2 end } ,
skel : run ( { pipe , [ Inc , Double ] } ,

[ 1 ,2 ,3 ,4 ,5 ,6 ] ).

37

m

Farm
Skeleton

§  Each
worker
is
executed
in
parallel

§  A
bit
like
a
1-‐stage
pipeline

{farm, Skel, M}
Skel1
Tn · · · T1

!

Skel2

.
.
.

Tn · · · T1

SkelM

skel:do([{farm, Skel, M}], Inputs).!

nc = { seq , fun ( X ) - X +1 end } ,

38

Using
The
Right
Pa:ern
Ma:ers

Speedup

Speedups for Matrix Multiplication
24
22
20
18
16
14
12
10
8
6
4
2

Naive Parallel
Farm
Farm with Chunk 16

12 4

8

12
16
No. cores

20

24

39

The
ParaPhrase
Approach

Erlang

SequenGal

Code

Generic

Pa:ern
Library

Parallel

Code

Erlang

C/C++
Java

Haskell

Cos9ng/
Proﬁling

Refactoring

C/C++
Java

...

...

Haskell

Mellanox
Inﬁniband

Nvidia

Tesla

AMD

Opteron

AMD

Opteron

Intel

Core

Intel

Core

Nvidia

GPU

Nvidia

GPU

Intel

GPU

Intel

GPU

Intel

Xeon
Phi

Refactoring

§  Refactoring
changes
the

structure
of
the
source
code

§  using
well-‐deﬁned
rules

§  semi-‐automa:cally
under

programmer
guidance

Review

S1Refactoring:
Farm
Introduc2on

S2
⌘
P ipe(S1 , S2 )
pipe seq
Map(S1 S2 , d, r)
⌘
Map(S1 , d, r) Map(S2 , d, r)
map ﬁssion/fusion
S
⌘
F arm(S)
farm intro/elim
Map(F, d, r)
⌘
P ipe(Decomp(d)), F arm(F ), Recomp(r)) data2stream
0
S1
⌘
Map(S1 , d, r)
map intro/elim
Figure 3.3: Some Standard Skeleton Equivalences

Farm

The following describes each of the patterns in turn:
• a MAP is made up of three OPERATIONs: a worker, a partitioner, and a
combiner, followed by an INPUT;
• a SEQ is made up of a single OPERATION denoting the sequential computation to be performed, followed by an INPUT;
• a FARM is made up of a single OPERATON denoting the working, an INPUT

44

Image
Processing
Example

Read
Image
1

Read
Image
2

{
Build

Stuﬀ
}

“White

screening”
{
Build

Merge

Images

{
Build

Stuﬀ
}

Write
Image

45

Basic
Erlang
Structure

[ writeImage(convertMerge(readImage(X))) !
!
!
!
!
!|| X - Images() ]!
!
readImage({In1, in2, out) -!
!…!
!{ Image1, Image2, out}.!
!
convertImage({Image1, Image2, out}) -!
!Image1P = whiteScreen(Image1),!
!Image2P = mergeImages(Image1, Image2),!
!{Image2P, out}.!
!
writeImage({Image, Out}) - …!

46

Speedup
Results
(Image
Processing)

Speedup

Speedups for Haar Transform (Skel Task Farm)
24
22
20
18
16
14
12
10
8
6
4
2
1

1D Skel Task Farm
1D Skel Task Farm with Chunk Size = 4
2D Skel Task Farm

12 4

8
12
16
20
No. Farm Workers

24

50

Large-‐Scale
Demonstrator
Applica2ons

§  ParaPhrase
tools
are
being
used
by
commercial/end-‐user
partners

§  SCCH
(SME,
Austria)

§  Erlang
Solu9ons
Ltd
(SME,
UK)

§  Mellanox
(Israel)

§  ELTESos,
Hungary
(SME)

§  AGH
(University,
Poland)

§  HLRS
(High
Performance
Compu9ng
Centre,
Germany)

Speedup
Results
(demonstrators)

Speedup

Speedups for Ant Colony, BasicN2 and Graphical Lasso
24
22
20
18
16
14
12
10
8
6
4
2
1

BasicN2
BasicN2 Manual
Graphical Lasso
Graphical Lasso Manual
Ant Colony Optimisation Manual
Ant Colony Optimisation

Speedup
close
to

or
beHer
than

manual

op9misa9on

1 2 4 6 8 10 12 14 16 18 20 22 24
No of Workers

55

Bow2e2:
most
widely
used
DNA

alignment
tool

28
30

26

Speedup

Speedup

24
22
20

25

20

18
16

15

Bt2FF-pin+int
Bt2

14
20

30

40

50

60
70
80
Read Length

90

100 110

Bt2FF-pin+int
Bt2
28

30

32

34
Quality

36

38

40

Original

Paraphrase

C.
Misale.
Accelera9ng
Bow9e2

with
a
lock-‐less
concurrency

approach
and
memory
aﬃnity.

IEEE
PDP
2014.
To
appear.

56

Comparison
of
Development
Times

ge pipeline (k),
ates the images
the images (F ).
tained from the
e ﬁrst farm and
o three workers
es, and one for
e load balancers
e, the nature of
second stage of
ﬁrst stage takes
e takes around
n a substantial

Convolution
Ant Colony
BasicN2
Graphical Lasso

Man.Time
3 days
1 day
5 days
15 hours

Refac. Time
3 hours
1 hour
5 hours
2 hours

LOC Intro.
58
32
40
53

Figure 3.
Approximate manual implementation time of use-cases vs.
refactoring time with lines of code introduced by refactoring tool

linear scaling for higher numbers of cores, because of cache
synchronisation (disjunct but interleaving memory regions are
updated in the tasks), and an uneven size combined with a
limited number of tasks (48). At the end of the computation,
58

some cores will wait idly for the completion of remaining

Heterogeneous
Parallel
Programming

Profile#
Informa'on*
1.#Iden(fy*

Applica'on*

Structured*Code*

Ini'al*Structure*
…*
Int*main*()*…*
For*(int*I*=0;*I**N;*i++)**
**f*(*g*(x));*
…*

Config.*1*

2.#Enumerate##
Skeleton*
Configura'ons*

Config.*2*

3.#Filter#
Using*Cost*
Model*

Pipeline*

4.*Apply*MCTS*
…*

…*

Op'mal*Parallel**Configura'on*
With*Mappings#

Refactorer*
with*Mappings#

CPU*
7.#Execute#

…*
Int*main*()*…*
Farm1*=*Farm(f,*8,*2);*
Pipe(farm1,*GPU(g));*
…*
*

GPU*

Config.*2(a)*

5.#Choose#Op'mal*
Mapping/Configura'on*

Heterogeneous*Machine#

CPU*

Config.*1(b)*

Profile#
Informa'on*

GPU*
CPU*
Component* Component*

CPU*

Config.*2*

Config.*3*

Config.*1(a)*

Farm*

Config.*1*

GPU*

Refactorer*
6.#Refactor#
Applica'on*

CPU*

CPU*

CPU*

[RGU
/

USTAN]

Example:
Enumerate
Skeleton
Configura2ons

for
Image
Convolu2on

Δ(r  p)

r || Δ(p)

Δ(r) p

r p
r || p

Δ(r) Δ(p)
r
:
read
image
file

p
:
process
image
file

r  Δ(p)

Results
on
Benchmark:
Image
Convolu2on

MCTS
Mapping
(C,
G):

(6,
0)
||
(0,
3)

Speedup
39.12

Best
Speedup:
40.91

Conclusions

§  The
manycore
revolu9on
is
upon
us

§  Computer
hardware
is
changing
very
rapidly

(more
than
in
the
last
50
years)

§  The
megacore
era
is
here
(aka
exascale,
BIG
data)

§  Heterogeneity
and
energy
are
both
important

§  Most
programming
models
are
too
low-‐level

§  concurrency
based

§  need
to
expose
mass
parallelism

§  Paberns
and
func:onal
programming
help
with
abstrac9on

§  millions
of
threads,
easily
controlled

Conclusions
(2)

§  Func9onal
programming
makes
it
easy
to
introduce
parallelism

§  No
side
eﬀects
means
any
computa9on
could
be
parallel

§  Matches
pabern-‐based
parallelism

§  Much
detail
can
be
abstracted

§  Lots
of
problems
can
be
avoided

§  e.g.
Freedom
from
Deadlock

§  Parallel
programs
give
the
same
results
as
sequen9al
ones!

§  Automa9on
is
very
important

§  Refactoring
drama9cally
reduces
development
9me

(while
keeping
the
programmer
in
the
loop)

§  Machine
learning
is
very
promising
for
determining
complex
performance
sewngs

But
isn’t
this
all
just
wishful
thinking?

Rampant-‐Lambda-‐Men
in
St
Andrews

66

NO!

§  C++11
has
lambda
func9ons
(and
some
other
nice
func9onal-‐
inspired
features)

§  Java
8
will
have
lambda
(closures)

§  Apple
uses
closures
in
Grand
Central
Dispatch

67

ParaPhrase
Parallel
C++
Refactoring

§  Integrated
into
Eclipse

§  Supports
full
C++(11)
standard

§  Uses
strongly
hygienic
components

§  func9onal
encapsula9on
(closures)

68

Image
Convolu2on

Componentff_im genStage(generate);
Componentff_im filterStage(filter);
for(int i = 0; iNIMGS; i++) {
r1 = genStage.callWorker(
new ff_im(images[i]));
results[i] = filterStage.callWorker(
new ff_im(r1));
}

Step%1:%Introduce%Components%
ff_farm gen_farm;
gen_farm.add_collector(NULL);
std::vectorff_node* gw;
for (int i=0; inworkers; i++)
gw.push_back(new gen_stage);
gen_farm.add_workers(gw);
ff_farm filter_farm;
filter_farm.add_collector(NULL);
std::vectorff_node* gw2;
for (int i=0; inworkers2; i++)
gw2.push_back(new CPU_Stage);
filter_farm2.add_workers(gw2);
StreamGen streamgen(NIMGS,images);
ff_pipeline pipe;
pipe.add_stage(streamgen);
pipe.add_stage(gen_farm);
pipe.add_stage(filter_farm);

Step%2:%Introduce%Pipeline%
ff_pipeline pipe;
pipe.add_stage(new genStage);
pipe.add_stage(new filterStage);
pipe.run_and_wait_end();

ff_farm gen_farm;
gen_farm.add_collector(NULL);
std::vectorff_node* gw;
for (int i=0; inworkers; i++)
gw.push_back(new gen_stage);
gen_farm.add_workers(gw);
ff_pipeline pipe;
pipe.add_stage(gen_farm);
pipe.add_stage(new filterStage);


Step%4:%Introduce%Farm%

Step%3:%Introduce%Farm%

69

Refactoring
C++
in
Eclipse

70

Funded
by

• 

ParaPhrase
(EU
FP7),
Pa:erns
for
heterogeneous
mul2core,

€4.2M,
2011-‐2014

• 

• 

SCIEnce
(EU
FP6),
Grid/Cloud/Mul2core
coordina2on

•  €3.2M,
2005-‐2012

Advance
(EU
FP7),
Mul2core
streaming

•  €2.7M,
2010-‐2013

• 

HPC-‐GAP
(EPSRC),
Legacy
system
on
thousands
of
cores

•  £1.6M,
2010-‐2014

• 

Islay
(EPSRC),
Real-‐2me
FPGA
streaming
implementa2on

•  £1.4M,
2008-‐2011

• 

TACLE:
European
Cost
Ac2on
on
Timing
Analysis

•  €300K,
2012-‐2015

74

Some
of
our
Industrial
Connec2ons

Mellanox
Inc.

Erlang
Solu9ons
Ltd

SAP
GmbH,
Karlsrühe

BAe
Systems

Selex
Galileo

BioId
GmbH,
Stubgart

Philips
Healthcare

Sosware
Competence
Centre,
Hagenberg

Microsos
Research

Well-‐Typed
LLC

75

ParaPhrase
Needs
You!

• 

Please
join
our
mailing
list

and
help
grow
our
user
community

§ 
§ 
§ 
§ 
§ 
§ 

• 

news
items

access
to
free
development
sosware

chat
to
the
developers

free
developer
workshops

bug
tracking
and
ﬁxing

Tools
for
both
Erlang
and
C++

Subscribe
at

hbps://mailman.cs.st-‐andrews.ac.uk/mailman/

lis9nfo/paraphrase-‐news

• 
• 

We’re
also
looking
for
open
source

developers...

We
also
have
8
PhD
studentships...

76

Further
Reading

Chris
Brown.
Vladimir
Janjic,
Kevin
Hammond,
Mehdi
Goli
and
John
McCall

“Bridging
the
Divide:
Intelligent
Mapping
for
the
Heterogeneous
Parallel
Programmer”,

Submi?ed
to
IPDPS
2014

Chris
Brown.
Marco
Danelu:o,
Kevin
Hammond,
Peter
Kilpatrick
and
Sam
Elliot

“Cost-‐Directed
Refactoring
for
Parallel
Erlang
Programs”

To
appear
in
InternaGonal
Journal
of
Parallel
Programming,
2013

Vladimir
Janjic,
Chris
Brown.
Max
Neunhoﬀer,
Kevin
Hammond,
Steve
Linton
and
Hans-‐
Wolfgang
Loidl

“Space
Explora2on
using
Parallel
Orbits”

Proc.
PARCO
2013:
Interna2onal
Conf.
on
Parallel
Compu2ng,
Munich,
Sept.
2013

Ask
me
for
copies!

Chris
Brown.
Hans-‐Wolfgang
Loidl
and
Kevin
Hammond

Many
technical

“ParaForming
Forming
Parallel
Haskell
Programs
using

efactoring
Techniques”

results
011
Trends
he
uncGonal
Programming
(TFP),
MNovel
Rpain,
May
2011

also
on
t in
F
Proc.

2
adrid,
S
project
web
site:

Henrique
ownload!

free
for
dFerreiro,
David
Castro,
Vladimir
Janjic
and
Kevin
Hammond

“Repea2ng
History:
Execu2on
Replay
for
Parallel
Haskell
Programs”

Proc.
2012
Trends
in
FuncGonal
Programming
(TFP),
St
Andrews,
UK,
June
2012

THANK
YOU!

http://www.paraphrase-ict.eu!
http://www.project-advance.eu!

@paraphrase_fp7

80

ParaForming - Patterns and Refactoring for Parallel Programming

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to ParaForming - Patterns and Refactoring for Parallel Programming

Similar to ParaForming - Patterns and Refactoring for Parallel Programming (20)

Recently uploaded

Recently uploaded (20)

ParaForming - Patterns and Refactoring for Parallel Programming