Parallelising Dynamic Programming

Parallelising Dynamic Programming
Raphael Reitzig
University of Kaiserslautern
Department of Computer Science
Algorithms and Complexity Group
September 27th, 2012

Vision
Compile dynamic programming recurrences into eﬃcient parallel
code.

Goal 1
Understand what eﬃciency means in parallel algorithms.

Goal 1
Goal 2
Characterise dynamic programming recurrences in a suitable way.

Goal 1
Goal 2
Characterise dynamic programming recurrences in a suitable way.
Goal 3
Find and implement eﬃcient parallel algorithms for DP.

Complexity theory
Classiﬁes problems

Complexity theory
Focuses on inherent parallelism

Complexity theory
Answers: How many processors do you need to be really fast
on inputs of a given size?

Complexity theory
Answers: How many processors do you need to be really fast
on inputs of a given size?
But...
...p grows with n – no statement about constant p and growing n!

Amdahl’s law
Parallel speedup ≤ 1
1−γ+γ
p
.

Amdahl’s law
1−γ+γ
p
.
Answers: How many processors can you utilise on given inputs?

Amdahl’s law
1−γ+γ
p
.
Answers: How many processors can you utilise on given inputs?
But...
...does not capture growth of n!

Work and depth
Work W = TA
1 and depth D = TA
∞

Work and depth
Work W = TA
1 and depth D = TA
∞
Brent’s Law: A with W
p ≤ TA
p < W
p + D is possible in a certain
setting.

Work and depth
Work W = TA
1 and depth D = TA
∞
Brent’s Law: A with W
p ≤ TA
p < W
p + D is possible in a certain
setting.
But...
...has limited applicability and D can be slippery!

Relative runtimes
Speedup SA
p :=
TA
1
TA
p

Relative runtimes
Speedup SA
p :=
TA
1
TA
p
Eﬃciency EA
p := TB
p·TA
p

Relative runtimes
Speedup SA
p :=
TA
1
TA
p
Eﬃciency EA
p := TB
p·TA
p
But...
...what are good values?

Relative runtimes
Speedup SA
p :=
TA
1
TA
p
Eﬃciency EA
p := TB
p·TA
p
But...
Clear: SA
p ∈ [0, p] and EA
p ∈ [0, 1]

Relative runtimes
Speedup SA
p :=
TA
1
TA
p
Eﬃciency EA
p := TB
p·TA
p
But...
Clear: SA
p ∈ [0, p] and EA
p ∈ [0, 1] – but we can certainly not always
hit the optima!

Proposal: Asymptotic relative runtimes
Deﬁnition
SA
p(∞) := lim inf
n→∞
SA
p(n)
?
= p
EA
p (∞) := lim inf
n→∞
EA
p (n)
?
= 1

Proposal: Asymptotic relative runtimes
Deﬁnition
SA
p(∞) := lim inf
n→∞
SA
p(n)
?
= p
EA
p (∞) := lim inf
n→∞
EA
p (n)
?
= 1
Goal
Find parallel algorithms that are asymptotically as scalable and
eﬃcient as possible for all p.

Disclaimer
This means:
A good parallel algorithm can utilise any number of processors if
the inputs are large enough.

Disclaimer
This means:
Not:
More processors are always better.

Disclaimer
This means:
Not:
More processors are always better.
Just as in sequential algorithmics.

Afterthoughts
Machine model
Keep it simple: (P)RAM with p processors and spawn/join.

Afterthoughts
Machine model
Which quantities to analyse?
Elementary operations, memory accesses, inter-thread
communication, ...

Afterthoughts
Machine model
Which quantities to analyse?
Elementary operations, memory accesses, inter-thread
communication, ...
Implicit interaction – blocking, communication via memory, ... – is
invisible in code!

Disclaimer
Only two dimensions
Only ﬁnite domains
Only rectangular domains
Memoisation-table point-of-view

Reducing to dependencies
e(i, j) :=



0 i = j = 0
j i = 0 ∧ j > 0
i i > 0 ∧ j = 0
min



e(i − 1, j) + 1
e(i, j − 1) + 1
e(i − 1, j − 1) + [ vi = wj ]
else

Simpliﬁcation
DL D DR
UL U UR
L R

Three cases
Impossible
Possible

Three cases
Assuming dependencies are area-complete and uniform, there are
only three cases up to symmetry:

Challenges
Contention
Method of synchronisation

Challenges
Contention
Method of synchronisation
Metal issues (moving threads, cache sync)

Performance Examples
Edit distance on two-core shared memory machine:
0 0.2 0.4 0.6 0.8 1 1.2 1.4
·105
0
0.5
1
1.5
2
2.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
·105
0
0.5
1
1.5
2
2.5

Edit distance on four-core NUMA machine:
0 1 2 3 4
·105
0
1
2
3
4
0 1 2 3 4
·105
0
1
2
3
4

Pseudo-Bellman-Ford on two-core shared memory machine:
0 0.2 0.4 0.6 0.8 1 1.2 1.4
·105
0
0.5
1
1.5
2
2.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
·105
0
1
2
3
4

Pseudo-Bellman-Ford on four-core NUMA machine:
0 1 2 3 4
·105
0
1
2
3
4
0 1 2 3 4
·105
0
2
4
6
8

Future Work
Fill gaps in theory (caching and communication).

Future Work
Generalise theory to more dimensions and interleaved DPs.

Future Work
Improve and extend implementations.

Future Work
More experiments (diﬀerent problems, more diverse machines).

Future Work
Improve compiler integration (detection, backtracing, result
functions).

Future Work
Improve compiler integration (detection, backtracing, result
functions).
Integrate with other tools.

Parallelising Dynamic Programming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Parallelising Dynamic Programming

Similar to Parallelising Dynamic Programming (20)

Recently uploaded

Recently uploaded (20)

Parallelising Dynamic Programming