Partitioning SKA Dataflows for Optimal Graph Execution

Partitioning SKA Dataflows
for Optimal Graph Execution
Data-‐Intensive
Astronomy
(DIA),
ICRAR

chen.wu@icrar.org

Chen Wu
Andreas Wicenec, Rodrigo Tobar

2
2
Western Australia
South Africa
h=ps://www.skatelescope.org

Precursor
(fully
opera/onal)

Murchison
Wideﬁeld
Array
(MWA)

Precursor

MeerKAT

The
Square
Kilometre
Array

SKA Science Data Processor (SDP) high-level dataflow
Data
ingesCon
at
0.5
TB/s
per
site

Data
management

Data
processing
130
PFlops
per
site

Data
analysis
and
Vis

4
SKA Data Challenges
•  Multiple concurrent observing projects
•  Data sharing between projects
•  Capital and operational budget limited
•  Power, Cooling
•  Acquisition, maintenance & software development costs
•  Throughput: produce ~0.2-10 Tera Voxels/second
•  Automatic 23/7 type of operation
•  Data parallelism: Millions of related tasks on thousands of nodes

5
Data deluge
5
Telescope
Raw
Data
Rate
Archive
Growth

MWA
1.4
TB/hour
5
PB/year

LSST
1.5
TB/hour
6
PB/year

ASKAP
9
TB/hour
5.5
PB/year

SKA1-‐Low
1,400
TB/hour
150
PB/year

arxiv.org/abs/1702.07617

6
DALiuGE
6
•  Defined once, executed anywhere (well)
–  Separation
–  Coherence
•  Work with existing software components
•  Extended dataflow model
–  Unlock “hidden” parallelisms
–  Data is given autonomy
•  Decentralised execution via event propagations
•  Built-in Data lifecycle management

7
Related work
7
•  Dataflow (DAG) computation model [7]
–  Unlock “hidden” parallelisms
•  DAG mapping (QAP) is a hard problem [5]
•  Exact solutions
–  Assignment graph [2], allocation graph [19] (max flow)
–  O (|V| * P) à works on small graphs on small clusters
•  Heuristics
–  One-phase (HEFT) [18]
•  Direct mapping from Ranked List A to Ranked List B
–  Two-phase [13, 16]:
•  (1) Partitioning (offline)
•  (2) Mapping (online)

8
Related work
8
•  Resource Demand Abstraction (RDA)
–  Aggregated workload “per partition”
–  Estimates and capacity planning
•  Existing two-phase methods mostly
–  multi-processors on a single node
–  We need multi-level scheduling/mapping
•  Goal ≠ Maximum parallelism
–  Resource footprint vs. execution latency
•  Graph partitioning vs. Dataflow partitioning
–  [1, 5, 20] vs. [16],…
–  dataflows vs. long running MPI processes
A
E
B F
C G
H
D

10
Partition problem
10
M(·∙)
is
a
funcCon
that
outputs
the
number
M
of
parCCons
given
a
PGT
and
a
soluCon
p

T(·∙)
is
a
funcCon
that
outputs
the
compleCon
Cme
T
given
a
PGT
and
a
parCCon
soluCon
p.

Ri(t)
denotes
the
aggregated
resource
demand
from
all
running
Drops
in
parCCon
i
at
Cme
t.

11
Partition algorithm (somewhat greedy)
11

12
12
•  Stochastic Local Search Heuristics
–  Meta-Heuristics
•  Particle Swarm Optimisation
•  Genetic algorithm
–  Statistical mechanics
•  Simulated annealing (MCMC)
•  Mean field annealing
•  Constraints-based Local Search
•  Reinforcement learning (MDP)
–  Monte Carlo Tree Search
Comparison
on
LOFAR
Imaging

(No
deadline,
DoP
=
4)

Min

Cost

#
Parts
Run

Time

Direct
HeurisCcs
(Edge
zeroing)
403
50
3

ParCcle
Swarm
OpCmisaCon
423
57
5

Simulated
annealing
713
73
64

Monte
Carlo
Tree
Search

(250
ms
“thinking”
Cme)

403
51
57

Monte
Carlo
Tree
Search

(150
ms
“thinking”
Cme)

408
52
35

AlphaGO
Partition algorithm (WIP: less greedy)

13
Partitioning constraint (DoP)
13
•  How to preserve constraints à Graph
theory to the rescue!
–  Brutal force does not work well due to the
huge number of anti-chains
–  Dilworth theorem (normal antichain)
•  Let bpg = bipartite_graph(DAG)
•  DoP == Poset Width ==
len(max_antichain) ==
len(min_num_chain) == cardinality(dag) -
len(max_matching(bpg))
–  Maximum Weighted K-families (weighted
antichain)
•  Split graph à Admissible Graph à
Residual Graph (using maxflow) à Pi
•  Drops that satisfies a Pi equation is in the
maximum weighted antichain

17
Case study – CHILES
CHILES Meeting, Perth, Jan 2017
17
Given:
4
Cores
per
node

Output
schedule:

4
nodes
needed

exec_Cme:32

total_data_movement:55

Given:
2
Cores
per
node

Output
schedule:

6
nodes
needed

exec_Cme:37

total_data_movement:65

18
# of nodes vs. per node DoP
18

20
Graph execution on Tianhe2
20
70K
Drops
running
on
500

compute
nodes
at
the
Tianhe-‐2

Supercomputer
for
simulated

LOFAR
imaging
“simulated”
run

Gray
–
Drops
not
yet
started

Yellow
–
Drops
being
executed

Green
–
Drops
completed
execuCons

Red
–
Drops
failed

21
Summary
21
•  SKA Dataflows
•  Related work
•  Graph execution engine à DALiuGE
•  Partitioning problem
•  Partitioning algorithm (current + WIP)
•  Partitioning constraint à DoP
•  Case study and preliminary results

Partitioning SKA Dataflows for Optimal Graph Execution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Partitioning SKA Dataflows for Optimal Graph Execution

Similar to Partitioning SKA Dataflows for Optimal Graph Execution (20)

Recently uploaded

Recently uploaded (20)

Partitioning SKA Dataflows for Optimal Graph Execution