3. SKA Science Data Processor (SDP) high-level dataflow
Data
ingesCon
at
0.5
TB/s
per
site
Data
management
Data
processing
130
PFlops
per
site
Data
analysis
and
Vis
4. 4
SKA Data Challenges
• Multiple concurrent observing projects
• Data sharing between projects
• Capital and operational budget limited
• Power, Cooling
• Acquisition, maintenance & software development costs
• Throughput: produce ~0.2-10 Tera Voxels/second
• Automatic 23/7 type of operation
• Data parallelism: Millions of related tasks on thousands of nodes
5. 5
Data deluge
5
Telescope
Raw
Data
Rate
Archive
Growth
MWA
1.4
TB/hour
5
PB/year
LSST
1.5
TB/hour
6
PB/year
ASKAP
9
TB/hour
5.5
PB/year
SKA1-‐Low
1,400
TB/hour
150
PB/year
arxiv.org/abs/1702.07617
6. 6
DALiuGE
6
• Defined once, executed anywhere (well)
– Separation
– Coherence
• Work with existing software components
• Extended dataflow model
– Unlock “hidden” parallelisms
– Data is given autonomy
• Decentralised execution via event propagations
• Built-in Data lifecycle management
7. 7
Related work
7
• Dataflow (DAG) computation model [7]
– Unlock “hidden” parallelisms
• DAG mapping (QAP) is a hard problem [5]
• Exact solutions
– Assignment graph [2], allocation graph [19] (max flow)
– O (|V| * P) à works on small graphs on small clusters
• Heuristics
– One-phase (HEFT) [18]
• Direct mapping from Ranked List A to Ranked List B
– Two-phase [13, 16]:
• (1) Partitioning (offline)
• (2) Mapping (online)
8. 8
Related work
8
• Resource Demand Abstraction (RDA)
– Aggregated workload “per partition”
– Estimates and capacity planning
• Existing two-phase methods mostly
– multi-processors on a single node
– We need multi-level scheduling/mapping
• Goal ≠ Maximum parallelism
– Resource footprint vs. execution latency
• Graph partitioning vs. Dataflow partitioning
– [1, 5, 20] vs. [16],…
– dataflows vs. long running MPI processes
A
E
B F
C G
H
D
10. 10
Partition problem
10
M(·∙)
is
a
funcCon
that
outputs
the
number
M
of
parCCons
given
a
PGT
and
a
soluCon
p
T(·∙)
is
a
funcCon
that
outputs
the
compleCon
Cme
T
given
a
PGT
and
a
parCCon
soluCon
p.
Ri(t)
denotes
the
aggregated
resource
demand
from
all
running
Drops
in
parCCon
i
at
Cme
t.
12. 12
12
• Stochastic Local Search Heuristics
– Meta-Heuristics
• Particle Swarm Optimisation
• Genetic algorithm
– Statistical mechanics
• Simulated annealing (MCMC)
• Mean field annealing
• Constraints-based Local Search
• Reinforcement learning (MDP)
– Monte Carlo Tree Search
Comparison
on
LOFAR
Imaging
(No
deadline,
DoP
=
4)
Min
Cost
#
Parts
Run
Time
Direct
HeurisCcs
(Edge
zeroing)
403
50
3
ParCcle
Swarm
OpCmisaCon
423
57
5
Simulated
annealing
713
73
64
Monte
Carlo
Tree
Search
(250
ms
“thinking”
Cme)
403
51
57
Monte
Carlo
Tree
Search
(150
ms
“thinking”
Cme)
408
52
35
AlphaGO
Partition algorithm (WIP: less greedy)
13. 13
Partitioning constraint (DoP)
13
• How to preserve constraints à Graph
theory to the rescue!
– Brutal force does not work well due to the
huge number of anti-chains
– Dilworth theorem (normal antichain)
• Let bpg = bipartite_graph(DAG)
• DoP == Poset Width ==
len(max_antichain) ==
len(min_num_chain) == cardinality(dag) -
len(max_matching(bpg))
– Maximum Weighted K-families (weighted
antichain)
• Split graph à Admissible Graph à
Residual Graph (using maxflow) à Pi
• Drops that satisfies a Pi equation is in the
maximum weighted antichain
20. 20
Graph execution on Tianhe2
20
70K
Drops
running
on
500
compute
nodes
at
the
Tianhe-‐2
Supercomputer
for
simulated
LOFAR
imaging
“simulated”
run
Gray
–
Drops
not
yet
started
Yellow
–
Drops
being
executed
Green
–
Drops
completed
execuCons
Red
–
Drops
failed
21. 21
Summary
21
• SKA Dataflows
• Related work
• Graph execution engine à DALiuGE
• Partitioning problem
• Partitioning algorithm (current + WIP)
• Partitioning constraint à DoP
• Case study and preliminary results