Onyx:
A Flexible and Extensible
Data Processing System
전병곤, 김주연, 송원욱
Software Platform Lab
Joint work with 양영석, 이산하, 서장호, 어정윤, 이계원, 엄태건, 이우연,
이윤성, 정주성, 하현민, 정은지, 김수정, 유경인, 신동진
1
Data Processing from 10,000 Feet
2
Data Processing Application
Data Processing Framework
Resource Environment
Spark, Flink,
Hadoop MR,
Dryad, Tez,
...
Data Processing from 10,000 Feet
3
Data Processing Application
Data Processing Framework
Resource Environment
Spark, Flink,
Hadoop MR,
Dryad, Tez,
...
Existing frameworks perform poorly in new resource
environments (e.g., disaggregation, transient resources)
Disaggregation
4
Compute Storage
(Ref. OpenCompute)
Intermediate data generated from compute nodes
should be written to and read from storage nodes.
Transient Resources
5
Preemption!
Task preemption can cause expensive recomputation.
Cross Datacenter
6
Wide-area network bandwidth is scarce and expensive
Data Processing from 10,000 Feet
7
Data Processing Application
Data Processing Framework
Resource Environment
Spark, Flink,
Hadoop MR,
Dryad, Tez,
...
It is hard to add new application optimization features
to existing frameworks.
Dynamic Optimization
Dynamic skew handling
Optimizing job execution based on its characteristics
Adapting execution to resource elasticity
8
Key Observation
Current data processing frameworks
are not flexible and extensible.
9
=> A new flexible and extensible data processing system
Onyx Architecture
Dataflow Program
Onyx Compiler
Onyx Runtime
Cluster
10
Onyx Compiler
11
Beam Program
Physical Execution Plan
OnyxCompiler
Beam Frontend
Onyx Backend
Spark Frontend
Spark Program
IR
DAG
IR (Intermediate Representation) DAG
: Program-agnostic DAG with Annotations
12
Vertex Edge
Vertex Labels
Type: Operator/Loop
Placement: GPUNode/
ReservedNode/TransientNode/Any
Parallelism
Edge Labels
Type: 1:1/Broadcast/Shuffle
Mode: Push/Pull
Storage: Memory/Disk/RemoteDisk
MapReduce Example
13
Shuffle,Pull,Disk
Classical MapReduce
Small-scale MapReduce
Shuffle,Push,Memory
Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Annotation Pass examples:
● Parallelism Pass
● Executor Placement Pass
● Data Flow Model Pass
● Stage Partitioning Pass
14
● Transient Resource EP Pass
● Transient Resource DFM Pass
● Resource Disaggregation EP Pass
● Resource Disaggregation DFM Pass
Variations
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Annotation Pass examples:
● Parallelism Pass
● Executor Placement Pass
● Data Flow Model Pass
● Stage Partitioning Pass
● Transient Resource EP Pass
● Transient Resource DFM Pass
● Resource Disaggregation EP Pass
● Resource Disaggregation DFM Pass
Compiler Passes
15
Common
Specialized
Specialized
Variations
Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Reshaping Pass examples:
● Loop Extraction Pass
● Loop Fusion Pass (Loop Optimization)
● Common Subexpression Elimination Pass
● Data Skew Reshaping Pass
Runtime Pass example:
● Data Skew Runtime Pass
16
Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Reshaping Pass examples:
● Loop Extraction Pass
● Loop Fusion Pass (Loop Optimization)
● Common Subexpression Elimination Pass
● Data Skew Reshaping Pass
Runtime Pass example:
● Data Skew Runtime Pass
17
Specialized
Compiler to Runtime
1818
Type: “Map” Operator
Placement: “Compute” Node
Parallelism: 100
Shuffle,Pull,Disk
Type: “Reduce” Operator
Placement: “Compute” Node
Parallelism: 50
Reduce StageMap Stage
Optimized IR DAG
Compiler to Runtime
1919
PhysicalStage PhysicalStage
“Map”Tasks “Reduce”Tasks.
.
.
.
.
.
.
X 100
.
.
X 50
I/O channels for
intermediate data flow
between tasks
Physical DAG
Distributed Execution in Onyx Runtime
Stage
20
Executor Executor Executor Executor
Master
Distributed Execution in Onyx Runtime
Master Stage
21
Executor Executor Executor Executor
TaskGroup(Tasks)
Distributed Execution in Onyx Runtime
Master Stage
22
Executor Executor Executor Executor
Onyx In Action
23
Onyx in Action
● Onyx compiler and runtime components
● Onyx job execution: MR, ALS
● Onyx runtime optimization: dynamic skew handling
● Harnessing transient resources with Onyx
Omitted other optimizations due to time constraints!
24
Key Components (Compiler)
25
Key Components (Runtime)
26
Key Components (Runtime)
27
Job Execution Demo
28
MapReduce
● We will show two executions of MapReduce using
different settings:
○ Intermediate data is saved in disk, and pulled by the reducers
○ Intermediate data is saved in memory, and pushed to the reducers
● In order to vary the settings, we go through the following
passes:
○ A data store pass
○ A data flow model pass
○ All of these are “Annotation” passes
29
Demo
Map Data in Disk, Pulled
30
Type: “Map” Operator
Placement: “Compute” Node
Shuffle,Pull,Disk
Type: “Reduce” Operator
Placement: “Compute” Node
Reduce
Stage
Map
Stage
Demo
Map Data in Memory, Pushed
31
Type: “Map” Operator
Placement: “Compute” Node
Shuffle,Push,Memory
Type: “Reduce” Operator
Placement: “Compute” Node
Reduce
Stage
Map
Stage
Alternating Least Squares Example
● Alternating Least Square is an ML algorithm used
commonly in recommendation systems.
● Most ML algorithms are iterative processes
=> ALS is one of them!
● But how is this expressed in terms of a DAG? (Acyclic!)
32
Alternating Least Squares Example
Naively…
33
(Read input data) . . . . . . . . . . . . (Write output). . . . . . .
Iteration 1 Iteration 2 Iteration N
But what if we want to decide this
“N” according to some condition?
(ex. model convergence in ML)
A set of operators that executes the ALS algorithm
Alternating Least Squares Example
Something special we have for the ALS example: Loops!
34
(Read input data) . . . . . . . . . . . . (Write output)
LoopVertex
with termination condition
(Read input data) . . . . . . . . . (Write output). . . . . .
Iteration 1 Iteration NIteration 2
Demo
ALS
35
Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
36
Onyx Compiler
Onyx Runtime
AnnotationPass(es) and
ReshapingPass(es)
IR DAG
Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
37
Onyx Compiler
Onyx Runtime
Physical DAG Conversion
Shuffle,Pull,Disk
StageStage
Optimized IR DAG
Dynamic Data Partitioning Example
38
Onyx Compiler
Onyx Runtime
PhysicalStage PhysicalStage
Physical DAG
Physical DAG Conversion
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
Dynamic Data Partitioning Example
39
Onyx Compiler
Onyx Runtime
Execute!
PhysicalStage PhysicalStage
Physical DAG
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
Dynamic Data Partitioning Example
40
Onyx Compiler
Onyx Runtime
Data Size Metric
Physical DAG Executing...
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
Dynamic Data Partitioning Example
41
Onyx Compiler
Onyx Runtime
New DAG
RuntimePass(es)
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
Dynamic Data Partitioning Example
42
Onyx Compiler
Onyx Runtime
Execute! New DAG
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
Demo
Dynamic Data Partitioning
43
Harnessing Transient Resources with Onyx
44
Harnessing Transient Resources with Onyx
45
Using the techniques introduced in
Pado: A Data Processing Engine for
Harnessing Transient Resources in Datacenters
from EuroSys 2017
Batch Engine
46
MapReduce
Flume
Spark
...
Transient Resources
?
47
Transient Resources
Resources borrowed from
over-provisioned latency-critical jobs
(search service, online mall, etc.)
Data Analytics with Transient Resources
48
....
Dataflow
Program
Transient
Data Analytics with Transient Resources
49
....
Dataflow
Program
Execute! Transient
Tasks Tasks Tasks Tasks
Tasks Tasks Tasks Tasks
Tasks Tasks Tasks Tasks
Data Analytics with Transient Resources
50
....
Dataflow
Program
Execute! Transient
Tasks Tasks Tasks Tasks
Tasks Tasks Tasks Tasks
Tasks Tasks Tasks Tasks
Data Analytics with Transient Resources
51
....
Dataflow
Program
Execute! Transient
Data
Data
Data
Solution
52
....
Dataflow
Program Transient
Solution
53
....
Dataflow
Program Transient
Analyze
Solution
54
....
Dataflow
Program
Other
Computations
Valuable
Computations Reserved
Transient
Analyze
Valuable
Our definition of Valuable computations
Not so valuable
One-to-One One-to-Many Many-to-One Many-to-Many
Valuable
Our definition of Valuable computations
Not so valuable
One-to-One One-to-Many Many-to-One Many-to-Many
... ... ... ...
Map-Reduce with Transient Containers
(Case #1) Batch Engines (e.g., Spark)
(Case #2) Our Approach 57
Many-to-Many
Map Reduce
Batch Engines (e.g., Spark)
2 Transient, 1 Reserved Containers 58
Our Approach
ReservedTransient
Batch Engines (e.g., Spark)
Map, Reduce tasks on each
container 59
ReservedTransient
Our Approach
Map1 Map2 Map3
Reduce1 Reduce2 Reduce3
60
No dependency Many-to-Many
Map Reduce
Many-to-Many
Map Reduce
61
No dependency
⇒ Not so valuable
⇒ Transient
Many-to-Many
⇒ Valuable
⇒ Reserved
Map Reduce
Many-to-Many
Map Reduce
Batch Engines (e.g., Spark)
Map tasks on Transient and
Reduce task on Reserved 62
Our Approach
Map1 Map2 Map3
Reduce1 Reduce2 Reduce3 Reduce1
Map1 Map2
ReservedTransient
Batch Engines (e.g., Spark)
63
Our Approach
Map1 Map2 Map3
Maintain Map Outputs
on Local Disks
ReservedTransient
Batch Engines (e.g., Spark)
64
Our Approach
Map1 Map2 Map3 Map1 Map2
Push Map Outputs to Destination
Reserved Containers
ReservedTransient
Batch Engines (e.g., Spark)
65
Our Approach
Reduce1 Reduce2 Reduce3
Pull Map Outputs
Map1 Map2
ReservedTransient
Batch Engines (e.g., Spark)
66
Our Approach
Reduce1 Reduce2 Reduce3
ReservedTransient
Reduce1
Read Input Data from Local
Reserved Containers
Batch Engines (e.g., Spark)
67
Our Approach
Reduce1 Reduce2 Reduce3
Eviction of Transient Containers
→ Map Outputs Destroyed
ReservedTransient
Reduce1
Batch Engines (e.g., Spark)
68
Our Approach
Reduce1 Reduce2 Reduce3
ReservedTransient
Reduce1
Eviction of Transient Containers
→ Map Outputs Not Destroyed
Batch Engines (e.g., Spark)
69
Our Approach
Reduce1 Reduce2 Reduce3
Map1 Map2 Map3
Cascading Recomputation of
5 Tasks
ReservedTransient
Reduce1
No Recomputation
Step 1:
Transient/Reserved
Executor Placement Pass
70
Operator Placement Example with the
Transient Resource Policy
Multinomial Logistic Regression(MLR)
: Machine learning application for classifying
inputs, like tumors as malignant or benign, and
ad clicks as profitable or not.
Gradients are used to update the regression
model, which is used for prediction.
71
Executor Placement Example
Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
72
One-to-One
One-to-Many
Many-to-One Costly!
Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved TransientNo
Dependency
No
Dependency
73
Many-to-One Costly!
One-to-One
One-to-Many
Executor Placement Example
Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
74
Many-to-One Costly!
No Costly Dependency
with Parents
One-to-One
One-to-Many
Executor Placement Example
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved TransientCostly Dependency
with Parent
75
Many-to-One Costly!
One-to-One
One-to-Many
Costly Dependency
with Parent, Pipelined
Executor Placement Example
Create
1st
Model
Step 2:
Data Flow Model Pass
76
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
77
Recall..
Safe! Prone to
evictions :(
Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
78
Must evacuate data out of transient executors ASAP
Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
79
Push data out as soon as it is ready!
Push
Push Push
Create
1st
Model
Push
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
80
No need to hurry for data in Reserved containers
Pull Pull
Push
Push Push
Create
1st
Model
Push
Step 3:
Stage Partitioning Pass
81
Stage Partitioning in Compiler
82
Execute subgraph-by-subgraph
⇒ Partition into subgraphs
⇒ Good abstraction for handling evictions/faults
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
83
Stage Partitioning Example
Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Stage 1
Reserved Transient
84
Stage Partitioning Example
Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Stage 1
Stage 2
Reserved Transient
85
Stage Partitioning Example
Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Stage 1
Stage 2
Reserved Transient
86
Stage Partitioning Example
Stage 3
Create
1st
Model
Demo
Executor Placement Pass
DataFlowModel Pass
Stage Partitioning Pass
with MLR example
87
Batch Engines
88
Spark 2.0.0
Onyx with
suggested
optimizations
VS
Containers
● Amazon EC2s(with local SSDs) as containers
● 40 Transient Containers, 5 Reserved Containers
● All containers used for computation
89
Workloads
● Alternating Least Squares
Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta
Information, v. 1.0. https://webscope. sandbox.yahoo.com/catalog.php?datatype=r
● Multinomial Logistic Regression
Synthetic
● Map-Reduce
Page view statistics for Wikimedia projects.
https://dumps.wikimedia.org/other/pagecounts-raw
90
Job Completion Time (Lower is Better)
91
4.13x
3.52x
5.15x
Summary
● Introduces a new data processing system that is flexible
and extensible
○ Compiler that represents various execution policies
○ Runtime that are modular and reconfigurable
● Adapts data processing seamlessly for new deployment
and application requirements
92
93
We are working on creating an Apache incubator
project. We look forward contribution from many
developers!
We are hiring software developers!
Contact: onyx@spl.snu.ac.kr
Software platform lab site: http://spl.snu.ac.kr
Onyx:
A Flexible and Extensible
Data Processing System
전병곤, 김주연, 송원욱
Software Platform Lab
Joint work with 양영석, 이산하, 서장호, 어정윤, 이계원, 엄태건, 이우연,
이윤성, 정주성, 하현민, 정은지, 김수정, 유경인, 신동진
94

[214]유연하고 확장성 있는 빅데이터 처리