PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference

MOSAIC :
Heterogeneity-, Communication-, and Constraint-
Aware Model Slicing
and Execution for Accurate and Efficient Inference
Myeonggyun Han, Jihoon Hyun, Seongbeom Park,
Jinsu Park, Woongki Baek
Computer Architecture and Systems Lab (CASL)
UNIST
발표자: 이제민 (leejaymin@etri.re.kr)
발표자료 제작자: 권진세 (kwonse@cnu.ac.kr)
28th International Conference on Parallel Architectures and Compilation
Techniques
- PACT 19
- acceptance rate : 27% ( 34 / 126 )
- 정보과학회 우수학술대회 IF : 2
1

DL Inference on mobile systems
2
• Increased the need for accurate and efficient
DL inference on mobile systems
- For Security- and privacy-sensitive applications
• Heterogeneous embedded systems are rapidly
emerging as a promising solution to enable
accurate and efficient inference
- Big-core cluster, Little-core cluster, GPU, NPU
- Various computing devices exhibit widely-different
characteristics in terms of ①performance,
②energy consumption, ③supported operations,
④memory capacity, ⑤communication overheads

DL Inference on mobile systems
3
• Increased the need for accurate and efficient
DL inference on mobile systems
- For Security- and privacy-sensitive applications
• Heterogeneous embedded systems are rapidly
emerging as a promising solution to enable
accurate and efficient inference
- Big-core cluster, Little-core cluster, GPU, NPU
- Various computing devices exhibit widely-different
characteristics in terms of ①performance,
②energy consumption, ③supported operations,
④memory capacity, ⑤communication overheads
- lack the consideration of the efficiency
heterogeneity and memory and functionality
constraints of inference workloads and emerging
computing devices (e.g., NPU)

Related Work
5
- Despite extensive prior works, it still remains
unexplored to investigate the system-software
support that efficiently executes inference
workloads on heterogeneous embedded systems
by judiciously considering their characteristics.

Mosaic
6
• Heterogeneity-
• Communication- aware model slicing
• Constraint-
- Execution plan based on dynamic programming
• Implement the prototype of MOSAIC
- as a user-level runtime system
- using the TensorFlow Lite1)
- on the Android OS2)
1) TensorFlow Lite 1.11.0
2) Android 8.1

Mosaic
7
• Heterogeneity-
• Communication- aware model slicing
• Constraint-
- Execution plan based on dynamic programming
• Implement the prototype of MOSAIC
- as a user-level runtime system
- using the TensorFlow Lite1)
- on the Android OS2)
1) TensorFlow Lite 1.11.0
2) Android 8.1

Preliminary Experiments
8
GPU > NPU
GPU < NPU

9
1. Figure 5 shows the inference latency of MO-
1.4 when they are decomposed into three
slices with various slicing plans
2. the performance difference of the best and
worst slicing plans is 34.1%
Little < NPU
Little > NPU

10
1. Figure 5 shows the inference latency of MO-
1.4 when they are decomposed into three
slices with various slicing plans
2. the performance difference of the best and
worst slicing plans is 34.1%
Little < NPU
Little > NPU
Overall, our experimental results show that
the evaluated inference workloads exhibit
widely-different characteristics in terms of
the model size, performance and energy
heterogeneity, and communication
overheads.

Design of MOSAIC
11
• A. Inference Workload Profiler
- profiles the total costs (e.g., latency, energy
consumption) for executing each layer
- If a computing device supports DVFS, the total costs
for executing the layer are collected at two frequencies
(Max / Min Freq’)

Design of MOSAIC (exe’ & comm’ overheads)
12
• B. Execution and Communication Cost Estimators
- i) Communication Cost Estimator (GPU, NPU)
- ii) Execution Cost Estimator
- Performance Estimator
- Power Estimator

Design of MOSAIC (Slicer & Scheduler)
13
• C. Model Slicer and Scheduler
- i) In case of performance optimization, D is defined as
- ii) In case of energy optimization, D is defined as
- 25 options of DVFS (Big 9, Little 7, GPU 8, NPU 1)
- Dynamic Programming

Design of MOSAIC (Slicer & Scheduler)
14
• C. Model Slicer and Scheduler
- Dynamic Programming

Design of MOSAIC (Executor)
15
• D. Inference Workload Executor

16
Evaluation
• Hardware
- HiKey 970 embedded development board
- Cortex-A73 (big) cores, four Cortex-A53 (little) cores,
- Mali-G72 MP12 GPU, and NPU
• Comparison Target (governor / heuristic vs Mosaic)
- Governor:Performance (TF-BIG/LITTLE/GPU/NPU-P)
- Governor:on-demand (TF-BIG/LITTLE/GPU/NPU-O)
- Exhaustive (1000 unique slicing plans)
• Evaluation items
- (1) inference latency,
- (2) inference energy,
- (3) impact of the MOSAIC components,
- (4) efficiency with smaller models,
- (5) estimation accuracy,
- (6) overheads for generating the model slicing and execution plan.

Evaluation- Mosaic Optimal Plans
17

Evaluation- (1)Latency(2)Energy
18

Evaluation- (1)Latency(2)Energy
19

Evaluation- (3)impact of Mosaic
(4)efficiency with smaller models
20

Evaluation- (5)estimation accuracy
(6)overheads
21

Evaluation- (5)estimation accuracy
(6)overheads
22

Summary
• This paper presents MOSAIC, heterogeneity-,
communication-, and constraint-aware model slicing
and execution for accurate and efficient inference on
heterogeneous embedded systems.
• MOSAIC uses the accurate models for estimating the
execution and communication costs of the target
inference workload and generates the efficient model
slicing and execution plan with low time complexity.
- 29.2% lower inference latency than an TF-NPU-P
- 36.6% lower energy than an TF-NPU-O
• MOSAIC achieves high estimation accuracy, and incurs
small overheads.
23

PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference

More Related Content

What's hot

Similar to PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference

More from jemin lee

Recently uploaded

PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference