MOSAIC :
Heterogeneity-, Communication-, and Constraint-
Aware Model Slicing
and Execution for Accurate and Efficient Inference
Myeonggyun Han, Jihoon Hyun, Seongbeom Park,
Jinsu Park, Woongki Baek
Computer Architecture and Systems Lab (CASL)
UNIST
발표자: 이제민 (leejaymin@etri.re.kr)
발표자료 제작자: 권진세 (kwonse@cnu.ac.kr)
28th International Conference on Parallel Architectures and Compilation
Techniques
- PACT 19
- acceptance rate : 27% ( 34 / 126 )
- 정보과학회 우수학술대회 IF : 2
1
DL Inference on mobile systems
2
• Increased the need for accurate and efficient
DL inference on mobile systems
- For Security- and privacy-sensitive applications
• Heterogeneous embedded systems are rapidly
emerging as a promising solution to enable
accurate and efficient inference
- Big-core cluster, Little-core cluster, GPU, NPU
- Various computing devices exhibit widely-different
characteristics in terms of ①performance,
②energy consumption, ③supported operations,
④memory capacity, ⑤communication overheads
DL Inference on mobile systems
3
• Increased the need for accurate and efficient
DL inference on mobile systems
- For Security- and privacy-sensitive applications
• Heterogeneous embedded systems are rapidly
emerging as a promising solution to enable
accurate and efficient inference
- Big-core cluster, Little-core cluster, GPU, NPU
- Various computing devices exhibit widely-different
characteristics in terms of ①performance,
②energy consumption, ③supported operations,
④memory capacity, ⑤communication overheads
- lack the consideration of the efficiency
heterogeneity and memory and functionality
constraints of inference workloads and emerging
computing devices (e.g., NPU)
Related Work
4
Related Work
5
- Despite extensive prior works, it still remains
unexplored to investigate the system-software
support that efficiently executes inference
workloads on heterogeneous embedded systems
by judiciously considering their characteristics.
Mosaic
6
• Heterogeneity-
• Communication- aware model slicing
• Constraint-
- Execution plan based on dynamic programming
• Implement the prototype of MOSAIC
- as a user-level runtime system
- using the TensorFlow Lite1)
- on the Android OS2)
1) TensorFlow Lite 1.11.0
2) Android 8.1
Mosaic
7
• Heterogeneity-
• Communication- aware model slicing
• Constraint-
- Execution plan based on dynamic programming
• Implement the prototype of MOSAIC
- as a user-level runtime system
- using the TensorFlow Lite1)
- on the Android OS2)
1) TensorFlow Lite 1.11.0
2) Android 8.1
Preliminary Experiments
8
GPU > NPU
GPU < NPU
Preliminary Experiments
9
1. Figure 5 shows the inference latency of MO-
1.4 when they are decomposed into three
slices with various slicing plans
2. the performance difference of the best and
worst slicing plans is 34.1%
Little < NPU
Little > NPU
Preliminary Experiments
10
1. Figure 5 shows the inference latency of MO-
1.4 when they are decomposed into three
slices with various slicing plans
2. the performance difference of the best and
worst slicing plans is 34.1%
Little < NPU
Little > NPU
Overall, our experimental results show that
the evaluated inference workloads exhibit
widely-different characteristics in terms of
the model size, performance and energy
heterogeneity, and communication
overheads.
Design of MOSAIC
11
• A. Inference Workload Profiler
- profiles the total costs (e.g., latency, energy
consumption) for executing each layer
- If a computing device supports DVFS, the total costs
for executing the layer are collected at two frequencies
(Max / Min Freq’)
Design of MOSAIC (exe’ & comm’ overheads)
12
• B. Execution and Communication Cost Estimators
- i) Communication Cost Estimator (GPU, NPU)
- ii) Execution Cost Estimator
- Performance Estimator
- Power Estimator
Design of MOSAIC (Slicer & Scheduler)
13
• C. Model Slicer and Scheduler
- i) In case of performance optimization, D is defined as
- ii) In case of energy optimization, D is defined as
- 25 options of DVFS (Big 9, Little 7, GPU 8, NPU 1)
- Dynamic Programming
Design of MOSAIC (Slicer & Scheduler)
14
• C. Model Slicer and Scheduler
- Dynamic Programming
Design of MOSAIC (Executor)
15
• D. Inference Workload Executor
16
Evaluation
• Hardware
- HiKey 970 embedded development board
- Cortex-A73 (big) cores, four Cortex-A53 (little) cores,
- Mali-G72 MP12 GPU, and NPU
• Comparison Target (governor / heuristic vs Mosaic)
- Governor:Performance (TF-BIG/LITTLE/GPU/NPU-P)
- Governor:on-demand (TF-BIG/LITTLE/GPU/NPU-O)
- Exhaustive (1000 unique slicing plans)
• Evaluation items
- (1) inference latency,
- (2) inference energy,
- (3) impact of the MOSAIC components,
- (4) efficiency with smaller models,
- (5) estimation accuracy,
- (6) overheads for generating the model slicing and execution plan.
Evaluation- Mosaic Optimal Plans
17
Evaluation- (1)Latency(2)Energy
18
Evaluation- (1)Latency(2)Energy
19
Evaluation- (3)impact of Mosaic
(4)efficiency with smaller models
20
Evaluation- (5)estimation accuracy
(6)overheads
21
Evaluation- (5)estimation accuracy
(6)overheads
22
Summary
• This paper presents MOSAIC, heterogeneity-,
communication-, and constraint-aware model slicing
and execution for accurate and efficient inference on
heterogeneous embedded systems.
• MOSAIC uses the accurate models for estimating the
execution and communication costs of the target
inference workload and generates the efficient model
slicing and execution plan with low time complexity.
- 29.2% lower inference latency than an TF-NPU-P
- 36.6% lower energy than an TF-NPU-O
• MOSAIC achieves high estimation accuracy, and incurs
small overheads.
23

PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference

  • 1.
    MOSAIC : Heterogeneity-, Communication-,and Constraint- Aware Model Slicing and Execution for Accurate and Efficient Inference Myeonggyun Han, Jihoon Hyun, Seongbeom Park, Jinsu Park, Woongki Baek Computer Architecture and Systems Lab (CASL) UNIST 발표자: 이제민 (leejaymin@etri.re.kr) 발표자료 제작자: 권진세 (kwonse@cnu.ac.kr) 28th International Conference on Parallel Architectures and Compilation Techniques - PACT 19 - acceptance rate : 27% ( 34 / 126 ) - 정보과학회 우수학술대회 IF : 2 1
  • 2.
    DL Inference onmobile systems 2 • Increased the need for accurate and efficient DL inference on mobile systems - For Security- and privacy-sensitive applications • Heterogeneous embedded systems are rapidly emerging as a promising solution to enable accurate and efficient inference - Big-core cluster, Little-core cluster, GPU, NPU - Various computing devices exhibit widely-different characteristics in terms of ①performance, ②energy consumption, ③supported operations, ④memory capacity, ⑤communication overheads
  • 3.
    DL Inference onmobile systems 3 • Increased the need for accurate and efficient DL inference on mobile systems - For Security- and privacy-sensitive applications • Heterogeneous embedded systems are rapidly emerging as a promising solution to enable accurate and efficient inference - Big-core cluster, Little-core cluster, GPU, NPU - Various computing devices exhibit widely-different characteristics in terms of ①performance, ②energy consumption, ③supported operations, ④memory capacity, ⑤communication overheads - lack the consideration of the efficiency heterogeneity and memory and functionality constraints of inference workloads and emerging computing devices (e.g., NPU)
  • 4.
  • 5.
    Related Work 5 - Despiteextensive prior works, it still remains unexplored to investigate the system-software support that efficiently executes inference workloads on heterogeneous embedded systems by judiciously considering their characteristics.
  • 6.
    Mosaic 6 • Heterogeneity- • Communication-aware model slicing • Constraint- - Execution plan based on dynamic programming • Implement the prototype of MOSAIC - as a user-level runtime system - using the TensorFlow Lite1) - on the Android OS2) 1) TensorFlow Lite 1.11.0 2) Android 8.1
  • 7.
    Mosaic 7 • Heterogeneity- • Communication-aware model slicing • Constraint- - Execution plan based on dynamic programming • Implement the prototype of MOSAIC - as a user-level runtime system - using the TensorFlow Lite1) - on the Android OS2) 1) TensorFlow Lite 1.11.0 2) Android 8.1
  • 8.
  • 9.
    Preliminary Experiments 9 1. Figure5 shows the inference latency of MO- 1.4 when they are decomposed into three slices with various slicing plans 2. the performance difference of the best and worst slicing plans is 34.1% Little < NPU Little > NPU
  • 10.
    Preliminary Experiments 10 1. Figure5 shows the inference latency of MO- 1.4 when they are decomposed into three slices with various slicing plans 2. the performance difference of the best and worst slicing plans is 34.1% Little < NPU Little > NPU Overall, our experimental results show that the evaluated inference workloads exhibit widely-different characteristics in terms of the model size, performance and energy heterogeneity, and communication overheads.
  • 11.
    Design of MOSAIC 11 •A. Inference Workload Profiler - profiles the total costs (e.g., latency, energy consumption) for executing each layer - If a computing device supports DVFS, the total costs for executing the layer are collected at two frequencies (Max / Min Freq’)
  • 12.
    Design of MOSAIC(exe’ & comm’ overheads) 12 • B. Execution and Communication Cost Estimators - i) Communication Cost Estimator (GPU, NPU) - ii) Execution Cost Estimator - Performance Estimator - Power Estimator
  • 13.
    Design of MOSAIC(Slicer & Scheduler) 13 • C. Model Slicer and Scheduler - i) In case of performance optimization, D is defined as - ii) In case of energy optimization, D is defined as - 25 options of DVFS (Big 9, Little 7, GPU 8, NPU 1) - Dynamic Programming
  • 14.
    Design of MOSAIC(Slicer & Scheduler) 14 • C. Model Slicer and Scheduler - Dynamic Programming
  • 15.
    Design of MOSAIC(Executor) 15 • D. Inference Workload Executor
  • 16.
    16 Evaluation • Hardware - HiKey970 embedded development board - Cortex-A73 (big) cores, four Cortex-A53 (little) cores, - Mali-G72 MP12 GPU, and NPU • Comparison Target (governor / heuristic vs Mosaic) - Governor:Performance (TF-BIG/LITTLE/GPU/NPU-P) - Governor:on-demand (TF-BIG/LITTLE/GPU/NPU-O) - Exhaustive (1000 unique slicing plans) • Evaluation items - (1) inference latency, - (2) inference energy, - (3) impact of the MOSAIC components, - (4) efficiency with smaller models, - (5) estimation accuracy, - (6) overheads for generating the model slicing and execution plan.
  • 17.
  • 18.
  • 19.
  • 20.
    Evaluation- (3)impact ofMosaic (4)efficiency with smaller models 20
  • 21.
  • 22.
  • 23.
    Summary • This paperpresents MOSAIC, heterogeneity-, communication-, and constraint-aware model slicing and execution for accurate and efficient inference on heterogeneous embedded systems. • MOSAIC uses the accurate models for estimating the execution and communication costs of the target inference workload and generates the efficient model slicing and execution plan with low time complexity. - 29.2% lower inference latency than an TF-NPU-P - 36.6% lower energy than an TF-NPU-O • MOSAIC achieves high estimation accuracy, and incurs small overheads. 23