ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference
1. MOSAIC :
Heterogeneity-, Communication-, and Constraint-
Aware Model Slicing
and Execution for Accurate and Efficient Inference
Myeonggyun Han, Jihoon Hyun, Seongbeom Park,
Jinsu Park, Woongki Baek
Computer Architecture and Systems Lab (CASL)
UNIST
발표자: 이제민 (leejaymin@etri.re.kr)
발표자료 제작자: 권진세 (kwonse@cnu.ac.kr)
28th International Conference on Parallel Architectures and Compilation
Techniques
- PACT 19
- acceptance rate : 27% ( 34 / 126 )
- 정보과학회 우수학술대회 IF : 2
1
2. DL Inference on mobile systems
2
• Increased the need for accurate and efficient
DL inference on mobile systems
- For Security- and privacy-sensitive applications
• Heterogeneous embedded systems are rapidly
emerging as a promising solution to enable
accurate and efficient inference
- Big-core cluster, Little-core cluster, GPU, NPU
- Various computing devices exhibit widely-different
characteristics in terms of ①performance,
②energy consumption, ③supported operations,
④memory capacity, ⑤communication overheads
3. DL Inference on mobile systems
3
• Increased the need for accurate and efficient
DL inference on mobile systems
- For Security- and privacy-sensitive applications
• Heterogeneous embedded systems are rapidly
emerging as a promising solution to enable
accurate and efficient inference
- Big-core cluster, Little-core cluster, GPU, NPU
- Various computing devices exhibit widely-different
characteristics in terms of ①performance,
②energy consumption, ③supported operations,
④memory capacity, ⑤communication overheads
- lack the consideration of the efficiency
heterogeneity and memory and functionality
constraints of inference workloads and emerging
computing devices (e.g., NPU)
5. Related Work
5
- Despite extensive prior works, it still remains
unexplored to investigate the system-software
support that efficiently executes inference
workloads on heterogeneous embedded systems
by judiciously considering their characteristics.
6. Mosaic
6
• Heterogeneity-
• Communication- aware model slicing
• Constraint-
- Execution plan based on dynamic programming
• Implement the prototype of MOSAIC
- as a user-level runtime system
- using the TensorFlow Lite1)
- on the Android OS2)
1) TensorFlow Lite 1.11.0
2) Android 8.1
7. Mosaic
7
• Heterogeneity-
• Communication- aware model slicing
• Constraint-
- Execution plan based on dynamic programming
• Implement the prototype of MOSAIC
- as a user-level runtime system
- using the TensorFlow Lite1)
- on the Android OS2)
1) TensorFlow Lite 1.11.0
2) Android 8.1
9. Preliminary Experiments
9
1. Figure 5 shows the inference latency of MO-
1.4 when they are decomposed into three
slices with various slicing plans
2. the performance difference of the best and
worst slicing plans is 34.1%
Little < NPU
Little > NPU
10. Preliminary Experiments
10
1. Figure 5 shows the inference latency of MO-
1.4 when they are decomposed into three
slices with various slicing plans
2. the performance difference of the best and
worst slicing plans is 34.1%
Little < NPU
Little > NPU
Overall, our experimental results show that
the evaluated inference workloads exhibit
widely-different characteristics in terms of
the model size, performance and energy
heterogeneity, and communication
overheads.
11. Design of MOSAIC
11
• A. Inference Workload Profiler
- profiles the total costs (e.g., latency, energy
consumption) for executing each layer
- If a computing device supports DVFS, the total costs
for executing the layer are collected at two frequencies
(Max / Min Freq’)
12. Design of MOSAIC (exe’ & comm’ overheads)
12
• B. Execution and Communication Cost Estimators
- i) Communication Cost Estimator (GPU, NPU)
- ii) Execution Cost Estimator
- Performance Estimator
- Power Estimator
13. Design of MOSAIC (Slicer & Scheduler)
13
• C. Model Slicer and Scheduler
- i) In case of performance optimization, D is defined as
- ii) In case of energy optimization, D is defined as
- 25 options of DVFS (Big 9, Little 7, GPU 8, NPU 1)
- Dynamic Programming
14. Design of MOSAIC (Slicer & Scheduler)
14
• C. Model Slicer and Scheduler
- Dynamic Programming
23. Summary
• This paper presents MOSAIC, heterogeneity-,
communication-, and constraint-aware model slicing
and execution for accurate and efficient inference on
heterogeneous embedded systems.
• MOSAIC uses the accurate models for estimating the
execution and communication costs of the target
inference workload and generates the efficient model
slicing and execution plan with low time complexity.
- 29.2% lower inference latency than an TF-NPU-P
- 36.6% lower energy than an TF-NPU-O
• MOSAIC achieves high estimation accuracy, and incurs
small overheads.
23