SlideShare a Scribd company logo
1 of 23
Download to read offline
MOSAIC :
Heterogeneity-, Communication-, and Constraint-
Aware Model Slicing
and Execution for Accurate and Efficient Inference
Myeonggyun Han, Jihoon Hyun, Seongbeom Park,
Jinsu Park, Woongki Baek
Computer Architecture and Systems Lab (CASL)
UNIST
발표자: 이제민 (leejaymin@etri.re.kr)
발표자료 제작자: 권진세 (kwonse@cnu.ac.kr)
28th International Conference on Parallel Architectures and Compilation
Techniques
- PACT 19
- acceptance rate : 27% ( 34 / 126 )
- 정보과학회 우수학술대회 IF : 2
1
DL Inference on mobile systems
2
• Increased the need for accurate and efficient
DL inference on mobile systems
- For Security- and privacy-sensitive applications
• Heterogeneous embedded systems are rapidly
emerging as a promising solution to enable
accurate and efficient inference
- Big-core cluster, Little-core cluster, GPU, NPU
- Various computing devices exhibit widely-different
characteristics in terms of ①performance,
②energy consumption, ③supported operations,
④memory capacity, ⑤communication overheads
DL Inference on mobile systems
3
• Increased the need for accurate and efficient
DL inference on mobile systems
- For Security- and privacy-sensitive applications
• Heterogeneous embedded systems are rapidly
emerging as a promising solution to enable
accurate and efficient inference
- Big-core cluster, Little-core cluster, GPU, NPU
- Various computing devices exhibit widely-different
characteristics in terms of ①performance,
②energy consumption, ③supported operations,
④memory capacity, ⑤communication overheads
- lack the consideration of the efficiency
heterogeneity and memory and functionality
constraints of inference workloads and emerging
computing devices (e.g., NPU)
Related Work
4
Related Work
5
- Despite extensive prior works, it still remains
unexplored to investigate the system-software
support that efficiently executes inference
workloads on heterogeneous embedded systems
by judiciously considering their characteristics.
Mosaic
6
• Heterogeneity-
• Communication- aware model slicing
• Constraint-
- Execution plan based on dynamic programming
• Implement the prototype of MOSAIC
- as a user-level runtime system
- using the TensorFlow Lite1)
- on the Android OS2)
1) TensorFlow Lite 1.11.0
2) Android 8.1
Mosaic
7
• Heterogeneity-
• Communication- aware model slicing
• Constraint-
- Execution plan based on dynamic programming
• Implement the prototype of MOSAIC
- as a user-level runtime system
- using the TensorFlow Lite1)
- on the Android OS2)
1) TensorFlow Lite 1.11.0
2) Android 8.1
Preliminary Experiments
8
GPU > NPU
GPU < NPU
Preliminary Experiments
9
1. Figure 5 shows the inference latency of MO-
1.4 when they are decomposed into three
slices with various slicing plans
2. the performance difference of the best and
worst slicing plans is 34.1%
Little < NPU
Little > NPU
Preliminary Experiments
10
1. Figure 5 shows the inference latency of MO-
1.4 when they are decomposed into three
slices with various slicing plans
2. the performance difference of the best and
worst slicing plans is 34.1%
Little < NPU
Little > NPU
Overall, our experimental results show that
the evaluated inference workloads exhibit
widely-different characteristics in terms of
the model size, performance and energy
heterogeneity, and communication
overheads.
Design of MOSAIC
11
• A. Inference Workload Profiler
- profiles the total costs (e.g., latency, energy
consumption) for executing each layer
- If a computing device supports DVFS, the total costs
for executing the layer are collected at two frequencies
(Max / Min Freq’)
Design of MOSAIC (exe’ & comm’ overheads)
12
• B. Execution and Communication Cost Estimators
- i) Communication Cost Estimator (GPU, NPU)
- ii) Execution Cost Estimator
- Performance Estimator
- Power Estimator
Design of MOSAIC (Slicer & Scheduler)
13
• C. Model Slicer and Scheduler
- i) In case of performance optimization, D is defined as
- ii) In case of energy optimization, D is defined as
- 25 options of DVFS (Big 9, Little 7, GPU 8, NPU 1)
- Dynamic Programming
Design of MOSAIC (Slicer & Scheduler)
14
• C. Model Slicer and Scheduler
- Dynamic Programming
Design of MOSAIC (Executor)
15
• D. Inference Workload Executor
16
Evaluation
• Hardware
- HiKey 970 embedded development board
- Cortex-A73 (big) cores, four Cortex-A53 (little) cores,
- Mali-G72 MP12 GPU, and NPU
• Comparison Target (governor / heuristic vs Mosaic)
- Governor:Performance (TF-BIG/LITTLE/GPU/NPU-P)
- Governor:on-demand (TF-BIG/LITTLE/GPU/NPU-O)
- Exhaustive (1000 unique slicing plans)
• Evaluation items
- (1) inference latency,
- (2) inference energy,
- (3) impact of the MOSAIC components,
- (4) efficiency with smaller models,
- (5) estimation accuracy,
- (6) overheads for generating the model slicing and execution plan.
Evaluation- Mosaic Optimal Plans
17
Evaluation- (1)Latency(2)Energy
18
Evaluation- (1)Latency(2)Energy
19
Evaluation- (3)impact of Mosaic
(4)efficiency with smaller models
20
Evaluation- (5)estimation accuracy
(6)overheads
21
Evaluation- (5)estimation accuracy
(6)overheads
22
Summary
• This paper presents MOSAIC, heterogeneity-,
communication-, and constraint-aware model slicing
and execution for accurate and efficient inference on
heterogeneous embedded systems.
• MOSAIC uses the accurate models for estimating the
execution and communication costs of the target
inference workload and generates the efficient model
slicing and execution plan with low time complexity.
- 29.2% lower inference latency than an TF-NPU-P
- 36.6% lower energy than an TF-NPU-O
• MOSAIC achieves high estimation accuracy, and incurs
small overheads.
23

More Related Content

What's hot

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARMEdge AI and Vision Alliance
 
Gpu acceleration for simulating massively parallel many core platforms
Gpu acceleration for simulating massively parallel many core platformsGpu acceleration for simulating massively parallel many core platforms
Gpu acceleration for simulating massively parallel many core platformsWMLab,NCU
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
 
BKK16-208 EAS
BKK16-208 EASBKK16-208 EAS
BKK16-208 EASLinaro
 
Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Ganesan Narayanasamy
 
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduhoKim Du-Ho
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacGanesan Narayanasamy
 
SFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateSFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateLinaro
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Improving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuardImproving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuardHeechul Yun
 
BKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPABKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPALinaro
 
A Simplex Architecture for Intelligent and Safe Unmanned Aerial Vehicles
A Simplex Architecture for Intelligent and Safe Unmanned Aerial VehiclesA Simplex Architecture for Intelligent and Safe Unmanned Aerial Vehicles
A Simplex Architecture for Intelligent and Safe Unmanned Aerial VehiclesHeechul Yun
 
LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)Linaro
 
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common KernelLAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common KernelLinaro
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...Heechul Yun
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)AllineaSoftware
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchRyousei Takano
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
 

What's hot (20)

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
 
Gpu acceleration for simulating massively parallel many core platforms
Gpu acceleration for simulating massively parallel many core platformsGpu acceleration for simulating massively parallel many core platforms
Gpu acceleration for simulating massively parallel many core platforms
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
BKK16-208 EAS
BKK16-208 EASBKK16-208 EAS
BKK16-208 EAS
 
Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10
 
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
 
SFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateSFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress Update
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Improving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuardImproving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuard
 
BKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPABKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPA
 
A Simplex Architecture for Intelligent and Safe Unmanned Aerial Vehicles
A Simplex Architecture for Intelligent and Safe Unmanned Aerial VehiclesA Simplex Architecture for Intelligent and Safe Unmanned Aerial Vehicles
A Simplex Architecture for Intelligent and Safe Unmanned Aerial Vehicles
 
LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)
 
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common KernelLAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
 

Similar to PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference

FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning Dr. Swaminathan Kathirvel
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGNFUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGNPankaj Singh
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spacejsvetter
 
HiPEAC-CSW 2022_Pedro Trancoso presentation
HiPEAC-CSW 2022_Pedro Trancoso presentationHiPEAC-CSW 2022_Pedro Trancoso presentation
HiPEAC-CSW 2022_Pedro Trancoso presentationVEDLIoT Project
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...CSCJournals
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Lecture_IIITD.pptx
Lecture_IIITD.pptxLecture_IIITD.pptx
Lecture_IIITD.pptxachakracu
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 
Implementation of Speed Efficient Image Processing algorithm on Multi-Process...
Implementation of Speed Efficient Image Processing algorithm on Multi-Process...Implementation of Speed Efficient Image Processing algorithm on Multi-Process...
Implementation of Speed Efficient Image Processing algorithm on Multi-Process...AM Publications
 
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...Matteo Ferroni
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
Static Energy Prediction in Software: A Worst-Case Scenario Approach
Static Energy Prediction in Software: A Worst-Case Scenario ApproachStatic Energy Prediction in Software: A Worst-Case Scenario Approach
Static Energy Prediction in Software: A Worst-Case Scenario ApproachGreenLabAtDI
 
Automatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapAutomatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapManolis Vavalis
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 

Similar to PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference (20)

FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Multiclet corp
Multiclet corpMulticlet corp
Multiclet corp
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGNFUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
HiPEAC-CSW 2022_Pedro Trancoso presentation
HiPEAC-CSW 2022_Pedro Trancoso presentationHiPEAC-CSW 2022_Pedro Trancoso presentation
HiPEAC-CSW 2022_Pedro Trancoso presentation
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Lecture_IIITD.pptx
Lecture_IIITD.pptxLecture_IIITD.pptx
Lecture_IIITD.pptx
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Implementation of Speed Efficient Image Processing algorithm on Multi-Process...
Implementation of Speed Efficient Image Processing algorithm on Multi-Process...Implementation of Speed Efficient Image Processing algorithm on Multi-Process...
Implementation of Speed Efficient Image Processing algorithm on Multi-Process...
 
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Static Energy Prediction in Software: A Worst-Case Scenario Approach
Static Energy Prediction in Software: A Worst-Case Scenario ApproachStatic Energy Prediction in Software: A Worst-Case Scenario Approach
Static Energy Prediction in Software: A Worst-Case Scenario Approach
 
Automatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapAutomatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmap
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
 

More from jemin lee

HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantizationjemin lee
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...jemin lee
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performancejemin lee
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage jemin lee
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usagejemin lee
 

More from jemin lee (6)

MobileViTv1
MobileViTv1MobileViTv1
MobileViTv1
 
HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantization
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
 

Recently uploaded

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 

Recently uploaded (20)

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 

PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference

  • 1. MOSAIC : Heterogeneity-, Communication-, and Constraint- Aware Model Slicing and Execution for Accurate and Efficient Inference Myeonggyun Han, Jihoon Hyun, Seongbeom Park, Jinsu Park, Woongki Baek Computer Architecture and Systems Lab (CASL) UNIST 발표자: 이제민 (leejaymin@etri.re.kr) 발표자료 제작자: 권진세 (kwonse@cnu.ac.kr) 28th International Conference on Parallel Architectures and Compilation Techniques - PACT 19 - acceptance rate : 27% ( 34 / 126 ) - 정보과학회 우수학술대회 IF : 2 1
  • 2. DL Inference on mobile systems 2 • Increased the need for accurate and efficient DL inference on mobile systems - For Security- and privacy-sensitive applications • Heterogeneous embedded systems are rapidly emerging as a promising solution to enable accurate and efficient inference - Big-core cluster, Little-core cluster, GPU, NPU - Various computing devices exhibit widely-different characteristics in terms of ①performance, ②energy consumption, ③supported operations, ④memory capacity, ⑤communication overheads
  • 3. DL Inference on mobile systems 3 • Increased the need for accurate and efficient DL inference on mobile systems - For Security- and privacy-sensitive applications • Heterogeneous embedded systems are rapidly emerging as a promising solution to enable accurate and efficient inference - Big-core cluster, Little-core cluster, GPU, NPU - Various computing devices exhibit widely-different characteristics in terms of ①performance, ②energy consumption, ③supported operations, ④memory capacity, ⑤communication overheads - lack the consideration of the efficiency heterogeneity and memory and functionality constraints of inference workloads and emerging computing devices (e.g., NPU)
  • 5. Related Work 5 - Despite extensive prior works, it still remains unexplored to investigate the system-software support that efficiently executes inference workloads on heterogeneous embedded systems by judiciously considering their characteristics.
  • 6. Mosaic 6 • Heterogeneity- • Communication- aware model slicing • Constraint- - Execution plan based on dynamic programming • Implement the prototype of MOSAIC - as a user-level runtime system - using the TensorFlow Lite1) - on the Android OS2) 1) TensorFlow Lite 1.11.0 2) Android 8.1
  • 7. Mosaic 7 • Heterogeneity- • Communication- aware model slicing • Constraint- - Execution plan based on dynamic programming • Implement the prototype of MOSAIC - as a user-level runtime system - using the TensorFlow Lite1) - on the Android OS2) 1) TensorFlow Lite 1.11.0 2) Android 8.1
  • 9. Preliminary Experiments 9 1. Figure 5 shows the inference latency of MO- 1.4 when they are decomposed into three slices with various slicing plans 2. the performance difference of the best and worst slicing plans is 34.1% Little < NPU Little > NPU
  • 10. Preliminary Experiments 10 1. Figure 5 shows the inference latency of MO- 1.4 when they are decomposed into three slices with various slicing plans 2. the performance difference of the best and worst slicing plans is 34.1% Little < NPU Little > NPU Overall, our experimental results show that the evaluated inference workloads exhibit widely-different characteristics in terms of the model size, performance and energy heterogeneity, and communication overheads.
  • 11. Design of MOSAIC 11 • A. Inference Workload Profiler - profiles the total costs (e.g., latency, energy consumption) for executing each layer - If a computing device supports DVFS, the total costs for executing the layer are collected at two frequencies (Max / Min Freq’)
  • 12. Design of MOSAIC (exe’ & comm’ overheads) 12 • B. Execution and Communication Cost Estimators - i) Communication Cost Estimator (GPU, NPU) - ii) Execution Cost Estimator - Performance Estimator - Power Estimator
  • 13. Design of MOSAIC (Slicer & Scheduler) 13 • C. Model Slicer and Scheduler - i) In case of performance optimization, D is defined as - ii) In case of energy optimization, D is defined as - 25 options of DVFS (Big 9, Little 7, GPU 8, NPU 1) - Dynamic Programming
  • 14. Design of MOSAIC (Slicer & Scheduler) 14 • C. Model Slicer and Scheduler - Dynamic Programming
  • 15. Design of MOSAIC (Executor) 15 • D. Inference Workload Executor
  • 16. 16 Evaluation • Hardware - HiKey 970 embedded development board - Cortex-A73 (big) cores, four Cortex-A53 (little) cores, - Mali-G72 MP12 GPU, and NPU • Comparison Target (governor / heuristic vs Mosaic) - Governor:Performance (TF-BIG/LITTLE/GPU/NPU-P) - Governor:on-demand (TF-BIG/LITTLE/GPU/NPU-O) - Exhaustive (1000 unique slicing plans) • Evaluation items - (1) inference latency, - (2) inference energy, - (3) impact of the MOSAIC components, - (4) efficiency with smaller models, - (5) estimation accuracy, - (6) overheads for generating the model slicing and execution plan.
  • 20. Evaluation- (3)impact of Mosaic (4)efficiency with smaller models 20
  • 23. Summary • This paper presents MOSAIC, heterogeneity-, communication-, and constraint-aware model slicing and execution for accurate and efficient inference on heterogeneous embedded systems. • MOSAIC uses the accurate models for estimating the execution and communication costs of the target inference workload and generates the efficient model slicing and execution plan with low time complexity. - 29.2% lower inference latency than an TF-NPU-P - 36.6% lower energy than an TF-NPU-O • MOSAIC achieves high estimation accuracy, and incurs small overheads. 23