SlideShare a Scribd company logo
ENERGY-EFFICIENT AI WORKLOAD
PARTITIONING ON MULTI-CORE SYSTEMS
Deepak Shankar
Founder
Mirabilis Design Inc.
Email: dshankar@mirabilisdesign.com
Tom Jose
R&D Engineer
Mirabilis Design Inc.
Email: tjose@mirabilisdesign.com
About Mirabilis Design
Started in 2007 and based in Santa Clara, CA, USA.
Development and support centers in US, India, Germany, China, Japan, Taiwan and Czech
Largest provider of Intellectual Property and Software for System architecture exploration
Used in the design of electronics, semiconductors and software
Over 250 products worldwide across Semiconductors, Aerospace, Computing and Automotive
VisualSim- Modeling and simulation software
100’s of man years experience in system design and exploration of digital electronics
Select the “Right” configuration to match customer request
VisualSim Architect
Modeling
and
Simulation
Software
Graphical
and
Hierarchical
Performance
Power
Functional
Tuning
Open API for
third-party
integration
Systems
&
Networks
Software
IP/SoC
Comprehensive Architecture Exploration Solution
Scope of System Architecture Exploration
Concept to Validation for the entire system
AI for Data center or Edge
Track 30 targets per minute
Wifi or 5G device
Process 3 cameras, 4 Lidars & 5 Radars
95% cache hit-ratio
Gateways to ECAN, WiFi, BLE and TSN
SoC Specification
Core
RTOS
Interface
AI/CCN
N
e
t
w
o
r
k
Size and Design Trade-off
Workload Partitioning
HW-SW Distribution
Parameter tuning
Select
Traffic,
Workload,
Use-cases
SW Performance Tuning
System Specification
IP and Core Selection
Optimize for Requirements
Software Performance Tuning
AI and Software Code
Compile code for
the target hardware
Translate into Interim trace
Execute Software trace
on SoC Platform
Delay_95_StDev_196s = 4.39E-6
Latency_Value = 7.46E-6
Mean_95_Confidence = 3.61E-7
Mean_Value = 4.01E-6
Min_Value = 3.62E-7
StDev_Value = 2.24E-6
Performance Reports
Modify the code to
improve performance
And repeat loop
Behavioral flow and HW architecture
•Behavioral flow -> sequence flow representation of how tasks will be executed
•HW architecture -> Representation of how the HW architecture of a device is implemented
•Example:
CPU_1
CPU_2
Cache
RAM
B
U
S
Trig
Task 1
Task 2
Task 3
• Task sequence defined from
behavior flow
• Task 3 dependent on the
output of Task 1 and Task 2
• Task mapping to CPU cores
done from behavior flow
• Task 1, 3 -> CPU_1
• Task 2 -> CPU_2
Behavior Flow
HW Architecture
Examples
Accuracy of VisualSim Power Profiler
Frequency Max Power
observed
Real System Power Delta percentage
500.0 Mhz 0.037 W 0.038 W 2.63%
600.0 Mhz 0.053 W 0.051 W -3.92%
700.0 Mhz 0.073 W 0.080 W 8.75%
800.0 Mhz 0.097 W 0.090 W -7.77%
1000.0 Mhz 0.157 W 0.159 W 1.25%
1100.0 Mhz 0.193 W 0.188 W -2.65%
1200.0 Mhz 0.233 W 0.227 W -2.64%
1300.0 Mhz 0.277 W 0.269 W -2.97%
Source: Anandtech.com
Device:
ARM
Cortex A53
SoC Power-Performance demo
MULTI MEDIA APPLICATION USE CASE
Media Application
Target
1. Power < 2.0W
2. Number of frames in 20 ms > 50K
Three architecture evaluations
1. All tasks deployed in Software
2. Migrate few tasks to Hardware accelerators
3. Add power management to reduce power
Block Diagram
Mirabilis Design Inc. 9
Device Configuration
Processor Core – Cortex A53
Processor Speed – 1200 MHz
L1 cache:
I Cache : 32 KB
: 2 way set associative
D Cache : 32 KB
: 4 way set associative
L2 Cache
Size :1 MB
Associativity :16 way
Ext DRAM
Size :4 GB
Type :DDR4
Speed :2400 MHz
HW Accelerator
Speed : 800 MHz
VisualSim Model
Processor Bus Topology
MPEG Application
IP or ARM level
• Evaluate pipeline stages
• Width, Speed
• Number of execution units, Levels of cache
SoC
• Number of ARM cores
• Accelerators
• Cache memory hierarchy and coherence
System level
• Development of an IoT device, ECU or an
integrated platform
Mirabilis Design Inc. 11
Model parameters
Select_Partitioning = “HW” -> map Rotate frame task to the HW_Accelerator
Case study
ACHIEVING THE REQUIRED PERFORMANCE
Case study environment
•Single core Cortex A53
•Analysis being done before software development
• No C/C++ code written
• Uses Task Generator to generate instruction sequence based on a task profile set via instruction Mix
Table (click here to view more details on the Task Generator)
CASE 1: Run all tasks on SW(on A53 core)
Observations:
1. Avg power
consumption within
requirements (<2.0 W)
2. Performance
requirement not
achieved (Only a max of
23.3K frames)
Cache Stats
Stats generated per cache block:
• Cache Hit Ratio , Miss Ratio
• Number of instructions
• Number of Prefetches
• Buffer Occupancy, Buffer Overflow
• Latency – Min, Max and Mean
• Read Hit Ratio
• MBps – Read, Write and Total
• Write Hit Ratio
• Evicted Block Count
• Write Backed block count
• Utilization
Sequence diagram
Rotate Frame
task is found to
be resource
intensive
CASE 2: Run Rotate Frame task on HW Acc
Observations:
1. Avg power consumption
requirement not met
(>2.0 W)
2. Performance
requirement achieved (
max of 125.5K frames)
CASE 2: Observations
•The avg power consumption is greater than the threshold
•Options for the Architect:
• Use HW accelerator from another vendor
• Use same HW accelerator, but apply power management
• From the power plot for HW accelerator, it could be observed that the HW accelerator is active only for a short period of time
CASE 3: Run Rotate Frame task on HW Acc
+ Power management
Observations:
1. Avg power consumption
requirement met (<2.0
W)
2. Performance
requirement achieved (
max of 125.5K frames)
Case study environment
•Dual core Cortex A53
•C code for one of the task is available (Render Frame)
• Execute the C code on GEM5
• GEM5 generated output traces
• Use the python code to parse the GEM5 output into VisualSim readable format
• Uses this Trace for providing instruction sequence and address
Click here to view more information on Trace generation and usage
Database updated
Render_Frame task is mapped to
the second core
CASE 4: Run Render_Frame task using
gem5 trace
Observations:
1. Avg power
consumption within
requirements (<2.0 W)
2. Performance
requirement achieved
(max of 105K frames)
Gnuplot – power plot
ARM Cortex A65 AE demo
16 ARM CORES
Block Diagram
A65 AE
Cluster
A65 AE
Cluster
L4 Cache DRAM
Router
Router Router
Router
Block Diagram – A65 AE Cluster
Mirabilis Design Inc. 27
A65 AE Cluster – 8 cores
Core
I1
D1
L2 Bridge
Private Cache
1
2
8
L3
DSU
Device Configuration
Processor Core – Cortex A65 AE
Processor Speed – 1200 MHz
L1 cache:
I Cache : 32 KB
: 4 way set associative
D Cache : 64 KB
: 4 way set associative
L2 cache: 256 KB
: 4 way set associative
L3 Cache
Size :512 KB
Associativity :16 way
DRAM
Type :DDR4
Speed :2000 MHz
VisualSim Model
Mirabilis Design Inc. 29
Behavioral Flow at the bottom
shows the Tasks/Applications
running on the Hardware.
Both Cluster 1 and Cluster 2 are
executing tasks in parallel
Model parameters
Task Details
Since the model is running in Lock Mode, same application is run on the redundant core as well
Results
Cache Stats
Stats generated per cache block:
• Cache Hit Ratio , Miss Ratio
• Number of instructions
• Number of Prefetches
• Buffer Occupancy, Buffer Overflow
• Latency – Min, Max and Mean
• Read Hit Ratio
• MBps – Read, Write and Total
• Write Hit Ratio
• Evicted Block Count
• Write Backed block count
• Utilization
Router Stats
Since there is no devices
attached to Router R_2_2
(bottom right in block
diagram), the data
transferred through R_2_2
is 0.
A65AE – Split mode – VisualSim model
Different applications
running on each core
Software mapping in NW processor
Results
Conclusion
Architecture exploration can be easy, fast and accurate
Architecture modeling IP is now available and can be
quickly configured to create a new system
Integrate IP, SoC and Systems with software, interfaces
and RTOS
Architecture accuracy at implementation-level
Simulation speed that enables large scale testing
ENERGY-EFFICIENT AI WORKLOAD
PARTITIONING ON MULTI-CORE SYSTEMS
Deepak Shankar
Founder
Mirabilis Design Inc.
Email: dshankar@mirabilisdesign.com
Tom Jose
R&D Engineer
Mirabilis Design Inc.
Email: tjose@mirabilisdesign.com
GEM5 – Trace generation
•Software code executed in GEM5
•GEM5 generates output in .txt format
•Custom Python code is used to parse the GEM5 output
•Python parses and generates the executed instruction sequence in csv format
•The csv file is provided as input to VisualSim Architect
C code – snippet and GEM5 output
Python code and output file (Trace.csv)
TrafficReader – used for reading Traces
Return
• When an application trigger
reaches render frame stage,
Trace.csv is read.
• The TrafficReader reads one line
per trigger
• So a loop mechanism is defined
to read till the End of File
• Once the last line from Trace.csv
is read out and executed in the
core, the trigger is sent to the
next stage via rendered_frame
output port
Task Generator Module
•Custom Task generator – number of instructions, type of instructions, order of tasks (loop,
random) can be set
•More dynamic and distributed traffic profile can be generated
•“n” number of Software tasks can be defined
•In case software development hasn’t started yet, we can use this module to generate the
instruction traces
9/2/2021 MIRABILIS DESIGN INC. 45
Task Generator – Config File (Instruction Mix Table)
9/2/2021 MIRABILIS DESIGN INC. 46
Software tasks
Number of
instructions per
task
The Total Number of instructions are made up of instructions of
different types. The percentages of each type of instruction is
specified here.
Task Generator - Config File (Instruction Mix Table)
9/2/2021 MIRABILIS DESIGN INC. 47
This type descriptor is used in the previous slide.
User can specify the percentage of each type of
instruction for each software operation
Example: One task, one Processor core
Example: Three tasks, one Processor core
Example: Three tasks, one Processor
core, Preemption enabled
Priority : Task3 > Task2 > Task1
ENERGY-EFFICIENT AI WORKLOAD
PARTITIONING ON MULTI-CORE SYSTEMS
Deepak Shankar
Founder
Mirabilis Design Inc.
Email: dshankar@mirabilisdesign.com
Tom Jose
R&D Engineer
Mirabilis Design Inc.
Email: tjose@mirabilisdesign.com
Backup slides
Ultrascale demo
QUAD CORE ARM CORTEX A53
Block Diagram
Mirabilis Design Inc. 54
D1
I1
A53 Core 1
D1
I1
A53 Core 2
D1
I1
A53 Core 3
D1
I1
A53 Core 4
L
O
C
A
L
B
U
S
Shared L2
A
M
B
A
A
X
I
DMA
DRAM
VisualSim Model
Mirabilis Design Inc. 55
Behavioral Flow at the bottom
shows the Tasks/Applications
running on the Hardware.
Model parameters
Task Details
CF1 - Inlining test for functions containing loops
CS1 - Switch case statement of size 10 (different case each time)
EM1 - Integer execution (length 1 dependency chain each loop with multiplies)
DPcvt - Simple data parallel loop (float/double conversion)
The c code corresponding to each task was run on GEM5 and the output from GEM5 was parsed
into VisualSim readable format.
Results
Cache Stats
Stats generated per cache block:
• Cache Hit Ratio , Miss Ratio
• Number of instructions
• Number of Prefetches
• Buffer Occupancy, Buffer Overflow
• Latency – Min, Max and Mean
• Read Hit Ratio
• MBps – Read, Write and Total
• Write Hit Ratio
• Evicted Block Count
• Write Backed block count
• Utilization
Stats Explanation – Core Latency
Each core runs different Task. According to the Task
profile of the application run, the number of
instructions that needs to be executed will be different.
Core 4 runs DPcvt task which is the biggest trace of the
4. Hence Core 4 has a higher latency.
Why Multiple points in the graph?
This demo model runs the traces multiple times (5
times). So each sample point corresponds to the
latency for each run
Why Latency for each core high for the first run?
Cache loading delay.
After the first run, since data is available in private L1,
very less request reaches L2/DRAM
Stats Explanation – Beh_Latency
Beh_Latency_Plot plots the total time taken to
complete the full Software sequence defined in the
behavioral flow.
This means that it is the time taken to complete Task
running on Core 1, Core 2 , Core 3 and Core 4.
Since we are running the tasks periodically, we see 5
sample points in the graph.
Why first run latency is higher?
Cache loading delay in all cores for the first run. Once
the cache gets loaded, the subsequent runs are
comparatively faster

More Related Content

What's hot

MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
Ganesan Narayanasamy
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
Ganesan Narayanasamy
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
Ganesan Narayanasamy
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
inside-BigData.com
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
Ganesan Narayanasamy
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
inside-BigData.com
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
Ganesan Narayanasamy
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
inside-BigData.com
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
Intel® Software
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
Intel® Software
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
Ganesan Narayanasamy
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
Ganesan Narayanasamy
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
Anand Haridass
 
An Update on Arm HPC
An Update on Arm HPCAn Update on Arm HPC
An Update on Arm HPC
inside-BigData.com
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
Yutaka Kawai
 

What's hot (20)

MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
An Update on Arm HPC
An Update on Arm HPCAn Update on Arm HPC
An Update on Arm HPC
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 

Similar to Energy efficient AI workload partitioning on multi-core systems

Webinar on RISC-V
Webinar on RISC-VWebinar on RISC-V
Webinar on RISC-V
Deepak Shankar
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
Deepak Shankar
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?
Deepak Shankar
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
Deepak Shankar
 
System Architecture Exploration Training Class
System Architecture Exploration Training ClassSystem Architecture Exploration Training Class
System Architecture Exploration Training Class
Deepak Shankar
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Design Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System ArchitectureDesign Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System Architecture
Inductive Automation
 
Exploration of Radars and Software Defined Radios using VisualSim
Exploration of  Radars and Software Defined Radios using VisualSimExploration of  Radars and Software Defined Radios using VisualSim
Exploration of Radars and Software Defined Radios using VisualSim
Deepak Shankar
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
Deepak Shankar
 
Webinar on radar
Webinar on radarWebinar on radar
Webinar on radar
Deepak Shankar
 
Architecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter EfficiencyArchitecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter Efficiency
Intel IT Center
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
Deepak Shankar
 
Accelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim ArchitectAccelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim Architect
Deepak Shankar
 
Webinar on Latency and throughput computation of automotive EE network
Webinar on Latency and throughput computation of automotive EE networkWebinar on Latency and throughput computation of automotive EE network
Webinar on Latency and throughput computation of automotive EE network
Deepak Shankar
 
Processors selection
Processors selectionProcessors selection
Processors selection
Pradeep Shankhwar
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration?
Deepak Shankar
 
ERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdfERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdf
VinothkumarUruman1
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V International
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
MongoDB
 
UNIT I.pptx
UNIT I.pptxUNIT I.pptx
UNIT I.pptx
SeshuSrinivas2
 

Similar to Energy efficient AI workload partitioning on multi-core systems (20)

Webinar on RISC-V
Webinar on RISC-VWebinar on RISC-V
Webinar on RISC-V
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
 
System Architecture Exploration Training Class
System Architecture Exploration Training ClassSystem Architecture Exploration Training Class
System Architecture Exploration Training Class
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Design Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System ArchitectureDesign Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System Architecture
 
Exploration of Radars and Software Defined Radios using VisualSim
Exploration of  Radars and Software Defined Radios using VisualSimExploration of  Radars and Software Defined Radios using VisualSim
Exploration of Radars and Software Defined Radios using VisualSim
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
Webinar on radar
Webinar on radarWebinar on radar
Webinar on radar
 
Architecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter EfficiencyArchitecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter Efficiency
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
 
Accelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim ArchitectAccelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim Architect
 
Webinar on Latency and throughput computation of automotive EE network
Webinar on Latency and throughput computation of automotive EE networkWebinar on Latency and throughput computation of automotive EE network
Webinar on Latency and throughput computation of automotive EE network
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration?
 
ERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdfERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdf
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
UNIT I.pptx
UNIT I.pptxUNIT I.pptx
UNIT I.pptx
 

More from Deepak Shankar

Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Deepak Shankar
 
Modeling Abstraction
Modeling AbstractionModeling Abstraction
Modeling Abstraction
Deepak Shankar
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
Deepak Shankar
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
Deepak Shankar
 
Capacity Planning and Power Management of Data Centers.
Capacity Planning and Power Management of Data Centers. Capacity Planning and Power Management of Data Centers.
Capacity Planning and Power Management of Data Centers.
Deepak Shankar
 
Automotive network and gateway simulation
Automotive network and gateway simulationAutomotive network and gateway simulation
Automotive network and gateway simulation
Deepak Shankar
 
Using ai for optimal time sensitive networking in avionics
Using ai for optimal time sensitive networking in avionicsUsing ai for optimal time sensitive networking in avionics
Using ai for optimal time sensitive networking in avionics
Deepak Shankar
 
Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0
Deepak Shankar
 
Develop High-bandwidth/low latency electronic systems for AI/ML application
Develop High-bandwidth/low latency electronic systems for AI/ML applicationDevelop High-bandwidth/low latency electronic systems for AI/ML application
Develop High-bandwidth/low latency electronic systems for AI/ML application
Deepak Shankar
 
Webinar: Detecting Deadlocks in Electronic Systems using Time-based Simulation
Webinar: Detecting Deadlocks in Electronic Systems using Time-based SimulationWebinar: Detecting Deadlocks in Electronic Systems using Time-based Simulation
Webinar: Detecting Deadlocks in Electronic Systems using Time-based Simulation
Deepak Shankar
 
Using VisualSim Architect for Semiconductor System Analysis
Using VisualSim Architect for Semiconductor System AnalysisUsing VisualSim Architect for Semiconductor System Analysis
Using VisualSim Architect for Semiconductor System Analysis
Deepak Shankar
 
Webinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System AnalysisWebinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System Analysis
Deepak Shankar
 
Is accurate system-level power measurement challenging? Check this out!
Is accurate system-level power measurement challenging? Check this out!Is accurate system-level power measurement challenging? Check this out!
Is accurate system-level power measurement challenging? Check this out!
Deepak Shankar
 
Architectural tricks to maximize memory bandwidth
Architectural tricks to maximize memory bandwidthArchitectural tricks to maximize memory bandwidth
Architectural tricks to maximize memory bandwidth
Deepak Shankar
 
Mirabilis design Inc - Brochure
Mirabilis design Inc - BrochureMirabilis design Inc - Brochure
Mirabilis design Inc - Brochure
Deepak Shankar
 

More from Deepak Shankar (15)

Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
 
Modeling Abstraction
Modeling AbstractionModeling Abstraction
Modeling Abstraction
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
 
Capacity Planning and Power Management of Data Centers.
Capacity Planning and Power Management of Data Centers. Capacity Planning and Power Management of Data Centers.
Capacity Planning and Power Management of Data Centers.
 
Automotive network and gateway simulation
Automotive network and gateway simulationAutomotive network and gateway simulation
Automotive network and gateway simulation
 
Using ai for optimal time sensitive networking in avionics
Using ai for optimal time sensitive networking in avionicsUsing ai for optimal time sensitive networking in avionics
Using ai for optimal time sensitive networking in avionics
 
Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0
 
Develop High-bandwidth/low latency electronic systems for AI/ML application
Develop High-bandwidth/low latency electronic systems for AI/ML applicationDevelop High-bandwidth/low latency electronic systems for AI/ML application
Develop High-bandwidth/low latency electronic systems for AI/ML application
 
Webinar: Detecting Deadlocks in Electronic Systems using Time-based Simulation
Webinar: Detecting Deadlocks in Electronic Systems using Time-based SimulationWebinar: Detecting Deadlocks in Electronic Systems using Time-based Simulation
Webinar: Detecting Deadlocks in Electronic Systems using Time-based Simulation
 
Using VisualSim Architect for Semiconductor System Analysis
Using VisualSim Architect for Semiconductor System AnalysisUsing VisualSim Architect for Semiconductor System Analysis
Using VisualSim Architect for Semiconductor System Analysis
 
Webinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System AnalysisWebinar on Functional Safety Analysis using Model-based System Analysis
Webinar on Functional Safety Analysis using Model-based System Analysis
 
Is accurate system-level power measurement challenging? Check this out!
Is accurate system-level power measurement challenging? Check this out!Is accurate system-level power measurement challenging? Check this out!
Is accurate system-level power measurement challenging? Check this out!
 
Architectural tricks to maximize memory bandwidth
Architectural tricks to maximize memory bandwidthArchitectural tricks to maximize memory bandwidth
Architectural tricks to maximize memory bandwidth
 
Mirabilis design Inc - Brochure
Mirabilis design Inc - BrochureMirabilis design Inc - Brochure
Mirabilis design Inc - Brochure
 

Recently uploaded

Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
PuktoonEngr
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
Ratnakar Mikkili
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 

Recently uploaded (20)

Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 

Energy efficient AI workload partitioning on multi-core systems

  • 1. ENERGY-EFFICIENT AI WORKLOAD PARTITIONING ON MULTI-CORE SYSTEMS Deepak Shankar Founder Mirabilis Design Inc. Email: dshankar@mirabilisdesign.com Tom Jose R&D Engineer Mirabilis Design Inc. Email: tjose@mirabilisdesign.com
  • 2. About Mirabilis Design Started in 2007 and based in Santa Clara, CA, USA. Development and support centers in US, India, Germany, China, Japan, Taiwan and Czech Largest provider of Intellectual Property and Software for System architecture exploration Used in the design of electronics, semiconductors and software Over 250 products worldwide across Semiconductors, Aerospace, Computing and Automotive VisualSim- Modeling and simulation software 100’s of man years experience in system design and exploration of digital electronics Select the “Right” configuration to match customer request
  • 3. VisualSim Architect Modeling and Simulation Software Graphical and Hierarchical Performance Power Functional Tuning Open API for third-party integration Systems & Networks Software IP/SoC Comprehensive Architecture Exploration Solution
  • 4. Scope of System Architecture Exploration Concept to Validation for the entire system AI for Data center or Edge Track 30 targets per minute Wifi or 5G device Process 3 cameras, 4 Lidars & 5 Radars 95% cache hit-ratio Gateways to ECAN, WiFi, BLE and TSN SoC Specification Core RTOS Interface AI/CCN N e t w o r k Size and Design Trade-off Workload Partitioning HW-SW Distribution Parameter tuning Select Traffic, Workload, Use-cases SW Performance Tuning System Specification IP and Core Selection Optimize for Requirements
  • 5. Software Performance Tuning AI and Software Code Compile code for the target hardware Translate into Interim trace Execute Software trace on SoC Platform Delay_95_StDev_196s = 4.39E-6 Latency_Value = 7.46E-6 Mean_95_Confidence = 3.61E-7 Mean_Value = 4.01E-6 Min_Value = 3.62E-7 StDev_Value = 2.24E-6 Performance Reports Modify the code to improve performance And repeat loop
  • 6. Behavioral flow and HW architecture •Behavioral flow -> sequence flow representation of how tasks will be executed •HW architecture -> Representation of how the HW architecture of a device is implemented •Example: CPU_1 CPU_2 Cache RAM B U S Trig Task 1 Task 2 Task 3 • Task sequence defined from behavior flow • Task 3 dependent on the output of Task 1 and Task 2 • Task mapping to CPU cores done from behavior flow • Task 1, 3 -> CPU_1 • Task 2 -> CPU_2 Behavior Flow HW Architecture Examples
  • 7. Accuracy of VisualSim Power Profiler Frequency Max Power observed Real System Power Delta percentage 500.0 Mhz 0.037 W 0.038 W 2.63% 600.0 Mhz 0.053 W 0.051 W -3.92% 700.0 Mhz 0.073 W 0.080 W 8.75% 800.0 Mhz 0.097 W 0.090 W -7.77% 1000.0 Mhz 0.157 W 0.159 W 1.25% 1100.0 Mhz 0.193 W 0.188 W -2.65% 1200.0 Mhz 0.233 W 0.227 W -2.64% 1300.0 Mhz 0.277 W 0.269 W -2.97% Source: Anandtech.com Device: ARM Cortex A53
  • 8. SoC Power-Performance demo MULTI MEDIA APPLICATION USE CASE
  • 9. Media Application Target 1. Power < 2.0W 2. Number of frames in 20 ms > 50K Three architecture evaluations 1. All tasks deployed in Software 2. Migrate few tasks to Hardware accelerators 3. Add power management to reduce power Block Diagram Mirabilis Design Inc. 9
  • 10. Device Configuration Processor Core – Cortex A53 Processor Speed – 1200 MHz L1 cache: I Cache : 32 KB : 2 way set associative D Cache : 32 KB : 4 way set associative L2 Cache Size :1 MB Associativity :16 way Ext DRAM Size :4 GB Type :DDR4 Speed :2400 MHz HW Accelerator Speed : 800 MHz
  • 11. VisualSim Model Processor Bus Topology MPEG Application IP or ARM level • Evaluate pipeline stages • Width, Speed • Number of execution units, Levels of cache SoC • Number of ARM cores • Accelerators • Cache memory hierarchy and coherence System level • Development of an IoT device, ECU or an integrated platform Mirabilis Design Inc. 11
  • 12. Model parameters Select_Partitioning = “HW” -> map Rotate frame task to the HW_Accelerator
  • 13. Case study ACHIEVING THE REQUIRED PERFORMANCE
  • 14. Case study environment •Single core Cortex A53 •Analysis being done before software development • No C/C++ code written • Uses Task Generator to generate instruction sequence based on a task profile set via instruction Mix Table (click here to view more details on the Task Generator)
  • 15. CASE 1: Run all tasks on SW(on A53 core) Observations: 1. Avg power consumption within requirements (<2.0 W) 2. Performance requirement not achieved (Only a max of 23.3K frames)
  • 16. Cache Stats Stats generated per cache block: • Cache Hit Ratio , Miss Ratio • Number of instructions • Number of Prefetches • Buffer Occupancy, Buffer Overflow • Latency – Min, Max and Mean • Read Hit Ratio • MBps – Read, Write and Total • Write Hit Ratio • Evicted Block Count • Write Backed block count • Utilization
  • 17. Sequence diagram Rotate Frame task is found to be resource intensive
  • 18. CASE 2: Run Rotate Frame task on HW Acc Observations: 1. Avg power consumption requirement not met (>2.0 W) 2. Performance requirement achieved ( max of 125.5K frames)
  • 19. CASE 2: Observations •The avg power consumption is greater than the threshold •Options for the Architect: • Use HW accelerator from another vendor • Use same HW accelerator, but apply power management • From the power plot for HW accelerator, it could be observed that the HW accelerator is active only for a short period of time
  • 20. CASE 3: Run Rotate Frame task on HW Acc + Power management Observations: 1. Avg power consumption requirement met (<2.0 W) 2. Performance requirement achieved ( max of 125.5K frames)
  • 21. Case study environment •Dual core Cortex A53 •C code for one of the task is available (Render Frame) • Execute the C code on GEM5 • GEM5 generated output traces • Use the python code to parse the GEM5 output into VisualSim readable format • Uses this Trace for providing instruction sequence and address Click here to view more information on Trace generation and usage
  • 22. Database updated Render_Frame task is mapped to the second core
  • 23. CASE 4: Run Render_Frame task using gem5 trace Observations: 1. Avg power consumption within requirements (<2.0 W) 2. Performance requirement achieved (max of 105K frames)
  • 25. ARM Cortex A65 AE demo 16 ARM CORES
  • 26. Block Diagram A65 AE Cluster A65 AE Cluster L4 Cache DRAM Router Router Router Router
  • 27. Block Diagram – A65 AE Cluster Mirabilis Design Inc. 27 A65 AE Cluster – 8 cores Core I1 D1 L2 Bridge Private Cache 1 2 8 L3 DSU
  • 28. Device Configuration Processor Core – Cortex A65 AE Processor Speed – 1200 MHz L1 cache: I Cache : 32 KB : 4 way set associative D Cache : 64 KB : 4 way set associative L2 cache: 256 KB : 4 way set associative L3 Cache Size :512 KB Associativity :16 way DRAM Type :DDR4 Speed :2000 MHz
  • 29. VisualSim Model Mirabilis Design Inc. 29 Behavioral Flow at the bottom shows the Tasks/Applications running on the Hardware. Both Cluster 1 and Cluster 2 are executing tasks in parallel
  • 31. Task Details Since the model is running in Lock Mode, same application is run on the redundant core as well
  • 33.
  • 34. Cache Stats Stats generated per cache block: • Cache Hit Ratio , Miss Ratio • Number of instructions • Number of Prefetches • Buffer Occupancy, Buffer Overflow • Latency – Min, Max and Mean • Read Hit Ratio • MBps – Read, Write and Total • Write Hit Ratio • Evicted Block Count • Write Backed block count • Utilization
  • 35. Router Stats Since there is no devices attached to Router R_2_2 (bottom right in block diagram), the data transferred through R_2_2 is 0.
  • 36. A65AE – Split mode – VisualSim model Different applications running on each core
  • 37. Software mapping in NW processor
  • 39. Conclusion Architecture exploration can be easy, fast and accurate Architecture modeling IP is now available and can be quickly configured to create a new system Integrate IP, SoC and Systems with software, interfaces and RTOS Architecture accuracy at implementation-level Simulation speed that enables large scale testing
  • 40. ENERGY-EFFICIENT AI WORKLOAD PARTITIONING ON MULTI-CORE SYSTEMS Deepak Shankar Founder Mirabilis Design Inc. Email: dshankar@mirabilisdesign.com Tom Jose R&D Engineer Mirabilis Design Inc. Email: tjose@mirabilisdesign.com
  • 41. GEM5 – Trace generation •Software code executed in GEM5 •GEM5 generates output in .txt format •Custom Python code is used to parse the GEM5 output •Python parses and generates the executed instruction sequence in csv format •The csv file is provided as input to VisualSim Architect
  • 42. C code – snippet and GEM5 output
  • 43. Python code and output file (Trace.csv)
  • 44. TrafficReader – used for reading Traces Return • When an application trigger reaches render frame stage, Trace.csv is read. • The TrafficReader reads one line per trigger • So a loop mechanism is defined to read till the End of File • Once the last line from Trace.csv is read out and executed in the core, the trigger is sent to the next stage via rendered_frame output port
  • 45. Task Generator Module •Custom Task generator – number of instructions, type of instructions, order of tasks (loop, random) can be set •More dynamic and distributed traffic profile can be generated •“n” number of Software tasks can be defined •In case software development hasn’t started yet, we can use this module to generate the instruction traces 9/2/2021 MIRABILIS DESIGN INC. 45
  • 46. Task Generator – Config File (Instruction Mix Table) 9/2/2021 MIRABILIS DESIGN INC. 46 Software tasks Number of instructions per task The Total Number of instructions are made up of instructions of different types. The percentages of each type of instruction is specified here.
  • 47. Task Generator - Config File (Instruction Mix Table) 9/2/2021 MIRABILIS DESIGN INC. 47 This type descriptor is used in the previous slide. User can specify the percentage of each type of instruction for each software operation
  • 48. Example: One task, one Processor core
  • 49. Example: Three tasks, one Processor core
  • 50. Example: Three tasks, one Processor core, Preemption enabled Priority : Task3 > Task2 > Task1
  • 51. ENERGY-EFFICIENT AI WORKLOAD PARTITIONING ON MULTI-CORE SYSTEMS Deepak Shankar Founder Mirabilis Design Inc. Email: dshankar@mirabilisdesign.com Tom Jose R&D Engineer Mirabilis Design Inc. Email: tjose@mirabilisdesign.com
  • 53. Ultrascale demo QUAD CORE ARM CORTEX A53
  • 54. Block Diagram Mirabilis Design Inc. 54 D1 I1 A53 Core 1 D1 I1 A53 Core 2 D1 I1 A53 Core 3 D1 I1 A53 Core 4 L O C A L B U S Shared L2 A M B A A X I DMA DRAM
  • 55. VisualSim Model Mirabilis Design Inc. 55 Behavioral Flow at the bottom shows the Tasks/Applications running on the Hardware.
  • 57. Task Details CF1 - Inlining test for functions containing loops CS1 - Switch case statement of size 10 (different case each time) EM1 - Integer execution (length 1 dependency chain each loop with multiplies) DPcvt - Simple data parallel loop (float/double conversion) The c code corresponding to each task was run on GEM5 and the output from GEM5 was parsed into VisualSim readable format.
  • 59.
  • 60. Cache Stats Stats generated per cache block: • Cache Hit Ratio , Miss Ratio • Number of instructions • Number of Prefetches • Buffer Occupancy, Buffer Overflow • Latency – Min, Max and Mean • Read Hit Ratio • MBps – Read, Write and Total • Write Hit Ratio • Evicted Block Count • Write Backed block count • Utilization
  • 61. Stats Explanation – Core Latency Each core runs different Task. According to the Task profile of the application run, the number of instructions that needs to be executed will be different. Core 4 runs DPcvt task which is the biggest trace of the 4. Hence Core 4 has a higher latency. Why Multiple points in the graph? This demo model runs the traces multiple times (5 times). So each sample point corresponds to the latency for each run Why Latency for each core high for the first run? Cache loading delay. After the first run, since data is available in private L1, very less request reaches L2/DRAM
  • 62. Stats Explanation – Beh_Latency Beh_Latency_Plot plots the total time taken to complete the full Software sequence defined in the behavioral flow. This means that it is the time taken to complete Task running on Core 1, Core 2 , Core 3 and Core 4. Since we are running the tasks periodically, we see 5 sample points in the graph. Why first run latency is higher? Cache loading delay in all cores for the first run. Once the cache gets loaded, the subsequent runs are comparatively faster