Energy efficient AI workload partitioning on multi-core systems

ENERGY-EFFICIENT AI WORKLOAD
PARTITIONING ON MULTI-CORE SYSTEMS
Deepak Shankar
Founder
Mirabilis Design Inc.
Email: dshankar@mirabilisdesign.com
Tom Jose
R&D Engineer
Mirabilis Design Inc.
Email: tjose@mirabilisdesign.com

About Mirabilis Design
Started in 2007 and based in Santa Clara, CA, USA.
Development and support centers in US, India, Germany, China, Japan, Taiwan and Czech
Largest provider of Intellectual Property and Software for System architecture exploration
Used in the design of electronics, semiconductors and software
Over 250 products worldwide across Semiconductors, Aerospace, Computing and Automotive
VisualSim- Modeling and simulation software
100’s of man years experience in system design and exploration of digital electronics
Select the “Right” configuration to match customer request

VisualSim Architect
Modeling
and
Simulation
Software
Graphical
and
Hierarchical
Performance
Power
Functional
Tuning
Open API for
third-party
integration
Systems
&
Networks
Software
IP/SoC
Comprehensive Architecture Exploration Solution

Scope of System Architecture Exploration
Concept to Validation for the entire system
AI for Data center or Edge
Track 30 targets per minute
Wifi or 5G device
Process 3 cameras, 4 Lidars & 5 Radars
95% cache hit-ratio
Gateways to ECAN, WiFi, BLE and TSN
SoC Specification
Core
RTOS
Interface
AI/CCN
N
e
t
w
o
r
k
Size and Design Trade-off
Workload Partitioning
HW-SW Distribution
Parameter tuning
Select
Traffic,
Workload,
Use-cases
SW Performance Tuning
System Specification
IP and Core Selection
Optimize for Requirements

Software Performance Tuning
AI and Software Code
Compile code for
the target hardware
Translate into Interim trace
Execute Software trace
on SoC Platform
Delay_95_StDev_196s = 4.39E-6
Latency_Value = 7.46E-6
Mean_95_Confidence = 3.61E-7
Mean_Value = 4.01E-6
Min_Value = 3.62E-7
StDev_Value = 2.24E-6
Performance Reports
Modify the code to
improve performance
And repeat loop

Behavioral flow and HW architecture
•Behavioral flow -> sequence flow representation of how tasks will be executed
•HW architecture -> Representation of how the HW architecture of a device is implemented
•Example:
CPU_1
CPU_2
Cache
RAM
B
U
S
Trig
Task 1
Task 2
Task 3
• Task sequence defined from
behavior flow
• Task 3 dependent on the
output of Task 1 and Task 2
• Task mapping to CPU cores
done from behavior flow
• Task 1, 3 -> CPU_1
• Task 2 -> CPU_2
Behavior Flow
HW Architecture
Examples

Accuracy of VisualSim Power Profiler
Frequency Max Power
observed
Real System Power Delta percentage
500.0 Mhz 0.037 W 0.038 W 2.63%
600.0 Mhz 0.053 W 0.051 W -3.92%
700.0 Mhz 0.073 W 0.080 W 8.75%
800.0 Mhz 0.097 W 0.090 W -7.77%
1000.0 Mhz 0.157 W 0.159 W 1.25%
1100.0 Mhz 0.193 W 0.188 W -2.65%
1200.0 Mhz 0.233 W 0.227 W -2.64%
1300.0 Mhz 0.277 W 0.269 W -2.97%
Source: Anandtech.com
Device:
ARM
Cortex A53

SoC Power-Performance demo
MULTI MEDIA APPLICATION USE CASE

Media Application
Target
1. Power < 2.0W
2. Number of frames in 20 ms > 50K
Three architecture evaluations
1. All tasks deployed in Software
2. Migrate few tasks to Hardware accelerators
3. Add power management to reduce power
Block Diagram
Mirabilis Design Inc. 9

Device Configuration
Processor Core – Cortex A53
Processor Speed – 1200 MHz
L1 cache:
I Cache : 32 KB
: 2 way set associative
D Cache : 32 KB
L2 Cache
Size :1 MB
Associativity :16 way
Ext DRAM
Size :4 GB
Type :DDR4
Speed :2400 MHz
HW Accelerator
Speed : 800 MHz

VisualSim Model
Processor Bus Topology
MPEG Application
IP or ARM level
• Evaluate pipeline stages
• Width, Speed
• Number of execution units, Levels of cache
SoC
• Number of ARM cores
• Accelerators
• Cache memory hierarchy and coherence
System level
• Development of an IoT device, ECU or an
integrated platform

Model parameters
Select_Partitioning = “HW” -> map Rotate frame task to the HW_Accelerator

Case study
ACHIEVING THE REQUIRED PERFORMANCE

Case study environment
•Single core Cortex A53
•Analysis being done before software development
• No C/C++ code written
• Uses Task Generator to generate instruction sequence based on a task profile set via instruction Mix
Table (click here to view more details on the Task Generator)

CASE 1: Run all tasks on SW(on A53 core)
Observations:
1. Avg power
consumption within
requirements (<2.0 W)
2. Performance
requirement not
achieved (Only a max of
23.3K frames)

Cache Stats
Stats generated per cache block:
• Cache Hit Ratio , Miss Ratio
• Number of instructions
• Number of Prefetches
• Buffer Occupancy, Buffer Overflow
• Latency – Min, Max and Mean
• Read Hit Ratio
• MBps – Read, Write and Total
• Write Hit Ratio
• Evicted Block Count
• Write Backed block count
• Utilization

Sequence diagram
Rotate Frame
task is found to
be resource
intensive

CASE 2: Run Rotate Frame task on HW Acc
Observations:
1. Avg power consumption
requirement not met
(>2.0 W)
2. Performance
requirement achieved (
max of 125.5K frames)

CASE 2: Observations
•The avg power consumption is greater than the threshold
•Options for the Architect:
• Use HW accelerator from another vendor
• Use same HW accelerator, but apply power management
• From the power plot for HW accelerator, it could be observed that the HW accelerator is active only for a short period of time

CASE 3: Run Rotate Frame task on HW Acc
+ Power management
Observations:
1. Avg power consumption
requirement met (<2.0
W)
2. Performance
requirement achieved (
max of 125.5K frames)

Case study environment
•Dual core Cortex A53
•C code for one of the task is available (Render Frame)
• Execute the C code on GEM5
• GEM5 generated output traces
• Use the python code to parse the GEM5 output into VisualSim readable format
• Uses this Trace for providing instruction sequence and address
Click here to view more information on Trace generation and usage

Database updated
Render_Frame task is mapped to
the second core

CASE 4: Run Render_Frame task using
gem5 trace
Observations:
1. Avg power
consumption within
requirements (<2.0 W)
2. Performance
requirement achieved
(max of 105K frames)

ARM Cortex A65 AE demo
16 ARM CORES

Block Diagram
A65 AE
Cluster
A65 AE
Cluster
L4 Cache DRAM
Router
Router Router
Router

Block Diagram – A65 AE Cluster
A65 AE Cluster – 8 cores
Core
I1
D1
L2 Bridge
Private Cache
1
2
8
L3
DSU

Device Configuration
Processor Core – Cortex A65 AE
Processor Speed – 1200 MHz
L1 cache:
I Cache : 32 KB
D Cache : 64 KB
L2 cache: 256 KB
L3 Cache
Size :512 KB
Associativity :16 way
DRAM
Type :DDR4
Speed :2000 MHz

VisualSim Model
Behavioral Flow at the bottom
shows the Tasks/Applications
running on the Hardware.
Both Cluster 1 and Cluster 2 are
executing tasks in parallel

Task Details
Since the model is running in Lock Mode, same application is run on the redundant core as well

Router Stats
Since there is no devices
attached to Router R_2_2
(bottom right in block
diagram), the data
transferred through R_2_2
is 0.

A65AE – Split mode – VisualSim model
Different applications
running on each core

Software mapping in NW processor

Conclusion
Architecture exploration can be easy, fast and accurate
Architecture modeling IP is now available and can be
quickly configured to create a new system
Integrate IP, SoC and Systems with software, interfaces
and RTOS
Architecture accuracy at implementation-level
Simulation speed that enables large scale testing

GEM5 – Trace generation
•Software code executed in GEM5
•GEM5 generates output in .txt format
•Custom Python code is used to parse the GEM5 output
•Python parses and generates the executed instruction sequence in csv format
•The csv file is provided as input to VisualSim Architect

C code – snippet and GEM5 output

Python code and output file (Trace.csv)

TrafficReader – used for reading Traces
Return
• When an application trigger
reaches render frame stage,
Trace.csv is read.
• The TrafficReader reads one line
per trigger
• So a loop mechanism is defined
to read till the End of File
• Once the last line from Trace.csv
is read out and executed in the
core, the trigger is sent to the
next stage via rendered_frame
output port

Task Generator Module
•Custom Task generator – number of instructions, type of instructions, order of tasks (loop,
random) can be set
•More dynamic and distributed traffic profile can be generated
•“n” number of Software tasks can be defined
•In case software development hasn’t started yet, we can use this module to generate the
instruction traces
9/2/2021 MIRABILIS DESIGN INC. 45

Task Generator – Config File (Instruction Mix Table)
Software tasks
Number of
instructions per
task
The Total Number of instructions are made up of instructions of
different types. The percentages of each type of instruction is
specified here.

Task Generator - Config File (Instruction Mix Table)
This type descriptor is used in the previous slide.
User can specify the percentage of each type of
instruction for each software operation

Example: One task, one Processor core

Example: Three tasks, one Processor core

Example: Three tasks, one Processor
core, Preemption enabled
Priority : Task3 > Task2 > Task1

Ultrascale demo
QUAD CORE ARM CORTEX A53

Block Diagram
D1
I1
A53 Core 1
D1
I1
A53 Core 2
D1
I1
A53 Core 3
D1
I1
A53 Core 4
L
O
C
A
L
B
U
S
Shared L2
A
M
B
A
A
X
I
DMA
DRAM

VisualSim Model
Behavioral Flow at the bottom
shows the Tasks/Applications
running on the Hardware.

Task Details
CF1 - Inlining test for functions containing loops
CS1 - Switch case statement of size 10 (different case each time)
EM1 - Integer execution (length 1 dependency chain each loop with multiplies)
DPcvt - Simple data parallel loop (float/double conversion)
The c code corresponding to each task was run on GEM5 and the output from GEM5 was parsed
into VisualSim readable format.

Stats Explanation – Core Latency
Each core runs different Task. According to the Task
profile of the application run, the number of
instructions that needs to be executed will be different.
Core 4 runs DPcvt task which is the biggest trace of the
4. Hence Core 4 has a higher latency.
Why Multiple points in the graph?
This demo model runs the traces multiple times (5
times). So each sample point corresponds to the
latency for each run
Why Latency for each core high for the first run?
Cache loading delay.
After the first run, since data is available in private L1,
very less request reaches L2/DRAM

Stats Explanation – Beh_Latency
Beh_Latency_Plot plots the total time taken to
complete the full Software sequence defined in the
behavioral flow.
This means that it is the time taken to complete Task
running on Core 1, Core 2 , Core 3 and Core 4.
Since we are running the tasks periodically, we see 5
sample points in the graph.
Why first run latency is higher?
Cache loading delay in all cores for the first run. Once
the cache gets loaded, the subsequent runs are
comparatively faster

Energy efficient AI workload partitioning on multi-core systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Energy efficient AI workload partitioning on multi-core systems

Similar to Energy efficient AI workload partitioning on multi-core systems (20)

More from Deepak Shankar

More from Deepak Shankar (15)

Recently uploaded

Recently uploaded (20)

Energy efficient AI workload partitioning on multi-core systems