Develop High-bandwidth/low latency electronic systems for AI/ML application

DEVELOP HIGH BANDWIDTH-LOW LATENCY
ELECTRONIC SYSTEMS FOR AI/ML APPLICATION
Deepak Shankar
Founder
Mirabilis Design Inc.
Email: dshankar@mirabilisdesign.com

Logistics
2
All attendees are set on mute.
To ask a question, click on Arrow to the left of Chat and
type the question. Folks are standing by to answer your
questions. There will also be a time at the end for Q&A

Agenda
Architecture Exploration of Electronic Systems
Introduction to System Modeling
VisualSim Libraries and Architecture Exploration requirements
VisualSim Demonstration and Analysis
◦ Software
◦ Semiconductor
◦ Power-Performance trade-off
Company profile

Exploration of Electronic
Systems
INTRODUCTION AND BENEFITS

Modeling Electronic Systems
Current approach
◦ Use of analytical models such as Spreadsheet and Worst-Case Execution Time
◦ Move from the high-specification to building prototypes
◦ WCET and Spreadsheets are highly inaccurate
◦ Prototypes take too long to develop and also have limited exploration capacity
Proposed Approach
◦ Add a systems engineering layer after the analytical analysis
◦ Create a virtual prototype of the full system- Hardware, software, RTOS and network connection
◦ Conduct trade-off early in the design cycle with detailed knowledge of the system operation

Tradition Approach-
Spreadsheet-based Traffic and Power Analysis

Proposed Approach-
Full Braking System
Input Spreadsheet and Trace file
Generated Report and Plots
Reuse existing data to kickstart model development

Analysis and Experiment
Understand the connectivity between all the individual components and sub-systems
Evaluate timing, throughput, power, heat and functional correctness using a single model
Measure the latency between network interface and processed output
Identify opportunity for hardware acceleration
Partition applications across multi-core, multi-processor and multi-chassis
Exploration of emerging technology
◦ New processor family, new backplane technology and better integrated memory

Why Deploy the New Approach
Eliminate all surprises before integration
Gain visibility into system operation and requirements early in the
design process
Complete visibility into constraint for each packet/request,
protocol/control, and software/hardware
Determine requirements for hardware and network components
Identify bottlenecks, limitations and reuse ability

Introduction to System Modeling
DEMO MODELS

System Architecture Modeling Methods
Application and Software behavior
Network or backplane Modeling
Hardware architecture

Network Model with
Scheduler and Flow Control

Software Code for Scheduler Algorithm
/* Scan Queues based on receiving input, user algorithm here */
Select = 1
WAIT (1.0E-08)
while (true) {
while (Select <= Ingress_Size) {
if (getBlockStatus(Smart_Resource_Name,"length",Select) > 0 && getBlockStatus("Egress","length",Select) < Threshold) {
token = getBlockStatus(Smart_Resource_Name,"copy",Select)
WAIT ((token.Size) / Scan_Rate)
SEND (pop,Select)
Index = Select - 1
InThru(Index) = InThru(Index) + token.Size
}
Select = Select + 1
}
Select = 1
WAIT (1.0E-09)
}

Software Profiling of the Scheduler Code
Address Number Mean_Time Script_RegEx_Statement
0 1 116.10900000 us Select = 1
1 1 69.97000000 us WAIT (1.0E-08)
2 404 206.66089 ns if (true) false, expression plus 13, else plus 1.
3 6462 258.44181 ns if (Select <= Ingress_Size) false, expression plus 9, else plus 1.
4 6059 8.07862948 us if (getBlockStatus(Smart_Resource_Name,"length",Select) > 0 && getBlockStatus("Egress","length",Select) < Threshold) false, expression plus 6,
else plus 1.
5 1168 6.47288699 us token = getBlockStatus(Smart_Resource_Name,"copy",Select)
6 1168 20.36501199 us WAIT ((token.Size) / Scan_Rate)
7 1167 1.59209769 us SEND (pop,Select)
8 1167 891.31791 ns Index = Select - 1
9 1167 4.95694859 us InThru(Index) = InThru(Index) + token.Size
10 6058 318.42786 ns Select = Select + 1
11 6058 85.02542 ns GTO (-8)
12 403 289.43921 ns Select = 1
13 403 44.19382630 us WAIT (1.0E-09)
14 403 295.18114 ns GTO (-12)
15 0 0.0000000 GTO (EndThread)

Mapping Scheduler code to Pseudo
Instructions
Instruction Sequence corresponding to the code execution
{"FXA_b", "LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH",
"LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH", "LTE", "LT",
"ADD", "BCH", "LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH",
"LTE", "LT", "ADD", "BCH", "LTE", "LT", "ADD", "BCH", "LTE", "IMM", "WAIT_s"}
Software code address line execution order
0, 1, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 6
List of Psuedo Instructions
FXA_b = Function w/ Args, boolean
FXA_r = Function w/ Args, struct (record)
FXA_a = Function w/ Args, array
FXA_m = Function w/ Args, matrix
WAIT_s = WAIT string, event
WAIT_d = WAIT double, delay
DEC = --
List of Psuedo Instructions- Cont.
GT = Greater than
LT = Less than
BCH = Branch
ADD = Add
SUB = Subtract
MUL = Multiply
INC = ++
List of Psuedo Instructions- Cont.
SHIFT = >> or <<
SEND = Send to Label, Block or Port
LTE = Less than or equal
GT = Greater than
LT = Less than
MOD = Modulo
POW = Power

Mapping of Two Applications
to an Single-Board computer
Applications are a set of Complex tasks
• Variable rate input stream
• Tasks and transfer between tasks
Contention for resources by tasks
• Resource are the hardware blocks
• Assign tasks to Resources
• Transfer flows across Buses and bridges
Trade-off between process and transfer
• Efficient- More processing and less transfer
• Minimize power consumption
I/O
DSP
CPU1
CPU2
task1 task2 task3 task4
Scheduling software tasks using limited resources

VisualSim Block Diagram
Library
Folder Parameters
Reports &
Statistics
Single Board Computer Architecture
Application 1
Application 2
Workload
Mapping
Power Data

Run Simulations using two Parameter
Variations of the Bus Speed
System with faster Bus is slower in places
Unpredictable System Response

VisualSim Libraries and
Architecture Challenges
DEMO MODELS

Systems-Level Block Library
Largest library of traffic, resources, hardware, software and analysis
Traffic
• Distribution
• Sequence
• Trace file
• Instruction profile
Reports
• Timing and Buffer
• Throughput/Util
• Ave/peak power
• Statistics
Power
• State power table
• Power
management
• Energy harvesters
• Battery
• RegEx operators
SoC Buses
• AMBA and Corelink
• AHB, AB, AXI, ACE,
CHI, CMN600
• Network-on-Chip
• TileLink
System Bus
• PCI/PCI-X/PCIe
• Rapid IO
• AFDX
• OpenVPX
• VME
• SPI 3.0
• 1553B
Processors
• GPU, DSP, mP and mC
• RISC-V
• Nvidia- Drive-PX
• PowerPC
• X86- Intel and AMD
• DSP- TI and ADI
• MIPS, Tensilica, SH
ARM
• M-, R-, 7TDMI
• A8, A53, A55, A72,
A76, A77
Custom Creator
• Script language
• 600 RegEx fn
• Task graph
• Tracer
• C/C++/Java
• Python
Support
• Listener and
Trace
• Debuggers
• Assertions
Stochastic
• FIFO/LIFO Queue
• Time Queue
• Quantity Queue
• System Resource
• Schedulers
• Cyber Security
RTOS
• Template
• ARINC 653
• AUTOSAR
Memory
• Memory Controller
• DDR DRAM 2,3,4, 5
• LPDDR 2, 3, 4
• HBM, HMC
• SDR, QDR, RDRAM
Storage
• Flash & NVMe
• Storage Array
• Disk and SATA
• Fibre Channel
• FireWire
Networking
• Ethernet & GiE
• Audio-Video Bridging
• 802.11 and Bluetooth
• 5G
• Spacewire
• CAN-FD
• TTEthernet
• FlexRay
• TSN & IEEE802.1Q
FPGA
• Xilinx- Zynq, Virtex, Kintex
• Intel-Stratix, Arria
• Microsemi- Smartfusion
• Programmable logic
template
• Interface traffic generator
Software
• GEM5
• Software code integration
• Instruction trace
• Statistical software model
• Task graph
Interfaces
• Virtual Channel
• DMA
• Crossbar
• Serial Switch
• Bridge
RTL-like
• Clock, Wire-Delay
• Registers, Latches
• Flip-flop
• ALU and FSM
• Mux, DeMux
• Lookup table

Application Template Library
VisualSim Modeling Library provides coverage over all applications using electronics

Electronic System Challenges
Systems Engineering
◦ Top-level view of the entire system without worrying about the exact implementation details
◦ Capture the data flow, application task sequence and mapping to System resources
◦ Generate statistics for response time, throughput and power consumed
Hardware-Software selection
◦ Select the appropriate hardware blocks including processor, memory and bus/network
◦ Determine the number of independent boards and chassis for symmetrical processing
◦ Experiment with different mapping strategies and select accelerators
◦ Reuse the systems engineering data flow and application task sequence
System level
◦ Develop the specification for integration and test cases

Mathematical function
allocation and partitioning
DEMO MODELS

Modeling Complex AI/ML processing
in an Image-based Application

Check Correctness of AI/ML Math

Host to Data center using Ethernet AVB
6/4/2020

Analysis
Latency from gateway to gateway, client to server, master to slave or node to node
Effects of communication stack activity
Scheduling of different traffic classes for policing and shaping
Trade-off switch vs gateway
Effect of global vs. local multicast
Impact of clock jitters
6/4/2020

Interface Modeling: Network on Chip
Block Diagram
VisualSim Model

VisualSim Software Architecture
and Mapping to Hardware
DEMO MODELS

Application Task Graph
(Implementation can be in HW or SW)

VisualSim Model of the Task Graph

Block Diagram of a Software System
Radar
Analyze system behavior with deterministic and non-deterministic workloads

Behavior Model of Radar Software

Mapping Radar Software Tasks to two
Hardware Architectures
X86 based ECU
DSP-based ECU

Comparing Mapping on x86 vs DSP
Key parameters are the latency, processing efficiency and the throughput

Failure Impact on RTOS and Scheduling
Without Faults
With Faults
Rapidly increasing time between Ready-to-Run and Run

VisualSim Semiconductor
DEMO MODELS

Designing for an SoC Block Diagram
Target
Power < 1.0W
Number of frames in 20 ms > 13K
Three Explorations
1. All tasks deployed in Software
2. Migrate few tasks to Hardware accelerators
3. Add power management to reduce power
ARM
Cortex A77
ARM AMBA AXI
ARM AMBA AXI Corelink CMN600
AMBA
AMBA
AMBA
Controller
VisualSim can handle any Processor architecture

Translate SoC Block Diagram into
VisualSim Model
Processor Bus
Topology
Memory
Controller
Hardware
Accelerators
Power
management
Use Cases
SoC design methodology provides lots of flexibility in level of detail and type of analysis

Comparing Power and Performance
across multiple Parameter Values
SW
SWSW
SW
HW HW
HW HW
Post processor and Batch-mode simulation allow for easy comparison across simulations

Power and Thermal Analysis in
an Application
DEMO MODELS

VisualSim Model of a Braking System

Power, Heat, Functional and Timing

ABOUT MIRABILIS DESIGN
Deepak Shankar
Founder
Mirabilis Design Inc.
Email: dshankar@mirabilisdesign.com

VisualSim Aerospace
Simulator of the Year
Hardware Modeling
40th Customer
2003
Company
Incorporated
2005
Modeling Services
1st Customer
2008
Stochastic Modeling
Innovation Award
2010
Integration API
10th customer
2011
Network modeling
University program
20132015
2018
50th Customer
Best ESL at DAC
2nd at Arm TechCon
2019
VisualSim Automotive
250 products built
Started Europe operations
2020
VisualSim Functional
Analysis ISO/DO/IEC
Started Asia Operations
Continuous Innovation, Awards and World-Wide Presence
Company Milestone

VisualSim software with libraries
Training:
Training and modeling support- user builds
the components and models
Services:
Develop custom library- User assembles
the models
Develop custom libraries and models -
User conducts parameter study
Architecture evaluation- Will develop
model, analyse and provide feedback
Model-based Systems Engineering simplified and made easy-to-adopt
Mirabilis Design Software and Solutions

Engineering Benefits
Average increase in revenue per project = $??M
Using Alternate Design Methodology
Project Schedule
Model Creation (6)
Implementation (18)
Analysis (1.5)
Communication and Refinement (6)
Implementation (15)
Using VisualSim Model-Based Design Methodology
Note: All times in months
Communication and Refinement (4)
Analysis (2.5)
Model Creation (1) Average gain for 24-month
project is 25%-30%
Ensuring Highest
Quality Product
Accelerate Model
development

Develop High-bandwidth/low latency electronic systems for AI/ML application

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Develop High-bandwidth/low latency electronic systems for AI/ML application

Similar to Develop High-bandwidth/low latency electronic systems for AI/ML application (20)

More from Deepak Shankar

More from Deepak Shankar (11)

Recently uploaded

Recently uploaded (20)

Develop High-bandwidth/low latency electronic systems for AI/ML application