Mirabilis Design provides the VisualSim Versal Library that enable System Architect and Algorithm Designers to quickly map the signal processing algorithms onto the Versal FPGA and define the Fabric based on the performance. The Versal IP support all the heterogeneous resource.
Mirabilis_Design AMD Versal System-Level IP Library
1.
2. Introduction to AMD Versal SOC-FPGA
Versal adaptive SoCs deliver unparalleled application- and
system-level value for cloud, network, and edge
applications.
The disruptive 7 nm architecture combines heterogeneous
compute engines with a breadth of hardened memory and
interfacing technologies for superior performance/watt
System-on-chip (SoC) combines CPUs, DSPs, I/O and RAM
control along with programmable hardware logic
Built around an integrated shell composed of a
programmable network on chip (NoC), which enables
seamless memory-mapped access to the full height and
width of the device.
3. Current Approach to Architecting FPGA
Math algorithm modeling
◦ Conducts Functional or math simulation to study precision and fidelity of new algorithms
Requirements database
◦ Requirements modeling list and static test changes
List the delays in spreadsheet and add them up
◦ Average or worst case without concurrent activity
Emulation/Boards
◦ Run benchmarks and capture the latency and memory throughput
4. FPGA Designer Requirements
1. High-level system architecture mapping. System architects (MATLAB) to evaluate the advantage
provided by Xilinx Versal heterogeneous architecture
• Persona: System architect
• High-level application capture w/ models of key application building blocks
• Fast application exploration to heterogeneous HW targets (PS, PL, AIE, PL+AIE)
2. Algorithm Trade-Offs: Initial architecture application mapping decisions to choose the right fabric for
optimal performance for each algorithm
• Persona: PL/RTL, AIE, & PS designers
• Make trade studies of how different implementations and mappings change system performance
• Identify potential bottlenecks
5. Introduction to VisualSim System-Level- IP
Architecture Platform for AMD Versal FPGA
VisualSim is system-level modeling and simulation SW
Platform for rapid Trade-off
◦ Performance/Power/Area during planning
◦ Study speed, power, failure and bottlenecks
Optimize implementation, resource and timing constraints of
algorithm tasks
New Versal FPGA IP is a Stochastic model containing
◦ Heterogeneous compute resources
◦ DDR and HBM memory interfaces
◦ Statistics for latency, throughput, utilization and power
◦ User expandable resource usage table
◦ External Interfaces
◦ Task block with mapping function
◦ Traffic generator for workloads
6. System-Level Application Exploration Tool
App
Model
Resource
Model
App resource
target
Simulate/test
Identify
congestion
?
What is it?
• Stochastic models focused on early application mapping
exploration to heterogeneous SoC compute resources.
• Users are system architect responsible for high-level
complex system mapping, not design entry
• Rapid trade-off and design iteration prior to algorithm
and design entry
Existing AMD-Xilinx tools
Where does it fit?
• Extends AMD-Xilinx general toolset with pre-design-
entry focused application analysis
• Lower fidelity stochastic based model vs. full design
entry simulation tools
• Provides guidance to down-stream sub-system (AIE, PL,
PS) design entry development teams
Persona
Generates
System
Architect
Subsystem
requirements
Developer
PL: RTL/HLS
AIE: C/C++
PS: C/C++/Python
System Eng
BootFW
Gen configs
SW Eng
Runtime SW
OTA life-cycle
App Exploration
Analysis Tool
Design Entry
RTL, C/C++
Design Generation
*.bit, *.pdi, *.elf
On-Target Runtime
Deployment
(Linux, libs, ??
VisualSim
App Exploration
Analysis Tool
Design Entry
(Vivado & Vitis)
RTL, C/C++
Design Generation
(Vivado & Vitis)
*.bit, *.pdi, *.elf
On-Target Runtime
Deployment
(Linux, libs, ??
7. App Exploration Tool – Elements
Resource models
• AIE tile & subsystem
• NOC backbone, AXI interconnects and Direct Path
• PL function “task” models
• DDR memory controller & devices
• Arm CPU models
User app functions & stimulus
• Target persona: System architect
• Traffic pattern generators
• Task/compute behavior models
• Task & data description via XML semantic language including SysML
• C-code for CPU & GPU like targets including Tarmac, Gem5 trace
• Stochastic and cycle-accurate function models for FPGA
App Exploration
Analysis Tool
Design Entry
(Vivado & Vitis)
RTL, C/C++
Design Generation
(Vivado & Vitis)
*.bit, *.pdi, *.elf
On-Target Runtime
Deployment
(Linux, libs,
AIE AIE AIE
AIE AIE AIE
Interconnect
Model
Interconnect
Model
Interconnect
Model
Interconnect
Model
Interconnect
Model
Interconnect
Model
NOC Backbone
PL Custom
Models
PL Custom
Models
PL Custom
Models
DDR Memory
PS Subsystem
Interfaces
9. Mapping Algorithm to Versal SoC-FPGA
Implementing an Image Processing
algorithm on AMD-Xilinx Versal FPGA.
Each task is mapped to a resources
Standard
Library
Component
Basic/Starting Configuration
Grayscale_Conversion - PS [A72 Core 1]
IIR – Logic (PL)
FFT – AI Engine Tile
Edge_Image - Logic (PL)
iFFT – AI Engine Tile
Edge_Image_Enhancement – Logic (PL)
Segmentation – PS [A72 Core 2]
Image
Processing
Algorithm
10. Algorithm Task Table
This table is used to define the number of resources consumed by each tasks across various resources (PS, PL,
AI Tiles) if they were to be mapped to any of them. Each of these tasks are mapped to the resource of choice
from the behavioural flow.
11. Requirements and AI-based Tracking
All the requirements (Latency, throughput, power, utilization etc.) can be listed in this csv file.
At the end of simulation, a report which says whether the requirements were met is generated.
12. Run 1 – Base Configuration
Application latency increasing over time.
Increase in latency is due to Segmentation.
Remap segmentation task AI Engine Tiles
13. Run 2 – Segmentation Mapped to AI Engine
Application latency is in a bounded range.
NoC Utilization is high.
To reduce utilization, changed interconnect for Segmentation from NoC to Direct
14. Run 3 – Using Direct Path between PS and AI
Latency if deterministic
Latency requirement (App latency < 80 msec) is met.
Utilization across NoC is acceptable
19. Behavior Graph and Mapping File
VisualSim Architect
Dispatcher sends it to the target hardware module for processing and Handle Transitions
Map individual functions to
resources in Mapping Table
20. Simulate Base Model (Clk = 600 MHz)
The Requirements – latency for both ST
(Static Target) and MT (Moving Target)
estimation is not met
21. Parameter Regression on Multi-Core
Different parameter combinations based on the
configured ranges are generated and simulated
22. AI-Based Study using Requirements
• Run number 19 – clock
frequency at 1000 MHz satisfied
the performance requirements
we had set.
• Since the frequency was
increased from 600 MHz, the
total power consumption went
up while running the system at
1000 MHz
• Architect can evaluate
different processing
resources – DSP vs Xeon
cores vs Power cores if
they have stringent power
thresholds
Requirements being evaluated for each simulation
run in the parameter sweep
Overall Results – We can identify the simulation runs which
meet the requirements and select the right configuration
after considering cost vs performance trade-offs
24. Failure Analysis
Hardware Failure
Loss of processing cores, limited storage, reduced or loss
memory device or bus overload/incorrect signals
Software failure
Resource starvation, deadlocks, data overwrite
Network failure
Network Congestion, misconfiguration, link loss and network
errors
RTOS failure
Unable to achieve real-time deadlines, malicious change in
schedule table, and executes beyond time slots
Power Failure
Both reduced and full power failure. Slower processing speed,
limited number of resources can be executing concurrently
MIRABILIS DESIGN INC. 24
26. Software Design and Optimization
GCC Compiler –
Target arch.
Compile and
disassembly
Source code
Objdump –
Disassembly
Trace in VisualSim
usable format
Select
Processor
core
Obtain Pipeline
structure from
official
documentation
Create the list
of parameters
and their
possible values
Map parameters
Stats
Reconfigure parameter map to
improve performance
Update Source code to improve
performance
28. Interconnect Architecture Exploration
Analyze SoC NOC and Memory sub-System
architectures
Coherent .vs. Non-Coherent sub-systems connectivity
IO Coherency BW allocation
QoS – control, configuration and data intensive
Analyze SoC end to end flow control, credits,
queueing and arbitration mechanisms
Analyze scheduling and distribution of tasks throughout the
compute pipeline
Analyze the importance of different flow control
mechanisms
e.g., credit allocation schemes, token bucket mechanisms
and rate limit configurations
Analyze SW-HW interfaces and communication
End-to-End Latency - Time taken for the return trip
1. Cross point delay
2. Buffering at cross point and slave
3. Transfer and control delay at cross point, slave and
cache coherent domains
4. Memory read or write delay
5. Wire delay
Network Latency – Latency across cross point
Throughput– Memory and PL-AIE bandwidth
30. Analysis Scenarios
Scenarios 1 2
Optimal network configuration
Packets only have to take one or two
hops to reach destination
Yes Non-Optimal network
configuration – Non-
optimal placement of
nodes
Router Frequency MHZ
Frequency at which the XP operates
2500 2500
Flit_Size (Bytes)
Max packet size allowed on the network
If the incoming packet is more than the
Flit_Size, the packet is fragmented
256 1024
X-dimension
Y-dimension
8
8
8
8
Packet Size 256 1024
Analysis shows the HBM Throughput is 40GBps because of Optimal network configuration and high frequency
40. Architecting Hardware-Software for
Infotainment System
DRAM
Display
IO
A
M
B
A
A
X
I
B
u
s
CPU
GPU
Display
Ctrl
P
C
I
e
Video Camera
SRAM
Packet
System Overview
◦ Camera : 30fps, VGA corresponds
◦ CPU : Multi-core ARM Cortex-A53 1.2GHz
◦ GPU : 64Cores(8Warps×8PEs), 32Threads, 1GHz
◦ DisplayCtrl : DisplayBuffer 293,888Byte
◦ SRAM: SDR, 64MB, 1.0GHz
◦ DRAM : DDR3, 64MB, 2.4GHz
Explore at the board- and semiconductor-level to size uP/GPU, memory bandwidth and bus/switch configuration
Develop an integrated Infotainment Processor
• Size GPU, AXI bus and memory controller
• Target is a high-end Automotive
infotainment
• Ensure sequence of flows from Video
Camera to Display Controller is correct
• Determine the maximum throughput that
can be processor with no overflows
41. VisualSim Model of Infotainment System
NXP i.MX6 /
nVIDIA Drive PX
Xilinx FPGA
Kintex 8
Discrete
DMA
ARM A53
GPU
Display Ctrl
SRAM3
DRAM3
Video IN
Parameters
Video OUT
42. Conducting Architecture Trade-off
• By changing the amount of video input data (packet number), observe the SRAM -> DRAM transfer
performance and examine the upper limit performance of the video input that the system can tolerate.
210Packet/Sec
12ms
21Packet/Sec
41.4us
300Packet/Sec
• 250 Packet/Sec is the system limit
• With 300 Packet/Sec, simulation cannot be
executed due to FIFO buffer overflow.
48. About Mirabilis Design
Software Company based in Silicon Valley
Integrates Model-based Systems Engineering with the electronics development flow
Development and Support Centers
USA, India, China, South Korea, Japan and Europe
VisualSim Architect - Modeling and Simulation Software
Graphical modeling, multi-domain simulator, system-level IP, analysis tools and open API
Digital Enablement of the Electronics Product Development Front-End
Market Segments
Semiconductors, Automotive and, Aerospace and Defense
Design Enablement
Architecture trade-offs, system validation, early functional testing and communication
Networking
49. System Design Solution and Platform
VisualSim
Architect
• Graphical and
Hierarchical
Modeling
System-Level IP
• Parameterized
components that
cover hardware,
software and
networking
Multi-domain
Simulator
Digital, FSM
Untimed &
Continuous
MBSE
Linking
Requirements with
multi-core
Regression with AI
Cloud and Desktop Version available
Key Innovations
• Parameterized library components for hardware to
create any vendor variation
• Real-time plotting
• Single-event calendar that can communicate with
both VisualSim and external simulators
• Behavior to architecture mapping
• Support for all design and analysis from Concept to
implementation flow for electronics
50. VisualSim System Level IP
Custom Creator
Algorithm
Power
Control, analog, DSP,
communication,audio
imaging Table, Energy harvesters,
Battery
Distribution, Sequence,
Trace file, Instruction
profile
Traffic
Reports
Latency, Throughput,
Utilization, Ave/peak
power, Statistics
RTL-Like
RTOS
Clock, Wire-Delay,
Registers, Latches and
Flip-flop, ALU and FSM,
Mux, DeMux, Lookup
table
Generic RTOS, ARINC
653, AUTOSAR
AMBA (AHB/ APB/ AXI), Corelink,
CoreConnect, Network-on-Chip,
Virtual Channel, DMA, Crossbar,
Serial Switch, Bridge
SOC
Board-
Level
VME, PCI/PCI-X/PCIe, SPI 3.0,
Rapid IO, 1553B, FlexRay, CAN-
FD, AFDX, TTEthernet, OpenVPX
Processors ARM (M-Series), ARM (A8, A72, A53,
A76), RISC-V, Nvidia- Drive-PX,
Configurable GPU, DSP, mP and mC,
PowerPC, X86- Intel and AMD, DSP- TI
and ADI, Others: MIPS, Tensilica,
Renesas SH, Marvel
Stochastic
Queue ,Time
Queue, Quantity
Queue, System
Resources,
Scheduling
algorithms
Script language,
600 RegEx, Task
graph, Use cases,
Programming
languages
Storage Flash, NVMe, Disk
Memory Controller, MPMC,
Fibre Channel, Fire Wire
Switched Ethernet, Resilient Packet Ring,
RP3, Wireless LAN 802.11, Bluetooth and
PAN, Spacewire, Audio-Video Bridging,
IEEE802.1Q
Networking
Memory
• Memory Controller, SDR, DDR
DRAM 2,3,4, LPDDR 2, 3, 4,
HBM, HMC, QDR, RDRAM
FPGA Xilinx- Zynq, Virtex, Kintex,
Intel-Stratix, Arria,
Microsemi- Smartfusion,
Programmable logic
generator, External links to
I/O, Network and Memory
Largest Library of System Modeling IP Components
51. VisualSim Integrated System Design Flow
MBSE Concept
Failure &
Security
Functional
Unit Testing
Embedded
Systems
FPGA/
ASIC
Misison and
Vehicles
RF/Analog/
Antenna
Hardware/
Software Flow
To Implementation
(Schematics, HDL, Embedded C/C++/Java
Emulators, test equipment, FPGA Boards)
Document
Generation
External Users
Government
Systems
Integrator
Protocols &
DSP/Imaging
3rd Party
Provided
4.
Communication
& Sharing
5.
Functional
Testing
1. Algorithmic
Optimization
(Fidelity & Precision)
Systems
engineering
Marketing
VisualSim
Architect
VisualSim
Architect
Integrating What-if’s to Functional Testing
2. Architecture
Exploration (Speed,
Power & Area)
3. Specialized
Testing and Demo
52. VisualSim drives Efficiency & Productivity
Model Creation (6)
Implementation (18)
Using Current Design Methodology
Project Schedule
)
Implementation (12)
Using VisualSim Design Methodology
Time savings
based on 24
month project
is 20-40%
Note: All times in months
TM
Communication and Refinement (4)
Analysis (2.5)
Model Creation (0.5)
Analysis (1.5)
Communication and Refinement (6)
Advantageous over generic modeling environment due to less time & greater applicability across the organization
Editor's Notes
Instant Power represents the instantaneous power consumption of the devices (mentioned in power table) at every instance of clock cycle
Average Power represents the average of power for each devices at different states-> (Standby, Active, Wait, Idle and Retention)
Here the Maximum Network Latency is 3.5x10-9 (Which is in Nano seconds) and Maximum End to End Latency is 1.6x10-7. From the analysis we can see that an optimal network configuration and high frequency results in better latency.