Evaluating UCIe based multi-die SoC to meet timing and power

Evaluating UCIe based multi-die SoC to
meet timing and power

Logistics of the Webinar
2
All attendees will be placed on mute
To ask a question, click on Cloud Chat sign and type the
question. Folks are standing by to answer your questions.
There will also be a time at the end for Q&A

Agenda
Overview of UCIe™ — Universal Chiplet Interconnect Express™
Introduction to system modeling with UCIe and other Intellectual Properties
Assembling System models using UCIe protocol
Examples of SoC architectures using UCIe
Use Case
Mirabilis Design and VisualSim Architect

Background on die-to-die Interconnect
•Packing large number of functions at different clock rate onto a monolithic die is not scalable
•Solution: Integrate multiple dies into a single package – Chiplets
•Chiplet Challenge:
• Die-to-die communication is very slow and consumes too much power
• No single standard available to handle the routing, signalling and multiple clock domains
• Cache coherency across dies
• Support for multiple protocols
•Exploration:
• Need a mechanism to predict the expected latency and power consumption
• Test feasibility of different configurations and assign compute resources on individual dies
• Study the impact of failures or extreme latency
• Explore different scheduling and Quality-of-Service algorithms

Universal Chiplet Interconnect Express or
UCIe is the Future
•Customizable, package-level integration of chiplets
• Combines best-in-class die-to-die interconnect and protocol connections from an interoperable, multi-vendor ecosystem
• open industry standard interconnect
•Offers high-bandwidth, low-latency, power-efficient, and cost-effective on-package connectivity
•Implement compute in an advanced process node to deliver power-efficient performance at higher cost
with memory and I/O controller reused from earlier design in an established (n-1 or n-2) process node
•Future design will incorporate interaction between AI engines on different dies connected and require
deterministic latency
•Optimal design requires accurate assignment of resource pooling, resource sharing and messaging passing
•UCIe theoretical bandwidth is 4x bandwidth of PCIe 6.0 (Tbps range)
•Actual bandwidth depends on burst data available, buffer size for the Tx and replay buffer

UCIe Specification at a Glance

How UCIe Works?
Multiple layers separate out the interconnect tasks
Physical layer is responsible for the electrical signaling,
clocking, link training and sideband
Die-to-Die adapter provides the link state management
and parameter negotiation for the chiplets. It optionally
guarantees reliable delivery of data through CRC and link
level retry mechanism.
◦ When multiple protocols are supported, it defines the
underlying arbitration mechanism.
The FLIT (flow control unit) defines the underlying transfer
mechanism when the adapter is responsible for reliable
transfer

UCIe Packaging
Two package types - standard and advanced.
◦ Standard has 16 lanes
◦ Advanced has 64 lanes
To increase bandwidth, support for multi-module
For 2 modules
◦ Standard has 32 lanes
◦ Advanced has 128 lanes
◦ Each will send different bytes of data
Increase in number of lanes as module count
increases
Multiple PHY logic provides for greater data transfer
with better scheduling

Transition to UCIe
CXL
2.0
PCIe Gen 6
interface
PCIe Gen 6
interface
Strea
ming
Typical SoC – monolithic approach Next Gen SoC – Use Chiplets in modular approach

Commonality with PCIe 6.0
•UCIe protocol emulates PCIe for chiplets
•UCIe transfers packets in FLITs
• PCIe 6.0 uses fixed value of 256 bytes
• UCIe FLIT Size is variable based on the sender and receiver protocol
•Credit based flow control mechanism
•Packets use ACK or NAK to confirm good reception
• Selective and Standard ACK options
•Advanced port status and error checking
• CRC checksum
•Bandwidth depends on the number of lanes
• Standard vs Advanced package
• Multi-Module option

System-level Architecture Analysis of UCIe
based multi-die SoC

Multi-Media Application –
UCIe Template provided by Intel
CPU – High
Performance
cores
CPU – Low
Power cores
Audio/Video
Encoder/Decoder
I/O Tile
M
E
M
M
E
M
M
E
M
PCIe 6.0 PCIe 6.0
P
C
I
e
6
.
0
C
X
L
3
.
0
UCIe
Retimer
Off-Package
Interconnect
NVMe SSD
chiplet
UCIe
Retimer
C
X
L
3
.
0
How much should the
retimer timeout be set to?
Do we need a multi module setup?
How much
should the
transfer rate
between UCIe
links be set to?
4 GTs or 8 GTs
… or 32 GTs?
Start with a System Block Diagram

VisualSim Model
Create a VisualSim model using existing building blocks

Stats
Advanced package, 4 module, 32 GT/s config Standard package, Single module, 4 GT/s config
~300x latency
difference can be
observed. However, for
non-time critical
applications, Standard
UCIe package option
looks attractive
Study the statistics to decide on the best configuration

Application Examples of UCIe based multi-
die SoC

Example 1 – Multi-Media applications
CPU – High
Performance
cores
CPU – Low
Power cores
Audio/Video
Encoder/Decoder
I/O Tile
M
E
M
M
E
M
M
E
M
PCIe 6.0 PCIe 6.0
P
C
I
e
6
.
0
C
X
L
3
.
0
Retimer Off-Package Interconnect

Example 2 :
Automotive Autonomous Driving
UCIe
AI Engine Tiles
Warp
Scheduler
PE
PE
PE
PE
Local Mem
GPU
Analog Chiplet
ADC DAC
PLL
ADC DAC
PLL
Processor subsystem
Core L1
B
u
s
SLC

Example 3 : Cache Coherency using UCIe
UCIe
SERDES
32nm
GPU
7nm
RISC-V Cores
5nm
ARM Cores
10nm
DSP
10nm
SLC chiplet
22nm
LPDDR5
28nm
C
a
c
h
e
C
a
c
h
e
C
a
c
h
e
C
a
c
h
e

Design Challenges in Implementing UCIe
•Huge memory transaction blocks a high priority control access
• For time critical application, these situations are not desirable
• Example : Automotive communication system
•Multiple chiplets can be connected easily and efficiently
• Resource sizing per chiplets needs to be correct to maximize bandwidth usage
• Example applications : Data Center and AI Accelerators
•Migrating from monolithic die to Chiplet in smartphones is efficient
• Limited memory needs to be partitioned for different dies to access with minimal contention
• Example: Apple M1 Ultra uses Chiplets to double the performance

Performance challenges
•User defines CXL stacks with two protocols sharing the physical link.
•Arbiter across the Die-to-Die adapter must send Flits alternatively between the 2 protocols.
• If one of the Protocol layers doesn’t have data to transmit, then instead of payload, “NOP” frames are
inserted. If one of the Protocol stacks is idle for most of the time, then bandwidth could essentially be wasted
on the “NOP” frames.
•Increasing the number of modules for either the standard or advanced package provides more
bandwidth.
• But is that extra bandwidth needed for the application?
•What happens if multiple chiplets in your design require the data stored at the same address location
which is in another chiplet?
• Consider the impact of cache coherency
•Can peak throughput be guaranteed for your application in a shared resource environment?
• AI Engine distributed across multiple dies

Analyzing UCIe based multi-die SoC using
VisualSim System Model

Autonomous driving
UCIe
AI Engine Tiles
Warp
Scheduler
PE
PE
PE
PE
Local Mem
GPU
Memory chiplet
ADC
DDR5
Processor subsystem
Core L1
B
u
s
SLC
• Optimal
mesh size
(mxn) ?
• Best sample
size (16
bytes vs 32
bytes etc) ?
Use a single protocol
stack or multi protocol
stack?
Do we need PCIe
gen6 or still use
gen5 for meeting
application
requirements?

VisualSim System Model using UCIe in
ADAS SoC

Statistics for Multi-Die SoC
• Note the AI Engine
latency spikes
• For multi protocol,
half bandwidth for
each protocol.
• Older gen protocols
are mixed with PCIe 6,
• Lower FLIT size
increases latency.

Comparing Different Configurations using
UCIe Interface
All Die Adapters use PCIe 6.0
Die Adapters use PCIe 6.0 and
Streaming Protocols (AXI)
Lower latency when using PCIe 6.0

Mirabilis Design
VisualSim Architect

About Mirabilis Design
Engineering Solutions focused on innovation in electronics
Based in Silicon Valley, USA
Development and support centers in US, India, Japan, China and Czech
60 large corporations, research centers and 73 universities as customers
Enabled 250 products in semiconductors, automotive, defense and space
VisualSim Architect is the system simulation and IP for hardware, software and networking

Mirabilis Design – Milestones
VisualSim Aerospace
Simulator of the Year
Hardware
Modeling
2003
Company
Incorporated
2005
Modeling Services
1st Customer
2008
Stochastic Modeling
Innovation Award
2010
Integration API
10th customer
2011
Network Modeling
University Program
2013
2015
2018
Best ESL at DAC
2nd at Arm TechCon
2019
VisualSim Automotive
Europe operations
2020
Failure Analysis
Created Asia Team
2021
Best Embedded Systems
Presentation Award – DAC
2021
SysML API
Requirements
2018
New
VisualSim
2022
Best in Show
Embedded World
2023
Communication System
Designer
2022
System Verilog and
UPF/CPF Link

VisualSim Architect
Cloud and
Desktop
Multi-simulation
engine- Digital,
Untimed &
Continuous
Library of Systems,
Networks, Semi,
FPGA & Software
Generate statistics,
documentation &
traces
Algorithms
Protocol
AI Insight
Performance
Power
Functional
Stochastic
Scripting
Sim API
Performance
Latency, Throughput, Buffer occupancy
Power
Instant, Average, Cumulative, Heat, Temperature
Battery and power generation sizing
Functionality
Correctness, efficiency and Quality-of-Service
Failure Analysis and Functional Safety
Generate errors and test for compliance
Software Evaluation
Test quality of C++ and impact on system performance
System-level Modeling and Simulation Software
that integrates requirements, exploration & verification

Over 500 Systems-Level IP Components
Comprehensive implementation-accurate Library
Traffic
• Distribution
• Sequence
• Trace file
• Instruction
profile
Power
• State power table
• Power management
• Energy harvesters
• Battery
• RegEx operators
SoC Buses
• AMBA and Corelink
• AHB, APB, AXI, ACE,
CHI, CMN600
• Network-on-Chip
• TileLink
System Bus
• PCI / PCI-
X / PCIe
• Rapid IO
• AFDX
• OpenVPX
• VME
• SPI 3.0
• 1553B
ARM
• M-, R-, 7TDMI
• A8, A53, A55, A72, A76,
A77, Neoverse
Custom
Creator
• Script language
• 600 RegEx fn
• Task graph
• Tracer
• C/C++/Java
• Python
Stochastic
• FIFO/LIFO Queue
• Time Queue
• Quantity Queue
• System Resource
• Schedulers
• Cyber Security
Memory
• Memory Controller
• DDR DRAM 2,3,4, 5
• LPDDR 2, 3, 4
• HBM, HMC
• SDR, QDR, RDRAM
Networking
• Ethernet & GiE
• Audio-Video Bridging
• 802.11 and Bluetooth
• 5G
• Spacewire
• CAN-FD
• TTEthernet
• FlexRay
• TSN & IEEE802.1Q
• ARINC 664/AFDX
Interfaces
• Virtual
Channel
• DMA
• Crossbar
• Serial
Switch
• Bridge
Algorithms
• Signal Processing
• Analog
• Antenna
RTOS
• Template
• ARINC 653
• AUTOSAR
Storage
• Flash & NVMe
• Storage Array
• Disk and SATA
• Fibre Channel
• FireWire
Software
• GEM5
• Software
code
integration
• Instruction
trace
• Statistical
software
model
• Task graph
RTL-Like
• Clock, Wire-Delay
• Registers, Latches
• Flip-flop
• ALU and FSM
• Mux, DeMux
• Lookup table
Processors
• GPU, DSP, mP and mC
• RISC-V
• SiFive u74
• Nvidia- Drive-PX
• PowerPC
• X86- Intel and AMD
• DSP- TI and ADI
• MIPS, Tensilica, SH
Reports
• Timing and Buffer
• Throughput/Util
• Ave/peak power
• Statistics
FPGA
• Xilinx- Zynq, Virtex,
Kintex
• Intel-Stratix, Arria
• Microsemi-
Smartfusion
• Programmable logic
template
• Interface traffic
generator

Evaluating UCIe based multi-die SoC to meet timing and power

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Evaluating UCIe based multi-die SoC to meet timing and power

Similar to Evaluating UCIe based multi-die SoC to meet timing and power (20)

More from Deepak Shankar

More from Deepak Shankar (20)

Recently uploaded

Recently uploaded (20)

Evaluating UCIe based multi-die SoC to meet timing and power

Editor's Notes