Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EMR CPU.pdf

Accelerating EDA workloads on Azure
- Best Practice and benchmark on Intel EMR CPU
Meng-Ru Tsai
Principal Technical Program Manager, Microsoft
Jennifer Zickel
Director, Xeon Product Line Management, Intel

Abstract/Agenda
 This session will introduce the best practices for running EDA on Azure,
covering the recommended architecture.
 We will present benchmark results of running EDA tools on Azure,
including Synopsys VCS and Cadence Spectre-X, highlighting the
capabilities of the latest Azure VMs equipped with the new 5th Gen Intel®
Xeon® Platinum 8537C (Emerald Rapids) processor.

Context
# iterations in full Design Cycle
(e.g. 9 mo)
Number of parallel jobs
(distributed)
Peak mem across all jobs
(GB)
Average mem per
jobs (GB)
# cores per job
(Multi-threading)
Data I/O per
iteration (GB)
Average
Runtime per
job (Hrs)
AMS/IP Design
Circuit Layout Full Chip 50 1 10 10 8 10 8
Circuit Simulation - Cells Block 50 1,000 1 0.1 1 100 24
Circuit Simulation - MEM/IP Block 50 100 60 16 1 100 24
Chip Design
(Front End)
High Level Synth (HLS) Block 10 20 50 50 8 10 12
Functional Simulation
(RTL)
Block 810 1,000 8 4 1 3 0.20
Full Chip 270 500 64 16 1 10 0.75
Functional Simulation
(Gate Level)
Block 20 2 384 128 1 10 12
FullChip 5 1 1,500 1,500 1 100 72
RTL Synthesis
Block 90 50 64 32 8 50 8
Full Chip 20 4 768 768 16 100 24
CDC (Clock domain crossing Block 10 8 30 30 16 50 4
Formal Verification Block 90 40 50 50 16 50 8
DFT (Scan/Bist/ATPG) Block 30 4 384 384 16 50 4
RTL Power Analysis Block 90 4 64 64 16 50 4
Chip Design
(Back End)
APR (P&R)
Block 30 50 384 128 16 200 72
Full Chip 20 4 768 768 16 500 72
Signoff Timing
Block 90 250 128 80 16 100 6
Full Chip 60 50 800 800 16 700 12
Extraction
Block 90 30 100 50 32 200 6
Full Chip 30 256 300 300 32 1,000 6
Signoff DRC/LVS
Block 90 16 384 200 200 200 8
Full Chip 20 10 2,000 2,000 244 1,000 12
IR Drop Full Chip 30 700 128 128 64 200 12
ECO (e.g.Tweaker) Full Chip 10 10 500 500 16 200 12
Examples of silicon design workloads

Chip Design productivity
Development cycle dominated by alternating phases of EDA tools simulation time
and designer debug time.
EDA simulation Development time Designer productivity

EDA Tools/ISV landscape
EDA EDA Flow Synopsys Mentor Cadence Empyrean Ansys
IP
Circuit Layout Custom Compiler Tanner Virtuoso Aether x
Circult Simulation - Cells Hspice Eldo/AFS Spectre Qualib x
Circult Simulation - MEM/IP Hspice Eldo/AFS Spectre ALPS x
Front-End
High Level Synth (HLS) x Catapult Stratus x
Functional Simulation (RTL) VCS Questa xCelium / Ncsim x
Functional Simulation (Gates) VCS Questa xCelium / Ncsim x
RTL Synthesis Design Compiler Oasys-RTL Genus x
CDC (Clock Domain Crossing) Spyglass CDC Questa CDC Conformal CDC x
Formal Verifcation VS Formal / Formality Formal-Pro JasperGold / Conformal x
DFT (Scan/Bist/ATPG) DFTMAX/Tetramax Tessent Modus x
TRL Power Analysis PrimePower Power-Pro Joules PowerArtist
Back-End
APR (P&R) ICC-II Nitro Innovus Argus x
Signoff Timing PrimeTime Optimus Tempus x
Signoff Extraction Star-RC Xact-RC Quantus /QRC RCExplorer Extraction x
Signoff DRC/LVS ICV Calibre Pegasus /Assura Argus x
Signoff EM/IP Drop/Power PrimePower BlueWave Voltus RedHawk / RedHawk-SC
Programmable ERC ICV Calibre PERC Pegasus x
ECO (Tweaker) PrimeTime ECO Optimus Tempus ECO x
Post
Tapeout
Computational Lithography
(OPC, RET) Proteus Calibre x x

EDA Tools/ISV landscape
EDA EDA Flow Synopsys Mentor Cadence Empyrean Ansys
IP
Circuit Layout Custom Compiler Tanner Virtuoso Aether x
Circult Simulation - Cells Hspice Eldo/AFS Spectre Qualib x
Circult Simulation - MEM/IP Hspice Eldo/AFS Spectre ALPS x
Front-End
High Level Synth (HLS) x Catapult Stratus x
Functional Simulation (RTL) VCS Questa xCelium / Ncsim x
Functional Simulation (Gates) VCS Questa xCelium / Ncsim x
RTL Synthesis Design Compiler Oasys-RTL Genus x
CDC (Clock Domain Crossing) Spyglass CDC Questa CDC Conformal CDC x
Formal Verifcation VS Formal / Formality Formal-Pro JasperGold / Conformal x
DFT (Scan/Bist/ATPG) DFTMAX/Tetramax Tessent Modus x
TRL Power Analysis PrimePower Power-Pro Joules PowerArtist
Back-End
APR (P&R) ICC-II Nitro Innovus Argus x
Signoff Timing PrimeTime Optimus Tempus x
Signoff Extraction Star-RC Xact-RC Quantus /QRC RCExplorer Extraction x
Signoff DRC/LVS ICV Calibre Pegasus /Assura Argus x
Signoff EM/IP Drop/Power PrimePower BlueWave Voltus RedHawk / RedHawk-SC
Programmable ERC ICV Calibre PERC Pegasus x
ECO (Tweaker) PrimeTime ECO Optimus Tempus ECO x
Post
Tapeout
Computational Lithography
(OPC, RET) Proteus Calibre x x
Intel, AMD, Qualcomm, MediaTek (5nm & 7 nm), TSMC (DTP), etc.

Why Cloud?
Source: TSMC eNewsletter
• Accelerate design and characterization.
• Eliminates purchasing in-house CPUs which would stand idle during off-peak times.
• Greater quality with higher simulation coverage.
• Designers around the world to collaborate.

A simple pipe cleaning
License
server
VPN/ER
On Prem
Managed NFS services
Azure NetApp Files
Scheduler
EDA Data from on-prem Read/Write
License server
Scheduler
• VM scale set (VMSS)
• Local /tmp
• Accelerated networking

A 200-job cluster, CycleCloud for orchestration
License
server
VPN/ER
License server
On Prem
Azure NetApp Files
Scheduler
EDA Data from on-prem
Scheduler
Read/Write
• Local /tmp
Log Analytic
CycleCloud
• Dynamic scale up and down
• Parallel VM Provisioning

A full-production cluster w/ 50,000+ cores
License
server
VPN/ER
License server
On Prem
Scheduler
EDA Data from on-prem
Scheduler
• Local /tmp
Log Analytic
CycleCloud
• Dynamic scale up and down
• Parallel VM Provisioning
ANF
Write/output
ANF
Read/Write
/scratch
ANF
Read/tool

Testing environment
• Azure NetApp Files (ANF) serves as the NFS
storage solution, featuring a Premium 4TiB
volume.
• To minimize network traffic latency, compute
VMs, the license server VM, and storage are
all located within the same Proximity
Placement Group.

©Microsoft Corporation
Azure
Shared under NDA
VM size changes:
o 2:1 (Dlv6), 4:1 (Dv6), 8:1 (Ev6) Mem:vCPU ratios
o Dv6 Sizes ranging from 2 to 128 vCPUs, up to 512GiB RAM (D192 size under evaluation)
o Ev6 sizes ranging from 2 to 192vCPU, up to 1,832GiB RAM
Expected improvement vs the previous v5 VMs (depending on a size):
o >15%-20% CPU performance on average measured by SPECInt; >3X L3 cache
o Max remote storage IOPS increase from 80k to 260k with Premium v1 SSDs and 400k with Premium v2 SSDs
o Max remote storage throughput increase from 2.6GB/s to 6.8GB/s (D128) or 12GB/s (E192i)
o 4X Faster local NVMe SSD in read IOPS, +50% local SSD capacity
o Up to 200Gbps network BW
Public Preview Plan (subject to change)
o Preview from July 2024 in US East & US West regions
o Attend preview by filling out this survey
o VM specifications
[Intel] Dlv6, Dv6, Ev6 VMs based on Intel Emerald Rapids CPU

Preview
Azure FXv2-series VMs
• Preview: Compute-optimized FXmdsv2 and FXmsv2
• Processor: 5th Generation Intel® Xeon® Platinum
Emerald Rapids processor in a hyper-threaded configuration
• Workloads: Ideal for large databases, data analytics, SQL, and
EDA workloads
• Regions: West US 3 and Southeast Asia (will expand beyond
2024)
Learn more and get started
aka.ms/FXv2-series-Preview-Blog
Limited time offer on Linux VMs
aka.ms/LinuxPromoOffer
Compared to our previous generation
FXv1 based VMs, up to:
• Increased vCPUs up to 96
• Larger memory up to 1832 GiBs w/ up to
21:1 memory-to-vCPU ratios
• Up to 50% increased CPU performance
• Up to 100% increase in local storage (Read)
IOPS
• 100% increase in IOPS & 400% increase in
remote storage throughput with Premium v1
remotes storage SSDs
• Support up to 400k IOPS and up to 11 GBps
throughput with Premium v2/ Ultra Disk
support

Synopsys VCS
Jennifer Zickel
Director, Xeon Product Line Management, Intel

Azure NetApp Files
Intel RTL Design: 1 to 32 simulations
FX64v2 (EmeraldRapids), D64dsv5 (Ice Lake) Azure Instances
Intel RTL Design:
. VCS RTL Simulation test design
. Complex RTL design (>10M gates)
. SVTB (System-Verilog Test Bench) simulation
test for 100K cycles
. Resident memory footprint per simulation
instance is 7 GB.
VCS is a Synopsys Functional verification solution in the EDA space
We observe speedup for the Emerald Rapids instance compared to Ice Lake instance from 17 to 43% for
the range of simultaneous simulations shown in the chart above. Emerald Rapids performance vectors:
Newer Gen architecture delivering higher IPC, Higher all core turbo frequency, Higher B/W and Larger
L2/L3 caches, faster UPI NUMA links and PCIe 5.0 support vs PCIe 4.0
0
500
1000
1500
1 Sim 2 Sim 4 Sim 8 Sim 16 Sim 24 Sim 32 Sim
Completion time in Seconds: Lower is better
FX64v2 (Emerald Rapids, all core turbo frequency up to 4.0 GHz
D64dsv5 (Ice Lake, all core turbo frequency up to 3.5 GHz

Azure NetApp Files
Intel RTL Design
FX64v2 (Emerald Rapids), D64dsv5 (Ice Lake) Azure Instances
Number of simultaneous simulations tested. Seconds to completion. Lower is better
Top chart has raw completion numbers. Bottom chart has the percentage speedup for the Emerald Rapids
instance compared to Ice Lake instance.
Instance/
Simulations
1 2 4 8 16 24 32
FX64v2
Emerald Rapids
879 876 910 989 1088 1140 1196
D64dsv5
Ice Lake
1196 1249 1262 1295 1316 1348 1403
Speedup %/
Simulations
1 2 4 8 16 24 32
D64dsv5/
FX64v2 1.36 1.43 1.39 1.31 1.21 1.18 1.17

Notices and Disclaimers
 Performance varies by use, configuration and other factors. Learn more on
the Performance Index site.
 Performance results are based on testing as of dates shown in configurations
and may not reflect all publicly available updates. See backup for configuration
details. No product or component can be absolutely secure.
 Your costs and results may vary.
 Intel technologies may require enabled hardware, software or service activation.
 © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks
of Intel Corporation or its subsidiaries. Other names and brands may be
claimed as the property of others.

Cadence Spectre-X
Meng-Ru Tsai
Principal Technical Program Manager, Microsoft

Observation
 Test design: Post Layout DSPF design with 100+K circuit inventories.
 CPU average utilization kept 95+% during the runtime. Very
compute-intensive and CPU-bound.

Simulation time and scalability
 Total elapsed time (seconds), the lower the better
 Compare to Ice Lake in %
# of
threads
D64dsv5 (Ice Lake,
all-core-turbo
frequency up to 3.5
GHz
D64dsv6 (Emerald Rapids,
all-core-turbo frequency
up to 3.6 GHz)
FX64v2 (Emerald Rapids,
all-core-turbo frequency
up to 4.0 GHz)
1 7010 5740 4990
2 3590 3070 2690
4 1970 1740 1500
8 1190 1050 925
# of
threads
D64dsv5 D64dsv6 FX64v2
1 100% 82% 71%
2 100% 86% 75%
4 100% 88% 76%
8 100% 88% 78%
0
1000
2000
3000
4000
5000
6000
7000
8000
0 1 2 3 4 5 6 7 8
Performance improves
of multithreading Spectre-X jobs
D64dsv5 (Ice Lake, all-core-turbo frequency up to 3.5 GHz
D64dsv6 (Emerald Rapids, all-core-turbo frequency up to 3.6 GHz)
FX64v2 (Emerald Rapids, all-core-turbo frequency up to 4.0 GHz)

Cost effective estimation
 The estimated total time and VM cost for running 500 single-
threaded Spectre-X jobs:
o D64lds v6 has the lowest cost.
o FX64mds v2 has the shortest total time.

ACCUMULATE points by scanning the QR and have the chance to
WIN PRIZES!
Innovation Pass

Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EMR CPU.pdf

More Related Content

Similar to Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EMR CPU.pdf

More from Meng-Ru (Raymond) Tsai

Recently uploaded

Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EMR CPU.pdf