This event includes forward-looking statements about
future products and other topics, which are based on
our current expectations and subject to risks and
uncertainties. Please refer
to the press release for this event and our SEC filings at
intc.com
for more information on the risk factors that could
cause actual results to differ materially.
Hybrid Computing
Architectures
Caches
Memor
y
Proces
s
Packagin
g
Interconne
ct
19
Stephen
Robinson
Highly Scalable Architecture To Address
the Throughput Efficiency Needs For
the Next Decade of Compute
Vector and
AI
Instruction
Support
Intel’s Most
Efficient
Performant
CPU
Wide
Dynamic
Range
Dense &
Highly
Scalable
24
Modern Instruction
Set
Intel® VT-rp
(Virtualization Technology
redirect protection)
Supported
Advanced
speculative
execution validation
methodology
Intel® Control-flow
Enforcement Technology
designed to improve
defensein depth
Support for
Advanced Vector
Instructions
with AI extensions
Key instruction additions to
enable integer AI throughput
(VNNI)
Floating point multiply-
accumulate (FMA) instructions for
2x throughput
Wide Vector
Instruction Set
Architecture
Security
31
Efficiency in Both Power and Performance per
Transistor
Intense focus on feature
selection and design
implementation costs
to maximize area efficiency,
which in turns enables core
count scaling
Low switching energy
per instruction
to maximize
powerconstrained
throughput, key for today’s
throughput-driven
workloads
Reduced operating
voltage required for
all frequencies
saving power while
extending the
performance range
32
Latency
Performance
>40
%
Performan
ce at ISO
Power
Power at
ISO
performanc
e
Powe
r
Single
Thread
Performanc
e
Skylake
core
Efficient
core1C1T
>40%
Perf
<40%
1C1
T
Powe
r
33
SPECrate2017_int_base estimates using an open source compiler, iso-binary. For
workloads and configurations visit www.intel.com/ArchDay21claims. Results may
vary.
Throughput
Performance
+80
% -
More
Performanc
e
Power at
ISO
performanc
e
Powe
r
Multi
Thread
Performanc
e
Skylake
core
Efficient
core4C4T
2C4T
+80%
Perf
-80%
Power
34
SPECrate2017_int_base estimates using an open source compiler, iso-binary For
workloads and configurations visit www.intel.com/ArchDay21claims. Results may
vary.
Adi
Yoaz
Architecture
Goals
Innovate with next
disruption in AI
performance acceleration
Deliver a step function in general
purpose CPU performance
Advance the Arch/uArchwith new
features for evolving trends of workload
patterns
38
Architectur
e
Intel’s
New
Designed for speed, pushing the limits
of low latency and single threaded
application performance via:
Wide
r
Deepe
r
Smarte
r



Acceleration of workloads with large code footprint & large data
sets NEWAI acceleration technology via coprocessor for matrix
multiplication NEWsmart PM controller for fine grain power
budget management
39
1.6
x
1.5
x
1.4
x
1.3
x
1.2
x
1.1
x
1.0
x
0.9
x
0.8
x 46
General-Purpose Performance Vs. 11thGen Intel®
Core™
19%
Performance
Improvement at
1
ISO
Frequency
ISO
Frequency
Performance
vs.
Cypress
Cove
Core
SPEC CPU 2017, SYSmark25, Crossmark, PCMark10, WebXPRT3,
Geekbench5.4.1
1Geomean of Performance core (ADL) vs. Cypress Cove (RKL) Core @ ISO 3.3GHz
Frequency
For workloads and configurations visit www.intel.com/ArchDay21claims. Results may
vary.
Intel® Advanced Matrix Extensions (Intel®
AMX)
Tiled Matrix Multiplication Accelerator -Data Center
VNN
I
256 int8
AMX
2048 int8
8x
operations / cycle /
core
47
Tiled Matrix Multiplication Accelerator -Data
Center
AMX architecture has two
components:
Tiles
TMUL




Set of matrix multiplication instructions, the first
(oApeMratXor)s on TILEs
A MAC computation grid calculates ‘tiles’ of data
TMUL –performs Matrix ADD-Multiplication
(C=+A*C)
using three Tile registers (T2=+T1*T0)
TMUL requires TILE to be present
A new expandable 2D register file –8 new
registers, 1Kb each: T0-T7 Register file supports
basic data operators –load/store, clear, set to
constant, etc. TILES declares the state and is OS-
managed by XSAVE architecture
•
•
•
C A B
C1
C2
A1
A2
B
1
+
=
Intel® Advanced Matrix Extensions (Intel®
AMX)
x
Express more work per instruction and per
µop – save power for fetch/decode/OOO
48
Inte
l
Coprocessor
2
TMUL Engine
Coprocessor 1
Tmm0 += Tmm1 *
Tmm2
Coheren
t
Memory
Interface
Architectur
eHos
t
TILES and
coprocesso
r
command
s
tmm[n-
1]
TILECONFIG
tmm0 tmm1
Intel® Advanced Matrix Extensions (Intel®
AMX)
Architecture
49
53
Graph is for conceptual illustration purposes
only.
Rajshree
Chabukswar
Real -Time
Adaptive
Software
Transparent
Scalable from Mobile to
Desktop
Performance Hybrid
Scheduling
Goals
55
Introducing
Intel
ThreadDirector
Intelligence built directly into the
core
OS
Scheduler
Dynamically adapts guidance
based on the thermal design point, operating
conditions, and power settings –without any user
input
Monitors the runtime instruction mix
of each thread and as well as the state of each core –
with nanosecond precision
Provides runtime feedback to the OS
to make the optimal scheduling decision for any
workload or workflow
56
Introducing
Intel
ThreadDirector
Scheduling Examples
P
57
Introducing
Intel
ThreadDirector
Scheduling Examples
1 Priority tasks scheduled on P-
cores
P
58
Introducing
Intel
ThreadDirector
Scheduling Examples
1
2
Priority tasks scheduled on P-
cores
Background tasks scheduled on E-
cores
BBB
59
Introducing
Intel
ThreadDirector
Scheduling Examples
1
2
3 New AI thread
ready
Priority tasks scheduled on P-
cores
Background tasks scheduled on E-
cores
AI
60
Introducing
Intel
ThreadDirector
Scheduling Examples
1
2
3 New AI thread
ready
Priority tasks scheduled on P-
cores
Background tasks scheduled on E-
cores
AI
61
Introducing
Intel
ThreadDirector
Scheduling Examples
1
2
3 AI thread prioritized on P-
core
Priority tasks scheduled on P-
cores
Background tasks scheduled on E-
cores
AI
62
Introducing
Intel
ThreadDirector
Scheduling Examples
1
2
3
4
AI thread prioritized on P-
core
Priority tasks scheduled on P-
cores
Spin loop wait moved from P toE-
core
Background tasks scheduled on E-
cores
63
Arik
Gihon
All-New Core Design
Performance Hybrid with Intel Thread
Director
Industry-Leading Memory &
I/O
DDR5, PCIe Gen5, Thunderbolt™4, Wi-Fi 6E
Single, Scalable SoC Architecture
All Client Segments –9W to 125W –built on Intel 7
process
Introducin
g
Alder Lake
Reinventing Multi Core
Architecture
68
Scalable Client
Architecture
69
Deskto
p
Mobile
BGA Type3
50 x 25 x 1.3
mm
Ultra
Mobile
BGA Type4 HDI
28.5 x 19 x 1.1 mm
LGA 1700
Socket
Visitwww.intel.com/ArchDay21claimsfor
details
70
SOC
LLC
Memor
y
Displa
y
IP
U
PCIe TBT
GNA 3.0
Alder
Lake
Building Blocks
P-
Core
E-
Cores
Xe
Architecture
Alder
Lake
Core/Cache
Up To
24
Up to
30
Up To
1
6
Core
s
Thread
s
MB
8
Performance
8 Efficient
2T per P-
core 1T per
E-core
Non-
inclusive LL
Cache
75
Alder
Lake
Memory
Memory
Subsystem
DDR
PHY
Leading the industry
transition to DDR5
Support for all four major
memory technologies Dynamic
voltage-frequency scaling
Enhanced overclocking support
LP5 -5200
LP4x -4266
DDR4 -3200
DDR5 –4800 Ne
w
Ne
w
76
Alder Lake
PCIe
Leading the industry
transition to PCIe Gen5 Up to
2X bandwidth vs. Gen4 Up to
64GB/s with x16 lanes
x16 PCIe Gen
3
x12 PCIe Gen
4
x4 PCIe Gen
4
x16 PCIe Gen
5
Ne
w
PCIe
PCIe
PCIe
77
Visitwww.intel.com/ArchDay21claimsfor
details
Alder
Lake
Interconnect
Up to
64
Up To
204
Up to
1000GB/s
GB/s
GB/s
Dynamic
Latency
Optimization
Dynamic Bus
Width &
Frequency
Real-time,
demand- based
BW control
Compute
Fabric
I/O
Fabric
Memory
Fabric
Copmpute
Fabric
Memory
Subsystem
78
Visitwww.intel.com/ArchDay21claimsfor
details
Beginning Fall
2021
Alder Lake
Reinventing Multi Core
Architecture
All-New Core Design
Performance Hybrid with Intel Thread
Director
Industry-Leading Memory &
I/O
DDR5, PCIe Gen5, Thunderbolt™4, Wi-Fi 6E
Single, Scalable SoC Architecture
All Client Segments –9W to 125W –built on Intel 7
process
79
Vivid PC Graphics
Market
1.5B
PC
Gamers
13M+
Game
Developer
s
Over 10,000 games
released on Steam in
2020
Hours of
Live
Streams
Watched
Twitch.tvviewership has
doubledin one year
8.8B
Over the last 4 years, the
amount of concurrent
users has doubledon
Steam.
84
1.Source: https://www.pcgamesn.com/pc-gaming-study 2. Source: https://blog.streamlabs.com/streamlabs-strea
m-hatchet-q1-2021-live-streaming-industry-report-eaba2143f492 3. Source:
Part 1 : Game Developer Population Forecast 2020, April 2020, SlashData
Powered
by
85
Software
First
87
Core
Driver
APIs and
Engines
Experience
s
DD
I
OS/GPU
GPU
Profiler
Intel
Graphics
Compile
r
Submissio
n
Memory
Manageme
nt
DDI
Threadin
g
Comman
d
Encodin
g
Smooth
Gaming
Performance
Tuning
Live Game
Streaming
Modern User
Interface
88
Performanc
e
Low
Quality
Imag
e
Qualit
y
High
Quality
Imag
e
Render
Quality
Trade-
off
Performanc
e
Low
Quality
Imag
e
Tradition
al
Upscaling
Qualit
y
Tradition
al
Upscaling
High
Quality
Imag
e
Render
Quality
Trade-
off
89
Performanc
e
Low
Quality
Imag
e
Qualit
y
High
Quality
Imag
e
Render
Quality
Trade-
off
90
Performanc
e
Low
Quality
Imag
e
Qualit
y
High
Quality
Imag
e
Render
Quality
Trade-
off
91
Jitter
Velocit
y
Frame
N
Histor
y
Xe
Super
Samplin
g
92
Raster
&
Lighting
Post
Process
Hits the
Sweet
Spot
93
Traditiona
l
Upscaling
Low
Quality
Image
High
Quality
Image
+
XMX
+
DP4a
1080
p
Frame Render
Time
4K
4K
4K
“4K”
Xe SS
XeSS
XeSS
Graph is for conceptual illustration purposes only. Subject to revision with further
testing.
Xe SS
SDK
94
Available this
month
Scalabilit
y
96
High
Performance
Gaming
Optimized
Compute Building Block of XeHPG-based
GPUs
16
VectorEngine
s
256 bit
per engine
16
Matrix
Engines
1024
bit
per engine
97
Render
Slice
98
4 Xe-cores with
XMX
99
Fixed Function
optimized for
DX12 Ultimate
Gaming
Sampler
s
Pixel
Backends
Geometry
Pipeline
Rasterization
Pipeline
100
4 Ray Tracing
Units
Ray
Traversal
Triangle
Intersection
Bounding Box
Intersection
101
102
Scaling
the
Graphics
Engine
103
104
Scaling
the
Graphics
Engine
Leadership IP
Performance/Watt
Voltag
e
Frequenc
y
Performance/
Watt
XeLP
Discret
e
Xe
HPG
XeLP
Xe
HPG
1.5
x
1.5
x
Softwar
e
Architectur
e
Logic
Design
Circuit
Design
Process
Technology
105
For workloads and configurations visitwww.intel.com/ArchDay21claims. Results may
vary .
N6
“In the world of graphics, there is an insatiable demand for better performance and
more realism.TSMC is excited that Intel has chosen our N6 technology for their Alchemist
family of discrete graphics solutions”. “There are many ingredients to a successful
graphics product including the semiconductor technology.With N6, TSMC provides an
optimal balance of performance, density and power efficiency that are ideal for modern
GPUs.We are pleased with the collaboration with Intel on the Alchemist family of discrete
GPUs”.
Dr.Kevin Zhang,
Senior Vice President of Business Development at
TSMC
106
Q1
2022
Celestia
l
e
3
Xe
HPG
Xe2
HPG
Performan
ce
Drui
d
e
XNext Architecture
X HPG
Battlemag
e
Multi-
Year
107
Sailesh
Kottapalli
Next-Gen Intel Xeon Scalable
Processor
New Standard for Data
CenterArchitecture
Designed for
Microservices & AI
Workloads
Pioneering Advanced
Memory & IO Transitions
Introducin
g
111
Node
Performance
Data Center
Performance
112
Node
Performance
Scalar
Performanc
e
Data
Parallel
Performanc
e
Cache &
Memory
Sub- System
Arch
Intra/Inter
Socket
Scaling
New
Performance
Core
Microarchitectur
e
Increased
Core Counts
Multiple
Integrated
Acceleration
Engines DDR
5
PCIe
5.0
Larger Private
& Shared
Caches
Next Gen
Optane Support
Embedded
Silicon Bridge
(EMIB)
Wider & Faster
UPI
Modular SoC /w
Modular Die
Fabric
113
Data Center
Performance
Consolidation
&
Orchestration
Performanc
e
Consistency
Elasticity &
Efficient
Data Center
Utilization
Infrastructure
& Framework
Overhead
IO
Virtualization
Better
Telemetry
Fast VM
Migration
Low Jitter
Architectur
e
Consistent
Caching
& Mem Latency
Inter-Processor
Interrupt Virt.
CXL
1.1
Next Gen
Optane Support
Broad
WL/Usage
Support and
Optimizations
Next Gen Quality
of Service
Capabilities
Integrated WL
Accelerators
Improved Security
&
RAS
114
Single Monolithic
Die
Multi-Tile Design for Increased
Scalability
Delivers a scalable, balanced architecture leveraging existing software
paradigms for monolithic CPUs via a modular
architecture
115
Sapphire
Rapids
Multiple Tiles, Single CPU
Provides consistent low
latency & high cross-section
BW across the entire SoC
Every thread has full access to
all resources on all tiles
Cache, Memory, IO…
116
Sapphire
Rapids
SoC
117
Sapphire
Rapids
Key Building Blocks
I/O
IP
Memor
y IP
Compute IP
Seamless
Integration of
DDR5
CXL1.
1
Core
s
Optan
e
PCIe
Gen
5
HB
M
UPI
2.0
Engine
s
Acceleratio
n
118
Performance
Core
Built for Data Center
Autonomous/Fast PM for high freq@ low
jitter
Major microarchitecture and IPC
improvement
Consistent performance for multi-tenant
usages
Improved support for large code/data
footprint
119
Performanc
e Core
AI
FP1
6
Attache
d Device
Cache
Manageme
nt
CLDEMOTE
Proactive placement of cache
contents
Half- Precision
Support for higher throughput lower
precision
Intel® Advanced Matrix Extensions -
AMX
Tiled matrix operations for inference & training
acceleration
Accelerator interfacing Architecture -AiA
Efficient dispatch, signaling & synchronization from
user level
Architecture
Improvements for
DC Workloads &
Usages
120
Core
Core
Core
Core
Core
Cor
e
Core
Core Accel. Accel.
Utilization
Without
Acceleratio
n
Utilizatio
n
With
Acceleration
Critical
Workloads
Common Mode
Tasks
Additional Workload
Capacity
Increasing effectiveness of
cores,
Processes, containers and
VMs
Accelerator interfacing
Architecture
Between Cores & Acceleration
Engines
by enabling offload of common mode
tasks via
seamlessly integrated acceleration
engines
Concurrently
shareable
Coherent, Shared Memory
Space
Native Dispatch, Signaling & Synchronization from User
Space
Sapphire
Rapids
Acceleration Engines
121
47
%
53
%
14
%
86
%
up to
4 Instances per
Socket
W/o
Offload
With
Offload
39%
additional CPU
Core cycles
after DSA
offload
Core
utilization
of
Open
vSwitch
@
1512B
packet
size
% CPU
Compute
% CPU Data
Movement
Optimizing streaming data movement and transformation
operations
Acceleration
Engine
Intel® Data
Streaming
Low Latency
Invocation
No Memory Pinning
Overhead
122
Results have been estimated or simulated and based on tests with
Ice Lake with Intel QAT For workloads and configurations visit
www.intel.com/ArchDay21claims. Results may vary.
100
%
98%
additional
workload
capacity after
QAT offload
Without
Offload
With QAT
Offload
Expected
Core
utilization
on
SPR
for
ZlibL9
up to
400Gb/s Symmetric
Crypto
up to
160Gb/s Compression
+ 160Gb/s De-
compression
Fused
Operations
Intel® Quick Assist
Technology
Acceleration
Engine
Accelerating Cryptography and Data
De/Compression
123
Results have been estimated or simulated. Sapphire Rapids
estimation based on architecture models and baseline testing with
Ice Lake and Intel QAT. For workloads and configurations visit
www.intel.com/ArchDay21claims. Results may vary.
Sapphire
Rapids
SoC
124
Expanded device performance
via PCIe 5.0 & connectivity
Improved Multi-Socket scaling via
Intel® Ultra Path Interconnect (UPI)
2.0
Introducing Compute eXpressLink (CXL)
1.1
Improved DDIO & QoS
capabilities
Up to 4 x24 UPI links operating @ 16 GT/s
New 8S-4UPI performance optimized
topology
Accelerator and memory expansion in
datacenter
Sapphire
Rapids
I/O Advancements
125
Sapphire
Rapids
Memory and Last Level Cache
Intel® Optane™Persistent
Memory 300 Series
Increased Shared Last Level Cache
(LLC)
Up to >100 MB LLC shared across ALL cores
Increased bandwidth, security &
reliability via DDR 5 Memory
4 memory controllers supporting 8 channels
126
Sapphire
Rapids
High Bandwidth Memory
Increased capacity and
Bandwidth
some usages can eliminate need for DDR
entirely
Significantly Higher
Memory Bandwidth
vs. baseline Xeon-SP with 8 channels of
DDR 5
HBM Flat Mode
Flat Mem Regions w/ HBM &
DRAM
HBM Caching
Mode
DRAM backed cache
2
Modes
127
Sapphire
Rapids -
2048
Architected for
AI
AI has become ubiquitous across usages –AI performance required in all tiers of
computing
Goal
Enable efficient usage of AI across all services
deployed on elastic general-purpose tier by delivering
many times more AI performance and lower CPU
utilization
64
256
Ops/Cycle
per
core
@
100%
utilization
102
4
Intel® AVX-512 (2xFMA)
FP32
AMX (TMUL)
BF16
Intel® AVX-512 (2xFMA)
INT8 AMX (TMUL) INT8
Acceleration
at the ISA
Level
For Deep
Learning
Datatypes
Available and integrated with
industry-relevant frameworks &
libraries
128
Results have been simulated. For workloads and configurations visit www.intel.com/ArchDay21claims. Results
may vary.
Sapphire
Rapids -
Built for elastic computing models -
microservices
>80% of new cloud-native and SaaS applicationsare expected to be built as
microservices
Throughput
per
Core
under
Latency
SLA
of
p99
<30ms
Goal
Enable higher throughput while meeting latency
requirements and reducing infrastructure overhead for
execution, monitoring and orchestration thousands of
microservices
Reduced
Infrastructu
re Overhead
Better
Distributed
Communication
Improved
Performance
and Quality of
Service
Microservices
Performance
Cascade
Lake
Ice Lake
Server
Sapphire
Rapids
+69
%
+
24%
1.
0
129
Results have been simulated. For workloads and configurations visit www.intel.com/ArchDay21claims. Results
may vary.
Sapphire
Rapids
Biggest Leap in Data Center
Capabilities
in over a Decade
New Standard in Data
CenterArchitecture
Designed for Microservices and AI
Workloads
Pioneering Advanced Memory & IO
Transitions
DDR 5 &
HBM
Performance
Core
Architecture
PCIe
5.0
Workload
Specialized
Acceleration
Enhanced
Virtualizatio
n
Capabilities
Multi Tile SoC
for Scalability
Physically Tiled,
Logically
Monolithic
General Purpose &
Dedicated
Acceleration
Engines
130
Guido
Appenzeller
Server Architecture in a classic Data
Center
Software and Infrastructure are all controlled by One
Entity
133
Server
Hardware
CPU
General Purpose
Compute
Everything
runs
on the
CPU
Network Interface
Card
Virtualized
Network
NIC
Cloud Server
Architecture
Classic Server
Architecture
134
General Purpose
Compute
General Purpose
Compute
Tenant Code
& Hypervisor
runon the
CPU
Everything
runs
on the
CPU
IPI
U
Network Interface
Card
Infrastructure Processing
Unit
CSP Code
Runs
Here
Virtualized
Network
Virtualized
Network
Server
Hardware
Server
Hardware
CPU
CPU
NIC
Kitche
n
Guest
Room
s
Living
Room
Dinne
r
Table
Dining
Hall
Kitche
n
135
Major Advantages of
IPUs
1 2 3
136
Separation of
Infrastructure &
Tenant
Guest can fully control the
CPU with their SW, while
CSP maintains control of the
infrastructure and Root of
Trust
Infrastructure
Offload
Accelerators help process
these task efficiently.
Minimize latency and jitter
and maximize revenue
from CPU
Diskless Server
Architecture
Simplifies data center
architecture while
adding flexibility for
the CSP
CCP
U
IIP
U
Hardwar
e
Softwar
e
Customer’
sSoftwar
e
Storag
e
Networ
k
Manageme
nt
Advantag
e
-Separation of Infrastructure and
Tenant
137
1
Maximum Control and Isolation for the
Tenant
Virtualized
Network
CCP
U
IPI
U
Advantag
e
-Infrastructure
Offload
138
100
%
50%
0%
% Cycles spent in microservice
functionalities
Web Feed
1
Feed
2
Ads
1
Ads
2
Cache
1
Cache
2
Microservic
e Overhead
at
Facebook
Source: From Accelerometer: Understanding Acceleration Opportunities for Data Center Overheads at Hyperscale. AkshithaSrirama, Abhishek
Dhanotia. Facebook.
In some cases, the majority of CPU cycles are spent on
overhead
2
Hardwar
e
Softwar
e
Customer’
sSoftwar
e
Storag
e
Networ
k
Manageme
nt
Advantag
e
-Infrastructure
Offload
139
2
Dedicated Accelerators Free up CPU
Capacity
CCP
U
Standard
OS
IPI
U
Software Based
Accelerators
for
Efficient
Processing
Hardwar
e
Softwar
e
CCP
U
Customer’
sSoftwar
e
Standard
OS
PCI
e
IPI
U
Control
Pane
Shared
Storage
Storage
Service
Storag
e
Networ
k
Manageme
nt
Advantag
e
-Diskless Server
Architecture
140
3
Scale with Virtual Storage via
Network
Virtualized
Network
NVMe –
SSD1
New Virtual
SSD
Storage data
path is fully
implemented in
hardware!
Create
Virtual
NVMeDevice SSD1 –
250GB
New Virtual
Volume -
SSD1
Create
Creat
e
Dedicated ASIC
IPU
FPGA-based
Acceleration
Performance
and power
optimized
Optimized secure
networking and
storage pipeline
Faster time to market for evolving
standards
Re-programmable Secure Datapath
enables
flexible/customizable workload offload
(future
proof)
Onboard Intel® Xeon®
processor
Programmable accelerated
infrastructure workloads
with customizable packet
processing
Intel Ethernet NIC
with DPDK support
Note: Future Intel IPUs may integrate both ASIC and
FPGA
Broad Infrastructure Acceleration
Portfolio
IPU Platforms &
Adapters
SmartNIC
s
141
Customizable solutions with
FPGA
OVS, NVMeover Fabric, and
RoCEsolutions
Programmable through Intel OFS, DPDK, and
SPDK
Introducing
Oak Springs
Canyon
High perf networking and storage acceleration
for Cloud Service Providers
142
PCIe Gen 4
x16
Hardware crypto block
enables security at line rate
High speed Ethernet support -
2x100G
Oak Springs Canyon
Built with Intel® AgilexFPGA and Xeon-D
SoC
DDR4
16GB
DDR4
16GB
PCIe
Gen 4
x16
143
Customizable packet processing
including bridging and networking
services
Programmablethrough Intel OFS and
DPDK
Accelerated infrastructure
workloads
Juniper Contrail , OVS, SRv6, vFW
Secure Remote Update
of FPGA and Firmware over PCIe
On-board root of
trust
Introducing
Arrow Creek
Acceleration Development Platform (ADP) for
High Performance 100G networking
acceleration
144
PCIe Gen4
x16
2x QSFP56 supporting
up to 2x100GbE
networking
16GB DDR4 memory 1GB
DDR4 to integrated CPU
Full height, half length PCIe
form factor; passively cooled
Built with Intel® AgilexFPGA and Ethernet E810
Controller
PCIe
Gen 4
x8
PCIe
Gen 4
x8
DDR4
1GB
DDR4
Precisio
n
Clocking
4GB
4GB
4GB
4GB
Arrow
Creek
145
Naru
Sundar
Introducing
Mount
Evans
Intel’s 200G IPU
Softwar
e
Hyperscal
e Ready
Technology
Innovation
SW/HW/Accel co-design P4 Studio
based on Barefoot Leverage and
extend DPDK and SPDK
Co-designed with a top cloud provider
Integrated learnings from multiple gen. of
FPGAsNICs High performance under real world
load Security and isolation from the ground up
Best-in-Class Programmable Packet Processing
Engine NVMestorage interface scaled up from Intel
Optane Tech Next Generation Reliable Transport
Advanced crypto and compression accel.
147
Support for up to 4 host Xeons
with 200Gb/s full duplex
High-performance
ROCEv2
NVMeoffload
engine
Programmable packet pipeline with
QoS
and telemetry capabilities
Inline
IPSec
Mount
Evans
Architectural Breakdown
149
Mount
Evans
Compute Complex
Up to 16 Arm Neoverse® N1
Cores
Lookaside crypto and
compression
Dedicated management
processor
Dedicated compute and cache
with up to 3 memory channels
150
Vector
Engines
512 bit
per
engine
Matrix
Engines
4096
bit
per engine
L1$/ SLM
Compute Building Block of XeHPC-based
GPUs
8 8
Load / Store
512
B/CLK
Cache
(512KB), I
$
157
Cor
e
Vector
Engine
(ops/clk)
Matrix
Engine
(ops/
clk)
256
256
51
2
2048
4096
4096
819
2
FP1
6
FP64
FP32
INT8
FP1
6
BF1
6
TF32
158
Slice
16 Xe–
cores
8MBL1 Cache
160
Slice
16 Xe–
cores
8MBL1 Cache
16 Ray Tracing
Units
Ray
Traversal
Triangle
Intersection
Bounding Box
Intersect.
161
2 -
Stack
8
Slices
8
HBM2e
controller
s
16
e
2
Media
Engine
s
8
128 Xe-
cores
128 Ray Tracing
Units
X Links
Hardware
Contexts
164
For workloads and configurations visit www.intel.com/ArchDay21claims. Results may
vary.
Lin
k
High Speed
Coherent Unified
Fabric
Up to 8 Fully Connected
GPUs through Embedded
Switch
Load/Store, Bulk Data
Transfer & Sync Semantics
(GPU to
GPU)
165
Xe -
Link
Xe
-
Link
Xe
-
Link
Link for
Scalability -
Link
X
e
167
Xe -
Link
Link for
Scalability -
Link
X
e
168
Link for
Scalability
Xe
-
Link
Xe
-
Link
Xe -
Link
-
Link
X
e
169
8x System Compute
Rates
FP32 Ops/CLK
FP64 Ops/CLK
Up to
32,768
BF16 Ops/CLK
Up to
1,048,57
6
INT8 Ops/CLK
TF32 Ops/CLK
Up to
524,288
Up to
32,76
8
Up to
262,14
4
Xe -
Link
Vector Matri
x
Xe
-
Link
Xe
-
Link
-
Link
X
e
170
Masooma
Bhaiwala
New Verification Methodology
New Software NewReliability
Methodology NewSignal
Integrity Techniques New
Interconnects NewPower
Delivery Technology
NewPackaging Technology
NewI/O Architecture
NewMemory Architecture
NewIP Architecture NewSOC
Architecture
173
Scale of
Integration
Foveros
Implementation
Verification Tools &
Methods
Signal Integrity, Reliability & Power
Delivery
Ponte
Vecchi
o
Key Challenges
Fovero
s
Base
Tile
EMIB
Tile
HBM Tile
XeLink
Tile
Rambo
Tile
Compute
Tile
Multi Tile
Package
175
Compute
Tile
Built on
TSMC
N5
Per Tile
8 Xe -
cores
Bump
Pitch
36u
m
Foveros
L1 Cache
4MB
Per Tile
176
Ponte
Vecchi
o
Compute Tiles
EMIB
MDF
I
HBM2e
Base
Tile
Built on
Intel
7
FOVEROS
L2 Cache
144M
B
Area
640mm
2
Host
Interface
PCIe
Gen5
177
Ponte
Vecchi
o
Base Tile
Ponte
Vecchi
o
XeLink Tile
Built on
TSMC N7
Per Tile
8
XeLinks
Up to
90G
Serdes
Switch
8 ports
Embedde
d
XeLink
Tile
178
For workloads and configurations visit www.intel.com/ArchDay21claims. Results may
vary.
A0 Silicon Current
Status
Connectivit
y
Bandwidth
Memory
Fabric
Bandwidth
FP32
Throughput
179
Ponte
Vecchio
Execution
Progress
For workloads and configurations visit www.intel.com/ArchDay21claims. Results may
vary.
Accelerated Compute
Systems
Ponte Vecchio
OAM
Ponte Vecchio
x4 Subsystem
Ponte Vecchio
with XeLinks
+ 2S Sapphire
Rapids
e
x4 Subsystem
with XLinks
180
Overcoming Separate CPU and GPU Software
Stacks
Low-level
Libraries
Hardware Abstraction
Layer
Applications &
Services
Middleware, Frameworks &
Runtimes
Language
s
Low-level
Libraries
Hardware Abstraction
Layer
Applications &
Services
Middleware, Frameworks &
Runtimes
Language
s
CPU-Optimized
Stack
GPU-Optimized
Stack
CPU GPU
181
Freedom from proprietary programming
models Full performance from the
hardware Piece of mind for developers
Compute
Hardware
Hardware Abstraction
Layer
CPU & XPU -Optimized
Stack
Applications & Services
Middleware, Frameworks &
Runtimes
oneTBB
oneMKL oneDN
N
oneCCL oneDPL
oneDAL oneVPL
Other
Librarie
s
CPU
Level Zero
DPC+
+
GPU
Other
Languages
...
Language
s
Low-level
Libraries
Open, Standards-
Based Unified
Software Stack
182
oneAPIIndustry
Momentum
Cross-
Vendor
Evolving
Spec
NvidiaGPU
Huawei ASIC
Arm CPU
AMD GPU + Advanced raytracing
libraries
+ Graph interfaces for Deep Learning
workloads
183
Industry
Momentum
End
Users
ISVs & OSVs
Universities & Research
Institutes
OEMs &
SIs
National
Labs
CSPs &
Frameworks
UNIVERSITY OF
CAMBRIDGE
University
College
London
Indian Institute
of Science
Bangalore
Indian Institute
of Science
Education &
Research Pune
Indian Institutes
of Technology
Delhi /
Kharagpur /
Roorkee
Theoretical and computational
Biophysics Group
184
Toolkits
v2021.3
Available Now
>200K >300 >8
0
Developer
s
Application
s
HPC & AI
Unique installs of Intel®
oneAPIproduct since Dec’20
release
Deployed in market using Intel®
oneAPI language & libraries
Functional on Intel’s XeHPC
architecture using Intel® oneAPI
Application
s
185
The vision 2 years
ago…
Leadership
Performance for
HPC/AI
Unified Programming
Model powered with
oneAPI
Connectivity to drive
scaleup and scale out
190
193
See you
at

intel business presentation 77777777777.pptx