© 2019
Delivering Heterogeneous Accelerators for the Data Center and Edge
Disaggregation a Primer:
Optimizing design for Edge Cloud & Bare Metal applications
Infra//Structure, May 2019
1
© 2019
DISAGGREGATED ARCHITECTURE
 Performance is throttled when CPU manages all traffic
 Scaling performance is breaking the cost/power budget
 CPUs alone cannot keep up with demands of new workload
2
Traditional Access to Disaggregated Architecture
 Disaggregation is key to unlocking performance
 Enable servers to use more cost/power effective CPUs
 Provide lower latency to application and Edge devices
CPU
SmartNIC Access to Disaggregated Architecture
Networking Coprocessors
(SmartNIC)
CPU CPU
Host Processors Bottleneck
GPU/coprocessor
Storage/NVME
Memory
GPU/coprocessor
Storage/NVMEoTCP
Memory
Data
Traditional NIC
Data
© 2019
PACKET DROPS & TAIL LATENCY
 Disaggregated flash storage improves utilization by up to 50%
• Moves memory access from within a server to the network
 Agilio SmartNICs eliminate packet drops by using Advanced Buffer
Management
• Standards-based and open source Explicit Congestion Notification
and buffering
• Dramatically improves Web 2.0 application performance due to
reduced tail latency
Local Flash
Disaggregated Flash
© 2019
OPTIMIZING FOR PERFORMANCE/POWER/$ -> YOSEMITE OCP SERVERS
• Small form factor, low power servers (e.g. microservers)
 4x 12-core Xeon-D (45-65W) vs. Skylake (130-150W)
 12 servers in 4U
• One port shared between 4 servers (impedance mismatch)
 50Gb/s in and PCIe Gen 3 x2/x4 (12-24Gb/s) out to each server
 Causes significant packet drops; untenable tail latency
• Crypto for each connection creates high CPU overhead
12-24Gb/s PCIe
(Each Link)
50Gb/s Link
© 2019
OPEN19 MICRO SERVER DESIGN FOR BARE METAL AND EDGE CLOUD
5
Maximum flexibility of tenants | Advanced bare metal services
by the operator | At lowest TCO
Server 1
With x86 CPU
LOM/SmartNIC
Controller
Server 2
With x86 CPU
Server 3
With x86 CPU
Server 4
With x86 CPU
SmartNIC
SoC
Interposer Board for Bare
Metal Services
SmartNIC
SoC
Arm
Cores
DDR
RoT
Si
PCIe/1x10GbE
2x25GbE 2x25GbE
To TOR Switches and Remote Storage
OCPorOpen19SledDesign
OCPorOpen19RackDesign
Local
Storage
Local
Storage
Local
Storage
Local
Storage
Optional: SmartNIC SoC on microserver:
• Provides tenant the ability to develop
and deploy accelerated data plane
apps for higher server efficiency.
• Can also enable local NVMe storage
offload and acceleration.
Interposer Board:
• Controlled by Operator
• Virtual Networking offload and
acceleration
• Remote/disaggregated storage
offload and acceleration
• Secure boot and hardware-based
Root of Trust
• Integration with SDN Controller/Cloud
Orchestration software
• Congestion management for lowest
tail latency
• Failure redundancy
SmartNIC
SoC
PCIe/1x10GbE
LOM/SmartNIC
Controller
LOM/SmartNIC
Controller
LOM/SmartNIC
Controller
© 2019
INNOVATIVE EDGE CLOUD MICROSERVER FORM FACTORS
6
Standard 1U Form Factor Solution with 4 Microservers
and Integrated Netronome SmartNIC
Open19-based Solution with 4 Microservers
and Integrated Netronome SmartNIC
Open19 Chassis
© 2019
XDP: BEST-IN-CLASS LATENCY REDUCTION
3X Latency Reduction: P(99). For P(99.99) see the next slide
P(99) Latency Competitor NIC
Agilio SmartNIC with
Congestion and Buffer
Management
99% of the flows experience
>12ms tail latency with
competitor NIC
99% of the flows experience
<4ms tail latency with
Netronome Agilio SmartNIC
3X Benefit for P(99)
Latency
T0 T1 T2 T3
© 2019
BEST-IN-CLASS LATENCY REDUCTION
Standard NIC
Agilio SmartNIC
2.7-3.8X
Latency
Reduction
33-70X
Latency
Reduction In an OCP data center
allows 2 type 1 servers
(Twin Lake) to replace a
type 6 server
(Leopard/Tioga Pass)
within a disaggregated
flash architecture.
Saves ~100W
(33% of total power)
per replacement.
Up to 70X Improvement for P(99.99) Latency
© 2019
>20X EBPF/XDP LOAD BALANCER OFFLOAD PERFORMANCE
eBPF: Agilio vs. x86
50
0
Performance(Mpps)
40
30
20
10
Agilio
x86 Core
 Enables on-the-fly changing of network algorithms
as needed in hyperscale data centers
• Filtering, Load Balancing, DDoS, Monitoring
• eBPF-based security for bare metal servers
• Drivers upstreamed and included in the Linux
kernel
 (e.g., to direct traffic to available servers based on
time of day, geographies served and other
security-related criteria)
• Offload of open source eBPF/XDP Load Balancer code
to the Agilio SmartNIC yields up to 42Mpps using a
single x86 CPU core
• Without offload, 16 x86 CPU cores yield 20Mpps,
about 1.5Mpps per CPU core
© 2019
10
Hardware offload provides significant performance improvement and efficiency
OFFLOADED TLS ENABLES SUPERIOR APPLICATION PERFORMANCE
Hardware offloaded TLS provide
20% query throughput
improvement
Hardware offloaded TLS provides
22% latency reduction
Benchmarked against Mellanox
ConnectX-5, 50GbE
20% improvement on
28-core server = 6 cores released
back to application
Queries per Second Latency
20% Throughput Improvement 22% Latency Reduction
HW TLSHW TLSSW TLS SW TLS
© 2019
AGILIO SMARTNIC IN WEB 2.0 APP AND DATA SERVERS
CONGESTION AND
TAIL LATENCY
REDUCTION FOR
DISAGGREGATED
STORAGE
SERVER CPU
Agilio SmartNIC
Buffer
Management
Deliver to Host
Update Statistics
Switching
Basic NIC TX/RX
and Stateless
Offloads
Chef
Orchestration
Congestion
Management
Application
Application
Application
SERVER CPU
Chef
Orchestration
Application
Application
Application
SERVER CPU
Chef
Orchestration
Application
Application
Application
SERVER CPU
Chef
Orchestration
Application
Application
Application
Multi-host
systems can
substantially
reduce server
and cabling
costs.
Reduction of
storage tail
latency is
critical to
ensure quality
user
experience.
25/50/100GbE
Network port
WEB 2.0 APP AND
DATA SERVERS
Multiple PCIe Interfaces
© 2019
DISAGGREGATION: MULTI-HOST SMARTNIC
Agilio CX 50GbE SmartNIC
1x50GbE
network port
4x PCIe Gen3 x4
connectivity to host
NFP-5000 Network Flow Processor
for data plane fast path
84 cores, 8 threads each
17MB internal SRAM for buffering and tables/cache
2GB DRAM for up
to 2M stateful
connections
Key Features
• 50GbE SmartNIC
• OCP Mezzanine v2
form factor
• 2GB on-card memory for up
to 2M stateful connections
• Advanced TLS-based
cryptography support at line-
rate
• Highly programmable with
eBPF, C and P4
Highly efficient implementation of cryptography and memory access technologies to maximize the performance
potential of cost-optimized OCP Yosemite v2 servers
© 2019
Multiple chiplets need to function as though they are on one die
ArchitectureInterface
Open Interface for Chiplet-Based Design
© 2019
Delivering Heterogeneous Accelerators for the Data Center and Edge
THANK YOU
14
© 2019
CORE COMPETENCY, S/W AND SILICON ARCHITECTURE FOR FAST INNOVATION
15
Enable the same benefits as owning your own silicon and hardware
Readily Customizable Silicon
Architecture
 Silicon is built using Islands
designed and verified
independently
 Extending the silicon with new
features is possible with fewer
resources and quick time-to-market
Extensible NIC Software, Available Data
Plane Acceleration Libraries
 CoreNIC: Fundamental building
block is a traditional NIC driver with
stateless offloads, DPDK, SR-IOV, and
accelerated Virtio
 Extensive pre-developed list of data
plane libraries available to quickly
bolt on to CoreNIC
Culture that Fosters Joint Innovation
with Customers
 Core silicon and software
architecture complemented by
software and systems-level
expertise makes Netronome
unique
 Netronome’s most successful
customer engagements to date
have been achieved through deep
co-development modelsCoreNIC
VxLAN Genève MLPS GTP
TC/BPF
Filter
TC
Action
vRouter
Match
vRouter
Action
Conntrack
ABM
Prog
RSS
INT
KTLS/
SSL
IPsec
+ …
Available
libraries

Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applications

  • 1.
    © 2019 Delivering HeterogeneousAccelerators for the Data Center and Edge Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applications Infra//Structure, May 2019 1
  • 2.
    © 2019 DISAGGREGATED ARCHITECTURE Performance is throttled when CPU manages all traffic  Scaling performance is breaking the cost/power budget  CPUs alone cannot keep up with demands of new workload 2 Traditional Access to Disaggregated Architecture  Disaggregation is key to unlocking performance  Enable servers to use more cost/power effective CPUs  Provide lower latency to application and Edge devices CPU SmartNIC Access to Disaggregated Architecture Networking Coprocessors (SmartNIC) CPU CPU Host Processors Bottleneck GPU/coprocessor Storage/NVME Memory GPU/coprocessor Storage/NVMEoTCP Memory Data Traditional NIC Data
  • 3.
    © 2019 PACKET DROPS& TAIL LATENCY  Disaggregated flash storage improves utilization by up to 50% • Moves memory access from within a server to the network  Agilio SmartNICs eliminate packet drops by using Advanced Buffer Management • Standards-based and open source Explicit Congestion Notification and buffering • Dramatically improves Web 2.0 application performance due to reduced tail latency Local Flash Disaggregated Flash
  • 4.
    © 2019 OPTIMIZING FORPERFORMANCE/POWER/$ -> YOSEMITE OCP SERVERS • Small form factor, low power servers (e.g. microservers)  4x 12-core Xeon-D (45-65W) vs. Skylake (130-150W)  12 servers in 4U • One port shared between 4 servers (impedance mismatch)  50Gb/s in and PCIe Gen 3 x2/x4 (12-24Gb/s) out to each server  Causes significant packet drops; untenable tail latency • Crypto for each connection creates high CPU overhead 12-24Gb/s PCIe (Each Link) 50Gb/s Link
  • 5.
    © 2019 OPEN19 MICROSERVER DESIGN FOR BARE METAL AND EDGE CLOUD 5 Maximum flexibility of tenants | Advanced bare metal services by the operator | At lowest TCO Server 1 With x86 CPU LOM/SmartNIC Controller Server 2 With x86 CPU Server 3 With x86 CPU Server 4 With x86 CPU SmartNIC SoC Interposer Board for Bare Metal Services SmartNIC SoC Arm Cores DDR RoT Si PCIe/1x10GbE 2x25GbE 2x25GbE To TOR Switches and Remote Storage OCPorOpen19SledDesign OCPorOpen19RackDesign Local Storage Local Storage Local Storage Local Storage Optional: SmartNIC SoC on microserver: • Provides tenant the ability to develop and deploy accelerated data plane apps for higher server efficiency. • Can also enable local NVMe storage offload and acceleration. Interposer Board: • Controlled by Operator • Virtual Networking offload and acceleration • Remote/disaggregated storage offload and acceleration • Secure boot and hardware-based Root of Trust • Integration with SDN Controller/Cloud Orchestration software • Congestion management for lowest tail latency • Failure redundancy SmartNIC SoC PCIe/1x10GbE LOM/SmartNIC Controller LOM/SmartNIC Controller LOM/SmartNIC Controller
  • 6.
    © 2019 INNOVATIVE EDGECLOUD MICROSERVER FORM FACTORS 6 Standard 1U Form Factor Solution with 4 Microservers and Integrated Netronome SmartNIC Open19-based Solution with 4 Microservers and Integrated Netronome SmartNIC Open19 Chassis
  • 7.
    © 2019 XDP: BEST-IN-CLASSLATENCY REDUCTION 3X Latency Reduction: P(99). For P(99.99) see the next slide P(99) Latency Competitor NIC Agilio SmartNIC with Congestion and Buffer Management 99% of the flows experience >12ms tail latency with competitor NIC 99% of the flows experience <4ms tail latency with Netronome Agilio SmartNIC 3X Benefit for P(99) Latency T0 T1 T2 T3
  • 8.
    © 2019 BEST-IN-CLASS LATENCYREDUCTION Standard NIC Agilio SmartNIC 2.7-3.8X Latency Reduction 33-70X Latency Reduction In an OCP data center allows 2 type 1 servers (Twin Lake) to replace a type 6 server (Leopard/Tioga Pass) within a disaggregated flash architecture. Saves ~100W (33% of total power) per replacement. Up to 70X Improvement for P(99.99) Latency
  • 9.
    © 2019 >20X EBPF/XDPLOAD BALANCER OFFLOAD PERFORMANCE eBPF: Agilio vs. x86 50 0 Performance(Mpps) 40 30 20 10 Agilio x86 Core  Enables on-the-fly changing of network algorithms as needed in hyperscale data centers • Filtering, Load Balancing, DDoS, Monitoring • eBPF-based security for bare metal servers • Drivers upstreamed and included in the Linux kernel  (e.g., to direct traffic to available servers based on time of day, geographies served and other security-related criteria) • Offload of open source eBPF/XDP Load Balancer code to the Agilio SmartNIC yields up to 42Mpps using a single x86 CPU core • Without offload, 16 x86 CPU cores yield 20Mpps, about 1.5Mpps per CPU core
  • 10.
    © 2019 10 Hardware offloadprovides significant performance improvement and efficiency OFFLOADED TLS ENABLES SUPERIOR APPLICATION PERFORMANCE Hardware offloaded TLS provide 20% query throughput improvement Hardware offloaded TLS provides 22% latency reduction Benchmarked against Mellanox ConnectX-5, 50GbE 20% improvement on 28-core server = 6 cores released back to application Queries per Second Latency 20% Throughput Improvement 22% Latency Reduction HW TLSHW TLSSW TLS SW TLS
  • 11.
    © 2019 AGILIO SMARTNICIN WEB 2.0 APP AND DATA SERVERS CONGESTION AND TAIL LATENCY REDUCTION FOR DISAGGREGATED STORAGE SERVER CPU Agilio SmartNIC Buffer Management Deliver to Host Update Statistics Switching Basic NIC TX/RX and Stateless Offloads Chef Orchestration Congestion Management Application Application Application SERVER CPU Chef Orchestration Application Application Application SERVER CPU Chef Orchestration Application Application Application SERVER CPU Chef Orchestration Application Application Application Multi-host systems can substantially reduce server and cabling costs. Reduction of storage tail latency is critical to ensure quality user experience. 25/50/100GbE Network port WEB 2.0 APP AND DATA SERVERS Multiple PCIe Interfaces
  • 12.
    © 2019 DISAGGREGATION: MULTI-HOSTSMARTNIC Agilio CX 50GbE SmartNIC 1x50GbE network port 4x PCIe Gen3 x4 connectivity to host NFP-5000 Network Flow Processor for data plane fast path 84 cores, 8 threads each 17MB internal SRAM for buffering and tables/cache 2GB DRAM for up to 2M stateful connections Key Features • 50GbE SmartNIC • OCP Mezzanine v2 form factor • 2GB on-card memory for up to 2M stateful connections • Advanced TLS-based cryptography support at line- rate • Highly programmable with eBPF, C and P4 Highly efficient implementation of cryptography and memory access technologies to maximize the performance potential of cost-optimized OCP Yosemite v2 servers
  • 13.
    © 2019 Multiple chipletsneed to function as though they are on one die ArchitectureInterface Open Interface for Chiplet-Based Design
  • 14.
    © 2019 Delivering HeterogeneousAccelerators for the Data Center and Edge THANK YOU 14
  • 15.
    © 2019 CORE COMPETENCY,S/W AND SILICON ARCHITECTURE FOR FAST INNOVATION 15 Enable the same benefits as owning your own silicon and hardware Readily Customizable Silicon Architecture  Silicon is built using Islands designed and verified independently  Extending the silicon with new features is possible with fewer resources and quick time-to-market Extensible NIC Software, Available Data Plane Acceleration Libraries  CoreNIC: Fundamental building block is a traditional NIC driver with stateless offloads, DPDK, SR-IOV, and accelerated Virtio  Extensive pre-developed list of data plane libraries available to quickly bolt on to CoreNIC Culture that Fosters Joint Innovation with Customers  Core silicon and software architecture complemented by software and systems-level expertise makes Netronome unique  Netronome’s most successful customer engagements to date have been achieved through deep co-development modelsCoreNIC VxLAN Genève MLPS GTP TC/BPF Filter TC Action vRouter Match vRouter Action Conntrack ABM Prog RSS INT KTLS/ SSL IPsec + … Available libraries

Editor's Notes

  • #4 Disaggregated storage
  • #5 Bandwidth to each server is x2/x4 (~13.5/27 Gbps)
  • #11 Version of TLS being used is kernel accelerated TLS
  • #13 Runs at 500MHz
  • #14 In Gen 2, we aim to centralize memory and host access for components on the platform.